June 24, 2017

June 22, 2017

Gaël Varoquaux

Scikit-learn Paris sprint 2017

Two week ago, we held in Paris a large international sprint on scikit-learn. It was incredibly productive and fun, as always. We are still busy merging in the work, but I think that know is a good time to try to summarize the sprint.

A massive workforce

We had a mix of core contributors and newcomers, which is a great combination, as it enables us to be productive, but also to foster the new generation of core developers. Were present:

  • Albert Thomas
  • Alexandre Abadie
  • Alexandre Gramfort
  • Andreas Mueller
  • Arthur Imbert
  • Aurélien Bellet
  • Bertrand Thirion
  • Denis Engemann
  • Elvis Dohmatob
  • Gael Varoquaux
  • Jan Margeta
  • Joan Massich
  • Joris Van den Bossche
  • Laurent Direr
  • Lemaitre Guillaume
  • Loic Esteve
  • Mohamed Maskani Filali
  • Nathalie Vauquier
  • Nicolas Cordier
  • Nicolas Goix
  • Olivier Grisel
  • Patricio Cerda
  • Paul Lagrée
  • Raghav RV
  • Roman Yurchak
  • Sebastien Treger
  • Sergei Lebedev
  • Thierry Guillemot
  • Thomas Moreau
  • Tom Dupré la Tour
  • Vlad Niculae
  • Manoj Kumar (could not come to Paris because of visa issues)

And many more people participating remote, and I am pretty certain that I forgot people.

Support and hosting

Hosting: As the sprint extended through a French bank holiday and the week end, we were hosted in a variety of venues:

  • La paillasse, a Paris bio-hacker space
  • Criteo, a French company doing word-wide add-banner placement. The venue there was absolutely gorgeous, with a beautiful terrace on the roofs of Paris. And they even had a social event with free drinks one evening.

Guillaume Lemaître did most of the organization, and at Criteo Ibrahim Abubakari was our host. We were treated like kings during the whole stay; each host welcoming us as well they could.

Financial support by France is IA: Beyond our hosts, we need to thank France is IA who fund the sprint, covering some of the lunches, accomodations, and travel expenses to bring in our contributors from abroad (3000 euros travel & accomodation, and 1000 euros for food and a venue during the week end).

Some achievements during the sprint

I would be hard to list everything that we did during the sprint (have a look at the development changelog if you’re curious). Here are some

  • Quantile transformer, to transform the data distribution into uniform, or Gaussian distributions (PR, example):

    Before

    After

  • Memory saving by avoiding to cast to float64 if X is given as float32: we are slowly making sure that, as much as possible, all models avoid using internal representations of a dtype float64 when the data is given as float32. This reduces significantly memory usage and can give speed ups up to a factor of two.

  • API test on instances rather than class. This is to facilitate testing packages in scikit-learn-contrib.

  • Many small API fixes to ensure better consistency of models, as well as cleaning the codebase, making sure that examples display well under matplotlib 2.x.

  • Many bug fixes, include fixing corner cases in our average precision, which was dear to me (PR).

Work soon to be merged

  • ColumnTransformer (PR): from pandas dataframe to feature matrix, by applying different transformers to different columns.
  • Fixing t-SNE (PR): our t-SNE implementation was extremely memory-inefficient, and on top of this had minor bugs. We are fixing it.

There is a lot more of pending work that the sprint help moved forward. You can also glance at the monthly activity report on github.

Joblib progress

Joblib, the parallel-computing engine used by scikit-learn, is getting extended to work in distributed settings, for instance using dask distributed as a backend. At the sprint, we made progress running a grid-search on Criteo’s Hadoop cluster.

by Gaël Varoquaux at June 22, 2017 10:00 PM

June 21, 2017

Continuum Analytics news

It’s Getting Hot, Hot, Hot: Four Industries Turning Up The Data Science Heat

Wednesday, June 21, 2017
Christine Doig
Christine Doig
Sr. Data Scientist, Product Manager

Summer 2017 has officially begun. As temperatures continue to rise, so does the use of data science across dozens of industries. In fact, IBM predicts the demand for data scientists will increase by 28 percent in just three short years, and our own survey recently revealed that 96 percent of company executives conclude data science is critical to business success. While it’s clear that health care providers, financial institutions and retail organizations are harnessing the growing power of data science, it’s time for more industries to turn up the data science heat. We take a peek below at some of the up and comers.  
 
Aviation and Aerospace
As data science continues to reach for the sky, it’s only fitting that the aviation industry is also on track to leverage this revolutionary technology. Airlines and passengers generate an abundance of data everyday, but are not currently harnessing the full potential of this information. Through advanced analytics and artificial intelligence driven by data science, fuel consumption, flight routes and air congestion could be optimized to improve the overall flight experience. What’s more, technology fueled by data science could help aviation proactively avoid some of the delays and inefficiencies that burden both staff and passengers—airlines just need to take a chance and fly with it! 

Cybersecurity   
In addition to aviation, cybersecurity has become an increasingly hot topic during the past few years. The global cost of handling cyberattacks is expected to rise from $400 billion in 2015 to $2.1 trillion by 2019, but implementing technology driven by data science can help secure business data and reduce these attacks. By focusing on the abnormalities, using all available data and automating whenever possible, companies will have a better chance at standing up to threatening attacks. Not to mention, artificial intelligence software is already being used to defend cyber infrastructure. 
  
Construction
While improving data security is essential, the construction industry is another space that should take advantage of data science tools to improve business outcomes. As an industry that has long resisted change, some companies are now turning to data science technology to manage large teams, improve efficiency in the building process and reduce project delivery time, ultimately increasing profit margins. By embracing data analytics and these new technologies, the construction industry will also have more room to successfully innovate. 
 
Ecology
From aviation to cybersecurity to construction, it’s clear that product-focused industries are on track to leverage data science. But what about the more natural side of things? One example suggests ecologists can learn more about ocean ecosystems through the use of technology driven by data science. Through coding and the use of other data science tools, these environmental scientists found they could conduct better, more effective oceanic research in significantly less time. Our hope is for other scientists to continue these methods and unearth more pivotal information about our planet. 
 
So there you have it. Four industries who are beginning to harness the power of data science to help transform business processes, drive innovation and ultimately change the world. Who will the next four be? 

 

 

by swebster at June 21, 2017 05:56 PM

Enthought

Enthought Announces Canopy 2.1: A Major Milestone Release for the Python Analysis Environment and Package Distribution

Python 3 and multi-environment support, new state of the art package dependency solver, and over 450 packages now available free for all users

Enthought Canopy logoEnthought is pleased to announce the release of Canopy 2.1, a significant feature release that includes Python 3 and multi-environment support, a new state of the art package dependency solver, and access to over 450 pre-built and tested scientific and analytic Python packages completely free for all users. We highly recommend that all current Canopy users upgrade to this new release.

Ready to dive in? Download Canopy 2.1 here.


For those currently familiar with Canopy, in this blog we’ll review the major new features in this exciting milestone release, and for those of you looking for a tool to improve your workflow with Python, or perhaps new to Python from a language like MATLAB or R, we’ll take you through the key reasons that scientists, engineers, data scientists, and analysts use Canopy to enable their work in Python.

First, let’s talk about the latest and greatest in Canopy 2.1!

  1. Support for Python 3 user environments: Canopy can now be installed with a Python 3.5 user environment. Users can benefit from all the Canopy features already available for Python 2.7 (syntax checking, debugging, etc.) in the new Python 3 environments. Python 3.6 is also available (and will be the standard Python 3 in Canopy 2.2).
  2. All 450+ Python 2 and Python 3 packages are now completely free for all users: Technical support, full installers with all packages for offline or shared installation, and the premium analysis environment features (graphical debugger and variable browser and Data Import Tool) remain subscriber-exclusive benefits. See subscription options here to take advantage of those benefits.
  3. Built in, state of the art dependency solver (EDM or Enthought Deployment Manager): the new EDM back end (which replaces the previous enpkg) provides additional features for robust package compatibility. EDM integrates a specialized dependency solver which automatically ensures you have a consistent package set after installation, removal, or upgrade of any packages.
  4. Environment bundles, which allow users to easily share environments directly with co-workers, or across various deployment solutions (such as the Enthought Deployment Server, continuous integration processes like Travis-CI and Appveyor, cloud solutions like AWS or Google Compute Engine, or deployment tools like Ansible or Docker). EDM environment bundles not only allow the user to replicate the set of installed dependencies but also support persistence for constraint modifiers, the list of manually installed packages, and the runtime version and implementation.
  5. Multi-environment support: with the addition of Python 3 environments and the new EDM back end, Canopy now also supports managing multiple Python environments from the user interface. You can easily switch between Python 2.7 and 3.5, or between multiple 2.7 or 3.5 environments. This is ideal especially for those migrating legacy code to Python 3, as it allows you to test as you transfer and also provides access to historical snapshots or libraries that aren’t yet available in Python 3.


Why Canopy is the Python platform of choice for scientists and engineers

Since 2001, Enthought has focused on making the scientific Python stack accessible and easy to use for both enterprises and individuals. For example, Enthought released the first scientific Python distribution in 2004, added robust and corporate support for NumPy on 64-bit Windows in 2011, and released Canopy 1.0 in 2013.

Since then, with its MATLAB-like experience, Canopy has enabled countless engineers, scientists and analysts to perform sophisticated analysis, build models, and create cutting-edge data science algorithms. Canopy’s all-in-one package distribution and analysis environment for Python has also been widely adopted in organizations who want to provide a single, unified platform that can be used by everyone from data analysts to software engineers.

Here are five of the top reasons that people choose Canopy as their tool for enabling data analysis, data modelling, and data visualization with Python:

1. Canopy provides a complete, self-contained installer that gets you up and running with Python and a library of scientific and analytic tools – fast

Canopy has been designed to provide a fast installation experience which not only installs the Canopy analysis environment but also the Python version of your choice (e.g. 2.7 or 3.5) and a core set of curated Python packages. The installation process can be executed in your home directory and does not require administrative privileges.

In just minutes, you’ll have a fully working Python environment with the primary tools for doing your work pre-installed: Jupyter, Matplotlib, NumPy and SciPy optimized with the latest MKL from Intel, Matplotlib, Scikit-learn, and Pandas, plus instant access to over 450 additional pre-built and tested scientific and analytic packages to customize your toolset.

No command line, no complex multi-stage setups! (although if you do prefer a flat, standalone command line interface for package and environment management, we offer that too via the EDM tool)

2. Access to a curated, quality assured set of packages managed through Canopy’s intuitive graphical package manager

The scientific Python ecosystem is gigantic and vibrant. Enthought is continuously updating its Enthought Python Distribution package set to provide the most recent “Enthought approved” versions of packages, with rigorous testing and quality assessment by our experts in the Python packaging ecosystem before release.

Our users can’t afford to take chances with the stability of their software and applications, and using Canopy as their gateway to the Python ecosystem helps take the risk out of the “wild west” of open source software. With more than 450 tested, pre-built and approved packages available in the Enthought Python Distribution, users can easily access both the most current stable version as well as historical versions of the libraries in the scientific Python stack.

Consistent with our focus on ease-of-use, Canopy provides a graphical package manager to easily search, install and remove packages from the user environment. You can also easily roll back to earlier versions of a package. The underlying EDM back end takes care of complex dependency management when installing, updating, and removing packages to ensure nothing breaks in the process.

3. Canopy is designed to be extensible for the enterprise

Canopy not only provides a consistent Python toolset for all 3 major operating systems and support for a wide variety of use cases (from data science to data analysis to modelling and even application development), but it is also extensible with other tools.

Canopy can easily be integrated with other software tools in use at enterprises, such with Excel via PyXLL or with LabVIEW from National Instruments using the Python Integration Toolkit for LabVIEW. The built-in Canopy Data Import Tool helps you automate your data ingestion steps and automatically import tabular data files such as CSVs into Pandas DataFrames.

But it doesn’t stop there. If an enterprise has Python embedded in another software application, Canopy can be directly connected to that application to provide coding and debugging capabilities. Canopy itself can even be customized or embedded to provide a sophisticated Python interface for your applications. Contact us to learn more about these options.

Finally, in addition to accessing the libraries in the Enthought Python Distribution from Canopy, users can use the same tools to share and deploy their own internal, private packages by adding the Enthought Deployment Server.  The Enthought Deployment Server also allows enterprises to have a private,  onsite copy of the full Enthought Python Distribution on their own approved servers and compliant with their existing security protocols.

 

5. Canopy’s straightforward analysis environment, specifically tailored to the needs and workflow of scientists, analysts, and engineers

Three integrated features of the Canopy analysis environment combine to create a powerful, yet streamlined platform: (1) a code editor, (2) an interactive graphical debugger with variable browser, and (3) an IPython window.

  • Canopy’s code editor comes with everything required to write analysis code, but without the burden of advanced development environments like PyCharm or Microsoft Visual Studio (although, if needed, other IDE’s can be configured to use the Canopy Python environment). With syntax highlighting, Python code auto-completion, and error checking, users can quickly interact with Python code, write and execute existing code.
  • Canopy’s interactive graphical debugger with variable browser helps you quickly find and fix code errors, understand and investigate code and data, and write new code more quickly.
  • The integrated IPython window lets you quickly test code, experiment with ideas and see the results of code run directly from the editor. Canopy also includes pre-configured Jupyter Notebook access.

Finally, access to package documentation at your fingertips in Canopy is a great benefit to faster coding. Canopy not only integrates online documentation and examples for many of the most used packages for data visualization, numerical analysis, machine learning, and more, but also let you easily extract and execute code from that documentation to get started working with starter code quickly.

We’re very excited for this major release and all of the new capabilities that it will enable for both individuals and enterprises, and encourage you to download or update to the new Canopy 2.1 today.

Have feedback on your experience with Canopy?

We’d love to hear about it! Contact the product development team at canopy.support@enthought.com.


Additional Resources:

Blog: New Year, New Enthought Products (Jan 2017)

Blog: Enthought Presents the Canopy Platform at the 2017 American Institute of Chemical Engineers (AIChE) Spring Meeting

Product pages:

The post Enthought Announces Canopy 2.1: A Major Milestone Release for the Python Analysis Environment and Package Distribution appeared first on Enthought Blog.

by dpinte at June 21, 2017 04:30 PM

June 20, 2017

Matthieu Brucher

Announcement: Audio TK 2.1.0

ATK is updated to 2.1.0 with a major refactoring of the Python wrappers and extensive testing of them. New filters were also added to support more complex pipelines (mute/solo and circular buffers for real-time spectrum displays) and Audio ToolKit provies now a CMake configuration file for easier integration in CMake projects.

Thanks to Travis and Appveyor, binaries for the releases are now updated on Github. On all platforms we compile static and shared libraries. On Linux, gcc 5, gcc 6, clang 3.8 and clang 3.9 are generated, on OS X, XCode 7 and XCode 8 are available as universal binaries, and on Windows, 32 bits, 64 bits, with dynamic or static (no shared libraries in this case) runtime, are also generated. Due to Travis strange configuration on macOS, the binaries there don’t yet support Python 2 or Python 3.

Download link: ATK 2.1.0

Changelog:
2.1.0
* Added a config file for CMake
* Rewrote the Python wrappers to use pybind11 instead of SWIG
* Added MuteSoloSumFilter to allow mute/solo operations on tracks with Python wrappers
* SumFilter can now sum multiple channels together
* Adding fourth order Linkwitz-Riley filters
* Adding a new circular buffer (for FFT plugins for instance)
* Added parameters for tube (inverters) filters definition
* Added Python wrappers in Travis-CI builds
* Added a modified implementation of the Munro-Piazza triode function to remove some artefacts

2.0.2
* Fix ARM compilation

2.0.1
* Turn set/get into properties when possible (Python wrapper)
* Enhanced Tools API (Audio ToolKit book)
* Added a Feedback Delay Netwrok filter (FDN) with Hadamard mixing matrix with Python wrappers
* Fixed MultipleUniversalFixedDelayLineFilter parameters

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at June 20, 2017 07:04 AM

June 15, 2017

Matthew Rocklin

Dask Release 0.15.0

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.15.0. This release contains performance and stability enhancements as well as some breaking changes. This blogpost outlines notable changes since the last release on May 5th.

As always you can conda install Dask:

conda install dask distributed

or pip install from PyPI

pip install dask[complete] --upgrade

Conda packages are available both on the defaults and conda-forge channels.

Full changelogs are available here:

Some notable changes follow.

NumPy ufuncs operate as Dask.array ufuncs

Thanks to recent changes in NumPy 1.13.0, NumPy ufuncs now operate as Dask.array ufuncs. Previously they would convert their arguments into Numpy arrays and then operate concretely.

import dask.array as da
import numpy as np

x = da.arange(10, chunks=(5,))

# Before
>>> np.negative(x)
array([ 0, -1, -2, -3, -4, -5, -6, -7, -8, -9])

# Now
>>> np.negative(x)
dask.array<negative, shape=(10,), dtype=int64, chunksize=(5,)>

To celebrate this change we’ve also improved support for more of the NumPy ufunc and reduction API, such as support for out parameters. This means that a non-trivial subset of the actual NumPy API works directly out-of-the box with dask.arrays. This makes it easier to write code that seamlessly works with either array type.

Note: the ufunc feature requires that you update NumPy to 1.13.0 or later. Packages are available through PyPI and conda on the defaults and conda-forge channels.

Asynchronous Clients

The Dask.distributed API is capable of operating within a Tornado or Asyncio event loop, which can be useful when integrating with other concurrent systems like web servers or when building some more advanced algorithms in machine learning and other fields. The API to do this used to be somewhat hidden and only known to a few and used underscores to signify that methods were asynchronous.

# Before
client = Client(start=False)
await client._start()

future = client.submit(func, *args)
result = await client._gather(future)

These methods are still around, but the process of starting the client has changed and we now recommend using the fully public methods even in asynchronous situations (these used to block).

# Now
client = await Client(asynchronous=True)

future = client.submit(func, *args)
result = await client.gather(future)  # no longer use the underscore

You can also await futures directly:

result = await future

You can use yield instead of await if you prefer Python 2.

More information is available at https://distributed.readthedocs.org/en/latest/asynchronous.html.

Single-threaded scheduler moves from dask.async to dask.local

The single-machine scheduler used to live in the dask.async module. With async becoming a keyword since Python 3.5 we’re forced to rename this. You can now find the code in dask.local. This will particularly affect anyone who was using the single-threaded scheduler, previously known as dask.async.get_sync. The term dask.get can be used to reliably refer to the single-threaded base scheduler across versions.

Retired the distributed.collections module

Early blogposts referred to functions like futures_to_dask_array which resided in the distributed.collections module. These have since been entirely replaced by better interactions between Futures and Delayed objects. This module has been removed entirely.

Always create new directories with the –local-directory flag

Dask workers create a directory where they can place temporary files. Typically this goes into your operating system’s temporary directory (/tmp on Linux and Mac).

Some users on network file systems specify this directory explicitly with the dask-worker ... --local-directory option, pointing to some other better place like a local SSD drive. Previously Dask would dump files into the provided directory. Now it will create a new subdirectory and place files there. This tends to be much more convenient for users on network file systems.

$ dask-worker scheduler-address:8786 --local-directory /scratch
$ ls /scratch
worker-1234/
$ ls /scratch/worker-1234/
user-script.py disk-storage/ ...

Bag.map no longer automatically expands tuples

Previously the map method would inspect functions and automatically expand tuples to fill arguments:

import dask.bag as db
b = db.from_sequence([(1, 10), (2, 20), (3, 30)])

>>> b.map(lambda x, y: return x + y).compute()
[11, 22, 33]

While convenient, this behavior gave rise to corner cases and stopped us from being able to support multi-bag mapping functions. It has since been removed. As an advantage though, you can now map two co-partitioned bags together.

a = db.from_sequence([1, 2, 3])
b = db.from_sequence([10, 20, 30])

>>> db.map(lambda x, y: x + y, a, b).compute()
[11, 22, 33]

Styling

Clients and Futures have nicer HTML reprs that show up in the Jupyter notebook.

And the dashboard stays a decent width and has a new navigation bar with links to other dashboard pages. This template is now consistently applied to all dashboard pages.

Multi-client coordination

More primitives to help coordinate between multiple clients on the same cluster have been added. These include Queues and shared Variables for futures.

Joblib performance through pre-scattering

When using Dask to power Joblib computations (such as occur in Scikit-Learn) with the joblib.parallel_backend context manager, you can now pre-scatter select data to all workers. This can significantly speed up some scikit-learn computations by reducing repeated data transfer.

import distributed.joblib
from sklearn.externals.joblib import parallel_backend

# Serialize the training data only once to each worker
with parallel_backend('dask.distributed', scheduler_host='localhost:8786',
                      scatter=[digits.data, digits.target]):
      search.fit(digits.data, digits.target)

Other Array Improvements

  • Filled out the dask.array.fft module
  • Added a basic dask.array.stats module with functions like chisquare
  • Support the @ matrix multiply operator

General performance and stability

As usual, a number of bugs were identified and resolved and a number of performance optimizations were implemented. Thank you to all users and developers who continue to help identify and implement areas for improvement. Users should generally have a smoother experience.

Removed ZMQ networking backend

We have removed the experimental ZeroMQ networking backend. This was not particularly useful in practice. However it was very effective in serving as an example while we were making our network communication layer pluggable with different protocols.

The following related projects have also been released recently and may be worth updating:

  • NumPy 1.13.0
  • Pandas 0.20.2
  • Bokeh 0.12.6
  • Fastparquet 0.1.0
  • S3FS 0.1.1
  • Cloudpickle 0.3.1 (pip)
  • lz4 0.10.0 (pip)

Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.3 release on May 5th:

  • Antoine Pitrou
  • Elliott Sales de Andrade
  • Ghislain Antony Vaillant
  • John A Kirkham
  • Jim Crist
  • Joseph Crail
  • Juan Nunez-Iglesias
  • Julien Lhermitte
  • Martin Durant
  • Matthew Rocklin
  • Samantha Hughes
  • Tom Augspurger

The following people contributed to the dask/distributed repository since the 1.16.2 release on May 5th:

  • A. Jesse Jiryu Davis
  • Antoine Pitrou
  • Brett Naul
  • Eugene Van den Bulke
  • Fabian Keller
  • Jim Crist
  • Krisztián Szűcs
  • Matthew Rocklin
  • Simon Perkins
  • Thomas Arildsen
  • Viacheslav Ostroukh

June 15, 2017 12:00 AM

June 14, 2017

Continuum Analytics news

Here Comes The Data Science—And It’s All Right

Wednesday, June 14, 2017
Travis Oliphant
President, Chief Data Scientist & Co-Founder

Did you know that 94 percent of enterprises are using open source technologies for Data Science, and 96 percent of company executives say Data Science is critical to the success of their business?   We uncovered these statistics when surveying several hundred executives and data scientists to gain a better understanding of the state of Data Science in today’s organizations. Clearly, Data Science is becoming more popular by the minute—but, what tools and platforms are companies specifically using?
 
Many people start with Anaconda. As the leading and fastest-growing Open Data Science platform powered by Python, it has been downloaded more than 13 million times and has approximately two million active users in 2016—and this number is rapidly increasing!
 
Another proof of Data Science’s hockey stick growth curve is the 18th annual KDnuggets Software Poll, which shows Anaconda usage has increased by 37 percent from 2016—giving it the second highest growth rate in the poll. Furthermore, Anaconda earned a top 10 ranking for the industry’s most popular analytics/Data Science tools, with nearly a quarter of the KDnuggets’ survey respondents using the Open Data Science platform. In addition, Python surpassed R as the most popular Data Science language (52.6 percent of respondents use the language), a trend that will likely continue through the next few years, especially given its particular suitability to Artificial Intelligence and Machine Learning.

 
The growth numbers point an undeniable market need for Data Science technologies that can, and are, delivering tremendous insights and impactful business outcomes. By using Python, Anaconda and other Data Science technologies, organizations across dozens of industries will be able to identify patterns, uncover crucial insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems—such as predicting the effects of public policy, curing rare genetic diseases and even discovering new planets. This is only the beginning of the power of Data Science. 

 

by swebster at June 14, 2017 03:56 PM

June 13, 2017

numfocus

NumPy receives first ever funding, thanks to Moore Foundation

For the first time ever, NumPy—a core project for the Python scientific computing stack—has received grant funding. The proposal, “Improving NumPy for Better Data Science” will receive $645,020 from the Moore Foundation over 2 years, with the funding going to UC Berkeley Institute for Data Science. The principal investigator is Dr. Nathaniel Smith. NumFOCUS congratulates Nathaniel and all […]

by Gina Helfrich at June 13, 2017 02:00 PM

Matthieu Brucher

Audio ToolKit: RIAA correction curves

Vinyl has become trendy again, and as such, I’ve been asked to add some new filters in Audio ToolKit. Here is a small dive in RIAA land.

More than meets the eye

The RIAA filter compensates for lower bass frequencies and louder high frequencies on a vinyl (of course, this pre mastering filter for vinyl is also available now in Audio ToolKit on the develop branch). The filter is quite simple, it’s an order 2 low-pass filter, with known knees. There are three time constants: t_1 = 75*10^{-6}s, t_2 = 318*10^{-6} and t_3 = 3180*10^{-6}s. This then used in the following continuous transfer function:

H(s) = \dfrac{t_2}{t_1}\dfrac{(1+st_2)}{(1+st_1)(1+st_3)}

The trick now to convert this into discrete time is to first wrap these times with the following equation: t_d = \dfrac{1}{f_s * tan(\pi / (t_a * f_s))}, with f_s the sampling frequency. Only then can you get a matching curve:

RIAA correction curve

If you compare this with Wikipedia’s curve, you will see that it doesn’t match mine. The gain for lower frequencies is not correct, and at 1kHz, the gain is not 0dB. And if you try to make 0dB at 1kHz, you have even a lower gain! Why? Good question. These are the constant given by the association and the gain that should used.

RIAA is not simple. It’s not even consistent! But at least it given a good first curve for vinyl correction that can be enhanced by adding additional EQ filters.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at June 13, 2017 07:16 AM

June 09, 2017

numfocus

NumFOCUS adds Shogun Machine Learning Toolbox to Sponsored Projects

NumFOCUS is delighted to announce the addition of the Shogun Machine Learning Toolbox to our fiscally sponsored projects. Shogun’s mission is to make powerful machine learning tools available to everyone —researchers, engineers, students — anyone curious to experiment with machine learning to leverage data. The Shogun Machine Learning Toolbox provides efficient implementations of standard and state-of-the-art machine […]

by Gina Helfrich at June 09, 2017 03:00 PM

Trichech

Jenis Rayap yang Perlu Dibasmi dengan Jasa Anti Rayap

Jasa anti rayap menerima berbagai jenis pelayanan. Mulai dari rayap yang hinggap pada dinding, kursi, meja, pintu dan perkakas rumah tangga lainnya. Rayap yang hidup pada jenis-jenis benda tadi memiliki ragam dan bentuk yang bervariasi. Rayap sendiri terbagi ke dalam tiga jenis, yaitu rayap pekerja, rayap penjaga dan laron. Ketiganya merupakan momok yang mengerikan jika ada di rumah dalam jumlah yang banyak. Tiap jenis rayap ini memiliki habitat dan fungsi masing-masing. Rayap pekerja misalnya, rayap ini berfungsi untuk mengumpulkan persediaan makanan untuk koloninya. Dalam melaksanakan tugasnya rayap ini tidak jarang menggrogoti kayu, tanah atau pun benda lain yang mereka sukai.


Selain rayap pekerja, rayap lain yang perlu menggunakan jasa pembasmi rayap untuk mengusirnya adalah rayap penjaga. Rayap jenis ini bertugas untuk menjaga sarang dan persediaan makanan dari serangan musuh. Jika anda menemui rayap ketika membasmi sarangnya, itu adalah rayap penjaga. Fungsi utama mereka adalah menjamin keselamatan sarang dan apa saja yang ada di dalamnya.
Jenis yang ketiga yaitu laron. Banyak yang tidak menyadari bahwa laron merupakan jenis rayap yang bisa dibasmi dengan menggunakan jasa anti rayap. Rayap jenis ini tumbuh besar dari tanah yang memiliki sayap. Rayap jenis ini biasanya paling gemar berputar-putar pada lampu. Jika di rumah anda terdapa hewan ini, lebih baik segera menghubungi tenaga ahli yang bisa mengusirnya.

by admin at June 09, 2017 01:45 PM

June 06, 2017

Continuum Analytics news

Using Anaconda to Embrace Python 3 And Support Python 2

Tuesday, June 6, 2017
Ian Stokes-Rees
Continuum Analytics

The data science community received a special delivery in December with the release of Python 3.6. At the time, I had a conversation with The New Stack to discuss what’s new with this release, and why we at Continuum see 2017 as the year that Python 3 is beginning to dominate the data science landscape, with major adoption in the enterprise space. I’m excited that today we’re extending that story with the most recent version 4.4 release of Anaconda, available for both Python 2.7 and Python 3.6. Anaconda downloads have skyrocketed in the past six months, with more than a million downloads per month. Besides delivering a comprehensive platform for Python-centric data science with a single-click installer for Windows, Mac, Linux, and Power8, Anaconda 4.4 is also designed to make it easy to work with both Python 2 and Python 3 code.  If you haven’t already started using Python 3, there’s no better time than today. 

Why Python 3.6?

When Python 3.6 launched, I was excited to see this tweet from legendary Python core developer and former Python Software Foundation board member, Raymond Hettinger, as it expressed my sentiments exactly:

 

 

To backup a minute; in 2016, Python 3.5 proved itself among the established Python community, and word got out that Python 3 was good to go. This made Python 3.6 the first version of the language that was delivered to a maturing base of users, poising it for a prime growth opportunity. Having just returned from PyCon 2017 in Portland Oregon—the flagship event for the Python community—I can tell you that Python 2 is mostly a footnote, and the buzz is largely around Python 3. With just over three years until Python 2 support is officially eliminated by the Python Software Foundation, and the scientific Python community publishing coordinated timelines for when Python 2 support will be suspended for many popular libraries, it is reasonable that there’s a widespread move to Python 3 happening now. This is great news because Python 3, and especially Python 3.6 (as Raymond alluded to in his tweet), offers many great new features and performance enhancements.

What’s New?

Of the 200 tools and libraries that Anaconda provides, there are dozens designed to help with co-development of Python 2 and Python 3 code. This means that many of the new Python 3 standard library capabilities are available through backported implementations that are bundled in Anaconda for Python 2. Enterprises that are still committed to maintaining their legacy Python 2 code-base can benefit from these features in advance of migrating to Python 3. Furthermore, Anaconda’s software sandboxing system means it is easy to run Python 2 and Python 3 in tandem on the same system, supporting a gradual migration and avoiding the need for any “big bang” cutover. For many, these capabilities alone are enough of a reason to use Anaconda. If you’re looking to migrate people and software to Python 3, then I’d recommend Lennart Regebro’s Python 3 Porting website and the shorter PSF HOWTO on Porting to Python 3 by Microsoft engineer Brett Cannon.

Anaconda 4.4 also ships with the Intel Math Kernel Libraries, providing a substantial performance boost for the 20+ Python libraries that are compiled to leverage these optimized routines on several generations of Intel processors. Numpy, Scipy and Scikit-Learn and all the libraries that build on these benefit from this.

What’s Next?

We’re advising our clients to see Python 3.6 as a “reference release” that should be adopted for any new Python-based projects. 

Python 3.6 offers a more stable version of the language that enhances some core concurrency capabilities around a concept known as “coroutines.” These provide new language constructs for the creation of asynchronous, and possibly parallel, functions. This is an area that is obviously important in the world of multi-core CPUs, increasing demands for computational power, and the leveling off of peak processor speeds. Python 3 also offers fundamental improvements of the core data structures, in particular dictionaries on which the language is practically built. Python dictionaries are often known as “hash-maps” in other languages.

The piece I’m most excited about, however, is the introduction of type annotations that can be used to provide a degree of type checking. I believe this will take Python to a new level for enterprise adoption and introduce possibilities for interface design and program behavior that have been tricky until now. At Continuum, we’re looking forward to leveraging these in our Numba Just-In-Time compiler for Python.

Join the Conversation

If you’re not already using Anaconda I’d encourage you to download it today. If you’re already an Anaconda user then you can update to the latest version with:

conda update conda anaconda

If you’re using Anaconda Navigator, update the “anaconda” package. If you want to get started with Python 3, then I’d recommend an O’Reilly booklet written by a colleague of mine, David Mertz, that describes the benefits and how to migrate to Python 3. It is called Picking a Python Version: A Manifesto

I’d love to hear your thoughts and comments on Python 3 and Anaconda. Catch me on Twitter: @ijstokes.

by swebster at June 06, 2017 07:05 PM

Matthieu Brucher

Audio ToolKit: Mailing list for beta testers

Recently, I’ve struggled with releasing perfect plugins. There were some glitches in the last 2 plugins that could have been avoided easily with more testing.

So this is a call for people who are interested in my plugins. You can join the mailing list, share your ideas of the current plugins, and the future ones, about what can be made better, how it can be better…

by Matt at June 06, 2017 07:59 AM

June 05, 2017

Enthought

SciPy 2017 Conference to Showcase Leading Edge Developments in Scientific Computing with Python

Renowned scientists, engineers and researchers from around the world to gather July 10-16, 2017 in Austin, TX to share and collaborate to advance scientific computing tool


AUSTIN, TX – June 6, 2017 –
Enthought, as Institutional Sponsor, today announced the SciPy 2017 Conference will be held July 10-16, 2017 in Austin, Texas. At this 16th annual installment of the conference, scientists, engineers, data scientists and researchers will participate in tutorials, talks and developer sprints designed to foster the continued rapid growth of the scientific Python ecosystem. This year’s attendees hail from over 25 countries and represent academia, government, national research laboratories, and industries such as aerospace, biotechnology, finance, oil and gas and more.

“Since 2001, the SciPy Conference has been a highly anticipated annual event for the scientific and analytic computing community,” states Dr. Eric Jones, CEO at Enthought and SciPy Conference co-founder. “Over the last 16 years we’ve witnessed Python emerge as the de facto open source programming language for science, engineering and analytics with widespread adoption in research and industry. The powerful tools and libraries the SciPy community has developed are used by millions of people to advance scientific inquest and innovation every day.”

Special topical themes for this year’s conference are “Artificial Intelligence and Machine Learning Applications” and the “Scientific Python (SciPy) Tool Stack.” Keynote speakers include:

  • Kathryn Huff, Assistant Professor in the Department of Nuclear, Plasma, and Radiological Engineering at the University of Illinois at Urbana-Champaign  
  • Sean Gulick, Research Professor at the Institute for Geophysics at the University of Texas at Austin
  • Gaël Varoquaux, faculty researcher in the Neurospin brain research institute at INRIA (French Institute for Research in Computer Science and Automation)

In addition to the special conference themes, there will also be over 100 talk and poster paper speakers/presenters covering eight mini-symposia tracks including: Astronomy; Biology, Biophysics, and Biostatistics; Computational Science and Numerical Techniques; Data Science; Earth, Ocean, and Geo Sciences; Materials Science and Engineering; Neuroscience; and Open Data and Reproducibility.

New for 2017 is a sold-out “Teen Track,” a two-day curriculum designed to inspire the scientists of tomorrow.  From July 10-11, high school students will learn more about the Python language and how developers solve real world scientific problems using Python and its scientific libraries.

Conference and tutorial registration is open at https://scipy2017.scipy.org.

About the SciPy Conference

SciPy 2017, the sixteenth annual Scientific Computing with Python conference, will be held July 10-16, 2017 in Austin, Texas. SciPy is a community dedicated to the advancement of scientific computing through open source Python software for mathematics, science and engineering. The annual SciPy Conference allows participants from all types of organizations to showcase their latest projects, learn from skilled users and developers and collaborate on code development. For more information or to register, visit https://scipy2017.scipy.org.

About Enthought

Enthought is a global leader in scientific and analytic software, consulting and training solutions serving a customer base comprised of some of the most respected names in the oil and gas, manufacturing, financial services, aerospace, military, government, biotechnology, consumer products and technology industries. The company was founded in 2001 and is headquartered in Austin, Texas, with additional offices in Cambridge, United Kingdom and Pune, India. For more information visit www.enthought.com and connect with Enthought on Twitter, LinkedIn, Google+, Facebook and YouTube.

 

 

The post SciPy 2017 Conference to Showcase Leading Edge Developments in Scientific Computing with Python appeared first on Enthought Blog.

by admin at June 05, 2017 06:18 PM

June 03, 2017

Paul Ivanov

June 1st, 2017

We had another biannual Jupyter team meeting this week, this time it was right nearby in Berkeley. Since I had read a poem at the last meeting, I was encouraged to keep that going and decided to make this a tradition. Here's the result, as delivered this past Friday, recorded by Fernando Pérez (thanks, Fernando!).

June 1st, 2017

We struggle -- with ourselves and with each other
we plan -- we code and write
the pieces and ideas t'wards what we think is right
but we may disagree -- about
the means, about the goals, about the
shoulders we should stand on --
where we should stand, what we should stretch toward
shrink from, avoid, embrace --
a sense of urgency - but this is not a race

There's much to learn, to do...

Ours not the only path, no one coerced you here
You chose this -- so did I and here we are --
still at the barricades and gaining ground
against the old closed world:
compute communication comes unshackled

by Paul Ivanov at June 03, 2017 07:00 AM

June 02, 2017

Randy Olson

TPOT Automated Machine Learning Competition

Can AutoML beat humans on Kaggle? Automated Machine Learning (AutoML) is poised to make a transformative impact on data science in 2017. At the University of Pennsylvania, we’ve been working hard to develop TPOT, a state-of-the-art open source AutoML tool

by Randy Olson at June 02, 2017 05:54 PM

June 01, 2017

William Stein

RethinkDB must relicense NOW

What is RethinkDB?

UPDATE:  Several months after I wrote this post, RethinkDB was relicensed.  For the CoCalc project, it was too late, and by then we had already switched to PostgreSQL


RethinkDB is a INCREDIBLE high quality polished open source realtime database that is easy to deploy, shard, replicate, and supports a reactive client programming model, which is useful for collaborative web-based applications. Shockingly, the 7-year old company that created RethinkDB has just shutdown. I am the CEO of a company, SageMath, Inc., that uses RethinkDB very heavily, so I have a strong interest in RethinkDB surviving as an independent open source project.

Three Types of Open Source Projects

There are many types of open source projects. RethinkDB was the type of open source project where most work on RethinkDB has been fulltime focused work, done by employees of the RethinkDB company. RethinkDB is licensed under the AGPL, but the company promised to make the software available to customers under other licenses.

Academia: I started the SageMath open source math software project in 2005, which has over 500 contributors, and a relatively healthy volunteer ecosystem, with about hundred contributors to each release, and many releases each year. These are mostly volunteer contributions by academics: usually grad students, postdocs, and math professors. They contribute because SageMath is directly relevant to their research, and they often contribute state of the art code that implements algorithms they have created or refined as part of their research. Sage is licensed under the GPL, and that license has worked extremely well for us. Academics sometimes even get significant grants from the NSF or the EU to support Sage development.

Companies: I also started the Cython compiler project in 2007, which has had dozens of contributors and is now the defacto standard for writing or wrapping fast code for use by Python. The developers of Cython mostly work at companies (e.g., Google) as a side project in their spare time. (Here's a message today about a new release from a Cython developer, who works at Google.) Cython is licensed under the Apache License.

What RethinkDB Will Become

RethinkDB will no longer be an open source project whose development is sponsored by a single company dedicated to the project. Will it be an academic project, a company-supported project, or dead?

A friend of mine at Oxford University surveyed his academic CS colleagues about RethinkDB, and they said they had zero interest in it. Indeed, from an academic research point of view, I agree that there is nothing interesting about RethinkDB. I myself am a college professor, and understand these people! Academic volunteer open source contributors are definitely not going to come to RethinkDB's rescue. The value in RethinkDB is not in the innovative new algorithms or ideas, but in the high quality carefully debugged implementations of standard algorithms (largely the work of bad ass German programmer Daniel Mewes). The RethinkDB devs had to carefully tune each parameter in those algorithms based on extensive automated testing, user feedback, the Jepsen tests, etc.

That leaves companies. Whether or not you like or agree with this, many companies will not touch AGPL licensed code:
"Google open source guru Chris DiBona says that the web giant continues to ban the lightning-rod AGPL open source license within the company because doing so "saves engineering time" and because most AGPL projects are of no use to the company."
This is just the way it is -- it's psychology and culture, so deal with it. In contrast, companies very frequently embrace open source code that is licensed under the Apache or BSD licenses, and they keep such projects alive. The extremely popular PostgreSQL database is licensed under an almost-BSD license. MySQL is freely licensed under the GPL, but there are good reasons why people buy a commercial MySQL license (from Oracle) for MySQL. Like RethinkDB, MongoDB is AGPL licensed, but they are happy to sell a different license to companies.

With RethinkDB today, the only option is AGPL. This very strongly discourage use by the only possible group of users and developers that have any chance to keep RethinkDB from death. If this situation is not resolved as soon as possible, I am extremely afraid that it never will be resolved. Ever. If you care about RethinkDB, you should be afraid too. Ignoring the landscape and culture of volunteer open source projects is dangerous.

A Proposal

I don't know who can make the decision to relicense RethinkDB. I don't kow what is going on with investors or who is in control. I am an outsider. Here is a proposal that might provide a way out today:

PROPOSAL: Dear RethinkDB, sell me an Apache (or BSD) license to the RethinkDB source code. Make this the last thing your company sells before it shuts down. Just do it.


Hacker News Discussion

by William Stein (noreply@blogger.com) at June 01, 2017 01:25 PM

May 30, 2017

numfocus

Python in Astronomy 2017

#pyastro17 brought together 54 participants from five continents for a wide-ranging workshop on Python in Astronomy.

by Gina Helfrich at May 30, 2017 06:01 PM

Matthieu Brucher

Announcement: ATKGuitarPreamp 1.0.0

I’m happy to announce the release of a modeling of the Vox AC30 preamplifier stage followed by a JCM800 tone stack based on the Audio Toolkit. They are available on Windows and OS X (min. 10.9) in different formats.

ATKGuitarPreamp

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKGuitarPreamp.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at May 30, 2017 07:51 AM

May 26, 2017

Enthought

Enthought at National Instruments’ NIWeek 2017: An Inside Look

This week I had the distinct privilege of representing Enthought at National Instruments‘ 23rd annual user conference, NIWeek 2017. National Instruments is a leader in test, measurement, and control solutions, and we share many common customers among our global scientific and engineering user base.

NIWeek kicked off on Monday with Alliance Day, where my colleague Andrew Collette and I went on stage to receive the LabVIEW Tools Network 2017 Product of the Year Award for Enthought’s Python Integration Toolkit, which provides a bridge between Python and LabVIEW, allowing you to create VI’s (virtual instruments) that make Python function and object method calls. Since its release last year, the Python Integration Toolkit has opened up access to a broad range of new capabilities for LabVIEW users,  by combining the best of Python with the best of LabVIEW. It was also inspiring to hear about the advances being made by other National Instruments partners. Congratulations to the award winners in other categories (Wineman Technology, Bloomy, and Moore Good Ideas)!

On Wednesday, Andrew gave a presentation titled “Building and Deploying Python-Powered LabVIEW Applications” to a standing-room only crowd.  He gave some background on the relative strengths of Python and LabVIEW (some of which is covered in our March 2017 webinar “Using Python and LabVIEW to Rapidly Solve Engineering Problems“) and then showcased some of the capabilities provided by the toolkit, such as plotting data acquisition results live to a web server using plotly, which is always a crowd-pleaser (you can learn more about that in the blog post “Using Plotly from LabVIEW via Python”).  Other demos included making use of the Python scikit-learn library for machine learning, (you can see Enthought’s CEO Eric Jones run that demo here, during the 2016 NIWeek keynotes.)

For a mechanical engineer like me, attending NIWeek is a bit like giving a kid a holiday in a candy shop.  There was much to admire on the expo floor, with all kinds of mechatronic gizmos and gadgets.  I was most interested by the lightning-fast video and image processing possible with NI’s FPGA systems, like the part sorting system shown below.  Really makes me want to play around with nifpga.

Another thing really gaining traction is the implementation of machine learning for a number of applications. I attended one talk titled “Deep Learning With LabVIEW and Acceleration on FPGAs” that demonstrated image classification using a neural network and talked about strategies to reduce the code size to get it to fit on an FPGA.

Finally, of course, I was really excited by all of the activity in the Industrial Internet of Things (IIoT), which is an area of core focus for Enthought.  We have been in the big data analytics game for a long time, and writing software for hard science is in our company DNA. But this year especially, starting with the AIChE 2017 Spring Meeting and now at NIWeek 2017, it has been really energizing to meet with industry leaders and see some of the amazing things that are being implemented in the IIoT.  National Instruments has been a leader in the test and measurement sector for a long time, and they have been pioneers in IIoT.  Now it is easy to download and install an interface to Amazon S3 for LabVIEW, and just like that, your sensor is now a connected sensor … and your data is ready for analysis in Enthought’s Canopy Data platform.

After immersion in NIWeek, I guess you could say, I’ve been “LabVIEWed”:

The post Enthought at National Instruments’ NIWeek 2017: An Inside Look appeared first on Enthought Blog.

by Tim Diller at May 26, 2017 08:44 PM

Continuum Analytics news

Let’s Talk PyCon 2017 - Thoughts from the Anaconda Team

Friday, May 26, 2017
Peter Wang
Chief Technology Officer & Co-Founder

We’re not even halfway through the year, but 2017 has already been filled to the brim with dynamic presentations and action-packed conferences. This past week, the Anaconda team was lucky enough to attend PyCon 2017 in Portland, OR - the largest annual gathering for the community that uses and develops Python. We came, we saw, we programmed, we networked, we spoke, we ate, we laughed, and we learned. Myself and some of our team members at the conference shared some details on their experiences - take a look and, if you attended, share your thoughts in the comment section below, or on Twitter @ContinuumIO

Did anything surprise you at PyCon? 

“I was surprised how many attendees were using Python for data. I missed last year's PyCon, and so comparing against PyCon 2015, there was a huge growth in the last two years. During Katy Huff's keynote, she asked how many people in the audience had degrees in science, and something like 40% of the people raised their hands. In the past, this was not the case - PyCon had a lot more "traditional" software developers.” - Peter Wang, CTO & co-founder, Anaconda

“Yes - how diverse the community is. Looking at the session topics provides an indicator about this, but having had somewhere between 60-80 interactions at the Anaconda booth, there was a huge range of discussions all the way from "Tell me more about data science" to "I've been using Anaconda for years and am a huge fan" or "conda saved my life.” I also saw a huge range of roles and backgrounds in attendees from enterprise, government, military, academic, students, and independent consultants. It was great to see a number of large players here: Facebook/Instagram, LinkedIn, Microsoft, Google,and Intel were all highly visible, supporting the community.” - Stephen Kearns, Product Marketing Manager, Anaconda

“What really struck me this year was how heavy the science and data science angles were from speakers, topics, exhibitors, and attendees.  The Thursday and Friday morning keynotes were Science + Python (Jake Vanderplas and Katy Huff), then the Sunday closing keynote was about containers and Kubernetes (Kelsey Hightower).” - Ian Stokes-Rees, Computational Scientist, Anaconda 

What was the most popular topic people were buzzing about? Was this surprising to you? 

“There's definitely a good feeling about the transition to Python 3 really happening, which has been a point of angst in the Python community for several years. To me, the sense of closure around this was palpable, in that people could spend their emotional energy talking about other things and not griping about ‘Python 2 vs. 3.’” - Peter Wang

“The talks! So great to see how fast the videos for the talks were getting posted.” - Stephen Kearns 

Did you attend any talks? Did any of them stand out? 

“Jake Vanderplas presented a well-researched and well-structured talk on the Python visualization landscape. The keynotes were all excellent. I appreciated the Instagram folks sharing their Python 3 migration story with everyone.” - Peter Wang

“There were some at-capacity tutorials by me on “Data Science Apps with Anaconda,” showing off our new Anaconda Project deployment capability and “Accelerating your Python Data Science code with Dask and Numba.” - Ian Stokes-Rees

How was the buzz around Anaconda at PyCon? 

“Awesome - we exhausted our entire supply of Anaconda Crew T-Shirts by the end of the second day. A conference first!” - Ian Stokes-Rees 

“It was great, and very positive. Lots of people were very interested in our various open source projects, but we also got a lot of interest from attendees in our enterprise offerings: commercially-supported Anaconda, our premium training, and the Anaconda Enterprise Data Science platform. In previous years, there were not as many people who I would characterize as "potential customers,” and this was a very positive change for us. I also think that it is a sign that the PyCon attendee audience is also changing, to include more people from the data science and machine learning ecosystem.” - Peter Wang

“Anaconda had lots of partnership engagement opportunities at the show, specifically with Intel, Microsoft and ESRI. It was exciting to hear Intel talk about how they’re using Anaconda as the channel for delivering optimized high performance Python, and great to see Microsoft giving SQL Server demonstrations of server-side Python using Anaconda. Lastly, great to hear that ESRI is increasing its Python interfaces to ArcGIS and have started to make the ArcGIS Python package available as a conda package from Anaconda Cloud.” - Ian Stokes-Rees

 

by swebster at May 26, 2017 04:58 PM

May 24, 2017

Filipe Saraiva

LaKademy 2017

LaKademy 2017 group photo

Some weeks ago we had the fifth edition of the KDE Latin-America summit, LaKademy. Since the first edition, KDE community in Latin-America has grown up and now we has several developers, translators, artists, promoters, and more people from here involved in KDE activities.

This time LaKademy was held in Belo Horizonte, a nice city known for the amazing cachaça, cheese, home made beers, cheese, hills, and of course, cheese. The city is very cosmopolitan, with several options of activities and gastronomy, while the people is gentle. I would like to back to Belo Horizonte, maybe in my next vacation.

LaKademy activites were held in CEFET, an educational technological institute. During the days of LaKademy there were political demonstrations and a general strike in the country, consequence of the current political crisis here in Brazil. Despite I support the demonstrations, I was in Belo Horizonte for event. So I focused in the tasks while in my mind I was side-by-side with the workers on the streets.

Like in past editions I worked a lot with Cantor, the mathematical software I am the maintainer. This time the main tasks performed were an extensive set of reviews: revisions in pending patches, in the bug management system in order to close very old (and invalid) reports, and in the task management workboard, specially to ping developers with old tasks without any comment in the last year.

There were some work to implement new features as well. I finished a backends refactoring in order to provide a recommended version of the programming language for each backend in Cantor. How each programming language has its own planning and scheduling, it is common some programming language version not be correctly supported in a Cantor backend (Sage, I am thinking you). This feature presents a “recommended” version of the programming language supported for the Cantor backend, meaning that version was tested and it will work correctly with Cantor. It is more like a workaround in order to maintain the sanity of the developer while he try to support 11 different programming languages.

Other feature I worked but it is not finished is a option to select different LaTeX processors in Cantor. Currently there are several LaTeX processors available (like pdflatex, pdftex, luatex, xetex, …), some of them with several additional features. This option will increased the versatility of Cantor and will allow the use of moderns processors and their features in the software.

I addition to these tasks I fixed some bugs and helped Fernando Telles, my past SoK student, with some tasks in Cantor.

(Like in past editions)², in LaKademy 2017 I also worked in other set of tasks related to the management and promotion of KDE Brazil. I investigated how to bring back our unified feed with Brazilian blogs posts as in the old Planet KDE Português, utilized to send updates about KDE in Brazil to our social networks. Fred implemented the solution. So I updated this feed in social networks, updated our e-mail contact utilized in this networks, and started a bootstrap version of LaKademy website (but the team is migrating to WordPress, I think it will not be used). I also did a large revision in the tasks of KDE Brazil workboard, migrated past year from the TODO website. Besides all this we had the promo meeting to discuss our actions in Latin-America – all the tasks were documented in the workboard.

Of course, just as we worked intensely in those days, we also had a lot of fun between a push and other. LaKademy is also a opportunity to find old friends and make new ones. It is amazing see again the KDE fellows, and I invite the newcomers to stay with us and go to next LaKademy editions!

This year we had a problem that we must to address in next edition – all the participants were Brazilians. We need to think about how to integrate people from other Latin-America countries in LaKademy. It would be bad if the event become only an Akademy-BR.

Filipe and Chicão

So, I give my greetings to the community and put myself in the mission to continue to work in order to grown the Latin-America as an important player to the development and future of KDE.

by Filipe Saraiva at May 24, 2017 08:37 PM

Enthought

Enthought Receives 2017 Product of the Year Award From National Instruments LabVIEW Tools Network

Python Integration Toolkit for LabVIEW recognized for extending LabVIEW connectivity and bringing the power of Python to applications in Test, Measurement and the Industrial Internet of Things (IIoT)

AUSTIN, TX – May 24, 2017 Enthought, a global leader in scientific and analytic computing solutions, was honored this week by National Instruments with the LabVIEW Tools Network Platform Connectivity 2017 Product of the Year Award for its Python Integration Toolkit for LabVIEW.

Python Integration Toolkit for LabVIEWFirst released at NIWeek 2016, the Python Integration Toolkit enables fast, two-way communication between LabVIEW and Python. With seamless access to the Python ecosystem of tools, LabVIEW users are able to do more with their data than ever before. For example, using the Toolkit, a user can acquire data from test and measurement tools with LabVIEW, perform signal processing or apply machine learning algorithms in Python, display it in LabVIEW, then share results using a Python-enabled web dashboard.

Enthought-Python-Integration-Toolkit-for-LabVIEW-Machine-Learning

Click to see the webinar “Using Python and LabVIEW to Rapidly Solve Engineering Problems” to learn more about adding capabilities such as machine learning by extending LabVIEW applications with Python.

“Python is ideally suited for scientists and engineers due to its simple, yet powerful syntax and the availability of an extensive array of open source tools contributed by a user community from industry and R&D,” said Dr. Tim Diller, Director, IIoT Solutions Group at Enthought. “The Python Integration Toolkit for LabVIEW unites the best elements of two major tools in the science and engineering world and we are honored to receive this award.”

Key benefits of the Python Integration Toolkit for LabVIEW from Enthought:

  • Enthought-Python-Integration-Toolkit-for-LabVIEW-Frequency-Burst

    Click to see the webinar “Introduction to the Python Integration Toolkit for LabVIEW” to learn more about the fast, two-way communication between Python and LabVIEW.

    Enables fast, two-way communication between LabVIEW and Python

  • Provides LabVIEW users seamless access to tens of thousands of mature, well-tested scientific and analytic software packages in the Python ecosystem, including software for machine learning, signal processing, image processing and cloud connectivity
  • Speeds development time by providing access to robust, pre-developed Python tools
  • Provides a comprehensive out-of-the box solution that allows users to be up and running immediately

“Add-on software from our third-party developers is an integral part of the NI ecosystem, and we’re excited to recognize Enthought for its achievement with the Python Integration Toolkit for LabVIEW,” said Matthew Friedman, senior group manager of the LabVIEW Tools Network at NI.

The Python Integration Toolkit is available for download via the LabVIEW Tools Network, and also includes the Enthought Canopy analysis environment and Python distribution. Enthought’s training, support and consulting resources are also available to help LabVIEW users maximize their value in leveraging Python.

For more information on Enthought’s Python Integration Toolkit for LabVIEW, visit www.enthought.com/python-for-LabVIEW.

 

Additional Resources

Product Information

Python Integration Toolkit for LabVIEW product page

Download a free trial of the Python Integration Toolkit for LabVIEW

Webinars

Webinar: Using Python and LabVIEW to Rapidly Solve Engineering Problems | Enthought
April 2017

Webinar: Introducing the New Python Integration Toolkit for LabVIEW from Enthought
September 2016

About Enthought

Enthought is a global leader in scientific and analytic software, consulting, and training solutions serving a customer base comprised of some of the most respected names in the oil and gas, manufacturing, financial services, aerospace, military, government, biotechnology, consumer products and technology industries. The company was founded in 2001 and is headquartered in Austin, Texas, with additional offices in Cambridge, United Kingdom and Pune, India. For more information visit www.enthought.com and connect with Enthought on Twitter, LinkedIn, Google+, Facebook and YouTube.

About NI

Since 1976, NI (www.ni.com) has made it possible for engineers and scientists to solve the world’s greatest engineering challenges with powerful platform-based systems that accelerate productivity and drive rapid innovation. Customers from a wide variety of industries – from healthcare to automotive and from consumer electronics to particle physics – use NI’s integrated hardware and software platform to improve the world we live in.

About the LabVIEW Tools Network

The LabVIEW Tools Network is the NI app store equipping engineers and scientists with certified, third-party add-ons and apps to complete their systems. Developed by industry experts, these cutting-edge technologies expand the power of NI software and modular hardware. Each third-party product is reviewed to meet specific guidelines and ensure compatibility. With hundreds of products available, the LabVIEW Tools Network is part of a rich ecosystem extending the NI Platform to help customers positively impact our world. Learn more about the LabVIEW Tools Network at www.ni.com/labview-tools-network.

LabVIEW, National Instruments, NI and ni.com and NIWeek are trademarks of National Instruments. Enthought, Canopy and Python Integration Toolkit for LabVIEW are trademarks of Enthought, Inc.

Media Contact

Courtenay Godshall, VP, Marketing, +1.512.536.1057, cgodshall@enthought.com

The post Enthought Receives 2017 Product of the Year Award From National Instruments LabVIEW Tools Network appeared first on Enthought Blog.

by admin at May 24, 2017 01:42 PM

May 23, 2017

numfocus

Welcome Nancy Nguyen, the new NumFOCUS Events Coordinator!

NumFOCUS is pleased to announce Nancy Nguyen has been hired as our new Events Coordinator. Nancy has over five years of event management experience in the non-profit and higher education sectors. She graduated from The University of Texas at Austin in 2011 with a BA in History. Prior to joining NumFOCUS, Nancy worked in development and fundraising […]

by Gina Helfrich at May 23, 2017 04:48 PM

Matthieu Brucher

Announcement: ATKBassPreamp 1.0.0

I’m happy to announce the release of a modeling of the Fender Bassman preamplifier stage based on the Audio Toolkit. They are available on Windows and OS X (min. 10.9) in different formats.

ATKBassPreamp

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKBassPreamp.

Update: it seems that a bug in the editor window resetted the knobs to their default values. This is fixed in 1.0.1.
Update: people have reported an issue with the dry/wet mix. This is fixed in 1.0.2.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at May 23, 2017 07:47 AM

May 22, 2017

numfocus

NumFOCUS Awards Small Development Grants to Projects

This spring the NumFOCUS Board of Directors awarded targeted small development grants to applicants from or approved by our sponsored and affiliated projects. In the wake of a successful 2016 end-of-year fundraising drive, NumFOCUS wanted to direct the donated funds to our projects in a way that would have impact and visibility to donors and […]

by Gina Helfrich at May 22, 2017 09:52 PM

May 17, 2017

numfocus

What is it like to chair a PyData conference?

Have you ever wondered what it’s like to be in charge of a PyData event? Vincent Warmerdam has collected some thoughts and reflections on his experience chairing this year’s PyData Amsterdam conference: This year I was the chair of PyData Amsterdam and I’d like to share some insights on what that was like. I was on the committee the […]

by Gina Helfrich at May 17, 2017 03:09 PM

May 16, 2017

Matthieu Brucher

Book review: OpenGL Data Visualization Cookbook

This review will actually be quite quick: I haven’t finished the book and I won’t finish it.

The book was published in August 2015 and is based on OpenGL 3. The authors may sometimes say that you can use shaders to do better, but the fact is that if you want to execute the code they propose, you need to use the backward compatibility layer, if it's available. OpenGL was published almost a decade ago, I can't understand why in 2015 two guys decided that a new book on scientific visualization should use an API that was deprecated a long time ago. What a waste of time. [amazon_enhanced asin="1782169725" /][amazon_enhanced asin="B01FGMWRO8" /]

by Matt at May 16, 2017 07:06 AM

May 15, 2017

Pierre de Buyl

Developing a Cython library

For some time, I have used Cython to accelerate parts of Python programs. One stumbling block in going from Python/NumPy code to Cython code is the fact that one cannot access NumPy's random number generator from Cython without explicit declaration. Here, I give the steps to make a pip-installable 'cimportable' module.

The aim

The aim is that, starting with a Python code reading

import numpy as np

N=100
x=0
for i in range(N):
    x = x + np.random.normal()

to Cython code

cimport numpy as np

cdef int i, N
cdef double x
N = 100
x = 0
for i in range(N):
    x = x + np.random.normal()

With obvious benefit of using the same module for the random number generator with a simple interface.

The challenge

Building a c-importable module just depends on having a corresponding .pxd file available in the path. The idea behing .pxd file is that they contain C-level (or cdef level) declarations whereas the implementation go in the .pyx file with the same basename.

by Pierre de Buyl at May 15, 2017 09:00 AM

May 14, 2017

Titus Brown

How to analyze, integrate, and model large volumes of biological data - some thoughts

This blog post stems from notes I made for a 12 minute talk at the Oregon State Microbiome Initiative, which followed from some previous thinking about data integration on my part -- in particular, Physics ain't biology (and vice versa) and What to do with lots of (sequencing) data.

My talk slides from OSU are here if you're interested.

Thanks to Andy Cameron for his detailed pre-publication peer review - any mistakes remaining are of course mine, not his ;).


Note: During the events below, I was just a graduate student. So my perspective is probably pretty limited. But this is what I saw and remember!

My graduate work was in Eric Davidson's lab, where we studied early development in the sea urchin. Eric had always been very interested in gene expression, and over the preceding decade or two (1980s and onwards) had invested heavily in genomic technologies. This included lots of cDNA macroarrays and BAC libraries, as well as (eventually) the sea urchin genome project.

The sea urchin is a great system for studying early development! You can get literally billions of synchronously developing embryos by fertilizing all the eggs simultaneously; the developing embryo is crystal clear and large enough to be examined using a dissecting scope; sea urchins are available world-wide; early development is mostly invariant with respect to cell lineage (although that comes with a lot of caveats); and sea urchin embryos have been studied since the 1800s, so there was a lot of background literature on the embryology.

The challenge: data integration without guiding theory

What we were faced with in the '90s and '00s was a challenge provided by the scads of new molecular data provided by genomics: that of data integration. We had discovered plenty of genes (all the usual homologs of things known in mice and fruit flies and other animals), we had cell-type specific markers, we could measure individual gene expression fairly easily and accurately with qPCR, we had perturbations working with morpholino oligos, and we had reporter assays working quite well with CAT and GFP. And now, between BAC sequencing and cDNA libraries and (eventually) genome sequencing, we had tons of genomic and transcriptomic data available.

How could we make sense of all of this data? It's hard to convey the confusion and squishiness of a lot of this data to anyone who hasn't done hands-on biology research; I would just say that single experiments or even collections of many experiments rarely provided a definitive answer, and usually just led to new questions. This is not rare in science, of course, but it typically took 2-3 years to figure out what a specific transcription factor might be doing in early development, much less nail down its specific upstream and downstream connections. Scale that to the dozens or 100s of genes involved in early development and, well, it was a lot of people, a lot of confusion, and a lot of discussion.

The GRN wiring diagram

To make a longer story somewhat shorter:

Eric ended up leading an effort (together with Hamid Bolouri, Dave McClay, Andy Cameron, and others in the sea urchin community) to build a gene regulatory network that provided a foundation for data integration and exploration. You can see the result here:

http://sugp.caltech.edu/endomes/

This network at its core is essentially a map of the genomic connections between genes (transcriptional regulation of transcription factors, together with downstream connections mediated by specific binding sites and signaling interactions between cells, as well as whatever other information we had). Eric named this "the view from the genome." On top of this is layered several different "views from the nucleus", which charted the different regulatory states initiated by asymmetries such as the localization of beta cadherin to the vegetal pole of the egg, and the location of sperm entry into the egg.

At least when it started, the network served primarily as a map of the interactions - a somewhat opinionated interpretation of both published and unpublished data. Peter et al., 2012 showed that the network could be used for in silico perturbations, but I don't know how much has been followed up on. During my experiences with it, it mainly served as a communications medium and a point of reference for discussions about future experiments as well as an integrative guide to published work.

What was sort of stunning in hindsight is the extent to which this model became a touchpoint for our lab and (fairly quickly) the community that studied sea urchin early development. Eric presented the network one year at the annual Developmental Biology of the Sea Urchin meeting, and by the next meeting, 18 months later, I remember it showing up in a good portion of talks from other labs. (One of my favorite memories is someone from Dave McClay's lab - I think it was Cyndi Bradham - putting up a view of the GRN inverted to make signaling interactions the core focus, instead of transcriptional regulation; heresy in Eric's lab!)

In essence, the GRN became a community resource fairly quickly. It was provided in both image and interactive form (using BioTapestry), and people felt free to edit and modify the network for their own presentations. It readily enabled in silico thought experiments - "what happens if I knock out this gene? The model predicts this, and this, and this should be downstream, and this other gene should be unaffected" that quickly led to choosing impactful actual experiments. In part because of this, arguments about the effects of specific genes quickly converged to conversation about how to test the arguments (for some definition of "quickly" and "conversation" - sometimes discussions were quite, ahem, robust in Eric's lab and the larger community!)

The GRN also served to highlight the unknowns and the insufficiencies in the model. Eric and others spent a lot of time thinking through questions such as this: "we know that transcription of gene X is repressed by gene Y; but something must still activate gene X. What could it be?" Eventually we did "crazy" things like measure the transcriptional levels and spatial expression patterns of all ~1000 transcription factors found in the sea urchin genome, which could then be directly integrated into the model for further testing.

In short, the GRN was a pretty amazing way for the community of people interested in early development in the sea urchin to communicate about the details. Universal agreement wasn't the major outcome, although I think significant points about early development were settled in part through the model - communication was the outcome.

And, importantly, it served as a central meeting point for data analysis. More on this below.

Missed opportunities?

One of the major missed opportunities (in my view, obviously - feel free to disagree, the comment section is below :) was that we never turned the GRN into a model that was super easy for experimentalists to play with. It would have required significant software development effort to make it possible to do click-able gene knockdown followed by predicted phenotype readout -- but this hasn't been done yet; apparently it has been tough to find funding for this purpose. Had I stayed in the developmental biology game, I like to think I would have invested significant effort in this kind of approach.

I also don't feel like much time was invested in the community annotation and updating aspect of things. The official model was tightly controlled by a few people (in the traditional scientific "experts know best!" approach) and there was no particular attempt to involve the larger community in annotating or updating the model except through 1-1 conversations or formal publications. It's definitely possible that I just missed it, because I was just a graduate student, and by mid-2004 I had also mentally checked out of grad school (it took me a few more years to physically check out ;).

Taking and holding ground

One question that occupies my mind a lot is the question of how we learn, as a community, from the research and data being produced in each lab. With data, one answer is to work to make the data public, annotate it, curate it, make it discoverable - all things that I'm interested in.

With research more broadly, though, it's more challenging. Papers are relatively poor methods for communicating the results of research, especially now that we have the Internet and interactive Web sites. Surely there are better venues (perhaps ones like Distill, the interactive visual journal for machine learning research). Regardless, the vast profusion of papers on any possible topic, combined with the array of interdisciplinary methods needed, means that knowledge integration is slow and knowledge diffusion isn't much faster.

I fear this means that when it comes to specific systems and question, we are potentially forgetting many things that we "know" as people retire or move on to other systems or questions. This is maybe to be expected, but when we confront the level of complexity inherent in biology, with little obvious convergence between systems, it seems problematic to repose our knowledge in dead tree formats.

Mechanistic maps and models for knowledge storage and data integration

So perhaps the solution is maps and models, as I describe above?

In thinking about microbiomes and microbial communities, I'm not sure what form a model would take. At the most concrete and boring level, a directly useful model would be something that took in a bunch of genomic/transcriptomic/proteomic data and evaluated it against everything that we knew, and then sorted it into "expected" and "unexpected". (This is what I discussed a little bit in my talk at OSU.)

The "expected" would be things like the observation of carbon fixation pathways in well-understood autotrophs - "yep, there it is, sort of matches what we already see." The "unexpected" would be things like unannotated or poorly understood genes that were behaving in ways that suggested they were correlated with whatever conditions we were examining. Perhaps we could have multiple bins of unexpected, so that we could separate out things like genes where the genome, transcriptome, and proteome all provided evidence of expression versus situations where we simply saw a transcript with no other kind of data. I don't know.

If I were to indulge in fanciful thinking, I could imagine a sort of Maxwell's Daemon of data integration, sorting data into bins of "boring" and "interesting", churning through data sets looking for a collection of "interesting" that correlated with other data sets produced from the same system. It's likely that such a daemon would have to involve some form of deep correlational analysis and structure identification - deep learning comes to mind. I really don't know.

One interesting question is, how would this interact with experimental biology and experimental biologists? The most immediately useful models might be the ones that worked off of individual genomes, such as flux-balance models; they could be applied to data from new experimental conditions and knockouts, or shifted to apply to strain variants and related species and look for missing genes in known pathways, or new genes that looked potentially interesting.

So I don't know a lot. All I do know is that our current approaches for knowledge integration don't scale to the volume of data we're gathering or (perhaps more importantly) to the scale of the biology we're investigating, and I'm pretty sure computational modeling of some sort has to be brought into the fray in practical ways.

Perhaps one way of thinking about this is to ask what types of computational models would serve as good reference resources, akin to a reference genome. The microbiome world is surprisingly bereft of good reference resources, with the 16s databases and IMG/M serving as two of the big ones; but we clearly need more, along the vein of a community KEGG and other such resources, curated and regularly updated.

Some concluding thoughts

Communication of understanding is key to progress in science; we should work on better ways of doing that. Open science (open data, open source, open access) is one way of better communicating data, computational methods, and results.

One theme that stood out for me from the microbiome workshop at OSU was that of energetics, a point that Stephen Giovanonni made most clearly. To paraphrase, "Microbiome science is limited by the difficulty of assessing the pros and cons of metabolic strategies." The guiding force behind evolution and ecology in the microbial world is energetics, and if we can get a mechanistic handle on energy extraction (autotrophy and heterotrophy) in single genomes and then graduate that to metagenome and community analysis, maybe that will provide a solid stepping stone for progress.

I'm a bit skeptical that the patterns that ecology and evolution can predict will be of immediate use for developing a predictive model. On the other hand, Jesse Zaneweld at the meeting presented on the notion that all happy microbiomes look the same, while all dysfunctional microbiomes are dysfunctional in their own special way; and Jesse pointed towards molecular signatures of dysfunction; so perhaps I'm wrong :).

It may well be that our data is still far too sparse to enable us to build a detailed mechanistic understanding of even simple microbial ecosystems. I wouldn't be surprised by this.

Trent Northern from the JGI concluded in his talk that we need model ecosystems too; absolutely! Perhaps experimental model ecosystems, either natural or fabricated, can serve to identify the computational approaches that will be most useful.

Along this vein, are there a natural set of big questions and core systems for which we could think about models? In the developmental biology world, we have a few big model systems that we focused on (mouse, zebrafish, fruit fly, and worm) - what are the equivalent microbial ecosystems?

All things to think about.

--titus

p.s. There are a ton of references and they can be fairly easily found, but a decent starting point might be Davidson et al., 2002, "A genomic regulatory network for development."

by C. Titus Brown at May 14, 2017 10:00 PM

May 13, 2017

Matthieu Brucher

Analog modeling: Comparing preamps

In a previous post, I explained how I modeled the triode inverter circuit. I’ve decided to put it inside two different plugins, so I’d like to present in 4 pictures their differences.

Preamps plugins

The two plugins will start as the modeling of the Fender Bassman preamp (inverter circuit, followed by its associated tone stack) and the other will be the modeling of the inverter section of a Vox AC30 (followed by the tone stack of a JCM800). Compared to the default preamp of Audio ToolKit, the behaviors are quite different, all with just a few differences in the values of the components:

30Hz preamps behavior

200Hz preamps behavior

1kHz preamps behavior

10kHz preamps behavior

I will probably add the options of using a different triode model (Audio Toolkit has lots of options, and I definitely need to present the differences in terms of quality and performance), and perhaps also a way of selecting a different tone stack. But for now, the plugins will propose a single modeling of a triode inverter followed by a tone stack. To model a full amp, you still need to model the final stage and the loudspeaker.

The next picture displays the preamp behavior depending on the used triode function:

Response with different triode functions

The Leach model and the original Koren model don’t behave as well the other models. It’s probably due to different parameters, but they give a good idea of the behavior of the tube. The modified Munro-Piazza is a personal modification of the tube function to make the derivative continuous as well. It helps the convergence when the state of the tube changes fast, even though it is clearly not enough to remove all discontinuities.

The following picture describes the cost with valgrind of the different models:

Triode preamp profiles

Obviously, the cost for the more complex functions, the time is spent trying to figure out the logarithm of a double number. This is because the fast math functions in ATK don’t support this function yet. If we use floating point numbers, then the cost is divided by 3 for the Koren model (for instance).

As the results are quite close, using floats or doubles is a matter of optimization and precision. In the plugins, I will use floats to maximize performance.

Coming next

In the next two weeks, the two plugins will be released. And depending on feedback and comments, I’ll add more options.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at May 13, 2017 07:19 AM

May 12, 2017

Trichech

Hal-hal yang Perlu Diperhatikan Dalam Jual Nasi Box

Jual nasi box jakarta atau nasi kotak merupakan usaha yang paling banyak dilakoni belakangan ini. Ini dikarenakan semakin meningkatnya jenis acara yang diadakan. Dengan begini, usaha ini mampu menjadi ladang rezeki yang melimpah dan menguntungkan. Alasan banyak orang yang berlangganan nasi kotak adalah karena tidak ribet dan simpel. Hal ini tentu tidak didapatkan jika memasak makanan dalam jumlah banyak sendiri. Memasak hidangan sendiri untuk menjamu tamu dalam jumlah banyak tentu melelahkan dan membutuhkan biaya yang tidak sedikit.

Saat ini bisnis jual nasi box bisa menjadi jalan untuk mencari penghidupan. Dengan mengandalkan pendapatan yang didapatkan dari usaha ini mampu untuk membiayai segala keperluan keluarga, mulai dari biaya pendidikan, kesehatan, kebutuhan primer maupun sekunder lainnya. Kebutuhan ini terpenuhi asal mengetahui dan memahami hal-hal apa saja yang perlu diperhatikan dalam menjalankan usaha kuliner ini.

Dalam menjalani usaha jual nasi box sangat penting untuk mempertimbangkan lokasi penjualan. Alamat menjadi poin penting karena tempat inilah yang akan pertama kali di cari oleh pelanggan jika hendak memesan. Selain itu, nomor telepon yang bisa dihubungi juga sangat berpengaruh. Selain kedua hal tadi, anda juga perlu memastikan ketersediaan bahan baku untuk dimasak. Sebab, jika persediaan kurang, maka usaha ini tidak akan berjalan maksimak. Kemungkinan terburuknya adalah pelanggan akan kabur dan beralih ke penjual nasi kotak yang lain.

by admin at May 12, 2017 02:05 AM

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!

by admin at May 12, 2017 02:00 AM

May 11, 2017

Leonardo Uieda

Reviews of our Scipy 2017 talk proposal: Bringing the Generic Mapping Tools to Python

Thumbnail image for publication.

This year, Scipy is using a double-open peer-review system, meaning that both authors and reviewers know each others identities. These are the reviews that we got for our proposal and our replies/comments (posted with permission from the reviewers). My sincerest thanks to all reviewers and editors for their time and effort.

The open review model is great because it increases the transparency of the process and might even result in better reviews. I started signing my reviews a few years ago and I found that I'm more careful with the tone of my review to make sure I don't offend anyone and provide constructive feedback.

Now, on to the reviews!

Review 1 - Paul Celicourt

The paper introduces a Python wrapper for the C-based Generic Mapping Tools used to process and analyze time series and gridded data. The content is well organized, but I encourage the authors to consider the following comments: While the authors promise to demonstrate an initial prototype of the wrapper, it is not sure that a WORKING prototype will be available by the time of the conference as claimed by the authors when looking at the potential functionalities to be implemented and presented in the second paragraph of the extended abstract. Furthermore, it is not clear what would be the functionalities of the initial prototype. On top of that, the approach to the implementation is not fully presented. For instance, the Simplified Wrapper and Interface Generator (SWIG) tool may be used to reduce the workload but the authors do not mention whether the wrapper would be manually developed or using an automated tool such as the SWIG. Finally, the portability of the shared memory process has not been addressed.

Thanks for all your comments, Paul! They are good questions and we should have addressed them better in the abstract.

That is a valid concern regarding the working prototype. We're not sure how much of the prototype will be ready for the conference. We are sure that we'll have something to show, even if it's not complete. The focus of the talk will be on our design decisions, implementation details, and the changes in the GMT modern execution mode on which the Python wrappers are based. We'll run some examples of whatever we have working mostly for the "Oooh"s and "Aaah"s.

The wrapper will be manually generated using ctypes. We chose this over SWIG or Cython because ctypes allows us to write pure Python code. It's a much simpler way of wrapping a C library. Not having any compiled extension modules also greatly facilitates distributing the package across operating systems. The same wrapper code can work on Windows, OSX, and Linux (as long as the GMT shared library is available).

The amount of C functions that we'll have to wrap is not that large. Mainly, we need GMT_Call_Module to run a command (like psxy), GMT_Create_Session for generating the session structure, and GMT_Open_VirtualFile and GMT_Read_VirtualFile for passing data to and from Python. The majority of the work will be in creating the Python functions for each GMT command, documenting them, and parsing the Python function arguments into something that GMT_Call_Module accepts. This work would have to be done manually using SWIG or Cython as well, so ctypes is not a disadvantage with regard to this. There are some more details about this in our initial design and goals.

Review 2 - Ricardo Barros Lourenço

The authors submitted a clear abstract, in the sense that they will present a new Python library, which is a binding to the Generic Mapping Tools (GMT) C library, which is widely adopted by the Geosciences community. They were careful in detailing their reasoning in such implementation, and also in analogue initiatives by other groups.

In terms of completeness, the abstract precisely describes that the design plans and some of the implementation would be detailed and explained, as well on a demo of their current version of the library. It was very interesting that the authors, while describing their implementation, also pointed that the library could be used in other applications not necessarily related to geoscientific applications, by the generation of general line plots, bar graphs, histograms, and 3D surfaces. It would be beneficial to the audience to see how this aspect is sustained, by comparing such capabilities with other libraries (such as Matplotlib and Seaborn) and evaluating their contribution to the geoscientific domain, and also on the expanded related areas.

The abstract is highly compelling to the Earth Sciences community members at the event because the GMT module is already used for high-quality visualization (both in electronic, but also in printed outputs - maps - which is an important contribution to) , but with a Python integration it could simplify the integration of "Pythonic" workflows into it, expanding the possibilities in geoscientific visualization, especially in printed maps.

It would be interesting, aside from a presumed comparison in online visualization with matplotlib and cartopy, if the authors would also discuss in their presentation other possible contributions, such as online tile generation in map servers, which is very expensive in terms of computational resources and is still is challenging in an exclusive "Pythonic" environment. Additionally, it would be interesting if the authors provide some clarification if there is any limitation on the usage of such library, more specifically to the high variance in geoscientific data sources, and also in how netCDF containers are consumed in their workflow (considering that these containers don't necessarily conform to a strict standard, allowing users to customize their usage) in terms of the automation of this I/O.

The topic of high relevance because there is still few options for spatial data visualization in a "fully pythonic" environment, and none of them is used in the process of plotting physical maps, in a production setting, such as GMT is. Considering these aspects, I recommend such proposal for acceptance.

Thank you, Ricardo, for your incentives and suggestions for the presentation!

I hadn't thought about the potential use in map tiling but we'll keep an eye on that from now on and see if we have anything to say about it. Thanks!

Regarding netCDF, the idea is to leverage the xarray library for I/O and use their Dataset objects as input and output arguments for the grid related GMT commands. There is also the option of giving the Python functions the file name of a grid and have GMT handle I/O, as it already does in the command line. The appeal of using xarray is that it integrates well with numpy and pandas and can be used instead of gmt grdmath (no need to tie your head in knots over RPN anymore!).

Review 3 - Ryan May

Python bindings for GMT, as demonstrated by the authors, are very much in demand within the geoscience community. The work lays out a clear path towards implementation, so it's an important opportunity for the community to be able offer API and interaction feedback. I feel like this talk would be very well received and kick off an important dialogue within the geoscience Python community.

Thanks, Ryan! Getting community feedback was the motivation for submitting a talk without having anything ready to show yet. It'll be much easier to see what the community wants and thinks before we have fully committed to an implementation. We're very much open and looking forward to getting a ton of questions!


What would you like to see in a GMT Python library? Let us know if there are any questions/suggestions before the conference. See you at Scipy in July!


Thumbnail image for this post is modified from "ScientificReview" by the Center for Scientific Review which is in the public domain.


Comments? Leave one below or let me know on Twitter @leouieda or in the Software Underground Slack group.

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

Please enable JavaScript to view the comments powered by Disqus.

May 11, 2017 12:00 PM

May 09, 2017

Matthieu Brucher

Book review: Getting Started With JUCE

After the announce of JUCE 5 release, I played a little bit with it, and then decided to read the only book on JUCE. It’s outdated and tackles JUCE 2.1.2. But who knows, it may be a gem?

Content and opinions

The book starts with a chapter on JUCE and its installation. If the main application changed its name from Introjucer to Projucer (probably because of the change in licence), the rest seems to be quite similar. This app still creates a Visual Studio solution, a Xcode project or makefiles.

The second chapter was the one I was interested in the most, because I plan on creating new component and GUIs for my plugins. The chapter still feels a little bit short, we could have worked more with overwriting custom look and feel, handling more events… It seems that not much changed in this area since JUCE 2.1, so this is quite an achievement for ROLI and the team. If in the future my component can last several major releases, I know I can invest time in learning JUCE.

The next chapter feels useless with C++11/14 or even Boost. It seems JUCE still uses a custom String class, which is too bad, not sure it really brings anything to the library, and there are other APIs that are now deprecated in my opinion (data types like int32, file system handling…)

The fourth chapter deals with streaming data and building a small app that can play sound. It is a nice feature, but I have to say I read it even faster than the previous chapter because it was of no interest to me.

Finally, the last chapter ends the book with a sudden note on some utilities (I can’t even remember them without reading the chapter again), no final conclusion, but a feeling that the book was more a list of tutorials than a real book.

Conclusion

I wouldn’t recommend on buying the book, as the tutorials cover the only bit that is still relevant. But I can hope for an updated version, one day.

by Matt at May 09, 2017 07:49 AM

May 08, 2017

Continuum Analytics news

Anaconda Joins Forces with Leading Companies to Further Innovate Open Data Science

Monday, May 8, 2017
Travis Oliphant
President, Chief Data Scientist & Co-Founder

In addition to announcing the formation of the GPU Open Analytics Initiative with H2O and MapD, today, we are pleased to announce an exciting collaboration with NVIDIA, H2O and MapD, with a goal of democratizing machine learning to increase performance gains of data science workloads. Using NVIDIA’s Graphics Processing Unit (GPU) technology, Anaconda is mobilizing the Open Data Science movement by helping teams avoid the data transfer process between Central Processing Units (CPUs) and GPUs and move toward their larger business goals. 

The new GPU Data Frame (GDF) will augment the Anaconda platform as the foundational fabric to bring data science technologies together allowing it to take full advantage of GPU performance gains. In most workflows using GPUs, data is first manipulated with the CPU and then loaded to the GPU for analytics. This creates a data transfer “tax” on the overall workflow.   With the new GDF initiative, data scientists will be able to move data easily onto the GPU and do all their manipulation and analytics at the same time without the extra transfer of data. With this collaboration, we are opening the door to an era where innovative AI applications can be deployed into production at an unprecedented pace and often with just a single click.

In a nutshell, this collaboration provides these key benefits:

  • Python Democratization. GPU Data Frame makes it easy to create new optimized data science models and iterate on ideas using the most innovative GPU and AI technologies.

  • Python Acceleration. The standard empowers data scientists with unparalleled acceleration within Python on GPUs for data science workloads, enabling Open Data Science to proliferate across the enterprise.

  • Python Production. Data science teams can move beyond ad-hoc analysis to unearthing game-changing results within production-deployed data science applications that drive measurable business impact.

Anaconda aims to bring the performance, insights and intelligence enterprises need to compete in today’s data-driven economy. We’re excited to be working with NVIDIA, mapD, and H2O as GPU Data Frame pushes the door to Open Data Science wide open by further empowering the data scientist community with unparalleled innovation, enabling Open Data Science to proliferate across the enterprise.

by swebster at May 08, 2017 08:51 PM

Anaconda Easy Button - Microsoft SQL Server and Python

Tuesday, May 9, 2017
Ian Stokes-Rees
Continuum Analytics

Previously there were many twisty roads that you may have followed if you wanted to use Python on a client system to connect to a Microsoft SQL Server database, and not all of those roads would even get you to your destination. With the news that Microsoft SQL Server 2017 has increased support for Python, by including a subset of Anaconda packages on the server-side, I thought it would be useful to demonstrate how Anaconda delivers the easy button to get Python on the client side connected to Microsoft SQL Server.

This blog post demonstrates how Anaconda and Anaconda Enterprise can be used on the client-side to connect Python running on Windows, Mac, or Linux to a SQL Server instance. The instructions should work for many versions SQL Server, Python and Anaconda, including Anaconda Enterprise, our commercially oriented version of Anaconda that adds in strong collaboration, security, and server deployment capabilities. If you run into any trouble let us know either through the Anaconda Community Support mailing list or on Twitter @ContinuumIO.

TL;DR: For the Impatient

If you're the kind of person who just wants the punch line and not the story, there are three core steps to connect to an existing SQL Server database:

  1. Install the SQL Server drivers for your platform on the client system. That is described in the Client Driver Installation section below.

  2. conda install pyodbc

  3. Establish a connection to the SQL Server database with an appropriately constructed connection statement:

     conn = pyodbc.connect(
        r'DRIVER={ODBC Driver 13 for SQL Server};' +
        ('SERVER={server},{port};'   +
         'DATABASE={database};'      +
         'UID={username};'           +
         'PWD={password}').format(
                server= 'sqlserver.testnet.corp',
                  port= 1433,
              database= 'AdventureWorksDW2012',
              username= 'tanya',
              password= 'Tanya1234')
    )
    

For cut-and-paste convenience, here's the string:

 (r'DRIVER={ODBC Driver 13 for SQL Server};' +
 ('SERVER={server},{port};'   +
  'DATABASE={database};'      +
  'UID={username};'           +
  'PWD={password}').format(
                server= 'sqlserver.testnet.corp',
                  port= 1433,
              database= 'AdventureWorksDW2012',
              username= 'tanya',
              password= 'Tanya1234')
)
'DRIVER={ODBC Driver 13 for SQL Server};SERVER=sqlserver.testnet.corp,1433;DATABASE=AdventureWorksDW2012;UID=tanya;PWD=Tanya1234'

Hopefully that doesn't look too intimidating!

Here's the scoop: the Python piece is easy (yay Python!) whereas the challenges are installing the platform-specific drivers (Step 1), and if you don't already have a database properly setup then the SQL Server installation, configuration, database loading, and setting up appropriate security credentials are the parts that the rest of this blog post are going to go into in more detail. As well as a fully worked out example of client-side connection and query.

And you can grab a copy of this blog post as a Jupyter Notebook from Anaconda Cloud.

On With The Story

While this isn't meant to be an exhaustive reference for SQL Server connectivity from Python and Anaconda it does cover several client/server configurations. In all cases I was running SQL Server 2016 on a Windows 10 system. My Linux system was CentOS 6.9 based. My Mac was running macOS 10.12.4, and my client-side Windows system also used Windows 10. The Windows and Mac Python examples were using Anaconda 4.3 with Python 3.6 and pyodbc version 3.0, while the Linux example used Anaconda Enterprise, based on Anaconda 4.2, using Python 2.7.

NOTE: In the examples below the $ symbol indicates the command line prompt. Do not include this in any commands if you cut-and-paste. Your prompt will probably look different!

Server Side Preparation

If you are an experienced SQL Server administrator then you can skip this section. All you need to know are:

  1. The hostname or IP address and port number for your SQL Server instance
  2. The database you want to connect to
  3. The user credentials that will be used to make the connection

The following provides details on how to setup your SQL Server instance to be able to exactly replicate the client-side Python-based connection that follows. If you do not have Microsoft SQL Server it can be downloaded and installed for free and is now available for Windows and Linux. NOTE: The recently released SQL Server 2017 and SQL Server on Azure both require pyodbc version >= 3.2. This blog post has been developed using SQL Server 2016 with pyodbc version 3.0.1.

This demonstration is going to use the Adventure Works sample database provided by Microsoft on CodePlex. There are instructions on how to install this into your SQL Server instance in Step 3 of this blog post, however you can also simply connect to an existing database by adjusting the connection commands below accordingly.

Many of the preparation steps described below are most easily handled using the SQL Server Management Studio which can be downloaded and installed for free.

Additionally this example makes use of the Mixed Authentication Mode which allows SQL Server-based usernames and passwords for database authentication. By default this is not enabled, and only Windows Authentication is permitted, which makes use of the pre-existing Kerberos user authentication token wallet. It should go without saying that you would only change to Mixed Authentication Mode for testing purposes if SQL Server is not already so configured. While the example below focuses on SQL Server Authentication there are also alternatives presented for the Windows Authentication Mode that uses Kerberos tokens.

You should know the hostname and port on which SQL Server is running and be sure you can connect to that hostname (or IP address) and port from the client-side system. The easiest way to test this is with telnet from the command line excuting the command:

$ telnet sqlserver.testnet.corp 1433

Where you would replace sqlserver.testnet.corp with the hostname or IP address of your SQL Server instance and 1433 with the port SQL Server is running on. Port 1433 is the SQL Server default. Executing this command on the client system should return output like:

Trying sqlserver.testnet.corp...
Connected to sqlserver.testnet.corp.
Escape character is '^]'.

telnet> close

At which point you can then type CTRL-] and then close.

If your client system is also Windows you can perform this simple Universal Data Link (UDL) test.

Finally you will need to confirm that your have access credentials for a user known by SQL Server and that the user is permitted to perform SELECT operations (and perhaps others) on the database in question. In this particular example we are making use of a fictional user named Tanya who has the username tanya and a password of Tanya1234. It is a 3 step process to get Tanya access to the Adventure Works database:

  1. The SQL Server user tanya is added as a Login to the DBMS:

    • which can be found in SQL Server Management Studio
    • under Security->Logins
    • right-click on Logins
    • add a New Login...
    • provide a Login name of tanya
    • select SQL Server authentication
    • provide a password of Tanya1234
    • uncheck the option for Enforce password policy (the other two will automatically be unchecked and greyed out)
  2. The database user needs to be added:

    • under Databases->AdventureWorksDW2012->Security->Users
    • right-click on Users
    • add a New User...
    • select SQL user with Login for type
    • add a Username set to tanya
    • add a Login name set to tanya.
  3. Grant tanya permissions on the AdventureWorksDW2012 database by executing the following query:

    use AdventureWorksDW2012; 
    GRANT SELECT, INSERT, DELETE, UPDATE ON SCHEMA::DBO TO tanya;
    

Using the UDL test method described above is a good way to confirm that tanya can connect to the database, even just from localhost, though it does not confirm if she can perform operations such as SELECT. For that I would recommend installing the free Microsoft Command Line Utilities 13.1 for SQL Server

Client Driver Installation

You'll now need to get the drivers installed on your client system. There are two parts to this: the platform specific dynamic libraries for SQL Server, and the Python ODBC interface. You won't be surprised to hear that the platform-specific libraries are harder to get setup, but this should still only take 10-15 minutes and there are established processes for all major operating systems.

Linux

The Linux drivers are avaialble for RHEL, Ubuntu, and SUSE. This is the process that you'd have to follow if you are using Anaconda Enterprise, as well as for anyone using a Linux variant with Anaconda installed.

My Linux test system was using CentOS 6.9, so I followed the RHEL6 installation procedure from Microsoft (linked above), which essentially consisted of 3 steps:

  1. Adding the Microsoft RPM repository to the yum configuration
  2. Pre-emptively removing some packages that may cause conflicts (I didn't have these installed)
  3. Using yum to install the msodbcsql RPM for version 13.1 of the drivers

In my case I had to play around with the yum command and in the end just doing:

$ ACCEPT_EULA=Y yum install msodbcsql

Windows

The Windows drivers are dead easy to install. Download the msodbcsql MSI file, double click to install, and you're in business.

Mac

The SQL Server drivers for Mac need to be installed via Homebrew which is a popular OS X package manager, though not affiliated with nor supported by Apple. Microsoft has created their own tap which is a Homebrew package repository. If you don't already have Homebrew installed you'll need to do that, then Microsoft have provided some simple instructions describing how to add the SQL Server tap and then install the mssql-tools package. The steps are simple enough I'll repeat them here, though check out that link above if you need more details or background.

$ brew tap microsoft/mssql-preview https://github.com/Microsoft/homebrew-mssql-preview
$ brew update
$ brew install mssql-tools

One thing I'll note is that the Homebrew installation output suggested I should execute a command to remove one of the SQL Server drivers. Don't do this! That driver is required. If you've already done it then the way to correct the process is to reset the configuration file by removing and re-adding the package:

$ brew remove  mssql-tools
$ brew install mssql-tools

Install Anaconda

Download and install Anaconda if you don't already have it on your system. There are graphical and command line installers available for Windows, Mac, and Linux. It is about 400 MB to download and a bit over 1 GB installed. If you're looking for a minimal system you can install Miniconda instead (command line only installer) and then a la carte pick the packages you want with conda install commands.

Anaconda Enterprise users or administrators can simply execute the commands below in the conda environment where they want pyodbc to be available.

Python ODBC package

This part is easy. You can just do:

$ conda install pyodbc

And if you're not using Anaconda or prefer pip, then you can also do:

$ pip install pyodbc

NOTE: If you are using the recently released SQL Server 2017 you will need pyodbc >= 3.2. There should be a conda package available for that "shortly" but be sure to check which version you get if you use the conda command above.

Connecting to SQL Server using pyodbc

Now that you've got your server-side and client-side systems setup with the correct software, databases, users, libraries, and drivers it is time to connect. If everything works properly these steps are very simple and work for all platforms. Everything that is platform-specific has been handled elsewhere in the process.

We start by importing the common pyodbc package. This is Microsoft's recommended Python interface to SQL Server. There was an alternate Python interface pymssql that at one point was more reliable for SQL Server connections and even until quite recently was the only way to get Python on a Mac to connect to SQL Server, however with Microsoft's renewed support for Python and Microsoft's own Mac Homebrew packages it is now the case that pyodbc is the leader for all platforms.

import pyodbc

Use a Python dict to define the configuration parameters for the connection

config = dict(server=   'sqlserver.testnet.corp', # change this to your SQL Server hostname or IP address
              port=      1433,                    # change this to your SQL Server port number [1433 is the default]
              database= 'AdventureWorksDW2012',
              username= 'tanya',
              password= 'Tanya1234')

Create a template connection string that can be re-used.

conn_str = ('SERVER={server},{port};'   +
            'DATABASE={database};'      +
            'UID={username};'           +
            'PWD={password}')

If you are using the Windows Authentication mode where existing authorization tokens are picked up automatically this connection string would be changed to remove UID and PWD entries and replace them with TRUSTED_CONNECTION, as below:

trusted_conn_str = ('SERVER={server};'     +
                    'DATABASE={database};' +
                    'TRUSTED_CONNECTION=yes')

Check your configuration looks right:

config
{'database': 'AdventureWorksDW2012',
 'password': 'Tanya1234',
 'port': 1433,
 'server': 'sqlserver.testnet.corp',
 'username': 'tanya'}

Now open a connection by specifying the driver and filling in the connection string with the connection parameters.

The following connection operation can take 10s of seconds to complete.

conn = pyodbc.connect(
    r'DRIVER={ODBC Driver 13 for SQL Server};' +
    conn_str.format(**config)
    )

Executing Queries

Request a cursor from the connection that can be used for queries.

cursor = conn.cursor()

Perform your query.

cursor.execute('SELECT TOP 10 EnglishProductName FROM dbo.DimProduct;')
<pyodbc.Cursor at 0x7f7ca4a82d50>

Loop through to look at the results (an iterable of 1-tuples, containing unicodde strings of the results).

for entry in cursor:
    print(entry)
(u'Adjustable Race', )
(u'Bearing Ball', )
(u'BB Ball Bearing', )
(u'Headset Ball Bearings', )
(u'Blade', )
(u'LL Crankarm', )
(u'ML Crankarm', )
(u'HL Crankarm', )
(u'Chainring Bolts', )
(u'Chainring Nut', )

Data Science Happens Here

Now that we've demonstrated how to connect to a SQL Server instance from Windows, Mac and Linux using Anaconda or Anaconda Enterprise it is possible to use T-SQL queries to interact with that database as you normally would.

Looking to the future, the latest preview release of SQL Server 2017 includes a server-side Python interface built around Anaconda. There are lots of great resources on Python and SQL Server connectivity from the team at Microsoft, and here are a few that you may find particularly interesting:

Next Steps

My bet is that if you're reading a blog post on SQL Server and Python (and you can download a Notebook version of it here) then you're using it in a commercial context. Anaconda Enterprise is going to be the best way for you and your organization to make a strategic investment in Open Data Science.

See how Anaconda Enterprise is transforming data science through our webinar series or grab one of our white papers on Enterprise Open Data Science.

Let us help you be successful in your strategic adoption of Python and Anaconda for high-performance enterprise-oriented open data science connected to your existing data sources and systems, such as SQL Server.

by swebster at May 08, 2017 04:46 PM

Data Science And Deep Learning Application Leaders Form GPU Open Analytics Initiative

Monday, May 8, 2017

Continuum Analytics, H2O.ai and MapD Technologies Create Open Common Data Frameworks for GPU In-Memory Analytics

SAN JOSE, CA—May 8, 2017—Continuum Analytics, H2O.ai, and MapD Technologies have announced the formation of the GPU Open Analytics Initiative (GOAI) to create common data frameworks enabling developers and statistical researchers to accelerate data science on GPUs. GOAI will foster the development of a data science ecosystem on GPUs by allowing resident applications to interchange data seamlessly and efficiently. BlazingDB, Graphistry and Gunrock from UC Davis led by CUDA Fellow John Owens have joined the founding members to contribute their technical expertise.

The formation of the Initiative comes at a time when analytics and machine learning workloads are increasingly being migrated to GPUs. However, while individually powerful, these workloads have not been able to benefit from the power of end-to-end GPU computing. A common standard will enable intercommunication between the different data applications and speed up the entire workflow, removing latency and decreasing the complexity of data flows between core analytical applications. 

At the GPU Technology Conference (GTC), NVIDIA’s annual GPU developers’ conference, the Initiative announced its first project: an open source GPU Data Frame with a corresponding Python API. The GPU Data Frame is a common API that enables efficient interchange of data between processes running on the GPU. End-to-end computation on the GPU avoids transfers back to the CPU or copying of in-memory data reducing compute time and cost for high-performance analytics common in artificial intelligence workloads.

Users of the MapD Core database can output the results of a SQL query into the GPU Data Frame, which then can be manipulated by the Continuum Analytics’ Anaconda NumPy-like Python API or used as input into the H2O suite of machine learning algorithms without additional data manipulation. In early internal tests, this approach exhibited order-of-magnitude improvements in processing times compared to passing the data between applications on a CPU. 

“The data science and analytics communities are rapidly adopting GPU computing for machine learning and deep learning. However, CPU-based systems still handle tasks like subsetting and preprocessing training data, which creates a significant bottleneck,” said Todd Mostak, CEO and co-founder of MapD Technologies. “The GPU Data Frame makes it easy to run everything from ingestion to preprocessing to training and visualization directly on the GPU. This efficient data interchange will improve performance, encouraging development of ever more sophisticated GPU-based applications.” 

“GPU Data Frame relies on the Anaconda platform as the foundational fabric that brings data science technologies together to take full advantage of GPU performance gains,” said Travis Oliphant, co-founder and chief data scientist of Continuum Analytics. “Using NVIDIA’s technology, Anaconda is mobilizing the Open Data Science movement by helping teams avoid the data transfer process between CPUs and GPUs and move nimbly toward their larger business goals. The key to producing this kind of innovation are great partners like H2O and MapD.”

“Truly diverse open source ecosystems are essential for adoption - we are excited to start GOAI for GPUs alongside leaders in data and analytics pipeline to help standardize data formats,” said Sri Ambati, CEO and co-founder of H2O.ai. “GOAI is a call for the community of data developers and researchers to join the movement to speed up analytics and GPU adoption in the enterprise.”

The GPU Open Analytics Initiative is actively welcoming participants who are committed to open source and to GPUs as a computing platform. 

Details of the GPU Data Frame can be found at the Initiative’s Github link - 
https://github.com/gpuopenanalytics

In conjunction with this announcement, MapD Technologies has announced the immediate open sourcing of the MapD Core database to foster open analytics on GPUs. Anaconda and H2O already have large open source communities, which can benefit from this project immediately and drive further development to accelerate the adoption of data science and analytics on GPUs. 

About Anaconda Powered by Continuum Analytics
Anaconda is the leading Open Data Science platform powered by Python, the fastest growing data science language with more than 13 million downloads and 4 million unique users to date. Continuum Analytics is the creator and driving force behind Anaconda, empowering leading businesses across industries worldwide with solutions to identify patterns in data, uncover key insights and transform data into a goldmine of intelligence to solve the world’s most challenging problems. Learn more at continuum.io

About H2O.ai
H2O.ai is focused on bringing AI to businesses through software. Its flagship product is H2O, the leading open source platform that makes it easy for financial services, insurance and healthcare companies to deploy AI and deep learning to solve complex problems. More than 9,000 organizations and 80,000+ data scientists depend on H2O for critical applications like predictive maintenance and operational intelligence. The company -- which was recently named to the CB Insights AI 100 -- is used by 169 Fortune 500 enterprises, including 8 of the world’s 10 largest banks, 7 of the 10 largest insurance companies and 4 of the top 10 healthcare companies. Notable customers include Capital One, Progressive Insurance, Transamerica, Comcast, Nielsen Catalina Solutions, Macy's, Walgreens and Kaiser Permanente.

About MapD Technologies
MapD Technologies is a next-generation analytics software company. Its technology harnesses the massive parallelism of modern graphics processing units (GPUs) to power lightning-fast SQL queries and visualization of large data sets. The MapD analytics platform includes the MapD Core database and MapD Immerse visualization client. These software products provide analysts and data scientists with the fastest time to insight, performance not possible with traditional CPU-based solutions. MapD software runs on-premise and on all leading cloud providers.

Founded in 2013, MapD Technologies originated from research at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). MapD is funded by GV, In-Q-Tel, New Enterprise Associates (NEA), NVIDIA, Vanedge Capital and Verizon Ventures. The company is headquartered in San Francisco.

Visit MapD at www.mapd.com or follow MapD on Twitter @mapd. For more information or to evaluate MapD, contact sales@mapd.com. Press inquiries, please contact press@mapd.com.

Media Contacts:

Jill Rosenthal
Continuum Analytics
anaconda@inkhouse.com

Mary Fuochi
MapD
press@mapd.com

James Christopherson
H2O.ai
james@vscpr.com

 

 

by swebster at May 08, 2017 12:47 PM

Matthew Rocklin

Dask Release 0.14.3

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation.

I’m pleased to announce the release of Dask version 0.14.3. This release contains a variety of performance and feature improvements. This blogpost includes some notable features and changes since the last release on March 22nd.

As always you can conda install from conda-forge

conda install -c conda-forge dask distributed

or you can pip install from PyPI

pip install dask[complete] --upgrade

Conda packages should be on the default channel within a few days.

Arrays

Sparse Arrays

Dask.arrays now support sparse arrays and mixed dense/sparse arrays.

>>> import dask.array as da

>>> x = da.random.random(size=(10000, 10000, 10000, 10000),
...                      chunks=(100, 100, 100, 100))
>>> x[x < 0.99] = 0

>>> import sparse
>>> s = x.map_blocks(sparse.COO)  # parallel array of sparse arrays

In order to support sparse arrays we did two things:

  1. Made dask.array support ndarray containers other than NumPy, as long as they were API compatible
  2. Made a small sparse array library that was API compatible to the numpy.ndarray

This process was pretty easy and could be extended to other systems. This also allows for different kinds of ndarrays in the same Dask array, as long as interactions between the arrays are well defined (using the standard NumPy protocols like __array_priority__ and so on.)

Documentation: http://dask.pydata.org/en/latest/array-sparse.html

Update: there is already a pull request for Masked arrays

Reworked FFT code

The da.fft submodule has been extended to include most of the functions in np.fft, with the caveat that multi-dimensional FFTs will only work along single-chunk dimensions. Still, given that rechunking is decently fast today this can be very useful for large image stacks.

Documentation: http://dask.pydata.org/en/latest/array-api.html#fast-fourier-transforms

Constructor Plugins

You can now run arbitrary code whenever a dask array is constructed. This empowers users to build in their own policies like rechunking, warning users, or eager evaluation. A dask.array plugin takes in a dask.array and returns either a new dask array, or returns None, in which case the original will be returned.

>>> def f(x):
...     print('%d bytes' % x.nbytes)

>>> with dask.set_options(array_plugins=[f]):
...     x = da.ones((10, 1), chunks=(5, 1))
...     y = x.dot(x.T)
80 bytes
80 bytes
800 bytes
800 bytes

This can be used, for example, to convert dask.array code into numpy code to identify bugs quickly:

>>> with dask.set_options(array_plugins=[lambda x: x.compute()]):
...     x = da.arange(5, chunks=2)

>>> x  # this was automatically converted into a numpy array
array([0, 1, 2, 3, 4])

Or to warn users if they accidentally produce an array with large chunks:

def warn_on_large_chunks(x):
    shapes = list(itertools.product(*x.chunks))
    nbytes = [x.dtype.itemsize * np.prod(shape) for shape in shapes]
    if any(nb > 1e9 for nb in nbytes):
        warnings.warn("Array contains very large chunks")

with dask.set_options(array_plugins=[warn_on_large_chunks]):
    ...

These features were heavily requested by the climate science community, which tends to serve both highly technical computer scientists, and less technical climate scientists who were running into issues with the nuances of chunking.

DataFrames

Dask.dataframe changes are both numerous, and very small, making it difficult to give a representative accounting of recent changes within a blogpost. Typically these include small changes to either track new Pandas development, or to fix slight inconsistencies in corner cases (of which there are many.)

Still, two highlights follow:

Rolling windows with time intervals

>>> s.rolling('2s').count().compute()
2017-01-01 00:00:00    1.0
2017-01-01 00:00:01    2.0
2017-01-01 00:00:02    2.0
2017-01-01 00:00:03    2.0
2017-01-01 00:00:04    2.0
2017-01-01 00:00:05    2.0
2017-01-01 00:00:06    2.0
2017-01-01 00:00:07    2.0
2017-01-01 00:00:08    2.0
2017-01-01 00:00:09    2.0
dtype: float64

Read Parquet data with Arrow

Dask now supports reading Parquet data with both fastparquet (a Numpy/Numba solution) and Arrow and Parquet-CPP.

df = dd.read_parquet('/path/to/mydata.parquet', engine='fastparquet')
df = dd.read_parquet('/path/to/mydata.parquet', engine='arrow')

Hopefully this capability increases the use of both projects and results in greater feedback to those libraries so that they can continue to advance Python’s access to the Parquet format.

Graph Optimizations

Dask performs a few passes of simple linear-time graph optimizations before sending a task graph to the scheduler. These optimizations currently vary by collection type, for example dask.arrays have different optimizations than dask.dataframes. These optimizations can greatly improve performance in some cases, but can also increase overhead, which becomes very important for large graphs.

As Dask has grown into more communities, each with strong and differing performance constraints, we’ve found that we needed to allow each community to define its own optimization schemes. The defaults have not changed, but now you can override them with your own. This can be set globally or with a context manager.

def my_optimize_function(graph, keys):
    """ Takes a task graph and a list of output keys, returns new graph """
    new_graph = {...}
    return new_graph

with dask.set_options(array_optimize=my_optimize_function,
                      dataframe_optimize=None,
                      delayed_optimize=my_other_optimize_function):
    x, y = dask.compute(x, y)

Documentation: http://dask.pydata.org/en/latest/optimize.html#customizing-optimization

Speed improvements

Additionally, task fusion has been significantly accelerated. This is very important for large graphs, particularly in dask.array computations.

Web Diagnostics

The distributed scheduler’s web diagnostic page is now served from within the dask scheduler process. This is both good and bad:

  • Good: It is much easier to make new visuals
  • Bad: Dask and Bokeh now share a single CPU

Because Bokeh and Dask now share the same Tornado event loop we no longer need to send messages between them to then send out to a web browser. The Bokeh server has full access to all of the scheduler state. This lets us build new diagnostic pages more easily. This has been around for a while but was largely used for development. In this version we’ve switched the new version to be default and turned off the old one.

The cost here is that the Bokeh scheduler can take 10-20% of the CPU use. If you are running a computation that heavily taxes the scheduler then you might want to close your diagnostic pages. Fortunately, this almost never happens. The dask scheduler is typically fast enough to never get close to this limit.

Tornado difficulties

Beware that the current versions of Bokeh (0.12.5) and Tornado (4.5) do not play well together. This has been fixed in development versions, and installing with conda is fine, but if you naively pip install then you may experience bad behavior.

Joblib

The Dask.distributed Joblib backend now includes a scatter= keyword, allowing you to pre-scatter select variables out to all of the Dask workers. This significantly cuts down on overhead, especially on machine learning workloads where most of the data doesn’t change very much.

# Send the training data only once to each worker
with parallel_backend('dask.distributed', scheduler_host='localhost:8786',
                      scatter=[digits.data, digits.target]):
    search.fit(digits.data, digits.target)

Early trials indicate that computations like scikit-learn’s RandomForest scale nicely on a cluster without any additional code.

Documentation: http://distributed.readthedocs.io/en/latest/joblib.html

Preload scripts

When starting a dask.distributed scheduler or worker people often want to include a bit of custom setup code, for example to configure loggers, authenticate with some network system, and so on. This has always been possible if you start scheduler and workers from within Python but is tricky if you want to use the command line interface. Now you can write your custom code as a separate standalone script and ask the command line interface to run it for you at startup:

# scheduler-setup.py
from distributed.diagnostics.plugin import SchedulerPlugin

class MyPlugin(SchedulerPlugin):
    """ Prints a message whenever a worker is added to the cluster """
    def add_worker(self, scheduler=None, worker=None, **kwargs):
        print("Added a new worker at", worker)

    def dask_setup(scheduler):
        plugin = MyPlugin()
        scheduler.add_plugin(plugin)
dask-scheduler --preload scheduler-setup.py

This makes it easier for people to adapt Dask to their particular institution.

Documentation: http://distributed.readthedocs.io/en/latest/setup.html#customizing-initialization

Network Interfaces (for infiniband)

Many people use Dask on high performance supercomputers. This hardware differs from typical commodity clusters or cloud services in several ways, including very high performance network interconnects like InfiniBand. Typically these systems also have normal ethernet and other networks. You’re probably familiar with this on your own laptop when you have both ethernet and wireless:

$ ifconfig
lo          Link encap:Local Loopback                       # Localhost
            inet addr:127.0.0.1  Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
eth0        Link encap:Ethernet  HWaddr XX:XX:XX:XX:XX:XX   # Ethernet
            inet addr:192.168.0.101
            ...
ib0         Link encap:Infiniband                           # Fast InfiniBand
            inet addr:172.42.0.101

The default systems Dask uses to determine network interfaces often choose ethernet by default. If you are on an HPC system then this is likely not optimal. You can direct Dask to choose a particular network interface with the --interface keyword

$ dask-scheduler --interface ib0
distributed.scheduler - INFO -   Scheduler at: tcp://172.42.0.101:8786

$ dask-worker tcp://172.42.0.101:8786 --interface ib0

Efficient as_completed

The as_completed iterator returns futures in the order in which they complete. It is the base of many asynchronous applications using Dask.

>>> x, y, z = client.map(inc, [0, 1, 2])
>>> for future in as_completed([x, y, z]):
...     print(future.result())
2
0
1

It can now also wait to yield an element only after the result also arrives

>>> for future, result in as_completed([x, y, z], with_results=True):
...     print(result)
2
0
1

And also yield all futures (and results) that have finished up until this point.

>>> for futures in as_completed([x, y, z]).batches():
...    print(client.gather(futures))
(2, 0)
(1,)

Both of these help to decrease the overhead of tight inner loops within asynchronous applications.

Example blogpost here: http://matthewrocklin.com/blog/work/2017/04/19/dask-glm-2

Co-released libraries

This release is aligned with a number of other related libraries, notably Pandas, and several smaller libraries for accessing data, including s3fs, hdfs3, fastparquet, and python-snappy each of which have seen numerous updates over the past few months. Much of the work of these latter libraries is being coordinated by Martin Durant

Acknowledgements

The following people contributed to the dask/dask repository since the 0.14.1 release on March 22nd

  • Antoine Pitrou
  • Dmitry Shachnev
  • Erik Welch
  • Eugene Pakhomov
  • Jeff Reback
  • Jim Crist
  • John A Kirkham
  • Joris Van den Bossche
  • Martin Durant
  • Matthew Rocklin
  • Michal Ficek
  • Noah D Brenowitz
  • Stuart Archibald
  • Tom Augspurger
  • Wes McKinney
  • wikiped

The following people contributed to the dask/distributed repository since the 1.16.1 release on March 22nd

  • Antoine Pitrou
  • Bartosz Marcinkowski
  • Ben Schreck
  • Jim Crist
  • Jens Nie
  • Krisztián Szűcs
  • Lezyes
  • Luke Canavan
  • Martin Durant
  • Matthew Rocklin
  • Phil Elson

May 08, 2017 12:00 AM

May 02, 2017

Enthought

Webinar: Python for Data Science: A Tour of Enthought’s Professional Training Course

DataView Python for Data Science Webinar
What: A guided walkthrough and Q&A about Enthought’s technical training course “Python for Data Science and Machine Learning” with VP of Training Solutions, Dr. Michael Connell

Who Should Watch: individuals, team leaders, and learning & development coordinators who are looking to better understand the options to increase professional capabilities in Python for data science and machine learning applications

VIEW


Enthought’s Python for Data Science training course is designed to accelerate the development of skill and confidence in using Python’s core data science tools — including the standard Python language, the fast array programming package NumPy, and the Pandas data analysis package, as well as tools for database access (DBAPI2, SQLAlchemy), machine learning (scikit-learn), and visual exploration (Matplotlib, Seaborn).

In this webinar, we give you the key information and insight you need to evaluate whether Enthought’s Python for Data Science course is the right solution to advance your professional data science skills in Python, including:

  • Who will benefit most from the course
  • A guided tour through the course topics
  • What skills you’ll take away from the course, how the instructional design supports that
  • What the experience is like, and why it is different from other training alternatives (with a sneak peek at actual course materials)
  • What previous course attendees say about the course

VIEW


michael_connell-enthought-vp-trainingPresenter: Dr. Michael Connell, VP, Enthought Training Solutions

Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT


Considering Moving to Python for Data Science?

Then Enthought’s Python for Data Science training course is definitely for you! This class has been particularly appealing to people who have been using other tools like R or SAS (or even Excel) for their data science work and want to start applying their analytic skills using the Python toolset.  And it’s no wonder — Python has been identified as the most popular coding language for five years in a row for good reason.

One reason for Python’s broad popularity across a range of disciplines is its efficiency and ease-of-use. Many people consider Python more fun to work in than other languages (and we agree!). Another reason for its popularity among data analysts and data scientists in particular is Python’s extensive (and growing) open source library of powerful tools for preparing, visualizing, analyzing, and modeling data.

Python is also an extraordinarily comprehensive toolset – it supports everything from interactive analysis to automation to software engineering to web app development within a single language and plays very well with other languages like C/C++ or FORTRAN so you can continue leveraging your existing code libraries written in those other languages.

Many organizations are moving to Python so they can consolidate all of their technical work streams under a single comprehensive toolset. In the first part of this class we’ll give you the fundamentals you need to switch from another language to Python and then we cover the core tools that will enable you to do in Python what you were doing with other tools, only faster!

Additional Resources

Upcoming Open Python for Data Science Sessions:
Austin, TX, June 12-16, 2017
San Jose, CA, July 17-21, 2017Learn MoreHave a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.
Download Enthought’s Machine Learning with Python’s Scikit-Learn Cheat Sheets
Enthought's Machine Learning with Python Cheat Sheets
Download Enthought’s Pandas Cheat SheetsEnthought's Pandas Cheat Sheets

The post Webinar: Python for Data Science: A Tour of Enthought’s Professional Training Course appeared first on Enthought Blog.

by admin at May 02, 2017 08:03 PM

Matthieu Brucher

Using Audio ToolKit with JUCE 5

As some may have seen online, ROLI released a new version of JUCE. The nice thing is that they added a new tier for people like me who don’t sell plugins but who don’t want to release their code under the GPL license for diverse reasons (for me, it was formerly incompatibility between VST3 license and the GPL).

With JUCE 5, you have support for all major APIS, from VST2 to Audio Unit v3 and also AAX or VST3. And you can develop your own plugins. The caveat with this tier is that you have a splash screen and a tracking of your users… (actually, there is a flag to remove both). the advantage is that on MacOS, there is no more SDK conflicts, and I have Audio Unit 3 support

So I’ve started playing with Projucer and built a barebone ATK plugin that doesn’t do anything. What I can say is that the worst part is handling universal binaries, support 32bits plugins, as the JUCE project builder overwrites all my changes. Even adding ATK is painful with the project manager.

So instead, I’m going the WDL-OL here, and keeping this ATKJUCE plugin as the simple plugin I’ll duplicate by changing the names and its content. I have my builders that build the plugins and creates the installers, all that while keeping the same JUCE core code (it is shared by all plugins).

The next step is trying to make sense of the API to build a nicer GUI than what I currently have (probably something flat). Indeed, the tutorials on the GUI are small and too basic, but WDL-OL was no better in that aspect, but with more examples.

by Matt at May 02, 2017 07:21 AM

May 01, 2017

numfocus

NumFOCUS projects participate in Docathon 2017

Seven NumFOCUS sponsored projects participated in Docathon 2017: IPython, Project Jupyter, Julia, Matplotlib, pandas, SunPy, and yt. The Docathon is like a hackathon but is focused on developing material and tools for documentation. Documentation is one of the most important components of the open science ecosystem—and it’s everywhere! From examples that provide inspiration for things you can […]

by Gina Helfrich at May 01, 2017 02:00 PM

April 28, 2017

Leonardo Uieda

Thoughts from the Introduction to Python Workshop at UH Manoa

Thumbnail image for publication.

Last week, I taught a 3-day Python workshop at the Department of Geology and Geophysics of the University of Hawaii at Manoa, where I'm currently doing a postdoc. It covered the basics of computer programming with Python, starting from the very beginning. Below are thoughts and information about the workshop, the demographics of people who signed up, and the feedback that I got from the participants.

See the workshop page and Github repository for more information and links to material used.

My goals

I wanted this to be a hands-on workshop of the basic concepts needed to use Python for research. Participants who complete the workshop should be able to use Python to gather data from one or more files, process the data, run an analysis, make publication quality figures, and save the output. Most importantly, I wanted participants to know what they should type into Google to learn more about Python.

Materials

The class is based on a mixture of the Software Carpentry lessons Plotting and Programming in Python (under development) and Programming with Python. However, I use temperature data from Berkeley Earth instead of the Gapminder and inflammation data used by Software Carpentry. For example, our goal for the second day of the workshop was to reproduce this figure for average temperature variation in Hawaii from the website:

On the last day, we finished with some code that processed a list of country names to download the respective data file (using requests), load it into Python, make a figure, and save it to a different folder (see this Jupyter notebook for the code).

I also used a few techniques from the Software Carpentry Instructor Training, mainly the shared class notes (I used Google Docs instead of Etherpad) and colored sticky notes. The sticky notes were in two colors: pink and blue. Learners kept the blue sticky note on their laptop lid if everything was OK. They put up the pink one if they have a problem or need help. This way, myself and the teaching assistants can know at a glance who needs help. I had them write positive and negative feedback on the sticky notes at the end of the workshop.

The notebooks that I created during class and some notes for myself are in the Github repository (in the notebooks and notes folders, respectively).

Who attended

I asked all participants to sign up through a Google Form that asked a few questions regarding their operating system, background in programming, and position at the university. The file demographics.csv in the Github repository has the anonymous information from this form. I wrote some code to analyze the data and generate the figures below using pandas and matplotlib. You can find in the demographics-analysis.ipynb Jupyter notebook (also in the Github repo).

First, lets look at how many people signed up and then actually attended each day.

Number of attendants per day of the workshop.

We were very lucky that most people who signed up also attended the workshop on the first day, even though there was no sign up fee. It seems that the workshop really fills a need in the community! Some people gave up after the first day. Maybe it was too fast, or too basic, or life just happened. I can't really tell because I failed to collect feedback at the end of each day. That is something to keep in mind for the next iteration: get feedback every day.

The experience level of participants was more evenly distributed than I expected. I was very pleased with the number of people who had never programmed before. But the distribution made it challenging to keep everyone motivated and following along. From the feedback (see below), it seems that I managed it well enough.

Not surprisingly, most participants who already programmed know Matlab. What was a bit surprising is how few people reported experience with Fortran. Is this a reflection of the age of participants (a lot of young grad students)? The number of Fortran users does correlate with the number of faculty who signed up, so maybe yes.

I was very pleased to have someone from the Nā Kūpuna Senior Citizen Visitor Program and a not insignificant number of faculty. We even had an "interested citizen" who studies film production and education (a personal friend)!

Feedback

Feedback on the colored sticky notes.

This is a synthesis from the feedback given by participants on the last day (using the pink and blue sticky notes):

The Good # The Bad #
Instructor style 5 Too short 5
Examples and exercises 5 Too fast 3
Dense but efficient (learn a lot in little time) 3 Too slow 3
Using real data 3 Hard to multi-task (pay attention + notes + exercise) 2
Simple and accessible level 3 No TA on Monday 2
Shared notes 1 Instructor took too many tangents when teaching 1
Too many people 1
Ran late 1
Jupyter Google maps example failed 1

It was a very funny coincidence that the exact same number of people complained about the pace being either too fast or too slow.

Lessons learned

I was glad to see that the hands-on approach worked and the students appreciated using real data during the exercises. We had very little time (6h total) to cover a lot of material. So it's no wonder that people thought it was too short and maybe didn't explain quite as thoroughly some concepts (like for and if). Regarding the pace, it's hard to satisfy everyone. I expect that novices might have found the pace a bit too fast and more experienced programmers found it too slow (but I don't have data to back that up). Since only 6 people complained about the pace, I guess it wasn't too bad. Not having a TA on Monday (the first day) was not good because that is when the most serious problems occur (Jupyter won't start, where is my Python?, lost my files, etc). The next two days of the workshop were much smoother thanks to the generous help of volunteer TAs Sam Murphy and Julie Schnurr. We also didn't get to cover the last few topics on the last day (functions and getting data from headers).

A few things that I would have done differently:

  • Get feedback through sticky notes at the end of each day. The Software Carpentry material actually recommends this but I completely forgot.
  • Use more pair programming activities. This is also something recommended by Software Carpentry and that I had planned on doing. In the end, I left this as optional and didn't explicitly pair learners. A lot of people were interacting naturally but I would have liked to see more of it.

Have you taught or participated in a workshop like this before? What were your experiences?


Comments? Leave one below or let me know on Twitter @leouieda or in the Software Underground Slack group.

Found a typo/mistake? Send a fix through Github and I'll happily merge it (plus you'll feel great because you helped someone). All you need is an account and 5 minutes!

Please enable JavaScript to view the comments powered by Disqus.

April 28, 2017 12:00 PM

Matthew Rocklin

Dask Development Log

This work is supported by Continuum Analytics and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly(ish) about the work done on Dask and related projects during the previous week. This log covers work done between 2017-04-20 and 2017-04-28. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Development in Dask and Dask-related projects during the last week includes the following notable changes:

  1. Improved Joblib support, accelerating existing Scikit-Learn code
  2. A dask-glm powered LogisticRegression estimator that is scikit-learn compatible
  3. Additional Parquet support by Arrow
  4. Sparse arrays
  5. Better spill-to-disk behavior
  6. AsyncIO compatible Client
  7. TLS (SSL) support
  8. NumPy __array_ufunc__ protocol

Joblib

Scikit learn parallelizes most of their algorithms with Joblib, which provides a simple interface for embarrassingly parallel computations. Dask has been able to hijack joblib code and serve as the backend for some time now, but it had some limitations, particularly because we would repeatedly send data back and forth from a worker to client for every batch of computations.

import distributed.joblib
from joblib import Parallel, parallel_backend

with parallel_backend('dask.distributed', scheduler_host='HOST:PORT'):
    # normal Joblib code

Now there is a scatter= keyword, which allows you to pre-scatter select variables out to all of the Dask workers. This significantly cuts down on overhead, especially on machine learning workloads where most of the data doesn’t change very much.

# Send the training data only once to each worker
with parallel_backend('dask.distributed', scheduler_host='localhost:8786',
                      scatter=[digits.data, digits.target]):
    search.fit(digits.data, digits.target)

Early trials indicate that computations like scikit-learn’s RandomForest scale nicely on a cluster without any additional code.

This is particularly nice because it allows Dask and Scikit-Learn to play well together without having to introduce Dask within the Scikit-Learn codebase at all. From a maintenance perspective this combination is very attractive.

Work done by Jim Crist in dask/distributed #1022

Dask-GLM Logistic Regression

The convex optimization solvers in the dask-glm project allow us to solve common machine learning and statistics problems in parallel and at scale. Historically this young library has contained only optimization solvers and relatively little in the way of user API.

This week dask-glm grew new LogisticRegression and LinearRegression estimators that expose the scalable convex optimization algorithms within dask-glm through a Scikit-Learn compatible interface. This can both speedup solutions on a single computer or provide solutions for datasets that were previously too large to fit in memory.

from dask_glm.estimators import LogisticRegression

est = LogisticRegression()
est.fit(my_dask_array, labels)

This notebook compares performance to the latest release of scikit-learn on a 5,000,000 dataset running on a single machine. Dask-glm beats scikit-learn by a factor of four, which is also roughly the number of cores on the development machine. However in response this notebook by Olivier Grisel shows the development version of scikit-learn (with a new algorithm) beating out dask-glm by a factor of six. This just goes to show you that being smarter about your algorithms is almost always a better use of time than adopting parallelism.

Work done by Tom Augspurger and Chris White in dask/dask-glm #40

Parquet with Arrow

The Parquet format is quickly becoming a standard for parallel and distributed dataframes. There are currently two Parquet reader/writers accessible from Python, fastparquet a NumPy/Numba solution, and Parquet-CPP a C++ solution with wrappers provided by Arrow. Dask.dataframe has supported parquet for a while now with fastparquet.

However, users will now have an option to use Arrow instead by switching the engine= keyword in the dd.read_parquet function.

df = dd.read_parquet('/path/to/mydata.parquet', engine='fastparquet')
df = dd.read_parquet('/path/to/mydata.parquet', engine='arrow')

Hopefully this capability increases the use of both projects and results in greater feedback to those libraries so that they can continue to advance Python’s access to the Parquet format. As a gentle reminder, you can typically get much faster query times by switching from CSV to Parquet. This is often much more effective than parallel computing.

Work by Wes McKinney in dask/dask #2223.

Sparse Arrays

There is a small multi-dimensional sparse array library here: https://github.com/mrocklin/sparse. It allows us to represent arrays compactly in memory when most entries are zero. This differs from the standard solution in scipy.sparse, which can only support arrays of dimension two (matrices) and not greater.

pip install sparse
>>> import numpy as np
>>> x = np.random.random(size=(10, 10, 10, 10))
>>> x[x < 0.9] = 0
>>> x.nbytes
80000

>>> import sparse
>>> s = sparse.COO(x)
>>> s
<COO: shape=(10, 10, 10, 10), dtype=float64, nnz=1074>

>>> s.nbytes
12888

>>> sparse.tensordot(s, s, axes=((1, 0, 3), (2, 1, 0))).sum(axis=1)
array([ 100.93868073,  128.72312323,  119.12997217,  118.56304153,
        133.24522101,   98.33555365,   90.25304866,   98.99823973,
        100.57555847,   78.27915528])

Additionally, this sparse library more faithfully follows the numpy.ndarray API, which is exactly what dask.array expects. Because of this close API matching dask.array is able to parallelize around sparse arrays just as easily as it parallelizes around dense numpy arrays. This gives us a decent distributed multidimensional sparse array library relatively cheaply.

>>> import dask.array as da
>>> x = da.random.random(size=(10000, 10000, 10000, 10000),
...                      chunks=(100, 100, 100, 100))
>>> x[x < 0.9] = 0

>>> s = x.map_blocks(sparse.COO)  # parallel array of sparse arrays

Work on the sparse library is so far by myself and Jake VanderPlas and is available here. Work connecting this up to Dask.array is in dask/dask #2234.

Better spill to disk behavior

I’ve been playing with a 50GB sample of the 1TB Criteo dataset on my laptop (this is where I’m using sparse arrays). To make computations flow a bit faster I’ve improved the performance of Dask’s spill-to-disk policies.

Now, rather than depend on (cloud)pickle we use Dask’s network protocol, which handles data more efficiently, compresses well, and has special handling for common and important types like NumPy arrays and things built out of NumPy arrays (like sparse arrays).

As a result reading and writing excess data to disk is significantly faster. When performing machine learning computations (which are fairly heavy-weight) disk access is now fast enough that I don’t notice it in practice and running out of memory doesn’t significantly impact performance.

This is only really relevant when using common types (like numpy arrays) and when your computation to disk access ratio is relatively high (such as is the case for analytic workloads), but it was a simple fix and yielded a nice boost to my personal productivity.

Work by myself in dask/distributed #946.

AsyncIO compatible Client

The Dask.distributed scheduler maintains a fully asynchronous API for use with non-blocking systems like Tornado or AsyncIO. Because Dask supports Python 2 all of our internal code is written with Tornado. While Tornado and AsyncIO can work together, this generally requires a bit of excess book-keeping, like turning Tornado futures into AsyncIO futures, etc..

Now there is an AsyncIO specific Client that only includes non-blocking methods that are AsyncIO native. This allows for more idiomatic asynchronous code in Python 3.

async with AioClient('scheduler-address:8786') as c:
    future = c.submit(func, *args, **kwargs)
    result = await future

Work by Krisztián Szűcs in dask/distributed #1029.

TLS (SSL) support

TLS (previously called SSL) is a common and trusted solution to authentication and encryption. It is a commonly requested feature by companies of institutions where intra-network security is important. This is currently being worked on now at dask/distributed #1034. I encourage anyone who this may affect to engage on that pull request.

Work by Antoine Pitrou in dask/distributed #1034 and previously by Marius van Niekerk in dask/distributed #866.

NumPy __array_ufunc__

This recent change in NumPy (literally merged as I was typing this blogpost) allows other array libraries to take control of the the existing NumPy ufuncs, so if you call something like np.exp(my_dask_array) this will no longer convert to a NumPy array, but will rather call the appropriate dask.array.exp function. This is a big step towards writing generic array code that works both on NumPy arrays as well as other array projects like dask.array, xarray, bcolz, sparse, etc..

As with all large changes in NumPy this was accomplished through a collaboration of many people. PR in numpy/numpy #8247.

April 28, 2017 12:00 AM

April 25, 2017

Matthieu Brucher

Announcement: ATKStereoUniversalDelay 1.0.0

I’m happy to announce the release of a stereo delay that allows ping-pong like effects based on the Audio Toolkit. They are available on Windows and OS X (min. 10.11) in different formats.

ATKStereoUniversalDelay

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKStereoUniversalDelay.

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at April 25, 2017 07:40 AM

April 24, 2017

numfocus

Moore Foundation gives grant to support NumFOCUS Diversity & Inclusion in Scientific Computing initiatives

As part of our mission to support and promote better science through support of the open source scientific software community, NumFOCUS champions technical progress through diversity. NumFOCUS recognizes that the open source data science community is currently highly homogenous. We believe that diverse contributors and community members produce better science and better projects. NumFOCUS strives […]

by Gina Helfrich at April 24, 2017 07:23 PM

Anyone Can Do Astronomy with Python and Open Data

Ole Moeller-Nilsson, CTO at Pivigo, was kind enough to share his insights on how a beginner can easily get started exploring astronomy using Python. This blog post grew out of a presentation he gave at PyData London meetup on March 7th. Python is a great language for science, and specifically for astronomy. The various packages […]

by Gina Helfrich at April 24, 2017 03:00 PM

April 23, 2017

Titus Brown

A (revised and updated) shotgun metagenome workshop at UC Santa Cruz

We just finished teaching a second version of our two-day shotgun metagenome analysis workshop, this time at UC Santa Cruz (the first one was in October 2016, at Scripps Institute of Oceanography). Harriet Alexander led the workshop and Phillip Brooks and I co-taught; Luiz Irber, Shannon Joslin, and Taylor Reiter TAed. The workshop was hosted by Professor Marilou Sison-Mangus at the Earth and Marine Sciences Building.

(Note that Harriet will be running an expanded version of this workshop at our summer institute, July 17-21. Registration is still open!)

About 30-35 people came the first day, and about 30 were there on the second.

Some good - new lessons!

In addition to our old lessons on Illumina read QC, assembly with MEGAHIT, annotation with Prokka, and quantification with Salmon, we introduced two new lessons --

For all of this we used subset data from Hu et al. (the Banfield Lab), 2016, which is a great low-complexity metagenome.

More good - using XSEDE Jetstream instead of Amazon Web Services!

This was the first genomics workshop in many years where we didn't use Amazon Web Services - we used XSEDE Jetstream instead. See our login instructions here.

Why are we abandoning Amazon? Two reasons --

  • while we've been teaching it for almost 8 years now, the conversion rate seems to be very low: AFAICT our students aren't using it, because it costs money and their advisors don't want to pay for AWS when they can use institutional resources. (This is anecdotal.)
  • since sometime before October 2016, Amazon changed their registration system so that newly registered people cannot start up instances for a few hours after their first try. This is death on half-day and two-days workshops. (You can read a bit more about it here.) There seems to be nothing that AWS folk can do to help us so we are giving up.

I am happy to report that Jetstream went more smoothly than AWS in almost every way and seems to perfectly meet our needs for training! We may have more to say about it after our summer institute's use.

I also suspect that people will be more inclined to use Jetstream if they can get allocations on it for free; there was significant interest in this during the workshop.

Other good --

  • As always, the people that attended the workshop were fantastic, and dealt with our occasional hiccups pretty well!
  • We managed to pretty smoothly move between the command line and the Jupyter Notebook for two of the lessons, which was pretty cool.
  • We managed to implement a simple demo of a tetramer nucleotide frequency clustering system using sourmash and t-SNE - see the notebook on github (which should be run after the initial steps in the binning lesson).

There is no bad or ugly

Nothing went wrong! Which I guess is a 'good' all on its own!

There were a few minor issues with the Jetstream desktop, and some problems with starting up Jetstream instances every now and then, and the guest network at UCSC blocked port 8000 (which we used for Jupyter), but most of the time we could work around these issues.

Feedback from participants

The in-person feedback (which is admittedly always kinder than the anonymous feedback :) was excellent - students really liked the hands-on teaching style (Carpentry-style, but with copy/paste) and the slow teaching pace with lots of time for questions was well received.

Misc notes

As always, our materials are available under CC0 on github - the URL is https://github.com/ngs-docs/2017-ucsc-metagenomics.

--titus

by C. Titus Brown at April 23, 2017 10:00 PM

April 21, 2017

numfocus

NumFOCUS Welcomes SunPy, Our Newest Fiscally Sponsored Project

​NumFOCUS is pleased to announce the addition of SunPy to our fiscally sponsored projects. SunPy is a community-developed, free and open-source software library for solar physics based on Python. The aim of the SunPy project is to provide the software tools necessary so that anyone can analyze solar data. SunPy is written using the Python programming language […]

by Gina Helfrich at April 21, 2017 05:04 PM

April 20, 2017

Continuum Analytics news

Two Peas in a Pod: Anaconda + IBM Cognitive Systems

Thursday, April 20, 2017
Travis Oliphant
President, Chief Data Scientist & Co-Founder

There is no question that deep learning has come out to play across a wide range of sectors—finance, marketing, pharma, legal...the list goes on. What’s more, from now until 2022, the deep learning market is expected to grow more than 65 percent. Clearly, companies are increasingly looking deeply at this popular machine learning approach to help fulfill business needs. Deep learning makes it possible to process giant datasets with billions of elements and extract useful predictive models. Deep learning is transforming the businesses of leading consumer Web and mobile app companies and is also being adopted by more traditional business enterprises. 

That’s why this week we are pleased to announce the availability of Anaconda on IBM’s Cognitive Systems, the company’s high performance deep learning platform, highlighting the fact that Anaconda is regarded as an important capability for developers building cognitive solutions. The platform empowers these developers and data scientists to build and deploy deep learning applications that are ready to scale. Anaconda is also integrating with the IBM PowerAI software distribution that makes it simpler for companies to take advantage of Power performance and GPU optimization for data intensive cognitive workloads. 

At Anaconda, we’re helping leading businesses across the world, like IBM, solve the world’s most challenging problems—from improving medical treatments to discovering planets to predicting effects of public policy—by handing them tools to identify patterns in data, uncover key insights and transform basic data into a goldmine of intelligence. This news reiterates the importance of Open Data Science in all factors of business. 

Want to learn more about this news? Read the press release here

by swebster at April 20, 2017 03:05 PM

NeuralEnsemble

PyNN 0.9.0 released

I'm happy to announce the release of PyNN 0.9.0!

This version of PyNN adopts the new, simplified Neo object model, first released as Neo 0.5.0, for the data structures returned by Population.get_data(). For more information on the new Neo API, see the Neo release notes

The main difference for a PyNN user is that the AnalogSignalArray class has been renamed to AnalogSignal, and similarly the Segment.analogsignalarrays attribute is now called Segment.analogsignals

What is PyNN?

PyNN (pronounced 'pine') is a simulator-independent language for building neuronal network models.

In other words, you can write the code for a model once, using the PyNN API and the Python programming language, and then run it without modification on any simulator that PyNN supports (currently NEURON, NEST and Brian as well as the SpiNNaker and BrainScaleS neuromorphic hardware systems).

Even if you don't wish to run simulations on multiple simulators, you may benefit from writing your simulation code using PyNN's powerful, high-level interface. In this case, you can use any neuron or synapse model supported by your simulator, and are not restricted to the standard models.

The code is released under the CeCILL licence (GPL-compatible).

by Andrew Davison (noreply@blogger.com) at April 20, 2017 11:11 AM

April 19, 2017

Matthew Rocklin

Asynchronous Optimization Algorithms with Dask

This work is supported by Continuum Analytics, the XDATA Program, and the Data Driven Discovery Initiative from the Moore Foundation.

Summary

In a previous post we built convex optimization algorithms with Dask that ran efficiently on a distributed cluster and were important for a broad class of statistical and machine learning algorithms.

We now extend that work by looking at asynchronous algorithms. We show the following:

  1. APIs within Dask to build asynchronous computations generally, not just for machine learning and optimization
  2. Reasons why asynchronous algorithms are valuable in machine learning
  3. A concrete asynchronous algorithm (Async ADMM) and its performance on a toy dataset

This blogpost is co-authored by Chris White (Capital One) who knows optimization and Matthew Rocklin (Continuum Analytics) who knows distributed computing.

Reproducible notebook available here

Asynchronous vs Blocking Algorithms

When we say asynchronous we contrast it against synchronous or blocking.

In a blocking algorithm you send out a bunch of work and then wait for the result. Dask’s normal .compute() interface is blocking. Consider the following computation where we score a bunch of inputs in parallel and then find the best:

import dask

scores = [dask.delayed(score)(x) for x in L]  # many lazy calls to the score function
best = dask.delayed(max)(scores)
best = best.compute()  # Trigger all computation and wait until complete

This blocks. We can’t do anything while it runs. If we’re in a Jupyter notebook we’ll see a little asterisk telling us that we have to wait.

A Jupyter notebook cell blocking on a dask computation

In a non-blocking or asynchronous algorithm we send out work and track results as they come in. We are still able to run commands locally while our computations run in the background (or on other computers in the cluster). Dask has a variety of asynchronous APIs, but the simplest is probably the concurrent.futures API where we submit functions and then can wait and act on their return.

from dask.distributed import Client, as_completed
client = Client('scheduler-address:8786')

# Send out several computations
futures = [client.submit(score, x) for x in L]

# Find max as results arrive
best = 0
for future in as_completed(futures):
    score = future.result()
    if score > best:
        best = score

These two solutions are computationally equivalent. They do the same work and run in the same amount of time. The blocking dask.delayed solution is probably simpler to write down but the non-blocking futures + as_completed solution lets us be more flexible.

For example, if we get a score that is good enough then we might stop early. If we find that certain kinds of values are giving better scores than others then we might submit more computations around those values while cancelling others, changing our computation during execution.

This ability to monitor and adapt a computation during execution is one reason why people choose asynchronous algorithms. In the case of optimization algorithms we are doing a search process and frequently updating parameters. If we are able to update those parameters more frequently then we may be able to slightly improve every subsequently launched computation. Asynchronous algorithms enable increased flow of information around the cluster in comparison to more lock-step batch-iterative algorithms.

Asynchronous ADMM

In our last blogpost we showed a simplified implementation of Alternating Direction Method of Multipliers (ADMM) with dask.delayed. We saw that in a distributed context it performed well when compared to a more traditional distributed gradient descent. This algorithm works by solving a small optimization problem on every chunk of our data using our current parameter estimates, bringing these back to the local process, combining them, and then sending out new computation on updated parameters.

Now we alter this algorithm to update asynchronously, so that our parameters change continuously as partial results come in in real-time. Instead of sending out and waiting on batches of results, we now consume and emit a constant stream of tasks with slightly improved parameter estimates.

We show three algorithms in sequence:

  1. Synchronous: The original synchronous algorithm
  2. Asynchronous-single: updates parameters with every new result
  3. Asynchronous-batched: updates with all results that have come in since we last updated.

Setup

We create fake data

n, k, chunksize = 50000000, 100, 50000

beta = np.random.random(k) # random beta coefficients, no intercept
zero_idx = np.random.choice(len(beta), size=10)
beta[zero_idx] = 0 # set some parameters to 0
X = da.random.normal(0, 1, size=(n, k), chunks=(chunksize, k))
y = X.dot(beta) + da.random.normal(0, 2, size=n, chunks=(chunksize,)) # add noise

X, y = persist(X, y)  # trigger computation in the background

We define local functions for ADMM. These correspond to solving an l1-regularized Linear regression problem:

def local_f(beta, X, y, z, u, rho):
    return ((y - X.dot(beta)) **2).sum() + (rho / 2) * np.dot(beta - z + u,
                                                              beta - z + u)

def local_grad(beta, X, y, z, u, rho):
    return 2 * X.T.dot(X.dot(beta) - y) + rho * (beta - z + u)


def shrinkage(beta, t):
    return np.maximum(0, beta - t) - np.maximum(0, -beta - t)

local_update2 = partial(local_update, f=local_f, fprime=local_grad)

lamduh = 7.2 # regularization parameter

# algorithm parameters
rho = 1.2
abstol = 1e-4
reltol = 1e-2

z = np.zeros(p)  # the initial consensus estimate

# an array of the individual "dual variables" and parameter estimates,
# one for each chunk of data
u = np.array([np.zeros(p) for i in range(nchunks)])
betas = np.array([np.zeros(p) for i in range(nchunks)])

Finally because ADMM doesn’t want to work on distributed arrays, but instead on lists of remote numpy arrays (one numpy array per chunk of the dask.array) we convert each our Dask.arrays into a list of dask.delayed objects:

XD = X.to_delayed().flatten().tolist() # a list of numpy arrays, one for each chunk
yD = y.to_delayed().flatten().tolist()

Synchronous ADMM

In this algorithm we send out many tasks to run, collect their results, update parameters, and repeat. In this simple implementation we continue for a fixed amount of time but in practice we would want to check some convergence criterion.

start = time.time()

while time() - start < MAX_TIME:
    # process each chunk in parallel, using the black-box 'local_update' function
    betas = [delayed(local_update2)(xx, yy, bb, z, uu, rho)
             for xx, yy, bb, uu in zip(XD, yD, betas, u)]
    betas = np.array(da.compute(*betas))  # collect results back

    # Update Parameters
    ztilde = np.mean(betas + np.array(u), axis=0)
    z = shrinkage(ztilde, lamduh / (rho * nchunks))
    u += betas - z  # update dual variables

    # track convergence metrics
    update_metrics()

Asynchronous ADMM

In the asynchronous version we send out only enough tasks to occupy all of our workers. We collect results one by one as they finish, update parameters, and then send out a new task.

# Submit enough tasks to occupy our current workers
starting_indices = np.random.choice(nchunks, size=ncores*2, replace=True)
futures = [client.submit(local_update, XD[i], yD[i], betas[i], z, u[i],
                           rho, f=local_f, fprime=local_grad)
           for i in starting_indices]
index = dict(zip(futures, starting_indices))

# An iterator that returns results as they come in
pool = as_completed(futures, with_results=True)

start = time.time()
count = 0

while time() - start < MAX_TIME:
    # Get next completed result
    future, local_beta = next(pool)
    i = index.pop(future)
    betas[i] = local_beta
    count += 1

    # Update parameters (this could be made more efficient)
    ztilde = np.mean(betas + np.array(u), axis=0)

    if count < nchunks:  # artificially inflate beta in the beginning
        ztilde *= nchunks / (count + 1)
    z = shrinkage(ztilde, lamduh / (rho * nchunks))
    update_metrics()

    # Submit new task to the cluster
    i = random.randint(0, nchunks - 1)
    u[i] += betas[i] - z
    new_future = client.submit(local_update2, XD[i], yD[i], betas[i], z, u[i], rho)
    index[new_future] = i
    pool.add(new_future)

Batched Asynchronous ADMM

With enough distributed workers we find that our parameter-updating loop on the client can be the limiting factor. After profiling it seems that our client was bound not by updating parameters, but rather by computing the performance metrics that we are going to use for the convergence plots below (so not actually a limitation in practice). However we decided to leave this in because it is good practice for what is likely to occur in larger clusters, where the single machine that updates parameters is possibly overwhelmed by a high volume of updates from the workers. To resolve this, we build in batching.

Rather than update our parameters one by one, we update them with however many results have come in so far. This provides a natural defense against a slow client. This approach smoothly shifts our algorithm back over to the synchronous solution when the client becomes overwhelmed. (though again, at this scale we’re fine).

Conveniently, the as_completed iterator has a .batches() method that iterates over all of the results that have come in so far.

# ... same setup as before

pool = as_completed(new_betas, with_results=True)

batches = pool.batches()            # <<<--- this is new

while time() - start < MAX_TIME:

    # Get all tasks that have come in since we checked last time
    batch = next(batches)           # <<<--- this is new
    for future, result in batch:
        i = index.pop(future)
        betas[i] = result
        count += 1

    ztilde = np.mean(betas + np.array(u), axis=0)
    if count < nchunks:
        ztilde *= nchunks / (count + 1)
    z = shrinkage(ztilde, lamduh / (rho * nchunks))
    update_metrics()

    # Submit as many new tasks as we collected
    for _ in batch:                 # <<<--- this is new
        i = random.randint(0, nchunks - 1)
        u[i] += betas[i] - z
        new_fut = client.submit(local_update2, XD[i], yD[i], betas[i], z, u[i], rho)
        index[new_fut] = i
        pool.add(new_fut)

Visual Comparison of Algorithms

To show the qualitative difference between the algorithms we include profile plots of each. Note the following:

  1. Synchronous has blocks of full CPU use followed by blocks of no use
  2. The Asynchrhonous methods are more smooth
  3. The Asynchronous single-update method has a lot of whitespace / time when CPUs are idling. This is artifiical and because our code that tracks convergence diagnostics for our plots below is wasteful and inside the client inner-loop
  4. We intentionally leave in this wasteful code so that we can reduce it by batching in the third plot, which is more saturated.

You can zoom in using the tools to the upper right of each plot. You can view the full profile in a full window by clicking on the “View full page” link.

Synchronous

View full page

Asynchronous single-update

View full page

Asynchronous batched-update

View full page

Plot Convergence Criteria

Primal residual for async-admm Primal residual for async-admm

Analysis

To get a better sense of what these plots convey, recall that optimization problems always come in pairs: the primal problem is typically the main problem of interest, and the dual problem is a closely related problem that provides information about the constraints in the primal problem. Perhaps the most famous example of duality is the Max-flow-min-cut Theorem from graph theory. In many cases, solving both of these problems simultaneously leads to gains in performance, which is what ADMM seeks to do.

In our case, the constraint in the primal problem is that all workers must agree on the optimum parameter estimate. Consequently, we can think of the dual variables (one for each chunk of data) as measuring the “cost” of agreement for their respective chunks. Intuitively, they will start out small and grow incrementally to find the right “cost” for each worker to have consensus. Eventually, they will level out at an optimum cost.

So:

  • the primal residual plot measures the amount of disagreement; “small” values imply agreement
  • the dual residual plot measures the total “cost” of agreement; this increases until the correct cost is found

The plots then tell us the following:

  • the cost of agreement is higher for asynchronous algorithms, which makes sense because each worker is always working with a slightly out-of-date global parameter estimate, making consensus harder
  • blocked ADMM doesn’t update at all until shortly after 5 seconds have passed, whereas async has already had time to converge. (In practice with real data, we would probably specify that all workers need to report in every K updates).
  • asynchronous algorithms take a little while for the information to properly diffuse, but once that happens they converge quickly.
  • both asynchronous and synchronous converge almost immediately; this is most likely due to a high degree of homogeneity in the data (which was generated to fit the model well). Our next experiment should involve real world data.

What we could have done better

Analysis wise we expect richer results by performing this same experiment on a real world data set that isn’t as homogeneous as the current toy dataset.

Performance wise we can get much better CPU saturation by doing two things:

  1. Not running our convergence diagnostics, or making them much faster
  2. Not running full np.mean computations over all of beta when we’ve only updated a few elements. Instead we should maintain a running aggregation of these results.

With these two changes (each of which are easy) we’re fairly confident that we can scale out to decently large clusters while still saturating hardware.

April 19, 2017 12:00 AM

April 18, 2017

Enthought

Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

This is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

No dataset is perfect and most datasets that we have to deal with on a day-to-day basis have values missing, often represented by “NA” or “NaN”. One of the reasons why the Pandas library is as popular as it is in the data science community is because of its capabilities in handling data that contains NaN values.

But spending time looking up the relevant Pandas commands might be cumbersome when you are exploring raw data or prototyping your data analysis pipeline. This is one of the places where the Canopy Data Import Tool helps make data munging faster and easier, by simplifying the task of identifying missing values in your raw data and removing/replacing them.

Why are missing values a problem you ask? We can answer that question in the context of machine learning. scikit-learn and TensorFlow are popular and widely used libraries for machine learning in Python. Both of them caution the user about missing values in their datasets. Various machine learning algorithms expect all the input values to be numerical and to hold meaning. Both of the libraries suggest removing rows and/or columns that contain missing values.

If removing the missing values is not an option, given the size of your dataset, then they suggest replacing the missing values. The scikit-learn library provides an Imputer class, which can be used to replace missing values. See the sci-kit learn documentation for an example of how the Imputer class is used. Similarly, the decode_csv function in the TensorFlow library can be passed a record_defaults argument, which will replace missing values in the dataset. See the TensorFlow documentation for specifics.

The Data Import Tool provides capabilities to handle missing values in your dataset because we strongly believe that discovering and handling missing values in your dataset is a part of the data import and cleaning phase and not the analysis phase of the data science process.

Digging into the specifics, here we’ll compare how you can go about handling missing values with three typical scenarios, first using the Pandas library, then contrasting with the Data Import Tool:

  1. Identifying missing values in data
  2. Replacing missing values in data, and
  3. Removing missing values from data.

Note : Pandas’ internal representation of your data is called a DataFrame. A DataFrame is simply a tabular data structure, similar to a spreadsheet or a SQL table.


Identifying Missing Values – The Hard Way: Using Pandas

If you are interested in identifying missing values in a row/column of a DataFrame, you need to understand the isnull, any, all methods on a DataFrame.

Taking a detour, we have so far described missing values as being represented by NA or NaN. Instead, what if missing values in a column are values that aren’t of the same type as the rest of the cells in the column, say for example a string in a column containing integers? Doing so in Pandas is not trivial.

Identifying Missing Values – The Easy Way: Using the Data Import Tool

Highlighting Null Values using the Data Import Tool

Highlighting null values using the Data Import Tool

Instead of giving you the column names and index values of the cells containing missing values, the Data Import Tool shows them to you. Simply checking the `Highlight Missing Values` checkbox in the bottom-left corner of the Data Import Tool will paint the DataFrame to show you the cells that contain missing values. Further, the Data Import Tool understands that your data file might have errors, like having a string value in a column otherwise containing integers. The Data Import Tool highlights the cell and displays the underlying content too.

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN values

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN values


Replacing Missing Values – The Hard Way: Using Pandas

While Pandas does a great job at handling column operations even if the columns contain NaN values, our data analysis workflow might need us to replace the missing values in our data.

After spending a little time browsing through the Pandas documentation, you will come across the `fillna` method on a DataFrame, which can be used to replace a missing values. The arguments you pass to the fillna method will determine what value the missing values in your DataFrame are replaced with and how the underlying column dtypes change after replacing the missing values.

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Replacing Missing Values – The Easy Way: Using the Data Import Tool

With the Data Import Tool, you can replace missing values by right-clicking on the column containing missing values selecting the appropriate Fill Missing Values item. Opting to replace missing values in the column with a specific column will open an additional dialog, prompting you to enter the value.

Fill missing values

Replace missing values in your DataFrame using the Canopy Data Import Tool


Removing Missing Values – The Hard Way: Using Pandas

While removing columns or rows containing missing values might be a little extreme, it might be necessary. Pandas suggests that you use the dropna method on the DataFrame to drop columns or rows that contain missing values. The arguments you pass to the dropna method will determine what rows/columns are removed from the DataFrame.

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Removing Missing Values – The Easy Way: Using the Data Import Tool

With the Data Import Tool on the other hand, you can remove rows/columns containing missing values by selecting the “Delete Empty Columns” or “Delete Empty Rows” item from the “Transform” menu. An additional dialog will pop up asking you how lenient you want to be in removing rows/columns containing missing values – if you choose ‘any’, the Data Import Tool will remove rows/columns that contain any missing values; if you choose ‘all’, the Data Import Tool will only remove those rows/columns which contain only missing values.

Delete Empty Rows & Columns

Delete empty cells in rows/columns using the Canopy Data Import Tool

Delete Empty Columns

Choose to delete columns containing any null value or columns full of null values using the Canopy Data Import Tool

Finally, we have data that contains no missing values. So far, we’ve used the DIT to easily discover the missing values in our dataset and to remove/replace the missing values. Finally, by clicking on ‘Use DataFrame’, you can import the dataset as a pandas DataFrame into the IPython workspace of the Canopy Editor. If you’re a data scientist, your data is now void of missing values and can be converted to arrays or variables and passed on to scikit-learn, TensorFlow or any other Machine Learning library of your choice.

Ready to try the Canopy Data Import Tool?

Download Canopy (free) and click on the icon to start a free trial of the Data Import Tool today

This is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way


Additional resources:

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

The post Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way appeared first on Enthought Blog.

by Rahul Poruri at April 18, 2017 02:21 PM

Pierre de Buyl

A concise derivation of the Wiener-Khinchin theorem

Introduction

While teaching a class on statistical physics, I found myself unhappy with textbook derivations of the Wiener-Khinchin theorem. I worked my way to a very short derivation that is free of integral bounds manipulations and of holes, or so I believe.

In either the books by Balakrishnan (Elements of Nonequilibrium Statistical Mechanics, Ane Books, 2008), Risken (The Fokker-Planck Equation, 2nd edition, Springer-Verlag, 1989), Coffey-Kalmykov-Waldron (The Langevin Equation, 2nd edition, World Scientific, 2004) or MathWorld, I could not find a short derivation that would cleanly take into account the averaging procedure or that would not resort to splittings of the domain of integration. I present here one derivation and a numerical illustration with Python.

{% notebook 2017/wiener_khinchine.ipynb %}

by Pierre de Buyl at April 18, 2017 08:00 AM

Matthieu Brucher

Announcement: Audio TK 2.0.0

ATK is updated to 2.0.0 with a major refactoring to ensure signed/unsigned consistency, new Adaptive module and EQ design. Complex-valued filters are also now available to allow simultaneous dual channel processes and advanced filters like complex LMS filters.

Thanks to Travis and Appveyor, binaries for the releases are now updated on Github. On all platforms we compile static and shared libraries. On Linux, gcc 5, gcc 6, clang 3.8 and clang 3.9 are generated, on OS X, XCode 7 and XCode 8 are available as universal binaries, and on Windows, 32 bits, 64 bits, with dynamic or static (no shared libraries in this case) runtime, are also generated.

Download link: ATK 2.0.0

Changelog:
2.0.0
* Refactored fixed line delays (performance improvement)
* Allow new filters to have unconnected inputs (can only be changed inside a filter)
* Refactored the stereo universal delay line to allow more simultaneous channels (renamed to MultipleUniversalDelayLineFilter)
* ATK now allows complex-valued filters with filters to convert from real from/to complex
* Added a BlockLMSFilter with Python wrappers
* Added a LMSFilter with Python wrappers
* Added a RemezBasedCoefficients with Python wrappers to be used with FIRFilter to generate a FIR filter from a template
* Added a RLSFilter with Python wrappers
* Support for IPP as a FFT backend
* Refactored the API for global unsigned consistency

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at April 18, 2017 07:10 AM

April 17, 2017

Titus Brown

Workshops posted for DIBSI - July 10-15, July 17-21

As part of our Summer Institute in Data Intensive Biology, we will be running nine week-long computational workshops from July 10 to July 17 at the University of California, Davis.

Week 1: July 10-15

Week 2: July 17-21

All workshops will take place at UC Davis; please see the venue information for details.

Workshops may extend into the evening hours; please plan on devoting the entire time to the workshop. Workshops are $350/wk.

On-campus housing information is available for approximately $400/wk, which includes breakfast and dinner. Housing registration currently closes April 26th.

Registration links for each workshop are under the workshop description; housing is linked there as well, and must be booked separately. Attendees of both weeks of workshops may book housing for both weeks, and attendees of the two-week introductory bioinformatics workshop, ANGUS may book a full four weeks of housing.

For questions about registration, travel, invitation letters, or other general topics, please contact dibsi.training@gmail.com. For workshop specific questions, contact the instructors (e-mail links are under each workshop).

--titus

by C. Titus Brown at April 17, 2017 10:00 PM

April 13, 2017

Matthew Rocklin

Streaming Python Prototype

This work is supported by Continuum Analytics, and the Data Driven Discovery Initiative from the Moore Foundation.

This blogpost is about experimental software. The project may change or be abandoned without warning. You should not depend on anything within this blogpost.

This week I built a small streaming library for Python. This was originally an exercise to help me understand streaming systems like Storm, Flink, Spark-Streaming, and Beam, but the end result of this experiment is not entirely useless, so I thought I’d share it. This blogpost will talk about my experience building such a system and what I valued when using it. Hopefully it elevates interest in streaming systems among the Python community.

Background with Iterators

Python has sequences and iterators. We’re used to mapping, filtering and aggregating over lists and generators happily.

seq = [1, 2, 3, 4, 5]
seq = map(inc, L)
seq = filter(iseven, L)

>>> sum(seq) # 2 + 4 + 6
12

If these iterators are infinite, for example if they are coming from some infinite data feed like a hardware sensor or stock market signal then most of these pieces still work except for the final aggregation, which we replace with an accumulating aggregation.

def get_data():
    i = 0
    while True:
        i += 1
        yield i

seq = get_data()
seq = map(inc, seq)
seq = filter(iseven, seq)
seq = accumulate(lambda total, x: total + x, seq)

>>> next(seq)  # 2
2
>>> next(seq)  # 2 + 4
6
>>> next(seq)  # 2 + 4 + 6
12

This is usually a fine way to handle infinite data streams. However this approach becomes awkward if you don’t want to block on calling next(seq) and have your program hang until new data comes in. This approach also becomes awkward when you want to branch off your sequence to multiple outputs and consume from multiple inputs. Additionally there are operations like rate limiting, time windowing, etc. that occur frequently but are tricky to implement if you are not comfortable using threads and queues. These complications often push people to a computation model that goes by the name streaming.

To introduce streaming systems in this blogpost I’ll use my new tiny library, currently called streams (better name to come in the future). However if you decide to use streaming systems in your workplace then you should probably use some other more mature library instead. Common recommendations include the following:

  • ReactiveX (RxPy)
  • Flink
  • Storm (Streamparse)
  • Beam
  • Spark Streaming

Streams

We make a stream, which is an infinite sequence of data into which we can emit values and from which we can subscribe to make new streams.

from streams import Stream
source = Stream()

From here we replicate our example above. This follows the standard map/filter/reduce chaining API.

s = (source.map(inc)
           .filter(iseven)
           .accumulate(lambda total, x: total + x))

Note that we haven’t pushed any data into this stream yet, nor have we said what should happen when data leaves. So that we can look at results, lets make a list and push data into it when data leaves the stream.

results = []
s.sink(results.append)  # call the append method on every element leaving the stream

And now lets push some data in at the source and see it arrive at the sink:

>>> for x in [1, 2, 3, 4, 5]:
...     source.emit(x)

>>> results
[2, 6, 12]

We’ve accomplished the same result as our infinite iterator, except that rather than pulling data with next we push data through with source.emit. And we’ve done all of this at only a 10x slowdown over normal Python iteators :) (this library takes a few microseconds per element rather than CPython’s normal 100ns overhead).

This will get more interesting in the next few sections.

Branching

This approach becomes more interesting if we add multiple inputs and outputs.

source = Stream()
s = source.map(inc)
evens = s.filter(iseven)
evens.accumulate(add)

odds = s.filter(isodd)
odds.accumulate(sub)

Or we can combine streams together

second_source = Stream()
s = combine_latest(second_source, odds).map(sum)

So you may have multiple different input sources updating at different rates and you may have multiple outputs, perhaps some going to a diagnostics dashboard, others going to long-term storage, others going to a database, etc.. A streaming library makes it relatively easy to set up infrastructure and pipe everything to the right locations.

Time and Back Pressure

When dealing with systems that produce and consume data continuously you often want to control the flow so that the rates of production are not greater than the rates of consumption. For example if you can only write data to a database at 10MB/s or if you can only make 5000 web requests an hour then you want to make sure that the other parts of the pipeline don’t feed you too much data, too quickly, which would eventually lead to a buildup in one place.

To deal with this, as our operations push data forward they also accept Tornado Futures as a receipt.

Upstream: Hey Downstream! Here is some data for you
Downstream: Thanks Upstream!  Let me give you a Tornado future in return.
            Make sure you don't send me any more data until that future
            finishes.
Upstream: Got it, Thanks!  I will pass this to the person who gave me the
          data that I just gave to you.

Under normal operation you don’t need to think about Tornado futures at all (many Python users aren’t familiar with asynchronous programming) but it’s nice to know that the library will keep track of balancing out flow. The code below uses @gen.coroutine and yield common for Tornado coroutines. This is similar to the async/await syntax in Python 3. Again, you can safely ignore it if you’re not familiar with asynchronous programming.

@gen.coroutine
def write_to_database(data):
    with connect('my-database:1234/table') as db:
        yield db.write(data)

source = Stream()
(source.map(...)
       .accumulate(...)
       .sink(write_to_database))  # <- sink produces a Tornado future

for data in infinite_feed:
    yield source.emit(data)       # <- that future passes through everything
                                  #    and ends up here to be waited on

There are also a number of operations to help you buffer flow in the right spots, control rate limiting, etc..

source = Stream()
source.timed_window(interval=0.050)  # Capture all records of the last 50ms into batches
      .filter(len)                   # Remove empty batches
      .map(...)                      # Do work on each batch
      .buffer(10)                    # Allow ten batches to pile up here
      .sink(write_to_database)       # Potentially rate-limiting stage

I’ve written enough little utilities like timed_window and buffer to discover both that in a full system you would want more of these, and that they are easy to write. Here is the definition of timed_window

class timed_window(Stream):
    def __init__(self, interval, child, loop=None):
        self.interval = interval
        self.buffer = []
        self.last = gen.moment

        Stream.__init__(self, child, loop=loop)
        self.loop.add_callback(self.cb)

    def update(self, x, who=None):
        self.buffer.append(x)
        return self.last

    @gen.coroutine
    def cb(self):
        while True:
            L, self.buffer = self.buffer, []
            self.last = self.emit(L)
            yield self.last
            yield gen.sleep(self.interval)

If you are comfortable with Tornado coroutines or asyncio then my hope is that this should feel natural.

Recursion and Feedback

By connecting the sink of one stream to the emit function of another we can create feedback loops. Here is stream that produces the Fibonnacci sequence. To stop it from overwhelming our local process we added in a rate limiting step:

from streams import Stream
source = Stream()
s = source.sliding_window(2).map(sum)
L = s.sink_to_list()  # store result in a list

s.rate_limit(0.5).sink(source.emit)  # pipe output back to input

source.emit(0)  # seed with initial values
source.emit(1)
>>> L
[1, 2, 3, 5]

>>> L  # wait a couple seconds, then check again
[1, 2, 3, 5, 8, 13, 21, 34]

>>> L  # wait a couple seconds, then check again
[1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

Note: due to the time rate-limiting functionality this example relied on an event loop running somewhere in another thread. This is the case for example in a Jupyter notebook, or if you have a Dask Client running.

Things that this doesn’t do

If you are familiar with streaming systems then you may say the following:

Lets not get ahead of ourselves; there’s way more to a good streaming system than what is presented here. You need to handle parallelism, fault tolerance, out-of-order elements, event/processing times, etc..

… and you would be entirely correct. What is presented here is not in any way a competitor to existing systems like Flink for production-level data engineering problems. There is a lot of logic that hasn’t been built here (and its good to remember that this project was built at night over a week).

Although some of those things, and in particular the distributed computing bits, we may get for free.

Distributed computing

So, during the day I work on Dask, a Python library for parallel and distributed computing. The core task schedulers within Dask are more than capable of running these kinds of real-time computations. They handle far more complex real-time systems every day including few-millisecond latencies, node failures, asynchronous computation, etc.. People use these features today inside companies, but they tend to roll their own system rather than use a high-level API (indeed, they chose Dask because their system was complex enough or private enough that rolling their own was a necessity). Dask lacks any kind of high-level streaming API today.

Fortunately, the system we described above can be modified fairly easily to use a Dask Client to submit functions rather than run them locally.

from dask.distributed import Client
client = Client()       # start Dask in the background

source.to_dask()
      .scatter()        # send data to a cluster
      .map(...)         # this happens on the cluster
      .accumulate(...)  # this happens on the cluster
      .gather()         # gather results back to local machine
      .sink(...)        # This happens locally

Other things that this doesn’t do, but could with modest effort

There are a variety of ways that we could improve this with modest cost:

  1. Streams of sequences: We can be more efficient if we pass not individual elements through a Stream, but rather lists of elements. This will let us lose the microseconds of overhead that we have now per element and let us operate at pure Python (100ns) speeds.
  2. Streams of NumPy arrays / Pandas dataframes: Rather than pass individual records we might pass bits of Pandas dataframes through the stream. So for example rather than filtering elements we would filter out rows of the dataframe. Rather than compute at Python speeds we can compute at C speeds. We’ve built a lot of this logic before for dask.dataframe. Doing this again is straightforward but somewhat time consuming.
  3. Annotate elements: we want to pass through event time, processing time, and presumably other metadata
  4. Convenient Data IO utilities: We would need some convenient way to move data in and out of Kafka and other common continuous data streams.

None of these things are hard. Many of them are afternoon or weekend projects if anyone wants to pitch in.

Reasons I like this project

This was originally built strictly for educational purposes. I (and hopefully you) now know a bit more about streaming systems, so I’m calling it a success. It wasn’t designed to compete with existing streaming systems, but still there are some aspects of it that I like quite a bit and want to highlight.

  1. Lightweight setup: You can import it and go without setting up any infrastructure. It can run (in a limited way) on a Dask cluster or on an event loop, but it’s also fully operational in your local Python thread. There is no magic in the common case. Everything up until time-handling runs with tools that you learn in an introductory programming class.
  2. Small and maintainable: The codebase is currently a few hundred lines. It is also, I claim, easy for other people to understand. Here is the code for filter:

    class filter(Stream):
        def __init__(self, predicate, child):
            self.predicate = predicate
            Stream.__init__(self, child)
    
        def update(self, x, who=None):
            if self.predicate(x):
                return self.emit(x)
    
  3. Composable with Dask: Handling distributed computing is tricky to do well. Fortunately this project can offload much of that worry to Dask. The dividing line between the two systems is pretty clear and, I think, could lead to a decently powerful and maintainable system if we spend time here.
  4. Low performance overhead: Because this project is so simple it has overheads in the few-microseconds range when in a single process.
  5. Pythonic: All other streaming systems were originally designed for Java/Scala engineers. While they have APIs that are clearly well thought through they are sometimes not ideal for Python users or common Python applications.

Future Work

This project needs both users and developers.

I find it fun and satisfying to work on and so encourage others to play around. The codebase is short and, I think, easily digestible in an hour or two.

This project was built without a real use case (see the project’s examples directory for a basic Daskified web crawler). It could use patient users with real-world use cases to test-drive things and hopefully provide PRs adding necessary features.

I genuinely don’t know if this project is worth pursuing. This blogpost is a test to see if people have sufficient interest to use and contribute to such a library or if the best solution is to carry on with any of the fine solutions that already exist.

pip install git+https://github.com/mrocklin/streams

April 13, 2017 12:00 AM

April 11, 2017

Enthought

Webinar- Get More From Your Core: Applying Artificial Intelligence to CT, Photo, and Well Log Analysis with Virtual Core

What: Presentation, demo, and Q&A with Brendon Hall, Geoscience Product Manager, Enthought

Who should watch this webinar:

  • Oil and gas industry professionals who are looking for ways to extract more value from expensive science wells
  • Those interested in learning how artificial intelligence and machine learning techniques can be applied to core analysis

VIEW 


Geoscientists and petroleum engineers rely on accurate core measurements to characterize reservoirs, develop drilling plans and de-risk play assessments. Whole-core CT scans are now routinely performed on extracted well cores, however the data produced from these scans are difficult to visualize and integrate with other measurements.

Virtual Core automates aspects of core description for geologists, drastically reducing the time and effort required for core description, and its unified visualization interface displays cleansed whole-core CT data alongside core photographs and well logs. It provides tools for geoscientists to analyze core data and extract features from sub-millimeter scale to the entire core.

In this webinar and demo, we’ll start by introducing the Clear Core processing pipeline, which automatically removes unwanted artifacts (such as tubing) from the CT image. We’ll then show how the machine learning capabilities in Virtual Core can be used to describe the core, extracting features such as bedding planes and dip angle. Finally, we’ll show how the data can be viewed and analyzed alongside other core data, such as photographs, wellbore images, well logs, plug measurements, and more.

What You’ll Learn:

  • How core CT data, photographs, well logs, borehole images, and more can be integrated into a digital core workshop
  • How digital core data can shorten core description timelines and deliver business results faster
  • How new features can be extracted from digital core data using artificial intelligence
  • Novel workflows that leverage these features, such as identifying parasequences and strategies for determining net pay

VIEW 

Presenter:

Brendon Hall, Geoscience Applications Engineer, Enthought Brendon Hall, Enthought
Geoscience Product Manager and Application Engineer

Additional Resources

Other Blogs and Articles on Virtual Core:

The post Webinar- Get More From Your Core: Applying Artificial Intelligence to CT, Photo, and Well Log Analysis with Virtual Core appeared first on Enthought Blog.

by Brendon Hall at April 11, 2017 01:00 PM

Matthieu Brucher

Audio Toolkit: Create a FIR Filter from a Template (EQ module)

Last week, I published a post on adaptive filtering. It was long overdue, but I actually had one other project on hold for even longer: allowing a user to specify a filter template and let Audio Toolkit figure out a FIR filter from this template.

Remez/Parks & McClellan algorithm

The most famous algorithm is the Remez/Parks & McClellan algorithm. In Matlab, it’s called remez, but Remez is actually a more generic algorithm than just FIR determination.

The algorithm starts by selecting a few random points on the templates where the user set non-zero weights. The zero weights are usually the transition zones, which means that the filter can roam free in these sections. Usually, you don’t want to have them too big, especially in bandpass filters. As the resulting filter has ripples, this means that you can select the weight of each bandwidth in the template. Where the ripples should be small, use a big weight, where they don’t matter, use a small one.

Then, the Remez algorithm is all about moving these points to the maximum of the difference between the template and the actual filter. At the end, the result is an optimal filter around the given template, for a given order.

The determination of the result rests often on the selection of the starting points. If all starting points are in only one bandwidth, then the determination of the filter is wrong. As such, Audio Toolkit selects points that are equidistant so that all bandwidths are covered. Of course, if one bandwidth is too small, then the determination will fail.

Demo

There are many good papers on the Remez algorithm for FIR determination so I won’t take the time to rehash something that lots of people did far better than I could. But I’ll try to explain how it goes on a simple example, with the Python script that was used as the reference test case for the development of the plugin.

Instead of using the equidistant start, I used the set of indices coming from the paper I used (and same for the template). As such, the indices are:

[51 101 341 361 531 671 701 851]

After the optimization, we get the following error function:
Remez Iteration 1

The maximum error is 0.0325 in that case. The algorithm then selects new iterations for the next iteration, at the minimum and maximum of the current error function:

[  0 151 296 409 512 595 744 877]

From these indices, we compute the optimal parameters again and then get a new error function (notice that the highlighted points correspond to the previous min/max)

Remez Iteration 2

The maximum error is now 0.162. And we start the selection process again:

[  0 163 320 409 512 579 718 874]

Once again, we get a new error function:

Remez Iteration 3

The max error is a little bit bigger and is now 0.169. We select new indices:

[  0 163 320 409 512 579 718 874]

The indices are identical, and at the next iteration, the search for the best stops.

The resulting filter has the following transfer function (the template is in red)

Estimated filter against template

Conclusion

There is finally a way of designing filters in Audio Toolkit that don’t require you to go to Matlab or Python. This can be quite efficient to design a linear phase filter on the fly in a plugin. There is probably more work to be done in terms of optimization, but the processing part itself is already fully optimized.

Buy Me a Coffee!
Other Amount:
Your Email Address:

by Matt at April 11, 2017 07:51 AM