Planet SciPy 2020-04-02 05:29:54

6 GAN Architectures You Really Should Know

Generative Adversarial Networks (GANs) were first introduced in 2014 by Ian Goodfellow et. al. and since then this topic itself opened up a new area of research. Within a few years, the research community came up with plenty of papers on this topic some of which have very interesting names :). You have CycleGAN, followed […]

The post 6 GAN Architectures You Really Should Know appeared first on 2020-04-01 11:21:22

What Is The Machine Learning Engineer Salary?

According to the O’Reilly study The State of Machine Learning Adoption in the Enterprise, 50% of their respondents claim to have adopted machine learning to a different extent.  The demand for machine learning engineers is growing with the development of technology and new discoveries in the world of data science. More and more organizations decide […]

The post What Is The Machine Learning Engineer Salary? appeared first on 2020-03-31 18:22:22

Interview with a Head of AI: Vladimir Rybakov

Interview with a Head of AI: Vladimir Rybakov Some time ago we sat down with Vladimir Rybakov, Head of Data Science at WaveAccess and a master communicator of business to ML teams and ML problems to business people. We talked about: How to solve problems fast under extreme pressure How taking responsibility can help you […]

The post Interview with a Head of AI: Vladimir Rybakov appeared first on

Anaconda 2020-03-31 15:42:54

Securing Pangeo with Dask Gateway

This post is also available on the Pangeo blog. Over the past few weeks, we have made some exciting changes to Pangeo’s cloud deployments. These changes will make using Pangeo’s clusters easier for users while…

The post Securing Pangeo with Dask Gateway appeared first on Anaconda.

Anaconda 2020-03-30 20:51:06

Why Understanding CVEs Is Critical for Data Scientists

CVEs are Common Vulnerabilities and Exposures found in software components. Because modern software is complex with its many layers, interdependencies, data input, and libraries, vulnerabilities tend to emerge over time. Ignoring a high CVE score…

The post Why Understanding CVEs Is Critical for Data Scientists appeared first on Anaconda. 2020-03-30 09:52:44

This Week in Machine Learning: Our Podcast, ML Tips, and Neural Networks

If you’re stuck at home due to the coronavirus pandemic, you can dedicate your free time to deepening your knowledge on machine learning. If you’ve already read all the books and available resources, check out this weekly roundup. Here goes a dose of the latest news, discoveries, and inspiring stories. There is something for everyone. […]

The post This Week in Machine Learning: Our Podcast, ML Tips, and Neural Networks appeared first on 2020-03-27 08:21:15

The Best MLflow Alternatives (2020 Update)

MLflow is an open-source platform that helps manage the whole machine learning lifecycle. This includes experimentation, but also reproducibility and deployment. Each of these three elements represented by one MLflow component: Tracking, Projects, and Models. That means a data scientist who works with MLflow is able to track an experiment, organize it, describe it for […]

The post The Best MLflow Alternatives (2020 Update) appeared first on 2020-03-26 11:19:24

Top Machine Learning Influencers – All The Names You Need to Know

Following the great minds of machine learning can help you discover new things and deepen your knowledge. It’s fascinating to learn from the best scientists. Among them, you will find influencers, teachers, business leaders, and even many more. Undeniably their expertise can help to change the world and make it a better place. On this […]

The post Top Machine Learning Influencers – All The Names You Need to Know appeared first on 2020-03-24 08:18:00

This Week in Machine Learning: Books, Quantum Computing & More

Machine learning is such a vast field of science that it’s impossible to wrap your head around it all. New information is generated every millisecond, so how to comprehend it all? It’s a difficult task but to help you find out what’s happening in machine learning, I gathered the best resources. Here goes a dose […]

The post This Week in Machine Learning: Books, Quantum Computing & More appeared first on 2020-03-23 18:04:33

Interview with a Head of AI: Pawel Godula

Interview with Head of AI: Pawel Godula A few weeks ago I had a chance to interview an amazing person and total rockstar when it comes to modeling and understanding customer data. We’ll talk about: How to build models that are robust to change How to become a leader in a technical organization How to […]

The post Interview with a Head of AI: Pawel Godula appeared first on 2020-03-23 09:43:24

The Best TensorBoard Alternatives (2020 Update)

TensorBoard is a visualization toolkit for TensorFlow that lets you analyse model training runs. It allows you to visualize various aspects of machine learning experiment, such as metrics, visualize model graphs, view tensors’ histograms and more. To that end, TensorBoard fits in the raising need for tools to track and visualize machine learning experiments. While […]

The post The Best TensorBoard Alternatives (2020 Update) appeared first on

Anaconda 2020-03-18 13:55:19

How We’re Responding to COVID-19

To our community of customers, partners, contributors, and friends: We are all facing a dynamic and difficult situation in the face of the COVID-19 pandemic. Our families, friends, customers, employees, and communities are all dramatically…

The post How We’re Responding to COVID-19 appeared first on Anaconda. 2020-03-16 18:51:25

This Week in Machine Learning: ML & Remote Work, Useful Tools, and Neural Networks

Every day interesting things happen in the world of Data Science. And if you’re staying at home due to the coronavirus pandemic, make sure to check the best picks from machine learning in breaks between work. Here goes a dose of the latest news, discoveries, and inspiring stories. There is something for everyone. Enjoy your […]

The post This Week in Machine Learning: ML & Remote Work, Useful Tools, and Neural Networks appeared first on

Quansight Labs 2020-03-14 12:25:55

Documentation as a way to build Community

As a long time user and participant in open source communities, I've always known that documentation is far from being a solved problem. At least, that's the impression we get from many developers: "writing docs is boring"; "it's a chore, nobody likes to do it". I have come to realize I'm one of those rare people who likes to write both code and documentation.

Nobody will argue against documentation. It is clear that for an open-source software project, documentation is the public face of the project. The docs influence how people interact with the software and with the community. It sets the tone about inclusiveness, how people communicate and what users and contributors can do. Looking at the results of a “NumPy Tutorial” search on any search engine also gives an idea of the demand for this kind of content - it is possible to find documentation about how to read the NumPy documentation!

I've started working at Quansight in January,

NumFOCUS 2020-03-13 15:02:49

PyData COVID-19 Response

The safety and well-being of our community are extremely important to us. We have therefore decided to postpone all PyData conferences scheduled to take place until the end of June: PyData Miami PyData London PyData Amsterdam We have been closely monitoring the situation and believe this is the best action to take based on the […]

The post PyData COVID-19 Response appeared first on NumFOCUS.

Quansight Labs 2020-03-12 12:39:00

uarray: GSoC Participation

I'm pleased to announce that uarray is participating in GSoC '20 as a sub-organization under the umbrella of the Python Software Foundation. Our ideas page is up here, go take a look and see if you (or someone you know) is interested in participating, either as a student or as a mentor.

Prasun Anand and Peter Bell and myself will be mentoring, and we plan to take a maximum of two students, unless more community mentors show up.

There have been quite a few pull requests already to qualify from prospective students, some even going as far as to begin the work described in the idea they plan to work on.

We're quite excited by the number of students who have shown an interest in participating, and we look forward to seeing excellent applications! What's more exciting, though, are some of the first contributions from people not currently at Quansight, in the true spirit of open-source software!

Anaconda 2020-03-11 22:12:24

Anaconda Individual Edition 2020.02: New Name, Exciting Features

  We are pleased to announce the release of Anaconda Individual Edition (formerly Anaconda Distribution) 2020.02! There are some exciting new features in this release, but first we’ll touch on the name change. Recently, we…

The post Anaconda Individual Edition 2020.02: New Name, Exciting Features appeared first on Anaconda.

Quansight Labs 2020-03-11 11:30:00

Planned architectural work for PyData/Sparse

What have we been doing so far? 🤔 Research 📚

A lot of behind the scenes work has been taking place on PyData/Sparse. Not so much in terms of code, more in terms of research and community/team building. I've more-or-less decided to use the structure and the research behind the Tensor Algebra Compiler, the work of Fredrik Kjolstad and his collaborators at MIT. 🙇🏻‍♂️ To this end, I've read/watched the following talks and papers:

Read more… (2 min remaining to read)

NumFOCUS 2020-03-10 19:13:20

Statement on Coronavirus

As you are aware, the Coronavirus (COVID-19) is a topic of frequent and ongoing discussions. We would like to provide an update on our status and policies as well as provide resources for additional information. As of today, our event schedule remains as posted on event sites. Any changes or updates will be immediately shared. […]

The post Statement on Coronavirus appeared first on NumFOCUS.

Filipe Saraiva's blog 2020-02-24 16:22:14

Akademy 2019

Past September the Italian city of Milan hosted the KDE contributors meeting called Akademy, the main KDE conference where contributors from different areas like translators, developers, artists, promoters and more stay together for some days thinking and building the future of KDE projects and community(ies). Firstly before Akademy I departed from Brazil to Portugal to… Continue a ler »Akademy 2019
Quansight Labs 2020-02-21 18:38:07

My Unexpected Dive into Open-Source Python

Header illustration by author, Mars Lee

I'm very happy to announce that I have joined Quansight as a front-end developer and designer! It was a happy coincidence how I joined- the intersection of my skills and the open source community's expanded vision.

Read more… (4 min remaining to read)

NumFOCUS 2020-02-20 18:08:46

Announcing JupyterCon 2020

NumFOCUS is excited to be a part of JupyterCon 2020. JupyterCon will be held August 10 – 14 in Berlin, Germany at the Berlin Conference Center. We invite you to participate in this exciting community event! Read the full announcement here. JupyterCon 2020 is an event brought to you in partnership by Project Jupyter and NumFOCUS.

The post Announcing JupyterCon 2020 appeared first on NumFOCUS.

NumFOCUS 2020-02-20 16:27:43

MDAnalysis joins NumFOCUS Sponsored Projects

NumFOCUS is pleased to announce the newest addition to our fiscally sponsored projects: MDAnalysis MDAnalysis is a Python library for the analysis of computer simulations of many-body systems at the molecular scale, spanning use cases from interactions of drugs with proteins to novel materials. It is widely used in the scientific community and is written […]

The post MDAnalysis joins NumFOCUS Sponsored Projects appeared first on NumFOCUS.

Living in an Ivory Basement 2020-02-16 23:00:00

Two talks at JGI in May: sourmash, spacegraphcats, and disease associations in the human microbiome.

Using k-mers and taxonomy to find contamination in metagenomes

Quansight Labs 2020-02-14 19:00:00

Creating the ultimate terminal experience in Spyder 4 with Spyder-Terminal

The Spyder-Terminal project is revitalized! The new 0.3.0 version adds numerous features that improves the user experience, and enhances compatibility with the latest Spyder 4 release, in part thanks to the improvements made in the xterm.js project.

Read more… (3 min remaining to read)

Anaconda 2020-02-11 03:05:50

6 Reasons Your Open-Source Data Science Pipeline Needs Attention Now

Your enterprise data scientists are almost certainly using Anaconda Distribution alongside 20 million other practitioners worldwide. The Anaconda Distribution is a package and environment manager designed for solo data scientists, and it is the most…

The post 6 Reasons Your Open-Source Data Science Pipeline Needs Attention Now appeared first on Anaconda.

Anaconda 2020-01-30 15:07:00

We’ve Reached a Milestone: pandas 1.0 Is Here

Today the pandas project announced the release of pandas 1.0.0. For more on what’s changed, read through the extensive release notes. We’re particularly excited about Numba-accelerated window operations and the new nullable boolean and string…

The post We’ve Reached a Milestone: pandas 1.0 Is Here appeared first on Anaconda.

Anaconda 2020-01-28 21:39:50

Introducing Anaconda Team Edition: Secure Open-Source Data Science for the Enterprise

I’m very excited to announce a new addition to Anaconda’s product line — Anaconda Team Edition! For the last few years, Anaconda has offered two products: our free Anaconda Distribution, meant for individual practitioners, and…

The post Introducing Anaconda Team Edition: Secure Open-Source Data Science for the Enterprise appeared first on Anaconda.

Leonardo Uieda 2020-01-23 12:00:00

Advancing research software in the UK through an SSI fellowship

I have selected as part of the 2020 cohort of Fellows of the Software Sustainability Institute!

The Institute cultivates world-class research with software. It's based at the universities of Edinburgh, Manchester, Southampton, and Oxford in the UK. Their motto says it all:

The SSI has a yearly fellowship program to fund the organization of communities around scientific software (creating of local user groups, workshops, hackathons, etc). Even more importantly, they organize several events to get current and past fellows in the same place doing awesome stuff. I'm really looking forward to this year's Collaborations Workshop (registration is open to all, not just fellows). I applied at the end of last year and was selected to join the 2020 cohort of fellows along with some truly amazing people.

My plan for the fellowship is to run

Peekaboo 2020-01-07 17:26:00

Don't fund Software that doesn't exist

I’ve been happy to see an increase in funding for open source software across research areas and across funding bodies. However, I observed that a majority of funding from, say, the NSF, goes to projects that do not exist yet, and where the funding is supposed to create a new project, or to extend projects that are developed and used within a single research lab. I think this top-down approach to creating software comes from a misunderstanding of the existing open source software that is used in science. This post collects thoughts on the effectiveness of current grant-based funding and how to improve it from the perspective of the grant-makers.
Instead of the current approach of funding new projects, I would recommend funding existing open source software, ideally software that is widely used, and underfunded. The story of the underfunded but critically important open source software (which I’ll refer to as infrastructure software) should be an old tale by now.
Living in an Ivory Basement 2020-01-01 23:00:00

sourmash-oddify: a workflow for exploring contamination in metagenome-assembled genomes

Using k-mers and taxonomy to find contamination in metagenomes

Quansight Labs 2019-12-24 11:00:00

metadsl PyData talk

metadsl PyData talk

PyData NYC just ended and I thought it would be good to collect my thoughts on metadsl based on the many conversations I had there surrounding it. This is a rather long post, so if you are just looking for some code here is a Binder link for my talk. Also, here is the talk I gave a month or so later on the same topic in Austin:

What is metadsl?
class Number(metadsl.Expression):
   def __add__(self, other: Number) -> Number:

   def from_int(cls, i: int) -> Number:

def add_zero(y: Number):
   yield Number.from_int(0) + y, y
   yield y + Number.from_int(0), y

Read more… (13 min remaining to read)

Anaconda 2019-12-17 17:50:45

2019: A Year in Review

Before we dive into a new decade, we’re looking back on all we’ve accomplished together as a company and as a community in 2019. We’re excited to be part of such a vibrant and growing…

The post 2019: A Year in Review appeared first on Anaconda.

Anaconda 2019-12-11 15:27:08

8 AI Predictions for 2020: Business Leaders & Researchers Weigh In

The first industrial revolution was powered by coal, the second by oil and gas, and the third by nuclear power. The fourth — AI — is fueled by an abundance of data and breakthroughs in…

The post 8 AI Predictions for 2020: Business Leaders & Researchers Weigh In appeared first on Anaconda.

Anaconda 2019-12-09 15:36:35

It’s Time to Upgrade to Python 3 – Time Is Running Out!

The Python community will sunset Python 2 on January 1, 2020.   How much time is left? Check out the handy clock at:  How does this impact the Anaconda Distribution and the conda packages Anaconda…

The post It’s Time to Upgrade to Python 3 – Time Is Running Out! appeared first on Anaconda.

Leonardo Uieda 2019-12-08 12:00:00

Two PhD studentships at the University of Liverpool

I have two open positions for funded studentships at the University of Liverpool. Applications are open until 10 January 2020.

Project descriptions

Follow the links for more detailed versions.

Bringing machine learning techniques to geophysical data processing

The goal of this project is to investigate the use of existing machine learning techniques to process gravity and magnetics data using the Equivalent Layer Method. The methods and software developed during this project can be applied to process large amounts of gravity and magnetics data, including airborne and satellite surveys, and produce data products that can enable further scientific investigations. Examples of such data products include global gravity gradient grids from GOCE satellite measurements, regional magnetic grids for the UK, gravity grids for the Moon and Mars, etc.

Large-scale mapping of the thickness of the

Anaconda 2019-12-05 16:32:30

How Machine Learning Will Generate up to $2 Trillion in Value for the Manufacturing Industry

Today manufacturers must create high-quality products as efficiently and consistently as possible to remain competitive. With the increasing price of raw materials, skilled labor shortages, and rising competition, this can be difficult. Fortunately, machine learning…

The post How Machine Learning Will Generate up to $2 Trillion in Value for the Manufacturing Industry appeared first on Anaconda.

NumFOCUS 2019-12-04 18:50:46

NumFOCUS Kicks Off Year-End Fundraising with $2,500 Gift

Giving Tuesday 2019 marked the start of NumFOCUS’s year-end fundraising campaign, and this year’s effort began with a major donation from the organization’s Board President, Andy Terrel. “Today I’m pledging a gift of $2,500 to help kick off the NumFOCUS end-of-year fundraising campaign,” Terrel wrote yesterday in an e-mail to the NumFOCUS community. “I believe […]

The post NumFOCUS Kicks Off Year-End Fundraising with $2,500 Gift appeared first on NumFOCUS.

NumFOCUS 2019-12-02 22:55:41

Stepping Onto a New Path…

The post Stepping Onto a New Path… appeared first on NumFOCUS.

Anaconda 2019-12-02 19:55:10

Get Python Package Download Statistics with Condastats

Hundreds of millions of Python packages are downloaded using Conda every month. That’s why we are excited to announce the release of condastats, a conda statistics API with Python interface and Command Line interface. Now anyone…

The post Get Python Package Download Statistics with Condastats appeared first on Anaconda.

Gaël Varoquaux - programming 2019-12-01 05:00:00

Getting a big scientific prize for open-source software


An important acknowledgement for a different view of doing science: open, collaborative, and more than a proof of concept.

A few days ago, Loïc Estève, Alexandre Gramfort, Olivier Grisel, Bertrand Thirion, and myself received the “Académie des Sciences Inria prize for transfer”, for our contributions to the scikit-learn project …

Quansight Labs 2019-11-29 01:00:00

Variable Explorer improvements in Spyder 4

Spyder 4 will be released very soon with lots of interesting new features that you'll want to check out, reflecting years of effort by the team to improve the user experience. In this post, we will be talking about the improvements made to the Variable Explorer.

These include the brand new Object Explorer for inspecting arbitrary Python variables, full support for MultiIndex dataframes with multiple dimensions, and the ability to filter and search for variables by name and type, and much more.

It is important to mention that several of the above improvements were made possible through integrating the work of two other projects. Code from gtabview was used to implement the multi-dimensional Pandas indexes, while objbrowser was the foundation of the new Object Explorer.

Read more… (7 min remaining to read)

Spyder Blog 2019-11-28 20:00:00

Variable Explorer improvements in Spyder 4

This blogpost was originally published on the Quansight Labs website.

Spyder 4 will be released very soon with lots of interesting new features that you'll want to check out, reflecting years of effort by the team to improve the user experience. In this post, we will be talking about the improvements made to the Variable Explorer.

These include the brand new Object Explorer for inspecting arbitrary Python variables, full support for MultiIndex dataframes with multiple dimensions, and the ability to filter and search for variables by name and type, and much more.

It is important to mention that several of the above improvements were made possible through integrating the work of two other projects. Code from gtabview was used to implement the multi-dimensional Pandas indexes, while objbrowser was the foundation of the new Object Explorer.

New viewer for arbitrary Python objects

For Spyder 4 we added a long-requested feature: full support for inspecting any kind of Python object through the Variable

NumFOCUS 2019-11-20 23:24:48

NumFOCUS Summit 2019

The post NumFOCUS Summit 2019 appeared first on NumFOCUS.

Quansight Labs 2019-11-14 20:00:00

A new grant for NumPy and OpenBLAS!

I'm very pleased to announce that NumPy and OpenBLAS just received a $195,000 grant from the Chan Zuckerberg Initiative, through its Essential Open Source Software for Science (EOSS) program! This is good news for both projects, and I'm particularly excited about the types of activities we'll be undertaking, what this will mean in terms of growing the community, and to be part of the first round of funded projects of this visionary program.

The program

The press release gives a high level overview of the program, and the grantee website lists the 32 successful applications. Other projects that got funded include SciPy and Matplotlib (it's the very first significant funding for both projects!), Pandas, Zarr, scikit-image, JupyterHub, and Bioconda - we're in good company!

Nicholas Sofroniew and Dario Taborelli, two of the people driving the EOSS program, wrote a blog post that's well worth reading about the motivations for starting this program and the 42 projects that applied and got funded: The Invisible Foundations of Biomedicine.

Read more… (5 min remaining to

Quansight Labs 2019-11-12 17:00:00

File management improvements in Spyder4

Version 4.0 of Spyder—a powerful Python IDE designed for scientists, engineers and data analysts—is almost ready! It has been in the making for well over two years, and it contains lots of interesting new features. We will focus on the Files pane in this post, where we've made several improvements to the interface and file management tools.

Simplified interface

In order to simplify the Files pane's interface, the columns corresponding to size and kind are hidden by default. To change which columns are shown, use the top-right pane menu or right-click the header directly.

Read more… (7 min remaining to read)

Spyder Blog 2019-11-12 00:00:00

File management improvements in Spyder 4

This blogpost was originally published on the Quansight Labs website.

Version 4.0 of Spyder is almost ready! It has been in the making for well over two years, and it contains lots of interesting new features. We will focus on the Files pane in this post, where we've made several improvements to the interface and file management tools.

Simplified interface

In order to simplify the Files pane's interface, the columns corresponding to size and kind are hidden by default. To change which columns are shown, use the top-right pane menu or right-click the header directly.

Custom file associations

First, we added the ability to associate different external applications with specific file extensions they can open. Under the File associations tab of the Files preferences pane, you can add file types and set the external program used to open each of them by default.

Once you've set this up, files will automatically launch in the associated application when opened from the Files pane in Spyder.

Quansight Labs 2019-11-10 05:28:00

uarray: Attempting to move the ecosystem forward

There comes a time in every project where most technological hurdles have been surpassed, and its adoption is a social problem. I believe uarray and unumpy had reached such a state, a month ago.

I then proceeded, along with Ralf Gommers and Peter Bell to write NumPy Enhancement Proposal 31 or NEP-31. This generated a lot of excellent feedback on the structure and the nuances of the proposal, which you can read both on the pull request and on the mailing list discussion, which led to a lot of restructuring in the contents and the structure of the NEP, but very little in the actual proposal. I take full responsibility for this: I have a bad tendency to assume everyone knows what I'm thinking. Thankfully, I'm not alone in this: It's a known psychological phenomenon.

Read more… (2 min remaining to read)

ListenData 2019-10-28 15:48:00

Loan Amortisation Schedule using R and Python

In this post, we will explain how you can calculate your monthly loan instalments the way bank calculates using R and Python. In financial world, analysts generally use MS Excel software for calculating principal and interest portion of instalment using PPMT, IPMT functions. As data science is growing and trending these days, it is important to know how you can do the same using popular data science programming languages such as R and Python.

When you take a loan from bank at x% annual interest rate for N number of years. Bank calculates monthly (or quarterly) instalments based on the following factors :

  • Loan Amount
  • Annual Interest Rate
  • Number of payments per year
  • Number of years for loan to be repaid in instalments
Loan Amortisation ScheduleIt refers to table of periodic loan payments explaining the breakup of principal and interest in each instalment/EMI until the loan is repaid at the end of its stipulated term. Monthly instalments are generally same every month
I Love Symposia! 2019-10-24 13:59:54

Introducing napari: a fast n-dimensional image viewer in Python

I'm really excited to finally, officially, share a new(ish) project called napari with the world. We have been developing napari in the open from the very first commit, but we didn't want to make any premature fanfare about it… Until now. It's still alpha software, but for months now, both the core napari team and a few collaborators/early adopters have been using napari in our daily work. I've found it life-changing.

The background

I've been looking for a great nD volume viewer in Python for the better part of a decade. In 2009, I joined Mitya Chklovskii's lab and the FlyEM team at the Janelia [Farm] Research Campus to work on the segmentation of 3D electron microscopy (EM) volumes. I started out in Matlab, but moved to Python pretty quickly and it was a very smooth transition (highly recommended! ;). Looking at my data was always annoying though. I was either looking at single 2D slices using matplotlib.pyplot.imshow, or saving the volumes in VTK format and loading them into ITK-SNAP — which worked ok

Filipe Saraiva's blog 2019-10-11 16:29:56

O SERPRO e a validação de documentos digitais

No rascunho do post anterior sobre os documentos digitais no Brasil acabei escrevendo bastante sobre o papel do SERPRO nesse processo – tanto que decidi separá-lo em um post próprio. Com o lançamento do e-Título foi necessário para o TSE criar uma maneira de validar o documento digital para evitar fraudes. A tecnologia adotada foi… Continue a ler »O SERPRO e a validação de documentos digitais
Filipe Saraiva's blog 2019-10-07 00:29:16

Os documentos digitais (no plural) do Brasil

Já faz algum tempo o Brasil está passando por um processo de digitalização dos documentos oficiais utilizados por pessoas físicas. Entretanto, o que antes se anunciava como uma possível convergência dos mais diferentes documentos para um documento único, que serviria para tudo, passou a acontecer a digitalização de cada documento específico através do desenvolvimento de… Continue a ler »Os documentos digitais (no plural) do Brasil 2019-09-26 22:00:00

How to Evaluate the Logistic Loss and not NaN trying

A naive implementation of the logistic regression loss can results in numerical indeterminacy even for moderate values. This post takes a closer look into the source of these instabilities and discusses more robust Python implementations.

hljs.initHighlightingOnLoad(); MathJax.Hub.Config({ extensions: ["tex2jax.js"], jax: ["input/TeX", "output/HTML-CSS"], tex2jax: { inlineMath …
Paul Ivanov’s Journal 2019-09-17 07:00:00

Uvas Gold 200

My poem about a rainy 200k was published in the Fall 2019 issue of American Randonneur (a quarterly magazine published by Randonneurs USA)

I've been doing samizdat poetry for as long as I've had a web presence (since 1999), but I am now officially a published poet! (I am deliberately not counting the embarrasing hackjob that was published in a youth anthology when I was in 8th grade.)

You can find "Uvas Gold 200" on page 26 - either directly on this skeuomorphic leafing viewer or the PDF, but I'm republishing both the exposition blurb and the poem below. If you prefer to listen, I recorded a reading of it that you can download in different flavors: a local audio only, a local video, or the embeded video version below.

Uvas Gold 200k starts and ends in Fremont, CA and was held on Saturday, December 1st, 2018. The ride frontloads the climbing by going nearly half-way up Mount

Filipe Saraiva's blog 2019-09-16 13:24:54

Grupo de Estudos do Laboratório Amazônico de Estudos Sociotécnicos – UFPA

Eu e o prof. Leonardo Cruz da Faculdade de Ciências Sociais estamos juntos trabalhando no desenvolvimento do Laboratório Amazônico de Estudos Sociotécnicos da UFPA. Nossa proposta é realizar leituras e debates críticos sobre o tema da sociologia da tecnologia, produzir pesquisas teóricas e empíricas na região amazônica sobre as relações entre tecnologia e sociedade, e… Continue a ler »Grupo de Estudos do Laboratório Amazônico de Estudos Sociotécnicos – UFPA
Filipe Saraiva's blog 2019-08-22 14:58:40

pelas ruas de Belém

as vezes pelas ruas de Belém não sei se sou eu ou minha mãe quem está ali
Spyder Blog 2019-08-16 00:00:00

Spyder 4.0: Kite integration is here

This blogpost was originally published on the Quansight Labs website.

Note: Kite is sponsoring the work discussed in this blog post, and in addition supports Spyder 4.0 development through a Quansight Labs Community Work Order.

As part of our next release, we are proud to announce an additional completion client for Spyder, Kite. Kite is a novel completion client that uses Machine Learning techniques to find and predict the best autocompletion for a given text. Additionally, it collects improved documentation for compiled packages, e.g. Matplotlib, NumPy and SciPy, that cannot be obtained easily by using traditional code analysis packages such as Jedi. Although Kite is not open source like Spyder, you can download it without charge at the Kite website.

By incorporating Kite into Spyder, we will improve and provide the ultimate autocompletion and signature retrieval experience for most of the scientific Python stack and beyond. For instance, let’s take a look at the following PyTorch completion. While

ListenData 2019-08-10 21:54:00

Object Oriented Programming in Python : Learn by Examples

This tutorial outlines object oriented programming (OOP) in Python with examples. It is a step by step guide which was designed for people who have no programming experience. Object Oriented Programming is popular and available in other programming languages besides Python which are Java, C++, PHP.
Table of Contents

What is Object Oriented Programming?In object-oriented programming (OOP), you have the flexibility to represent real-world objects like car, animal, person, ATM etc. in your code. In simple words, an object is something that possess some characteristics and can perform certain functions. For example, car is an object and can perform functions like start, stop, drive and brake. These are the function of a car. And the characteristics are color of car, mileage, maximum speed, model year etc.

In the above example, car is an object. Functions are called methods in OOP world. Characteristics are attributes (properties). Technically attributes are variables or values related to the state of the object whereas methods

ListenData 2019-07-29 20:20:00

Precision Recall Curve Simplified

This article outlines precision recall curve and how it is used in real-world data science application. It includes explanation of how it is different from ROC curve. It also highlights limitation of ROC curve and how it can be solved via area under precision-recall curve. This article also covers implementation of area under precision recall curve in Python, R and SAS.
Table of Contents

What is Precision Recall Curve?Before getting into technical details, we first need to understand precision and recall terms in layman's term. It is essential to understand the concepts in simple words so that you can recall it for future work when it is required. Both Precision and Recall are important metrics to check the performance of binary classification model. PrecisionPrecision is also called Positive Predictive Value. Suppose you are building a customer attrition model which has objective to identify customers who are likely to close relationship with the company. The use of this model is to
Living in an Ivory Basement 2019-07-22 22:00:00

Comparing two genome binnings quickly with sourmash

Comparing two sets of MAGs, for fun and profit!

ListenData 2019-07-22 09:20:00

Calculate KS Statistic with Python

Kolmogorov-Smirnov (KS) Statistics is one of the most important metrics used for validating predictive models. It is widely used in BFSI domain. If you are a part of risk or marketing analytics team working on project in banking, you must have heard of this metrics. What is KS Statistics?It stands for Kolmogorov–Smirnov which is named after Andrey Kolmogorov and Nikolai Smirnov. It compares the two cumulative distributions and returns the maximum difference between them. It is a non-parametric test which means you don't need to test any assumption related to the distribution of data. In KS Test, Null hypothesis states null both cumulative distributions are similar. Rejecting the null hypothesis means cumulative distributions are different.

In data science, it compares the cumulative distribution of events and non-events and KS is where there is a maximum difference between the two distributions. In simple words, it helps us to understand how well our predictive model is able to discriminate between events and

ListenData 2019-07-20 16:22:00

A Complete Guide to Python DateTime Functions

In this tutorial, we will cover python datetime module and how it is used to handle date, time and datetime formatted columns (variables). It includes various practical examples which would help you to gain confidence in dealing dates and times with python functions. In general, Date types columns are not easy to manipulate as it comes with a lot of challenges like dealing with leap years, different number of days in a month, different date and time formats or if date values are stored in string (character) format etc.
Table of Contents

Introduction : datetime moduleIt is a python module which provides several functions for dealing with dates and time. It has four classes as follows which are explained in the latter part of this article how these classes work.
  1. datetime
  2. date
  3. time
  4. timedelta

People who have no experience of working with real-world datasets might have not encountered date columns. They might be under impression that working with dates is rarely used and not so

ListenData 2019-07-17 17:32:00

What are *args and **kwargs and How to use them

This article explains the concepts of *args and **kwargs and how and when we use them in python program. Seasoned python developers embrace the flexibility it provides when creating functions. If you are beginner in python, you might not have heard it before. After completion of this tutorial, you will have confidence to use them in your live project.
Table of Contents

Introduction : *argsargs is a short form of arguments. With the use of *args python takes any number of arguments in user-defined function and converts user inputs to a tuple named args. In other words, *args means zero or more arguments which are stored in a tuple named args.

When you define function without *args, it has a fixed number of inputs which means it cannot accept more (or less) arguments than you defined in the function.

In the example code below, we are creating a very basic function which adds two numbers. At the same time, we created a

ListenData 2019-07-12 21:42:00

Python : 10 Ways to Filter Pandas DataFrame

In this article, we will cover various methods to filter pandas dataframe in Python. Data Filtering is one of the most frequent data manipulation operation. It is similar to WHERE clause in SQL or you must have used filter in MS Excel for selecting specific rows based on some conditions. In terms of speed, python has an efficient way to perform filtering and aggregation. It has an excellent package called pandas for data wrangling tasks. Pandas has been built on top of numpy package which was written in C language which is a low level language. Hence data manipulation using pandas package is fast and smart way to handle big sized datasets.
Examples of Data Filtering
It is one of the most initial step of data preparation for predictive modeling or any reporting project. It is also called 'Subsetting Data'. See some of the examples of data filtering below.
  • Select all the active customers whose accounts were opened
ListenData 2019-07-04 19:51:00

Python Dictionary Comprehension with Examples

In this tutorial, we will cover how dictionary comprehension works in Python. It includes various examples which would help you to learn the concept of dictionary comprehension and how it is used in real-world scenarios.
What is Dictionary?
Dictionary is a data structure in python which is used to store data such that values are connected to their related key. Roughly it works very similar to SQL tables or data stored in statistical softwares. It has two main components -
  1. Keys : Think about columns in tables. It must be unique (like column names cannot be duplicate)
  2. Values : It is similar to rows in tables. It can be duplicate.
It is defined in curly braces { }. Each key is followed by a colon (:) and then values.
Syntax of Dictionary

d = {'a': [1,2], 'b': [3,4], 'c': [5,6]}
To extract keys, values and structure of dictionary, you can submit the following commands.

d.keys() # 'a', 'b', 'c'
d.values() # [1, 2], [3, 4], [5,

HTML outputs in Jupyter


User interaction in data science projects can be improved by adding a small amount of visual deisgn.

To motivate effort around visual design we show several simple-yet-useful examples. The code behind these examples is small and accessible to most Python developers, even if they don’t have much HTML experience.

This post in particular focuses on Jupyter’s ability to add HTML output to any object. This can either be full-fledged interactive widgets, or just rich static outputs like tables or diagrams. We hope that by showing examples here we will inspire some throughts in other projects.

This post was supported by replies to this tweet. The rest of this post is just examples.


I originally decided to write this post after reading another blogpost from the UK Met office, where they included the HTML output of their library Iris in a a blogpost

(work by Peter Killick, post by Theo McCaie)

The fact that the output provided by an interactive session is the same output that you would provide in a published result helps everyone. The interactive

ListenData 2019-07-03 15:01:00

Python list comprehension with Examples

This tutorial covers how list comprehension works in Python. It includes many examples which would help you to familiarize the concept and you should be able to implement it in your live project at the end of this lesson.
Table of Contents

What is list comprehension?Python is an object oriented programming language. Almost everything in them is treated consistently as an object. Python also features functional programming which is very similar to mathematical way of approaching problem where you assign inputs in a function and you get the same output with same input value. Given a function f(x) = x2, f(x) will always return the same result with the same x value. The function has no "side-effect" which means an operation has no effect on a variable/object that is outside the intended usage. "Side-effect" refers to leaks in your code which can modify a mutable data structure or variable.

Functional programming is also good for parallel computing as there is no

Peekaboo 2019-07-02 16:11:00

Don't cite the No Free Lunch Theorem

Tldr; You probably shouldn’t be citing the "No Free Lunch" Theorem by Wolpert. If you’ve cited it somewhere, you might have used it to support the wrong conclusion. What it actually (vaguely) says is “You can’t learn from data without making assumptions”.

The paper on the “No Free Lunch Theorem”, actually called "The Lack of A Priori Distinctions Between Learning Algorithms" is one of these papers that are often cited and rarely read, and I hear many people in the ML community refer to it when supporting the claim that “one model can’t be the best at everything” or “one model won’t always be better than another model”. The point of this post is to convince you that this is not what the paper or theorem says (at least not the one usually cited by Wolpert), and you should not cite this theorem in this context; and also that common versions cited of the "No Free Lunch" Theorem (continued...)
ListenData 2019-06-28 22:46:00

15 ways to read CSV file with pandas

This tutorial explains how to read a CSV file in python using read_csv function of pandas package. Without use of read_csv function, it is not straightforward to import CSV file with python object-oriented programming. Pandas is an awesome powerful python package for data manipulation and supports various functions to load and import data from various formats. Here we are covering how to deal with common issues in importing CSV file.
Table of Contents

Install and Load Pandas Package
Make sure you have pandas package already installed on your system. If you set up python using Anaconda, it comes with pandas package so you don't need to install it again. Otherwise you can install it by using command pip install pandas. Next step is to load the package by running the following command. pd is an alias of pandas package. We will use it instead of full name "pandas".
import pandas as pd
Create Sample Data for Import
The program below creates a sample
ListenData 2019-06-25 11:31:00

Matplotlib Tutorial : Learn with Examples in 3 hours

This tutorial outlines how to perform plotting and data visualization in python using Matplotlib library. The objective of this post is to get you familiar with the basics and advanced plotting functions of the library. It contains several examples which will give you hands-on experience in generating plots in python.
Table of Contents

What is Matplotlib?It is a powerful python library for creating graphics or charts. It takes care of all of your basic and advanced plotting requirements in Python. It took inspiration from MATLAB programming language and provides a similar MATLAB like interface for graphics. The beauty of this library is that it integrates well with pandas package which is used for data manipulation. With the combination of these two libraries, you can easily perform data wrangling along with visualization and get valuable insights out of data. Like ggplot2 library in R, matplotlib library is the grammar of graphics in Python and most used library for charts in Python.

Write Short Blogposts

I encourage my colleagues to write blogposts more frequently. This is for a few reasons:

  1. It informs your broader community what you’re up to, and allows that community to communicate back to you quickly.

    You communicating to the community fosters a sense of collaboration, openness, and trust. You gain collaborators, build momentum behind your work, and curate a body of knowledge that early adopters can consume to become experts quickly.

    Getting feedback from your community helps you to course-correct early in your work, and stops you from wasting time in inefficient courses of action.

    You can only work for a long time without communicating if you are either entirely confident in what you’re doing, or reckless, or both.

  2. It increases your visibility, and so is good for your career.

    I have a great job. I find my work to be both

ListenData 2019-06-19 13:20:00

How to drop one or multiple columns in Pandas Dataframe

In this tutorial, we will cover how to drop or remove one or multiple columns from pandas dataframe.
What is pandas in Python?
pandas is a python package for data manipulation. It has several functions for the following data tasks:
  1. Drop or Keep rows and columns
  2. Aggregate data by one or more columns
  3. Sort or reorder data
  4. Merge or append multiple dataframes
  5. String Functions to handle text data
  6. DateTime Functions to handle date or time format columns
Import or Load Pandas library
To make use of any python library, we first need to load them up by using import command.
import pandas as pd
import numpy as np
Let's create a fake dataframe for illustration
The code below creates 4 columns named A through D.
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
          A         B         C         D
0 -1.236438 -1.656038
ListenData 2019-06-09 21:07:00

String Functions in Python with Examples

This tutorial outlines various string (character) functions used in Python. To manipulate strings and character values, python has several in-built functions. It means you don't need to import or have dependency on any external package to deal with string data type in Python. It's one of the advantage of using Python over other data science tools. Dealing with string values is very common in real-world. Suppose you have customers' full name and you were asked by your manager to extract first and last name of customer. Or you want to fetch information of all the products that have code starting with 'QT'.
Table of Contents

List of frequently used string functions The table below shows many common string functions along with description and its equivalent function in MS Excel. We all use MS Excel in our workplace and familiar with the functions used in MS Excel. The comparison of string functions in MS EXCEL and Python would help you to learn
Ralf Gommers | Reflections 2019-06-05 00:00:00

The cost of an open source contribution

Open source is massively successful. Some say it’s eating the world, although to my ears that phrasing doesn’t sound entirely like a good thing. Open source maintainers are always in need of help, and over the past years I’ve seen a lot of focus on ways open source projects can grow their communities and gain new contributors. Guidance on how to go about finding new contributors is easily found. E.
Spyder Blog 2019-06-02 00:00:00

TDK-Micronas partners with Quansight to sponsor Spyder

This blogpost was originally published on the Quansight Labs website

TDK-Micronas is sponsoring Spyder development efforts through Quansight Labs. This will enable the development of some features that have been requested by our users, as well as new features that will help TDK develop custom Spyder plugins in order to complement their Automatic Test Equipment (ATE’s) in the development of their Application Specific Integrated Circuits (ASIC’s).

At this point it may be useful to clarify the relationship the role of Quansight Labs in Spyder's development and the relationship with TDK. To quote Ralf Gommers (director of Quansight Labs):

"We're an R&D lab for open source development of core technologies around data science and scientific computing in Python. And focused on growing communities around those technologies. That's how I see it for Spyder as well: Quansight Labs enables developers to be employed to work on Spyder, and helps with connecting them to developers of other projects in similar situations. Labs should be an enabler to let the Spyder project, its community and individual developers grow.

I Love Symposia! 2019-05-28 08:41:54

Why citations are not enough for open source software

A few weeks ago I wrote about why you should cite open source tools. Although I think citations important, though, there are major problems in relying on them alone to support open source work.

The biggest problem is that papers describing a software library can only give credit to the contributors at the time that the paper was written. The preferred citation for the SciPy library is “Eric Jones, Travis Oliphant, Pearu Peterson, et al”, 2001. The “et al” is not an abbreviation here, but a fixed shorthand for all other contributors. Needless to say many, many people have contributed to the SciPy library since 2001 (GitHub counts 716 contributors as of this writing), and they are unable to get credit within the academic system for those contributions. (As an aside, Google counts about 1,200 citations to SciPy, which is a breathtaking undercounting of its value and influence, and reinforces my earlier point: cite open source software! Definitely don't use this post as an excuse not to cite it!!!)

Not surprisingly, we have had

Spyder Blog 2019-05-20 00:00:00

Spyder 4.0 takes a big step closer with the release of Beta 2!

This blogpost was originally published on the Quansight Labs website

It has been almost two months since I joined Quansight in April, to start working on Spyder maintenance and development. So far, it has been a very exciting and rewarding journey under the guidance of long time Spyder maintainer Carlos Córdoba. This is the first of a series of blog posts we will be writing to showcase updates on the development of Spyder, new planned features and news on the road to Spyder 4.0 and beyond.

First off, I would like to give a warm welcome to Edgar Margffoy, who recently joined Quansight and will be working with the Spyder team to take its development even further. Edgar has been a core Spyder developer for more than two years now, and we are very excited to have his (almost) full-time commitment to the project.

Spyder 4.0 Beta 2 released!

Since August 2018, when the first beta of the 4.x series was released, the Spyder development team has been


The Role of a Maintainer

What are the expectations and best practices for maintainers of open source software libraries? How can we do this better?

This post frames the discussion and then follows with best practices based on my personal experience and opinions. I make no claim that these are correct.

Let us Assume External Responsibility

First, the most common answer to this question is the following:

  • Q: What are expectations on OSS maintainers?
  • A: Nothing at all. They’re volunteers.

However, let’s assume for a moment that these maintainers are paid to maintain the project some modest amount, like 10 hours a week.

How can they best spend this time?

What is a Maintainer?

Next, let’s disambiguate the role of developer, reviewer, and maintainer

  1. Developers fix bugs and create features. They write code and docs and generally are agents of change in a software project. There are often many more developers than reviewers or maintainers.

  2. Reviewers are known

Living in an Ivory Basement 2019-05-14 22:00:00

Using GitHub for janky project reporting - some code

We scripted GitHub for lightweight project reporting

Paul Ivanov’s Journal 2019-05-13 07:00:00

My first DNF (Ft Bragg 600k)

It's been six years since my first ride with The San Francisco Randonneurs and four years since my first 200k. I've ridden 18 rides that are at least that distance since then (3x 300k, 2x 400k, 1x 600k), completing my first Super Randonneur Series (2-, 3-, 4-, and 600k in one year) last year after not riding much the year before that. And this weekend I had my first DNF result on the Fort Bragg 600k. I Did Not Finish.

The best response to my choice of abandoning the ride to enjoy the campground came from Peter Curley, who said "That was a very mature decision." A clear departure from typical randonneuring stubbornness and refusal to give up, I celebrated my decision to quit as a victory when I arrived at the campground and made my announcement to the volunteers. I think I was so energetic about it that they did not believe me. I was being kind to myself, to my body, and at peace with the decision by


Should I Resign from My Full Professor Job to Work Fulltime on Cocalc?

Nearly 3 years ago, I gave a talk at a Harvard mathematics conference announcing that “I am leaving academia to build a company”. What I really did is go on unpaid leave for three years from my tenured Full Professor position. No further extensions of that leave is possible, so I finally have to decide whether or not to go back to academia or resign.
How did I get here?
Nearly two decades ago, as a recently minted Berkeley math Ph.D., I was hired as a non-tenure-track faculty member in the mathematics department at Harvard. I spent five years at Harvard, then I applied for jobs, and accepted a tenured Associate Professor position in the mathematics department at UC San Diego. The mathematics community was very supportive of my number theory research; I skipped tenure track, and landed a tier-1 tenured position by the time I was 30 years old. In 2006, I moved from UCSD to a tenured Associate Professor position at the University
Paul Ivanov’s Journal 2019-05-03 07:00:00

PyCon2019 poem

I'm back in Cleveland for another Pycon. Yesterday was my first full day here. Along with Matt Seale, I was a helper at Matthias Bussonnier tutorial ("IPython and Jupyter in Depth: High productivity, interactive Python). The sticky system is efficient at signaling when someone in a classroom needs help, and a lot of folks don't know that this practice was popularized by Software Carpentry workshops and continues to be used at The Carpentries.

I stepped out for a coffee refill and bumped into a large contingent of Bloomberg folks I'd never met (Princeton office). I guess we have something like 90 people at the conference this year, and I made the usual and true remark about how I go to conferences to meet the other people who work at our company. Then after his tutorial concluded, Matthias and I bumped into Tracy Teal, exchanged some stickers, and chatted about The Carpentries, Jupyter, organizing conferences, governance and sponsorship models, and a bunch of other stuff.

Matthias was a

I Love Symposia! 2019-05-02 02:31:30

Why you should cite open source tools

Every now and then, a moment or a sentence in a conversation sticks out at you, and lodges itself in the back of your brain for months or even years. In this case, the sentence is a tweet, and I fear that the only way to dislodge it is to talk about it publicly.

Last year, I complained on Twitter that a very prominent paper that was getting lots of attention used scikit-image, but failed to cite our paper. (Or the papers corresponding to many other open source packages.) I continued that scientists developing open source software depend on these citations to continue their work. (More on this in another post...) One response was that surely the developers of the open source scientific Python stack were not scientists per se, and that citations were not a priority for them.

I still sigh internally when I think of it.

That tweet manifests a pervasive perception that open source scientific software is written by God-like figures. These massively experienced software developers have easy access to funds

ListenData 2019-04-27 13:52:00

Python Lambda Function with Examples

This article covers detailed explanation of lambda function of Python. You will learn how to use it in real-world data scenarios with examples.
Table of Contents

Introduction : Lambda FunctionIn non-technical language, lambda is an alternative way of defining function. You can define function inline using lambda. It means you can apply a function to some data using a single line of python code. It is called anonymous function as the function can be defined without its name. They are a part of functional programming style which focus on readability of code and avoids changing mutable data.
Syntax of Lambda Function
lambda arguments: expression
Lambda function can have more than one argument but expression cannot be more than 1. The expression is evaluated and returned. Example
addition = lambda x,y: x + y
addition(2,3) returns 5
In the above python code, x,y are the arguments and x + y is the expression that gets evaluated and returned.
ListenData 2019-04-20 21:01:00

Loops in Python explained with examples

This tutorial covers various ways to execute loops in python with several practical examples. After reading this tutorial, you will be familiar with the concept of loop and will be able to apply loops in real world data wrangling tasks.

Table of Contents

What is Loop?Loop is an important programming concept and exist in almost every programming language (Python, C, R, Visual Basic etc.). It is used to repeat a particular operation(s) several times until a specific condition is met. It is mainly used to automate repetitive tasks.

Real World Examples of Loop
  1. Software of the ATM machine is in a loop to process transaction after transaction until you acknowledge that you have no more to do.
  2. Software program in a mobile device allows user to unlock the mobile with 5 password attempts. After that it resets mobile device.
  3. You put your favorite song on a repeat mode. It is also a loop.
  4. You want to run a particular analysis on each column of your data
Living in an Ivory Basement 2019-04-15 22:00:00

Some questions and thoughts on journal peer review.

What's up with current peer review practice?

ListenData 2019-04-14 15:31:00

Create Dummy Data in Python

This article explains various ways to create dummy or random data in Python for practice. Like R, we can create dummy data frames using pandas and numpy packages. Most of the analysts prepare data in MS Excel. Later they import it into Python to hone their data wrangling skills in Python. This is not an efficient approach. The efficient approach is to prepare random data in Python and use it later for data manipulation.

Table of Contents

1. Enter Data Manually in Editor WindowThe first step is to load pandas package and use DataFrame function
import pandas as pd
data = pd.DataFrame({"A" : ["John","Deep","Julia","Kate","Sandy"],
"MonthSales" : [25,30,35,40,45]})
       A  MonthSales
0 John 25
1 Deep 30
Living in an Ivory Basement 2019-04-10 22:00:00

Things to think about when developing shotgun metagenome classifiers

Thoughts on goals and tradeoffs in classifying shotgun metagenome data.