SciPy

Planet SciPy

neptune.ai 2020-06-02 16:05:55

This Week in Machine Learning: Oceans, Marketers, and Wildlife Protection

Machine learning is fascinating. New things happen every second while we’re busy performing our daily tasks. If you want to know what big things have happened over the last week, make sure to check this weekly roundup! Here are the best picks from the last week from the world of machine learning. Enjoy the read! […]

The post This Week in Machine Learning: Oceans, Marketers, and Wildlife Protection appeared first on neptune.ai.

neptune.ai 2020-05-28 06:00:13

How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way)

I’ve been working with object detection and image segmentation problems for many years. An important realization I made is that people don’t put the same amount of effort and emphasis on data exploration and results analysis as they would normally in any other non-image machine learning project.   Why is it so?  I  believe there are two […]

The post How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way) appeared first on neptune.ai.

Gaël Varoquaux - programming 2020-05-27 22:00:00

Technical discussions are hard; a few tips

Note

This post discuss the difficulties of communicating while developing open-source projects and tries to gives some simple advice.

A large software project is above all a social exercise in which technical experts try to reach good decisions together, for instance on github pull requests. But communication is difficult, in …

neptune.ai 2020-05-25 17:17:05

This Week in Machine Learning: Bees, Sky Objects, & HMI

Every day brings new opportunities. A week brings even more. To make sure you’re getting the most out of it, we’ve gathered a list of trending articles of the week – everything about Data Science, AI, tech, and machine learning. Here are the best picks from the last week from the world of the machine […]

The post This Week in Machine Learning: Bees, Sky Objects, & HMI appeared first on neptune.ai.

neptune.ai 2020-05-25 12:35:41

The Best Comet.ml Alternatives

Comet is one of the most popular tools used by people working on machine learning experiments. It is a self-hosted and cloud-based meta machine learning platform allowing data scientists and teams to track, compare, explain, and optimize experiments and models. Comet proposes an open-source Python library to allow data scientists to integrate their code with […]

The post The Best Comet.ml Alternatives appeared first on neptune.ai.

neptune.ai 2020-05-25 09:56:33

The Best Tools for Machine Learning Model Visualization

The phrase “Every model is wrong but some are useful” is especially true in Machine Learning. When developing machine learning models you should always understand where it works as expected and where it fails miserably. There are many methods that you can use to get that understanding: Look at evaluation metrics (also you should know […]

The post The Best Tools for Machine Learning Model Visualization appeared first on neptune.ai.

neptune.ai 2020-05-22 05:57:41

Random Forest Regression: When Does It Fail and Why?

In this article, we’ll look at a major problem with using Random Forest for Regression which is extrapolation.  We’ll cover the following items: Random Forest Regression vs Linear Regression Random Forest Regression Extrapolation Problem Potential solutions Should you use Random Forest for Regression? Let’s dive in.  Random Forest Regression vs Linear Regression Random Forest Regression is […]

The post Random Forest Regression: When Does It Fail and Why? appeared first on neptune.ai.

neptune.ai 2020-05-21 16:33:32

The Best Tools to Visualize Metrics and Hyperparameters of Machine Learning Experiments

Evaluating your model on the key metrics is a crucial first step in understanding your model quality. Keeping track of hyperparameters and corresponding evaluation metrics is important because small changes in hyperparameters can sometimes have a big impact on model quality. And so, understanding which hyperparameters have an impact and which do not affect evaluation […]

The post The Best Tools to Visualize Metrics and Hyperparameters of Machine Learning Experiments appeared first on neptune.ai.

Pierre de Buyl's homepage - scipy 2020-05-19 09:00:00

Tidynamics, what use?

In 2018 I published small Python library, tidynamics. The scope was deliberately limited: compute the typical correlation functions for stochastic and molecular dynamics: the autocorrelation and the mean-square displacement. Two years later, I wonder about its usage.

NumFOCUS 2020-05-18 19:48:24

Moderna, IMC Renew NumFOCUS Corporate Sponsorships

Monday, May 18th, 2020 Two NumFOCUS corporate supporters recently made fresh commitments to our open source mission. Trading firm IMC and biotechnology company Moderna Therapeutics each renewed their corporate sponsorships earlier this month. Both companies have supported NumFOCUS since 2018 at our Silver and Bronze sponsorship levels, respectively. Asked about his company’s decision to partner […]

The post Moderna, IMC Renew NumFOCUS Corporate Sponsorships appeared first on NumFOCUS.

neptune.ai 2020-05-18 17:10:56

This Week in Machine Learning: Amazon Releases Kendra, Brain-Inspired Algorithms, Confused AI Models & More

Machine learning is fascinating. New things happen every second while we’re busy performing our daily tasks. If you want to know what big things have happened over the last week, make sure to check this weekly roundup! Here are the best picks from the last week from the world of the machine learning. Enjoy the […]

The post This Week in Machine Learning: Amazon Releases Kendra, Brain-Inspired Algorithms, Confused AI Models & More appeared first on neptune.ai.

NumFOCUS 2020-05-18 14:58:09

NumFOCUS Projects helping combat the COVID-19 pandemic

Open source tools are uniquely positioned to help combat the ongoing COVID-19 pandemic through their adaptable and collaborative nature. NumFOCUS sponsored and affiliated projects are being used on a global scale to meet the needs of researchers and data scientists. Our projects are being used in groundbreaking scientific efforts to create response models, visualize and […]

The post NumFOCUS Projects helping combat the COVID-19 pandemic appeared first on NumFOCUS.

neptune.ai 2020-05-18 11:06:57

Top Open Source Tools and Libraries for Deep Learning – ICLR 2020 Experience

Where is cutting-edge deep learning created and discussed? One of the top places is ICLR – a leading deep learning conference, that took place on April 27-30, 2020. As a fully virtual event, with 5600+ participants and almost 700 papers/posters it could be called a great success. You can find comprehensive info about the conference […]

The post Top Open Source Tools and Libraries for Deep Learning – ICLR 2020 Experience appeared first on neptune.ai.

Paul Ivanov’s Journal 2020-05-17 07:00:00

Lazy River of Curious Content 0

This is the first post of what I'm calling a Lazy River of Curious Content. This is a way to review stuff that I've been doing, dealing with, or find interesting during the week recently (This was originally written two weeks ago, May 3rd, my shoddy internet connectivity kept me from posting it.). I'm loosely following the format that Justin Sherrill uses with great effect over at https://dragonflydigest.com

Learn NixOS by turning a Raspberry Pi into a Wireless Router Friend of the show, Anthony Scopatz, tried NixOS for the first time and provides a detailed report:

"While I had read the NixOS pamphlets, and listened politely when the faithful came knocking on my door at inconvenient times, I had never walked the path of functional Linux enlightenment myself"

Reading through that made me file away a todo of writing up how I use propellor (and why). But those todo sometimes just pile up for a while...

An interview of one of my long time nerd-crushes, Rob Pike. The questions focus on the Go programming

(continued...)
neptune.ai 2020-05-15 10:36:27

Interview with a Chief AI Scientist: Arash Azhand

Some time ago I had a chance to interview a great artificial intelligence researcher and Chief AI Scientist in Lindera, Arash Azhand. We talked about: The AI technology behind his work at Lindera His career path How it is to be a research-centered scientist How to become a good leader Why it is important to […]

The post Interview with a Chief AI Scientist: Arash Azhand appeared first on neptune.ai.

Living in an Ivory Basement 2020-05-06 22:00:00

sourmash databases as zip files, in sourmash v3.3.0

The feature that I'm most excited about in sourmash 3.3.0 is the ability to directly use compressed SBT search databases.

Previously, if you wanted to search (say) 100,000 genomes from GenBank, you'd have to download a several GB .tar.gz file, and then uncompress it out to ~20 GB before searching it. The time and disk space requirements for this were major barriers for teaching and use.

In v3.3.0, Luiz Irber fixed this by, first, releasing the niffler Rust library with Pierre Marijon, to read and write compressed files; second, replacing our old khmer Bloom filter nodegraph with a Rust implementation (sourmash PR #799); and, third, adding direct zip file storage (sourmash #648).

So, as of the latest release, you can do the following:

# install sourmash v3.3.0
conda create -y -n sourmash-demo \
    -c conda-forge -c bioconda sourmash=3.3.0

# activate environment
conda activate sourmash-demo

# download the 25k GTDB release89 guide database (~1.4 GB)
curl -L https://osf.io/5mb9k/download > gtdb-release89-k31.sbt.zip

# grab
(continued...)
Filipe Saraiva's blog 2020-05-05 18:29:16

LaKademy 2019

Em novembro passado, colaboradores latinoamericanos do KDE desembarcaram em Salvador/Brasil para participarem de mais uma edição do LaKademy – o Latin American Akademy. Aquela foi a sétima edição do evento (ou oitava, se você contar o Akademy-BR como o primeiro LaKademy) e a segunda com Salvador como a cidade que hospedou o evento. Sem problemas… Continue a ler »LaKademy 2019
Filipe Saraiva's blog 2020-05-04 21:20:54

Akademy 2019

Em setembro de 2019 a cidade italiana de Milão sediou o principal encontro mundial dos colaboradores do KDE – o Akademy, onde membros de diferentes áreas como tradutores, desenvolvedores, artistas, pessoal de promo e mais se reúnem por alguns dias para pensar e construir o futuro dos projetos e comunidade(s) do KDE Antes de chegar… Continue a ler »Akademy 2019
Quansight Labs 2020-05-02 03:30:00

Highlights of the Ibis 1.3 release

Ibis 1.3 was just released, after 8 months of development work, with 104 new commits from 16 unique contributors. What is new? In this blog post we will discuss some important features in this new version!

First, if you are new to the Ibis framework world, you can check this blog post I wrote last year, with some introductory information about it.

Some highlighted features of this new version are:

  • Addition of a PySpark backend
  • Improvement of geospatial support
  • Addition of JSON, JSONB and UUID data types
  • Initial support for Python 3.8 added and support for Python 3.5 dropped
  • Added new backends and geospatial methods to the documentation
  • Renamed the mapd backend to omniscidb

Read more… (9 min remaining to read)

NumFOCUS 2020-05-01 16:32:10

Yellowbrick Update – April 2020

Yellowbrick released Version 1.1 on February 25, 2020.  If you haven’t yet upgraded simply type pip install yellowbrick -U or conda install -c districtdatalabs yellow-brick into your terminal/command prompt to get it.  The major improvement in v1.1 is introducing quick methods or one-liners to generate your favorite ML plots more quickly with Yellowbrick.  Dr.  Rebecca […]

The post Yellowbrick Update – April 2020 appeared first on NumFOCUS.

NumFOCUS 2020-04-29 18:34:15

2020 PyData Conferences Update [COVID-19]

We wanted to give an update to our community regarding the upcoming 2020 PyData conferences. We have been closely monitoring the situation and to help ensure the safety of our community given the threat of the COVID-19 virus, the following in-person events have been postponed to 2021: PyData Miami PyData Amsterdam PyData LA While disappointing, we […]

The post 2020 PyData Conferences Update [COVID-19] appeared first on NumFOCUS.

NumFOCUS 2020-04-28 18:16:15

Scientific Software Developer- Contract Basis [SunPy Project]

Scientific Software Developer- Contract Basis NumFOCUS is seeking a Scientific Software Developer to support the SunPy project. SunPy is a Python-based open source scientific software package supporting solar physics data analysis. This is a 1 year contract.     The successful applicant will work to improve SunPy’s functionality. There are four main tasks:   Report on […]

The post Scientific Software Developer- Contract Basis [SunPy Project] appeared first on NumFOCUS.

Quansight Labs 2020-04-28 06:00:00

Thanking the people behind Spyder 4

After more than three years in development and more than 5000 commits from 60 authors around the world, Spyder 4 finally saw the light on December 5, 2019! I decided to wait until now to write a blogpost about it because shortly after the initial release, we found several critical performance issues and some regressions with respect to Spyder 3, most of which are fixed now in version 4.1.2, released on April 3rd 2020.

Read more… (3 min remaining to read)

Spyder Blog 2020-04-22 17:00:00

Creating the ultimate terminal experience in Spyder 4 with Spyder-Terminal

This blogpost was originally published on the Quansight Labs website.

The Spyder-Terminal project is revitalized! The new 0.3.0 version adds numerous features that improve the user experience, and enhances compatibility with the latest Spyder 4 release, in part thanks to the improvements made in the xterm.js project.

Upgrade to ES6/JSX syntax

First, we were able to update all the old JavaScript files to use ES6/JSX syntax and the tests for the client terminal. This change simplified the code base and maintenance and allows us to easily extend the project to new functionalities that the xterm.js API offers. In order to compile this code and run it inside Spyder, we migrated our deployment to Webpack.

Multiple shells per operating system

In the new release, you now have the ability to configure which shell to use in the terminal. On Linux and UNIX systems, bash, sh, ksh, zsh, csh, pwsh, tcsh, screen, tmux, dash and rbash are supported, while cmd and powershell are the

(continued...)
Quansight Labs 2020-04-20 05:00:00

Introducing ndindex, a Python library for manipulating indices of ndarrays

One of the most important features of NumPy arrays is their indexing semantics. By "indexing" I mean anything that happens inside square brackets, for example, a[4::-1, 0, ..., [0, 1], np.newaxis]. NumPy's index semantics are very expressive and powerful, and this is one of the reasons the library is so popular.

Index objects can be represented and manipulated directly. For example, the above index is (slice(4, None, -1), 0, Ellipsis, [0, 1], None). If you are any author of a library that tries to replicate NumPy array semantics, you will have to work with these objects. However, they are often difficult to work with:

  • The different types that are valid indices for NumPy arrays do not have a uniform API. Most of the types are also standard Python types, such as tuple, list, int, and None, which are usually unrelated to indexing.

  • Those objects that are specific to indexes, such as slice and Ellipsis do not make any assumptions about their underlying semantics. For example, Python lets you create slice(None, None, 0) or slice(0, 0.5)

(continued...)
Living in an Ivory Basement 2020-04-19 22:00:00

Software and workflow development practices (April 2020 update)

Over the last 10-15 years, I've blogged periodically about how my lab develops research software and build scientific workflows. The last update talked a bit about how we've transitioned to snakemake and conda for automation, but I was spurred by an e-mail conversation into another update - because, y'all, it's going pretty well and I'm pretty happy!

Below, I talk through our current practice of building workflows and software. These procedures work pretty well for our (fairly small) lab of people who mostly work part-time on workflow and software development. By far the majority of our effort is usually spent trying to understand the results of our workflows; except in rare cases, I try to guide people to spend at most 20% of their time writing new analysis code - preferably less.

Nothing about these processes ensures that the scientific output is correct or useful, of course. While scientific correctness of computational workflows necessarily depends (often critically) on the correctness of the

(continued...)
Filipe Saraiva's blog 2020-04-16 15:12:29

LaKademy 2019

Past November 2019 KDE fellows from Latin-America arrived in Salvador – Brazil to attend an one more edition of LaKademy – the Latin American Akademy. That was the 7th edition of the event (or the 8th, if you count Akademy-BR as the first LaKademy) and the second one with Salvador as host city. No problem… Continue a ler »LaKademy 2019
Quansight Labs 2020-04-13 15:39:56

PyTorch TensorIterator Internals

PyTorch is one of the leading frameworks for deep learning. Its core data structure is Tensor, a multi-dimensional array implementation with many advanced features like auto-differentiation. PyTorch is a massive codebase (approx. a million lines of C++, Python and CUDA code), and having a method for iterating over tensors in a very efficient manner that is independent of data type, dimension, striding and hardware is a critical feature that can lead to a very massive simplification of the codebase and make distributed development much faster and smoother. The TensorIterator C++ class within PyTorch is a complex yet useful class that is used for iterating over the elements of a tensor over any dimension and implicitly parallelizing various operations in a device independent manner.

It does this through a C++ API that is independent of type and device of the tensor, freeing the programmer of having to worry about the datatype or device when writing iteration logic for PyTorch tensors. For those coming from the NumPy universe, NpyIter is a close cousin of TensorIterator.

This

(continued...)
Martin Fitzpatrick - python 2020-04-13 11:01:00

Is it getting better yet? An optimistic visual guide to the Coronavirus pandemic

As the apocalypse rumbles on, I found myself wondering "Is it getting any better?"

Daily updates of spiralling case numbers (and worse, deaths) does little to give a sense of whether we're getting to, or already past, the worst of it.

To answer that question for myself and you, I …

Living in an Ivory Basement 2020-04-12 22:00:00

How to give a bad online talk

Today at lab meeting, I wanted to brainstorm about how to give good online talks, because I'm giving a few remote talks in the next month. Tracy suggested that perhaps I should demonstrate a bad talk first, just to get everyone on the same page.

So I did!

Direct (YouTube link)

...enjoy? It's short, and not TOO painful if you show up with low expectations!


First, let me say that we were tremendously ...inspired by Greg Wilson's How to Teach Badly and How to Teach Badly (part 2)!

So here's what I did --

I put together a few slides on some stuff that I'd been working on recently, so it would look reasonable.

My initial screen opened with a private Twitter message up, to mimic inadvertent content sharing :).

I started out with "I didn't have a lot of time to prepare for this meeting so apologies for some of the slides."

My slide theme was very hard to read - bad fonts and colors.

A

(continued...)
fa.bianp.net 2020-04-06 22:00:00

On the Link Between Polynomials and Optimization

There's a fascinating link between minimization of quadratic functions and polynomials. A link that goes deep and allows to phrase optimization problems in the language of polynomials and vice versa. Using this connection, we can tap into centuries of research in the theory of polynomials and shed new light on …

Paul Ivanov’s Journal 2020-04-03 07:00:00

pheriday 3: infrastructure

img { display: block; margin-left: auto; margin-right: auto; } .thumb-link:hover { box-shadow: 0px 0px 8px #000; } .thumbnail { width: 80px; height: auto; margin: 0; padding: 0; display: block; } .thumb-link { display: inline-block; border: 5px solid #FFF; } .thumb-link:hover { box-shadow: 0px 0px 8px #000; } .thumbnail { width: 80px; height: auto; margin: 0; padding: 0; display: block; } .lightbox { opacity: 0; position: fixed; z-index: -1; width: 100%; height: 100%; text-align: center; top: 0; left: 0; background: rgba(0, 0, 0, 0.8); -webkit-transition: opacity 0.0s ease-out 0s; transition: opacity 0.0s ease-out 0s; } .lightbox img { (continued...)
Quansight Labs 2020-03-14 12:25:55

Documentation as a way to build Community

As a long time user and participant in open source communities, I've always known that documentation is far from being a solved problem. At least, that's the impression we get from many developers: "writing docs is boring"; "it's a chore, nobody likes to do it". I have come to realize I'm one of those rare people who likes to write both code and documentation.

Nobody will argue against documentation. It is clear that for an open-source software project, documentation is the public face of the project. The docs influence how people interact with the software and with the community. It sets the tone about inclusiveness, how people communicate and what users and contributors can do. Looking at the results of a “NumPy Tutorial” search on any search engine also gives an idea of the demand for this kind of content - it is possible to find documentation about how to read the NumPy documentation!

I've started working at Quansight in January,

(continued...)
NumFOCUS 2020-03-13 15:02:49

PyData COVID-19 Response

The safety and well-being of our community are extremely important to us. We have therefore decided to postpone all PyData conferences scheduled to take place until the end of June: PyData Miami PyData London PyData Amsterdam We have been closely monitoring the situation and believe this is the best action to take based on the […]

The post PyData COVID-19 Response appeared first on NumFOCUS.

Quansight Labs 2020-03-12 12:39:00

uarray: GSoC Participation

I'm pleased to announce that uarray is participating in GSoC '20 as a sub-organization under the umbrella of the Python Software Foundation. Our ideas page is up here, go take a look and see if you (or someone you know) is interested in participating, either as a student or as a mentor.

Prasun Anand and Peter Bell and myself will be mentoring, and we plan to take a maximum of two students, unless more community mentors show up.

There have been quite a few pull requests already to qualify from prospective students, some even going as far as to begin the work described in the idea they plan to work on.

We're quite excited by the number of students who have shown an interest in participating, and we look forward to seeing excellent applications! What's more exciting, though, are some of the first contributions from people not currently at Quansight, in the true spirit of open-source software!

Quansight Labs 2020-03-11 11:30:00

Planned architectural work for PyData/Sparse

What have we been doing so far? 🤔 Research 📚

A lot of behind the scenes work has been taking place on PyData/Sparse. Not so much in terms of code, more in terms of research and community/team building. I've more-or-less decided to use the structure and the research behind the Tensor Algebra Compiler, the work of Fredrik Kjolstad and his collaborators at MIT. 🙇🏻‍♂️ To this end, I've read/watched the following talks and papers:

Read more… (2 min remaining to read)

NumFOCUS 2020-03-10 19:13:20

Statement on Coronavirus

As you are aware, the Coronavirus (COVID-19) is a topic of frequent and ongoing discussions. We would like to provide an update on our status and policies as well as provide resources for additional information. As of today, our event schedule remains as posted on event sites. Any changes or updates will be immediately shared. […]

The post Statement on Coronavirus appeared first on NumFOCUS.

Living in an Ivory Basement 2020-03-08 23:00:00

Some snakemake hacks for dealing with large collections of files

This winter quarter I taught my usual graduate-level introductory bioinformatics lab at UC Davis, GGG 201(b), for the fourth time. The course lectures are given by Megan Dennis and Fereydoun Hormozdiari, and I do a largely separate lab that aims to teach the basics of practical variant calling, de novo assembly, and RNAseq differential expression.

I also co-developed and co-taught a new course, GGG 298 / Tools for Data Intensive Research, with Shannon Joslin, a graduate student here in Genetics & Genomics who (among other things) took GGG 201(b) the first time I offered it. GGG 298 is a series of ten half-day workshops where we teach shell, conda, snakemake, git, RMarkdown, etc - you can see the syllabus for GGG 298 here.

This time around, I did a complete redesign of the GGG 201(b) lab (see syllabus) to focus on using snakemake workflows.

I'm 80% happy with how it went - there's some overall fine tuning to be done, and snakemake has some corners that need more explaining than other corners, but I think the

(continued...)
Filipe Saraiva's blog 2020-02-24 16:22:14

Akademy 2019

Past September the Italian city of Milan hosted the KDE contributors meeting called Akademy, the main KDE conference where contributors from different areas like translators, developers, artists, promoters and more stay together for some days thinking and building the future of KDE projects and community(ies). Firstly before Akademy I departed from Brazil to Portugal to… Continue a ler »Akademy 2019
Quansight Labs 2020-02-21 18:38:07

My Unexpected Dive into Open-Source Python

Header illustration by author, Mars Lee

I'm very happy to announce that I have joined Quansight as a front-end developer and designer! It was a happy coincidence how I joined- the intersection of my skills and the open source community's expanded vision.

Read more… (4 min remaining to read)

NumFOCUS 2020-02-20 18:08:46

Announcing JupyterCon 2020

NumFOCUS is excited to be a part of JupyterCon 2020. JupyterCon will be held August 10 – 14 in Berlin, Germany at the Berlin Conference Center. We invite you to participate in this exciting community event! Read the full announcement here. JupyterCon 2020 is an event brought to you in partnership by Project Jupyter and NumFOCUS.

The post Announcing JupyterCon 2020 appeared first on NumFOCUS.

NumFOCUS 2020-02-20 16:27:43

MDAnalysis joins NumFOCUS Sponsored Projects

NumFOCUS is pleased to announce the newest addition to our fiscally sponsored projects: MDAnalysis MDAnalysis is a Python library for the analysis of computer simulations of many-body systems at the molecular scale, spanning use cases from interactions of drugs with proteins to novel materials. It is widely used in the scientific community and is written […]

The post MDAnalysis joins NumFOCUS Sponsored Projects appeared first on NumFOCUS.

Living in an Ivory Basement 2020-02-16 23:00:00

Two talks at JGI in May: sourmash, spacegraphcats, and disease associations in the human microbiome.

Hello all! I'm giving two metagenomics talks - a tech talk and a bio talk - at the Joint Genome Institute on May 7, 2020. The abstracts are below.

The JGI just moved to a new building at LBNL, so these talks are much more accessible to the UC Berkeley and LBNL communities than they would have been a year ago. I hope interested people can make it!

The talks will be in the afternoon on May 7th at the Integrative Genomics Building, LBNL Bldg 91-310. I've put the tentative times down. I'll update this post with final times and contact information for security + parking passes closer to the day.

Bio talk: Novel approaches to metagenome analysis reveal microbial signatures of IBD

(This will be the Science and Technology seminar, 3-4pm on May 7.)

Inflammatory bowel disease (IBD) is a spectrum of diseases characterized by chronic inflammation of the intestines; it is likely caused by host-mediated inflammatory responses at least in part elicited by microorganisms. As of 2015, 1.3% of US adults have been diagnosed

(continued...)
Quansight Labs 2020-02-14 19:00:00

Creating the ultimate terminal experience in Spyder 4 with Spyder-Terminal

The Spyder-Terminal project is revitalized! The new 0.3.0 version adds numerous features that improves the user experience, and enhances compatibility with the latest Spyder 4 release, in part thanks to the improvements made in the xterm.js project.

Read more… (3 min remaining to read)

Leonardo Uieda 2020-01-23 12:00:00

Advancing research software in the UK through an SSI fellowship

I have selected as part of the 2020 cohort of Fellows of the Software Sustainability Institute!

The Institute cultivates world-class research with software. It's based at the universities of Edinburgh, Manchester, Southampton, and Oxford in the UK. Their motto says it all:

The SSI has a yearly fellowship program to fund the organization of communities around scientific software (creating of local user groups, workshops, hackathons, etc). Even more importantly, they organize several events to get current and past fellows in the same place doing awesome stuff. I'm really looking forward to this year's Collaborations Workshop (registration is open to all, not just fellows). I applied at the end of last year and was selected to join the 2020 cohort of fellows along with some truly amazing people.

My plan for the fellowship is to run

(continued...)
Peekaboo 2020-01-07 17:26:00

Don't fund Software that doesn't exist

I’ve been happy to see an increase in funding for open source software across research areas and across funding bodies. However, I observed that a majority of funding from, say, the NSF, goes to projects that do not exist yet, and where the funding is supposed to create a new project, or to extend projects that are developed and used within a single research lab. I think this top-down approach to creating software comes from a misunderstanding of the existing open source software that is used in science. This post collects thoughts on the effectiveness of current grant-based funding and how to improve it from the perspective of the grant-makers.
Instead of the current approach of funding new projects, I would recommend funding existing open source software, ideally software that is widely used, and underfunded. The story of the underfunded but critically important open source software (which I’ll refer to as infrastructure software) should be an old tale by now.
(continued...)
Living in an Ivory Basement 2020-01-01 23:00:00

sourmash-oddify: a workflow for exploring contamination in metagenome-assembled genomes

(Thanks to Erich Schwarz, Taylor Reiter, and Donovan Parks for brainstorming and feedback on this stuff. Thanks also to Luiz Irber and Phillip Brooks for their work on sourmash!)

Yesterday, I posted about using k-mers and taxonomy to investigate Genbank genomes for potential contamination.

The underlying idea is pretty simple: look for subsets of k-mers that don't match the inferred taxonomy of the genome bins they're from, then analyze.

What started me down this path over two years ago (!!) was the use of the same underlying Tara Oceans metagenomic data for two separate papers, Tully et al., 2018 and Delmont et al., 2018. Both groups released their data early along with bioRxiv preprints, and it proved to be a treasure trove for my bioinformatics methods development - all of the sourmash lca functionality as well a lot of other functionality came from a series of about 14 blog posts examining these genomes.

I last left off with the

(continued...)
Living in an Ivory Basement 2019-12-31 23:00:00

Finding problematic bacterial/archaeal genomes using k-mers and taxonomy

(Happy New Year, everyone! Thanks on this blog post go out to Erich Schwarz and Taylor Reiter, for offering helpful suggestions and asking tough questions as I meandered through this work!)

Yesterday, I posted about using sourmash lca classify to taxonomically classify bacterial and archaeal genomes quickly, and compared the results to the full GTDB taxonomy. The tl;dr was that sourmash works pretty well and returns results consistent with GTDB and GTDB-Tk, but that it often doesn't classify as precisely as GTDB-Tk.

I was kind of expecting that at the species level, because there is a limit to the kind of precision that downsampled k-mers can achieve: the last 1-0.1% of nucleotide similarity can be a bit wobbly with sourmash (<- technical term).

But I was surprised to see the phylum and superkingdom level limits. sourmash lca classify couldn't classify 235 genomes beyond phylum level! What could be causing this?

Digging into a single case of imprecise classification by sourmash

I took a closer look at GCF_001477405, a genome tagged as Staphylococcus sciuri in Genbank. Using sourmash lca summarize,

(continued...)
Living in an Ivory Basement 2019-12-30 23:00:00

How does sourmash's lca classification routine compare with GTDB classifications?

Yesterday I posted about the GTDB taxonomy; we are now providing prepared databases that can be used with sourmash's taxonomy classification routines to classify genomes with GTDB.

The databases we posted are built from the dereplicated 25k GTDB genomes distributed as part of the GTDB-Tk classification toolkit, and not the full 145k classifications in GTDB. So they are smaller than they could be, and also potentially lower resolution. Moreover, sourmash uses k-mers instead of amino acids, which may lead to different classifications.

A good first question is, how well do classifications with sourmash lca classify & 25k genomes compare to the full 145k classifications in GTDB? This is basically a measure of generalizability - how reliably can we infer the classifications of the 145k genomes from the 25k?

Comparing sourmash lca classify on Genbank to GTDB

I classified all 420k Genbank genomes using sourmash lca classify with k=31, and I then wrote a script to compare the output to the GTDB taxonomy. This involved some rather nasty identifier conversion which sometimes failed, but we ended up with a good

(continued...)
Living in an Ivory Basement 2019-12-29 23:00:00

Sourmash LCA databases now available for the GTDB taxonomy

I am happy to announce that we have made available prepared sourmash taxonomy ("LCA") databases for release 89 of the GTDB taxonomy.

The databases are available for download from the Open Science Framework in this project. There are prepared databases avaialble for k=21, k=31, and k=51.

What is the GTDB taxonomy?

GTDB is a revised bacterial and archaeal taxonomy based on phylogenetic relations between proteins from approximately 25k genomes. You can read more about it here.

GTDB is an alternative to the NCBI taxonomy. It is used by (among others) MGnify, the EBI metagenomics resource.

What is sourmash?

Sourmash is a research platform and bioinformatics tool for searching and analyzing genomes, based on a MinHash-inspired approach that allows genome similarity searches, genome containment searches, and compositional analysis of k-mers in large sequence data sets. You can read more about it here.

What do these databases let you do?

There are three immediate uses for these databases:

  • you can use the sourmash lca classify routine (and other LCA commands) to do taxonomic classification of genomes

(continued...)
Quansight Labs 2019-12-24 11:00:00

metadsl PyData talk

metadsl PyData talk

PyData NYC just ended and I thought it would be good to collect my thoughts on metadsl based on the many conversations I had there surrounding it. This is a rather long post, so if you are just looking for some code here is a Binder link for my talk. Also, here is the talk I gave a month or so later on the same topic in Austin:

What is metadsl?
class Number(metadsl.Expression):
   @metadsl.expression
   def __add__(self, other: Number) -> Number:
       ...

   @metadsl.expression
   @classmethod
   def from_int(cls, i: int) -> Number:
       ...


@metadsl.rule
def add_zero(y: Number):
   yield Number.from_int(0) + y, y
   yield y + Number.from_int(0), y

Read more… (13 min remaining to read)

Leonardo Uieda 2019-12-08 12:00:00

Two PhD studentships at the University of Liverpool

I have two open positions for funded studentships at the University of Liverpool. Applications are open until 10 January 2020.

Project descriptions

Follow the links for more detailed versions.

Bringing machine learning techniques to geophysical data processing

The goal of this project is to investigate the use of existing machine learning techniques to process gravity and magnetics data using the Equivalent Layer Method. The methods and software developed during this project can be applied to process large amounts of gravity and magnetics data, including airborne and satellite surveys, and produce data products that can enable further scientific investigations. Examples of such data products include global gravity gradient grids from GOCE satellite measurements, regional magnetic grids for the UK, gravity grids for the Moon and Mars, etc.

Large-scale mapping of the thickness of the

(continued...)
NumFOCUS 2019-12-04 18:50:46

NumFOCUS Kicks Off Year-End Fundraising with $2,500 Gift

Giving Tuesday 2019 marked the start of NumFOCUS’s year-end fundraising campaign, and this year’s effort began with a major donation from the organization’s Board President, Andy Terrel. “Today I’m pledging a gift of $2,500 to help kick off the NumFOCUS end-of-year fundraising campaign,” Terrel wrote yesterday in an e-mail to the NumFOCUS community. “I believe […]

The post NumFOCUS Kicks Off Year-End Fundraising with $2,500 Gift appeared first on NumFOCUS.

NumFOCUS 2019-12-02 22:55:41

Stepping Onto a New Path…

The post Stepping Onto a New Path… appeared first on NumFOCUS.

Gaël Varoquaux - programming 2019-12-01 05:00:00

Getting a big scientific prize for open-source software

Note

An important acknowledgement for a different view of doing science: open, collaborative, and more than a proof of concept.

A few days ago, Loïc Estève, Alexandre Gramfort, Olivier Grisel, Bertrand Thirion, and myself received the “Académie des Sciences Inria prize for transfer”, for our contributions to the scikit-learn project …

Spyder Blog 2019-11-28 20:00:00

Variable Explorer improvements in Spyder 4

This blogpost was originally published on the Quansight Labs website.

Spyder 4 will be released very soon with lots of interesting new features that you'll want to check out, reflecting years of effort by the team to improve the user experience. In this post, we will be talking about the improvements made to the Variable Explorer.

These include the brand new Object Explorer for inspecting arbitrary Python variables, full support for MultiIndex dataframes with multiple dimensions, and the ability to filter and search for variables by name and type, and much more.

It is important to mention that several of the above improvements were made possible through integrating the work of two other projects. Code from gtabview was used to implement the multi-dimensional Pandas indexes, while objbrowser was the foundation of the new Object Explorer.

New viewer for arbitrary Python objects

For Spyder 4 we added a long-requested feature: full support for inspecting any kind of Python object through the Variable

(continued...)
Spyder Blog 2019-11-12 00:00:00

File management improvements in Spyder 4

This blogpost was originally published on the Quansight Labs website.

Version 4.0 of Spyder is almost ready! It has been in the making for well over two years, and it contains lots of interesting new features. We will focus on the Files pane in this post, where we've made several improvements to the interface and file management tools.

Simplified interface

In order to simplify the Files pane's interface, the columns corresponding to size and kind are hidden by default. To change which columns are shown, use the top-right pane menu or right-click the header directly.

Custom file associations

First, we added the ability to associate different external applications with specific file extensions they can open. Under the File associations tab of the Files preferences pane, you can add file types and set the external program used to open each of them by default.

Once you've set this up, files will automatically launch in the associated application when opened from the Files pane in Spyder.

(continued...)
ListenData 2019-10-28 15:48:00

Loan Amortisation Schedule using R and Python

In this post, we will explain how you can calculate your monthly loan instalments the way bank calculates using R and Python. In financial world, analysts generally use MS Excel software for calculating principal and interest portion of instalment using PPMT, IPMT functions. As data science is growing and trending these days, it is important to know how you can do the same using popular data science programming languages such as R and Python.

When you take a loan from bank at x% annual interest rate for N number of years. Bank calculates monthly (or quarterly) instalments based on the following factors :

  • Loan Amount
  • Annual Interest Rate
  • Number of payments per year
  • Number of years for loan to be repaid in instalments
Loan Amortisation ScheduleIt refers to table of periodic loan payments explaining the breakup of principal and interest in each instalment/EMI until the loan is repaid at the end of its stipulated term. Monthly instalments are generally same every month
(continued...)
I Love Symposia! 2019-10-24 13:59:54

Introducing napari: a fast n-dimensional image viewer in Python

I'm really excited to finally, officially, share a new(ish) project called napari with the world. We have been developing napari in the open from the very first commit, but we didn't want to make any premature fanfare about it… Until now. It's still alpha software, but for months now, both the core napari team and a few collaborators/early adopters have been using napari in our daily work. I've found it life-changing.

The background

I've been looking for a great nD volume viewer in Python for the better part of a decade. In 2009, I joined Mitya Chklovskii's lab and the FlyEM team at the Janelia [Farm] Research Campus to work on the segmentation of 3D electron microscopy (EM) volumes. I started out in Matlab, but moved to Python pretty quickly and it was a very smooth transition (highly recommended! ;). Looking at my data was always annoying though. I was either looking at single 2D slices using matplotlib.pyplot.imshow, or saving the volumes in VTK format and loading them into ITK-SNAP — which worked ok

(continued...)
Filipe Saraiva's blog 2019-10-11 16:29:56

O SERPRO e a validação de documentos digitais

No rascunho do post anterior sobre os documentos digitais no Brasil acabei escrevendo bastante sobre o papel do SERPRO nesse processo – tanto que decidi separá-lo em um post próprio. Com o lançamento do e-Título foi necessário para o TSE criar uma maneira de validar o documento digital para evitar fraudes. A tecnologia adotada foi… Continue a ler »O SERPRO e a validação de documentos digitais
Filipe Saraiva's blog 2019-10-07 00:29:16

Os documentos digitais (no plural) do Brasil

Já faz algum tempo o Brasil está passando por um processo de digitalização dos documentos oficiais utilizados por pessoas físicas. Entretanto, o que antes se anunciava como uma possível convergência dos mais diferentes documentos para um documento único, que serviria para tudo, passou a acontecer a digitalização de cada documento específico através do desenvolvimento de… Continue a ler »Os documentos digitais (no plural) do Brasil
fa.bianp.net 2019-09-26 22:00:00

How to Evaluate the Logistic Loss and not NaN trying

A naive implementation of the logistic regression loss can results in numerical indeterminacy even for moderate values. This post takes a closer look into the source of these instabilities and discusses more robust Python implementations.

hljs.initHighlightingOnLoad(); MathJax.Hub.Config({ extensions: ["tex2jax.js"], jax: ["input/TeX", "output/HTML-CSS"], tex2jax: { inlineMath …
Paul Ivanov’s Journal 2019-09-17 07:00:00

Uvas Gold 200

My poem about a rainy 200k was published in the Fall 2019 issue of American Randonneur (a quarterly magazine published by Randonneurs USA)

I've been doing samizdat poetry for as long as I've had a web presence (since 1999), but I am now officially a published poet! (I am deliberately not counting the embarrasing hackjob that was published in a youth anthology when I was in 8th grade.)

You can find "Uvas Gold 200" on page 26 - either directly on this skeuomorphic leafing viewer or the PDF, but I'm republishing both the exposition blurb and the poem below. If you prefer to listen, I recorded a reading of it that you can download in different flavors: a local audio only, a local video, or the embeded video version below.

Uvas Gold 200k starts and ends in Fremont, CA and was held on Saturday, December 1st, 2018. The ride frontloads the climbing by going nearly half-way up Mount

(continued...)
Filipe Saraiva's blog 2019-09-16 13:24:54

Grupo de Estudos do Laboratório Amazônico de Estudos Sociotécnicos – UFPA

Eu e o prof. Leonardo Cruz da Faculdade de Ciências Sociais estamos juntos trabalhando no desenvolvimento do Laboratório Amazônico de Estudos Sociotécnicos da UFPA. Nossa proposta é realizar leituras e debates críticos sobre o tema da sociologia da tecnologia, produzir pesquisas teóricas e empíricas na região amazônica sobre as relações entre tecnologia e sociedade, e… Continue a ler »Grupo de Estudos do Laboratório Amazônico de Estudos Sociotécnicos – UFPA
Filipe Saraiva's blog 2019-08-22 14:58:40

pelas ruas de Belém

as vezes pelas ruas de Belém não sei se sou eu ou minha mãe quem está ali
Spyder Blog 2019-08-16 00:00:00

Spyder 4.0: Kite integration is here

This blogpost was originally published on the Quansight Labs website.

Note: Kite is sponsoring the work discussed in this blog post, and in addition supports Spyder 4.0 development through a Quansight Labs Community Work Order.

As part of our next release, we are proud to announce an additional completion client for Spyder, Kite. Kite is a novel completion client that uses Machine Learning techniques to find and predict the best autocompletion for a given text. Additionally, it collects improved documentation for compiled packages, e.g. Matplotlib, NumPy and SciPy, that cannot be obtained easily by using traditional code analysis packages such as Jedi. Although Kite is not open source like Spyder, you can download it without charge at the Kite website.

By incorporating Kite into Spyder, we will improve and provide the ultimate autocompletion and signature retrieval experience for most of the scientific Python stack and beyond. For instance, let’s take a look at the following PyTorch completion. While

(continued...)
Living in an Ivory Basement 2019-08-14 22:00:00

An initial report on the Common Fund Data Ecosystem

For the past 6 months or so, I've been working with a team of people on a project called the Common Fund Data Ecosystem. This is a targeted effort within the NIH Common Fund (CF) to improve the Findability, Accessibility, Interoperability, and Reusability - a.k.a. "FAIRness" - of the data sets hosted by their Data Coordinating Centers.

(You can see Dr. Vivien Bonazzi's presentation if you're interested in more details on the background motivation of this project.)

I'm thrilled to announce that our first report is now available! This is the product of a tremendous data gathering effort (by many people), four interviews, and an ensuing distillation and writing effort with Owen White and Amanda Charbonneau. To quote,

This assessment was generated from a combination of systematic review of online materials, in-person site visits to the Genotype Tissue Expression (GTEx) DCC and Kids First, and online interviews with Library of Integrated Network-Based Cellular Signatures (LINCS) and Human Microbiome Project

(continued...)
ListenData 2019-08-10 21:54:00

Object Oriented Programming in Python : Learn by Examples

This tutorial outlines object oriented programming (OOP) in Python with examples. It is a step by step guide which was designed for people who have no programming experience. Object Oriented Programming is popular and available in other programming languages besides Python which are Java, C++, PHP.
Table of Contents

What is Object Oriented Programming?In object-oriented programming (OOP), you have the flexibility to represent real-world objects like car, animal, person, ATM etc. in your code. In simple words, an object is something that possess some characteristics and can perform certain functions. For example, car is an object and can perform functions like start, stop, drive and brake. These are the function of a car. And the characteristics are color of car, mileage, maximum speed, model year etc.

In the above example, car is an object. Functions are called methods in OOP world. Characteristics are attributes (properties). Technically attributes are variables or values related to the state of the object whereas methods

(continued...)
ListenData 2019-07-29 20:20:00

Precision Recall Curve Simplified

This article outlines precision recall curve and how it is used in real-world data science application. It includes explanation of how it is different from ROC curve. It also highlights limitation of ROC curve and how it can be solved via area under precision-recall curve. This article also covers implementation of area under precision recall curve in Python, R and SAS.
Table of Contents

What is Precision Recall Curve?Before getting into technical details, we first need to understand precision and recall terms in layman's term. It is essential to understand the concepts in simple words so that you can recall it for future work when it is required. Both Precision and Recall are important metrics to check the performance of binary classification model. PrecisionPrecision is also called Positive Predictive Value. Suppose you are building a customer attrition model which has objective to identify customers who are likely to close relationship with the company. The use of this model is to
(continued...)
Living in an Ivory Basement 2019-07-22 22:00:00

Comparing two genome binnings quickly with sourmash

tl;dr? Compare and cluster two collections of 1000+ metagenome-assembled genomes in a few minutes with sourmash!


A week ago, someone e-mailed me with an interesting question: how can we compare two collections of genome bins with sourmash?

Why would you want to do this? Well, there's lots of reasons! The main one that caught my attention is comparing genomes extracted from metagenomes via two different binning procedures - that's where I started almost two years ago, with two sets of bins extracted from the Tara ocean data. You might also want to merge bins that were similar to produce a (hopefully) more complete bin, or you could intersect bins that were similar to produce a consensus bin that might be higher quality, or you could identify bins that were in one collection and not in the other, to round out your collection.

I'm assuming this is done by lots of workflows - I note, for example, that the metaWRAP workflow includes a 'bin refinement' step that must do something like

(continued...)
ListenData 2019-07-22 09:20:00

Calculate KS Statistic with Python

Kolmogorov-Smirnov (KS) Statistics is one of the most important metrics used for validating predictive models. It is widely used in BFSI domain. If you are a part of risk or marketing analytics team working on project in banking, you must have heard of this metrics. What is KS Statistics?It stands for Kolmogorov–Smirnov which is named after Andrey Kolmogorov and Nikolai Smirnov. It compares the two cumulative distributions and returns the maximum difference between them. It is a non-parametric test which means you don't need to test any assumption related to the distribution of data. In KS Test, Null hypothesis states null both cumulative distributions are similar. Rejecting the null hypothesis means cumulative distributions are different.

In data science, it compares the cumulative distribution of events and non-events and KS is where there is a maximum difference between the two distributions. In simple words, it helps us to understand how well our predictive model is able to discriminate between events and

(continued...)
ListenData 2019-07-20 16:22:00

A Complete Guide to Python DateTime Functions

In this tutorial, we will cover python datetime module and how it is used to handle date, time and datetime formatted columns (variables). It includes various practical examples which would help you to gain confidence in dealing dates and times with python functions. In general, Date types columns are not easy to manipulate as it comes with a lot of challenges like dealing with leap years, different number of days in a month, different date and time formats or if date values are stored in string (character) format etc.
Table of Contents

Introduction : datetime moduleIt is a python module which provides several functions for dealing with dates and time. It has four classes as follows which are explained in the latter part of this article how these classes work.
  1. datetime
  2. date
  3. time
  4. timedelta

People who have no experience of working with real-world datasets might have not encountered date columns. They might be under impression that working with dates is rarely used and not so

(continued...)
ListenData 2019-07-17 17:32:00

What are *args and **kwargs and How to use them

This article explains the concepts of *args and **kwargs and how and when we use them in python program. Seasoned python developers embrace the flexibility it provides when creating functions. If you are beginner in python, you might not have heard it before. After completion of this tutorial, you will have confidence to use them in your live project.
Table of Contents

Introduction : *argsargs is a short form of arguments. With the use of *args python takes any number of arguments in user-defined function and converts user inputs to a tuple named args. In other words, *args means zero or more arguments which are stored in a tuple named args.

When you define function without *args, it has a fixed number of inputs which means it cannot accept more (or less) arguments than you defined in the function.

In the example code below, we are creating a very basic function which adds two numbers. At the same time, we created a

(continued...)
ListenData 2019-07-12 21:42:00

Python : 10 Ways to Filter Pandas DataFrame

In this article, we will cover various methods to filter pandas dataframe in Python. Data Filtering is one of the most frequent data manipulation operation. It is similar to WHERE clause in SQL or you must have used filter in MS Excel for selecting specific rows based on some conditions. In terms of speed, python has an efficient way to perform filtering and aggregation. It has an excellent package called pandas for data wrangling tasks. Pandas has been built on top of numpy package which was written in C language which is a low level language. Hence data manipulation using pandas package is fast and smart way to handle big sized datasets.
Examples of Data Filtering
It is one of the most initial step of data preparation for predictive modeling or any reporting project. It is also called 'Subsetting Data'. See some of the examples of data filtering below.
  • Select all the active customers whose accounts were opened
(continued...)
ListenData 2019-07-04 19:51:00

Python Dictionary Comprehension with Examples

In this tutorial, we will cover how dictionary comprehension works in Python. It includes various examples which would help you to learn the concept of dictionary comprehension and how it is used in real-world scenarios.
What is Dictionary?
Dictionary is a data structure in python which is used to store data such that values are connected to their related key. Roughly it works very similar to SQL tables or data stored in statistical softwares. It has two main components -
  1. Keys : Think about columns in tables. It must be unique (like column names cannot be duplicate)
  2. Values : It is similar to rows in tables. It can be duplicate.
It is defined in curly braces { }. Each key is followed by a colon (:) and then values.
Syntax of Dictionary

d = {'a': [1,2], 'b': [3,4], 'c': [5,6]}
To extract keys, values and structure of dictionary, you can submit the following commands.

d.keys() # 'a', 'b', 'c'
d.values() # [1, 2], [3, 4], [5,
(continued...)

HTML outputs in Jupyter

Summary

User interaction in data science projects can be improved by adding a small amount of visual deisgn.

To motivate effort around visual design we show several simple-yet-useful examples. The code behind these examples is small and accessible to most Python developers, even if they don’t have much HTML experience.

This post in particular focuses on Jupyter’s ability to add HTML output to any object. This can either be full-fledged interactive widgets, or just rich static outputs like tables or diagrams. We hope that by showing examples here we will inspire some throughts in other projects.

This post was supported by replies to this tweet. The rest of this post is just examples.

Iris

I originally decided to write this post after reading another blogpost from the UK Met office, where they included the HTML output of their library Iris in a a blogpost

(work by Peter Killick, post by Theo McCaie)

The fact that the output provided by an interactive session is the same output that you would provide in a published result helps everyone. The interactive

(continued...)
ListenData 2019-07-03 15:01:00

Python list comprehension with Examples

This tutorial covers how list comprehension works in Python. It includes many examples which would help you to familiarize the concept and you should be able to implement it in your live project at the end of this lesson.
Table of Contents

What is list comprehension?Python is an object oriented programming language. Almost everything in them is treated consistently as an object. Python also features functional programming which is very similar to mathematical way of approaching problem where you assign inputs in a function and you get the same output with same input value. Given a function f(x) = x2, f(x) will always return the same result with the same x value. The function has no "side-effect" which means an operation has no effect on a variable/object that is outside the intended usage. "Side-effect" refers to leaks in your code which can modify a mutable data structure or variable.

Functional programming is also good for parallel computing as there is no

(continued...)
Peekaboo 2019-07-02 16:11:00

Don't cite the No Free Lunch Theorem

Tldr; You probably shouldn’t be citing the "No Free Lunch" Theorem by Wolpert. If you’ve cited it somewhere, you might have used it to support the wrong conclusion. What it actually (vaguely) says is “You can’t learn from data without making assumptions”.

The paper on the “No Free Lunch Theorem”, actually called "The Lack of A Priori Distinctions Between Learning Algorithms" is one of these papers that are often cited and rarely read, and I hear many people in the ML community refer to it when supporting the claim that “one model can’t be the best at everything” or “one model won’t always be better than another model”. The point of this post is to convince you that this is not what the paper or theorem says (at least not the one usually cited by Wolpert), and you should not cite this theorem in this context; and also that common versions cited of the "No Free Lunch" Theorem (continued...)
ListenData 2019-06-28 22:46:00

15 ways to read CSV file with pandas

This tutorial explains how to read a CSV file in python using read_csv function of pandas package. Without use of read_csv function, it is not straightforward to import CSV file with python object-oriented programming. Pandas is an awesome powerful python package for data manipulation and supports various functions to load and import data from various formats. Here we are covering how to deal with common issues in importing CSV file.
Table of Contents

Install and Load Pandas Package
Make sure you have pandas package already installed on your system. If you set up python using Anaconda, it comes with pandas package so you don't need to install it again. Otherwise you can install it by using command pip install pandas. Next step is to load the package by running the following command. pd is an alias of pandas package. We will use it instead of full name "pandas".
import pandas as pd
Create Sample Data for Import
The program below creates a sample
(continued...)
ListenData 2019-06-25 11:31:00

Matplotlib Tutorial : Learn with Examples in 3 hours

This tutorial outlines how to perform plotting and data visualization in python using Matplotlib library. The objective of this post is to get you familiar with the basics and advanced plotting functions of the library. It contains several examples which will give you hands-on experience in generating plots in python.
Table of Contents

What is Matplotlib?It is a powerful python library for creating graphics or charts. It takes care of all of your basic and advanced plotting requirements in Python. It took inspiration from MATLAB programming language and provides a similar MATLAB like interface for graphics. The beauty of this library is that it integrates well with pandas package which is used for data manipulation. With the combination of these two libraries, you can easily perform data wrangling along with visualization and get valuable insights out of data. Like ggplot2 library in R, matplotlib library is the grammar of graphics in Python and most used library for charts in Python.
Basics
(continued...)

Write Short Blogposts

I encourage my colleagues to write blogposts more frequently. This is for a few reasons:

  1. It informs your broader community what you’re up to, and allows that community to communicate back to you quickly.

    You communicating to the community fosters a sense of collaboration, openness, and trust. You gain collaborators, build momentum behind your work, and curate a body of knowledge that early adopters can consume to become experts quickly.

    Getting feedback from your community helps you to course-correct early in your work, and stops you from wasting time in inefficient courses of action.

    You can only work for a long time without communicating if you are either entirely confident in what you’re doing, or reckless, or both.

  2. It increases your visibility, and so is good for your career.

    I have a great job. I find my work to be both

(continued...)
Living in an Ivory Basement 2019-06-23 22:00:00

How to encourage participation in teleconferences

(and/or how to run effective teleconferences!)

I participate in a lot of teleconferences, and some of them aren't very participatory, for various reasons. Recently a good friend asked for suggestions on how to open up the phone calls, and I came up with the below ideas. What am I missing? What did I get wrong?


First, post a meeting agenda with a medium amount of detail, well in advance ( > 24 hours).

  • Posting an agenda in advance gives people time to think about things, if they are interested.
  • The medium amount of detail (up to a paragraph) lets people understand what it’s about, see what the major issues/questions are, and think of questions or comments they may have.
  • If the agenda is posted > 24 hours in advance, you can reasonably expect people to have read it, and if people want to add things to the agenda on the call you punt them to the next call instead.

Basically, if you spring a skeleton

(continued...)
ListenData 2019-06-19 13:20:00

How to drop one or multiple columns from Pandas Dataframe

In this tutorial, we will cover how to drop or remove one or multiple columns from pandas dataframe.
What is pandas in Python?
pandas is a python package for data manipulation. It has several functions for the following data tasks:
  1. Drop or Keep rows and columns
  2. Aggregate data by one or more columns
  3. Sort or reorder data
  4. Merge or append multiple dataframes
  5. String Functions to handle text data
  6. DateTime Functions to handle date or time format columns
Import or Load Pandas library
To make use of any python library, we first need to load them up by using import command.
import pandas as pd
import numpy as np
Let's create a fake dataframe for illustration
The code below creates 4 columns named A through D.
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
          A         B         C         D
0 -1.236438 -1.656038
(continued...)
ListenData 2019-06-09 21:07:00

String Functions in Python with Examples

This tutorial outlines various string (character) functions used in Python. To manipulate strings and character values, python has several in-built functions. It means you don't need to import or have dependency on any external package to deal with string data type in Python. It's one of the advantage of using Python over other data science tools. Dealing with string values is very common in real-world. Suppose you have customers' full name and you were asked by your manager to extract first and last name of customer. Or you want to fetch information of all the products that have code starting with 'QT'.
Table of Contents

List of frequently used string functions The table below shows many common string functions along with description and its equivalent function in MS Excel. We all use MS Excel in our workplace and familiar with the functions used in MS Excel. The comparison of string functions in MS EXCEL and Python would help you to learn
(continued...)
Ralf Gommers | Reflections 2019-06-05 00:00:00

The cost of an open source contribution

Open source is massively successful. Some say it’s eating the world, although to my ears that phrasing doesn’t sound entirely like a good thing. Open source maintainers are always in need of help, and over the past years I’ve seen a lot of focus on ways open source projects can grow their communities and gain new contributors. Guidance on how to go about finding new contributors is easily found. E.
Spyder Blog 2019-06-02 00:00:00

TDK-Micronas partners with Quansight to sponsor Spyder

This blogpost was originally published on the Quansight Labs website

TDK-Micronas is sponsoring Spyder development efforts through Quansight Labs. This will enable the development of some features that have been requested by our users, as well as new features that will help TDK develop custom Spyder plugins in order to complement their Automatic Test Equipment (ATE’s) in the development of their Application Specific Integrated Circuits (ASIC’s).

At this point it may be useful to clarify the relationship the role of Quansight Labs in Spyder's development and the relationship with TDK. To quote Ralf Gommers (director of Quansight Labs):

"We're an R&D lab for open source development of core technologies around data science and scientific computing in Python. And focused on growing communities around those technologies. That's how I see it for Spyder as well: Quansight Labs enables developers to be employed to work on Spyder, and helps with connecting them to developers of other projects in similar situations. Labs should be an enabler to let the Spyder project, its community and individual developers grow.

(continued...)
I Love Symposia! 2019-05-28 08:41:54

Why citations are not enough for open source software

A few weeks ago I wrote about why you should cite open source tools. Although I think citations important, though, there are major problems in relying on them alone to support open source work.

The biggest problem is that papers describing a software library can only give credit to the contributors at the time that the paper was written. The preferred citation for the SciPy library is “Eric Jones, Travis Oliphant, Pearu Peterson, et al”, 2001. The “et al” is not an abbreviation here, but a fixed shorthand for all other contributors. Needless to say many, many people have contributed to the SciPy library since 2001 (GitHub counts 716 contributors as of this writing), and they are unable to get credit within the academic system for those contributions. (As an aside, Google counts about 1,200 citations to SciPy, which is a breathtaking undercounting of its value and influence, and reinforces my earlier point: cite open source software! Definitely don't use this post as an excuse not to cite it!!!)

Not surprisingly, we have had

(continued...)
Spyder Blog 2019-05-20 00:00:00

Spyder 4.0 takes a big step closer with the release of Beta 2!

This blogpost was originally published on the Quansight Labs website

It has been almost two months since I joined Quansight in April, to start working on Spyder maintenance and development. So far, it has been a very exciting and rewarding journey under the guidance of long time Spyder maintainer Carlos Córdoba. This is the first of a series of blog posts we will be writing to showcase updates on the development of Spyder, new planned features and news on the road to Spyder 4.0 and beyond.

First off, I would like to give a warm welcome to Edgar Margffoy, who recently joined Quansight and will be working with the Spyder team to take its development even further. Edgar has been a core Spyder developer for more than two years now, and we are very excited to have his (almost) full-time commitment to the project.

Spyder 4.0 Beta 2 released!

Since August 2018, when the first beta of the 4.x series was released, the Spyder development team has been

(continued...)

The Role of a Maintainer

What are the expectations and best practices for maintainers of open source software libraries? How can we do this better?

This post frames the discussion and then follows with best practices based on my personal experience and opinions. I make no claim that these are correct.

Let us Assume External Responsibility

First, the most common answer to this question is the following:

  • Q: What are expectations on OSS maintainers?
  • A: Nothing at all. They’re volunteers.

However, let’s assume for a moment that these maintainers are paid to maintain the project some modest amount, like 10 hours a week.

How can they best spend this time?

What is a Maintainer?

Next, let’s disambiguate the role of developer, reviewer, and maintainer

  1. Developers fix bugs and create features. They write code and docs and generally are agents of change in a software project. There are often many more developers than reviewers or maintainers.

  2. Reviewers are known

(continued...)
Living in an Ivory Basement 2019-05-14 22:00:00

Using GitHub for janky project reporting - some code

For the NIH Data Commons, we needed a way for 10 distinct teams to do reporting at the level of about 50-100 milestones per team, on a monthly basis.

Each team was already using different project management software internally, and we didn't want to require them to switch to something new. We also didn't need a lot of innate functionality in the project reporting system - basically, for each milestone we needed two statuses, "started" and "finished".

So we decided to go with something lightweight and simple that would support programmatic update and automated reporting: GitHub!

We chose to use GitHub for project reporting for several reasons. We were already using GitHub for content stuff, and everyone had accounts. We were also using GitHub for authentication control on static Web sites via a Heroku app.

So what we did was use the PyGithub package to write a script to take the project milestones (which were all in a spreadsheet) and load them

(continued...)
Paul Ivanov’s Journal 2019-05-13 07:00:00

My first DNF (Ft Bragg 600k)

It's been six years since my first ride with The San Francisco Randonneurs and four years since my first 200k. I've ridden 18 rides that are at least that distance since then (3x 300k, 2x 400k, 1x 600k), completing my first Super Randonneur Series (2-, 3-, 4-, and 600k in one year) last year after not riding much the year before that. And this weekend I had my first DNF result on the Fort Bragg 600k. I Did Not Finish.

The best response to my choice of abandoning the ride to enjoy the campground came from Peter Curley, who said "That was a very mature decision." A clear departure from typical randonneuring stubbornness and refusal to give up, I celebrated my decision to quit as a victory when I arrived at the campground and made my announcement to the volunteers. I think I was so energetic about it that they did not believe me. I was being kind to myself, to my body, and at peace with the decision by

(continued...)

Should I Resign from My Full Professor Job to Work Fulltime on Cocalc?

Nearly 3 years ago, I gave a talk at a Harvard mathematics conference announcing that “I am leaving academia to build a company”. What I really did is go on unpaid leave for three years from my tenured Full Professor position. No further extensions of that leave is possible, so I finally have to decide whether or not to go back to academia or resign.
How did I get here?
Nearly two decades ago, as a recently minted Berkeley math Ph.D., I was hired as a non-tenure-track faculty member in the mathematics department at Harvard. I spent five years at Harvard, then I applied for jobs, and accepted a tenured Associate Professor position in the mathematics department at UC San Diego. The mathematics community was very supportive of my number theory research; I skipped tenure track, and landed a tier-1 tenured position by the time I was 30 years old. In 2006, I moved from UCSD to a tenured Associate Professor position at the University
(continued...)
Paul Ivanov’s Journal 2019-05-03 07:00:00

PyCon2019 poem

I'm back in Cleveland for another Pycon. Yesterday was my first full day here. Along with Matt Seale, I was a helper at Matthias Bussonnier tutorial ("IPython and Jupyter in Depth: High productivity, interactive Python). The sticky system is efficient at signaling when someone in a classroom needs help, and a lot of folks don't know that this practice was popularized by Software Carpentry workshops and continues to be used at The Carpentries.

I stepped out for a coffee refill and bumped into a large contingent of Bloomberg folks I'd never met (Princeton office). I guess we have something like 90 people at the conference this year, and I made the usual and true remark about how I go to conferences to meet the other people who work at our company. Then after his tutorial concluded, Matthias and I bumped into Tracy Teal, exchanged some stickers, and chatted about The Carpentries, Jupyter, organizing conferences, governance and sponsorship models, and a bunch of other stuff.

Matthias was a

(continued...)
I Love Symposia! 2019-05-02 02:31:30

Why you should cite open source tools

Every now and then, a moment or a sentence in a conversation sticks out at you, and lodges itself in the back of your brain for months or even years. In this case, the sentence is a tweet, and I fear that the only way to dislodge it is to talk about it publicly.

Last year, I complained on Twitter that a very prominent paper that was getting lots of attention used scikit-image, but failed to cite our paper. (Or the papers corresponding to many other open source packages.) I continued that scientists developing open source software depend on these citations to continue their work. (More on this in another post...) One response was that surely the developers of the open source scientific Python stack were not scientists per se, and that citations were not a priority for them.

I still sigh internally when I think of it.

That tweet manifests a pervasive perception that open source scientific software is written by God-like figures. These massively experienced software developers have easy access to funds

(continued...)
ListenData 2019-04-27 13:52:00

Python Lambda Function with Examples

This article covers detailed explanation of lambda function of Python. You will learn how to use it in real-world data scenarios with examples.
Table of Contents

Introduction : Lambda FunctionIn non-technical language, lambda is an alternative way of defining function. You can define function inline using lambda. It means you can apply a function to some data using a single line of python code. It is called anonymous function as the function can be defined without its name. They are a part of functional programming style which focus on readability of code and avoids changing mutable data.
Syntax of Lambda Function
lambda arguments: expression
Lambda function can have more than one argument but expression cannot be more than 1. The expression is evaluated and returned. Example
addition = lambda x,y: x + y
addition(2,3) returns 5
In the above python code, x,y are the arguments and x + y is the expression that gets evaluated and returned.
READ MORE »
ListenData 2019-04-20 21:01:00

Loops in Python explained with examples

This tutorial covers various ways to execute loops in python with several practical examples. After reading this tutorial, you will be familiar with the concept of loop and will be able to apply loops in real world data wrangling tasks.

Table of Contents

What is Loop?Loop is an important programming concept and exist in almost every programming language (Python, C, R, Visual Basic etc.). It is used to repeat a particular operation(s) several times until a specific condition is met. It is mainly used to automate repetitive tasks.

Real World Examples of Loop
  1. Software of the ATM machine is in a loop to process transaction after transaction until you acknowledge that you have no more to do.
  2. Software program in a mobile device allows user to unlock the mobile with 5 password attempts. After that it resets mobile device.
  3. You put your favorite song on a repeat mode. It is also a loop.
  4. You want to run a particular analysis on each column of your data
(continued...)
Living in an Ivory Basement 2019-04-15 22:00:00

Some questions and thoughts on journal peer review.

Can I use comments from other people's prior reviews when reviewing a submission to a new journal?

I just had the dispiriting experience of receiving a paper to review from Journal B, that was unchanged from a prior submission to Journal A. The "dispiriting" part of the experience was that the paper was completely unchanged, despite a host of minor and major comments on the paper from all three reviewers for Journal A.

I ended up writing that I was disappointed that the authors had not seen fit to confront the bigger issues in any way, much less correct even the smallest and easiest errors; and then pasted in my previous review. What I wanted to do was paste in the expert reviews from the other two reviewers for Journal A, but I didn't feel like that was OK.

(If I get the paper back with some revisions, I'll reevaluate it in light of the Journal A reviews, too.)

I think the behavior of

(continued...)
ListenData 2019-04-14 15:31:00

Create Dummy Data in Python

This article explains various ways to create dummy or random data in Python for practice. Like R, we can create dummy data frames using pandas and numpy packages. Most of the analysts prepare data in MS Excel. Later they import it into Python to hone their data wrangling skills in Python. This is not an efficient approach. The efficient approach is to prepare random data in Python and use it later for data manipulation.

Table of Contents

1. Enter Data Manually in Editor WindowThe first step is to load pandas package and use DataFrame function
import pandas as pd
data = pd.DataFrame({"A" : ["John","Deep","Julia","Kate","Sandy"],
"MonthSales" : [25,30,35,40,45]})
       A  MonthSales
0 John 25
1 Deep 30
2
(continued...)