One of the core concepts in DevOps that is now making its way to machine learning operations (MLOps) is CI/CD—Continuous Integration and Continuous Delivery or Continuous Deployment. CI/CD as a core DevOps practice embraces tools and methods to deliver software applications reliably by streamlining the building, testing, and deployment of your applications to production. Let’s […]
The post 5 Ways Machine Learning Teams Use CI/CD in Production appeared first on neptune.ai.
This is a companion post from the Official release of IPython 8.0, that describe what we learned with this large new major IPython release. We hope it will help you apply best practices, and have an easier time maintaining your projects, or helping other. We'll focus on many patterns that made it easier for us to make IPython 8.0 what it is with minimal time involved.
Read more… (8 min remaining to read)
Boosting algorithms have become one of the most powerful algorithms for training on structural (tabular) data. The three most famous boosting algorithm implementations that have provided various recipes for winning ML competitions are: In this article, we will primarily focus on CatBoost, how it fares against other algorithms and when you should choose it over […]
The post When to Choose CatBoost Over XGBoost or LightGBM [Practical Guide] appeared first on neptune.ai.
When an optimization problem has multiple global minima, different algorithms can find different solutions, a phenomenon often referred to as the implicit bias of optimization algorithms. In this post we'll characterize the implicit bias of gradient-based methods on a class of regression problems that includes linear least squares and Huber …
Assuming we subscribe to a linear understanding of time and causality, as Dr. Sheldon Cooper says, then representing historical events as a series of values and features observed over time provides the foundations for learning from the past. However, time series are somewhat different from other datasets, including sequential data like text or DNA sequences. […]
The post ARIMA vs Prophet vs LSTM for Time Series Prediction appeared first on neptune.ai.
Code and data are the foundations of the AI system. Both of these components play an important role in the development of a robust model but which one should you focus on more? In this article, we’ll go through the data-centric vs model-centric approaches, and see which one is better, we would also talk about […]
The post Data-Centric Approach vs Model-Centric Approach in Machine Learning appeared first on neptune.ai.
Deploying machine learning models is hard! If you don’t believe me, ask any ML engineer or data team that has been asked to put their models into production. To further back up this claim, Algorithima’s “2021 State of Enterprise ML” reports that the time required for organizations to deploy a machine learning model is increasing, […]
The post Model Deployment Challenges: 6 Lessons From 6 ML Engineers appeared first on neptune.ai.
Application containers may be created, deployed, and executed using the Docker tool. It’s just a packed bundle of application code and the libraries and other dependencies that are needed for it to run. Once executed, a Docker Image turns into a Container and contains all the components required to run an application. However, what’s the […]
The post Best Practices When Working With Docker for Machine Learning appeared first on neptune.ai.
This tutorial explores image classification in PyTorch using state-of-the-art computer vision models. The dataset used in this tutorial will have 3 classes that are very imbalanced. So, we will explore augmentation as a solution to the imbalance problem.
import os import random import numpy as np import pandas as pd from PIL import Image from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split import torch from torch import nn import torch.nn.functional as F from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler from torchvision import datasets, models from torchvision import transforms import matplotlib.pyplot as plt
Setting the device to make use of the GPU.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") device
Identifying the data paths.
data_dir = "images/" labels_file = "images_labeled.csv"
Since the labels are in a CSV file, we use
We all want our models to generalize well so that they remain at their peak performance on any kind of dataset. To ensure such demands we often rely on cross-validation in our machine learning projects, a resampling procedure used to evaluate machine learning models on limited data samples. It could be a nightmare to realize […]
The post 7 Cross-Validation Mistakes That Can Cost You a Lot [Best Practices in ML] appeared first on neptune.ai.
This is the first of a series of blog posts on short and beautiful proofs in optimization (let me know what you think in the comments!). For this first post in the series I'll show that stochastic gradient descent (SGD) converges exponentially fast to a neighborhood of the solution.
Working with time series data? Here’s a guide for you. In this article, you will learn how to compare and select time series models based on predictive performance. In the first part, you will be introduced to numerous models for time series. This part is divided into three parts: classical time series models, supervised models, […]
The post How to Select a Model For Your Time Series Prediction Task [Guide] appeared first on neptune.ai.
A framing for open source is that the software and code are kernels of community. The code, and its abstractions, unite developers and their patrons; a struggle for growing/evolving open communities is to make sure these groups remain connected. A lot of us showed up for the code, but hung around for the community. We'll continue this post talking about the monthly Jupyter community calls, and how they help all jovyans, Project Jupyter's pet name for their developers and users, stay connected.
Read more… (2 min remaining to read)
Why do you have to know more about model registry? If you were once the only data scientist on your team you can probably relate to this: you start working on a machine learning project and perform a series of experiments that produce various models (and artifacts) that you “track” through non-standard naming conventions. Since […]
The post ML Model Registry: What It Is, Why It Matters, How to Implement It appeared first on neptune.ai.
Generative Adversarial Networks or GANs is a type of neural network that belongs to the class of unsupervised learning. It is used for the task of deep generative modeling. In deep generative modeling, the deep neural networks learn a probability distribution over a given set of data points and generate similar data points. Since it […]
A vision for extensibility to GPU & distributed support for SciPy, scikit-learn, scikit-image and beyond
Over the years, array computing in Python has evolved to support distributed arrays, GPU arrays, and other various kinds of arrays that work with specialized hardware, or carry additional metadata, or use different internal memory representations. The foundational library for array computing in the PyData ecosystem is NumPy. But NumPy alone is a CPU-only library - and a single-threaded one at that - and in a world where it's possible to get a GPU or a CPU with a large core count in the cloud cheaply or even for free in a matter of seconds, that may not seem enough. For the past couple of years, a lot of thought and effort has been spent on devising mechanisms to tackle this problem, and evolve the ecosystem in a gradual way towards a state where PyData libraries can run on a GPU, as well as in distributed mode across multiple GPUs.
We feel like a shared vision has emerged, in bits and pieces. In this post, we aim to articulate that vision and
Careers in training!
In this blog post, I'll be talking about my journey in Quansight. I want to share all things I was involved in and accomplished. What issues I faced, and most importantly, what were awesome life hacks I learned during this period.
First of all, I'd like to express my gratitude to the whole team for allowing me to be a part of such a great team. My work was majorly focused on providing performance benchmarks to NumPy in realistic situations. The target was to show the world that NumPy is efficient in handling quasi real-life situations too.
The primary technical outcome of my work is available in the numpy documentation.
Read more… (6 min remaining to read)
Join us to work on reinventing data-science practices and tools to produce robust analysis with less data curation.
It is well known that data cleaning and preparation are a heavy burden to the data scientist.
In the dirty data project, we have been conducting machine-learning research …
The TorchVision datasets subpackage is a convenient utility for accessing well-known public image and video datasets. You can use these tools to start training new computer vision models very quickly. TorchVision Datasets Example To get started, all you have to do is import one of the Dataset classes. Then, instantiate ... Read More
The np.any() function tests whether any element in a NumPy array evaluates to true: The input can have any shape and the data type does not have to be boolean (as long as it’s truthy). If none of the elements evaluate to true, the function returns false: Passing in a ... Read More
This is Ismaël Koné from Côte d'Ivoire (Ivory Coast). I am a fan of open source software.
In the next lines, I'll try to capture my experience at Quansight Labs as an intern working on the
cuDF implementation of the dataframe interchange protocol.
We'll continue by motivating this project through details about cuDF and the dataframe interchange protocol.
Read more… (9 min remaining to read)
While there exist ways to wrap C++ codes to Python (see Appendix below), calling these wrappers from Numba compiled functions is often not as straightforward and efficient as one would hope.
Read more… (5 min remaining to read)
In this blog post I talk about the work that I was able to accomplish during my internship at Quansight Labs and the efforts being made towards making array libraries more interoperable.
Going ahead, I'll assume basic understanding of array and tensor libraries with their usage in the Python Scientific and Data Science software stack.
Master NumPy leading the young Tensor Turtles
Read more… (15 min remaining to read)
In this blog post I talk about the projects and my work during my internship at Quansight Labs. My efforts were geared towards re-engineering CI/CD pipelines for SciPy to make them more efficient to use with GitHub Actions. I also talk about the milestones that I achieved, along with the associated learnings and improvements that I made.
This blog post would assume a basic understanding of CI/CD and GitHub Actions. I will also assume a basic understanding of Python and the SciPy ecosystem.
Re-Engineering CI/CD pipelines for SciPy
Read more… (14 min remaining to read)
PyTorch comes with powerful data loading capabilities out of the box. But with great power comes great responsibility and that makes data loading in PyTorch a fairly advanced topic. One of the best ways to learn advanced topics is to start with the happy path. Then add complexity when you ... Read More
Understanding the np.append() operation and when you might want to use it.
Over the summer,
I've been interning at Quansight Labs
to develop testing tools
for the developers and users
of the upcoming Array API standard.
I contributed "strategies"
to the testing library Hypothesis,
which I'm excited to announce
are now available in
Check out the primary pull request I made
for more background.
This blog post is for anyone developing array-consuming methods (think SciPy and scikit-learn) and is new to property-based testing. I demonstrate a typical workflow of testing with Hypothesis whilst writing an array-consuming function that works for all libraries adopting the Array API, catching bugs before your users do.
Read more… (12 min remaining to read)
The work I briefly describe in this blog post is the implementation of the dataframe interchange protocol into Vaex which I was working on through the three month period as a Quansight Labs Intern.
Connection between dataframe libraries with dataframe protocolAbout | What is all that?
Today there are quite a number of different dataframe libraries available in Python. Also, there are quite a number of, for example, plotting libraries. In most cases they accept only the general Pandas dataframe and so the user is quite often made to convert between dataframes in order to be able to use the functionalities of a specific plotting library. It would be extremely cool to be able to use plotting libraries on any kind of dataframe, would it not?
Read more… (13 min remaining to read)
With the growth of scikit-learn and the wider PyData ecosystem, we want to recruit in the Inria scikit-learn team for a new role. Departing from our usual focus on excellence in algorithms, statistics, or code, we want to add to the team someone with some technical understanding, but an …
snakemake is awesome
A few years ago1, Sebastian contacted me to help with simulations. Great, I like simulation studies, so we start discussing the details. The idea: use an established method, the Lees-Edwards boundary condition, to study colloids under shear.
Careers outside of universities!
Bigger and better!
When you’re building a production machine learning system, reproducibility is a proxy for the effectiveness of your development process. But without locking all your Python dependencies, your builds are not actually repeatable. If you work in a Python project without locking long enough, you will eventually get a broken build ... Read More
The post Poetry for Package Management in Machine Learning Projects appeared first on Sparrow Computing.
If you’re building production ML systems, dev containers are the killer feature of VS Code. Dev containers give you full VS Code functionality inside a Docker container. This lets you unify your dev and production environments if production is a Docker container. But even if you’re not targeting a Docker ... Read More
The post Development containers in VS Code: a quick start guide appeared first on Sparrow Computing.
Databases are now available for GTDB!
Lessons for Geoscientists from the book Real World AI: A Practical Guide for Responsible Machine LearningIn this blog article Enthought Energy Solutions vice president Mason Dykstra looks at the recently published book titled “Real World AI: A Practical Guide for Responsible Machine Learning” in the context of both the technical challenges faced by geoscientists and how to scale. Author: Mason Dykstra, Ph.D., Vice President, Energy Solutions In the newly released …
CZI EOSS4 application for sourmash support
Searching all the things!
For the unaware reader, the Journal of Open Source Software (JOSS) is an open-access scientific journal founded in 2016 and aimed at publishing scientific software. A JOSS article in itself is short and its publication contributes to recognize the work on the software. I share here my point of view on what makes some software tools more ready to be published in JOSS. I do not comment on the size or the relevance for research which are both documented on JOSS' website.
I love fancy machine learning algorithms as much as anyone. But sometimes, you just need to count things. And Python’s built-in data structures make this really easy. Let’s say we have a list of strings: With a list like this, you might care about a few different counts. What’s the ... Read More
The PyTorch sigmoid function is an element-wise operation that squishes any real number into a range between 0 and 1. This is a very common activation function to use as the last layer of binary classifiers (including logistic regression) because it lets you treat model predictions like probabilities that their ... Read More
While the most common accelerated methods like Polyak and Nesterov incorporate a momentum term, a little known fact is that simple gradient descent –no momentum– can achieve the same rate through only a well-chosen sequence of step-sizes. In this post we'll derive this method and through simulations discuss its practical …
NumFOCUS is pleased to announce our new partnership with Tesco Technology. A long-time PyData event sponsor, Tesco Technology joined NumFOCUS as a Silver Corporate Sponsor in December 2020. “We are very excited to formalize our partnership with Tesco Technology,” said Leah Silen, NumFOCUS Executive Director. “Tesco Technology has partnered with NumFOCUS for the past several […]
The post NumFOCUS Welcomes Tesco Technology to Corporate Sponsors appeared first on NumFOCUS.
Job Title: Communications and Marketing Manager Position Overview The primary role of the Communications & Marketing Manager is to manage the NumFOCUS brand by overseeing all outgoing communications between NumFOCUS and our stakeholders. You will serve the project communities by playing a key role in their event marketing management and assist with project promotional and […]
The post Job Posting | Communications and Marketing Manager appeared first on NumFOCUS.
You can easily convert a NumPy array to a PyTorch tensor and a PyTorch tensor to a NumPy array. This post explains how it works.
TorchVision, a PyTorch computer vision package, has a great API for image pre-processing in its torchvision.transforms module. This post gives some basic usage examples, describes the API and shows you how to create and use custom image transforms.
The post TorchVision Transforms: Image Preprocessing in PyTorch appeared first on Sparrow Computing.
sourmash v4.0.0 is here!
I've seen things you people wouldn't believe.
Valleys sculpted by trigonometric functions.
Rates on fire off the shoulder of divergence.
Beams glitter in the dark near the Polyak gate.
All those landscapes will be lost in time, like tears in rain.
Time to halt.
A momentum optimizer *
After I left Quantopian in 2020, something interesting happened: various companies contacted me inquiring about consulting to help them with their PyMC3 models.
MicroPython is an implementation of the Python 3 programming language, optimized to run microcontrollers. It's one of the options available for programming your Raspberry Pi Pico and a nice friendly way to get started with microcontrollers.
MicroPython can be installed easily on your Pico, by following the instructions on the …
sourmash v4.0.0 is coming!
Job Title: Events and Digital Marketing Coordinator Position Overview The primary role of the Events and Digital Marketing Coordinator is to support and assist the Events Manager and the Community Communications and Marketing Manager to advance one of NumFOCUS’s primary missions of educating and building the community of users and developers of open source scientific […]
The post Job Posting | Events and Digital Marketing Coordinator appeared first on NumFOCUS.
Updating old Python packages, in this year of the PSF 2021!