Since its reveal in 2017 in the popular paper Attention Is All You Need (https://arxiv.org/abs/1706.03762), the Transformer quickly became the most popular model in NLP. The ability to process text in a non-sequential way (as opposed to RNNs) allowed for training of big models. The attention mechanism it introduced proved extremely useful in generalizing text.
Following the paper, several popular transformers surfaced, the most popular of which is GPT. GPT models are developed and trained by OpenAI, one of the leaders in AI research. The latest release of GPT is GPT-3, which has 175 billion parameters. The model was very advanced to the point where OpenAI chose not to open-source it. People can access it through an API after a signup process and a long queue.
However, GPT-2, their previous release is open-source and available on many deep learning frameworks.
In this excercise, we use Huggingface and PyTorch to fine-tune a
Generative Adversarial Networks or GANs is a type of neural network that belongs to the class of unsupervised learning. It is used for the task of deep generative modeling. In deep generative modeling, the deep neural networks learn a probability distribution over a given set of data points and generate similar data points. Since it […]
Gradient Boosted Machines and their variants offered by multiple communities have gained a lot of traction in recent years. This has been primarily due to the improvement in performance offered by decision trees as compared to other machine learning algorithms both in products and machine learning competitions. Two of the most popular algorithms that are […]
Weights & Biases, also known as WandB, is an MLOps tool for performance visualization and experimental tracking of machine learning models. It helps with automation, tracking, training, and improvement of ML models. Weights & Biases is a cloud-based service that allows you to host your experiments in a single central repository and if you have […]
When you are working on a Machine learning problem, adapting an existing solution and repurposing it can help you get to a solution much faster. Using existing models, not just aid machine learning engineers or data scientists but also helps companies to save computational costs as it requires less training. There are many companies that […]
The post Hugging Face Pre-trained Models: Find the Best One for Your Task appeared first on neptune.ai.
ML software development is complex; building an ML model is one thing, improving and maintaining it, is another. If you want your machine learning models to be robust, compliant, and give reproducible results, you must invest time and money in quality model management. Model governance, model provenance, and model lineage tools help you in doing […]
The post Best Tools for ML Model Governance, Provenance, and Lineage appeared first on neptune.ai.
The concept of Deep Learning frameworks, libraries, and numerous tools exist to reduce the large amounts of manual computations that must otherwise be calculated. TensorFlow and PyTorch are currently two of the most popular frameworks to construct neural network architectures. While TensorFlow was released a year before PyTorch, most developers are tending to shift towards […]
A vision for extensibility to GPU & distributed support for SciPy, scikit-learn, scikit-image and beyond
Over the years, array computing in Python has evolved to support distributed arrays, GPU arrays, and other various kinds of arrays that work with specialized hardware, or carry additional metadata, or use different internal memory representations. The foundational library for array computing in the PyData ecosystem is NumPy. But NumPy alone is a CPU-only library - and a single-threaded one at that - and in a world where it's possible to get a GPU or a CPU with a large core count in the cloud cheaply or even for free in a matter of seconds, that may not seem enough. For the past couple of years, a lot of thought and effort has been spent on devising mechanisms to tackle this problem, and evolve the ecosystem in a gradual way towards a state where PyData libraries can run on a GPU, as well as in distributed mode across multiple GPUs.
We feel like a shared vision has emerged, in bits and pieces. In this post, we aim to articulate that vision and
Careers in training!
Clustering was introduced in 1932 by H.E. Driver and A.L.Kroeber in their paper on “Quantitative expression of cultural relationship”. Since then this technique has taken a big leap and has been used to discover the unknown in a number of application areas eg. Healthcare. Clustering is a type of unsupervised learning where the references need […]
In this blog post, I'll be talking about my journey in Quansight. I want to share all things I was involved in and accomplished. What issues I faced, and most importantly, what were awesome life hacks I learned during this period.
First of all, I'd like to express my gratitude to the whole team for allowing me to be a part of such a great team. My work was majorly focused on providing performance benchmarks to NumPy in realistic situations. The target was to show the world that NumPy is efficient in handling quasi real-life situations too.
The primary technical outcome of my work is available in the numpy documentation.
Read more… (6 min remaining to read)
In recent times, Machine Learning has gained importance due to its ability to guide businesses in making precise and accurate decisions. Under the hood, Machine Learning is an iterative and repetitive process. Series of training jobs are done to optimize a model’s predictive performance. Without the right methods, it is easy to lose track of […]
Join us to work on reinventing data-science practices and tools to produce robust analysis with less data curation.
It is well known that data cleaning and preparation are a heavy burden to the data scientist.
In the dirty data project, we have been conducting machine-learning research …
Delivering the best machine learning model to production should be as easy as training, testing, and deploying — right? Not quite! Models are far from perfect as they move from research to production, and maintaining model performance once in production is even more challenging. Once out of the offline research environment, the data a model consumes […]
The post Arize AI & Neptune AI Partnership: Continuous Monitoring, Continuous Improvements for ML Models appeared first on neptune.ai.
Data forms the foundation of any machine learning algorithm, without it, Data Science can not happen. Sometimes, it can contain a huge number of features, some of which are not even required. Such redundant information makes modeling complicated. Furthermore, interpreting and understanding the data by visualization gets difficult because of the high dimensionality. This is […]
The TorchVision datasets subpackage is a convenient utility for accessing well-known public image and video datasets. You can use these tools to start training new computer vision models very quickly. TorchVision Datasets Example To get started, all you have to do is import one of the Dataset classes. Then, instantiate ... Read More
The np.any() function tests whether any element in a NumPy array evaluates to true: The input can have any shape and the data type does not have to be boolean (as long as it’s truthy). If none of the elements evaluate to true, the function returns false: Passing in a ... Read More
This is Ismaël Koné from Côte d'Ivoire (Ivory Coast). I am a fan of open source software.
In the next lines, I'll try to capture my experience at Quansight Labs as an intern working on the
cuDF implementation of the dataframe interchange protocol.
We'll continue by motivating this project through details about cuDF and the dataframe interchange protocol.
Read more… (9 min remaining to read)
While there exist ways to wrap C++ codes to Python (see Appendix below), calling these wrappers from Numba compiled functions is often not as straightforward and efficient as one would hope.
Read more… (5 min remaining to read)
In this blog post I talk about the work that I was able to accomplish during my internship at Quansight Labs and the efforts being made towards making array libraries more interoperable.
Going ahead, I'll assume basic understanding of array and tensor libraries with their usage in the Python Scientific and Data Science software stack.
Master NumPy leading the young Tensor Turtles
Read more… (15 min remaining to read)
In this blog post I talk about the projects and my work during my internship at Quansight Labs. My efforts were geared towards re-engineering CI/CD pipelines for SciPy to make them more efficient to use with GitHub Actions. I also talk about the milestones that I achieved, along with the associated learnings and improvements that I made.
This blog post would assume a basic understanding of CI/CD and GitHub Actions. I will also assume a basic understanding of Python and the SciPy ecosystem.
Re-Engineering CI/CD pipelines for SciPy
Read more… (14 min remaining to read)
PyTorch comes with powerful data loading capabilities out of the box. But with great power comes great responsibility and that makes data loading in PyTorch a fairly advanced topic. One of the best ways to learn advanced topics is to start with the happy path. Then add complexity when you ... Read More
Understanding the np.append() operation and when you might want to use it.
Over the summer,
I've been interning at Quansight Labs
to develop testing tools
for the developers and users
of the upcoming Array API standard.
I contributed "strategies"
to the testing library Hypothesis,
which I'm excited to announce
are now available in
Check out the primary pull request I made
for more background.
This blog post is for anyone developing array-consuming methods (think SciPy and scikit-learn) and is new to property-based testing. I demonstrate a typical workflow of testing with Hypothesis whilst writing an array-consuming function that works for all libraries adopting the Array API, catching bugs before your users do.
Read more… (12 min remaining to read)
The work I briefly describe in this blog post is the implementation of the dataframe interchange protocol into Vaex which I was working on through the three month period as a Quansight Labs Intern.
Connection between dataframe libraries with dataframe protocolAbout | What is all that?
Today there are quite a number of different dataframe libraries available in Python. Also, there are quite a number of, for example, plotting libraries. In most cases they accept only the general Pandas dataframe and so the user is quite often made to convert between dataframes in order to be able to use the functionalities of a specific plotting library. It would be extremely cool to be able to use plotting libraries on any kind of dataframe, would it not?
Read more… (13 min remaining to read)
Healthy, inclusive communities are critical to impactful open source projects. A challenge for established projects is that the history and implicit technical debt increase the barrier to contribute to significant portions of code base. The literacy of large code bases happens over time through incremental contributions, and we'll discuss a format that can help people begin this journey.
At Quansight Labs, we are motivated to provide opportunities for new contributors to experience open source community work regardless of their software literacy. Community workshops are a common format for onboarding, but sometimes the outcome can be less than satisfactory for participants and organizers. In these workshops, there are implicit challenges that need to be overcome to contribute to projects' revision history like Git or setting up development environments.
Our goal with the following low-code workshop is to offer a way for folks to join a project's contributors list without the technical overhead. To achieve this we'll discuss a format that relies solely on the GitHub web interface.
Read more… (5 min remaining to
In a pandemic, the template joke-starter “x and y walk into a bar” seems like a stretch from my reality. So let’s try this remote version:
Two community members with accessibility knowledge enter a virtual meeting room to talk about JupyterLab. They’ve both updated themselves on GitHub issues ahead of time. They’ve both identified major problems with the interface. They both get ready to express to the rest of the community what is indisputably, one hundred percent for-sure the biggest accessibility blocker in JupyterLab for users. Here it is, the moment of truth!
And they each say totally different things.
Read more… (5 min remaining to read)
With the growth of scikit-learn and the wider PyData ecosystem, we want to recruit in the Inria scikit-learn team for a new role. Departing from our usual focus on excellence in algorithms, statistics, or code, we want to add to the team someone with some technical understanding, but an …
snakemake is awesome
A few years ago1, Sebastian contacted me to help with simulations. Great, I like simulation studies, so we start discussing the details. The idea: use an established method, the Lees-Edwards boundary condition, to study colloids under shear.
Careers outside of universities!
Bigger and better!
When you’re building a production machine learning system, reproducibility is a proxy for the effectiveness of your development process. But without locking all your Python dependencies, your builds are not actually repeatable. If you work in a Python project without locking long enough, you will eventually get a broken build ... Read More
The post Poetry for Package Management in Machine Learning Projects appeared first on Sparrow Computing.
If you’re building production ML systems, dev containers are the killer feature of VS Code. Dev containers give you full VS Code functionality inside a Docker container. This lets you unify your dev and production environments if production is a Docker container. But even if you’re not targeting a Docker ... Read More
The post Development containers in VS Code: a quick start guide appeared first on Sparrow Computing.
Databases are now available for GTDB!
Lessons for Geoscientists from the book Real World AI: A Practical Guide for Responsible Machine LearningIn this blog article Enthought Energy Solutions vice president Mason Dykstra looks at the recently published book titled “Real World AI: A Practical Guide for Responsible Machine Learning” in the context of both the technical challenges faced by geoscientists and how to scale. Author: Mason Dykstra, Ph.D., Vice President, Energy Solutions In the newly released …
CZI EOSS4 application for sourmash support
Searching all the things!
For the unaware reader, the Journal of Open Source Software (JOSS) is an open-access scientific journal founded in 2016 and aimed at publishing scientific software. A JOSS article in itself is short and its publication contributes to recognize the work on the software. I share here my point of view on what makes some software tools more ready to be published in JOSS. I do not comment on the size or the relevance for research which are both documented on JOSS' website.
I love fancy machine learning algorithms as much as anyone. But sometimes, you just need to count things. And Python’s built-in data structures make this really easy. Let’s say we have a list of strings: With a list like this, you might care about a few different counts. What’s the ... Read More
The PyTorch sigmoid function is an element-wise operation that squishes any real number into a range between 0 and 1. This is a very common activation function to use as the last layer of binary classifiers (including logistic regression) because it lets you treat model predictions like probabilities that their ... Read More
While the most common accelerated methods like Polyak and Nesterov incorporate a momentum term, a little known fact is that simple gradient descent –no momentum– can achieve the same rate through only a well-chosen sequence of step-sizes. In this post we'll derive this method and through simulations discuss its practical …
NumFOCUS is pleased to announce our new partnership with Tesco Technology. A long-time PyData event sponsor, Tesco Technology joined NumFOCUS as a Silver Corporate Sponsor in December 2020. “We are very excited to formalize our partnership with Tesco Technology,” said Leah Silen, NumFOCUS Executive Director. “Tesco Technology has partnered with NumFOCUS for the past several […]
The post NumFOCUS Welcomes Tesco Technology to Corporate Sponsors appeared first on NumFOCUS.
Job Title: Communications and Marketing Manager Position Overview The primary role of the Communications & Marketing Manager is to manage the NumFOCUS brand by overseeing all outgoing communications between NumFOCUS and our stakeholders. You will serve the project communities by playing a key role in their event marketing management and assist with project promotional and […]
The post Job Posting | Communications and Marketing Manager appeared first on NumFOCUS.
You can easily convert a NumPy array to a PyTorch tensor and a PyTorch tensor to a NumPy array. This post explains how it works.
TorchVision, a PyTorch computer vision package, has a great API for image pre-processing in its torchvision.transforms module. This post gives some basic usage examples, describes the API and shows you how to create and use custom image transforms.
The post TorchVision Transforms: Image Preprocessing in PyTorch appeared first on Sparrow Computing.
sourmash v4.0.0 is here!
I've seen things you people wouldn't believe.
Valleys sculpted by trigonometric functions.
Rates on fire off the shoulder of divergence.
Beams glitter in the dark near the Polyak gate.
All those landscapes will be lost in time, like tears in rain.
Time to halt.
A momentum optimizer *
After I left Quantopian in 2020, something interesting happened: various companies contacted me inquiring about consulting to help them with their PyMC3 models.
MicroPython is an implementation of the Python 3 programming language, optimized to run microcontrollers. It's one of the options available for programming your Raspberry Pi Pico and a nice friendly way to get started with microcontrollers.
MicroPython can be installed easily on your Pico, by following the instructions on the …
sourmash v4.0.0 is coming!
Job Title: Events and Digital Marketing Coordinator Position Overview The primary role of the Events and Digital Marketing Coordinator is to support and assist the Events Manager and the Community Communications and Marketing Manager to advance one of NumFOCUS’s primary missions of educating and building the community of users and developers of open source scientific […]
The post Job Posting | Events and Digital Marketing Coordinator appeared first on NumFOCUS.
Updating old Python packages, in this year of the PSF 2021!
The SAM Coupé was a British 8 bit home computer that was pitched as a successor to the ZX Spectrum, featuring improved graphics and sound and higher processor speed.
The SAM Coupé's high-color MODE4 could manage 256x192 resolution graphics, with 16 colors from a choice of 128. Each pixel can …