Array Libraries Interoperability
In this blog post I talk about the work that I was able to accomplish during my internship at Quansight Labs and the efforts being made towards making array libraries more interoperable.
Going ahead, I'll assume basic understanding of array and tensor libraries with their usage in the Python Scientific and Data Science software stack.
Master NumPy leading the young Tensor Turtles
Array Provider & Consumer Libraries
As someone who's been involved in the Data Science domain for more than four years now, I can't think of any of my projects without using an array/tensor library. Over time, NumPy (more than 15 years old today) has become a blind import before starting any Python project involving manipulation of arrays.
With the recent advancements in and around the field of Machine Learning,
Deep Learning techniques have gained a lot of popularity in the AI community.
More and more libraries have been developed to extend
NumPy's concept of N-dimensional arrays to the new domains of
numerical computing. Namely these features come in the form of
multi-device (CPU
, GPU
, XLA
etc.) support and
autograd engines for the array/tensor. One such framework is PyTorch,
and it has been my go-to for research in machine learning during the last
couple of years. The reason simply is due to this extra saccharinity
along with some other machine learning specific modules
(torch.nn
, torch.optim
etc.) it has to offer. A few eminent
array/tensor libraries include CuPy
,
TensorFlow
,
MXNet
,
JAX
etc.
Henceforth, let's just call all of them "array provider" libraries.
Libraries like SciPy
,
scikit-learn
,
einops
etc. are built on top of
these "array providers" and they integrate specific features through their
own methods which conventionally ingest arrays from these array provider
libraries as their arguments. We'll refer to these set of libraries as "array
consumer" libraries.
What's the catch?
The pain now is that an extremely stable and popular consumer library like SciPy is cut off from the sweet features which the new array/tensor frameworks have to offer. A user might prefer SciPy's domain specific features and stable API for their scientific stack, but because their underlying project is built on top of say PyTorch, they are left alone at sea.
Say we wanted to use a provider library other than NumPy, but with SciPy (or a similar library). After some thought, one may end up choosing either of the following options:
-
Somehow convert their array/tensors into NumPy arrays, before feeding them in the SciPy function, which is where you lose the "extra sweetness".
-
Find some other consumer library (if it exists) mimicking SciPy's API, but built on top of the "array provider" library that was used in the initial project (say PyTorch). This way the user enjoys the extra features, but this comes at the cost of learning and getting used to a new library API which might mimic SciPy's functionality but not behave exactly the same.
Also, the APIs (function signatures) for these "array providers" can be different, which is a problem in itself. Imagine a scientist who has experience using NumPy for about 10 years. When they move into the PyTorch ecosystem, they are still doing a lot of similar computations, but now will need to learn a new API.
ELI5 Caption: This is a page from the array libraries interoperability special edition chapter of comic array wonderland. SciPy only wants to play with NumPy and refuses to play with other libraries like PyTorch, CuPy, Tensorflow and JAX. All of them are sad, and then they decide to make their own new friends.
Serious Caption: Other than NumPy, array/tensor libraries like PyTorch, CuPy, Tensorflow and JAX aren't compatible with Consumer Libraries like SciPy. This led to developers creating an ecosystem of libraries around each array/tensor framework which are conceptually the same but differ in the unique array/tensor frameworks they are built for.
Life is hard, ain't it! But as Rocky Balboa said:
"NumPy, CuPy, PyTorch or SciPy is not gonna hit as hard as all of them when used together. But it ain't about finding a way to use them individually; it's about making them work together. That's how Data Science is done." ~ (actual quote :P)
OK sorry, that's just me using Rocky's motivational lines to make a point. To define the issue more concretely, the question is: can we do something like the following?!
import torch
... # Do something with PyTorch
x = torch.rand(3000) # End up with some torch tensor
# Estimate power spectral density using Welch’s method.
# SciPy offers a method `welch` for doing exactly that.
from scipy.signal import welch
f, Pxx = welch(x, fs=10e3, nperseg=1024)
Our world would be a better place if we could pass any kind of "array provider" object in these consumer libraries and let them do their magic, finally spitting out the same "array provider" type as the one in input. Note that all the arguments should be of the same kind.
What if there is a way? What if there are existing ways to achieve this? Not kidding, it is possible with the recent efforts made towards this direction.
Protocols for Interoperable Behaviour
Now that the problem and motivation is clear, let's dive into the technicalities involved in array libraries interoperability and understand the protocols making this a reality.
Enter the NEP 18 (__array_function__
protocol), which was one of the first
efforts to address the issue of interoperability in NumPy. In a nutshell,
NEP 18
allows arguments of NumPy functions to define how that function
operates on them. This enables using NumPy as a high level API for
efficient multi-dimensional array operations, even with array implementations
that differ greatly from numpy.ndarray
.
I suggest reading the NEP 18 itself for a detailed understanding, but I'll try to expound the motivation with a simple example taken from an insightful talk by Ralf Gommers at PyData Amsterdam 2019.
import numpy as np
import cupy
def some_func(x):
return np.log(np.sin(x) + 5)
x = np.random.random((100000, 100000))
some_func(x) # Runs on CPU, might be slow
# Now can we use some_func with CuPy arrays
# designed to work on NVIDIA GPUs and are
# faster at parallelized matrix operations.
x_cupy = cupy.array(x)
# NEP 18 enables this possibility
some_func(x_cupy) # Runs on GPU, orders of magnitude fast
Since NEP 18, there have been a few other protocols like NEP 30, NEP 35 and NEP 37 endeavouring to address some of the issues and shortcomings with NEP 18. Note that these NEPs were actually never accepted or implemented.
For the sake of brevity in this blog,
we'll limit our focus to NEP 31 or uarray
and NEP 47 or Array API (__array_namespace__
). These are some of the most recent protocols
with a goal to ameliorate interoperability shortfalls.
Array API or NEP 47
Before I start describing Python Array API in the context of this blog, I urge you to read:
These two official blogs from the Consortium for Python Data API Standards describe the API, giving a high level overview of its existence.
Let's see how the use of this Array API might look in practice.
import torch as xp
# or
import numpy.array_api as xp
a = xp.arange(3)
b = xp.ones(3)
c = a + b
Probably the only changes involved for a NumPy end user to
support PyTorch would be to update the import statements and refactor
np.*
to xp.*
. Here xp
represents any array provider library
compliant with the Array API. Doing something like this is extremely easy as
compared to some other array interoperability protocols taking a much more
convoluted approach.
The Array API spec mentions a couple of concrete use cases:
- Use case 1: add hardware accelerator and distributed support to SciPy
- Use case 2: simplify einops by removing the backend system
With the introduction and compliance of Array API in all the major array/tensor libraries, consumer libraries will be able to support more than one "array provider" and become truly interoperable.
Since we started by talking about consumer libraries like SciPy, let's continue with the same example. We've built a demo around Array API showcasing the use of PyTorch Tensors with SciPy.
The Array API Demo walks you through the details and processes involved to make an array consumer library like SciPy more interoperable with array provider libraries. The demo is built keeping two different perspectives in mind: an end user, and an open-source developer/maintainer looking to incorporate the Array API within their array consumer library.
The demo showcases the 2017 Nobel prize winning work for Physics about LIGO-Virgo detector noise and extraction of transient gravitational-wave signals. The original tutorial (Colab link) was built using NumPy, SciPy, and Matplotlib. But, instead of using NumPy arrays we switch to a PyTorch based implementation with minimal changes in the codebase.
Let's dive into the exact formulation and Python code that allows this behaviour.
get_namespace
The demo shows the implementation of a dummy get_namespace
method which is the
first function to be called inside any SciPy method. One can see how it works
below, simply returning the namespace, which can be used later for calling any
Array API specified methods. See the csd
toy example function below to understand
get_namespace
usage.
def get_namespace(*xs):
# `xs` contains one or more arrays, or possibly Python scalars (accepting
# those is a matter of taste, but doesn't seem unreasonable).
namespaces = {
x.__array_namespace__() if hasattr(x, '__array_namespace__') else None for x in xs if not isinstance(x, (bool, int, float, complex))
}
if not namespaces:
# one could special-case np.ndarray above or use np.asarray here if
# older numpy versions need to be supported.
# This can happen when lists are sent as an input, for eg. some
# SciPy functions coerce lists into ndarrays internally.
raise ValueError("Unrecognized array input")
if len(namespaces) != 1:
raise ValueError(f"Multiple namespaces for array inputs: {namespaces}")
xp, = namespaces
if xp is None:
raise ValueError("The input is not a supported array type")
return xp
# representative example of how to use get_namespace
def csd(x, y, fs=1.0, window='hann', nperseg=None, noverlap=None, nfft=None,
detrend='constant', return_onesided=True, scaling='density', axis=-1,
average='mean'):
# Array-API
xp = get_namespace(x, y)
# Library Independent Code
# Call Array API specified methods
some_internal_calculation = xp.some_func(x)
...
This should be possible with the Array API, for the end-user in a future release of SciPy and PyTorch.
We made SciPy plus PyTorch work for this very much nontrivial use case, which demonstrates the feasibility and power of using the Array API. I'd encourage you to read more details about the demo on the tutorial webpage itself.
Given that the Array API standard is still under development, there are some issues we ran into, namely:
- A lack of Array API standards on complex numbers in the upcoming 2021 version,
v1.0
of the specification, compelled us to special case such instances within the SciPy codebase for PyTorch and NumPy separately. - In the current state, the lack of complete Array API compliance in PyTorch
is another small issue. One example of PyTorch diverging from NumPy and
Array API is
.numel()
vs.size()
.
All of this is of course addressable with plans to add support for the complex numbers module in the second release of Array API Specification. We can expect updates and added submodules in a future version of the spec, to be released next year.
PyTorch developers have been working hard to improve PyTorch's Array API compliance, fixing divergent behaviour. During the development of this prototype demo, I was able to identify some gaps in the current state of Array API in PyTorch and started my journey as a PyTorch contributor.
Contributing to PyTorch
In an effort to make PyTorch
more Array API compatible
and in the process to get the above demo working,
I started contributing to PyTorch by raising PRs and relevant Issues.
My first PyTorch PR #62560
started with adding an alias torch.concat
for torch.cat
, which is
how concatenation is defined in the Array API spec,
and ended with a bunch of other fixes related to concatenation in PyTorch.
Later, I worked on improving compatibility for the Array API in the torch.linalg
module. See pytorch/pytorch#63285
and pytorch/pytorch#63227.
As mentioned earlier, you can track the progress on PyTorch's GitHub repo with the label "module: python array api" to check out other interesting developments.
What's Missing?
This approach of tackling interoperability in scipy.signal
for
pure Python + NumPy code can leverage the Array API standard. But, for other
modules like scipy.fft
, scipy.ndimage
, scipy.special
etc. where one
encounters compiled code, there is a need for an array library and a hardware
specific implementation, and hence from SciPy we need to be able to
access and use those. This is where uarray comes in. A more detailed
explanation can be found in the section below
highlighting the differences between the Array API and uarray.
uarray
uarray is a backend system for Python that allows you to separately define an API, along with backends that contain separate implementations of that API.
I've been working on adding uarray backend support to more SciPy modules.
SciPy: uarray compatibility tracker
The uarray backend compatibility tracker issue linked above sums up the plan and current state of uarray in SciPy.
I've been working on adding uarray support in the
scipy.ndimage
module
for the last couple of months.
See scipy/scipy#14356 for more
details on the discussions.
With ndimage
supporting uarray
backends soon, one will be able to
achieve the following in the future:
# SciPy ndimage with CuPy array
from scipy import ndimage
import cupy as cp
with scipy.ndimage.set_backend('cupy'):
y_cupy = ndimage.correlate1d(cp.arange(10),
cp.array([1, 2.5]))
The work on adding uarray
backend in ndimage
is slightly complicated and
involved a few other maintenance fixes in SciPy. In case you are interested,
I'll leave a list of some of my notable SciPy contributions below which
are directly or indirectly connected to uarray.
All ☀️ & 🌈?
As they say, nothing is perfect on the human stage, both uarray and the Array API also have their limitations.
***Wears interoperability hat again***
Let's highlight some Pros & Cons for these two protocols.
uarray
Pros
- Can handle compiled code (anything not Python but which ends up with Python bindings).
- Supports ability to coerce/convert inputs and wrapping other arrays using the `__ua_convert__` protocol.
Cons
- Involves a lot of utility code (setup machinery) on both the libraries which may get tricky at times.
- The dispatching mechanism is not implicit. The user is required to register the backend before they can use the array library of their choice. See the scipy/scipy#14266 issue for auto-registering backends.
Array API
Pros
- Easy for consumer libraries to add support and compatiblity with multiple "array providers".
- One can make use of the consumer library functions without thinking about any form of backend registration etc.
Cons
- Can't handle compiled code, works only for pure Python and array provider based code in consumer library methods.
- Not all functions are covered in the Array API spec, which may be a blocker if the consumer library method utilizes a function outside of Array API's scope.
These protocols may not be perfect, but are a big step towards interoperability and bringing the array/tensor libraries ecosystem closer together. We'll see iterations and development of new NEPs in the future which will probably make array libraries even more interoperable. In essence, open-source communities like NumPy putting interoperability as one of the key goals in their roadmap and the larger scientific community taking small steps in the right direction is ultimately progress towards the ideal world of array library interoperability. At Quansight and in the wider PyData community, we've gained a lot of momentum and interest towards improving interoperability and extensibility in SciPy and Scikits. Stay tuned for some interesting updates on this very soon.
What's Next?
On a more personal note, I've absolutely enjoyed the Scientific Python Open-Source community and plan to continue working on projects including SciPy and PyTorch voluntarily going forward.
Specifically, I plan to work on improving interoperability with other
libraries in PyTorch with Python Array API compliance, which is aimed for a
release in 1.11
, and also on improving NumPy support. There are a lot of
interesting gaps that are to be filled in the
OpInfo
testing module and
in general trying to catch a few bugs through the improved testing framework.
With the recent migration to Structured Kernels
, I also plan to help out
with porting of some Ops to structured
.
PyTorch is a project that I really love and have been a user for a long time, it is always nice to be contributing back to the project and learning along the way. I'm open to contributing to other interesting areas that might arise in the future.
In SciPy, I aim to continue adding and improving uarray backend support
for more modules. This also extends work into libraries like CuPy
, Dask
etc.
where the uarray backend needs to be enabled.
Apart from uarray, I'd also like to explore and contribute to a few more interesting tracks revolving around making SciPy more performant.
-
One of the GSoC students (Xingyu Liu) recently made a lot of progress in accelerating some of the modules with the experimental Pythran support. It would be interesting to explore further possibilities with
Pythran
. -
A more personal goal is to learn and contribute more towards SciPy's
Cython
,CPython API-using
, and in general Python bindings code. I plan to pick up relevant issues and also contribute to that part of the codebase in SciPy.
I end my blog here and hope you learnt a few new things about array library interoperability. Feel free to check out my non-technical blog post titled "Why Quansight is Awesome!", where I talk about my experience as an Intern at Quansight.
Acknowledgements
Special thanks to my mentor, Ralf Gommers, who has been extremely helpful and positive everytime I felt lost. Thank you for taking all my questions and doubts repeatedly, and still being so responsive and thorough with your answers. I'm grateful to your support and guidance.
I feel fortunate contributing back to impactful libraries like PyTorch and SciPy (having used them personally). Thanks to the community and the awesome team at Quansight Labs for an amazing summer.
References
- Python Data API Standards
- uarray
- NEP 18 — A dispatch mechanism for NumPy’s high level array functions (
__array_function__
Protocol) - NEP 31 — Context-local and global overrides of the NumPy API
- NEP 47 — Adopting the array API standard
- The evolution of array computing in Python | PyData Amsterdam 2019 | Ralf Gommers
- Gravitational Wave Open Science Center Tutorials
- A guide to LIGO–Virgo detector noise and extraction of transient gravitational-wave signals
Comments