## Measuring API usage for popular numerical and scientific libraries

Published May 27, 2019

costrouc

Chris Ostrouchov

Developers of open source software often have a difficult time understanding how others utilize their libraries. Having better data of when and how functions are being used has many benefits. Some of these are:

- better API design
- determining whether or not a feature can be deprecated or removed.
- more instructive tutorials
- understanding the adoption of new features

# Python Namespace Inspection

We wrote a general tool
python-api-inspect
to analyze any function/attribute call within a given set of
namespaces in a repository. This work was heavily inspired by a blog
post on inspecting method usage with
Google BigQuery
for pandas,
NumPy, and
SciPy. The previously mentioned work used
regular expressions to search for method usage. The primary issue with
this approach is that it cannot handle `import numpy.random as rand; rand.random(...)`

unless additional regular expressions are
constructed for each case and will result in false
positives. Additionally,
BigQuery is not a free resource.
Thus, this approach is not general enough and does not scale well with
the number of libraries that we would like to inspect function and
attribute usage.

A more robust approach is to inspect the Python abstract syntax tree
(AST). Python comes with a performant method from the ast
module `ast.parse(...)`

for constructing a Python AST from source code. A node
visitor
is used to traverse the AST and record `import`

statements, and
function/attribute calls. This allows us to catch any absolute
namespace reference. The following are cases that
python-api-inspect
catches:

import numpyimport numpy as npimport numpy.random as rndfrom numpy import random as randnumpy.array([1, 2, 3])numpy.random.random((2, 3))np.array([1, 2, 3])rnd.random((2, 3))rand.random((2, 3))

There are limitations to this approach since Python is a heavily duck-typed language. To understand this see the following two examples.

def foobar(array): return array.transpose()a = numpy.array(...)a.transpose()foobar(a)

How is one supposed to infer that `a.transpose()`

is a numpy
`numpy.ndarray`

method or `foobar`

is a function that takes a
`numpy.ndarray`

as input? These are open questions that would allow
for further inspection of how libraries use given functions and
attributes. It should be noted that dynamically typed languages in
general have this
problem. Now
that the internals of the tool have been discussed, the usage is quite
simple. The repository
Quansight-Labs/python-api-inspect
comes with two command line tools (Python scripts). The important tool
`inspect_api.py`

has heavy caching of downloaded repositories and
source files that have been analyzed. Inspecting a file the second
time is a sqlite3 lookup. Currently, this repository inspects 17
libraries/namespaces and around 10,000 repositories (100 GB
compressed). It has been designed to have no other dependencies than
the Python stdlib and easily run
from the command line. Below is the command that is run when
inspecting all the libraries that depend on numpy.

python inspect_api.py data/numpy-whitelist.ini \ --exclude-dirs test,tests,site-packages \ --extensions ipynb,py \ --output data/inspect.sqlite

The command comes with several options that can be useful for
filtering the results. `--exclude-dirs`

is used to exclude directories
from counts (e.g. `tests`

directory or `site-packages`

directory)
within a repository. This option reveals the use of a given namespace
in tests as opposed to within the library. `--extensions`

is by
default all Python files `*.py`

but can also include Jupyter notebooks
`*.ipynb`

showing us how users use a namespace in an interactive
context. Unsurprisingly this work found that many Jupyter notebooks in
repositories have syntax errors.

While not the focus of this post, an additional script is provided in
the repository `dependant-packages.py`

. This script is used to
populate the `data/numpy-whitelist.ini`

file with repositories that
depend on numpy. This would not be possible without the libraries.io
API. It is a remarkable project which
deserves more attention.

# Results

The table below summarizes the findings of namespace usage within all
`*.py`

files, all `*.py`

in only test directories, all `*.py`

files
excluding ones within test directories (`tests`

, `test`

), and only
Jupyter notebook `*.ipynb`

files. All of the results are provided as
`csv`

files. It is important to note that the `inspect_api.py`

script
gets much more detail than is included in the `csv`

files and there is
plenty of additional work that could be done with this tool for
general Python ast analysis.

Library | Whitelist | Summary only `.py` |
Summary only `.py` tests |
Summary only `.py` without tests |
Summary only `.ipynb` |
---|---|---|---|---|---|

astropy | ini | csv | csv | csv | csv |

dask | ini | csv | csv | csv | csv |

ipython | ini | csv | csv | csv | csv |

ipywidgets | ini | csv | csv | csv | csv |

matplotlib | ini | csv | csv | csv | csv |

numpy | ini | csv | csv | csv | csv |

pandas | ini | csv | csv | csv | csv |

pyarrow | ini | csv | csv | csv | csv |

pymapd | ini | csv | csv | csv | csv |

pymc3 | ini | csv | csv | csv | csv |

pytorch | ini | csv | csv | csv | csv |

requests | ini | csv | csv | csv | csv |

scikit-image | ini | csv | csv | csv | csv |

scikit-learn | ini | csv | csv | csv | csv |

scipy | ini | csv | csv | csv | csv |

statsmodels | ini | csv | csv | csv | csv |

sympy | ini | csv | csv | csv | csv |

tensorflow | ini | csv | csv | csv | csv |

Since many namespaces were checked we will highlight only some of the
results. First for NumPy the
unsurprising function calls: `numpy.array`

, `numpy.zeros`

,
`numpy.asarray`

, `numpy.arange`

, `numpy.sqrt`

, `numpy.sum`

, and
`numpy.dot`

. There are
plans
to deprecate `numpy.matrix`

and this seem possible since it
`numpy.matrix`

is not in the top 150 functions calls. Numpy testing functions were
the expected `testing.assert_allclose`

, `testing.assert_almost_equal`

,
and `testing.assert_equal`

.

SciPy acts as a glue for many algorithms
needed for scientific and numerical work. The usage of `scipy`

is
surprising and also possibly the most accurate results of the
following analysis. This is due to the fact that `scipy`

tends to be
function wrappers over lower level routines and less class instance
methods which are harder to detect as discussed above. The `sparse`

methods are heavily used along with several high level wrappers for
`scipy.interpolate.interp1d`

and `scipy.optimize.minimize`

. I was
surprised to find out one of my favorite SciPy methods,
`scipy.signal.find_peaks`

, is rarely used! Only a small fraction of the
`scipy.signal`

functions are used and these include:
`scipy.signal.lfilter`

, `scipy.signal.fftconvolve`

,
`scipy.signal.convolve2d`

, `scipy.signal.lti`

, and
`scipy.signal.savgol_filter`

.

scikit-learn is a popular library for data analysis and offers some of the traditional machine learning algorithms. Interestingly here we order the most used models.

`sklearn.linear_model.LogisticRegression`

`sklearn.decomposition.PCA`

`sklearn.ensemble.RandomForestClassifier`

`sklearn.cluster.KMeans`

`sklearn.svm.SVC`

pandas is another popular data analysis
library for tabular data that helped drive the popularity of
Python. One of the huge benefits of `pandas`

is that it allows reading
many file formats to a single in memory `pandas.DataFrame`

object. Unsurprisingly the most popular `pandas`

functions are
`pandas.DataFrame`

and `pandas.Series`

. Here we rank the most popular
`pandas.read_*`

functions.

`pandas.read_csv`

`pandas.read_table`

`pandas.read_sql_query`

`pandas.read_json`

`pandas.read_pickle`

requests makes working
with HTTP requests easier to work with than the stdlib
urllib.request
and is one of the most downloaded
packages. Looking at the
data for usage of `requests`

, three functions are primarily used
(everything else is used 3-5x less): `requests.get`

, `requests.post`

,
and `requests.Session`

with `headers`

being the most common argument.

Overall it is clear that libraries are being used differently within
either a package, tests, or notebooks. Notebooks tend to prefer high
level routines such as `scipy.optimize.minimize`

, `numpy.linspace`

,
`matplotlib.pyplot.plot`

which can be used for demos. Additionally
notebook function usage would be a good metric for material that is
worthwhile to include in introduction and quick-start
documentation. The same goes for testing and development documentation
that is equally informed as to what functions are used in tests and in
packages. Further work is necessary to generalize this tool as it
could be useful for the Python ecosystem to better understand through
analytics how the language is being used.