Measuring API usage for popular numerical and scientific libraries
Published May 27, 2019
costrouc
Chris Ostrouchov
Developers of open source software often have a difficult time understanding how others utilize their libraries. Having better data of when and how functions are being used has many benefits. Some of these are:
- better API design
- determining whether or not a feature can be deprecated or removed.
- more instructive tutorials
- understanding the adoption of new features
Python Namespace Inspection
We wrote a general tool
python-api-inspect
to analyze any function/attribute call within a given set of
namespaces in a repository. This work was heavily inspired by a blog
post on inspecting method usage with
Google BigQuery
for pandas,
NumPy, and
SciPy. The previously mentioned work used
regular expressions to search for method usage. The primary issue with
this approach is that it cannot handle import numpy.random as rand; rand.random(...)
unless additional regular expressions are
constructed for each case and will result in false
positives. Additionally,
BigQuery is not a free resource.
Thus, this approach is not general enough and does not scale well with
the number of libraries that we would like to inspect function and
attribute usage.
A more robust approach is to inspect the Python abstract syntax tree
(AST). Python comes with a performant method from the ast
module ast.parse(...)
for constructing a Python AST from source code. A node
visitor
is used to traverse the AST and record import
statements, and
function/attribute calls. This allows us to catch any absolute
namespace reference. The following are cases that
python-api-inspect
catches:
import numpyimport numpy as npimport numpy.random as rndfrom numpy import random as randnumpy.array([1, 2, 3])numpy.random.random((2, 3))np.array([1, 2, 3])rnd.random((2, 3))rand.random((2, 3))
There are limitations to this approach since Python is a heavily duck-typed language. To understand this see the following two examples.
def foobar(array): return array.transpose()a = numpy.array(...)a.transpose()foobar(a)
How is one supposed to infer that a.transpose()
is a numpy
numpy.ndarray
method or foobar
is a function that takes a
numpy.ndarray
as input? These are open questions that would allow
for further inspection of how libraries use given functions and
attributes. It should be noted that dynamically typed languages in
general have this
problem. Now
that the internals of the tool have been discussed, the usage is quite
simple. The repository
Quansight-Labs/python-api-inspect
comes with two command line tools (Python scripts). The important tool
inspect_api.py
has heavy caching of downloaded repositories and
source files that have been analyzed. Inspecting a file the second
time is a sqlite3 lookup. Currently, this repository inspects 17
libraries/namespaces and around 10,000 repositories (100 GB
compressed). It has been designed to have no other dependencies than
the Python stdlib and easily run
from the command line. Below is the command that is run when
inspecting all the libraries that depend on numpy.
python inspect_api.py data/numpy-whitelist.ini \ --exclude-dirs test,tests,site-packages \ --extensions ipynb,py \ --output data/inspect.sqlite
The command comes with several options that can be useful for
filtering the results. --exclude-dirs
is used to exclude directories
from counts (e.g. tests
directory or site-packages
directory)
within a repository. This option reveals the use of a given namespace
in tests as opposed to within the library. --extensions
is by
default all Python files *.py
but can also include Jupyter notebooks
*.ipynb
showing us how users use a namespace in an interactive
context. Unsurprisingly this work found that many Jupyter notebooks in
repositories have syntax errors.
While not the focus of this post, an additional script is provided in
the repository dependant-packages.py
. This script is used to
populate the data/numpy-whitelist.ini
file with repositories that
depend on numpy. This would not be possible without the libraries.io
API. It is a remarkable project which
deserves more attention.
Results
The table below summarizes the findings of namespace usage within all
*.py
files, all *.py
in only test directories, all *.py
files
excluding ones within test directories (tests
, test
), and only
Jupyter notebook *.ipynb
files. All of the results are provided as
csv
files. It is important to note that the inspect_api.py
script
gets much more detail than is included in the csv
files and there is
plenty of additional work that could be done with this tool for
general Python ast analysis.
Library | Whitelist | Summary only .py |
Summary only .py tests |
Summary only .py without tests |
Summary only .ipynb |
---|---|---|---|---|---|
astropy | ini | csv | csv | csv | csv |
dask | ini | csv | csv | csv | csv |
ipython | ini | csv | csv | csv | csv |
ipywidgets | ini | csv | csv | csv | csv |
matplotlib | ini | csv | csv | csv | csv |
numpy | ini | csv | csv | csv | csv |
pandas | ini | csv | csv | csv | csv |
pyarrow | ini | csv | csv | csv | csv |
pymapd | ini | csv | csv | csv | csv |
pymc3 | ini | csv | csv | csv | csv |
pytorch | ini | csv | csv | csv | csv |
requests | ini | csv | csv | csv | csv |
scikit-image | ini | csv | csv | csv | csv |
scikit-learn | ini | csv | csv | csv | csv |
scipy | ini | csv | csv | csv | csv |
statsmodels | ini | csv | csv | csv | csv |
sympy | ini | csv | csv | csv | csv |
tensorflow | ini | csv | csv | csv | csv |
Since many namespaces were checked we will highlight only some of the
results. First for NumPy the
unsurprising function calls: numpy.array
, numpy.zeros
,
numpy.asarray
, numpy.arange
, numpy.sqrt
, numpy.sum
, and
numpy.dot
. There are
plans
to deprecate numpy.matrix
and this seem possible since it
numpy.matrix
is not in the top 150 functions calls. Numpy testing functions were
the expected testing.assert_allclose
, testing.assert_almost_equal
,
and testing.assert_equal
.
SciPy acts as a glue for many algorithms
needed for scientific and numerical work. The usage of scipy
is
surprising and also possibly the most accurate results of the
following analysis. This is due to the fact that scipy
tends to be
function wrappers over lower level routines and less class instance
methods which are harder to detect as discussed above. The sparse
methods are heavily used along with several high level wrappers for
scipy.interpolate.interp1d
and scipy.optimize.minimize
. I was
surprised to find out one of my favorite SciPy methods,
scipy.signal.find_peaks
, is rarely used! Only a small fraction of the
scipy.signal
functions are used and these include:
scipy.signal.lfilter
, scipy.signal.fftconvolve
,
scipy.signal.convolve2d
, scipy.signal.lti
, and
scipy.signal.savgol_filter
.
scikit-learn is a popular library for data analysis and offers some of the traditional machine learning algorithms. Interestingly here we order the most used models.
sklearn.linear_model.LogisticRegression
sklearn.decomposition.PCA
sklearn.ensemble.RandomForestClassifier
sklearn.cluster.KMeans
sklearn.svm.SVC
pandas is another popular data analysis
library for tabular data that helped drive the popularity of
Python. One of the huge benefits of pandas
is that it allows reading
many file formats to a single in memory pandas.DataFrame
object. Unsurprisingly the most popular pandas
functions are
pandas.DataFrame
and pandas.Series
. Here we rank the most popular
pandas.read_*
functions.
pandas.read_csv
pandas.read_table
pandas.read_sql_query
pandas.read_json
pandas.read_pickle
requests makes working
with HTTP requests easier to work with than the stdlib
urllib.request
and is one of the most downloaded
packages. Looking at the
data for usage of requests
, three functions are primarily used
(everything else is used 3-5x less): requests.get
, requests.post
,
and requests.Session
with headers
being the most common argument.
Overall it is clear that libraries are being used differently within
either a package, tests, or notebooks. Notebooks tend to prefer high
level routines such as scipy.optimize.minimize
, numpy.linspace
,
matplotlib.pyplot.plot
which can be used for demos. Additionally
notebook function usage would be a good metric for material that is
worthwhile to include in introduction and quick-start
documentation. The same goes for testing and development documentation
that is equally informed as to what functions are used in tests and in
packages. Further work is necessary to generalize this tool as it
could be useful for the Python ecosystem to better understand through
analytics how the language is being used.