Considering Data Science Users in Your API Design

With the growth of data analytics and AI, API producers need to understand the needs of data scientists and machine learning engineers. Your APIs are likely already being used by this group, but you can serve their needs better by understanding them more.

How Data Scientists Use APIs

Data professionals use APIs in many ways. For simplicity, I’ll call them data scientists, but they’re just as likely to have job titles like data engineer, data analyst, or machine learning engineer.

Data scientists might first access your API while performing exploratory data analysis (EDA). The key distinction with data scientists is that a software developer might be happy with grabbing a few hundred rows to get an understanding of the data format, but data scientists often require a lot more data to get a representative sample.

As such, they’ll probably retrieve as much data as is available from your API, then perform some random sampling to explore the distribution of your data, along with measures of central tendency such as mean and median. They’ll create scatterplots to see the relationships and potential correlations between the various elements and make box plots on a few key elements. (As one of my mentors told me: “Always start by plotting your data.”)

Data scientists might even create a script that loops through with pagination to get the full contents of the data. They’ll examine your API’s contents across data quality measures such as completeness, consistency, and uniqueness.

If they find the API to be useful and high-quality, they may use it for a variety of data products. They could create analytics reports or dashboards from it, possibly retrieving the data as users query it and filtering on useful columns.

They may also use it to train a machine learning model, an iterative process where they experiment with a variety of combinations of data columns and algorithms to perform tasks such as classification (for instance, categorizing email as spam or ham) or regression (making predictions on future numeric values based on patterns they find in past data). If they work out a reliable model, they’ll likely deploy it to production, which often includes creating an API of their own.

If they commit to using your API going forward, they might include it in a data pipeline, which they use to regularly retrieve the latest updates and store them in some type of data store. Here’s where they will be counting on your API to be true to its definition, provide regular updates, and stay reliably available.

The newest way data scientists may be using your APIs is in a generative AI application. There are a range of ways APIs can be used in this type of application. One method is retrieval-augmented generation (RAG), where the program retrieves content from an API and then feeds it to a large language model (LLM) to provide a result. In other cases, the LLM is used to create an agent that is provided with a set of tools to use, which might include APIs. In that case, the agent decides when to call an API and what endpoints and parameters to use.

Designing APIs to Serve Their Needs

While a data scientist’s desire for quality data with reliable service is not unique, there are some special features that they’ll appreciate. If you provide these, you’ll likely draw data scientists to your offering and keep them coming back. Here are a few qualities that data science users will benefit from.

Add Standard External Identifiers to Your API

Your API most likely includes identifiers unique to your system, such as product identifiers or region codes. Data scientists will likely want to enrich the data you provide by joining it to third-party datasets. You can enable this by adding industry standard identifiers to your API, even if you don’t currently store them or use them in your system. Examples of these are FIPS or ISO codes that allow standardizing locations across the U.S. or internationally and GTIN codes for tracking retail products.

Provide a Software Development Kit (SDK)

In my experience, API providers don’t provide an SDK with their API until they have a larger API program. But you’ll find that data science users will be much quicker to explore your API if they can access your API with a quick pip install command in Python or install.packages() in R.

A key insight is that data scientists don’t care about your API — they just want your data. And the quickest way to get it is to use a library in their language of choice. There are a variety of commercial and open-source SDK generators that support Python. Fewer seem to support auto-generating R clients, but the OpenAPI Generator lists support for R at the time of writing.

Provide the Last Changed Date and Time on Each Endpoint

If data scientists use your API in a scheduled data pipeline, they’ll typically want to perform a one-time load of your full data contents (more on that later) and then regularly retrieve the deltas, which are the records that have changed since their last update.

If you include an accurate last changed date and time in your results and add it as a query parameter, they can make targeted queries that are easier for them, and put less load on your service. This is especially important for REST and GraphQL APIs, where the consumer will make recurring calls to retrieve updates. Other architectural styles are often used when updates are pushed to consumers in real-time or as events occur, such as Webhooks, Websockets, Kafka, and message queues.

Provide a Bulk Download

This is the follow-up to the previous tip. For data pipelines, they’d like to start with a full load and then only retrieve updates. They also would appreciate this bulk file when performing the initial EDA or training that model. If you will provide them with a method of getting a full bulk load in CSV or parquet format, they’ll be grateful.

Understanding the Tools Data Scientists Use

If you’d like to walk a mile in a data scientist’s shoes, a great place to start is by using some standard open-source tools to access your own APIs. I’m partial to Python, so I would start by creating a Jupyter Notebook in VS Code and using the requests or httpx libraries to call your API. You could then use the pandas library to store the results of your API calls in dataframes, which are spreadsheet-like structures that are handy for cleaning and exploring tabular data.

You could create some basic charts and visualizations of your data with libraries such as matplotlib and plotly. Then, you could quickly create a web-based data app using Streamlit or Gradio. You’ll be surprised by how quickly these simple tools allow you to explore and understand the data in your APIs as a business user would. What you learn about your own data might surprise you.

If you’d like to explore the bleeding edge, you can experiment with a generative AI application in Python. Frameworks such as Langchain, Autogen, Llamaindex, and PydanticAI are being developed that can implement the RAG-based or agent-based styles of application. However, the limitations currently present in all generative AI apps, such as hallucinations, still apply. These are also a definite step up in complexity from the examples above.

Embrace Your Data Science Users

If you’re an API provider and haven’t spent time talking to your data science users, now is a great time to start. Reach out to your existing users to find data scientists and schedule time for some usability testing with them. Watch them use your APIs with their native tools, and identify gaps and opportunities to serve them better. You can also recruit testers at local data science meetups. As you add features that meet their needs, you can create specific blog posts and user guides with a data science focus. You may be surprised by how many data scientists are already using your APIs and how many more are waiting in the wings if you make the effort to meet a few of their unique needs.

To get your hands dirty developing APIs using the tips above or using APIs as data scientists do, check out my new book Hands-On APIs for AI and Data Science: Python Development with FastAPI.