How to Design Robust Generative AI APIs

Posted in

Artificial intelligence has recently dominated the online space, specifically the concept of large language models (LLMs). These technologies represent a sea change for technology, offering a new pathway for code generation, security assurance, and much more. They also depend on robust, well-designed APIs to function.

As generative AI tools have been developed, there has been a gold rush to productize them as API products. However, not all APIs are created equally. In the hurry to make generative AI APIs accessible to developers and businesses, it’s essential that such interfaces are designed appropriately. These APIs require quality approaches and implementations designed for the unique challenges and considerations inherent in the technology.

Below, we’ll look at these challenges and considerations and begin to form an approach toward designing robust APIs for generative AI. This list is not entirely exhaustive, but it will lay the groundwork for any organization to approach AI API work with a solid foundation.

Why AI?

AI is somewhat of a misnomer when discussing APIs since it’s become somewhat of a catch-all term for a term for a variety of technologies. One such technology is a large language model, or LLM, which is a machine learning model trained on existing data that can be used to generate content, provide analysis of data, or even create code.

Setting aside copyright problems and certain security setbacks, these LLM tools have seen adoption across various industries and systems. Since LLMs represent a new type of system that can be interacted with through an API, and therefore, a new type of system for software providers to monetize.

However, there are a few caveats to these new AI-based systems. Notably, they require APIs that are carefully designed and appropriately formed.

API Design and Style

With AI and LLMs, the sky is the limit. Ironically, however, this fact makes some API paradigms more appropriate than others. Generative models have vast capabilities, allowing everything from image generation to code review, and as such, API design has to be open-ended while being intuitive. The more control you place on what the API can do and how it does what it does, the more limited the capabilities of the end system might be.

For this reason, approaches like REST and, notably, GraphQL have become commonplace in many AI/LLM implementations. Being able to state specifically what you want to be done and in what format can make it that much easier to utilize the system, and the statelessness and standard HTTP verbiage of RESTful development allows for user control of both the form and function of the resulting output. LLMs can be used to push for human readability or mutation in the output.

For example, let’s say you have a GraphQL endpoint that provides weather data based upon a query for a specific range of temperatures. You could ask the GraphQL endpoint to return weather data for a specific location within a specific temperature band, say 37 to 38 degrees centigrade.

This might generate an output of multiple coordinates reporting this temperature range. But these coordinates might represent ten points in the same city. Using an LLM, we can ask that these coordinates be combined, and the output itself transformed, which would result in an output as follows:

There are currently two cities in Portugal reporting a temperature between 37 and 38 degrees centigrade – Lisbon and Coimbra!

Intuitiveness is the name of the game here. While most APIs can get away with a little bit of a learning curve, AI/LLM systems are like interacting with near-human intelligence. Accordingly, having some sort of guidebook to frame your requests and an understanding of what that request will output is a good idea. Breaking out functions into families of endpoints — for example, endpoints specifically for graphic generation, or endpoints just for code — can help to clarify this and streamline interactions.

Formats and Types

Generative models are very powerful but often require detailed and specific inputs to deliver on their promise. Accordingly, APIs built atop generative models must be designed to support complicated inputs and complex, multi-layered outputs. For instance, an API built on a generative model might need to support rich media, hypertext, iterative outputs that are contextually and temporally linked to previous outputs, and much more.

Accordingly, APIs built on generative models need to support a wide range of output formats and types. GraphQL and other such technologies are great for this type of interaction, as they can allow the user to state the form and function of the output. However, the specific libraries and frameworks under the hood will be just as important in defining what is actually supported.

Rate Limiting

Resource balancing is a big problem with generative models. Generative development is resource intensive, and as requests grow in complexity, so does the demand on the underlying systems. Accordingly, ensuring fair access, proper functionality, and long-term service health will be an important part of the process, and rate limiting can be a big part of that.

Ensuring that rates are fair and unrestrictive beyond what is necessary is key to ensuring proper adoption and long-term use, but you must balance this with overall resource availability. Overly large or complex requests, especially paired with rich media output, can slow down processing or even break the underlying systems. Because of this, format and type limitations — such as optimizing the size of files, capping output, and rate-limiting requests for high-data requirement format outputs — must be implemented correctly to prevent scope, denial of service, and other business logic issues.


Generative systems have a very unique relationship with caching. While traditional APIs benefit hugely from caching and serving repeat requests to users, generative APIs don’t benefit as much due to the custom nature of the generative request workflow. When each request is unique, the benefit of caching these responses diminishes somewhat.

Where this becomes interesting is in the idea of generative iteration. Some generative models require previous requests to be stored, at least for a time, to allow for iterative development. In such a case, caching can be accomplished simply by feeding the last request back into the model as a contextual reference for the new request. This is akin to caching, but is an inverse model — instead of feeding the existing request to the user, it feeds the existing request to itself for further iteration.

This is a pivot from the typical caching implementation but is quite important to get right in the context of iterative development.

Resources and Processing

When developing an API on top of a generative model, some requirements typical for standard APIs are made much more critical. This is the case with resources and processing. While APIs require efficient use of computational ability and balance across the infrastructure, this requirement is much more critical in the context of generative models.

Generative models require incredible amounts of processing, and these costs can scale exponentially with iteration. API providers should invest in robust infrastructure regardless, but with generative models, efficiencies in latency and availability are made that much more important. While proper load balancing can play into this, small amounts of efficiency that might otherwise be not worth the cost in time and effort suddenly become more justified in generative systems.

Iteration and Versioning

Generative models improve over time, changing their form and function with each new evolution. Accordingly, the APIs built on top of these systems must have substantial iteration and versioning systems to ensure that updates are controlled. Different models may have different resource requirements or abilities, requiring more clarity in communication and better documentation about feature sets.

More specifically, different versions may also come with various revenue models and access requirements. If the third version of an API is more intense on resources but enables more power, then this should be priced accordingly for the end user, and may require different access control methods.

Real-World Examples

When it comes to generative APIs, there is a wealth of real-world examples that developers can look to for inspiration.

Perhaps the best current example is the OpenAI API. While OpenAI has seen massive coverage on its consumer application frontend, ChatGPT, the API has not gotten as much attention. Yet, many developers have begun to integrate with it.

OpenAI’s API is clear, understandable, and easy to use. It stands as a great example of what an effective API LLM implementation can look like. User access levels are clearly differentiated, and costs are communicated effectively through a multi-tier user system. OpenAI provides many resources to help developers get started, such as a testing playground.

Another great AI API implementation is GooseAI. This solution is unique compared to something like OpenAI’s ChatGPT. Instead, GooseAI positions itself as an NLP-as-a-service via API, offering to build out an AI infrastructure by simply tagging into a series of endpoints. There are price constructs and controls, of course, but the premise of deploying an at-scale NLP implementation with such a simple connection is alluring and serves as a great business use case.


Ultimately, building business logic around LLM and AI APIs has a lot of similarities with any other API approach. That said, certain considerations are unique to LLM and AI systems, and appropriately designing for them will make your product more attractive and user-friendly.

What do you think of these best practices? Did we leave anything out? Let us know how you’re designing web APIs for generative AI in the comments below!