What Will Be the API Giving Voice to LLMs?

If you are reading this, then there’s a high likelihood that you’ve had your fair share of chats with ChatGPT or similar platforms. The text prompting scene and its advantages to many aspects of our work and personal life is hard to understate.

What is happening in recent months is that these generative AI platforms — these LLMs — are getting a voice. They now enable users to interact with them with voice in addition to text.

This raises a question: what API and interface will we use for this new voice of LLMs? Below, I’ll go through three different potential alternatives, two of which are already in use. These are WebSocket, WebRTC, and WebTransport.

Let’s start first with who uses which interface.

Who Uses What?

Speech-to-text services traditionally used a WebSocket connection as their interface of choice. This led to an obvious selection of WebSocket as the primary interface created by LLM services for voice as well. We’ve seen OpenAI introduce their Realtime API and Google Gemini introduce a WebSocket interface.

These and other vendors also sometimes have an HTTPS interface to handle simple upload and download of voice files, but these are the kind of interfaces I’d use for batch processing of offline work — not something that should take place in real-time or near real-time.

When OpenAI introduced its Realtime API with its WebSocket voice interface, vendors jumped on this bandwagon to extend it with WebRTC support — especially the Programmable Voice and Programmable Video vendors. These include Twilio, Daily, LiveKit, and others. There’s a reason why, and we will get to it later.

OpenAI then introduced in December 2024 a beta version of its WebRTC-based interface.

Which interface should developers pick for their applications? Which ones should LLM providers offer? Let’s check each of the alternatives.

What’s in a WebSocket Interface?

While there aren’t many examples of these voice-based bot APIs out there yet, it is safe to say that the WebSocket interface is the most popular one.

Here, the device connects to the AI vendor by opening a WebSocket connection to the server and keeping it open for the duration of the session. It is a bidirectional connection, allowing either the device or the AI server to send any data across the connection at will.

The mindset of this interface is turn-based in nature: the device sends its prompt, either as text or voice, and this gets processed on the AI server, which usually tries to figure out the end of the prompt by itself (when the person finishes speaking). It then passes that full voice recording to be processed via speech-to-text and from there to the LLM or directly to a multimodal LLM capable of processing audio inputs.

The LLM generates the response prompt and then converts that to audio using text-to-speech. The generated audio is then pushed through the WebSocket back to the device.

Flow diagram for using WebSocket for real-time voice with LLMs.

Flow diagram of using WebSocket for real-time voice with LLMs.

WebSockets are great since they are reliable in nature — you send data through them, and that data is guaranteed to arrive on the other end of the connection. If it doesn’t arrive, that simply means the connection is broken and lost. On the other hand, this guarantee of reliability comes at the cost of latency. To deal with packet losses, WebSockets (which are implemented mainly over TCP) end up retransmitting lost packets.

The other thing to consider is that the LLM generates the reply much faster than it gets spoken out. For the sake of the example, let’s assume that it took the LLM one second to generate a 20-second long reply. This is from receiving the voice prompt, converting it to text, writing a text reply, and then converting the text to voice. The whole 20 seconds will be pushed down the WebSocket virtually immediately — a lot faster than the 20 seconds it will take to play it to the user.

This is great if we’re trying to use the LLM kinda offline or in a turn-based setting. A bit less fun if what we’re trying to do is live and interactive in nature.

Why Did OpenAI Introduce a WebRTC Interface?

Let’s get back to OpenAI’s decision to add a WebRTC interface.

Adding voice to LLM is powerful. It enables humans to interact with AI agents and converse with them naturally. This means that the interface of an LLM supporting voice ends up being consumed on the user’s device.

WebSocket isn’t geared towards that due to the following reasons:

Network issues: Users might be connected via poor networks with low bandwidth and high packet loss. This leads to higher latency, which negatively affects the interaction.
Geographic constraints: Most WebSocket interfaces will be provided from a specific location. The farther away you are from that location, the poorer the connection and the higher the latency. Think of a WebSocket connected from a user in Australia to a server in an East US data center — the possibility of high latency is significant.
Audio playback: The work of dealing with audio playout is handled on the client side without any assistance from the server. While doable, it is clunky, especially when you plan on introducing the concept of interrupting the LLM mid-reply — in this case, you won’t know how much of the response has been heard…

For interactivity, we aim to reduce latency as much as possible, and for that, WebRTC ticks all the right boxes:

Low latency by design and implementation, with a preference towards UDP transport.
Retransmissions are only done if latency isn’t going to increase in a way that renders interactivity useless. Packet losses are handled using other packet loss concealment techniques.
WebRTC “forces” the server to always be aware of the exact playback state on the user’s end, knowing what was heard and what wasn’t played yet. This makes it easier to implement human interruptions and increases the natural flow of the conversations,
Deployments of WebRTC infrastructure are usually global, aiming to get the traffic out of the realm of unmanaged internet connections as close as possible to the user and then route it internally between servers in a more controlled and managed fashion. Here, a user in Australia will connect to a local server in Australia, even if the LLM is located in East US. The route to the US will take place on a higher quality network that we have more control over.
WebRTC is available in all modern browsers. I won’t be explaining the intricacies of WebRTC here (read What is WebRTC for a deeper dive), but what I will say is this: WebRTC is a natural choice when one wants to interact using voice with others or with a generative AI bot. This standard was designed and implemented, especially for low latency and high interactivity.

The moment we wish end users to strike natural conversations with Generative AI services, we will need to forgo the use of WebSocket and move to WebRTC. The only caveat with using WebRTC? It might be overkill since it’s geared towards solving human-to-human conversation challenges. Is there another alternative?

Will WebTransport Be Used?

WebRTC comes with its own set of headaches. It must do many things, such as supporting peer-to-peer communication between two browsers — something that a conversation with a machine doesn’t require. Using WebRTC necessitates more moving parts than is truly needed if all we need to do is give LLMs a voice.

Another theoretical alternative that is available to us is WebTrasport. WebTransport is a new standard interface for browsers. It’s not available in Safari just yet, which makes its coverage only partial. With Apple being Apple, and given their slow adoption of WebRTC, you can expect them to take their sweet time with the introduction of WebTransport.

Browser support for WebTransport at the time of writing. Source: Caniuse.com.

The great thing about WebTransport is that it requires far fewer moving parts and focuses on communications between a browser and a server — for us, a user and a bot. It also doesn’t assume much about the data being sent — it can be reliable or unreliable. Future work around this technology also includes Media Over QUIC (MoQ).

In a way, WebTransport might be the perfect fit to offer voice to LLMs. At least theoretically — because it’s just too early to deploy it in production.

Feature Comparison: WebSocket, WebRTC, WebTransport

I’ll finish with a quick comparison between these three alternatives from the point of view of an LLM voice implementation.

Technology	Latency	Reliability	Uplink (user to bot)	Downlink (bot to user)	Best Use When	Availability
WebSocket	High on poor networks	Reliable, due to the retransmission mechanism in TCP	Realtime	Batched	User is a server in a controlled environment	Everywhere
WebRTC	Low	Unreliable, aiming to either receive data quickly or just “skip” it if it is too late	Realtime	Realtime	User is a person on an unmanaged network	Everywhere
WebTransport	Low	Depends on your implementation	Realtime	Realtime	User is a person on an unmanaged network	Not available on Safari yet, and can be considered as “beta” on other browsers

Final Thoughts

Planning to add voice to an LLM? Pick the right transport protocol for it (which will likely be WebRTC at this point). If you’re interested in understanding more about the differences between text and voice in the context of LLMs, then I invite you to read more about OpenAI, LLMs, and voice bots.