Speech-to-text and audio transcription has been a Holy Grail for machine learning and artificial intelligence researchers for some time. The seemingly simple act of turning speech into text is, in fact, not simple at all.

That’s what makes AssemblyAI so exceptional. Dylan Fox created AssemblyAI after seeing firsthand how difficult it was to create open-ended transcription software while working at Cisco.

Not only is open-ended transcription prohibitively difficult to implement, but it’s also resource-intensive. This is why a SaaS like AssemblyAI is such an invaluable resource. The San Francisco-based startup has made their custom speech-to-text software available via an API, making transcription AI available for any developer.

For these reasons, our judges chose AssemblyAI as the Best Public API of 2020 competition. In celebration, here’s our review of the AssemblyAI API, giving you everything you need to know to try out the speech-to-text API for yourself.

Congratulations to AssemblyAI for winning Best Public API of 2020!

AssemblyAI Review

AssemblyAI Speech-to-Text API

Audio transcription is trickier than it looks. If you’ve ever attempted to transcribe recorded audio into written text, you’ll know that writing out someone’s speech is hard work. It’s especially tricky if there are many interruptions, the recording quality is poor, or if you’re unfamiliar with the speaker’s accent. Transcription can become next-to-impossible if there’s more than one speaker.

AssemblyAI is a reason to rejoice for all the audio transcriptionists out there! No more muddling your way through sub-par voice memos and phone recordings. AssemblyAI frees you and your employees up to do more meaningful work. AssemblyAI is open-ended, meaning it can transcribe any word, not just ones included in the training data.

AssemblyAI Features

AssemblyAI features the most accurate audio transcription in the industry using the latest in Deep Learning research. It also gives you a ‘confidence score’ for each word, rating the probability of the text’s accuracy,

When you peek under the hood, however, is where you really begin to get a sense of not only what AssemblyAI can do but also some of its implications. Having speech transcribed into text makes it much easier to search and organize. AssemblyAI makes this even easier with time-stamping for each word. It also annotates different speakers if there is more than one.

It’s also possible to customize the AssemblyAI API for multiple speakers. It features dual-channel audio support, returning each channel as a separate transcription. This alone is reason to investigate AssemblyAI for audio transcription, if you’ve ever attempted to transcribe a phone call or audio with multiple speakers!

Acoustic and Language Models

One of the most impressive features of AssemblyAI is the inclusion of several different libraries for different accents, recording quality, and recording environments, including the amount of background noise. This feature, alone, makes AssemblyAI worthy of investigation, as transcribing audio from speakers with unfamiliar accents can be highly challenging and time-consuming.

At the time of this review, AssemblyAI features language models for Australian English, South African English, and UK English. Indian English and South Asian English are on the docket and will be released soon.

The Language Models are used to differentiate between different media types, like phone calls or interviews versus broadcast news, for instance. This helps AssemblyAI differentiate between similar-sounding words, like ‘to,’ ‘too,’ and ‘two,’ for instance. The Media Language Model is much better at picking out proper nouns, as an example, and has a broad vocabulary.

Throttle Limits

There are no limits to how many files you can transcribe using AssemblyAI. The only limit is how many files you can process concurrently. Free plans can only upload one file at a time. Paid plans can process up to 32 files at the same time.

Getting Started With AssemblyAI

AssemblyAI is very easy to get started with, once you get used to it. It’s about as close to plug-and-play as an API gets. This is largely thanks to the series of Quickstart Guides included in AssemblyAI’s documentation. Some of that documentation wasn’t quite as straightforward as one would hope, at first, but AssemblyAI’s tech support was able to quickly able to walk us through it. We’ll share what we learned here, as well, to save you all any potential confusion or headaches.

The Quickstarts Guides begin with a simple test, the API equivalent of a “Hello World.” Depending on what language you’re using, it’s almost as simple as copying and pasting the source code into your editor of choice. You simply have to replace YOUR-API-TOKEN with the token you get when signing up for AssemblyAI. For this review, we were using Python, Notepad++ as an editor, and a Terminal command line on Windows 10.

Here’s where things get a little confusing, which is why we’re talking you through it, as we needed a little bit of help to get AssemblyAI up and running. Once you’re used to it, it’s as simple as cutting and pasting a few strings of numbers, making it almost as easy to use as software with a GUI.

Note: You’ll need to have at least basic familiarity of APIs, CRUD commands, and some form of programming language to be able to use AssemblyAI.

Uploading Audio

The tutorials show you how to upload your own audio.

Once you put the file you want to upload into the ‘filename’ path and then run the script, you’re given a new web address where your audio is stored. Then, you simply have to go back to the first Quickstart guide and replace the address in the “audio_url” JSON with the web address you just received.

Run this script again and you will see the status of your transcription, which will vary from ‘queued’ to ‘processing.’ Once it’s finished, you’ll be given an output of your transcription, along with a log of metadata which we’ll cover more in the review section.

It also supports nearly every programming language or environment you can think of, so you should be able to comfortably and easily use AssemblyAI no matter what your preferred programming language is. They’ve also got an audio file pre-loaded in Amazon AWS, so you can quickly preview what the API is capable of without waiting for a file to be uploaded and analyzed.

After you’ve taken the Two Minute Quickstart examples for a spin, AssemblyAI’s docs show you how to upload your own audio files to be transcribed. You can upload files directly to the API, so you won’t need to mess with having your own server, which is a very nice and useful feature. The API also supports virtually any audio file format, so you won’t have to mess around with file conversions, which is another nice feature.

import sys
import time
import requests

filename = "I:\Downloads\ligeia.mp3"
 
def read_file(filename, chunk_size=5242880):
    with open(filename, 'rb') as _file:
        while True:
            data = _file.read(chunk_size)
            if not data:
                break
            yield data
 
headers = {'authorization': "YOUR-API-KEY"}
response = requests.post('https://api.assemblyai.com/v2/upload',
                         headers=headers,
                         data=read_file(filename))

print(response.json())

Once you’ve got your file uploaded, you’re given a web address where it lives. Then you can simply input that address into the example code and… voila! Instant transcription!

Do keep in mind it may take a moment for your audio to be transcribed. The ‘status’ will be listed as ‘queued’ while you’re file’s preparing to be processed. You may have to make several GET requests before your results are ready. A processing response may look like this:

{u'audio_end_at': None, u'acoustic_model': u'assemblyai_default', u'auto_highlights_result': None, u'text': None, u'audio_url': u'https://cdn.assemblyai.com/upload/05032247-7eb0-4b4e-bd0d-33040d15c2fc', u'speed_boost': False, u'language_model': u'assemblyai_default', u'redact_pii': False, u'confidence': None, u'webhook_status_code': None, u'id': u'zaxwd20tl-a5b2-401d-8b5b-a27d5fe536ed', u'status': u'processing', u'boost_param': None, u'words': None, u'format_text': True, u'dual_channel': None, u'punctuate': True, u'utterances': None, u'audio_duration': None, u'auto_highlights': False, u'word_boost': [], u'webhook_url': None, u'audio_start_from': None}

Again, let’s check on the status of our operation:

import requests

endpoint = "https://api.assemblyai.com/v2/transcript/zv3c2fo1w-b226-4d67-81f1-bbae24e3a424"

headers = {
    "authorization": "YOUR-API-KEY",
}

response = requests.get(endpoint, headers=headers)

print(response.json())

Success! We have uploaded an audio recording of an Edgar Allen Poe passage, and AssemblyAI has successfully transcribed this into the following text: “Eloquence of her low musical language made their way into my heart by paces so steadily and stealthily progressive that they have been unnoticed and unknown.” You can see below what the response looks like, along with metadata, like confidence percentage, for each word:

{u'audio_end_at': None, u'acoustic_model': u'assemblyai_default', u'auto_highlights_result': None, u'text': u'Eloquence of her low musical language made their way into my heart. Bypass so steadily and stealthily progressive that they have been unnoticed and unknown.', u'audio_url': u'https://cdn.assemblyai.com/upload/da62981c-b3a7-44a0-8631-ddb9e44386df', u'speed_boost': False, u'language_model': u'assemblyai_default', u'redact_pii': False, u'confidence': 0.9388, u'webhook_status_code': None, u'id': u'zv3c2fo1w-b226-4d67-81f1-bbae24e3a424', u'status': u'completed', u'boost_param': None, u'words': [{u'text': u'Eloquence', u'confidence': 0.9, u'end': 600, u'start': 0}, {u'text': u'of', u'confidence': 0.96, u'end': 750, u'start': 540}, {u'text': u'her', u'confidence': 0.92, u'end': 960, u'start': 720}, {u'text': u'low', u'confidence': 0.91, u'end': 1170, u'start': 900}, {u'text': u'musical', u'confidence': 0.96, u'end': 1710, u'start': 1230}, {u'text': u'language', u'confidence': 0.95, u'end': 2220, u'start': 1650}, {u'text': u'made', u'confidence': 0.99, u'end': 2850, u'start': 2190}, {u'text': u'their', u'confidence': 0.95, u'end': 3120, u'start': 2790}, {u'text': u'way', u'confidence': 0.88, u'end': 3330, u'start': 3090}, {u'text': u'into', u'confidence': 0.97, u'end': 3810, u'start': 3300}, {u'text': u'my', u'confidence': 0.88, u'end': 4020, u'start': 3810}, {u'text': u'heart.', u'confidence': 0.97, u'end': 4320, u'start': 3960}, {u'text': u'Bypass', u'confidence': 0.94, u'end': 5400, u'start': 4710}, {u'text': u'so', u'confidence': 0.89, u'end': 5850, u'start': 5430}, {u'text': u'steadily', u'confidence': 0.92, u'end': 6540, u'start': 5970}, {u'text': u'and', u'confidence': 0.84, u'end': 7110, u'start': 6870}, {u'text': u'stealthily', u'confidence': 0.97, u'end': 8010, u'start': 7200}, {u'text': u'progressive', u'confidence': 0.92, u'end': 8760, u'start': 8070}, {u'text': u'that', u'confidence': 0.99, u'end': 9330, u'start': 9060}, {u'text': u'they', u'confidence': 0.99, u'end': 9510, u'start': 9270}, {u'text': u'have', u'confidence': 0.88, u'end': 9720, u'start': 9450}, {u'text': u'been', u'confidence': 1.0, u'end': 9990, u'start': 9690}, {u'text': u'unnoticed', u'confidence': 0.96, u'end': 10620, u'start': 9960}, {u'text': u'and', u'confidence': 0.97, u'end': 11350, u'start': 10960}, {u'text': u'unknown.', u'confidence': 0.96, u'end': 11830, u'start': 11290}], u'format_text': True, u'dual_channel': None, u'punctuate': True, u'utterances': None, u'audio_duration': 12.617, u'auto_highlights': False, u'word_boost': [], u'webhook_url': None, u'audio_start_from': None}

Performance Review

The AssemblyAI team was kind enough to give us access to their transcription API to test it out and share our thoughts.

Setup And Ease Of Use

AssemblyAI is about as close to a plug-and-play API as you’re likely to find, once you get used to it. There’s a bit of a learning curve, but it’s so worth it, especially if you’ve ever had to type out transcriptions by hand.

“A great and unique product developers want to put into use immediately.” – Nordic APIs Judge

AssemblyAI Reliability

AssemblyAI API can boast 100% uptime for the last 90 days. That’s so reliable as to be effectively foolproof. This API’s ready to be integrated into your workflow with no worry of interruptions!

“They’ve also focused on developer experience to make it easy to use their APIs. With copy-paste examples across five languages and tutorials focused on use cases, the company helps developers over the most common hurdle: making that first transcription.” – Nordic APIs Judge

AssemblyAI Accuracy

For the sake of this review, we wanted to go with something well-known so we could verify AssemblyAI’s accuracy. Considering that Halloween is nearly upon us, at the time of this review, we decided to use a recording of Edgar Allan Poe’s “Ligieia” read by Novella Sirena for the LibriVox project.

Once AssemblyAI was up and running and configured correctly, the 41-minute long recording was transcribed in an impressively short time span. Before you know it, I had a raw chunk of text and metadata bearing the familiar first line “And the will therein lieth, which dieth not.” Except, for part of it, it returned “And the will there in life. Which diet stage…,” which is fairly understandable, as their algorithm might not be trained to interpret biblical English. You’d likely notice this when looking through the metadata at the end, which is where the ‘confidence’ score comes in handy.

AssemblyAI Security

AssemblyAI deletes your audio files from their AWS server at the end of the transaction, so there’s no need to worry about your data or information being compromised.

AssemblyAI: Final Thoughts

AssemblyAI’s transcription API is more than deserving of our API of the year award, given its usefulness and its easy use. It has the potential to be worked into a staggering array of different stacks and workflows. I could see it coming in handy for everything from podcasts to business meetings to creativity, even! It’s exciting when a piece of tech comes along that has the potential to unlock creativity and imagination. AssemblyAI has the potential to be just such an application, as well as practical, useful, and even fun!

J. Simpson

J. Simpson lives at the crossroads of logic and creativity. He writes and researches tech-related topics extensively for a wide variety of publications, including Forbes Finds. He is also a graphic designer, journalist, and academic writer, writing on the ways that technology is shaping our society while using the most cutting-edge tools and techniques to aid his path. He lives in Portland, Or.