5 Best Speech-to-Text APIs in 2021

Posted in

The number of applications leveraging Speech-to-Text technology has exploded over the past few years. For example, 41% of adults use voice search, and 72% of smart speaker owners use voice-activated commands daily. This is largely due to significant advances in Speech-to-Text accuracy, accessibility, and affordability.

Today, applications of Speech-to-Text technology go way beyond common everyday uses. This powerful technology has cross-industry applications in fields such as virtual meetings, media monitoring, video platforms, telephony, conversational intelligence, and more. With the rise in applications of Speech-to-Text, there are now many Speech-to-Text APIs to choose from.

If you're shopping for a Speech-to-Text API, you might wonder what criteria are best to use to evaluate your options. Here are the top questions to consider when choosing a Speech-to-Text API:

How accurate is the Speech-to-Text API?

Accuracy is the most important factor to consider when choosing a Speech-to-Text API. Ideally, the API should provide a free benchmark test that runs your audio/video files through the API and presents the results to you. Even better, the API should compare results with human transcriptions to determine the Word Error Rate, or WER, on your audio/video files, so you can see how close the API's transcripts were to human-level accuracy.

Does it offer transparent pricing and documentation?

API pricing doesn't have to be complex! The best Speech-to-Text APIs offer free trials, transparent pricing, and volume discounts for high levels of usage. You should also be able to quickly access API documentation for increased transparency and get a sense of how easy it would be to integrate the API into your application.

What features does it offer?

Today's best Speech-to-Text APIs go beyond Automated Speech Recognition (ASR) and transcription, supporting additional features such as:

  • Speaker diarization (aka: speaker labels)
  • Paragraph detection
  • Automatic punctuation and casing
  • Topic detection
  • Personal Identifiable Information (PII) Redaction
  • Sensitive Content detection
  • Profanity filtering
  • Custom vocabulary support
  • and others.

How secure is it?

When choosing a Speech-to-Text API, you should pay attention to how the API uses your data. Do they keep a copy of your audio/video files and transcriptions to improve their API? Do they give you the ability to permanently delete the audio/video files you send to the API and the transcript that is generated? These are some of the essential questions you should ask. The best Speech-to-Text APIs should never store or monetize your data. And it should be easy to purge your data from the API's systems.

How much support does the company offer?

An API doesn't have to be a complex one-off purchase. The best APIs should offer a true partnership, helping you problem solve in real-time and unlock the product's full potential for maximum ROI.

The 5 Best APIs for Speech-to-Text

Sounds simple enough, right? Unfortunately, finding a Speech-to-Text API that scores high in the above categories can be a challenge! That's why we're here to help.

In this article, we've taken the criteria above and applied it to determine the five best Speech-to-Text APIs for 2021. We'll examine the main features, costs, and pros and cons of each below.

1. Google Speech-to-Text

Since its inception in 2018, Google Speech-to-Text continues to be a top name in the Speech-to-Text market. It boasts numerous impressive key features, including speech adaptation, domain-specific models, and robust language support.

The API's high accuracy makes it a popular choice. Still, data privacy remains a concerning issue for many developers, as the company reserves the right to use all customer data as training data for its own models.

Its high cost can also be a limiting factor. While developers can get the first 60 minutes free, additional use costs .$006-$.009 per 15 seconds, depending on if the standard or enhanced model is being used. Since Google's Speech-to-Text API only supports transcribing audio/video files stored in Google Cloud, developers must also pay for Google Cloud hosting, making it one of the most expensive APIs on the market.

Pros

  • Well-known name
  • Free trial
  • Good accuracy
  • Supports 62+ languages

Cons

  • Infrequently updated
  • Only supports transcribing files in a Google Cloud Bucket
  • Lack of customer and developer support
  • Expensive
  • Data privacy

2. AssemblyAI

AssemblyAI is a Speech-to-Text API startup quickly making a name for itself as a top competitor in the space. It offers real-time and batch transcription with high accuracy, features like speaker diarization and topic detection, and automated punctuation and casing. Developers can also add custom vocabulary to the API to tailor the results for their specific application.

As of the time this article was written, the company only supports English transcription (all accents), which limits its use to English-speaking countries. The company also has limited SDKs available, although integrating with its API is very simple.

AssemblyAI's API is much more affordable than some other APIs, with three hours free per month and $.00025 per second transcribed after that. Volume discounts are also available.

Pros

  • Top-rated accuracy
  • Extensive feature list
  • 365/24/7 support via multiple channels
  • Supports virtually every audio/video file format
  • Free tier with affordable paid tier

Cons

  • Only supports English (as of today)
  • Limited SDKs available

3. AWS Transcribe

As Amazon's Speech-to-Text API, AWS Transcribe, supports both streaming and batch transcription as well as automatic language detection. Its companion product, Amazon Transcribe Medical, is one of the only Speech-to-Text APIs that is HIPAA-eligible to facilitate provider-patient communication transcription, as well as other medical applications.

Like with Google Speech-to-Text, AWS Transcribe is only one small component of Amazon's extensive product listing, making innovation in this category a lesser priority for the company. It also has lower accuracy compared to alternative APIs and only supports the transcription of files in Amazon S3 buckets.

Its tiered pricing system ranges from $.00049 to $.00013 per second per month, depending on volume.

Pros

  • Well-known name
  • Free trial
  • Transcribe Medical and HIPAA-eligible ASR

Cons

  • Only supports transcribing files in Amazon S3 buckets
  • Lower accuracy compared to alternative APIs
  • Lack of support
  • Infrequently updated

4. Speechmatics

Another top name in the Speech-to-Text API market is Speechmatics. Speechmatics supports real-time transcriptions and transcribed pre-recorded files with its cloud-based API. It also has on-premise options. The API supports 31 languages with good accuracy.

With an extensive feature list, the API is useful across various industries and applications, including media, compliance, and more. However, Speechmatics lacks a public API, making it challenging to integrate. The company also gates its pricing schemes, so no pricing comparison is offered.

Pros

  • Cross-industry applications across compliance, media broadcasts, contact centers, and more.
  • Supports multiple languages
  • Good accuracy

Cons

  • Gated free trial and pricing
  • No public API — difficult to integrate

5. Azure Speech-to-Text

Azure is Microsoft's Speech-to-Text API, offering accurate Speech-to-Text transcription with the option to build your own speech models. Other key features include flexible deployment (in the cloud or at the edge) and multiple languages support.

The company also provides extensive documentation and sample code and privacy and security certifications for enhanced transparency and ease of development.

However, Azure Speech-to-Text is a generalized product and, like Google and Amazon, not a main priority for continued development and improvement. The API also lacks real-time support and has a complex pricing scheme that ranges from free (five audio hours per month) to $2.10 per hour for Speech-to-Text to $10 per 1,000 transactions for speaker recognition.

Pros

  • Well-known name
  • Customized speech models
  • Good documentation and sample code
  • Privacy and security certifications
  • Supports many languages

Cons

  • Generalized product
  • Lack of support
  • Complex pricing
  • Infrequently updated

The Future of Speech-to-Text

When searching for a Speech-to-Text API today, look for APIs that have impressive features now, like the five discussed here, and that also prioritize innovation and continuous improvement. That way, you ensure that your API will continue to support your product development with the latest, cutting-edge technology into the future!