Everything You Need To Know About Transformers

Everything You Need To Know About AI Transformers

Posted in

Imagine a scenario where you want to translate a sentence from English to another language. Without additional logic or processing, a simple phrase like “the cat is sitting on the mat” in English might translate to “Le chat est séance sur le tapis.” The actual translation should be “Le chat est assis sur le tapis,” as “séance” means “a session,” which a word-for-word translator would have no way of knowing.

That’s not too bad. Let’s consider a German sentence as a counterexample. A simple word-for-word translation of the phrase “I bought a book yesterday” might return “Ich gekauft ein Buch Gestern.” Pull that phrase out in Düsseldorf, and you’ll quickly find out how wrong it really is when the kindly bookseller looks at you with pity when they immediately realize you’re very far from home. In German, the translation would be “Ich habe gestern ein Buch gekauft,” translating directly to “I have yesterday a book bought.” It’s not entirely grammatically incorrect in English, but it’s unwieldy and far from intuitive. A direct 1:1 translator matching exact phrases from different languages might be possible, but it would also be prohibitively large and obscenely slow.

That’s where transformers come in.

What Are Transformers?

The term “transformers” was first used in the landmark research paper Attention Is All You Need, published by eight researchers working on problems like the translation example mentioned in the introduction. The concept is actually fairly simple, a few years into an AI-driven universe. A transformer is a neural network that seeks to understand and emulate human speech and text by analyzing patterns in very large datasets. Transformers are a layer where an input can be processed in some way and then returned to the output.

How Transformers Work

Transformers use the concept of “attention” to find patterns in data that help return invaluable context to individual words or assets. This requires additional components, known as “encoders” and “decoders,” inside each component. The encoders and decoders have their own architecture as well, which transforms the inputs into mathematical representations using “embeddings.” It also allows each asset to receive a vector — a mathematical representation allowing probability to be factored in — which is the backbone of machine learning. There’s some sophisticated math going on inside a transformer, like the formula that dictates how each encoder communicates with the next in the series, expressed mathematically as LayerNorm(x + f(x)), where f(x) is the function used in the corresponding sublayer.

This brief sentence alone should start giving you an idea of what a miracle it truly is to have ready-made machine learning models and training datasets. What is completely commonplace in 2025 would have been extreme science fiction even ten years ago.

The Layers of Transformers

Each component of a transformer is made up of a number of sublayers. In an encoder, the transformer has two sublayers. The first is called a self-attention layer, which is a mathematical representation of how much attention one word should pay to another. There’s also the position-wise feed-forward layer that determines a word or phrase’s position in the overall text.

Every asset that passes through a transformer receives three variables: the query, a key, and a value. The query is the input represented in mathematical terms. The key is the vector value given to each input or token, which is used to find suitable matches. Finally, the value is associated with a particular key, which tells the attention layer what to return.

To put that in less abstract terms, searching for something on YouTube would be the query. The list of videos that gets returned would be the key, with the most relevant results ranked toward the top. When you select a video to watch, the video that plays would be the value.

Transformers in Action

Let’s close out by illustrating some of these principles in action using APIs. Imagine you’re using the Google Cloud Translate API to translate a phrase from English into another language. We’ll use the ubiquitous “Hello, World” as our case in point. When Google Cloud Translate API receives the query “Hello, World!,” “Hello,” “,”, and “world!” are converted into tokens, each of which receives its own embedding, which indicates how much each word or character has to do with the other tokens. This helps return the right result, as the embeddings help guarantee the words will not only be correct but also in the right order. Transformer APIs let Google Cloud Translate API know that “Dia dhuit, domhan!” is as accurate a translation of “Hello, World!” in Gaelic as “Hola, mundo!” is in Spanish.

Transformers are capable of a lot more than translation, however. The same techniques can be used to summarize or generate text. A user might submit a paper about the influence of the Industrial Revolution to a transformer like BART. The same embeddings and attention model that allow transformers to translate text can also be used to create much shorter, simpler explanations of a supplied text.

Final Thoughts on Transformers

One of the central concerns about AI is that it can be a black box, obfuscating what it’s doing and how it’s coming to its conclusions. This opacity has the potential to cause issues ranging from replicating biases to hiding nefarious or unethical behavior to outright hallucinating answers. While AI-driven tools like large language models (LLMs) are capable of some miraculous things, with all manner of time-saving applications, they work best when users have at least some idea of how they work.

As we have seen, transformers are the basis of today’s LLM-based processing. They’re not as opaque as they once were, but they definitely require some advanced math and know-how. That makes us all that much more grateful for the proliferation of plug-and-play tools for machine learning, training neural networks, and creating your own language models, large and small.