Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It

This article explains how Stochastic Gradient Descent (SGD) creates a frequency bias in language models, where common words are learned better than rare ones. It shows how Adam optimizer improves this by giving more attention to rare tokens.

Introduction

When computers learn to understand human language, they go through a process called training. During this training, the computer looks at thousands of examples of text and tries to figure out patterns. One of the most important parts of this process is called optimization, which is how the computer improves its understanding over time. But there's a surprising problem that happens when training language models — a problem called frequency bias. This issue is especially noticeable in how different training methods, like Stochastic Gradient Descent (SGD) and Adam, handle common versus rare words.

What is Frequency Bias?

Imagine you're learning a new language and you see the word "the" a million times in your textbook. You'll quickly remember it and understand its meaning. But what if you also see the word "serendipity" only a few times? You might not remember it as well, even though it's a meaningful and important word. This is exactly what happens in language models.

In language models, tokens are the smallest units of meaning — they can be words, parts of words, or even punctuation. Some tokens (like "the," "and," "is") appear very frequently, while others (like "serendipity," "ephemeral," "quixotic") are rare. When a model is trained, the parameters (numbers that control how the model works) associated with common tokens get updated very often, while those for rare tokens may not get updated for hundreds or even thousands of steps. This imbalance is known as frequency bias.

How Does Stochastic Gradient Descent (SGD) Work?

Stochastic Gradient Descent (SGD) is one of the most widely used methods for training machine learning models. Think of SGD like a person trying to find the lowest point in a hilly landscape. Each step they take is based on the slope they see right in front of them — they don't look at the whole landscape, just the immediate area.

In the context of language models, SGD updates the model's parameters based on the data it sees in small batches. When a common token appears, the model adjusts its parameters immediately. But when a rare token appears, it might be so infrequent that the model doesn't update its parameters for a very long time. This leads to the frequency bias — the model becomes very good at handling frequent words but struggles with rare ones.

Why Does Adam Fix This Problem?

Adam is another optimization method that’s become very popular in training modern AI models. Unlike SGD, which only looks at the current data point, Adam keeps a memory of past updates. It's like having a memory of the entire landscape, not just the immediate area.

Adam uses two key ideas:

Adaptive learning rates: Adam adjusts how much it changes each parameter based on how often that parameter has been updated in the past. If a parameter hasn't been updated much (like with rare tokens), Adam will give it more attention and update it more aggressively.
Momentum: Adam also remembers previous updates to smooth out the learning process, making it more stable and consistent.

By using these features, Adam helps the model pay more attention to rare but important tokens, balancing the learning process and reducing the frequency bias that plagues SGD.

Why Does This Matter?

When a language model is trained using SGD, it may perform well on common tasks but fail to understand rare or nuanced words. This can lead to models that sound fluent but miss the deeper meaning of text. For example, a model trained with SGD might easily understand "the cat sat on the mat," but struggle with a sentence like "The ephemeral beauty of the sunset was a serendipitous gift."

By using Adam, researchers can train models that are more balanced and accurate. This is especially important in real-world applications where understanding rare or domain-specific vocabulary is essential — such as in medical or legal documents, where precision matters.

Key Takeaways

Frequency bias is a problem where models trained with SGD pay more attention to common words than rare ones.
SGD updates parameters based on immediate data, so rare tokens may not be updated for a long time.
Adam fixes this by using adaptive learning rates and memory of past updates to give rare tokens more attention.
Using Adam helps models understand both common and rare words better, leading to more accurate and nuanced language understanding.