- AI, But Simple
- Posts
- Language Models Are Few-Shot Learners: The Introduction of GPT-3
Language Models Are Few-Shot Learners: The Introduction of GPT-3
AI, But Simple Issue #40

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.
Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!
Language Models Are Few-Shot Learners: The Introduction of GPT-3
AI, But Simple Issue #40
One of the most significant milestones in machine learning (ML) and AI research is the publishing of OpenAI’s groundbreaking paper “Language Models are Few-Shot Learners” by Brown et al.
In that paper, the researchers presented GPT-3, a massive language model with 175 billion parameters, and showed that scaling up language models by a huge factor (size + training data) allows them to perform new, unseen tasks with zero training for that specific task.

With no gradient updates, users give the model a few examples (or even zero examples) in a text prompt, and it infers what to do.
This is a technique popularly known as few-shot learning, an in-context learning method allowing the model to learn from context without fine-tuning on specific tasks.
Through the researchers’ experiments, they showed that bigger models don’t just memorize—they generalize patterns from diverse text.
But before getting into the vastly popular few-shot learning method, let’s start from the very beginning. How did they train GPT-3?
How GPT-3 Works
GPT-3 is a transformer-based large language model (LLM), excelling at NLP tasks such as text generation or translation. Transformers are great at handling text sequences by weighing the importance of different words through their self-attention mechanism.
Transformers are autoregressive models, meaning that they use past data to predict future outputs. For instance, an LLM works by predicting every next word in a sentence sequentially—being influenced by past words.
Want to learn more about transformers? Check out our past issues explaining the transformer architecture:
GPT-3 uses a decoder-only architecture (like GPT-2), meaning it doesn’t need an encoder like other transformer models used for translation (such as BART).