Generative AI 3: Transformer Networks and Large Language Models (LLMs)
Transformers are the foundation of modern Large Language Models (LLMs) such as GPT-3, BERT, and T5. In this step, we will explore the transformer architecture, its components, and how it enables LLMs to process and generate human-like text. We will also learn about tokenizers, which are critical for preparing text data for transformer models.
3.1 Transformer Architecture
The transformer is a deep learning architecture designed to process sequences of data in parallel using a mechanism called self-attention. It has become the dominant model for NLP tasks due to its ability to handle long-range dependencies and scale to large datasets.
Key Components of Transformers:
- Self-Attention:
- The self-attention mechanism allows the model to focus on different parts of the input sequence to understand relationships between words, regardless of their position in the sequence.
- It computes a weighted sum of all input tokens, where the weights (or “attention scores”) represent the importance of each token relative to others.
Formula:
\[\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V\]Where:
- \(Q\) (Query), \(K\) (Key), and \(V\) (Value) are matrices representing the input tokens.
- \(d_k\) is the dimension of the key vectors.
- Multi-Head Attention:
- Instead of a single attention function, the transformer uses multiple attention heads to focus on different parts of the input. Each attention head operates independently, and their outputs are combined.
- Position-Wise Feed-Forward Networks:
- After attention, each token passes through a feed-forward network. This network is applied independently to each token, which helps in transforming the attended input.
- Positional Encoding:
- Transformers process input in parallel, so they need positional encodings to understand the order of tokens in a sequence. These encodings are added to the input embeddings.
- Encoder and Decoder:
- Encoder: The encoder reads the input sequence and processes it through self-attention and feed-forward layers.
- Decoder: The decoder generates the output sequence by attending to both the encoder’s output and the previously generated tokens.
3.2 Tokenization
Before feeding text into a transformer model, it needs to be converted into tokens (numerical representations). Tokenization is the process of splitting text into smaller units like words, subwords, or characters.
Types of Tokenizers:
- Word Tokenizers:
- These split text into individual words.
- Example:
"I love cats"
→["I", "love", "cats"]
- Subword Tokenizers:
- These split rare words into smaller, meaningful subword units. Common techniques include:
- Byte Pair Encoding (BPE) (used in GPT-2, GPT-3).
- WordPiece (used in BERT).
- Example:
"unhappiness"
→["un", "##happiness"]
- These split rare words into smaller, meaningful subword units. Common techniques include:
- Character Tokenizers:
- These split text into individual characters.
- Example:
"hello"
→["h", "e", "l", "l", "o"]
Why Tokenization is Important:
- Tokenization is crucial for transforming text into a format that LLMs can process. Each token is mapped to an embedding, which serves as the model’s input.
Practical Tokenization with Hugging Face:
We’ll use the Hugging Face Transformers library to load pre-trained tokenizers and process text for transformer models like GPT-2 and BERT.
3.3 Fine-Tuning Large Language Models (LLMs)
Pre-trained LLMs:
LLMs like GPT, BERT, and T5 are pre-trained on massive text datasets using unsupervised learning objectives. These models can be fine-tuned on specific downstream tasks such as text classification, text generation, or summarization.
- GPT (Generative Pre-trained Transformer): Uses causal language modeling to predict the next word in a sequence.
- BERT (Bidirectional Encoder Representations from Transformers): Uses masked language modeling to predict masked words in a sentence.
- T5 (Text-to-Text Transfer Transformer): Treats all NLP tasks as text-to-text problems, making it highly flexible.
Tokenization and Fine-Tuning with Hugging Face
Installing transformers and datasets
pip install accelerate transformers datasets
Tokenization Example
Let’s tokenize some text using a pre-trained BERT tokenizer.
from transformers import BertTokenizer
# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Example sentence
sentence = "I love learning about transformers."
# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)
print("Tokens:", tokens)
# Convert tokens to input IDs
input_ids = tokenizer.encode(sentence, add_special_tokens=True)
print("Input IDs:", input_ids)
Output:
Tokens: ['i', 'love', 'learning', 'about', 'transformers', '.']
Input IDs: [101, 1045, 2293, 4083, 2055, 19081, 1012, 102]
Here, special tokens [101] (start token) and [102] (end token) are added automatically.
Fine-Tuning Example
Next, let’s fine-tune a pre-trained BERT model on a text classification task (e.g., sentiment analysis).
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load the pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Load a dataset (for example, IMDb sentiment dataset)
dataset = load_dataset('imdb')
# Tokenize the dataset
def tokenize_function(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Define training arguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
evaluation_strategy="epoch",
save_steps=10_000,
save_total_limit=2,
logging_dir='./logs',
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
)
# Train the model
trainer.train()
Output:
***** Running training *****
Num examples = 25000
Num Epochs = 3
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 4688
Epoch Training Loss Validation Loss Accuracy
---------------------------------------------------
1/3 0.4132 0.3741 0.8327
2/3 0.2476 0.3238 0.8654
3/3 0.1567 0.3121 0.8750
***** Running Evaluation *****
Num examples = 25000
Batch size = 64
Final Accuracy on test set: 87.50%
Summary:
- Transformers are the backbone of modern Large Language Models (LLMs) like GPT, BERT, and T5. They use self-attention mechanisms to process and generate text efficiently.
- Tokenizers are essential for converting text into numerical tokens that LLMs can understand. Pre-trained tokenizers like those from Hugging Face handle this process automatically.
- Fine-tuning allows you to adapt pre-trained LLMs to specific tasks (e.g., text classification, generation) using your own dataset.
Comments