w5_l2_finetuninggpt2 – Data Science Lab

DS Lab Course Week 5

Session 2:

HF Transformers - Hands-on Training GPT-2 This session focuses on the “how-to” of fine-tuning. The goal is to demystify the process and show students they can get a language model to generate text in a specific style with just a few key components.

What is GPT-2? It’s a “decoder-only” transformer trained to predict the next word in a sentence.

Setup on Google Colab (5 mins)

Guide students to create a new Colab notebook and enable the GPU runtime (Runtime -> Change runtime type -> T4 GPU).

Install the necessary libraries with one command:

!pip install --upgrade transformers

!pip install datasets accelerate evaluate

Load a Dataset: Use the datasets library to load a simple text dataset. eli5 is a good choice because it’s a dataset of questions and answers, making the generated text interesting.

from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
    AutoModelForCausalLM
)

What this does:

datasets is the Hugging Face library to load datasets like Wikitext easily.

transformers gives us:

Tokenizer — turns text into token IDs the model can understand.

Model — GPT-2 in our case.

DataCollator — handles batch formatting and padding.

TrainingArguments and Trainer — simplify training loops.

Why important: Without these, you’d have to write your own data loader, optimizer, evaluation loop — which is a lot of boilerplate.

Load and prepare the dataset

# Load 5000 examples from the training split
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:5000]")

# Remove empty lines (common in Wikitext)
dataset = dataset.filter(lambda ex: len(ex["text"]) > 0)

# Create train/validation split (90% train, 10% validation)
split = dataset.train_test_split(test_size=0.1, seed=42)
train_raw = split["train"]
val_raw = split["test"]

Explanation:

wikitext-2-raw-v1 is a Wikipedia-based dataset for language modeling.

Filtering removes empty strings — these waste computation.

Splitting creates a validation set to measure performance during training.

If skipped:

Without a validation set, you can’t monitor overfitting.

Without filtering, you train on garbage samples.

Load tokenizer and model

Tokenization: Explain that models work with numbers, not text. A tokenizer converts text into a format the model understands (input IDs).

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# GPT-2 has no pad token; we add one for batching
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})

model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))  # match new vocab size

Explanation:

Tokenizer maps words → integers using GPT-2’s vocab.

GPT-2 doesn’t have a padding token by default, but batching needs it, so we add one.

Model is GPT-2 with a language modeling head.

resize_token_embeddings ensures the model knows about our new pad token.

If skipped:

Without padding, batches of different lengths will crash.

Without resizing embeddings, you’ll get a size mismatch error.

Tokenize the text

max_length = 512
def tokenize_fn(examples):
    return tokenizer(examples["text"], truncation=True, max_length=max_length)

train_tok = train_raw.map(tokenize_fn, batched=True, remove_columns=["text"])
val_tok = val_raw.map(tokenize_fn, batched=True, remove_columns=["text"])

# Make datasets return PyTorch tensors
train_tok.set_format(type="torch", columns=["input_ids", "attention_mask"])
val_tok.set_format(type="torch", columns=["input_ids", "attention_mask"])

Explanation:

truncation=True: cuts long texts at max_length tokens (GPT-2 limit is 1024, but we pick 512 for speed).

map applies our tokenizer to the whole dataset.

remove_columns drops the raw text after tokenization to save memory.

set_format ensures Trainer gets PyTorch tensors directly.

If skipped:

The model can’t understand raw strings.

Without truncation, you’ll get “sequence too long” errors.

Create the data collator

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Explanation:

The collator batches tokenized data and pads sequences in the batch to the same length.

mlm=False means causal LM (predict next token), not masked LM like BERT.

If wrong:

Setting mlm=True would train GPT-2 in BERT-style — totally different objective.

Fine-Tuning with the Trainer API: This is the core of the session. The Trainer abstracts away the complex training loop.

Define training arguments

Training Arguments: Define the training parameters. Keep them simple for the session.

Common and Useful Training Arguments

Here are some of the most important arguments you might want to add, grouped by function:

For Model Performance

learning_rate: The speed at which the model updates its weights. A smaller value like \(5e-5\) (which is 0.00005) is a common starting point.

weight_decay: A regularization technique to prevent the model from becoming too complex and overfitting to the training data. A common value is 0.01.

warmup_steps: The number of initial training steps where the learning rate gradually increases from 0 to its full value. This helps stabilize training. 500 is a reasonable number.

For Logging, Saving & Evaluation

evaluation_strategy: When to perform evaluation. Set to “steps” or “epoch”.

eval_steps: If using evaluation_strategy=“steps”, this sets how often to run evaluation (e.g., every 500 steps).

save_strategy: Same as evaluation, but for saving model checkpoints. Set to “steps” or “epoch”.

save_total_limit: Limits the total number of checkpoints saved to avoid filling up your disk.

load_best_model_at_end: A very useful argument. If set to True, the Trainer will load the best-performing model (based on the evaluation metric) at the end of training.

For Speed and Efficiency

fp16: Set to True to enable mixed-precision training. This can significantly speed up training on modern GPUs (like those in Colab) and reduce memory usage.

Finding All Possible Arguments

The TrainingArguments class has many more options. To see a complete list with detailed explanations, you can always check the official Hugging Face documentation. It’s the best resource for exploring everything you can control.

https://huggingface.co/docs/transformers/en/main_classes/trainer

training_args = TrainingArguments(
    output_dir="./gpt2-wikitext-finetuned",  # save model checkpoints
    num_train_epochs=10,
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    weight_decay=0.01,       # helps prevent overfitting
    warmup_steps=500,        # gradual LR increase
    eval_strategy="steps",
    eval_steps=500,          # evaluate every 500 steps
    save_strategy="steps",
    save_steps=500,          # save every 500 steps
    load_best_model_at_end=True,
    save_total_limit=3,      # only keep last 3 checkpoints
    fp16=True,               # mixed precision for speed
    report_to="none"  # disable W&B, TensorBoard, etc.
)

Teaching moment:

Warmup: starts with small learning rate → more stable.

Weight decay: L2 regularization to keep weights small.

Mixed precision (fp16): speeds up training, uses less GPU memory.

Create the Trainer

import transformers
print(transformers.__version__)
from transformers import TrainingArguments
print(TrainingArguments.__module__)

Instantiate Trainer: Combine everything into the Trainer.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=val_tok,        # Needed for evaluation_strategy
    data_collator=data_collator,
    processing_class=tokenizer,         
)

Explanation:

Trainer automates the training loop, evaluation, saving, logging.

eval_dataset is essential because we set evaluation_strategy=“steps”.

If eval_dataset missing:

error

Train the model

Launch Training: Start the fine-tuning process. Explain that the model’s weights (W) are being updated via backpropagation to minimize a loss function.

import time

# Start timer
start_time = time.perf_counter()

# Train
trainer.train()

# End timer
end_time = time.perf_counter()

# Calculate and format
elapsed_time = end_time - start_time
minutes, seconds = divmod(elapsed_time, 60)
print(f"Total training time: {int(minutes)} min {seconds:.2f} sec")

Generate Text! (The Fun Part): Use the fine-tuned model to generate text. Show how it has adopted the “style” of the training data.

prompt = "Himalaya mountains are "

# Tokenize with attention mask, send to GPU
encoding = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

# Set pad token explicitly to avoid confusion
model.config.pad_token_id = tokenizer.pad_token_id

# Generate
outputs = model.generate(
    input_ids=encoding["input_ids"],
    attention_mask=encoding["attention_mask"],  
    max_length=100,
    num_return_sequences=1
)

# Decode and print
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

prompt = "photosynthesis is a function"

# Tell tokenizer to use EOS as pad token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

# Tokenize with attention mask, send to GPU
encoding = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

# Generate
outputs = model.generate(
    input_ids=encoding["input_ids"],
    attention_mask=encoding["attention_mask"],
    max_length=100,
    num_return_sequences=1,
    temperature=0.7,         # for more variety
    top_k=50,                # sample from top 50 tokens
    repetition_penalty=1.2   # reduce loops
)

# Decode and print
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Make sure pad token is set (do this once)
tokenizer.pad_token = tokenizer.eos_token          # safe: reuse eos as pad
model.config.pad_token_id = tokenizer.pad_token_id

prompt = "Mahatma Gandhi is "
encoding = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate(
    input_ids=encoding["input_ids"],
    attention_mask=encoding["attention_mask"],
    max_new_tokens=120,         # generate up to 120 new tokens
    do_sample=True,             # enable sampling so temperature/top_k/top_p take effect
    temperature=0.7,            # softness of sampling (0.7 is often good)
    top_k=50,                   # sample from top 50 tokens
    top_p=0.9,                  # or use nucleus sampling
    repetition_penalty=1.15,    # discourage immediate repetition
    no_repeat_ngram_size=3,     # avoid repeating the same 3-gram
    pad_token_id=tokenizer.pad_token_id,  # explicit
    eos_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Make sure pad token is set (do this once)
tokenizer.pad_token = tokenizer.eos_token          # safe: reuse eos as pad
model.config.pad_token_id = tokenizer.pad_token_id

prompt = "The Amazon River is"
encoding = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate(
    input_ids=encoding["input_ids"],
    attention_mask=encoding["attention_mask"],
    max_new_tokens=120,         # generate up to 120 new tokens
    do_sample=False,             # enable sampling so temperature/top_k/top_p take effect
    temperature=0.7,            # softness of sampling (0.7 is often good)
    top_k=50,                   # sample from top 50 tokens
    top_p=0.9,                  # or use nucleus sampling
    repetition_penalty=1.15,    # discourage immediate repetition
    no_repeat_ngram_size=3,     # avoid repeating the same 3-gram
    pad_token_id=tokenizer.pad_token_id,  # explicit
    eos_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Make sure pad token is set (do this once)
tokenizer.pad_token = tokenizer.eos_token          # safe: reuse eos as pad
model.config.pad_token_id = tokenizer.pad_token_id

prompt = "In physics, quantum mechanics is"
encoding = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate(
    input_ids=encoding["input_ids"],
    attention_mask=encoding["attention_mask"],
    max_new_tokens=120,         # generate up to 120 new tokens
    do_sample=True,             # enable sampling so temperature/top_k/top_p take effect
    temperature=0.1,            # softness of sampling (0.7 is often good)
    top_k=50,                   # sample from top 50 tokens
    top_p=0.9,                  # or use nucleus sampling
    repetition_penalty=1.5,    # discourage immediate repetition
    no_repeat_ngram_size=3,     # avoid repeating the same 3-gram
    pad_token_id=tokenizer.pad_token_id,  # explicit
    eos_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Right now, your prompt output is “rambling” because the model:

Is only fine-tuned briefly (1 epoch on 5,000 Wikitext examples — too small for deep learning models).

Doesn’t have domain-specific grounding (it’s just a generic GPT-2 fine-tune).

Uses a generation strategy that still leaves room for randomness (top-k, temperature).

To increase accuracy (more factually correct & coherent answers to prompts like “How does photosynthesis work?”), you can work at three levels:

1 Training Stage – Make the model smarter More data: Train on larger, cleaner datasets about biology/science instead of generic Wikitext. E.g., SQuAD, Wikipedia Science subset, Khan Academy transcripts.

More epochs: 3–5 passes over the data, with early stopping if eval loss stops improving.

Smaller learning rate: E.g., 2e-5 instead of 5e-5 to avoid overwriting pre-trained weights too aggressively.

Use evaluation dataset: So you can monitor overfitting and pick the best checkpoint.

Domain-specific fine-tuning: If your goal is biology Q&A, curate a biology text corpus.

2 Generation Stage – Guide the answer Lower randomness: temperature=0.3, # Less creative, more precise top_k=20, top_p=0.9 Increase repetition_penalty to avoid loops: repetition_penalty=1.5 Set max_new_tokens instead of max_length to avoid prompt truncation.

3 Prompt Engineering – Ask better Instead of: How does photosynthesis work? Try: Explain photosynthesis step-by-step as a science teacher for 10th grade students. Or: Explain photosynthesis in 5 clear bullet points. The more context and instruction you give, the more structured and accurate the output.

Load dataset → teaches reproducibility and data cleaning.

Tokenizer → explains how models process text as numbers.

Train/val split → introduces concept of evaluation and avoiding overfitting.

Collator → shows how batch padding works.

Training arguments → gives intuition for hyperparameters and resource management.

Trainer → shows benefits of using high-level APIs.

Training loop → connects all components together.

Other Links