Session 4: HF Transformers - Hands-on Sentiment Analysis
This session focuses on fine-tuning a pre-trained Transformer model for a classification task: sentiment analysis. The goal is to train a model to determine whether a piece of text expresses a positive or negative sentiment. We will use the popular imdb dataset for this task.
What is Sentiment Analysis?
Sentiment analysis is a natural language processing (NLP) technique used to determine the emotional tone behind a body of text. It’s commonly used to understand opinions and feedback in customer reviews, social media comments, and more. We will be fine-tuning a “encoder-based” transformer model like DistilBERT, which is excellent at understanding the context of an entire sentence to make a classification.
- Setup on Google Colab
First, let’s set up our environment. Open a new Google Colab notebook and ensure you have a GPU runtime enabled (Runtime -> Change runtime type -> T4 GPU).
Then, install the necessary libraries with the following commands.
transformers: Provides the pre-trained models (like DistilBERT) and the Trainer API.
datasets: Allows us to easily load and process datasets from the Hugging Face Hub.
evaluate: Contains implementations of common evaluation metrics like accuracy and F1-score.
accelerate: Optimizes PyTorch training loops, enabling features like mixed-precision training (fp16) with minimal code changes.
Now, let’s import the modules we’ll need.
What this does:
load_dataset: Fetches the dataset from the Hugging Face Hub.
AutoTokenizer: Loads the correct tokenizer for our chosen model. A tokenizer converts text into numbers (tokens) that the model can process.
AutoModelForSequenceClassification: Loads our pre-trained model with a sequence classification head on top. This head is a simple linear layer that we will fine-tune for our sentiment task.
TrainingArguments & Trainer: These are high-level classes that abstract away the manual training loop, making it easy to configure and run the fine-tuning process.
- Load and Prepare the Dataset
We will use the imdb dataset, which contains 50,000 movie reviews labeled as either positive (1) or negative (0).
# Load the dataset
raw_datasets = load_dataset("imdb")
# Create a smaller subset for faster training (optional, but recommended for a demo)
small_train_dataset = raw_datasets["train"].shuffle(seed=42).select(range(1000))
small_test_dataset = raw_datasets["test"].shuffle(seed=42).select(range(1000))
print("Training data sample:")
print(small_train_dataset[0])Explanation:
load_dataset(“imdb”) downloads the dataset, which is already split into train and test sets.
We use shuffle(seed=42).select(range(1000)) to create a smaller, random subset of 1,000 samples for both training and testing. This makes the training process much faster for this hands-on session.
The label field is 0 for a negative review and 1 for a positive review.
If skipped:
Training on the full imdb dataset would take significantly longer, which might not be ideal for a short lab session.
- Load Tokenizer and Model
We need a tokenizer to preprocess our text and a pre-trained model to fine-tune. We’ll use distilbert-base-uncased, a smaller, faster version of BERT that maintains excellent performance.
Explanation:
AutoTokenizer.from_pretrained(model_name) fetches the tokenizer that was used when distilbert-base-uncased was originally trained. It’s crucial to use the exact same tokenizer to ensure the model understands the input tokens correctly.
AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) loads the DistilBERT architecture but replaces the final layer with a new, untrained classification head.
num_labels=2 tells the model that we have two possible output classes: positive and negative.
If skipped or wrong:
If you used a different tokenizer than the one the model was trained on, you would get a “vocabulary mismatch,” and the model’s performance would be very poor.
If you forgot num_labels=2, the model might load with a different number of output neurons, leading to errors during training when the loss is calculated against your two labels (0 and 1).
- Tokenize the Text
Next, we create a function to tokenize the dataset. This function will take our text reviews and convert them into input_ids and attention_mask.
Explanation:
padding=“max_length”: Pads all sentences to the model’s maximum sequence length. This ensures all inputs in a batch have the same size.
truncation=True: Truncates any review longer than the model’s maximum length (512 tokens for DistilBERT).
.map(tokenize_function, batched=True): Applies our tokenize_function to the entire dataset. batched=True processes multiple rows at once for a significant speed-up.
If skipped:
The model cannot process raw text. Without tokenization, you cannot feed the data into the model.
Without padding and truncation, you would get errors because sequences in a batch must have the same length.
- Define Metrics and Training Arguments
Before we train, we need to tell the Trainer how to evaluate our model’s performance and what hyperparameters to use for training.
# Define the metrics we want to compute
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
# Define the training arguments
training_args = TrainingArguments(
output_dir="sentiment_model",
eval_strategy="epoch",
num_train_epochs=2,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
learning_rate=5e-5,
weight_decay=0.01,
fp16=True, # Enable mixed-precision for faster training on GPUs
report_to="none"
)Explanation:
evaluate.load(“accuracy”) loads the accuracy metric from the evaluate library.
The compute_metrics function takes the model’s raw output logits and the true labels, calculates the predictions by finding the index with the highest logit (np.argmax), and returns the accuracy score.
TrainingArguments is a class that holds all the hyperparameters.
output_dir: Where to save the trained model.
evaluation_strategy=“epoch”: Run evaluation at the end of each epoch.
num_train_epochs=2: Train for two full passes over the dataset.
learning_rate=5e-5: A common starting learning rate for fine-tuning transformers.
fp16=True: Enables mixed-precision training, which significantly speeds up training on compatible GPUs (like in Colab) and reduces memory usage.
- Create and Run the Trainer
Now we combine everything into the Trainer and start the fine-tuning process.
Explanation:
The Trainer object neatly packages the model, arguments, datasets, and evaluation function.
trainer.train() kicks off the training loop. The Trainer will automatically handle:
Moving data to the GPU.
Performing the forward and backward passes.
Updating the model’s weights.
Evaluating the model on the test set at the end of each epoch.
Printing the training and validation loss and metrics.
- Generate Predictions with the Fine-Tuned Model! The best part is using our new model. The pipeline function from transformers is the easiest way to do this.
from transformers import pipeline
# Create a sentiment analysis pipeline with our fine-tuned model
sentiment_pipeline = pipeline("sentiment-analysis", model=trainer.model, tokenizer=tokenizer, device=0)
# Test with some examples
print(sentiment_pipeline("This movie was fantastic, I really enjoyed it!"))
print(sentiment_pipeline("The plot was predictable and the acting was terrible."))Explanation:
pipeline(“sentiment-analysis”, …) creates a high-level object that handles all the steps for inference: tokenization, passing inputs through the model, and converting the output logits into human-readable labels (LABEL_0 or LABEL_1) and a confidence score.
We pass device=0 to ensure the pipeline runs on the GPU for faster inference.
You should see that the model correctly classifies the positive and negative sentences with high confidence!