Introduction to the Hugging Face Ecosystem: Datasets and Tokenizers
Part 1: The Hugging Face Ecosystem
Hugging Face provides tools for building modern machine learning models. Its ecosystem is a central part of the Natural Language Processing (NLP) community.
- The Hub: A central repository for thousands of pre-trained models and datasets.
Transformerslibrary: Provides the models themselves, such as BERT and GPT-2.Datasetslibrary: For efficiently loading and processing very large datasets.Tokenizerslibrary: For converting text into numerical inputs that a model can process.
This notebook focuses on the Datasets and Tokenizers libraries. We will load a dataset and train a new tokenizer from scratch.
Part 2: Hands-On with the Datasets Library
The Datasets library provides a standardized and memory-efficient way to work with data.
2.1 Loading a Dataset from the Hub
We can load any public dataset from the Hub with a single command. We will use wikitext, a large collection of text from Wikipedia.
2.2 Exploring the Dataset Object
The loaded object is a DatasetDict containing different data splits (e.g., train, validation).
# View the dataset structure
print(raw_datasets)
# Access the training split
train_split = raw_datasets['train']
print(f"\nTraining split info: {train_split}")
# View the features (columns) of the dataset
print(f"\nFeatures: {train_split.features}")
# View a single example from the training data
print("\nFirst example:")
print(raw_datasets['train'][1])Part 3: Understanding and Building Tokenizers
Models understand numbers, not text. A tokenizer is a translator that converts text into a sequence of numbers (IDs). Modern tokenizers use a subword strategy, breaking rare words into smaller, known pieces (e.g., “tokenization” -> “token”, “##ization”).
3.1 Training a Custom Tokenizer
We will train a new tokenizer on the wikitext corpus.
Step 1: Create a Text Corpus Iterator
To avoid loading all data into memory, we create a function that provides text to the tokenizer batch by batch.
# This function yields batches of text from our dataset
def get_training_corpus():
batch_size = 1000
for i in range(0, len(raw_datasets["train"]), batch_size):
yield raw_datasets["train"][i : i + batch_size]["text"]
# Test the iterator
text_iterator = get_training_corpus()
print(next(text_iterator)[5:10]) # Print a few lines from the first batchStep 2: Initialize and Train the Tokenizer
We will build a Byte-Pair Encoding (BPE) tokenizer, a common subword algorithm.
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# 1. Initialize a blank tokenizer with a BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
# 2. Set the pre-tokenizer, which splits text into words
tokenizer.pre_tokenizer = Whitespace()
# 3. Define the trainer
# vocab_size is the total number of subword units the tokenizer can have
# special_tokens are reserved tokens with specific meanings for the model
trainer = BpeTrainer(vocab_size=25000, special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"])
# 4. Train the tokenizer on our text corpus
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
print("Training complete.")Step 3: Save the Tokenizer
We save the trained tokenizer’s configuration to a file so we can reuse it later.
3.2 Using Our New Tokenizer
Let’s test our trained tokenizer by encoding a new sentence.
# Load the tokenizer from the saved file
loaded_tokenizer = Tokenizer.from_file("custom_tokenizer/tokenizer.json")
# Encode a sample sentence
sentence = "This is a test of our new tokenizer."
output = loaded_tokenizer.encode(sentence)
print(f"Sentence: {sentence}")
print(f"\nTokens (subwords): {output.tokens}")
print(f"Token IDs (numbers): {output.ids}")
# We can also decode the IDs back to text-----
decoded_sentence = loaded_tokenizer.decode(output.ids)
print(f"\nDecoded sentence: {decoded_sentence}")Part 4: Conclusion and Next Steps
We have successfully used the Datasets library to load a large text corpus. Using that data, we trained a custom subword tokenizer from scratch and saved it for future use.
This tokenizer is a critical component for the next stage in the NLP pipeline: training a transformers model from scratch or fine-tuning a pre-trained model on a new domain.