Introduction to the Hugging Face Ecosystem: Datasets and Tokenizers

Part 1: The Hugging Face Ecosystem

Hugging Face provides tools for building modern machine learning models. Its ecosystem is a central part of the Natural Language Processing (NLP) community.

The Hub: A central repository for thousands of pre-trained models and datasets.
Transformers library: Provides the models themselves, such as BERT and GPT-2.
Datasets library: For efficiently loading and processing very large datasets.
Tokenizers library: For converting text into numerical inputs that a model can process.

This notebook focuses on the Datasets and Tokenizers libraries. We will load a dataset and train a new tokenizer from scratch.

# Install required libraries
!pip install datasets tokenizers transformers -q > /dev/null

Part 2: Hands-On with the `Datasets` Library

The Datasets library provides a standardized and memory-efficient way to work with data.

2.1 Loading a Dataset from the Hub

We can load any public dataset from the Hub with a single command. We will use wikitext, a large collection of text from Wikipedia.

from datasets import load_dataset

# Load a raw text dataset
raw_datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

2.2 Exploring the `Dataset` Object

The loaded object is a DatasetDict containing different data splits (e.g., train, validation).

# View the dataset structure
print(raw_datasets)

# Access the training split
train_split = raw_datasets['train']
print(f"\nTraining split info: {train_split}")

# View the features (columns) of the dataset
print(f"\nFeatures: {train_split.features}")

# View a single example from the training data
print("\nFirst example:")
print(raw_datasets['train'][1])

Part 3: Understanding and Building Tokenizers

Models understand numbers, not text. A tokenizer is a translator that converts text into a sequence of numbers (IDs). Modern tokenizers use a subword strategy, breaking rare words into smaller, known pieces (e.g., “tokenization” -> “token”, “##ization”).

3.1 Training a Custom Tokenizer

We will train a new tokenizer on the wikitext corpus.

Step 1: Create a Text Corpus Iterator

To avoid loading all data into memory, we create a function that provides text to the tokenizer batch by batch.

# This function yields batches of text from our dataset
def get_training_corpus():
    batch_size = 1000
    for i in range(0, len(raw_datasets["train"]), batch_size):
        yield raw_datasets["train"][i : i + batch_size]["text"]

# Test the iterator
text_iterator = get_training_corpus()
print(next(text_iterator)[5:10]) # Print a few lines from the first batch

Step 2: Initialize and Train the Tokenizer

We will build a Byte-Pair Encoding (BPE) tokenizer, a common subword algorithm.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# 1. Initialize a blank tokenizer with a BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

# 2. Set the pre-tokenizer, which splits text into words
tokenizer.pre_tokenizer = Whitespace()

# 3. Define the trainer
# vocab_size is the total number of subword units the tokenizer can have
# special_tokens are reserved tokens with specific meanings for the model
trainer = BpeTrainer(vocab_size=25000, special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"])

# 4. Train the tokenizer on our text corpus
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

print("Training complete.")

Step 3: Save the Tokenizer

We save the trained tokenizer’s configuration to a file so we can reuse it later.

# Create a directory to save the tokenizer
!mkdir -p custom_tokenizer

# Save the tokenizer
tokenizer.save("custom_tokenizer/tokenizer.json")

print("Tokenizer saved to custom_tokenizer/tokenizer.json")

3.2 Using Our New Tokenizer

Let’s test our trained tokenizer by encoding a new sentence.

# Load the tokenizer from the saved file
loaded_tokenizer = Tokenizer.from_file("custom_tokenizer/tokenizer.json")

# Encode a sample sentence
sentence = "This is a test of our new tokenizer."
output = loaded_tokenizer.encode(sentence)

print(f"Sentence: {sentence}")
print(f"\nTokens (subwords): {output.tokens}")
print(f"Token IDs (numbers): {output.ids}")

# We can also decode the IDs back to text-----

decoded_sentence = loaded_tokenizer.decode(output.ids)
print(f"\nDecoded sentence: {decoded_sentence}")

Part 4: Conclusion and Next Steps

We have successfully used the Datasets library to load a large text corpus. Using that data, we trained a custom subword tokenizer from scratch and saved it for future use.

This tokenizer is a critical component for the next stage in the NLP pipeline: training a transformers model from scratch or fine-tuning a pre-trained model on a new domain.