🤗 Hugging Face Ecosystem

Learn about Transformers, Datasets, and Spaces - the essential tools for modern AI development

↓ Scroll to explore

🚀 Transformers Library

📚 Getting Started

Hugging Face Transformers is a library that provides pre-trained models for NLP, computer vision, and audio tasks.

Example: Load and use a pre-trained model in just 3 lines of code!

                            # Simple sentiment analysis
                            from transformers import pipeline
                            
                            classifier = pipeline("sentiment-analysis")
                            result = classifier("I love Hugging Face!")
                            print(result)
                            # [{'label': 'POSITIVE', 'score': 0.99}]
                        

Learn to use specific models, tokenizers, and customize pipelines for your needs.

                            # Custom model and tokenizer
                            from transformers import AutoTokenizer, AutoModelForSequenceClassification
                            import torch
                            
                            model_name = "bert-base-uncased"
                            tokenizer = AutoTokenizer.from_pretrained(model_name)
                            model = AutoModelForSequenceClassification.from_pretrained(model_name)
                            
                            inputs = tokenizer(
                                "Hello, Hugging Face!",
                                padding=True,
                                truncation=True,
                                return_tensors="pt"
                            )
                            
                            with torch.no_grad():
                                outputs = model(**inputs)
                                predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
                        

Fine-tune models, implement custom architectures, and optimize for production deployment.

                            # Fine-tuning with Trainer API
                            from transformers import (
                                Trainer, TrainingArguments,
                                DataCollatorWithPadding
                            )
                            from datasets import load_dataset
                            
                            # Load and preprocess dataset
                            dataset = load_dataset("imdb")
                            
                            def preprocess_function(examples):
                                return tokenizer(
                                    examples["text"],
                                    truncation=True,
                                    padding="max_length",
                                    max_length=512
                                )
                            
                            tokenized_datasets = dataset.map(preprocess_function, batched=True)
                            
                            # Training arguments
                            training_args = TrainingArguments(
                                output_dir="./results",
                                learning_rate=2e-5,
                                per_device_train_batch_size=16,
                                per_device_eval_batch_size=16,
                                num_train_epochs=3,
                                weight_decay=0.01,
                                evaluation_strategy="epoch",
                                save_strategy="epoch",
                                load_best_model_at_end=True,
                                push_to_hub=True,
                                hub_model_id="my-awesome-model"
                            )
                            
                            # Initialize trainer
                            trainer = Trainer(
                                model=model,
                                args=training_args,
                                train_dataset=tokenized_datasets["train"],
                                eval_dataset=tokenized_datasets["test"],
                                tokenizer=tokenizer,
                                data_collator=DataCollatorWithPadding(tokenizer=tokenizer)
                            )
                            
                            # Train and push to hub
                            trainer.train()
                            trainer.push_to_hub()
                        

🎯 Model Selection

Choose the right model for your task from thousands of pre-trained models available on the Hub.

Text Classification

BERT, RoBERTa, DistilBERT

Generation

GPT-2, T5, BART

Question Answering

BERT, ALBERT, Longformer

Understand model architectures, compare performance metrics, and select models based on your requirements.

Model	Size	Speed	Accuracy
DistilBERT	66M params	60% faster	97% of BERT
BERT-base	110M params	Baseline	Baseline
RoBERTa-large	355M params	Slower	SOTA on many tasks

Implement model ensembles, quantization, and optimization techniques for production deployment.

                            # Model optimization with ONNX
                            from transformers import AutoTokenizer, AutoModel
                            from optimum.onnxruntime import ORTModelForSequenceClassification
                            import torch
                            
                            # Convert to ONNX
                            model = ORTModelForSequenceClassification.from_pretrained(
                                "distilbert-base-uncased-finetuned-sst-2-english",
                                export=True
                            )
                            
                            # Quantization for faster inference
                            from optimum.onnxruntime import ORTQuantizer
                            from optimum.onnxruntime.configuration import AutoQuantizationConfig
                            
                            quantizer = ORTQuantizer.from_pretrained(model)
                            qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False)
                            quantizer.quantize(save_dir="quantized_model", quantization_config=qconfig)
                        

🔧 Pipelines

Pipelines are the easiest way to use models for inference on various tasks.

                            # Available pipelines
                            from transformers import pipeline
                            
                            # Text generation
                            generator = pipeline("text-generation")
                            
                            # Named entity recognition
                            ner = pipeline("ner")
                            
                            # Summarization
                            summarizer = pipeline("summarization")
                            
                            # Translation
                            translator = pipeline("translation_en_to_fr")
                        

Customize pipelines with specific models, configure parameters, and handle batch processing.

                            # Custom pipeline configuration
                            from transformers import pipeline
                            
                            classifier = pipeline(
                                "text-classification",
                                model="nlptown/bert-base-multilingual-uncased-sentiment",
                                device=0  # Use GPU
                            )
                            
                            # Batch processing
                            texts = [
                                "This product is amazing!",
                                "Terrible experience, would not recommend.",
                                "It's okay, nothing special."
                            ]
                            
                            results = classifier(
                                texts,
                                batch_size=8,
                                truncation=True,
                                max_length=512
                            )
                        

Build custom pipelines, implement streaming, and optimize for real-time applications.

                            # Custom pipeline class
                            from transformers import Pipeline
                            import torch
                            
                            class CustomPipeline(Pipeline):
                                def _sanitize_parameters(self, **kwargs):
                                    preprocess_kwargs = {}
                                    if "max_length" in kwargs:
                                        preprocess_kwargs["max_length"] = kwargs["max_length"]
                                    return preprocess_kwargs, {}, {}
                                
                                def preprocess(self, inputs, max_length=512):
                                    return self.tokenizer(
                                        inputs,
                                        return_tensors="pt",
                                        truncation=True,
                                        max_length=max_length
                                    )
                                
                                def _forward(self, model_inputs):
                                    with torch.no_grad():
                                        return self.model(**model_inputs)
                                
                                def postprocess(self, model_outputs):
                                    logits = model_outputs.logits
                                    probabilities = torch.nn.functional.softmax(logits, dim=-1)
                                    return probabilities.tolist()
                        

📊 Datasets Library

📁 Loading Data

Load popular datasets with a single line of code from the Hugging Face Hub.

                            # Load a dataset
                            from datasets import load_dataset
                            
                            # Load IMDB movie reviews
                            dataset = load_dataset("imdb")
                            
                            # Access train and test splits
                            train_data = dataset["train"]
                            test_data = dataset["test"]
                            
                            # View first example
                            print(train_data[0])
                        

Work with custom datasets, streaming large datasets, and efficient data processing.

                            # Load custom CSV/JSON files
                            from datasets import load_dataset, Dataset
                            
                            # From CSV
                            dataset = load_dataset("csv", data_files="my_data.csv")
                            
                            # From JSON
                            dataset = load_dataset("json", data_files="my_data.jsonl")
                            
                            # Create from dictionary
                            data_dict = {
                                "text": ["Example 1", "Example 2"],
                                "label": [0, 1]
                            }
                            dataset = Dataset.from_dict(data_dict)
                            
                            # Streaming for large datasets
                            dataset = load_dataset(
                                "c4",
                                "en",
                                streaming=True,
                                split="train"
                            )
                        

Build data pipelines with advanced preprocessing, caching, and distributed processing.

                            # Advanced data processing pipeline
                            from datasets import load_dataset, DatasetDict
                            from transformers import AutoTokenizer
                            import multiprocessing
                            
                            # Load and split dataset
                            dataset = load_dataset("squad")
                            
                            # Advanced preprocessing
                            def preprocess_function(examples):
                                questions = [q.strip() for q in examples["question"]]
                                inputs = tokenizer(
                                    questions,
                                    examples["context"],
                                    max_length=384,
                                    truncation="only_second",
                                    return_offsets_mapping=True,
                                    padding="max_length",
                                )
                                return inputs
                            
                            # Parallel processing
                            tokenized_datasets = dataset.map(
                                preprocess_function,
                                batched=True,
                                num_proc=multiprocessing.cpu_count(),
                                remove_columns=dataset["train"].column_names,
                                desc="Tokenizing"
                            )
                            
                            # Save to disk for caching
                            tokenized_datasets.save_to_disk("tokenized_squad")
                            
                            # Load from cache
                            cached_dataset = DatasetDict.load_from_disk("tokenized_squad")
                        

🔄 Data Processing

Transform and prepare your data for model training with simple operations.

                            # Basic data operations
                            from datasets import load_dataset
                            
                            dataset = load_dataset("imdb")
                            
                            # Filter examples
                            positive_reviews = dataset["train"].filter(
                                lambda x: x["label"] == 1
                            )
                            
                            # Select subset
                            small_dataset = dataset["train"].select(range(1000))
                            
                            # Shuffle data
                            shuffled = dataset["train"].shuffle(seed=42)
                        

Apply complex transformations, handle multi-modal data, and optimize processing speed.

                            # Complex transformations
                            def lowercase_text(example):
                                example["text"] = example["text"].lower()
                                return example
                            
                            # Apply transformation
                            dataset = dataset.map(lowercase_text)
                            
                            # Batch processing for speed
                            def batch_tokenize(examples):
                                return tokenizer(
                                    examples["text"],
                                    padding=True,
                                    truncation=True
                                )
                            
                            tokenized = dataset.map(
                                batch_tokenize,
                                batched=True,
                                batch_size=1000
                            )
                            
                            # Format for PyTorch
                            dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
                        

Implement custom data collators, augmentation strategies, and distributed data loading.

                            # Custom data collator with augmentation
                            from dataclasses import dataclass
                            from typing import Dict, List
                            import torch
                            
                            @dataclass
                            class CustomDataCollator:
                                tokenizer: AutoTokenizer
                                max_length: int = 512
                                augment: bool = True
                                
                                def __call__(self, features: List[Dict]) -> Dict:
                                    # Apply augmentation
                                    if self.augment:
                                        features = self.augment_batch(features)
                                    
                                    # Tokenize batch
                                    batch = self.tokenizer(
                                        [f["text"] for f in features],
                                        padding=True,
                                        truncation=True,
                                        max_length=self.max_length,
                                        return_tensors="pt"
                                    )
                                    
                                    batch["labels"] = torch.tensor([f["label"] for f in features])
                                    return batch
                                
                                def augment_batch(self, features):
                                    # Custom augmentation logic
                                    import random
                                    for feature in features:
                                        if random.random() > 0.5:
                                            # Random word swap, deletion, etc.
                                            words = feature["text"].split()
                                            random.shuffle(words)
                                            feature["text"] = " ".join(words[:len(words)//2])
                                    return features
                        

💾 Dataset Hub

Browse and use thousands of datasets from the Hugging Face Hub for various tasks.

Text Datasets

IMDB, SQuAD, GLUE, WikiText

Vision Datasets

ImageNet, COCO, CIFAR

Audio Datasets

LibriSpeech, Common Voice

Upload your own datasets, version control, and collaborate with the community.

                            # Push dataset to Hub
                            from datasets import Dataset
                            from huggingface_hub import login
                            
                            # Login to Hugging Face
                            login()
                            
                            # Create dataset
                            data = {
                                "text": ["Example 1", "Example 2"],
                                "label": ["positive", "negative"]
                            }
                            dataset = Dataset.from_dict(data)
                            
                            # Push to Hub
                            dataset.push_to_hub("my-awesome-dataset")
                        

Create dataset cards, implement data validation, and manage large-scale datasets.

                            Dataset Card Template
                            Dataset Description
Languages and Task Categories
Dataset Structure
Data Fields and Statistics
Collection Process
Licensing Information
Citation

                        

🎨 Spaces & Deployment

🚀

Gradio Apps

Create interactive ML demos with Gradio and deploy them on Hugging Face Spaces.

                            # Simple Gradio app
                            import gradio as gr
                            from transformers import pipeline
                            
                            classifier = pipeline("sentiment-analysis")
                            
                            def analyze_sentiment(text):
                                result = classifier(text)[0]
                                return f"{result['label']}: {result['score']:.2f}"
                            
                            iface = gr.Interface(
                                fn=analyze_sentiment,
                                inputs="text",
                                outputs="text",
                                title="Sentiment Analyzer"
                            )
                            
                            iface.launch()
                        

Build complex interfaces with multiple inputs/outputs and custom components.

                            # Advanced Gradio interface
                            import gradio as gr
                            from transformers import pipeline
                            
                            generator = pipeline("text-generation")
                            summarizer = pipeline("summarization")
                            
                            def process_text(text, task, max_length):
                                if task == "Generate":
                                    result = generator(text, max_length=max_length)[0]["generated_text"]
                                else:
                                    result = summarizer(text, max_length=max_length)[0]["summary_text"]
                                return result
                            
                            with gr.Blocks() as demo:
                                gr.Markdown("# Text Processing Demo")
                                
                                with gr.Row():
                                    with gr.Column():
                                        input_text = gr.Textbox(label="Input Text", lines=5)
                                        task = gr.Radio(["Generate", "Summarize"], label="Task")
                                        max_len = gr.Slider(50, 200, value=100, label="Max Length")
                                        submit_btn = gr.Button("Process")
                                    
                                    with gr.Column():
                                        output = gr.Textbox(label="Output", lines=5)
                                
                                submit_btn.click(process_text, [input_text, task, max_len], output)
                            
                            demo.launch()
                        

Deploy production-ready applications with authentication, caching, and scaling.

                            Production Deployment
                            Configure requirements.txt
Set up environment variables
Implement caching strategies
Add authentication
Monitor usage and performance
Scale with hardware upgrades

                        

🎯

Model Hub

Share your models with the community and use models from thousands of contributors.

                            # Push model to Hub
                            from transformers import AutoModelForSequenceClassification
                            
                            model = AutoModelForSequenceClassification.from_pretrained(
                                "bert-base-uncased",
                                num_labels=2
                            )
                            
                            # After training...
                            model.push_to_hub("my-awesome-model")
                            tokenizer.push_to_hub("my-awesome-model")
                        

Create model cards, manage versions, and collaborate on model development.

                            Model Card Sections
                            Model Description
Intended Use & Limitations
Training Data & Procedure
Evaluation Results
Environmental Impact
Citation & License

                        

Implement model versioning, A/B testing, and continuous integration for models.

                            # Advanced model management
                            from huggingface_hub import HfApi, ModelCard
                            
                            api = HfApi()
                            
                            # Create model card
                            card = ModelCard.from_template(
                                card_data={
                                    "language": "en",
                                    "license": "apache-2.0",
                                    "tags": ["text-classification", "sentiment-analysis"],
                                    "datasets": ["imdb"],
                                    "metrics": ["accuracy", "f1"]
                                },
                                template_path="modelcard_template.md"
                            )
                            
                            card.push_to_hub("username/model-name")
                            
                            # Tag versions
                            api.create_tag(
                                repo_id="username/model-name",
                                tag="v1.0",
                                tag_message="First production release"
                            )
                        

🔗

Inference API

Use the Inference API to run models directly from the Hub without downloading them.

                            # Using Inference API
                            import requests
                            
                            API_URL = "https://api-inference.huggingface.co/models/bert-base-uncased"
                            headers = {"Authorization": "Bearer YOUR_TOKEN"}
                            
                            def query(payload):
                                response = requests.post(API_URL, headers=headers, json=payload)
                                return response.json()
                            
                            output = query({"inputs": "I love Hugging Face!"})
                        

Configure inference endpoints, handle batching, and optimize for production use.

                            Inference Endpoints
                            Dedicated infrastructure
Auto-scaling capabilities
Custom container support
Private endpoints
Monitoring & logging

                        

Deploy custom inference endpoints with optimized serving and monitoring.

                            # Custom inference handler
                            from typing import Dict, List, Any
                            
                            class EndpointHandler:
                                def __init__(self, path=""):
                                    from transformers import pipeline
                                    self.pipeline = pipeline("text-classification", model=path)
                                
                                def __call__(self, data: Dict[str, Any]) -> List[Dict[str, Any]]:
                                    inputs = data.pop("inputs", data)
                                    parameters = data.pop("parameters", {})
                                    
                                    # Run inference
                                    predictions = self.pipeline(inputs, **parameters)
                                    
                                    # Post-process
                                    return predictions