Computer Vision - Interactive Learning | Master Vision AI Step-by-Step

What is Computer Vision?

Computer Vision is a field of artificial intelligence that enables computers to interpret and understand visual information from the world around us. It mimics human vision by teaching machines to identify objects, understand scenes, and extract meaningful information from images and videos.

Key Concepts

Image Processing: Manipulating and enhancing images to extract useful information
Feature Detection: Identifying important patterns, edges, corners, and textures in images
Pattern Recognition: Classifying objects and scenes based on learned patterns
Deep Learning: Using neural networks to automatically learn visual features

How Images Work in Computers

Digital images are represented as matrices of pixels, where each pixel contains color information:

import numpy as np
import cv2
from PIL import Image

# Load an image
image = cv2.imread('example.jpg')

# Image shape: (height, width, channels)
print(f"Image shape: {image.shape}")

# RGB channels for color images
height, width, channels = image.shape
print(f"Height: {height}, Width: {width}, Channels: {channels}")

# Convert BGR to RGB (OpenCV uses BGR by default)
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
                    

Image Pixel Visualization

Common Applications

Image Classification: Categorizing images (e.g., cat vs dog)
Object Detection: Finding and localizing objects in images
Face Recognition: Identifying specific individuals
Medical Imaging: Analyzing X-rays, MRIs, and CT scans
Autonomous Vehicles: Understanding road scenes and obstacles
Quality Control: Detecting defects in manufacturing

Image Processing Fundamentals

Image processing involves manipulating images to enhance them or extract useful information. These techniques form the foundation of computer vision systems.

Basic Image Operations

import cv2
import numpy as np
from scipy import ndimage
import matplotlib.pyplot as plt

# Load and basic operations
image = cv2.imread('input.jpg')
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Resize image
resized = cv2.resize(image, (224, 224))

# Crop image
cropped = image[100:300, 150:350]  # [y1:y2, x1:x2]

# Rotate image
rows, cols = gray_image.shape[:2]
rotation_matrix = cv2.getRotationMatrix2D((cols/2, rows/2), 45, 1)
rotated = cv2.warpAffine(gray_image, rotation_matrix, (cols, rows))
                    

Filtering and Enhancement

Filters help remove noise and enhance important features in images:

# Gaussian blur (noise reduction)
blurred = cv2.GaussianBlur(image, (15, 15), 0)

# Edge detection with Canny
edges = cv2.Canny(gray_image, 50, 150)

# Sharpening kernel
sharpening_kernel = np.array([[-1, -1, -1],
                              [-1,  9, -1],
                              [-1, -1, -1]])
sharpened = cv2.filter2D(image, -1, sharpening_kernel)

# Histogram equalization (contrast enhancement)
equalized = cv2.equalizeHist(gray_image)
                    

Image Filter Demo

Color Space Transformations

Different color spaces are useful for different computer vision tasks:

# RGB to HSV (Hue, Saturation, Value)
hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)

# RGB to LAB color space
lab = cv2.cvtColor(image, cv2.COLOR_BGR2LAB)

# Extract color channels
b, g, r = cv2.split(image)
h, s, v = cv2.split(hsv)

# Color thresholding in HSV
lower_blue = np.array([100, 50, 50])
upper_blue = np.array([130, 255, 255])
blue_mask = cv2.inRange(hsv, lower_blue, upper_blue)
                    

🎯 Practice Exercise: Image Enhancement

Try implementing these image processing techniques:

Load an image and convert it to grayscale
Apply different filters (blur, sharpen, edge detection)
Adjust brightness and contrast
Create a color mask to isolate specific colors

Convolutional Neural Networks (CNNs)

CNNs are the backbone of modern computer vision. They automatically learn features from images through layers of convolutions, making them perfect for image-related tasks.

CNN Architecture

A typical CNN consists of several types of layers working together:

Convolutional Layers: Apply filters to detect features
Pooling Layers: Reduce spatial dimensions
Activation Functions: Introduce non-linearity (ReLU)
Fully Connected Layers: Make final classifications

import tensorflow as tf
from tensorflow.keras import layers, models

# Building a simple CNN
model = models.Sequential([
    # First convolutional block
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
    layers.MaxPooling2D((2, 2)),
    
    # Second convolutional block
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    
    # Third convolutional block
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    
    # Classification layers
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')  # 10 classes
])

# Compile the model
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

print(model.summary())
                    

CNN Layer Visualization

Popular CNN Architectures

1. LeNet-5 (1998)

One of the first successful CNNs, designed for handwritten digit recognition.

2. AlexNet (2012)

Breakthrough architecture that won ImageNet 2012, proving the power of deep CNNs.

3. VGGNet (2014)

Demonstrated that depth matters - used very small (3×3) filters throughout.

# VGG-like architecture
def create_vgg_block(filters, num_convs):
    block = models.Sequential()
    for _ in range(num_convs):
        block.add(layers.Conv2D(filters, (3, 3), activation='relu', padding='same'))
    block.add(layers.MaxPooling2D((2, 2)))
    return block

# Build VGG-style model
vgg_model = models.Sequential([
    create_vgg_block(64, 2),   # Block 1
    create_vgg_block(128, 2),  # Block 2
    create_vgg_block(256, 3),  # Block 3
    create_vgg_block(512, 3),  # Block 4
    create_vgg_block(512, 3),  # Block 5
    
    layers.Flatten(),
    layers.Dense(4096, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(4096, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1000, activation='softmax')
])
                    

4. ResNet (2015)

Introduced skip connections, enabling training of very deep networks (up to 152 layers).

# ResNet block with skip connection
def resnet_block(x, filters):
    # Main path
    y = layers.Conv2D(filters, (3, 3), padding='same')(x)
    y = layers.BatchNormalization()(y)
    y = layers.ReLU()(y)
    y = layers.Conv2D(filters, (3, 3), padding='same')(y)
    y = layers.BatchNormalization()(y)
    
    # Skip connection
    if x.shape[-1] != filters:
        x = layers.Conv2D(filters, (1, 1))(x)
    
    # Add skip connection
    out = layers.Add()([x, y])
    out = layers.ReLU()(out)
    return out
                    

🎯 Practice Exercise: Build Your CNN

Create a CNN for image classification:

Design a CNN with 3-4 convolutional layers
Add appropriate pooling and dropout layers
Train on a small dataset (CIFAR-10 or Fashion-MNIST)
Visualize the learned filters and feature maps

Object Detection

Object detection goes beyond classification - it finds and localizes multiple objects within an image, providing both "what" and "where" information.

Detection vs Classification vs Segmentation

Classification: What's in the image? (cat)
Localization: Where is the object? (bounding box)
Detection: What objects and where? (multiple bounding boxes)
Segmentation: Pixel-level object boundaries

YOLO (You Only Look Once)

YOLO is a popular real-time object detection system that treats detection as a regression problem.

import cv2
import numpy as np

# Load YOLO model
def load_yolo():
    net = cv2.dnn.readNet("yolov5s.weights", "yolov5s.cfg")
    
    # Load class names
    with open("coco.names", "r") as f:
        classes = [line.strip() for line in f.readlines()]
    
    return net, classes

def detect_objects(image, net, classes):
    height, width = image.shape[:2]
    
    # Create blob from image
    blob = cv2.dnn.blobFromImage(image, 1/255.0, (416, 416), swapRB=True, crop=False)
    net.setInput(blob)
    
    # Run forward pass
    outputs = net.forward()
    
    boxes = []
    confidences = []
    class_ids = []
    
    for output in outputs:
        for detection in output:
            scores = detection[5:]
            class_id = np.argmax(scores)
            confidence = scores[class_id]
            
            if confidence > 0.5:  # Confidence threshold
                # Extract bounding box coordinates
                center_x = int(detection[0] * width)
                center_y = int(detection[1] * height)
                w = int(detection[2] * width)
                h = int(detection[3] * height)
                
                # Calculate top-left corner
                x = int(center_x - w/2)
                y = int(center_y - h/2)
                
                boxes.append([x, y, w, h])
                confidences.append(float(confidence))
                class_ids.append(class_id)
    
    return boxes, confidences, class_ids
                    

Object Detection Demo

R-CNN Family

Region-based CNNs use a two-stage approach: first generate region proposals, then classify each region.

Evolution of R-CNN:

R-CNN (2014): Selective search + CNN classification
Fast R-CNN (2015): Shared convolutional features
Faster R-CNN (2015): Learned region proposals with RPN
Mask R-CNN (2017): Added instance segmentation

# Using torchvision's pre-trained Faster R-CNN
import torch
import torchvision.transforms as transforms
from torchvision import models
from PIL import Image

# Load pre-trained model
model = models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

def detect_with_rcnn(image_path):
    # Load and preprocess image
    image = Image.open(image_path).convert('RGB')
    transform = transforms.Compose([transforms.ToTensor()])
    image_tensor = transform(image).unsqueeze(0)
    
    # Run detection
    with torch.no_grad():
        predictions = model(image_tensor)
    
    # Extract results
    boxes = predictions[0]['boxes']
    labels = predictions[0]['labels']
    scores = predictions[0]['scores']
    
    # Filter by confidence
    confident_detections = scores > 0.7
    final_boxes = boxes[confident_detections]
    final_labels = labels[confident_detections]
    final_scores = scores[confident_detections]
    
    return final_boxes, final_labels, final_scores
                    

Evaluation Metrics

Common metrics for evaluating object detection models:

Intersection over Union (IoU): Overlap between predicted and ground truth boxes
Mean Average Precision (mAP): Average precision across all classes
Precision: TP / (TP + FP) - accuracy of positive predictions
Recall: TP / (TP + FN) - coverage of actual positives

def calculate_iou(box1, box2):
    """Calculate Intersection over Union of two bounding boxes"""
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])
    
    if x2 < x1 or y2 < y1:
        return 0.0
    
    intersection = (x2 - x1) * (y2 - y1)
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection
    
    return intersection / union if union > 0 else 0.0

# Example usage
pred_box = [10, 10, 100, 100]  # [x1, y1, x2, y2]
true_box = [15, 15, 105, 105]
iou = calculate_iou(pred_box, true_box)
print(f"IoU: {iou:.2f}")
                    

🎯 Practice Exercise: Object Detection Project

Build an object detection application:

Use a pre-trained YOLO or Faster R-CNN model
Create a web interface for uploading images
Display detection results with bounding boxes and labels
Add confidence score filtering
Evaluate performance on a test dataset

🧪 Practice Labs

Hands-on exercises to reinforce your computer vision skills with real projects and datasets.

Lab 1: Image Classifier

Build a Custom Image Classifier

Objective: Create a CNN to classify images from a custom dataset.

# Complete image classification pipeline
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt

# Data augmentation
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True,
    zoom_range=0.2,
    validation_split=0.2
)

# Load data
train_generator = train_datagen.flow_from_directory(
    'dataset/train',
    target_size=(224, 224),
    batch_size=32,
    class_mode='categorical',
    subset='training'
)

validation_generator = train_datagen.flow_from_directory(
    'dataset/train',
    target_size=(224, 224),
    batch_size=32,
    class_mode='categorical',
    subset='validation'
)

# Build model
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(128, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(len(train_generator.class_indices), activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Train model
history = model.fit(
    train_generator,
    epochs=20,
    validation_data=validation_generator,
    verbose=1
)

# Evaluate results
def plot_training_history(history):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    
    ax1.plot(history.history['loss'], label='Training Loss')
    ax1.plot(history.history['val_loss'], label='Validation Loss')
    ax1.set_title('Model Loss')
    ax1.legend()
    
    ax2.plot(history.history['accuracy'], label='Training Accuracy')
    ax2.plot(history.history['val_accuracy'], label='Validation Accuracy')
    ax2.set_title('Model Accuracy')
    ax2.legend()
    
    plt.show()

plot_training_history(history)
                        

Tasks:

Collect and organize your dataset (minimum 100 images per class)
Implement data augmentation to increase dataset variety
Train a CNN classifier and monitor training progress
Evaluate model performance and identify areas for improvement
Deploy your model for real-time predictions

Lab 2: Real-time Object Detection

Webcam Object Detection System

Objective: Build a real-time object detection system using your webcam.

import cv2
import numpy as np
import time

class RealTimeDetector:
    def __init__(self, model_path, config_path, classes_path):
        self.net = cv2.dnn.readNet(model_path, config_path)
        self.classes = self.load_classes(classes_path)
        self.colors = np.random.uniform(0, 255, size=(len(self.classes), 3))
        
    def load_classes(self, classes_path):
        with open(classes_path, 'r') as f:
            classes = [line.strip() for line in f.readlines()]
        return classes
    
    def detect_objects(self, frame):
        height, width = frame.shape[:2]
        
        # Create blob
        blob = cv2.dnn.blobFromImage(frame, 1/255.0, (416, 416), swapRB=True, crop=False)
        self.net.setInput(blob)
        
        # Run inference
        outputs = self.net.forward()
        
        boxes = []
        confidences = []
        class_ids = []
        
        for output in outputs:
            for detection in output:
                scores = detection[5:]
                class_id = np.argmax(scores)
                confidence = scores[class_id]
                
                if confidence > 0.5:
                    center_x = int(detection[0] * width)
                    center_y = int(detection[1] * height)
                    w = int(detection[2] * width)
                    h = int(detection[3] * height)
                    
                    x = int(center_x - w/2)
                    y = int(center_y - h/2)
                    
                    boxes.append([x, y, w, h])
                    confidences.append(float(confidence))
                    class_ids.append(class_id)
        
        # Non-maximum suppression
        indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
        
        return boxes, confidences, class_ids, indices
    
    def draw_detections(self, frame, boxes, confidences, class_ids, indices):
        if len(indices) > 0:
            for i in indices.flatten():
                x, y, w, h = boxes[i]
                label = f"{self.classes[class_ids[i]]}: {confidences[i]:.2f}"
                color = self.colors[class_ids[i]]
                
                cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
                cv2.putText(frame, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
        
        return frame
    
    def run_webcam_detection(self):
        cap = cv2.VideoCapture(0)
        fps_start_time = time.time()
        fps_counter = 0
        
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            
            # Detect objects
            boxes, confidences, class_ids, indices = self.detect_objects(frame)
            frame = self.draw_detections(frame, boxes, confidences, class_ids, indices)
            
            # Calculate FPS
            fps_counter += 1
            if fps_counter % 30 == 0:
                fps = 30 / (time.time() - fps_start_time)
                fps_start_time = time.time()
                
            cv2.putText(frame, f"FPS: {fps:.1f}", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
            
            cv2.imshow('Real-time Object Detection', frame)
            
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
        
        cap.release()
        cv2.destroyAllWindows()

# Usage
detector = RealTimeDetector('yolov5s.weights', 'yolov5s.cfg', 'coco.names')
detector.run_webcam_detection()
                        

Tasks:

Set up webcam capture with OpenCV
Integrate pre-trained YOLO model for object detection
Implement real-time bounding box visualization
Add FPS counter and performance optimization
Create object tracking across frames

Lab 3: Image Segmentation

Semantic Segmentation with U-Net

Objective: Implement pixel-level image segmentation for medical or satellite imagery.

import tensorflow as tf
from tensorflow.keras import layers

def unet_model(input_size=(256, 256, 3), num_classes=1):
    inputs = layers.Input(input_size)
    
    # Encoder (Contracting Path)
    c1 = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(inputs)
    c1 = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(c1)
    p1 = layers.MaxPooling2D((2, 2))(c1)
    
    c2 = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(p1)
    c2 = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(c2)
    p2 = layers.MaxPooling2D((2, 2))(c2)
    
    c3 = layers.Conv2D(256, (3, 3), activation='relu', padding='same')(p2)
    c3 = layers.Conv2D(256, (3, 3), activation='relu', padding='same')(c3)
    p3 = layers.MaxPooling2D((2, 2))(c3)
    
    c4 = layers.Conv2D(512, (3, 3), activation='relu', padding='same')(p3)
    c4 = layers.Conv2D(512, (3, 3), activation='relu', padding='same')(c4)
    p4 = layers.MaxPooling2D((2, 2))(c4)
    
    # Bottleneck
    c5 = layers.Conv2D(1024, (3, 3), activation='relu', padding='same')(p4)
    c5 = layers.Conv2D(1024, (3, 3), activation='relu', padding='same')(c5)
    
    # Decoder (Expanding Path)
    u6 = layers.Conv2DTranspose(512, (2, 2), strides=(2, 2), padding='same')(c5)
    u6 = layers.concatenate([u6, c4])
    c6 = layers.Conv2D(512, (3, 3), activation='relu', padding='same')(u6)
    c6 = layers.Conv2D(512, (3, 3), activation='relu', padding='same')(c6)
    
    u7 = layers.Conv2DTranspose(256, (2, 2), strides=(2, 2), padding='same')(c6)
    u7 = layers.concatenate([u7, c3])
    c7 = layers.Conv2D(256, (3, 3), activation='relu', padding='same')(u7)
    c7 = layers.Conv2D(256, (3, 3), activation='relu', padding='same')(c7)
    
    u8 = layers.Conv2DTranspose(128, (2, 2), strides=(2, 2), padding='same')(c7)
    u8 = layers.concatenate([u8, c2])
    c8 = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(u8)
    c8 = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(c8)
    
    u9 = layers.Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same')(c8)
    u9 = layers.concatenate([u9, c1])
    c9 = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(u9)
    c9 = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(c9)
    
    outputs = layers.Conv2D(num_classes, (1, 1), activation='sigmoid')(c9)
    
    model = tf.keras.Model(inputs=[inputs], outputs=[outputs])
    return model

# Create and compile model
model = unet_model()
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy', 'iou']
)

# Custom IoU metric
def iou(y_true, y_pred, smooth=1e-6):
    intersection = tf.reduce_sum(y_true * y_pred)
    union = tf.reduce_sum(y_true) + tf.reduce_sum(y_pred) - intersection
    return (intersection + smooth) / (union + smooth)
                        

Tasks:

Implement U-Net architecture for segmentation
Prepare pixel-level annotated training data
Train model with appropriate loss functions (IoU, Dice)
Visualize segmentation masks and compare with ground truth
Apply to real-world problems (medical imaging, autonomous driving)

Progress Tracker

Lab 1: Image Classifier

Lab 2: Object Detection

Lab 3: Image Segmentation

🚀 Advanced Computer Vision

Cutting-edge techniques and emerging trends in computer vision research and applications.

Vision Transformers (ViTs)

Vision Transformers apply the transformer architecture from NLP to computer vision, treating images as sequences of patches.

import torch
import torch.nn as nn
from einops import rearrange
import math

class PatchEmbedding(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.img_size = img_size
        self.patch_size = patch_size
        self.n_patches = (img_size // patch_size) ** 2
        
        self.projection = nn.Conv2d(in_channels, embed_dim, kernel_size=patch_size, stride=patch_size)
        
    def forward(self, x):
        # x: (batch_size, channels, height, width)
        x = self.projection(x)  # (batch_size, embed_dim, n_patches_h, n_patches_w)
        x = rearrange(x, 'b e h w -> b (h w) e')  # (batch_size, n_patches, embed_dim)
        return x

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim=768, n_heads=12):
        super().__init__()
        self.embed_dim = embed_dim
        self.n_heads = n_heads
        self.head_dim = embed_dim // n_heads
        
        self.qkv = nn.Linear(embed_dim, embed_dim * 3)
        self.projection = nn.Linear(embed_dim, embed_dim)
        
    def forward(self, x):
        batch_size, seq_len, embed_dim = x.shape
        
        # Generate Q, K, V
        qkv = self.qkv(x)
        qkv = rearrange(qkv, 'b s (three h d) -> three b h s d', 
                       three=3, h=self.n_heads, d=self.head_dim)
        q, k, v = qkv[0], qkv[1], qkv[2]
        
        # Scaled dot-product attention
        attention = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        attention = torch.softmax(attention, dim=-1)
        
        # Apply attention to values
        out = torch.matmul(attention, v)
        out = rearrange(out, 'b h s d -> b s (h d)')
        out = self.projection(out)
        
        return out

class VisionTransformer(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, n_classes=1000,
                 embed_dim=768, n_layers=12, n_heads=12, mlp_ratio=4):
        super().__init__()
        
        self.patch_embedding = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)
        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
        self.pos_embedding = nn.Parameter(torch.randn(1, self.patch_embedding.n_patches + 1, embed_dim))
        
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(embed_dim, n_heads, mlp_ratio) for _ in range(n_layers)
        ])
        
        self.layer_norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, n_classes)
        
    def forward(self, x):
        batch_size = x.shape[0]
        
        # Patch embedding
        x = self.patch_embedding(x)
        
        # Add class token
        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)
        
        # Add position embedding
        x = x + self.pos_embedding
        
        # Transformer blocks
        for block in self.transformer_blocks:
            x = block(x)
        
        # Classification head
        x = self.layer_norm(x)
        x = x[:, 0]  # Use class token
        x = self.head(x)
        
        return x
                    

Generative Adversarial Networks for Vision

GANs can generate realistic images, perform style transfer, and image-to-image translation.

import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, latent_dim=100, img_channels=3, features_g=64):
        super(Generator, self).__init__()
        self.net = nn.Sequential(
            # Input: N x latent_dim x 1 x 1
            self._block(latent_dim, features_g * 16, 4, 1, 0),  # img: 4x4
            self._block(features_g * 16, features_g * 8, 4, 2, 1),  # img: 8x8
            self._block(features_g * 8, features_g * 4, 4, 2, 1),  # img: 16x16
            self._block(features_g * 4, features_g * 2, 4, 2, 1),  # img: 32x32
            nn.ConvTranspose2d(
                features_g * 2, img_channels, kernel_size=4, stride=2, padding=1
            ),
            nn.Tanh(),  # Output: N x img_channels x 64 x 64
        )

    def _block(self, in_channels, out_channels, kernel_size, stride, padding):
        return nn.Sequential(
            nn.ConvTranspose2d(
                in_channels, out_channels, kernel_size, stride, padding, bias=False,
            ),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

class Discriminator(nn.Module):
    def __init__(self, img_channels=3, features_d=64):
        super(Discriminator, self).__init__()
        self.disc = nn.Sequential(
            # Input: N x img_channels x 64 x 64
            nn.Conv2d(img_channels, features_d, kernel_size=4, stride=2, padding=1),
            nn.LeakyReLU(0.2),
            # State size: N x features_d x 32 x 32
            self._block(features_d, features_d * 2, 4, 2, 1),
            self._block(features_d * 2, features_d * 4, 4, 2, 1),
            self._block(features_d * 4, features_d * 8, 4, 2, 1),
            # State size: N x features_d*8 x 4 x 4
            nn.Conv2d(features_d * 8, 1, kernel_size=4, stride=2, padding=0),
            nn.Sigmoid(),
        )

    def _block(self, in_channels, out_channels, kernel_size, stride, padding):
        return nn.Sequential(
            nn.Conv2d(
                in_channels, out_channels, kernel_size, stride, padding, bias=False,
            ),
            nn.BatchNorm2d(out_channels),
            nn.LeakyReLU(0.2),
        )

    def forward(self, x):
        return self.disc(x)

# Training loop example
def train_gan(generator, discriminator, dataloader, num_epochs, device):
    criterion = nn.BCELoss()
    optimizer_g = torch.optim.Adam(generator.parameters(), lr=0.0002, betas=(0.5, 0.999))
    optimizer_d = torch.optim.Adam(discriminator.parameters(), lr=0.0002, betas=(0.5, 0.999))
    
    for epoch in range(num_epochs):
        for batch_idx, (real_images, _) in enumerate(dataloader):
            real_images = real_images.to(device)
            batch_size = real_images.shape[0]
            
            # Train Discriminator
            noise = torch.randn(batch_size, 100, 1, 1).to(device)
            fake_images = generator(noise)
            
            disc_real = discriminator(real_images).reshape(-1)
            loss_disc_real = criterion(disc_real, torch.ones_like(disc_real))
            
            disc_fake = discriminator(fake_images.detach()).reshape(-1)
            loss_disc_fake = criterion(disc_fake, torch.zeros_like(disc_fake))
            
            loss_disc = (loss_disc_real + loss_disc_fake) / 2
            discriminator.zero_grad()
            loss_disc.backward()
            optimizer_d.step()
            
            # Train Generator
            output = discriminator(fake_images).reshape(-1)
            loss_gen = criterion(output, torch.ones_like(output))
            
            generator.zero_grad()
            loss_gen.backward()
            optimizer_g.step()
                    

3D Computer Vision

Understanding depth, 3D reconstruction, and working with point clouds and 3D data.

import numpy as np
import open3d as o3d
from sklearn.cluster import DBSCAN

class PointCloudProcessor:
    def __init__(self):
        self.point_cloud = None
        
    def load_point_cloud(self, file_path):
        """Load point cloud from file"""
        self.point_cloud = o3d.io.read_point_cloud(file_path)
        return self.point_cloud
        
    def preprocess_point_cloud(self, voxel_size=0.05):
        """Downsample and remove outliers"""
        # Downsample
        downsampled = self.point_cloud.voxel_down_sample(voxel_size)
        
        # Remove outliers
        cleaned, _ = downsampled.remove_statistical_outlier(
            nb_neighbors=20, std_ratio=2.0
        )
        
        self.point_cloud = cleaned
        return cleaned
        
    def estimate_normals(self):
        """Estimate surface normals"""
        self.point_cloud.estimate_normals(
            search_param=o3d.geometry.KDTreeSearchParamHybrid(
                radius=0.1, max_nn=30
            )
        )
        
    def segment_plane(self, distance_threshold=0.01):
        """Segment largest plane using RANSAC"""
        plane_model, inliers = self.point_cloud.segment_plane(
            distance_threshold=distance_threshold,
            ransac_n=3,
            num_iterations=1000
        )
        
        inlier_cloud = self.point_cloud.select_by_index(inliers)
        outlier_cloud = self.point_cloud.select_by_index(inliers, invert=True)
        
        return plane_model, inlier_cloud, outlier_cloud
        
    def cluster_objects(self, eps=0.02, min_points=10):
        """Cluster point cloud into separate objects"""
        points = np.asarray(self.point_cloud.points)
        
        # DBSCAN clustering
        clustering = DBSCAN(eps=eps, min_samples=min_points).fit(points)
        labels = clustering.labels_
        
        # Create colored point cloud
        max_label = labels.max()
        colors = plt.get_cmap("tab20")(labels / (max_label if max_label > 0 else 1))
        colors[labels < 0] = 0  # Noise points
        self.point_cloud.colors = o3d.utility.Vector3dVector(colors[:, :3])
        
        return labels

# Stereo vision for depth estimation
def stereo_depth_estimation(left_image, right_image):
    """Estimate depth from stereo image pair"""
    # Convert to grayscale
    left_gray = cv2.cvtColor(left_image, cv2.COLOR_BGR2GRAY)
    right_gray = cv2.cvtColor(right_image, cv2.COLOR_BGR2GRAY)
    
    # Create stereo matcher
    stereo = cv2.StereoBM_create(numDisparities=64, blockSize=15)
    
    # Compute disparity map
    disparity = stereo.compute(left_gray, right_gray)
    
    # Normalize for visualization
    disparity_normalized = cv2.normalize(
        disparity, None, 0, 255, cv2.NORM_MINMAX, cv2.CV_8U
    )
    
    return disparity, disparity_normalized

# Structure from Motion (SfM) basics
class StructureFromMotion:
    def __init__(self):
        self.feature_detector = cv2.SIFT_create()
        self.matcher = cv2.BFMatcher()
        
    def extract_features(self, image):
        """Extract SIFT features from image"""
        keypoints, descriptors = self.feature_detector.detectAndCompute(image, None)
        return keypoints, descriptors
        
    def match_features(self, desc1, desc2):
        """Match features between two images"""
        matches = self.matcher.knnMatch(desc1, desc2, k=2)
        
        # Apply Lowe's ratio test
        good_matches = []
        for match_pair in matches:
            if len(match_pair) == 2:
                m, n = match_pair
                if m.distance < 0.7 * n.distance:
                    good_matches.append(m)
                    
        return good_matches
        
    def estimate_pose(self, kp1, kp2, matches, K):
        """Estimate camera pose between two views"""
        # Extract matched points
        pts1 = np.float32([kp1[m.queryIdx].pt for m in matches]).reshape(-1, 1, 2)
        pts2 = np.float32([kp2[m.trainIdx].pt for m in matches]).reshape(-1, 1, 2)
        
        # Find essential matrix
        E, mask = cv2.findEssentialMat(pts1, pts2, K)
        
        # Recover pose
        _, R, t, mask = cv2.recoverPose(E, pts1, pts2, K)
        
        return R, t, mask
                    

Real-time Applications

Optimizing computer vision models for deployment on mobile devices and edge computing.

# Model optimization techniques
import tensorflow as tf
from tensorflow.keras import mixed_precision

# Enable mixed precision training
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)

# Model quantization
def quantize_model(model, representative_dataset):
    """Quantize model for faster inference"""
    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    
    # Representative dataset for quantization
    converter.representative_dataset = representative_dataset
    converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
    converter.inference_input_type = tf.uint8
    converter.inference_output_type = tf.uint8
    
    quantized_model = converter.convert()
    return quantized_model

# Model pruning
def prune_model(model, target_sparsity=0.5):
    """Apply magnitude-based pruning"""
    import tensorflow_model_optimization as tfmot
    
    prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
    
    pruning_params = {
        'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
            initial_sparsity=0.0,
            final_sparsity=target_sparsity,
            begin_step=0,
            end_step=1000
        )
    }
    
    pruned_model = prune_low_magnitude(model, **pruning_params)
    return pruned_model

# Knowledge distillation
class DistillationTraining:
    def __init__(self, teacher_model, student_model, alpha=0.7, temperature=3):
        self.teacher_model = teacher_model
        self.student_model = student_model
        self.alpha = alpha
        self.temperature = temperature
        
    def distillation_loss(self, y_true, y_pred_student, y_pred_teacher):
        """Combine hard and soft targets"""
        # Hard target loss
        hard_loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred_student)
        
        # Soft target loss
        soft_targets = tf.nn.softmax(y_pred_teacher / self.temperature)
        soft_pred = tf.nn.softmax(y_pred_student / self.temperature)
        soft_loss = tf.keras.losses.categorical_crossentropy(soft_targets, soft_pred)
        
        # Combined loss
        total_loss = (1 - self.alpha) * hard_loss + self.alpha * soft_loss * (self.temperature ** 2)
        return total_loss
        
    def train_student(self, train_data, epochs=10):
        """Train student model with teacher guidance"""
        for epoch in range(epochs):
            for x_batch, y_batch in train_data:
                with tf.GradientTape() as tape:
                    # Get predictions
                    teacher_pred = self.teacher_model(x_batch, training=False)
                    student_pred = self.student_model(x_batch, training=True)
                    
                    # Calculate distillation loss
                    loss = self.distillation_loss(y_batch, student_pred, teacher_pred)
                
                # Update student model
                gradients = tape.gradient(loss, self.student_model.trainable_variables)
                optimizer.apply_gradients(zip(gradients, self.student_model.trainable_variables))
                    

🎯 Advanced Project Ideas

Challenge yourself with these cutting-edge projects:

Vision Transformer from Scratch: Implement and train a ViT on a custom dataset
3D Object Detection: Build a system that detects objects in 3D point clouds
Real-time Style Transfer: Create an app that applies artistic styles to live video
Medical Image Analysis: Develop AI for detecting anomalies in medical scans
Autonomous Drone Navigation: Computer vision for obstacle avoidance and path planning

🔍 Computer Vision Mastery

What is Computer Vision?

Key Concepts

How Images Work in Computers

Image Pixel Visualization

Common Applications

Image Processing Fundamentals

Basic Image Operations

Filtering and Enhancement

Image Filter Demo

Color Space Transformations

Convolutional Neural Networks (CNNs)

CNN Architecture

CNN Layer Visualization

Popular CNN Architectures

1. LeNet-5 (1998)

2. AlexNet (2012)

3. VGGNet (2014)

4. ResNet (2015)

Object Detection

Detection vs Classification vs Segmentation

YOLO (You Only Look Once)

Object Detection Demo

R-CNN Family

Evolution of R-CNN:

Evaluation Metrics

🧪 Practice Labs

Lab 1: Image Classifier

Lab 2: Real-time Object Detection

Lab 3: Image Segmentation

Progress Tracker

🚀 Advanced Computer Vision

Vision Transformers (ViTs)

Generative Adversarial Networks for Vision

3D Computer Vision

Real-time Applications

Advanced Techniques Comparison