๐Ÿ” Computer Vision Mastery

Learn image processing, CNNs, and vision AI with interactive tutorials and hands-on practice

Beginner Friendly Interactive Examples Advanced Concepts

What is Computer Vision?

Computer Vision is a field of artificial intelligence that enables computers to interpret and understand visual information from the world around us. It mimics human vision by teaching machines to identify objects, understand scenes, and extract meaningful information from images and videos.

Key Concepts

  • Image Processing: Manipulating and enhancing images to extract useful information
  • Feature Detection: Identifying important patterns, edges, corners, and textures in images
  • Pattern Recognition: Classifying objects and scenes based on learned patterns
  • Deep Learning: Using neural networks to automatically learn visual features

How Images Work in Computers

Digital images are represented as matrices of pixels, where each pixel contains color information:

import numpy as np import cv2 from PIL import Image # Load an image image = cv2.imread('example.jpg') # Image shape: (height, width, channels) print(f"Image shape: {image.shape}") # RGB channels for color images height, width, channels = image.shape print(f"Height: {height}, Width: {width}, Channels: {channels}") # Convert BGR to RGB (OpenCV uses BGR by default) image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

Image Pixel Visualization

Common Applications

  • Image Classification: Categorizing images (e.g., cat vs dog)
  • Object Detection: Finding and localizing objects in images
  • Face Recognition: Identifying specific individuals
  • Medical Imaging: Analyzing X-rays, MRIs, and CT scans
  • Autonomous Vehicles: Understanding road scenes and obstacles
  • Quality Control: Detecting defects in manufacturing

Image Processing Fundamentals

Image processing involves manipulating images to enhance them or extract useful information. These techniques form the foundation of computer vision systems.

Basic Image Operations

import cv2 import numpy as np from scipy import ndimage import matplotlib.pyplot as plt # Load and basic operations image = cv2.imread('input.jpg') gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # Resize image resized = cv2.resize(image, (224, 224)) # Crop image cropped = image[100:300, 150:350] # [y1:y2, x1:x2] # Rotate image rows, cols = gray_image.shape[:2] rotation_matrix = cv2.getRotationMatrix2D((cols/2, rows/2), 45, 1) rotated = cv2.warpAffine(gray_image, rotation_matrix, (cols, rows))

Filtering and Enhancement

Filters help remove noise and enhance important features in images:

# Gaussian blur (noise reduction) blurred = cv2.GaussianBlur(image, (15, 15), 0) # Edge detection with Canny edges = cv2.Canny(gray_image, 50, 150) # Sharpening kernel sharpening_kernel = np.array([[-1, -1, -1], [-1, 9, -1], [-1, -1, -1]]) sharpened = cv2.filter2D(image, -1, sharpening_kernel) # Histogram equalization (contrast enhancement) equalized = cv2.equalizeHist(gray_image)

Image Filter Demo

Color Space Transformations

Different color spaces are useful for different computer vision tasks:

# RGB to HSV (Hue, Saturation, Value) hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV) # RGB to LAB color space lab = cv2.cvtColor(image, cv2.COLOR_BGR2LAB) # Extract color channels b, g, r = cv2.split(image) h, s, v = cv2.split(hsv) # Color thresholding in HSV lower_blue = np.array([100, 50, 50]) upper_blue = np.array([130, 255, 255]) blue_mask = cv2.inRange(hsv, lower_blue, upper_blue)
๐ŸŽฏ Practice Exercise: Image Enhancement

Try implementing these image processing techniques:

  1. Load an image and convert it to grayscale
  2. Apply different filters (blur, sharpen, edge detection)
  3. Adjust brightness and contrast
  4. Create a color mask to isolate specific colors

Convolutional Neural Networks (CNNs)

CNNs are the backbone of modern computer vision. They automatically learn features from images through layers of convolutions, making them perfect for image-related tasks.

CNN Architecture

A typical CNN consists of several types of layers working together:

  • Convolutional Layers: Apply filters to detect features
  • Pooling Layers: Reduce spatial dimensions
  • Activation Functions: Introduce non-linearity (ReLU)
  • Fully Connected Layers: Make final classifications
import tensorflow as tf from tensorflow.keras import layers, models # Building a simple CNN model = models.Sequential([ # First convolutional block layers.Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)), layers.MaxPooling2D((2, 2)), # Second convolutional block layers.Conv2D(64, (3, 3), activation='relu'), layers.MaxPooling2D((2, 2)), # Third convolutional block layers.Conv2D(128, (3, 3), activation='relu'), layers.MaxPooling2D((2, 2)), # Classification layers layers.Flatten(), layers.Dense(128, activation='relu'), layers.Dropout(0.5), layers.Dense(10, activation='softmax') # 10 classes ]) # Compile the model model.compile( optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'] ) print(model.summary())

CNN Layer Visualization

Popular CNN Architectures

1. LeNet-5 (1998)

One of the first successful CNNs, designed for handwritten digit recognition.

2. AlexNet (2012)

Breakthrough architecture that won ImageNet 2012, proving the power of deep CNNs.

3. VGGNet (2014)

Demonstrated that depth matters - used very small (3ร—3) filters throughout.

# VGG-like architecture def create_vgg_block(filters, num_convs): block = models.Sequential() for _ in range(num_convs): block.add(layers.Conv2D(filters, (3, 3), activation='relu', padding='same')) block.add(layers.MaxPooling2D((2, 2))) return block # Build VGG-style model vgg_model = models.Sequential([ create_vgg_block(64, 2), # Block 1 create_vgg_block(128, 2), # Block 2 create_vgg_block(256, 3), # Block 3 create_vgg_block(512, 3), # Block 4 create_vgg_block(512, 3), # Block 5 layers.Flatten(), layers.Dense(4096, activation='relu'), layers.Dropout(0.5), layers.Dense(4096, activation='relu'), layers.Dropout(0.5), layers.Dense(1000, activation='softmax') ])

4. ResNet (2015)

Introduced skip connections, enabling training of very deep networks (up to 152 layers).

# ResNet block with skip connection def resnet_block(x, filters): # Main path y = layers.Conv2D(filters, (3, 3), padding='same')(x) y = layers.BatchNormalization()(y) y = layers.ReLU()(y) y = layers.Conv2D(filters, (3, 3), padding='same')(y) y = layers.BatchNormalization()(y) # Skip connection if x.shape[-1] != filters: x = layers.Conv2D(filters, (1, 1))(x) # Add skip connection out = layers.Add()([x, y]) out = layers.ReLU()(out) return out
๐ŸŽฏ Practice Exercise: Build Your CNN

Create a CNN for image classification:

  1. Design a CNN with 3-4 convolutional layers
  2. Add appropriate pooling and dropout layers
  3. Train on a small dataset (CIFAR-10 or Fashion-MNIST)
  4. Visualize the learned filters and feature maps

Object Detection

Object detection goes beyond classification - it finds and localizes multiple objects within an image, providing both "what" and "where" information.

Detection vs Classification vs Segmentation

  • Classification: What's in the image? (cat)
  • Localization: Where is the object? (bounding box)
  • Detection: What objects and where? (multiple bounding boxes)
  • Segmentation: Pixel-level object boundaries

YOLO (You Only Look Once)

YOLO is a popular real-time object detection system that treats detection as a regression problem.

import cv2 import numpy as np # Load YOLO model def load_yolo(): net = cv2.dnn.readNet("yolov5s.weights", "yolov5s.cfg") # Load class names with open("coco.names", "r") as f: classes = [line.strip() for line in f.readlines()] return net, classes def detect_objects(image, net, classes): height, width = image.shape[:2] # Create blob from image blob = cv2.dnn.blobFromImage(image, 1/255.0, (416, 416), swapRB=True, crop=False) net.setInput(blob) # Run forward pass outputs = net.forward() boxes = [] confidences = [] class_ids = [] for output in outputs: for detection in output: scores = detection[5:] class_id = np.argmax(scores) confidence = scores[class_id] if confidence > 0.5: # Confidence threshold # Extract bounding box coordinates center_x = int(detection[0] * width) center_y = int(detection[1] * height) w = int(detection[2] * width) h = int(detection[3] * height) # Calculate top-left corner x = int(center_x - w/2) y = int(center_y - h/2) boxes.append([x, y, w, h]) confidences.append(float(confidence)) class_ids.append(class_id) return boxes, confidences, class_ids

Object Detection Demo

R-CNN Family

Region-based CNNs use a two-stage approach: first generate region proposals, then classify each region.

Evolution of R-CNN:

  1. R-CNN (2014): Selective search + CNN classification
  2. Fast R-CNN (2015): Shared convolutional features
  3. Faster R-CNN (2015): Learned region proposals with RPN
  4. Mask R-CNN (2017): Added instance segmentation
# Using torchvision's pre-trained Faster R-CNN import torch import torchvision.transforms as transforms from torchvision import models from PIL import Image # Load pre-trained model model = models.detection.fasterrcnn_resnet50_fpn(pretrained=True) model.eval() def detect_with_rcnn(image_path): # Load and preprocess image image = Image.open(image_path).convert('RGB') transform = transforms.Compose([transforms.ToTensor()]) image_tensor = transform(image).unsqueeze(0) # Run detection with torch.no_grad(): predictions = model(image_tensor) # Extract results boxes = predictions[0]['boxes'] labels = predictions[0]['labels'] scores = predictions[0]['scores'] # Filter by confidence confident_detections = scores > 0.7 final_boxes = boxes[confident_detections] final_labels = labels[confident_detections] final_scores = scores[confident_detections] return final_boxes, final_labels, final_scores

Evaluation Metrics

Common metrics for evaluating object detection models:

  • Intersection over Union (IoU): Overlap between predicted and ground truth boxes
  • Mean Average Precision (mAP): Average precision across all classes
  • Precision: TP / (TP + FP) - accuracy of positive predictions
  • Recall: TP / (TP + FN) - coverage of actual positives
def calculate_iou(box1, box2): """Calculate Intersection over Union of two bounding boxes""" x1 = max(box1[0], box2[0]) y1 = max(box1[1], box2[1]) x2 = min(box1[2], box2[2]) y2 = min(box1[3], box2[3]) if x2 < x1 or y2 < y1: return 0.0 intersection = (x2 - x1) * (y2 - y1) area1 = (box1[2] - box1[0]) * (box1[3] - box1[1]) area2 = (box2[2] - box2[0]) * (box2[3] - box2[1]) union = area1 + area2 - intersection return intersection / union if union > 0 else 0.0 # Example usage pred_box = [10, 10, 100, 100] # [x1, y1, x2, y2] true_box = [15, 15, 105, 105] iou = calculate_iou(pred_box, true_box) print(f"IoU: {iou:.2f}")
๐ŸŽฏ Practice Exercise: Object Detection Project

Build an object detection application:

  1. Use a pre-trained YOLO or Faster R-CNN model
  2. Create a web interface for uploading images
  3. Display detection results with bounding boxes and labels
  4. Add confidence score filtering
  5. Evaluate performance on a test dataset

๐Ÿงช Practice Labs

Hands-on exercises to reinforce your computer vision skills with real projects and datasets.

Lab 1: Image Classifier

Build a Custom Image Classifier

Objective: Create a CNN to classify images from a custom dataset.

# Complete image classification pipeline import tensorflow as tf from tensorflow.keras.preprocessing.image import ImageDataGenerator import matplotlib.pyplot as plt # Data augmentation train_datagen = ImageDataGenerator( rescale=1./255, rotation_range=20, width_shift_range=0.2, height_shift_range=0.2, horizontal_flip=True, zoom_range=0.2, validation_split=0.2 ) # Load data train_generator = train_datagen.flow_from_directory( 'dataset/train', target_size=(224, 224), batch_size=32, class_mode='categorical', subset='training' ) validation_generator = train_datagen.flow_from_directory( 'dataset/train', target_size=(224, 224), batch_size=32, class_mode='categorical', subset='validation' ) # Build model model = tf.keras.Sequential([ tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)), tf.keras.layers.MaxPooling2D(2, 2), tf.keras.layers.Conv2D(64, (3, 3), activation='relu'), tf.keras.layers.MaxPooling2D(2, 2), tf.keras.layers.Conv2D(128, (3, 3), activation='relu'), tf.keras.layers.MaxPooling2D(2, 2), tf.keras.layers.Flatten(), tf.keras.layers.Dropout(0.5), tf.keras.layers.Dense(512, activation='relu'), tf.keras.layers.Dense(len(train_generator.class_indices), activation='softmax') ]) model.compile( optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'] ) # Train model history = model.fit( train_generator, epochs=20, validation_data=validation_generator, verbose=1 ) # Evaluate results def plot_training_history(history): fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) ax1.plot(history.history['loss'], label='Training Loss') ax1.plot(history.history['val_loss'], label='Validation Loss') ax1.set_title('Model Loss') ax1.legend() ax2.plot(history.history['accuracy'], label='Training Accuracy') ax2.plot(history.history['val_accuracy'], label='Validation Accuracy') ax2.set_title('Model Accuracy') ax2.legend() plt.show() plot_training_history(history)

Tasks:

  1. Collect and organize your dataset (minimum 100 images per class)
  2. Implement data augmentation to increase dataset variety
  3. Train a CNN classifier and monitor training progress
  4. Evaluate model performance and identify areas for improvement
  5. Deploy your model for real-time predictions

Lab 2: Real-time Object Detection

Webcam Object Detection System

Objective: Build a real-time object detection system using your webcam.

import cv2 import numpy as np import time class RealTimeDetector: def __init__(self, model_path, config_path, classes_path): self.net = cv2.dnn.readNet(model_path, config_path) self.classes = self.load_classes(classes_path) self.colors = np.random.uniform(0, 255, size=(len(self.classes), 3)) def load_classes(self, classes_path): with open(classes_path, 'r') as f: classes = [line.strip() for line in f.readlines()] return classes def detect_objects(self, frame): height, width = frame.shape[:2] # Create blob blob = cv2.dnn.blobFromImage(frame, 1/255.0, (416, 416), swapRB=True, crop=False) self.net.setInput(blob) # Run inference outputs = self.net.forward() boxes = [] confidences = [] class_ids = [] for output in outputs: for detection in output: scores = detection[5:] class_id = np.argmax(scores) confidence = scores[class_id] if confidence > 0.5: center_x = int(detection[0] * width) center_y = int(detection[1] * height) w = int(detection[2] * width) h = int(detection[3] * height) x = int(center_x - w/2) y = int(center_y - h/2) boxes.append([x, y, w, h]) confidences.append(float(confidence)) class_ids.append(class_id) # Non-maximum suppression indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4) return boxes, confidences, class_ids, indices def draw_detections(self, frame, boxes, confidences, class_ids, indices): if len(indices) > 0: for i in indices.flatten(): x, y, w, h = boxes[i] label = f"{self.classes[class_ids[i]]}: {confidences[i]:.2f}" color = self.colors[class_ids[i]] cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2) cv2.putText(frame, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2) return frame def run_webcam_detection(self): cap = cv2.VideoCapture(0) fps_start_time = time.time() fps_counter = 0 while True: ret, frame = cap.read() if not ret: break # Detect objects boxes, confidences, class_ids, indices = self.detect_objects(frame) frame = self.draw_detections(frame, boxes, confidences, class_ids, indices) # Calculate FPS fps_counter += 1 if fps_counter % 30 == 0: fps = 30 / (time.time() - fps_start_time) fps_start_time = time.time() cv2.putText(frame, f"FPS: {fps:.1f}", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2) cv2.imshow('Real-time Object Detection', frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows() # Usage detector = RealTimeDetector('yolov5s.weights', 'yolov5s.cfg', 'coco.names') detector.run_webcam_detection()

Tasks:

  1. Set up webcam capture with OpenCV
  2. Integrate pre-trained YOLO model for object detection
  3. Implement real-time bounding box visualization
  4. Add FPS counter and performance optimization
  5. Create object tracking across frames

Lab 3: Image Segmentation

Semantic Segmentation with U-Net

Objective: Implement pixel-level image segmentation for medical or satellite imagery.

import tensorflow as tf from tensorflow.keras import layers def unet_model(input_size=(256, 256, 3), num_classes=1): inputs = layers.Input(input_size) # Encoder (Contracting Path) c1 = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(inputs) c1 = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(c1) p1 = layers.MaxPooling2D((2, 2))(c1) c2 = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(p1) c2 = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(c2) p2 = layers.MaxPooling2D((2, 2))(c2) c3 = layers.Conv2D(256, (3, 3), activation='relu', padding='same')(p2) c3 = layers.Conv2D(256, (3, 3), activation='relu', padding='same')(c3) p3 = layers.MaxPooling2D((2, 2))(c3) c4 = layers.Conv2D(512, (3, 3), activation='relu', padding='same')(p3) c4 = layers.Conv2D(512, (3, 3), activation='relu', padding='same')(c4) p4 = layers.MaxPooling2D((2, 2))(c4) # Bottleneck c5 = layers.Conv2D(1024, (3, 3), activation='relu', padding='same')(p4) c5 = layers.Conv2D(1024, (3, 3), activation='relu', padding='same')(c5) # Decoder (Expanding Path) u6 = layers.Conv2DTranspose(512, (2, 2), strides=(2, 2), padding='same')(c5) u6 = layers.concatenate([u6, c4]) c6 = layers.Conv2D(512, (3, 3), activation='relu', padding='same')(u6) c6 = layers.Conv2D(512, (3, 3), activation='relu', padding='same')(c6) u7 = layers.Conv2DTranspose(256, (2, 2), strides=(2, 2), padding='same')(c6) u7 = layers.concatenate([u7, c3]) c7 = layers.Conv2D(256, (3, 3), activation='relu', padding='same')(u7) c7 = layers.Conv2D(256, (3, 3), activation='relu', padding='same')(c7) u8 = layers.Conv2DTranspose(128, (2, 2), strides=(2, 2), padding='same')(c7) u8 = layers.concatenate([u8, c2]) c8 = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(u8) c8 = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(c8) u9 = layers.Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same')(c8) u9 = layers.concatenate([u9, c1]) c9 = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(u9) c9 = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(c9) outputs = layers.Conv2D(num_classes, (1, 1), activation='sigmoid')(c9) model = tf.keras.Model(inputs=[inputs], outputs=[outputs]) return model # Create and compile model model = unet_model() model.compile( optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', 'iou'] ) # Custom IoU metric def iou(y_true, y_pred, smooth=1e-6): intersection = tf.reduce_sum(y_true * y_pred) union = tf.reduce_sum(y_true) + tf.reduce_sum(y_pred) - intersection return (intersection + smooth) / (union + smooth)

Tasks:

  1. Implement U-Net architecture for segmentation
  2. Prepare pixel-level annotated training data
  3. Train model with appropriate loss functions (IoU, Dice)
  4. Visualize segmentation masks and compare with ground truth
  5. Apply to real-world problems (medical imaging, autonomous driving)

Progress Tracker

Lab 1: Image Classifier
Lab 2: Object Detection
Lab 3: Image Segmentation

๐Ÿš€ Advanced Computer Vision

Cutting-edge techniques and emerging trends in computer vision research and applications.

Vision Transformers (ViTs)

Vision Transformers apply the transformer architecture from NLP to computer vision, treating images as sequences of patches.

import torch import torch.nn as nn from einops import rearrange import math class PatchEmbedding(nn.Module): def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768): super().__init__() self.img_size = img_size self.patch_size = patch_size self.n_patches = (img_size // patch_size) ** 2 self.projection = nn.Conv2d(in_channels, embed_dim, kernel_size=patch_size, stride=patch_size) def forward(self, x): # x: (batch_size, channels, height, width) x = self.projection(x) # (batch_size, embed_dim, n_patches_h, n_patches_w) x = rearrange(x, 'b e h w -> b (h w) e') # (batch_size, n_patches, embed_dim) return x class MultiHeadAttention(nn.Module): def __init__(self, embed_dim=768, n_heads=12): super().__init__() self.embed_dim = embed_dim self.n_heads = n_heads self.head_dim = embed_dim // n_heads self.qkv = nn.Linear(embed_dim, embed_dim * 3) self.projection = nn.Linear(embed_dim, embed_dim) def forward(self, x): batch_size, seq_len, embed_dim = x.shape # Generate Q, K, V qkv = self.qkv(x) qkv = rearrange(qkv, 'b s (three h d) -> three b h s d', three=3, h=self.n_heads, d=self.head_dim) q, k, v = qkv[0], qkv[1], qkv[2] # Scaled dot-product attention attention = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim) attention = torch.softmax(attention, dim=-1) # Apply attention to values out = torch.matmul(attention, v) out = rearrange(out, 'b h s d -> b s (h d)') out = self.projection(out) return out class VisionTransformer(nn.Module): def __init__(self, img_size=224, patch_size=16, in_channels=3, n_classes=1000, embed_dim=768, n_layers=12, n_heads=12, mlp_ratio=4): super().__init__() self.patch_embedding = PatchEmbedding(img_size, patch_size, in_channels, embed_dim) self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim)) self.pos_embedding = nn.Parameter(torch.randn(1, self.patch_embedding.n_patches + 1, embed_dim)) self.transformer_blocks = nn.ModuleList([ TransformerBlock(embed_dim, n_heads, mlp_ratio) for _ in range(n_layers) ]) self.layer_norm = nn.LayerNorm(embed_dim) self.head = nn.Linear(embed_dim, n_classes) def forward(self, x): batch_size = x.shape[0] # Patch embedding x = self.patch_embedding(x) # Add class token cls_tokens = self.cls_token.expand(batch_size, -1, -1) x = torch.cat([cls_tokens, x], dim=1) # Add position embedding x = x + self.pos_embedding # Transformer blocks for block in self.transformer_blocks: x = block(x) # Classification head x = self.layer_norm(x) x = x[:, 0] # Use class token x = self.head(x) return x

Generative Adversarial Networks for Vision

GANs can generate realistic images, perform style transfer, and image-to-image translation.

import torch.nn as nn class Generator(nn.Module): def __init__(self, latent_dim=100, img_channels=3, features_g=64): super(Generator, self).__init__() self.net = nn.Sequential( # Input: N x latent_dim x 1 x 1 self._block(latent_dim, features_g * 16, 4, 1, 0), # img: 4x4 self._block(features_g * 16, features_g * 8, 4, 2, 1), # img: 8x8 self._block(features_g * 8, features_g * 4, 4, 2, 1), # img: 16x16 self._block(features_g * 4, features_g * 2, 4, 2, 1), # img: 32x32 nn.ConvTranspose2d( features_g * 2, img_channels, kernel_size=4, stride=2, padding=1 ), nn.Tanh(), # Output: N x img_channels x 64 x 64 ) def _block(self, in_channels, out_channels, kernel_size, stride, padding): return nn.Sequential( nn.ConvTranspose2d( in_channels, out_channels, kernel_size, stride, padding, bias=False, ), nn.BatchNorm2d(out_channels), nn.ReLU(), ) def forward(self, x): return self.net(x) class Discriminator(nn.Module): def __init__(self, img_channels=3, features_d=64): super(Discriminator, self).__init__() self.disc = nn.Sequential( # Input: N x img_channels x 64 x 64 nn.Conv2d(img_channels, features_d, kernel_size=4, stride=2, padding=1), nn.LeakyReLU(0.2), # State size: N x features_d x 32 x 32 self._block(features_d, features_d * 2, 4, 2, 1), self._block(features_d * 2, features_d * 4, 4, 2, 1), self._block(features_d * 4, features_d * 8, 4, 2, 1), # State size: N x features_d*8 x 4 x 4 nn.Conv2d(features_d * 8, 1, kernel_size=4, stride=2, padding=0), nn.Sigmoid(), ) def _block(self, in_channels, out_channels, kernel_size, stride, padding): return nn.Sequential( nn.Conv2d( in_channels, out_channels, kernel_size, stride, padding, bias=False, ), nn.BatchNorm2d(out_channels), nn.LeakyReLU(0.2), ) def forward(self, x): return self.disc(x) # Training loop example def train_gan(generator, discriminator, dataloader, num_epochs, device): criterion = nn.BCELoss() optimizer_g = torch.optim.Adam(generator.parameters(), lr=0.0002, betas=(0.5, 0.999)) optimizer_d = torch.optim.Adam(discriminator.parameters(), lr=0.0002, betas=(0.5, 0.999)) for epoch in range(num_epochs): for batch_idx, (real_images, _) in enumerate(dataloader): real_images = real_images.to(device) batch_size = real_images.shape[0] # Train Discriminator noise = torch.randn(batch_size, 100, 1, 1).to(device) fake_images = generator(noise) disc_real = discriminator(real_images).reshape(-1) loss_disc_real = criterion(disc_real, torch.ones_like(disc_real)) disc_fake = discriminator(fake_images.detach()).reshape(-1) loss_disc_fake = criterion(disc_fake, torch.zeros_like(disc_fake)) loss_disc = (loss_disc_real + loss_disc_fake) / 2 discriminator.zero_grad() loss_disc.backward() optimizer_d.step() # Train Generator output = discriminator(fake_images).reshape(-1) loss_gen = criterion(output, torch.ones_like(output)) generator.zero_grad() loss_gen.backward() optimizer_g.step()

3D Computer Vision

Understanding depth, 3D reconstruction, and working with point clouds and 3D data.

import numpy as np import open3d as o3d from sklearn.cluster import DBSCAN class PointCloudProcessor: def __init__(self): self.point_cloud = None def load_point_cloud(self, file_path): """Load point cloud from file""" self.point_cloud = o3d.io.read_point_cloud(file_path) return self.point_cloud def preprocess_point_cloud(self, voxel_size=0.05): """Downsample and remove outliers""" # Downsample downsampled = self.point_cloud.voxel_down_sample(voxel_size) # Remove outliers cleaned, _ = downsampled.remove_statistical_outlier( nb_neighbors=20, std_ratio=2.0 ) self.point_cloud = cleaned return cleaned def estimate_normals(self): """Estimate surface normals""" self.point_cloud.estimate_normals( search_param=o3d.geometry.KDTreeSearchParamHybrid( radius=0.1, max_nn=30 ) ) def segment_plane(self, distance_threshold=0.01): """Segment largest plane using RANSAC""" plane_model, inliers = self.point_cloud.segment_plane( distance_threshold=distance_threshold, ransac_n=3, num_iterations=1000 ) inlier_cloud = self.point_cloud.select_by_index(inliers) outlier_cloud = self.point_cloud.select_by_index(inliers, invert=True) return plane_model, inlier_cloud, outlier_cloud def cluster_objects(self, eps=0.02, min_points=10): """Cluster point cloud into separate objects""" points = np.asarray(self.point_cloud.points) # DBSCAN clustering clustering = DBSCAN(eps=eps, min_samples=min_points).fit(points) labels = clustering.labels_ # Create colored point cloud max_label = labels.max() colors = plt.get_cmap("tab20")(labels / (max_label if max_label > 0 else 1)) colors[labels < 0] = 0 # Noise points self.point_cloud.colors = o3d.utility.Vector3dVector(colors[:, :3]) return labels # Stereo vision for depth estimation def stereo_depth_estimation(left_image, right_image): """Estimate depth from stereo image pair""" # Convert to grayscale left_gray = cv2.cvtColor(left_image, cv2.COLOR_BGR2GRAY) right_gray = cv2.cvtColor(right_image, cv2.COLOR_BGR2GRAY) # Create stereo matcher stereo = cv2.StereoBM_create(numDisparities=64, blockSize=15) # Compute disparity map disparity = stereo.compute(left_gray, right_gray) # Normalize for visualization disparity_normalized = cv2.normalize( disparity, None, 0, 255, cv2.NORM_MINMAX, cv2.CV_8U ) return disparity, disparity_normalized # Structure from Motion (SfM) basics class StructureFromMotion: def __init__(self): self.feature_detector = cv2.SIFT_create() self.matcher = cv2.BFMatcher() def extract_features(self, image): """Extract SIFT features from image""" keypoints, descriptors = self.feature_detector.detectAndCompute(image, None) return keypoints, descriptors def match_features(self, desc1, desc2): """Match features between two images""" matches = self.matcher.knnMatch(desc1, desc2, k=2) # Apply Lowe's ratio test good_matches = [] for match_pair in matches: if len(match_pair) == 2: m, n = match_pair if m.distance < 0.7 * n.distance: good_matches.append(m) return good_matches def estimate_pose(self, kp1, kp2, matches, K): """Estimate camera pose between two views""" # Extract matched points pts1 = np.float32([kp1[m.queryIdx].pt for m in matches]).reshape(-1, 1, 2) pts2 = np.float32([kp2[m.trainIdx].pt for m in matches]).reshape(-1, 1, 2) # Find essential matrix E, mask = cv2.findEssentialMat(pts1, pts2, K) # Recover pose _, R, t, mask = cv2.recoverPose(E, pts1, pts2, K) return R, t, mask

Real-time Applications

Optimizing computer vision models for deployment on mobile devices and edge computing.

# Model optimization techniques import tensorflow as tf from tensorflow.keras import mixed_precision # Enable mixed precision training policy = mixed_precision.Policy('mixed_float16') mixed_precision.set_global_policy(policy) # Model quantization def quantize_model(model, representative_dataset): """Quantize model for faster inference""" converter = tf.lite.TFLiteConverter.from_keras_model(model) converter.optimizations = [tf.lite.Optimize.DEFAULT] # Representative dataset for quantization converter.representative_dataset = representative_dataset converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] converter.inference_input_type = tf.uint8 converter.inference_output_type = tf.uint8 quantized_model = converter.convert() return quantized_model # Model pruning def prune_model(model, target_sparsity=0.5): """Apply magnitude-based pruning""" import tensorflow_model_optimization as tfmot prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude pruning_params = { 'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay( initial_sparsity=0.0, final_sparsity=target_sparsity, begin_step=0, end_step=1000 ) } pruned_model = prune_low_magnitude(model, **pruning_params) return pruned_model # Knowledge distillation class DistillationTraining: def __init__(self, teacher_model, student_model, alpha=0.7, temperature=3): self.teacher_model = teacher_model self.student_model = student_model self.alpha = alpha self.temperature = temperature def distillation_loss(self, y_true, y_pred_student, y_pred_teacher): """Combine hard and soft targets""" # Hard target loss hard_loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred_student) # Soft target loss soft_targets = tf.nn.softmax(y_pred_teacher / self.temperature) soft_pred = tf.nn.softmax(y_pred_student / self.temperature) soft_loss = tf.keras.losses.categorical_crossentropy(soft_targets, soft_pred) # Combined loss total_loss = (1 - self.alpha) * hard_loss + self.alpha * soft_loss * (self.temperature ** 2) return total_loss def train_student(self, train_data, epochs=10): """Train student model with teacher guidance""" for epoch in range(epochs): for x_batch, y_batch in train_data: with tf.GradientTape() as tape: # Get predictions teacher_pred = self.teacher_model(x_batch, training=False) student_pred = self.student_model(x_batch, training=True) # Calculate distillation loss loss = self.distillation_loss(y_batch, student_pred, teacher_pred) # Update student model gradients = tape.gradient(loss, self.student_model.trainable_variables) optimizer.apply_gradients(zip(gradients, self.student_model.trainable_variables))
๐ŸŽฏ Advanced Project Ideas

Challenge yourself with these cutting-edge projects:

  1. Vision Transformer from Scratch: Implement and train a ViT on a custom dataset
  2. 3D Object Detection: Build a system that detects objects in 3D point clouds
  3. Real-time Style Transfer: Create an app that applies artistic styles to live video
  4. Medical Image Analysis: Develop AI for detecting anomalies in medical scans
  5. Autonomous Drone Navigation: Computer vision for obstacle avoidance and path planning

Advanced Techniques Comparison