Multimodal LLMs

Part of Module 6: Current AI Market Trends

Multimodal Large Language Models represent the next frontier in AI, capable of understanding and generating content across text, images, audio, video, and more. This convergence of modalities enables more natural and versatile AI interactions, bringing us closer to human-like understanding.

Understanding Multimodal AI

What Makes AI Multimodal?

Multimodal AI systems can process and generate content across multiple input and output types, understanding the relationships between different modalities.

Key Characteristics:

  • Cross-modal Understanding: Relating concepts across text, vision, and audio
  • Unified Representations: Single model architecture for multiple modalities
  • Contextual Awareness: Understanding how modalities complement each other
  • Flexible I/O: Accept any combination of inputs, generate appropriate outputs
  • Zero-shot Transfer: Apply learning from one modality to another
# Multimodal processing example with different models
import torch
from PIL import Image
from transformers import (
    BlipProcessor, BlipForConditionalGeneration,
    CLIPProcessor, CLIPModel
)

# BLIP for image captioning and VQA
class MultimodalProcessor:
    def __init__(self):
        # Initialize BLIP for vision-language tasks
        self.blip_processor = BlipProcessor.from_pretrained(
            "Salesforce/blip-image-captioning-large"
        )
        self.blip_model = BlipForConditionalGeneration.from_pretrained(
            "Salesforce/blip-image-captioning-large"
        )
        
        # Initialize CLIP for image-text similarity
        self.clip_processor = CLIPProcessor.from_pretrained(
            "openai/clip-vit-large-patch14"
        )
        self.clip_model = CLIPModel.from_pretrained(
            "openai/clip-vit-large-patch14"
        )
    
    def generate_caption(self, image_path: str) -> str:
        """Generate natural language caption for image"""
        image = Image.open(image_path)
        inputs = self.blip_processor(image, return_tensors="pt")
        
        outputs = self.blip_model.generate(**inputs, max_length=50)
        caption = self.blip_processor.decode(outputs[0], skip_special_tokens=True)
        return caption
    
    def visual_question_answering(self, image_path: str, question: str) -> str:
        """Answer questions about images"""
        image = Image.open(image_path)
        inputs = self.blip_processor(
            image, 
            question, 
            return_tensors="pt"
        )
        
        outputs = self.blip_model.generate(**inputs, max_length=30)
        answer = self.blip_processor.decode(outputs[0], skip_special_tokens=True)
        return answer
    
    def compute_similarity(self, image_path: str, text: str) -> float:
        """Compute similarity between image and text"""
        image = Image.open(image_path)
        inputs = self.clip_processor(
            text=[text], 
            images=image, 
            return_tensors="pt", 
            padding=True
        )
        
        outputs = self.clip_model(**inputs)
        logits_per_image = outputs.logits_per_image
        similarity = logits_per_image.softmax(dim=1)[0][0].item()
        return similarity

Common Multimodal Tasks

  • Image Captioning: Generate text descriptions of images
  • Visual Question Answering: Answer questions about visual content
  • Text-to-Image Generation: Create images from text descriptions
  • Video Understanding: Analyze and describe video content
  • Audio-Visual Speech Recognition: Combine lip reading with audio
  • Document Understanding: Process layouts, tables, and figures

Leading Multimodal Models

GPT-4 Vision (GPT-4V)

OpenAI's multimodal extension of GPT-4, capable of understanding images alongside text.

# GPT-4 Vision implementation
from openai import OpenAI
import base64

client = OpenAI()

def analyze_image_with_gpt4v(image_path: str, prompt: str):
    """Analyze image using GPT-4 Vision"""
    
    with open(image_path, "rb") as image_file:
        base64_image = base64.b64encode(image_file.read()).decode('utf-8')
    
    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}",
                        "detail": "high"  # high or low resolution
                    }
                }
            ]
        }],
        max_tokens=1000
    )
    
    return response.choices[0].message.content

# Advanced use cases
def analyze_multiple_images(image_paths: list, analysis_prompt: str):
    """Analyze multiple images in single context"""
    
    content = [{"type": "text", "text": analysis_prompt}]
    
    for path in image_paths:
        with open(path, "rb") as img:
            base64_img = base64.b64encode(img.read()).decode('utf-8')
            content.append({
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{base64_img}"}
            })
    
    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[{"role": "user", "content": content}],
        max_tokens=2000
    )
    
    return response.choices[0].message.content

Google Gemini - Native Multimodal

Designed from the ground up to be multimodal, processing text, images, audio, and video natively.

# Gemini multimodal capabilities
import google.generativeai as genai
import PIL.Image
import cv2

genai.configure(api_key="your-api-key")

class GeminiMultimodal:
    def __init__(self):
        self.model = genai.GenerativeModel('gemini-pro-vision')
        self.text_model = genai.GenerativeModel('gemini-pro')
    
    def analyze_image_and_text(self, image_path: str, text_prompt: str):
        """Combine image and text understanding"""
        image = PIL.Image.open(image_path)
        
        response = self.model.generate_content([text_prompt, image])
        return response.text
    
    def process_video_frames(self, video_path: str, question: str):
        """Analyze video by processing key frames"""
        cap = cv2.VideoCapture(video_path)
        frames = []
        frame_count = 0
        
        while cap.isOpened() and frame_count < 10:  # Sample 10 frames
            ret, frame = cap.read()
            if not ret:
                break
            
            if frame_count % 30 == 0:  # Every 30th frame
                frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                pil_image = PIL.Image.fromarray(frame_rgb)
                frames.append(pil_image)
            
            frame_count += 1
        
        cap.release()
        
        # Analyze frames with Gemini
        prompt = f"{question}\nThese are frames from a video:"
        response = self.model.generate_content([prompt] + frames)
        return response.text
    
    def multimodal_chat(self):
        """Interactive multimodal chat session"""
        chat = self.model.start_chat(history=[])
        
        # Can mix text and images in conversation
        image = PIL.Image.open("example.jpg")
        response = chat.send_message(["What's in this image?", image])
        print(response.text)
        
        # Follow-up with text only
        response = chat.send_message("What colors are prominent?")
        print(response.text)
        
        return chat.history

Claude 3 Vision

Anthropic's Claude 3 family includes vision capabilities across all model sizes.

# Claude 3 multimodal implementation
import anthropic
import base64

client = anthropic.Anthropic()

def claude_vision_analysis(image_path: str, instructions: str):
    """Analyze images with Claude 3"""
    
    with open(image_path, "rb") as image_file:
        image_data = base64.b64encode(image_file.read()).decode("utf-8")
    
    message = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": instructions
                }
            ]
        }]
    )
    
    return message.content[0].text

# Document analysis with Claude
def analyze_document_layout(pdf_images: list):
    """Analyze document structure and extract information"""
    
    content = []
    for img_path in pdf_images:
        with open(img_path, "rb") as img:
            img_data = base64.b64encode(img.read()).decode("utf-8")
            content.append({
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": img_data
                }
            })
    
    content.append({
        "type": "text",
        "text": """Analyze these document pages and:
        1. Extract all tables and their data
        2. Identify key sections and headings
        3. Summarize the main content
        4. Note any charts or figures"""
    })
    
    message = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=4000,
        messages=[{"role": "user", "content": content}]
    )
    
    return message.content[0].text

Specialized Multimodal Models

CLIP - Contrastive Language-Image Pretraining

OpenAI's CLIP learns visual concepts from natural language descriptions, enabling zero-shot image classification.

# CLIP for zero-shot classification
from transformers import CLIPProcessor, CLIPModel
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def zero_shot_classify(image_path: str, labels: list):
    """Classify image without training on specific classes"""
    
    image = Image.open(image_path)
    
    # Prepare text descriptions
    texts = [f"a photo of a {label}" for label in labels]
    
    inputs = processor(
        text=texts,
        images=image,
        return_tensors="pt",
        padding=True
    )
    
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)
    
    # Get top prediction
    top_prob, top_idx = probs[0].max(dim=0)
    predicted_label = labels[top_idx]
    
    return {
        "predicted": predicted_label,
        "confidence": top_prob.item(),
        "all_probs": {
            label: prob.item() 
            for label, prob in zip(labels, probs[0])
        }
    }

Flamingo - Few-shot Visual Learning

DeepMind's Flamingo performs few-shot learning on vision-language tasks.

Key Features:

  • Interleaved image-text inputs
  • Few-shot adaptation without fine-tuning
  • Cross-attention between vision and language
  • Supports multiple images in context

DALL-E 3 - Text to Image

Advanced text-to-image generation with improved prompt following and safety.

# DALL-E 3 image generation
from openai import OpenAI

client = OpenAI()

def generate_image(prompt: str, quality: str = "standard"):
    """Generate images with DALL-E 3"""
    
    response = client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size="1024x1024",  # or 1792x1024, 1024x1792
        quality=quality,  # "standard" or "hd"
        n=1,
        style="natural"  # or "vivid"
    )
    
    return response.data[0].url

# Image variation and editing
def create_variations(image_path: str, n: int = 4):
    """Create variations of existing image"""
    
    with open(image_path, "rb") as image:
        response = client.images.create_variation(
            image=image,
            n=n,
            size="1024x1024"
        )
    
    return [img.url for img in response.data]

Emerging Trends in Multimodal AI

Any-to-Any Models

Next-generation models that can translate between any modalities.

Examples:

  • ImageBind (Meta): Binds 6 modalities in shared embedding space
  • Gato (DeepMind): Single model for vision, language, and robotics
  • Unified-IO: Single architecture for diverse I/O tasks
  • PaLM-E (Google): Embodied multimodal language model

Agentic Multimodal AI

AI agents that use multimodal understanding to interact with the world.

Applications:

  • Web Agents: Navigate and interact with websites using vision + text
  • Robotics: Combine vision, language, and action planning
  • Virtual Assistants: Screen understanding + natural language
  • Autonomous Vehicles: Sensor fusion with language instructions

Autonomous Workflows

Multimodal models enabling end-to-end automation of complex tasks.

# Example: Autonomous content creation workflow
class ContentCreationAgent:
    def __init__(self):
        self.llm = OpenAI()
        self.image_gen = "dall-e-3"
        self.vision = "gpt-4-vision-preview"
    
    async def create_blog_post(self, topic: str):
        """Autonomously create complete blog post with images"""
        
        # Generate article structure
        outline = await self.generate_outline(topic)
        
        # Write content sections
        content = []
        for section in outline['sections']:
            text = await self.write_section(section)
            
            # Determine if image needed
            needs_image = await self.should_add_image(text)
            
            if needs_image:
                # Generate appropriate image
                image_prompt = await self.create_image_prompt(text)
                image_url = await self.generate_image(image_prompt)
                
                # Verify image quality
                approved = await self.verify_image(image_url, text)
                
                if approved:
                    content.append({'text': text, 'image': image_url})
                else:
                    # Regenerate if needed
                    image_url = await self.regenerate_image(image_prompt)
                    content.append({'text': text, 'image': image_url})
            else:
                content.append({'text': text})
        
        # Final review and formatting
        final_post = await self.format_and_review(content)
        return final_post

Multimodal Model Comparison

Model Modalities Key Strength Context Size Availability
GPT-4V Text, Image General purpose vision-language 128K tokens API
Gemini Ultra Text, Image, Audio, Video Native multimodal 32K tokens API
Claude 3 Text, Image Document understanding 200K tokens API
CLIP Text, Image Zero-shot classification 77 tokens Open Source
LLaVA Text, Image Open source vision-language 4K tokens Open Source
ImageBind 6 modalities Unified embedding space N/A Research

Challenges in Multimodal AI

  • Hallucination: Models may describe non-existent visual elements
  • Alignment: Ensuring consistency across modalities
  • Computational Cost: Processing multiple modalities is resource-intensive
  • Data Requirements: Need paired multimodal training data
  • Evaluation: Difficult to benchmark multimodal performance

Best Practices for Multimodal AI

  • Choose the right model: Match capabilities to your specific needs
  • Optimize inputs: Preprocess images/audio for better results
  • Handle failures gracefully: Multimodal models can fail in unexpected ways
  • Verify outputs: Cross-check multimodal understanding
  • Consider latency: Multimodal processing takes longer
  • Monitor costs: Multiple modalities increase API expenses
  • Implement caching: Store processed multimodal embeddings
  • Test edge cases: Unusual input combinations may cause issues