Day 51: 3D Vision & Depth

Phase IV — Vision: ViT, 3D, Video | Week 8 | 2.5 hours "A robot doesn't just see pixels — it perceives a 3D world. Depth is the bridge between images and physical interaction." — Ranftl et al., 2021

Previous: Day 50: Stop & Reflect #3
Next: Day 52: Point Clouds & 3D Scenes
Week: Week 8 Overview
Phase: Phase IV: Vision
Curriculum: Full Curriculum

Theory (45 min)

Why Depth Matters for Robotics

A 2D image loses the third dimension. For manipulation and navigation, robots need to know: - How far away objects are - Which surfaces are graspable - Where obstacles lie in 3D space

RGB Image          Depth Map              3D Point Cloud
┌──────────┐      ┌──────────┐           .  .  .  .
│  🤖  📦  │  →   │ ░░░ ▓▓▓ │    →      .   .   .
│   table  │      │ ░░░░░░░ │          ..........
│  floor   │      │ ░░░░░░░ │          ...........
└──────────┘      └──────────┘

near=dark        far=light             x, y, z per pixel

Approaches to Depth Estimation

Method	Input	How	Accuracy	Use Case
Stereo vision	2 cameras	Triangulation from disparity	High	Industrial robots
Structured light	Projector + camera	Decode projected patterns	Very high	RGB-D sensors (RealSense)
LiDAR	Laser scanner	Time-of-flight	Very high	Autonomous vehicles
Monocular depth	1 camera	Neural network	Medium	Any camera → depth

Monocular Depth Estimation with ViTs

The key insight: ViTs with global attention can reason about depth cues that span the entire image — vanishing points, relative sizes, occlusions, atmospheric perspective.

MiDaS (2020): Multi-dataset training for robust monocular depth:

$$d = f_\theta(\text{image}) \quad \text{where } d \in \mathbb{R}^{H \times W}$$

Trained on a mixture of datasets with different depth representations (metric, relative, stereo) using scale-and-shift invariant loss:

$$\mathcal{L} = \frac{1}{n} \sum_i \left( \frac{d_i - \text{median}(d)}{\text{MAD}(d)} - \frac{d_i^* - \text{median}(d^*)}{\text{MAD}(d^*)} \right)^2$$

Depth Anything (2024): Scales monocular depth to 62M unlabeled images via self-training: 1. Train teacher on labeled data 2. Generate pseudo-labels for unlabeled data 3. Train student on labeled + pseudo-labeled data 4. Student surpasses teacher

DPT: Dense Prediction Transformer

DPT adapts ViT for dense prediction (depth, segmentation) by reassembling multi-scale features:

ViT Encoder (layer outputs at L/4, L/2, 3L/4, L)
       │
       ▼
Reassemble: project tokens back to spatial maps
       │
       ▼
Fusion: progressive upsampling with skip connections
       │
       ▼
Head: per-pixel depth prediction at full resolution

Implementation (60 min)

Using MiDaS / Depth Anything

import torch
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from torchvision import transforms


def estimate_depth_midas(image_path):
    """Monocular depth estimation with MiDaS v3."""
    # Load model
    model = torch.hub.load('intel-isl/MiDaS', 'DPT_Large')
    model.eval()

    # Load transforms
    midas_transforms = torch.hub.load('intel-isl/MiDaS', 'transforms')
    transform = midas_transforms.dpt_transform

    img = Image.open(image_path).convert('RGB')
    input_tensor = transform(np.array(img)).unsqueeze(0)

    with torch.no_grad():
        depth = model(input_tensor)
        depth = torch.nn.functional.interpolate(
            depth.unsqueeze(1),
            size=img.size[::-1],
            mode='bicubic',
            align_corners=False,
        ).squeeze()

    return depth.numpy()


def estimate_depth_anything(image_path):
    """Monocular depth with Depth Anything V2."""
    from transformers import pipeline

    pipe = pipeline(
        task="depth-estimation",
        model="depth-anything/Depth-Anything-V2-Small-hf",
    )

    image = Image.open(image_path)
    result = pipe(image)
    depth = np.array(result["depth"])

    return depth


def visualize_depth(image_path, depth, title="Depth Estimation"):
    """Side-by-side visualization of RGB and depth."""
    img = Image.open(image_path)

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    axes[0].imshow(img)
    axes[0].set_title('RGB Image')
    axes[0].axis('off')

    im = axes[1].imshow(depth, cmap='inferno')
    axes[1].set_title(title)
    axes[1].axis('off')
    plt.colorbar(im, ax=axes[1], fraction=0.046)

    plt.tight_layout()
    plt.savefig('depth_estimation.png', dpi=150)

RGB-D to Point Cloud

def depth_to_pointcloud(rgb, depth, fx=500, fy=500, cx=None, cy=None):
    """Convert RGB + depth map to 3D point cloud.

    Args:
        rgb: (H, W, 3) uint8 image
        depth: (H, W) depth map in meters
        fx, fy: focal lengths
        cx, cy: principal point (defaults to image center)

    Returns:
        points: (N, 3) xyz coordinates
        colors: (N, 3) rgb colors normalized to [0, 1]
    """
    H, W = depth.shape
    if cx is None:
        cx = W / 2
    if cy is None:
        cy = H / 2

    # Create pixel coordinate grid
    u, v = np.meshgrid(np.arange(W), np.arange(H))

    # Back-project to 3D
    z = depth
    x = (u - cx) * z / fx
    y = (v - cy) * z / fy

    # Filter invalid depths
    valid = (z > 0) & (z < 10.0)  # reasonable range

    points = np.stack([x[valid], y[valid], z[valid]], axis=-1)
    colors = rgb[valid].astype(np.float32) / 255.0

    return points, colors


def save_pointcloud_ply(points, colors, filename='scene.ply'):
    """Save point cloud as PLY file."""
    N = points.shape[0]
    header = (
        f"ply\nformat ascii 1.0\n"
        f"element vertex {N}\n"
        f"property float x\nproperty float y\nproperty float z\n"
        f"property uchar red\nproperty uchar green\nproperty uchar blue\n"
        f"end_header\n"
    )

    colors_uint8 = (colors * 255).astype(np.uint8)
    with open(filename, 'w') as f:
        f.write(header)
        for i in range(N):
            f.write(f"{points[i,0]:.4f} {points[i,1]:.4f} {points[i,2]:.4f} "
                    f"{colors_uint8[i,0]} {colors_uint8[i,1]} {colors_uint8[i,2]}\n")
    print(f"Saved {N} points to {filename}")

Exercise (45 min)

Depth estimation comparison: Run both MiDaS and Depth Anything on the same 5 images. Compare depth maps visually. Which handles edges better? Which is faster?
Indoor scene reconstruction: Take an RGB image of your room. Estimate depth → create point cloud → save as PLY → view in MeshLab or Open3D. How accurate does the 3D reconstruction look?
Depth for obstacle avoidance: Given a depth map from a robot's camera, write a function that returns the closest obstacle distance and its (x, y) position in the image. This is the foundation for reactive navigation.

Key Takeaways

Monocular depth from ViTs. Global attention enables reasoning about depth cues across the whole image
Scale-invariant training. Multi-dataset training with normalized loss enables robust generalization
Depth → 3D. With camera intrinsics, depth maps become point clouds for robotic reasoning
Depth Anything scales. Self-training on 62M unlabeled images pushes quality boundaries
Robotics pipeline. RGB → depth → point cloud → 3D reasoning is a core robot perception stack

Connection to the Thread

Depth estimation adds the third dimension to our vision pipeline. Tomorrow: processing 3D data directly with point cloud transformers — the representation robots actually use for manipulation.