Phase IV — Vision: ViT, 3D, Video | Week 8 | 2.5 hours "A robot doesn't just see pixels — it perceives a 3D world. Depth is the bridge between images and physical interaction." — Ranftl et al., 2021
A 2D image loses the third dimension. For manipulation and navigation, robots need to know: - How far away objects are - Which surfaces are graspable - Where obstacles lie in 3D space
RGB Image Depth Map 3D Point Cloud
┌──────────┐ ┌──────────┐ . . . .
│ 🤖 📦 │ → │ ░░░ ▓▓▓ │ → . . .
│ table │ │ ░░░░░░░ │ ..........
│ floor │ │ ░░░░░░░ │ ...........
└──────────┘ └──────────┘
near=dark far=light x, y, z per pixel
| Method | Input | How | Accuracy | Use Case |
|---|---|---|---|---|
| Stereo vision | 2 cameras | Triangulation from disparity | High | Industrial robots |
| Structured light | Projector + camera | Decode projected patterns | Very high | RGB-D sensors (RealSense) |
| LiDAR | Laser scanner | Time-of-flight | Very high | Autonomous vehicles |
| Monocular depth | 1 camera | Neural network | Medium | Any camera → depth |
The key insight: ViTs with global attention can reason about depth cues that span the entire image — vanishing points, relative sizes, occlusions, atmospheric perspective.
MiDaS (2020): Multi-dataset training for robust monocular depth:
$$d = f_\theta(\text{image}) \quad \text{where } d \in \mathbb{R}^{H \times W}$$
Trained on a mixture of datasets with different depth representations (metric, relative, stereo) using scale-and-shift invariant loss:
$$\mathcal{L} = \frac{1}{n} \sum_i \left( \frac{d_i - \text{median}(d)}{\text{MAD}(d)} - \frac{d_i^* - \text{median}(d^*)}{\text{MAD}(d^*)} \right)^2$$
Depth Anything (2024): Scales monocular depth to 62M unlabeled images via self-training: 1. Train teacher on labeled data 2. Generate pseudo-labels for unlabeled data 3. Train student on labeled + pseudo-labeled data 4. Student surpasses teacher
DPT adapts ViT for dense prediction (depth, segmentation) by reassembling multi-scale features:
ViT Encoder (layer outputs at L/4, L/2, 3L/4, L)
│
▼
Reassemble: project tokens back to spatial maps
│
▼
Fusion: progressive upsampling with skip connections
│
▼
Head: per-pixel depth prediction at full resolution
import torch
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from torchvision import transforms
def estimate_depth_midas(image_path):
"""Monocular depth estimation with MiDaS v3."""
# Load model
model = torch.hub.load('intel-isl/MiDaS', 'DPT_Large')
model.eval()
# Load transforms
midas_transforms = torch.hub.load('intel-isl/MiDaS', 'transforms')
transform = midas_transforms.dpt_transform
img = Image.open(image_path).convert('RGB')
input_tensor = transform(np.array(img)).unsqueeze(0)
with torch.no_grad():
depth = model(input_tensor)
depth = torch.nn.functional.interpolate(
depth.unsqueeze(1),
size=img.size[::-1],
mode='bicubic',
align_corners=False,
).squeeze()
return depth.numpy()
def estimate_depth_anything(image_path):
"""Monocular depth with Depth Anything V2."""
from transformers import pipeline
pipe = pipeline(
task="depth-estimation",
model="depth-anything/Depth-Anything-V2-Small-hf",
)
image = Image.open(image_path)
result = pipe(image)
depth = np.array(result["depth"])
return depth
def visualize_depth(image_path, depth, title="Depth Estimation"):
"""Side-by-side visualization of RGB and depth."""
img = Image.open(image_path)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].imshow(img)
axes[0].set_title('RGB Image')
axes[0].axis('off')
im = axes[1].imshow(depth, cmap='inferno')
axes[1].set_title(title)
axes[1].axis('off')
plt.colorbar(im, ax=axes[1], fraction=0.046)
plt.tight_layout()
plt.savefig('depth_estimation.png', dpi=150)
def depth_to_pointcloud(rgb, depth, fx=500, fy=500, cx=None, cy=None):
"""Convert RGB + depth map to 3D point cloud.
Args:
rgb: (H, W, 3) uint8 image
depth: (H, W) depth map in meters
fx, fy: focal lengths
cx, cy: principal point (defaults to image center)
Returns:
points: (N, 3) xyz coordinates
colors: (N, 3) rgb colors normalized to [0, 1]
"""
H, W = depth.shape
if cx is None:
cx = W / 2
if cy is None:
cy = H / 2
# Create pixel coordinate grid
u, v = np.meshgrid(np.arange(W), np.arange(H))
# Back-project to 3D
z = depth
x = (u - cx) * z / fx
y = (v - cy) * z / fy
# Filter invalid depths
valid = (z > 0) & (z < 10.0) # reasonable range
points = np.stack([x[valid], y[valid], z[valid]], axis=-1)
colors = rgb[valid].astype(np.float32) / 255.0
return points, colors
def save_pointcloud_ply(points, colors, filename='scene.ply'):
"""Save point cloud as PLY file."""
N = points.shape[0]
header = (
f"ply\nformat ascii 1.0\n"
f"element vertex {N}\n"
f"property float x\nproperty float y\nproperty float z\n"
f"property uchar red\nproperty uchar green\nproperty uchar blue\n"
f"end_header\n"
)
colors_uint8 = (colors * 255).astype(np.uint8)
with open(filename, 'w') as f:
f.write(header)
for i in range(N):
f.write(f"{points[i,0]:.4f} {points[i,1]:.4f} {points[i,2]:.4f} "
f"{colors_uint8[i,0]} {colors_uint8[i,1]} {colors_uint8[i,2]}\n")
print(f"Saved {N} points to {filename}")
Depth estimation comparison: Run both MiDaS and Depth Anything on the same 5 images. Compare depth maps visually. Which handles edges better? Which is faster?
Indoor scene reconstruction: Take an RGB image of your room. Estimate depth → create point cloud → save as PLY → view in MeshLab or Open3D. How accurate does the 3D reconstruction look?
Depth for obstacle avoidance: Given a depth map from a robot's camera, write a function that returns the closest obstacle distance and its (x, y) position in the image. This is the foundation for reactive navigation.
Depth estimation adds the third dimension to our vision pipeline. Tomorrow: processing 3D data directly with point cloud transformers — the representation robots actually use for manipulation.