Phase VI — Lie Groups & Manifold Optimization | Week 13 | 2.5 hours Assemble everything into a working VIO: SE(3) poses, IMU preintegration, landmarks, and sliding-window optimization.
Visual-Inertial Odometry (VIO) is the fallback localization method on the OKS robot when primary sensors (sensorbar, wheel odometry) degrade — e.g., during floor transitions, sensorbar stiction events, or on slippery surfaces. This capstone builds a complete mini-VIO pipeline demonstrating exactly how the robot fuses camera and IMU data to maintain pose estimates under adverse conditions.
A VIO system estimates the robot's trajectory by fusing: 1. IMU data (200+ Hz): provides rotation and acceleration between keyframes 2. Visual data (10–30 Hz): provides bearing measurements to 3D landmarks
The state at keyframe $k$:
$$\mathcal{X}_k = (T_k \in \text{SE}(3), \; v_k \in \mathbb{R}^3, \; b_k \in \mathbb{R}^6)$$
where $T_k$ is the camera pose, $v_k$ is velocity, and $b_k = (b_a, b_g)$ are IMU biases.
IMU Preintegration Factor (between consecutive keyframes):
$$r_{\text{imu}}^{ij} = \begin{pmatrix} \text{Log}(\Delta\tilde{R}_{ij}^T R_i^T R_j) \\ R_i^T(v_j - v_i - g\Delta t) - \Delta\tilde{v}_{ij} \\ R_i^T(p_j - p_i - v_i\Delta t - \frac{1}{2}g\Delta t^2) - \Delta\tilde{p}_{ij} \end{pmatrix} \in \mathbb{R}^9$$
Visual Landmark Factor (bearing measurement from keyframe $k$ to landmark $\ell$):
$$r_{\text{vis}}^{k\ell} = \pi(T_k^{-1} \cdot p_\ell) - z_{k\ell} \in \mathbb{R}^2$$
where $\pi$ is the camera projection and $z_{k\ell}$ is the observed pixel coordinate.
Bias Random Walk Factor (between consecutive keyframes):
$$r_{\text{bias}}^{ij} = b_j - b_i \in \mathbb{R}^6$$
Instead of optimizing all keyframes (growing cost), keep a sliding window of the last $W$ keyframes:
$$\min_{\mathcal{X}_{k-W+1:k}, \, p_{\ell \in \mathcal{L}}} \sum_{\text{factors in window}} \|r\|_{\Omega}^2$$
When a keyframe exits the window, it is marginalized (see §91.5), producing a prior on the remaining variables.
The Hessian has block structure:
$$H = \begin{pmatrix} H_{pp} & H_{pl} \\ H_{lp} & H_{ll} \end{pmatrix}$$
where $p$ indexes pose/velocity/bias states and $l$ indexes landmarks. Since $H_{ll}$ is block-diagonal ($3\times3$ blocks, one per landmark), we use the Schur complement to solve a reduced system in poses only:
$$\left(H_{pp} - H_{pl} H_{ll}^{-1} H_{lp}\right) \delta_p = b_p - H_{pl} H_{ll}^{-1} b_l$$
When removing old keyframe $x_0$ from the window, we convert its information into a prior factor on the remaining variables via the Schur complement:
$$H_{\text{prior}} = H_{rr} - H_{r0} H_{00}^{-1} H_{0r}$$
$$b_{\text{prior}} = b_r - H_{r0} H_{00}^{-1} b_0$$
This prior factor is dense over all variables that were connected to $x_0$, causing fill-in that limits the window size in practice.
"""Mini VIO: SE(3) + IMU preintegration + point landmarks + manifold GN."""
import numpy as np
# === Lie group helpers ===
def skew(v):
return np.array([[0,-v[2],v[1]],[v[2],0,-v[0]],[-v[1],v[0],0]])
def so3_exp(phi):
a = np.linalg.norm(phi)
if a < 1e-10: return np.eye(3) + skew(phi)
K = skew(phi/a)
return np.eye(3) + np.sin(a)*K + (1-np.cos(a))*(K@K)
def so3_log(R):
ca = np.clip((np.trace(R)-1)/2, -1, 1); a = np.arccos(ca)
if a < 1e-10: return np.array([R[2,1]-R[1,2], R[0,2]-R[2,0], R[1,0]-R[0,1]])/2
return a/(2*np.sin(a))*np.array([R[2,1]-R[1,2], R[0,2]-R[2,0], R[1,0]-R[0,1]])
def se3_exp(xi):
rho, phi = xi[:3], xi[3:]
R = so3_exp(phi); a = np.linalg.norm(phi)
if a < 1e-10: V = np.eye(3)
else:
K = skew(phi/a)
V = np.eye(3)+(1-np.cos(a))/a**2*K+(a-np.sin(a))/a**3*(K@K)
T = np.eye(4); T[:3,:3] = R; T[:3,3] = V @ rho
return T
def se3_log(T):
R, t = T[:3,:3], T[:3,3]; phi = so3_log(R); a = np.linalg.norm(phi)
if a < 1e-10: return np.concatenate([t, phi])
K = skew(phi)
V_inv = np.eye(3) - 0.5*K + (1-a/(2*np.tan(a/2)))/a**2*(K@K)
return np.concatenate([V_inv @ t, phi])
def se3_inv(T):
Ti = np.eye(4); Ti[:3,:3] = T[:3,:3].T; Ti[:3,3] = -T[:3,:3].T @ T[:3,3]
return Ti
# === Camera projection ===
def project(T_wc, p_w, fx=200, fy=200, cx=160, cy=120):
"""Project world point p_w into camera at pose T_wc. Returns pixel coords."""
p_c = se3_inv(T_wc)[:3,:3] @ (p_w - T_wc[:3,3])
if p_c[2] < 0.1:
return None
u = fx * p_c[0] / p_c[2] + cx
v = fy * p_c[1] / p_c[2] + cy
return np.array([u, v])
# === Synthetic data generation ===
def generate_vio_data(n_frames=5, n_landmarks=15):
np.random.seed(42)
gravity = np.array([0, 0, -9.81])
dt_frame = 0.1 # 10 Hz keyframes
# True trajectory: forward motion with gentle turn
true_poses, true_vels = [np.eye(4)], [np.array([1.0, 0, 0])]
for i in range(n_frames - 1):
omega = np.array([0, 0, 0.15]) # gentle yaw
dR = so3_exp(omega * dt_frame)
T_prev = true_poses[-1]
T_new = np.eye(4)
T_new[:3,:3] = T_prev[:3,:3] @ dR
T_new[:3,3] = T_prev[:3,3] + true_vels[-1] * dt_frame
true_poses.append(T_new)
true_vels.append(T_new[:3,:3] @ np.array([1.0, 0, 0]))
# Landmarks: scattered in front of trajectory
landmarks = np.random.randn(n_landmarks, 3) * 2
landmarks[:, 0] += 3 # in front
landmarks[:, 2] = np.abs(landmarks[:, 2]) + 0.5
# IMU preintegrated measurements (simplified)
imu_factors = []
for i in range(n_frames - 1):
dR = true_poses[i][:3,:3].T @ true_poses[i+1][:3,:3]
dv = true_poses[i][:3,:3].T @ (true_vels[i+1] - true_vels[i] - gravity * dt_frame)
dp = true_poses[i][:3,:3].T @ (true_poses[i+1][:3,3] - true_poses[i][:3,3]
- true_vels[i] * dt_frame - 0.5 * gravity * dt_frame**2)
# Add noise
dR_noisy = dR @ so3_exp(np.random.randn(3) * 0.005)
dv_noisy = dv + np.random.randn(3) * 0.01
dp_noisy = dp + np.random.randn(3) * 0.005
imu_factors.append((dR_noisy, dv_noisy, dp_noisy, dt_frame))
# Visual observations
vis_obs = []
for k in range(n_frames):
for l_idx in range(n_landmarks):
z = project(true_poses[k], landmarks[l_idx])
if z is not None:
z_noisy = z + np.random.randn(2) * 1.0 # 1 pixel noise
vis_obs.append((k, l_idx, z_noisy))
return true_poses, true_vels, landmarks, imu_factors, vis_obs, gravity
# === Mini VIO optimizer ===
def mini_vio(n_frames, n_landmarks, imu_factors, vis_obs, gravity,
init_poses, init_vels, init_landmarks, max_iter=15):
"""Manifold GN for VIO: optimize poses (SE3), velocities (R3), landmarks (R3)."""
# State layout: [pose_0(6), vel_0(3), ..., pose_{n-1}(6), vel_{n-1}(3), lm_0(3), ..., lm_{m-1}(3)]
pose_dim, vel_dim, lm_dim = 6, 3, 3
state_dim_per_frame = pose_dim + vel_dim
n_pose_vars = n_frames * state_dim_per_frame
n_lm_vars = n_landmarks * lm_dim
N = n_pose_vars + n_lm_vars
poses = [T.copy() for T in init_poses]
vels = [v.copy() for v in init_vels]
lms = [p.copy() for p in init_landmarks]
omega_imu_R = np.eye(3) * 1e4
omega_imu_v = np.eye(3) * 1e4
omega_imu_p = np.eye(3) * 1e4
omega_vis = np.eye(2) * (1.0 / 1.0**2) # 1 pixel sigma
costs = []
for it in range(max_iter):
H = np.zeros((N, N))
b = np.zeros(N)
cost = 0.0
# --- IMU factors ---
for i in range(n_frames - 1):
dR_m, dv_m, dp_m, dt = imu_factors[i]
Ri, Rj = poses[i][:3,:3], poses[i+1][:3,:3]
ti, tj = poses[i][:3,3], poses[i+1][:3,3]
vi, vj = vels[i], vels[i+1]
e_R = so3_log(dR_m.T @ Ri.T @ Rj)
e_v = Ri.T @ (vj - vi - gravity * dt) - dv_m
e_p = Ri.T @ (tj - ti - vi * dt - 0.5 * gravity * dt**2) - dp_m
cost += e_R @ omega_imu_R @ e_R + e_v @ omega_imu_v @ e_v + e_p @ omega_imu_p @ e_p
# Indices into state vector
idx_i = i * state_dim_per_frame
idx_j = (i + 1) * state_dim_per_frame
# Simplified Jacobians (approximate, small-error regime)
# d(e_R)/d(pose_i) ~ -I3, d(e_R)/d(pose_j) ~ I3 (rotation part)
for e_block, omega_block, r_start, r_end in [
(e_R, omega_imu_R, 0, 3),
(e_v, omega_imu_v, 0, 3),
(e_p, omega_imu_p, 0, 3),
]:
A, B = -np.eye(3), np.eye(3)
# Map to correct sub-indices
if e_block is e_R:
ci, cj = idx_i + 3, idx_j + 3 # rotation part of pose
elif e_block is e_v:
ci, cj = idx_i + 6, idx_j + 6 # velocity
else:
ci, cj = idx_i, idx_j # translation part of pose
H[ci:ci+3, ci:ci+3] += A.T @ omega_block @ A
H[ci:ci+3, cj:cj+3] += A.T @ omega_block @ B
H[cj:cj+3, ci:ci+3] += B.T @ omega_block @ A
H[cj:cj+3, cj:cj+3] += B.T @ omega_block @ B
b[ci:ci+3] += A.T @ omega_block @ e_block
b[cj:cj+3] += B.T @ omega_block @ e_block
# --- Visual factors ---
fx, fy, cx, cy = 200, 200, 160, 120
for (k, l_idx, z_obs) in vis_obs:
T_wc = poses[k]
p_w = lms[l_idx]
p_c = se3_inv(T_wc)[:3,:3] @ (p_w - T_wc[:3,3])
if p_c[2] < 0.1:
continue
z_pred = np.array([fx*p_c[0]/p_c[2]+cx, fy*p_c[1]/p_c[2]+cy])
e_vis = z_pred - z_obs
cost += e_vis @ omega_vis @ e_vis
# Jacobian w.r.t. landmark (in world frame, projected to pixel)
dz_dpc = np.array([[fx/p_c[2], 0, -fx*p_c[0]/p_c[2]**2],
[0, fy/p_c[2], -fy*p_c[1]/p_c[2]**2]])
R_cw = se3_inv(T_wc)[:3,:3]
J_lm = dz_dpc @ R_cw # 2x3
lm_idx = n_pose_vars + l_idx * lm_dim
H[lm_idx:lm_idx+3, lm_idx:lm_idx+3] += J_lm.T @ omega_vis @ J_lm
b[lm_idx:lm_idx+3] += J_lm.T @ omega_vis @ e_vis
costs.append(cost)
# Anchor first pose and velocity
for d in range(state_dim_per_frame):
H[d, :] = 0; H[:, d] = 0; H[d, d] = 1e8
b[d] = 0
# Damping
H += 1e-3 * np.eye(N)
delta = np.linalg.solve(H, -b)
if np.linalg.norm(delta) < 1e-8:
break
# Apply updates
for k in range(n_frames):
idx = k * state_dim_per_frame
poses[k] = poses[k] @ se3_exp(delta[idx:idx+6])
vels[k] = vels[k] + delta[idx+6:idx+9]
for l_idx in range(n_landmarks):
idx = n_pose_vars + l_idx * lm_dim
lms[l_idx] = lms[l_idx] + delta[idx:idx+3]
return poses, vels, lms, costs
# === Run ===
n_f, n_l = 5, 15
true_poses, true_vels, true_lms, imu_facs, vis_obs, grav = generate_vio_data(n_f, n_l)
# Noisy initialization
init_poses = [T @ se3_exp(np.random.randn(6)*0.05) for T in true_poses]
init_poses[0] = true_poses[0].copy() # anchor
init_vels = [v + np.random.randn(3)*0.1 for v in true_vels]
init_vels[0] = true_vels[0].copy()
init_lms = [p + np.random.randn(3)*0.3 for p in true_lms]
opt_poses, opt_vels, opt_lms, costs = mini_vio(
n_f, n_l, imu_facs, vis_obs, grav, init_poses, init_vels, init_lms)
print(f"VIO optimization: {len(costs)} iterations")
print(f"Cost: {costs[0]:.2f} -> {costs[-1]:.4f}")
print("\nPose errors (6D norm):")
for k in range(n_f):
err = np.linalg.norm(se3_log(se3_inv(true_poses[k]) @ opt_poses[k]))
print(f" Frame {k}: {err:.5f}")
print(f"\nLandmark RMSE: {np.sqrt(np.mean([(np.linalg.norm(opt_lms[i]-true_lms[i]))**2 for i in range(n_l)])):.4f} m")
For a 5-frame VIO window with 15 landmarks each seen in 3 frames, and 4 IMU factors, how many residual components are there? What is the Hessian dimension?
Apply the Schur complement to the VIO Hessian to eliminate landmarks. What is the reduced system size? How much faster is solving it?
Remove all visual factors and run with IMU only. Then remove IMU factors and run with vision only. Compare drift in each case. What does each sensor contribute?
Implement marginalization of the oldest keyframe. When frame 0 is removed, create a prior factor on frames 1–4 using the Schur complement. Run the windowed optimizer and verify the prior keeps the solution consistent.
Compare two marginalization strategies: (a) fixed-lag smoother removing the oldest frame each step, (b) sliding window that keeps every 3rd frame for long-range constraints. Measure trajectory drift over 20 frames.
Show that a VIO system with constant velocity has unobservable yaw. How many unobservable directions does a stationary VIO system have?