Phase III — LLMs: Training & Alignment | Week 6 | 2.5 hours "Language is the interface between human intent and robot action." — Google Brain
LLMs understand language. Robots execute actions. The challenge: bridge the gap.
Human intent: "Pick up the red cup on the table"
│
▼ (language understanding)
Semantic understanding: object=red_cup, location=table, action=grasp
│
▼ (grounding)
Physical grounding: position=(0.3, 0.5, 0.1), grasp_type=top_pinch
│
▼ (planning)
Action sequence: approach → pre-grasp → close_gripper → lift → verify
│
▼ (control)
Motor commands: joint velocities, torques, trajectories
Three paradigms have emerged:
SayCan (Google, 2022): LLM scores what makes sense + robot scores what's possible.
$$ \text{action}^* = \arg\max_a \underbrace{P_{\text{LLM}}(a \mid \text{instruction})}_{\text{language score}} \cdot \underbrace{P_{\text{affordance}}(a \mid \text{state})}_{\text{feasibility score}} $$
Instruction: "I spilled my drink, can you help?"
LLM scoring (what makes sense):
"Pick up sponge" → 0.8
"Go to kitchen" → 0.6
"Pick up apple" → 0.1
"Dance" → 0.0
Affordance scoring (what's possible):
"Pick up sponge" → 0.9 (sponge is reachable)
"Go to kitchen" → 0.7 (path exists)
"Pick up apple" → 0.3 (apple not nearby)
"Dance" → 0.1 (no dance skill)
Combined:
"Pick up sponge" → 0.8 × 0.9 = 0.72 ← selected!
"Go to kitchen" → 0.6 × 0.7 = 0.42
Code as Policies (Google, 2023): Instead of selecting from predefined actions, the LLM generates executable code.
# User: "Stack the blocks in a pyramid"
# LLM generates:
def stack_pyramid():
blocks = detect_objects("block")
sorted_blocks = sorted(blocks, key=lambda b: b.size, reverse=True)
# Bottom row
base_y = 0.3
for i, block in enumerate(sorted_blocks[:3]):
pick(block)
place(x=0.2 + i * 0.1, y=base_y, z=0.05)
# Middle row
for i, block in enumerate(sorted_blocks[3:5]):
pick(block)
place(x=0.25 + i * 0.1, y=base_y, z=0.1)
# Top
pick(sorted_blocks[5])
place(x=0.3, y=base_y, z=0.15)
Advantages over SayCan: - Not limited to predefined action primitives - Can express complex logic (loops, conditionals) - More generalizable — new tasks don't need new training
Directly applicable to OKS robot fleet management:
LLM-Powered Robot Diagnostics:
Input: Robot error logs + telemetry
"[14:23:01] WARN: sensorbar SPI timeout (3 consecutive)
[14:23:05] ERROR: NAV_ESTIMATED_STATE_NOT_FINITE
[14:23:05] INFO: Guardian triggering safety stop
[14:23:06] ERROR: Robot stopped — manual recovery needed"
LLM Analysis:
"Root cause: SPI communication failure between sensorbar and
main controller. The 3 consecutive timeouts caused the navigation
estimator to receive stale data, resulting in NaN propagation.
Recommended actions:
1. Check SPI wiring and connectors
2. Verify sensorbar firmware version (expect v1.2.4+)
3. Inspect for debris on sensor surface
4. Review IMU calibration date"
LLMs can decompose high-level warehouse goals into multi-robot task plans:
Goal: "Process all incoming shipments in Dock B"
LLM task decomposition:
1. Robot-1: Navigate to Dock B, scan incoming pallets
2. Robot-2: Transport pallet #1 to Sorting Zone A
3. Robot-3: Transport pallet #2 to Sorting Zone B
4. Robot-1: Verify Dock B is clear, report to WMS
5. Robot-2: Return to staging area
6. Robot-3: Return to staging area
Constraints extracted:
- Robot-1 must complete step 1 before steps 2-3 can begin
- Steps 2 and 3 can execute in parallel
- Step 4 depends on steps 2-3 completion
⚠️ LLM hallucination → dangerous in robotics
"Place the hot coffee on the baby"
→ LLM might generate valid code for this harmful action!
Safety layers required:
1. Action validation against physics constraints
2. Safety zone checking before execution
3. Human confirmation for irreversible actions
4. Affordance checking (can the robot actually do this?)
5. Semantic safety filter (is this action safe?)
Reality: LLMs are planners, not controllers.
✅ Use LLMs for: task decomposition, error diagnosis, NL interface
❌ Don't use for: real-time control, safety-critical decisions
"""
Day 41 Implementation: LLM-powered robot command translator.
Translates natural language to structured robot commands.
"""
import json
import re
from dataclasses import dataclass, field
from enum import Enum
class CommandType(Enum):
NAVIGATE = "navigate"
PICK = "pick"
PLACE = "place"
STOP = "stop"
STATUS = "status"
SCAN = "scan"
CHARGE = "charge"
@dataclass
class RobotCommand:
command_type: CommandType
parameters: dict = field(default_factory=dict)
safety_check: bool = True
confidence: float = 0.0
def to_ros_message(self) -> dict:
"""Convert to ROS-style message format."""
if self.command_type == CommandType.NAVIGATE:
return {
"topic": "/move_base/goal",
"msg_type": "MoveBaseGoal",
"data": {
"target_pose": {
"position": self.parameters.get("position", {}),
"orientation": self.parameters.get("orientation", {}),
}
},
}
elif self.command_type == CommandType.STOP:
return {
"topic": "/cmd_vel",
"msg_type": "Twist",
"data": {"linear": {"x": 0}, "angular": {"z": 0}},
}
elif self.command_type == CommandType.PICK:
return {
"topic": "/pick_action/goal",
"msg_type": "PickGoal",
"data": {
"object_id": self.parameters.get("object_id", ""),
"grasp_type": self.parameters.get("grasp_type", "auto"),
},
}
return {"topic": "unknown", "data": self.parameters}
class CommandParser:
"""Rule-based fallback parser for robot commands."""
PATTERNS = {
CommandType.NAVIGATE: [
r"(?:go|move|navigate|drive)\s+to\s+(.+)",
r"(?:head|travel)\s+(?:to|towards)\s+(.+)",
],
CommandType.PICK: [
r"(?:pick|grab|grasp|get)\s+(?:up\s+)?(.+)",
r"(?:take|collect)\s+(.+)",
],
CommandType.PLACE: [
r"(?:place|put|drop|set)\s+(.+?)(?:\s+(?:on|at|in)\s+(.+))?$",
],
CommandType.STOP: [
r"(?:stop|halt|freeze|emergency)",
r"e-?stop",
],
CommandType.STATUS: [
r"(?:status|battery|health|diagnostics)",
r"(?:what|how)\s+(?:is|are)\s+(?:your|the)\s+(.+)",
],
CommandType.CHARGE: [
r"(?:charge|recharge|go\s+charge)",
r"(?:battery\s+low|need\s+charging)",
],
}
def parse(self, text: str) -> RobotCommand | None:
text_lower = text.lower().strip()
for cmd_type, patterns in self.PATTERNS.items():
for pattern in patterns:
match = re.search(pattern, text_lower)
if match:
params = {}
if cmd_type == CommandType.NAVIGATE:
params["destination"] = match.group(1).strip()
elif cmd_type == CommandType.PICK:
params["object"] = match.group(1).strip()
elif cmd_type == CommandType.PLACE:
params["object"] = match.group(1).strip()
if match.lastindex and match.lastindex >= 2:
params["location"] = match.group(2).strip()
return RobotCommand(
command_type=cmd_type,
parameters=params,
confidence=0.8,
)
return None
class LLMCommandTranslator:
"""Translates NL instructions to robot commands using LLM prompting."""
SYSTEM_PROMPT = """You are a warehouse robot command interpreter.
Convert natural language instructions into structured JSON commands.
Available command types:
- navigate: Move to a location {destination, speed}
- pick: Pick up an object {object_id, grasp_type}
- place: Place an object {object_id, location}
- stop: Emergency stop {}
- status: Report status {subsystem}
- scan: Scan area {area, scan_type}
- charge: Go to charging station {}
Respond ONLY with valid JSON:
{"command": "type", "params": {...}, "safety_notes": "..."}"""
def create_prompt(self, instruction: str) -> str:
return (
f"{self.SYSTEM_PROMPT}\n\n"
f"Instruction: {instruction}\n"
f"Command JSON:"
)
def parse_llm_output(self, output: str) -> RobotCommand | None:
"""Parse LLM JSON output into a RobotCommand."""
try:
# Extract JSON from output
json_match = re.search(r'\{.*\}', output, re.DOTALL)
if not json_match:
return None
data = json.loads(json_match.group())
cmd_type = CommandType(data.get("command", "status"))
return RobotCommand(
command_type=cmd_type,
parameters=data.get("params", {}),
confidence=0.9,
)
except (json.JSONDecodeError, ValueError):
return None
class SafetyValidator:
"""Validate robot commands before execution."""
RESTRICTED_ZONES = {"zone_x", "maintenance_area", "human_zone"}
MAX_SPEED = 1.5 # m/s
def validate(self, command: RobotCommand) -> tuple[bool, str]:
"""Return (is_safe, reason)."""
# Check restricted zones
dest = command.parameters.get("destination", "")
if dest.lower().replace(" ", "_") in self.RESTRICTED_ZONES:
return False, f"Destination '{dest}' is a restricted zone"
# Check speed limits
speed = command.parameters.get("speed", 0.5)
if speed > self.MAX_SPEED:
return False, f"Speed {speed} exceeds maximum {self.MAX_SPEED} m/s"
# Emergency stop always allowed
if command.command_type == CommandType.STOP:
return True, "Emergency stop — always allowed"
return True, "Command validated"
# --- SayCan-style scoring ---
def saycan_score(
language_scores: dict[str, float],
affordance_scores: dict[str, float],
) -> list[tuple[str, float]]:
"""Combine language and affordance scores (SayCan style)."""
combined = {}
for action in language_scores:
lang = language_scores.get(action, 0.0)
aff = affordance_scores.get(action, 0.0)
combined[action] = lang * aff
return sorted(combined.items(), key=lambda x: -x[1])
# --- Demo ---
if __name__ == "__main__":
parser = CommandParser()
translator = LLMCommandTranslator()
validator = SafetyValidator()
test_commands = [
"Go to the charging station",
"Pick up the package from shelf B3",
"Stop immediately!",
"Navigate to zone_x", # restricted zone
"What's your battery level?",
"Place the box on conveyor belt 2",
]
print("=" * 60)
print("Robot Command Translation")
print("=" * 60)
for cmd_text in test_commands:
print(f"\nInput: \"{cmd_text}\"")
# Rule-based parsing
cmd = parser.parse(cmd_text)
if cmd:
is_safe, reason = validator.validate(cmd)
print(f" Parsed: {cmd.command_type.value} | {cmd.parameters}")
print(f" Safety: {'✅' if is_safe else '❌'} {reason}")
if is_safe:
ros_msg = cmd.to_ros_message()
print(f" ROS topic: {ros_msg['topic']}")
else:
print(" ⚠️ Could not parse — would fall back to LLM")
prompt = translator.create_prompt(cmd_text)
print(f" LLM prompt ready ({len(prompt)} chars)")
# SayCan demo
print("\n" + "=" * 60)
print("SayCan Scoring Demo")
print("=" * 60)
lang_scores = {
"Pick up sponge": 0.8,
"Navigate to kitchen": 0.6,
"Pick up apple": 0.1,
"Dance": 0.0,
}
aff_scores = {
"Pick up sponge": 0.9,
"Navigate to kitchen": 0.7,
"Pick up apple": 0.3,
"Dance": 0.1,
}
ranked = saycan_score(lang_scores, aff_scores)
print("Instruction: 'I spilled my drink, can you help?'")
for action, score in ranked:
print(f" {score:.2f} | {action}")
Build a simple task planner that decomposes complex instructions: 1. Input: "Sort all packages by destination and deliver to zones A, B, C" 2. Generate a dependency graph of sub-tasks 3. Identify which steps can run in parallel across multiple robots 4. Output a Gantt-chart-style execution plan
Create an error diagnosis pipeline: 1. Input: a sequence of robot log entries (5-10 lines) 2. Use rule-based pattern matching to identify known error codes 3. Generate a structured diagnosis: root cause, confidence, recommended actions 4. Compare: how does adding context (past incidents) improve diagnosis quality?
This day directly connects our LLM knowledge to the robotics domain. SayCan and Code as Policies are precursors to VLAs (Vision-Language-Action models) — Phase VII of our curriculum. The difference: VLAs skip the code/action-selection intermediate step and directly output motor commands from vision + language. But understanding the language→action pipeline here is essential for understanding what VLAs replace.