Reinforcement Learning in UrbanVerse

Reinforcement Learning in UrbanVerse#

UrbanVerse offers a comprehensive reinforcement learning framework designed specifically for training navigation policies in photorealistic urban environments. Built on Isaac Lab’s robust simulation infrastructure, UrbanVerse enables efficient parallel training across diverse real-to-sim city scenes, supporting everything from simple point navigation to complex multi-agent interactions.

This section walks through the essential building blocks for configuring and training RL agents in UrbanVerse’s rich urban simulation environments.

Documentation Overview#

This documentation covers the complete reinforcement learning workflow:

Configuring RL environments with scenes, observations, actions, rewards, and termination conditions
Understanding the seven key components that define the learning task
Setting up curriculum learning to gradually increase task difficulty
Using UrbanVerse’s RL APIs for environment creation and training
Best practices for effective reinforcement learning in urban navigation tasks

RL Environment Components

The RL Environment Architecture#

Training a navigation policy in UrbanVerse involves configuring seven key components that work together to define the learning task:

🌍 Scene Configuration Choose from UrbanVerse-160’s diverse city layouts, CraftBench’s artist-crafted test scenes, or your own custom environments generated with UrbanVerse-Gen. Configure how scenes are distributed across parallel training environments. → Working with Urban Scenes

🎮 Action Space Define how your robot moves. UrbanVerse automatically adapts the action interface based on your robot type—from simple velocity commands for wheeled robots to complex joint controls for humanoids. → Defining Robot Actions

👁️ Observations Specify what your policy sees. Combine visual inputs from onboard cameras with goal-relative position vectors and proprioceptive state information. → What Your Policy Sees: Observations

🎯 Rewards Design the learning signal. Balance task completion rewards, safety penalties, and navigation quality metrics to guide your policy toward desired behaviors. → Designing the Reward Function

🏁 Episode Termination Control when episodes end. Define success conditions, failure modes, and time limits that shape the learning dynamics. → When Episodes End: Termination Conditions

📈 Curriculum Learning Gradually increase task difficulty. Start with nearby goals and simple scenes, then progressively expand to long-range navigation across diverse urban layouts. → Progressive Learning: Curriculum Strategies

⚡ Events Inject variability and randomization. Customize robot initialization, dynamic agent spawning, and environmental variations to improve policy robustness. → Environment Events and Initialization

Basic Usage#

The following example demonstrates how to create a complete RL environment for training a COCO wheeled robot:

import urbanverse as uv
from urbanverse.navigation.config import (
    EnvCfg, SceneCfg, ObservationCfg, ActionCfg,
    RewardCfg, TerminationCfg, CurriculumCfg
)

# Define your training configuration
cfg = EnvCfg(
    scenes=SceneCfg(
        scene_paths=["/path/to/UrbanVerse-160/Tokyo_0001/scene.usd", ...],
        async_sim=True,
    ),
    robot_type="coco_wheeled",
    observations=ObservationCfg(rgb_size=(135, 240), include_goal_vector=True),
    actions=ActionCfg(),
    rewards=RewardCfg(),
    terminations=TerminationCfg(max_episode_steps=300),
    curriculum=CurriculumCfg(enable_goal_distance_curriculum=True),
)

# Create and train
env = uv.navigation.rl.create_env(cfg)
uv.navigation.rl.train(
    env=env,
    training_cfg=training_cfg,
    output_dir="outputs/my_navigation_policy"
)

Each component can be customized independently, allowing you to experiment with different configurations, robot types, and training strategies. The following pages dive deep into each component, providing detailed explanations, examples, and best practices for training effective navigation policies in UrbanVerse.

Benefits of Reinforcement Learning#

Reinforcement learning provides several advantages for training navigation policies:

Learn from experience: Policies improve through trial and error, discovering robust navigation strategies
Exceed expert performance: RL can learn behaviors that go beyond what’s demonstrated in expert data
Handle complex scenarios: Effective for long-horizon tasks requiring sophisticated planning and decision-making
Combine with imitation learning: Use BC policies as warm starts, then fine-tune with RL for best results

UrbanVerse’s reinforcement learning framework is designed to work seamlessly with the same scenes, robots, and observation/action spaces used in imitation learning, making it easy to combine both approaches.