In this section you’ll train a reinforcement learning (RL) policy for the Unitree H1 humanoid robot to walk over rough terrain. The training workflow uses Isaac Lab’s integration with the RSL-RL library, which implements the Proximal Policy Optimization (PPO) algorithm. This integration connects Isaac Sim’s physics simulation with an efficient RL training pipeline. By the end of this section you’ll understand the key stages of the RL training pipeline, including:
RSL-RL (Robotic Systems Lab Reinforcement Learning) is a lightweight RL library developed at ETH Zurich specifically for locomotion tasks. It implements PPO with features tailored to robotics:
Isaac Lab includes ready-to-use training scripts for RSL-RL under scripts/reinforcement_learning/rsl_rl/.
In this section you’ll train the Isaac-Velocity-Rough-H1-v0 environment. This is a locomotion task where the Unitree H1 humanoid robot must track a velocity command while navigating rough terrain.
The task details are:
| Property | Value |
|---|---|
| Environment ID | Isaac-Velocity-Rough-H1-v0 |
| Robot | Unitree H1 (19 actuated joints, bipedal humanoid) |
| Terrain | Procedurally generated rough terrain with slopes, stairs, and obstacles |
| Workflow | Manager-Based |
| Objective | Track a commanded forward velocity, lateral velocity, and yaw rate |
| Observation space | Joint positions, joint velocities, gravity projection, velocity commands, and previous actions |
| Action space | Target joint positions for all actuated joints |
| RL library | RSL-RL (PPO) |
The robot receives a velocity command (for example, “walk forward at 1.0 m/s”) and must learn to coordinate all 19 joints to achieve that velocity while maintaining balance on uneven ground.
This setup provides a high-dimensional control problem ideal for testing locomotion learning under challenging terrain.
Navigate to the Isaac Lab directory and start the training job. Running in headless mode disables visualization so that more GPU resources can be used for physics simulation and neural network computation:
cd ~/IsaacLab
export LD_PRELOAD="$LD_PRELOAD:/lib/aarch64-linux-gnu/libgomp.so.1"
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \
--task=Isaac-Velocity-Rough-H1-v0 \
--headless
Once training begins, the terminal displays iteration progress, reward statistics, and performance metrics.
Example output:
Learning iteration 15/3000
Computation: 65955 steps/s (collection: 1.256s, learning 0.235s)
Mean action noise std: 1.04
Mean value_function loss: 0.0911
Mean surrogate loss: 0.0003
Mean entropy loss: 27.6371
Mean reward: -5.35
Mean episode length: 61.42
Episode_Reward/track_lin_vel_xy_exp: 0.0179
Episode_Reward/track_ang_vel_z_exp: 0.0058
Episode_Reward/ang_vel_xy_l2: -0.0220
Episode_Reward/dof_torques_l2: 0.0000
Episode_Reward/dof_acc_l2: -0.0081
Episode_Reward/action_rate_l2: -0.0127
Episode_Reward/feet_air_time: 0.0002
Episode_Reward/flat_orientation_l2: -0.0170
Episode_Reward/dof_pos_limits: -0.0012
Episode_Reward/termination_penalty: -0.2000
Episode_Reward/feet_slide: -0.0119
Episode_Reward/joint_deviation_hip: -0.0083
Episode_Reward/joint_deviation_arms: -0.0079
Episode_Reward/joint_deviation_torso: -0.0013
Curriculum/terrain_levels: 0.1577
Metrics/base_velocity/error_vel_xy: 0.1221
Metrics/base_velocity/error_vel_yaw: 0.4705
Episode_Termination/time_out: 0.0000
Episode_Termination/base_contact: 1.0000
--------------------------------------------------------------------------------
Total timesteps: 1572864
Iteration time: 1.49s
Time elapsed: 00:00:26
ETA: 01:21:49
During training:
Each PPO iteration collects experience from all parallel environments, then updates the policy network using the gathered trajectories.
Known issue: NVRTC GPU architecture error on DGX Spark
When running RL training on the Blackwell GPU (GB10, compute capability 12.1), you may encounter an error similar to:
RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)
This error occurs because the NVRTC runtime compiler inside PyTorch does not yet fully support the sm_121 architecture. It is a known compatibility issue tracked in
Isaac Lab Discussion #2406
and
PyTorch Issue #87595
.
Workaround: Make sure you are using the Isaac Sim build from source (as described in the setup section) rather than a pip-installed version. The source build includes the correct CUDA 13 runtime for Blackwell. If the error persists, try running with --headless mode, which avoids some NVRTC code paths used by the renderer. Also ensure your NVIDIA driver is up to date (nvidia-smi should show driver 580.x or later).
Support for Blackwell GPUs is expected to improve in upcoming PyTorch and Isaac Sim releases.
You can also override default parameters from the command line.
For example:
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \
--task=Isaac-Velocity-Rough-H1-v0 \
--headless \
--num_envs=2048 \
--max_iterations=1500 \
--seed=42
The following table explains the key command-line arguments:
| Argument | Default | Description |
|---|---|---|
--task | (required) | The Isaac Lab environment ID. Use Isaac-Velocity-Rough-H1-v0 for rough-terrain humanoid locomotion |
--headless | off | Disables visualization for faster training. All GPU resources go to simulation and training |
--num_envs | 4096 | Number of parallel environments. Each environment runs an independent simulation of the H1 robot |
--max_iterations | 1500 | Total number of PPO training iterations. Each iteration collects a batch of experience and updates the policy |
--seed | 0 | Random seed for reproducibility. Set this to get deterministic results across runs |
Each environment runs an independent simulation of the robot, allowing the RL algorithm to collect experience efficiently.
On DGX Spark, 2048 - 4096 parallel environments typically work well for locomotion tasks. Higher values increase sample throughput but require more GPU memory. Start with 2048 if you want faster iteration cycles during development.
This section explains the core PPO training parameters used by RSL-RL and how they influence learning quality and stability.
PPO (Proximal Policy Optimization) is the RL algorithm used by RSL-RL. Understanding each hyperparameter helps you tune training for different tasks. The following table describes the key hyperparameters and their roles:
| Hyperparameter | Typical value | Description |
|---|---|---|
policy_class_name | ActorCritic | The neural network architecture. ActorCritic uses separate networks for the policy (actor) and value function (critic) |
actor_hidden_dims | [512, 256, 128] | Hidden layer sizes for the actor (policy) network. Larger networks can represent more complex behaviors but train more slowly |
critic_hidden_dims | [512, 256, 128] | Hidden layer sizes for the critic (value) network. The critic estimates how good each state is |
activation | elu | Activation function between hidden layers. ELU (Exponential Linear Unit) provides smooth gradients and avoids dead neurons |
init_noise_std | 1.0 | Initial standard deviation of the exploration noise. Higher values encourage more exploration early in training |
| Hyperparameter | Typical value | Description |
|---|---|---|
num_learning_epochs | 5 | Number of times the policy is updated using each batch of collected experience. Higher values extract more learning from each batch but risk overfitting |
num_mini_batches | 4 | Number of mini-batches the experience buffer is split into for each epoch. More mini-batches mean smaller gradient updates |
learning_rate | 1e-3 | Step size for the Adam optimizer. Controls how much the network weights change per update. Too high causes instability; too low slows convergence |
discount_factor (gamma) | 0.99 | How much the agent values future rewards vs. immediate rewards. A value of 0.99 means the agent considers rewards ~100 steps into the future |
gae_lambda (lambda) | 0.95 | Generalized Advantage Estimation smoothing parameter. Balances bias (low lambda) vs. variance (high lambda) in advantage estimates |
clip_param | 0.2 | PPO clipping range. Prevents the policy from changing too much in a single update. Keeps training stable |
value_loss_coef | 1.0 | Weight of the value function loss relative to the policy loss. Ensures the critic learns at an appropriate rate |
entropy_coef | 0.01 | Weight of the entropy bonus. Encourages exploration by penalizing overly deterministic policies. Reduce this as training converges |
desired_kl | 0.01 | Target KL divergence between old and new policies. If KL exceeds this value, the learning rate is reduced adaptively |
max_grad_norm | 1.0 | Maximum gradient norm for gradient clipping. Prevents exploding gradients during training |
| Hyperparameter | Typical value | Description |
|---|---|---|
num_steps_per_env | 24 | Number of simulation steps collected per environment per iteration. Together with num_envs, this determines the total batch size: batch_size = num_envs × num_steps_per_env |
save_interval | 50 | Save a model checkpoint every N iterations. Useful for resuming training or evaluating intermediate policies |
During training, each iteration collects experience from all parallel environments. The total batch size per iteration is:
batch_size = num_envs × num_steps_per_env
For example, with num_envs=4096 and num_steps_per_env=24, the batch size per iteration is:
batch_size = 4096 × 24 = 98,304 environment steps per iteration
This batch is then split into num_mini_batches (4) mini-batches of ~24,576 steps each. The policy is updated num_learning_epochs (5) times per iteration, meaning each batch of experience is used for 5 × 4 = 20 gradient updates.
During training, RSL-RL periodically prints progress statistics to the terminal. A typical log output looks like:
Learning iteration 100/1500
mean reward: 12.45
mean episode length: 234.5
value function loss: 0.032
surrogate loss: -0.0156
mean std: 0.42
learning rate: 0.001
fps: 48523
These metrics help you track learning progress and detect issues such as unstable gradients or stagnating policies.
The following table explains each metric:
| Metric | What it means | What to look for |
|---|---|---|
mean reward | Average cumulative reward across all environments per episode | Should increase over time. Higher values mean the robot is walking better |
mean episode length | Average number of steps before the episode ends | Should increase as the robot learns to stay upright longer |
value function loss | How well the critic predicts future rewards | Should decrease and stabilize |
surrogate loss | The PPO policy loss (negative because PPO maximizes the objective) | Should be small and negative |
mean std | Average exploration noise standard deviation | Should decrease as the policy becomes more confident |
learning rate | Current learning rate (may be adjusted by the adaptive KL mechanism) | Stays at the initial value unless KL divergence exceeds desired_kl |
fps | Frames (environment steps) per second | Indicates training throughput. On DGX Spark, expect 40,000-60,000+ fps for locomotion tasks |
Training checkpoints are saved to the logs/rsl_rl/ directory. Each run creates a timestamped folder containing the model weights, configuration, and training logs.
After training completes, evaluate the policy by running inference with visualization. Use the play.py script with the trained checkpoint:
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/play.py \
--task=Isaac-Velocity-Rough-H1-Play-v0 \
--num_envs=512
For evaluation, use the inference task name Isaac-Velocity-Rough-H1-Play-v0 instead of the training task name. The play variant disables runtime perturbations used during training and loads the checkpoint automatically.
The play script loads the most recent checkpoint and runs the policy in real time. You will observe the Unitree H1 humanoid walking over procedurally generated rough terrain, responding to live velocity commands.
You can also run inference with a specific checkpoint. This is useful for comparing policy performance at different stages of training.
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/play.py \
--task=Isaac-Velocity-Rough-H1-Play-v0 \
--num_envs=512 \
--checkpoint=logs/rsl_rl/h1_rough/<timestamp>/model_1500.pt
This command loads the specified checkpoint and runs the policy using the same simulation environment.
During evaluation, you can observe how the robot’s behavior evolves as training progresses.
Typical training behavior follows three stages:
This progression—from falling to stable walking—demonstrates how PPO gradually improves the policy through trial and error across thousands of parallel environments.
The following visualizations compare two training stages using num_envs=512, showcasing the benefit of large-scale parallel training on DGX Spark.
Iteration 50 (Early Stage, num_envs=512)
At iteration 50, the policy is still in its exploration phase. Most robots exhibit noisy joint actions, lack coordination, and frequently fall. There is no observable response to the velocity command, and no stable gait has emerged.
Early Stage
Iteration 1350 (Late Stage, num_envs=512)
By iteration 1350, the policy has matured. Most robots demonstrate coordinated walking behavior, balance maintenance, and accurate velocity tracking, even on rough terrain. The improvement in foot placement and heading stability is clearly visible.
Late Stage
In this section, you’ve:
You’ve now completed the end-to-end workflow of training and validating a reinforcement learning policy for humanoid locomotion on DGX Spark.