Learning Agile Intruder Interception using Differentiable Quadrotor Dynamics

Carnegie Mellon University
Proposed control policy rollout: The interceptor (blue) aggressively pursues the intruder (red) flying at 5.8 m/s, culminating in a forced collision at the yellow star marker.

Proposed control policy rollout: The interceptor (blue) aggressively pursues the intruder (red) flying at 5.8 m/s, culminating in a forced collision at the yellow star marker. Shading along trajectories represents acceleration magnitude (from low blue to high yellow).

Abstract

This paper presents a methodology for learning a control policy that intercepts an intruder using only the 3D direction vector to the intruder and the interceptor state. Prior deep reinforcement learning approaches assume either relative position or distance to the intruder is available, but this information is not readily accessible in real-world applications that employ passive, monocular camera sensors. Instead, we propose a solution that leverages an analytical policy gradient method using differentiable quadrotor dynamics to learn agile interception at speeds up to 10 m/s. The proposed approach outperforms baseline methods that utilize simplified point mass dynamics by an average of 30%.

Methodology & Sensing

Guidance sensing vector mathematics

Parallel Navigation Guidance

Inspired by classical homing guidance, the training objective decomposes parallel navigation into two key parameters:

  • Line-of-Sight Alignment: Penalizes the angular drift between relative position and relative velocity, forcing the relative velocity to point along the line-of-sight.
  • Closing Velocity: Maximizes the closing speed along the line-of-sight vector, driving the agent to aggressively close the gap rather than simply shadowing the target.
Privileged Information: Intruder state details are privileged and used only in the loss function during training, never at inference.

Policy Network Architecture

The policy network architecture is customized to extract target motion parameters from temporal sequences of unit directions:

  • State Inputs: Receives linear velocity \(\mathbf{v}_t\), rotation matrix \(\mathbf{R}_t\), and 3D unit direction vector \(\hat{\mathbf{d}}_t\) to the intruder.
  • Recurrent Estimation: Proprioception and target directions are encoded via separate MLPs and combined in a Gated Recurrent Unit (GRU). The GRU recurrent state implicitly estimates target velocity and acceleration over time.
  • Action Commands: Outputs the collective mass-normalized thrust vector \(\mathbf{f}_{\mathrm{cmd},t}\) and desired yaw angle \(\psi_t\), executed via an onboard PD attitude controller.
Policy Network Architecture showing MLP encoders, GRU, and command output heads

Interactive Rollout Showcase

Select a trajectory type to watch the learned Quad APG policy execute interception rollouts.

Ellipse Trajectories

Used for both training and evaluation

Ellipse trajectories subject the interceptor to standard curved flights with randomly sampled major axes, aspect ratios, orientations, and target speeds.

10 m/s

Max Target Speed

> 95%

Success Rate

Flight Behavior: The interceptor learns to cut off the target curve (lead pursuit) rather than chasing behind, minimizing flight distance and ensuring rapid collision.

Simulation Evaluation Results

1. Training Convergence Performance

APG vs. PPO Training curves

APG vs PPO training success rate chart

Training success rate variation with environment steps. PMD APG (solid grey) converges in significantly fewer environment steps and achieves higher stability compared to the model-free PMD PPO (solid red) baseline.

Quadrotor vs. Point Mass Dynamics

Quadrotor vs Point Mass success rate training chart

Training success rate variation with environment steps. APG trained directly through high-fidelity nonlinear Quadrotor Dynamics (solid blue) achieves a similar rate of convergence as the simplified Point Mass Dynamics (solid grey).

2. Algorithm Generalization (Simulation Evaluation Success Rates)

We evaluate the generalizability of PMD APG (solid grey with circle marker) against PMD PPO (solid red with square marker) across Ellipse (in-distribution), Spiral (out-of-distribution), and Lemniscate (out-of-distribution) trajectories under varying target intruder speeds up to 10 m/s.

Ellipse Trajectory
Algorithm evaluation success rates on Ellipse trajectories
Spiral Trajectory
Algorithm evaluation success rates on Spiral trajectories
Lemniscate Trajectory
Algorithm evaluation success rates on Lemniscate trajectories

3. Dynamics Model Fidelity (Simulation Evaluation Success Rates)

We compare the evaluation success rates of the policy trained with high-fidelity nonlinear Quadrotor Dynamics / Quad APG (solid blue with square marker) against the policy trained with simplified Point Mass Dynamics / PMD APG (solid grey with circle marker).

Ellipse Trajectory
Dynamics model evaluation success rates on Ellipse trajectories
Spiral Trajectory
Dynamics model evaluation success rates on Spiral trajectories
Lemniscate Trajectory
Dynamics model evaluation success rates on Lemniscate trajectories

Simulation Evaluation Takeaways

30%+

Higher average success rate compared to point mass approximations by exploiting nonlinear multi-rotor states.

10 m/s

Capable of executing high-speed aerial forced collisions in open three-dimensional flight spaces.

264K

Tiny policy parameter footprint with only 264,580 weights.

BibTeX

@misc{anoruo2026learning,
   title={Learning Agile Intruder Interception using Differentiable Quadrotor Dynamics},
   author={Michael Anoruo and Xiaoyu Tian and Abhishek Rathod and Timothy Naudet and Thomas Canchola and Eric Sturzinger and Kshitij Goel and Wennie Tabib},
   year={2026},
   eprint={2607.02472},
   archivePrefix={arXiv},
   primaryClass={cs.RO},
   url={https://arxiv.org/abs/2607.02472}
}