Is Your Trajectory Displacement Safe in Long-tail?
Summary
FluidTest is a new evaluation pipeline designed to address safety bottlenecks in autonomous driving planning, particularly for long-tail scenarios. Published on 2026-06-15, it formulates planning evaluation as "additional-threat detection," assessing whether a planner's trajectory introduces new unsafe driving behaviors compared to an expert reference. The pipeline comprises three key components: a pairwise WebUI protocol for reliable human annotation, a comprehensive taxonomy of 32 semantic threats supported by evidence-grounded decision graphs, and a three-agent verification system incorporating reflection for enhanced precision and auditability. Experiments conducted on the WOD-E2E dataset demonstrated FluidTest's ability to produce consistent labels among trained annotators. Crucially, it identified additional threats in 65% of Poutine trajectories and 51% of RAP trajectories, revealing substantial safety-relevant failures in state-of-the-art planners despite their high Rater Feedback Scores (RFS) and low Average Displacement Error (ADE).
Key takeaway
For autonomous driving engineers evaluating planner safety, relying solely on metrics like Rater Feedback Scores (RFS) or Average Displacement Error (ADE) is insufficient. You should integrate human-aligned, verifiable evaluation pipelines such as FluidTest to uncover hidden safety-relevant failures in long-tail scenarios. This approach helps identify additional threats in a significant percentage of trajectories, ensuring your planning systems meet higher safety standards before deployment.
Key insights
FluidTest identifies hidden safety threats in autonomous driving trajectories by comparing planner output to expert references using human-aligned verification.
Principles
- Evaluation must be human-aligned and verifiable.
- High RFS/ADE metrics can mask safety issues.
Method
FluidTest uses a pairwise WebUI for human annotation, a 32-threat taxonomy with decision graphs, and a three-agent verification system to detect unsafe trajectory displacement.
In practice
- Implement pairwise WebUI for annotation.
- Utilize 32-threat taxonomy for analysis.
Topics
- Autonomous Driving Evaluation
- Long-tail Scenarios
- Trajectory Planning
- FluidTest
- Safety Assessment
- Threat Detection
Best for: Computer Vision Engineer, Research Scientist, Robotics Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.