[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.
I spent the last year trying to answer a simple question: how good are VLA models on real commercial tasks? Not demos, not simulation, not success rates on 10 tries. Actual production metrics on real hardware.
I couldn't find honest numbers anywhere, so I built a benchmark.
Setup: DROID platform, bin-to-bin order picking – one of the most common warehouse and industrial operations. Four models fine-tuned on the same real-robot dataset, evaluated blind (the operator doesn't know which model is running). We measure Units Per Hour (UPH) and Mean Time Between Failures (MTBF) – the metrics operations people actually use.
Results (full data with video and telemetry for every run at phail.ai):
| Model | UPH | MTBF |
|---|---|---|
| OpenPI (pi0.5) | 65 | 4.0 min |
| GR00T | 60 | 3.5 min |
| ACT | 44 | 2.8 min |
| SmolVLA | 18 | 1.2 min |
| Teleop / Finetuning (human controlling same robot) | 330 | – |
| Human hands | 1,331 | – |
OpenPI and GR00T are not statistically significant at current episode counts – we're collecting more runs.
The teleop baseline is the fairer comparison: same hardware, human in the loop. That's a 5x gap, and it's almost entirely policy quality – the robot can physically move much faster than any model commands it to. The human-hands number is what warehouse operators compare against when deciding whether to deploy.
The MTBF numbers are arguably more telling than UPH. At 4 minutes between failures, "autonomous operation" means a full-time babysitter. Reliability needs to cross a threshold before autonomy has economic value.
Every run is public with synced video and telemetry. Fine-tuning dataset, training scripts, and submission pathway are all open. If you think your model or fine-tuning recipe can do better, submit a checkpoint.
What models are we missing? We're adding NVIDIA DreamZero next. If you have a checkpoint that works on DROID hardware, submit it – or tell us what you'd want to see evaluated. What tasks beyond pick-and-place would be the real test for general-purpose manipulation?
More:
- Leaderboard + full episode data: phail.ai
- White paper: phail.ai/whitepaper.pdf
- Open-source toolkit: github.com/Positronic-Robotics/positronic
- Detailed findings: positronic.ro/introducing-phail
[link] [comments]
Want to read more?
Check out the full article on the original site