[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.

I spent the last year trying to answer a simple question: how good are VLA models on real commercial tasks? Not demos, not simulation, not success rates on 10 tries. Actual production metrics on real hardware.

I couldn't find honest numbers anywhere, so I built a benchmark.

Setup: DROID platform, bin-to-bin order picking – one of the most common warehouse and industrial operations. Four models fine-tuned on the same real-robot dataset, evaluated blind (the operator doesn't know which model is running). We measure Units Per Hour (UPH) and Mean Time Between Failures (MTBF) – the metrics operations people actually use.

Results (full data with video and telemetry for every run at phail.ai):

Model	UPH	MTBF
OpenPI (pi0.5)	65	4.0 min
GR00T	60	3.5 min
ACT	44	2.8 min
SmolVLA	18	1.2 min
Teleop / Finetuning (human controlling same robot)	330	–
Human hands	1,331	–

OpenPI and GR00T are not statistically significant at current episode counts – we're collecting more runs.

The teleop baseline is the fairer comparison: same hardware, human in the loop. That's a 5x gap, and it's almost entirely policy quality – the robot can physically move much faster than any model commands it to. The human-hands number is what warehouse operators compare against when deciding whether to deploy.

The MTBF numbers are arguably more telling than UPH. At 4 minutes between failures, "autonomous operation" means a full-time babysitter. Reliability needs to cross a threshold before autonomy has economic value.

Every run is public with synced video and telemetry. Fine-tuning dataset, training scripts, and submission pathway are all open. If you think your model or fine-tuning recipe can do better, submit a checkpoint.

What models are we missing? We're adding NVIDIA DreamZero next. If you have a checkpoint that works on DROID hardware, submit it – or tell us what you'd want to see evaluated. What tasks beyond pick-and-place would be the real test for general-purpose manipulation?

More:

Leaderboard + full episode data: phail.ai
White paper: phail.ai/whitepaper.pdf
Open-source toolkit: github.com/Positronic-Robotics/positronic
Detailed findings: positronic.ro/introducing-phail

submitted by /u/svertix
[link] [comments]

[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.

Want to read more?

Tagged with