3 min readfrom Machine Learning

[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.

I spent the last year trying to answer a simple question: how good are VLA models on real commercial tasks? Not demos, not simulation, not success rates on 10 tries. Actual production metrics on real hardware.

I couldn't find honest numbers anywhere, so I built a benchmark.

Setup: DROID platform, bin-to-bin order picking – one of the most common warehouse and industrial operations. Four models fine-tuned on the same real-robot dataset, evaluated blind (the operator doesn't know which model is running). We measure Units Per Hour (UPH) and Mean Time Between Failures (MTBF) – the metrics operations people actually use.

Results (full data with video and telemetry for every run at phail.ai):

Model UPH MTBF
OpenPI (pi0.5) 65 4.0 min
GR00T 60 3.5 min
ACT 44 2.8 min
SmolVLA 18 1.2 min
Teleop / Finetuning (human controlling same robot) 330
Human hands 1,331

OpenPI and GR00T are not statistically significant at current episode counts – we're collecting more runs.

The teleop baseline is the fairer comparison: same hardware, human in the loop. That's a 5x gap, and it's almost entirely policy quality – the robot can physically move much faster than any model commands it to. The human-hands number is what warehouse operators compare against when deciding whether to deploy.

The MTBF numbers are arguably more telling than UPH. At 4 minutes between failures, "autonomous operation" means a full-time babysitter. Reliability needs to cross a threshold before autonomy has economic value.

Every run is public with synced video and telemetry. Fine-tuning dataset, training scripts, and submission pathway are all open. If you think your model or fine-tuning recipe can do better, submit a checkpoint.

What models are we missing? We're adding NVIDIA DreamZero next. If you have a checkpoint that works on DROID hardware, submit it – or tell us what you'd want to see evaluated. What tasks beyond pick-and-place would be the real test for general-purpose manipulation?

More:

submitted by /u/svertix
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#real-time data collaboration
#real-time collaboration
#generative AI for data analysis
#Excel alternatives for data analysis
#rows.com
#natural language processing for spreadsheets
#big data management in spreadsheets
#conversational data analysis
#large dataset processing
#financial modeling with spreadsheets
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#data cleaning solutions