2 min readfrom Machine Learning

[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA

[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA
[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA

Hi everyone, I am from Australia : ) I just released a new research prototype

It’s a lossless BF16 compression format that stores weights in 12 bits by replacing the 8-bit exponent with a 4-bit group code.
For 99.97% of weights, decoding is just one integer ADD.

Byte-aligned split storage: true 12-bit per weight, no 16-bit padding waste, and zero HBM read amplification.

Yes 12 bit not 11 bit !! The main idea was not just “compress weights more”, but to make the format GPU-friendly enough to use directly during inference:

sign + mantissa: exactly 1 byte per element
group: two nibbles packed into exactly 1 byte too

https://preview.redd.it/qbx94xeeo2tg1.png?width=1536&format=png&auto=webp&s=831da49f6b1729bd0a0e2d1f075786274e5a7398

  • 1.33x smaller than BF16
  • Fixed-rate 12-bit per weight, no entropy coding
  • Zero precision loss bit-perfect reconstruction
  • Fused decode + matmul, so there is effectively no separate decompression stage
  • Byte-aligned storage, no LUT, no bitstream parsing
  • Works on both NVIDIA and AMD

Some results so far:

Single-user (B=1), RTX 5070 Ti

  • Llama 2 7B: 64.7 tok/s (1.47x vs vLLM)
  • Mistral 7B: 60.0 tok/s (1.10x vs vLLM)
  • Llama 3.1 8B: 57.0 tok/s (vLLM OOM on 16 GB)

Multi-user (B=256), total tok/s

  • Llama 2 7B: 2931 vs 1086 in vLLM (2.70x)
  • Mistral 7B: 2554 vs 872 in vLLM (2.93x)

It also seems surprisingly stable across model types:

  • Llama 3.1 405B: 0.034% escape rate
  • Mixtral 8x7B: 0.050%
  • SDXL UNet: 0.233%
  • CogVideoX 2B: 0.128%

So far this is tested on BF16 safetensors only.

Repo: https://github.com/cenconq25/Turbo-Lossless

Also worth noting: the V3 fused decode+GEMM kernel uses tensor-core patterns inspired by ZipServ / ZipGEMM (Fan et al., ASPLOS 2026).

Happy to hear criticism, edge cases, or reasons this idea won’t scale.

Thanks for your time : )

submitted by /u/Embarrassed_Will_120
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#generative AI for data analysis
#rows.com
#Excel alternatives for data analysis
#financial modeling with spreadsheets
#row zero
#real-time data collaboration
#no-code spreadsheet solutions
#real-time collaboration