What I learned analysing Kaggle Deep Past Challenge
I fell into a rabbit hole looking at Kaggle’s Deep Past Challenge and ended up reading a bunch of winning solution writeups. Here's what I learned
At first glance it looks like a machine translation competition: translate Old Assyrian transliterations into English.
But after reading the top solutions, I don’t think that’s really what it was.
It was more like a data construction / data cleaning competition with a translation model at the end.
Why:
- the official train set was tiny: 1,561 pairs
- train and test were not really the same shape: train was mostly document-level, test was sentence-level
- the main extra resource was a massive OCR dump of academic PDFs
- so the real work was turning messy historical material into usable parallel data
- and the public leaderboard was noisy enough that chasing it was dangerous
What the top teams mostly did:
- mined and reconstructed sentence pairs from PDFs
- cleaned and normalized a lot of weird text variation
- used ByT5 because byte-level modeling handled the strange orthography better
- used fairly conservative decoding, often MBR
- used LLMs mostly for segmentation, alignment, filtering, repair, synthetic data, not as the final translator
Winners' edges:
- 1st place went very hard on rebuilding the corpus and iterating on extraction quality
- 2nd place was almost a proof that you could get near the top with a simpler setup if your data pipeline was good enough. No hard ensembling.
- 3rd place had the most interesting synthetic data strategy: not just more text, but synthetic examples designed to teach structure
- 5th place made back-translation work even in this weird low-resource ancient language setting
Main takeaway for me: good data beat clever modeling.
Honestly it felt closer to real ML work than a lot of competitions do. Small dataset, messy weakly-structured sources, OCR issues, normalization problems, validation that lies to you a bit… pretty familiar pattern.
I wrote a longer breakdown of the top solutions and what each one did differently. Didn’t want to just drop a link with no context, so this is the short useful version first. Full writeup in the comment
[link] [comments]
Want to read more?
Check out the full article on the original site