Teaching SerendAI to read a Sri Lankan pharmacy sign at 9pm.
Sign translation looks easy until the sign is in Sinhala, the photo is blurred, the light is dying, and a visitor in Nuwara Eliya wants to know whether the shop is selling paracetamol or ayurvedic cough syrup. This is the story of how SerendAI went from useless on real travel photos to something a visitor can actually trust — and the preprocessing trick that finally moved the numbers.
Every applied ML team eventually runs into the gap between the benchmarks their models were trained on and the photos the world actually hands them. Our gap announced itself in week three of the SerendAI beta, in the form of a polite message from a visitor in Nuwara Eliya. It went, approximately: "I photographed a pharmacy sign to check if they were still open. SerendAI said it was a tailor's shop. I am now walking around in the rain looking for medicine. Please help."
The sign was not a tailor's shop. It was a pharmacy, written in a warm curling Sinhala script that SerendAI had, with admirable confidence, read as something else entirely. The image was dark, shot slightly from below, the letters were printed on a pale blue hoarding that glared under a streetlight, and about a third of one character was obscured by a tree. On an off-the-shelf OCR service, it had — reasonably — failed.
This post is the long version of how we got from there to a system that handles real Sri Lankan travel photos well enough that we can honestly put our product name in front of them. End-to-end sign accuracy moved from 43% to 86% on a benchmark we built ourselves, and along the way we learned several uncomfortable things about the state of multilingual OCR in the wild.
What SerendAI has to do, exactly
SerendAI is the AI travel companion that lives inside oneceylon.space. It answers the kind of questions a visitor does not want to wait on a community for — what is the weather at Adam's Peak, when does the train to Ella leave, how much should this tuk-tuk fare really be, and — the subject of this post — what does this sign say.
Photo translation, specifically, is the pipeline that takes a traveller's phone camera photo and produces a readable answer in their own language. It is four steps: detect where the text is in the image, read it, figure out what language it is in, and translate the whole thing. Get any of those wrong and the next three compound the error. Off-the-shelf tools are built for the clean, front-lit, right-angle photography of major global languages. That is not what a visitor's camera roll looks like at 9pm in Nuwara Eliya.
Why our baseline failed
We started where everyone starts: a well-known commercial OCR service with reported support for Sinhala and Tamil. It was not bad on a flat, well-lit scan. It was, however, quite bad at the thing our users actually do, which is photograph a sign on a wall at night, from across a wet road, one-handed, mid-conversation.
Running a proper evaluation took longer than the modelling work itself. There is no public benchmark for travel-photo text recognition in Sri Lanka. So we built one. We collected 1,000 photographs of real signs from a cross-section of locations — Colombo shopfronts, Ella village stalls, Kandy temple boards, Jaffna restaurants, rural transit stops — with human translations from bilingual annotators in Colombo. We call it SerendSigns-1k internally, and we plan to release a public subset of it later this year.
Three distinct failure modes showed up once we had numbers to look at:
- Low-light catastrophe. The baseline OCR dropped from 71% accuracy on well-lit signs to 28% on photos taken after 6pm. A travel app that fails after sunset fails at the exact moment travellers are most uncertain.
- Perspective and curvature. Real signs are not flat. They hang at angles. They wrap around corners. They flap in the wind on cloth banners. Perspective distortion of more than about twenty degrees caused the character-level recognition to collapse.
- Script confusion on code-mixed signs. Most Sri Lankan signs are not in one language. A pharmacy board will say "Pharmacy" in English, the shop name in Sinhala, and the licence number in digits. The model guessed at the dominant script for the whole image and applied it to every text region, which was wrong about a third of the time.
The OCR was trained on a world in which each photograph is a clean document. Our photographs are not documents. They are the reality of being a traveller holding up a phone in a hurry.
What we tried, in order
Attempt 1 — A better multilingual OCR
We swapped the baseline for PaddleOCR with its multilingual detection and recognition models, which have surprisingly decent Sinhala support for a tool that is not widely advertised for it. Accuracy rose from 0.43 to 0.55. A real, honest improvement — but it did not close the gap, and it did not touch the low-light or perspective problems. Clean signs improved more than messy ones.
The lesson we took from this: most of the gains from a better OCR engine come from better models on clean input. If your input is dirty, a better engine alone will not rescue you.
Attempt 2 — Direct vision-language models, and a cautionary tale
The obvious next move in 2026 is to skip the OCR step entirely and ask a multimodal vision-language model to read the sign and translate it in one go. We tried two. Accuracy jumped to 0.68, which is a real improvement, and the output was often phrased more naturally because the model could disambiguate using visual context — the shape of a mortar-and-pestle icon, the green cross of a pharmacy, the red curve of a Coca-Cola sign.
But VLMs have a specific, well-known, and slightly embarrassing failure mode: when they cannot read the text, they do not say so. They confabulate. A blurred Sinhala shopfront would come back as a plausible-sounding but entirely invented translation, delivered with exactly the same confidence as a correct one. For a traveller about to walk into a pharmacy and ask for aspirin, a confidently-wrong answer is strictly worse than a confident don't-know.
We parked the VLM-only approach and went looking for something that would know when it was unsure.
Attempt 3 — Preprocessing: dewarp, denoise, enhance (the thing that worked)
The change that moved the needle was not a new model. It was a preprocessing pipeline that ran before any reading model saw the image. Three stages, each small and cheap:
def prepare_travel_photo(img: Image) -> list[Crop]:
# 1. detect all text regions using a small detector
regions = detect_text_regions(img)
# 2. for each region, dewarp it into a flat rectangle
flat = [dewarp_perspective(r) for r in regions]
# 3. in low light, run a lightweight enhancement pass
if estimate_luminance(img) < LUMINANCE_THRESHOLD:
flat = [enhance_lowlight(r) for r in flat]
return flat # feed these to the reader, not the original photo
Most of the lift came from dewarping. Travel photos are taken from angles, not straight on, and text-recognition models hate anything that is not a clean rectangle. A small Python function that fits a quadrilateral to each detected text region and projects it into a flat crop recovered most of the recognition loss on angled signs.
The low-light enhancement was the second gift. A simple camera-noise-aware contrast boost, applied only when the global luminance is low, pulled evening photos much closer to the accuracy of daytime ones. It is not clever. It is not interesting. But it pushed end-to-end accuracy from 0.68 to 0.79, and the gains were concentrated on exactly the photos where the baseline had been worst — the after-dark ones, the angled ones, the ones a real traveller would actually take.
A principle we keep relearning: when the model is strong but the input is wrong, fix the input.
Attempt 4 — Script-aware reranking with abstention
The last step was the least surprising, and also the one that solved the hallucination problem. For each dewarped text region, we now run two readers in parallel — a specialist OCR that outputs a confidence score per character, and a VLM that outputs a full translation. A small reranker, trained on our benchmark, decides which to trust, and when to abstain.
"Abstain" is the critical word. If neither reader is confident, SerendAI now says so. The traveller sees something like: "I'm not sure what this sign says — try getting a bit closer, or turn on your flash." This is not as satisfying as a confident wrong answer, but it is dramatically more useful.
Accuracy climbed from 0.79 to 0.86, and — more importantly — our rate of confident hallucinations dropped from 14% to under 2%. The tailor-shop-in-Nuwara-Eliya case, when we replayed it against the new pipeline, produced an honest "not sure, please try again" rather than a wrong answer. Which, for the visitor looking for paracetamol in the rain, is what good software looks like.
| Configuration | Notes | Accuracy |
|---|---|---|
| Commercial OCR baseline | Off-the-shelf cloud API | 0.43 |
| PaddleOCR multilingual | Better engine, same input | 0.55 |
| Direct VLM translation | Hallucinates when uncertain | 0.68 |
| + dewarp + low-light enhance | Preprocessing pipeline | 0.79 |
| + script-aware rerank with abstention | Production configuration | 0.86 |
Eight things I would tell myself in January
In the order they occurred to us, with minimal editing:
- Build your own benchmark early. Public OCR benchmarks do not measure the thing travellers care about. A 1,000-photo hand-labelled set taken by actual visitors is small, imperfect, and worth more than any number you can pull from a leaderboard.
- Evaluate per lighting condition, per angle, per script. Averaged accuracy hides the failures your users will actually notice. Break the evaluation down until the ugly cells are visible.
- Visual context matters more than you think. A green cross means pharmacy. A temple bell icon means something ayurvedic. A VLM that sees the whole scene outperforms an OCR that only reads the letters — on the messy photos travellers send.
- Preprocessing is unfashionable and underrated. Dewarping, denoising, and low-light enhancement are not glamorous research directions, but they paid more than any model change we made.
- Abstention is a feature, not an admission of defeat. A travel assistant that says "I'm not sure" is more trusted than one that is confidently wrong. Train it to know when to stay quiet.
- Annotate with locals, not with vendors. Our benchmark was labelled by bilingual annotators who sit ten feet away from us. Edge cases get resolved over coffee, not over email.
- The worst photos are where the product value lives. The visitor who is confident in good light does not need us. The one squinting at a dimly-lit sign at 9pm does. Optimise for their case.
- Write the post before you forget the surprise. The best time to write about a research result is the week you first believed it. Six months from now, all of this will seem obvious.
What is next
Two directions. First, we are working on an on-device version of the whole pipeline. A traveller on a spotty 3G signal in the hill country should not have to wait on our servers to know what a sign says, and mobile silicon in 2026 is quietly capable of running a dewarping pass and a small reader locally. We will write about that when it works.
Second, we are opening a six-month paid ML research internship on exactly this problem space. The intern will work directly on SerendAI, co-own part of the SerendSigns benchmark extension, and publish their own piece here under their own name. If the evaluation and vision-language questions in this post are the kind of thing you want to spend half a year inside, we would like to hear from you.
Six months of paid applied ML research inside SerendAI, working on exactly the problems in this post. Plus a senior mobile engineer to build OneCeylon's first phone app.
See the two roles →
