Putting 5,000 travel questions on a map.
A traveller standing in Galle Fort wants to know what people have asked about the streets around her. Stack Overflow has spent fifteen years answering this kind of question for code. For travel, the answer has to come from a different shape of database — and the first version of ours took fourteen seconds to return.
OneCeylon is, in shape, a Q&A community. People ask, locals and experienced travellers answer, the good answers float to the top, the rotten ones sink. We borrowed this shape from Stack Overflow, which has the best version of it in the world, and adjusted it for the only thing that matters about a travel question: where you are when you ask.
That last clause hides a great deal of work. A code question is not located anywhere — it is "about" a tag. A travel question is located somewhere, often very precisely. Is there a working ATM near the Galle bus stand? is a useless question without the words "near the Galle bus stand". And a useful answer to that question becomes, six weeks later, the answer to the same question asked by the next visitor standing in the same place.
So we built map-based discovery into the platform from day one. Every question has a location, every answer inherits it, every map view is a query, and the experience the visitor sees is: open the map, see what people have asked here, scroll out to see what people have asked nearby, tap a cluster to drill in. This post is the engineering story of how we made that work — at acceptable speed — for our 5,000-question, 200-place corpus, and how we built it to keep working at ten and a hundred times that scale.
The short version is that p95 latency for "show me the questions in this viewport" went from 14,200ms to 90ms across five iterations, and the lessons were not the ones I would have predicted in November.
The shape of the problem
A user opens a map of Sri Lanka. They pinch-zoom into Galle Fort. The map dispatches a query that says, in effect: find me every question whose location falls inside this rectangular viewport, ordered by some notion of relevance, and return enough of them to draw markers on the map without overwhelming the screen with thousands of pins.
The "without overwhelming" clause is critical. At country zoom, there are roughly 5,000 questions visible. At Galle Fort zoom, there might be a dozen. Showing all 5,000 as individual pins on a country-level map is a usability disaster — and a rendering disaster, because the browser will choke on that many markers. The solution is clustering: at low zoom, group nearby questions into a single marker showing a count; at high zoom, show individual questions.
So the real query is two-headed. Find the questions in this viewport, and group them into clusters appropriate for the current zoom level. Both parts have to come back fast enough that the map feels responsive — under 200ms for the round trip, ideally well under, because the user is panning and zooming and we are firing a query on every change.
Iteration 1 — the embarrassing baseline
The first version, which we shipped in beta because we needed something working before we could
iterate, was almost comically naïve. Every question had a latitude and
longitude column. The map sent four numbers — the bounding box of the current
viewport — and we wrote SQL that did exactly what those four numbers asked for:
SELECT id, title, lat, lng, score
FROM questions
WHERE lat BETWEEN $south AND $north
AND lng BETWEEN $west AND $east
ORDER BY score DESC
LIMIT 500;
No spatial index. No clustering — we just shipped 500 markers to the client and let the map library deal with it. p95 latency at country zoom: 14.2 seconds. The query planner was doing a sequential scan of the entire questions table on every map pan, applying four range filters, then sorting the survivors by score. On a corpus of 5,000 questions this was already painful. It would have been catastrophic at 50,000.
The map, on the client, was no better. Five hundred individual marker components rendered into a Leaflet layer, with React reconciling each one on every pan. The Chrome flame graph for a single zoom-out gesture was a horror show.
Every system has a version where you trust the database to be smart and trust the framework to be fast. That version is also the one you keep around as a humility exercise.
Iteration 2 — the obvious thing first
The first improvement was the boring, correct one: add an index. A B-tree index on a
single-column scalar is straightforward, but a bounding-box query against two correlated
columns (latitude and longitude) is not what B-trees are for. We added a simple compound index
on (lat, lng) just to see what would happen. Latency dropped to about 2.8 seconds
— better, still useless. The planner could now skip large parts of the table, but the index
was still doing more work than it should have because it does not understand that latitude and
longitude describe a two-dimensional space.
This is exactly the moment in every project where someone says "we should be using a spatial database." They are right. We were already on PostgreSQL. PostGIS — the spatial extension — has been quietly being the right answer to this problem for two decades.
Iteration 3 — PostGIS, and a 4× speedup almost for free
The PostGIS migration was, gratifyingly, a one-day project. We added a geom
column of type geography(Point, 4326), populated it from the existing
latitude/longitude pairs, dropped a GiST spatial index on it, and rewrote
the query to use the proper geometry operator:
-- the geom column, indexed for spatial queries
ALTER TABLE questions
ADD COLUMN geom geography(Point, 4326)
GENERATED ALWAYS AS (
ST_SetSRID(ST_MakePoint(lng, lat), 4326)::geography
) STORED;
CREATE INDEX questions_geom_idx ON questions USING GIST (geom);
-- the query, using ST_MakeEnvelope for the viewport
SELECT id, title, ST_X(geom::geometry) AS lng, ST_Y(geom::geometry) AS lat, score
FROM questions
WHERE geom && ST_MakeEnvelope($west, $south, $east, $north, 4326)::geography
ORDER BY score DESC
LIMIT 500;
The && operator is the spatial intersection operator. Combined with the
GiST index, PostGIS organises the question points into an R-tree-like structure that can
answer "which points fall inside this box" without scanning the whole table. p95 latency
dropped to 760ms — a 4× win, with about thirty lines of SQL changed.
Worth highlighting that we used geography, not geometry. Geography
treats coordinates as points on a sphere; geometry treats them as points on a plane. For
Sri Lanka, which is small enough to be approximately flat, the difference is academic for
visual queries — but having the spherical model means we get correct distance calculations
out of the box for the "nearby" features, and we never have to think about projections again.
It costs us a few percent of query speed and saves us a class of bugs we would otherwise
eventually ship.
Iteration 4 — vector tiles, or, stop sending the same data twice
760ms was better than fourteen seconds, but still slow enough that pinch-zoom felt sluggish. The next observation came from looking at what we were actually returning. A single map pan sometimes covered a region we had just queried two seconds earlier. We were re-fetching the same questions, re-serialising them, re-shipping them to the browser, and re-rendering them. The user's behaviour was rectangular and incremental; our backend treated every query as independent and absolute.
The fix is a technique borrowed from how digital maps themselves are served:
vector tiles. Instead of returning "the questions in this exact viewport,"
we cut the world into a fixed grid of tiles at every zoom level and serve "the questions in
this tile." A tile is identified by three integers — z/x/y — and is therefore
cacheable. The browser asks for the four or six tiles that cover its current viewport, and
most of those tiles are already in the user's cache from a moment ago.
We use Mapbox Vector Tile format, which PostGIS produces directly via the
ST_AsMVT family of functions. Tile generation runs on demand, gets cached at
the edge for ten minutes, and is invalidated by question writes through a small
pubsub-driven purge. The migration was a Wednesday's work and the result was
p95 of 240ms, with the cached path returning in under 30ms — the difference
between a map that drags and a map that flicks.
Iteration 5 — clustering, the choice that mattered
The last problem was the rendering one. Even with fast tile delivery, a tile at country zoom might contain hundreds of questions. Drawing each as a marker is wasteful when the user cannot meaningfully tell them apart at that zoom. We had to cluster — the question was just where in the pipeline.
There are two reasonable answers, and we evaluated both seriously.
Option A — supercluster on the client
Supercluster is Mapbox's tiny JavaScript library that takes a flat array of points and produces clustered output for any zoom level using a hierarchical grid algorithm. You ship all your points to the client once, hand them to supercluster, and ask it for clusters at whatever zoom you need. It is fast — single-digit milliseconds even for tens of thousands of points.
Option B — h3 hexagons, server-side
H3 is Uber's hexagonal hierarchical grid system. Every
point on Earth gets a unique hex address at fifteen resolution levels. You can, at write
time, compute the h3 index of every question at every resolution, store it as an integer
column, and then "cluster" by simply GROUP BY h3_index at the appropriate
resolution. The clustering happens in PostgreSQL, the wire payload is tiny — just one row
per cluster — and the client does almost no work.
What we picked, and why
We picked h3, server-side. The decision came down to three factors:
- Mobile data. Supercluster requires shipping every point to the client. At 5,000 questions today that is fine. At 50,000 it is half a megabyte per page load, on connections that are sometimes asthmatic. H3 lets us ship only the clusters the user can actually see.
- Cache friendliness. H3 cluster results, keyed by
(viewport, resolution), are cacheable. Supercluster output is computed fresh on every interaction. - The hex shape. Hexagons tile evenly, have uniform neighbour distances, and look correct on a map. Square-grid clustering produces ugly artefacts at every zoom boundary. This is a small thing visually, but the kind of small thing that compounds.
The implementation is satisfyingly simple. Each question gets fifteen integer columns, one per h3 resolution, computed at write time:
-- at write time, fill the h3 index columns
UPDATE questions
SET h3_r6 = h3_lat_lng_to_cell(POINT(lng, lat), 6),
h3_r7 = h3_lat_lng_to_cell(POINT(lng, lat), 7),
-- ... resolutions 4 through 11
WHERE id = $1;
-- at read time, cluster at whichever resolution suits this zoom
SELECT h3_r7 AS cluster_id,
COUNT(*) AS n,
AVG(lat) AS lat,
AVG(lng) AS lng
FROM questions
WHERE geom && ST_MakeEnvelope($west, $south, $east, $north, 4326)::geography
GROUP BY h3_r7;
A small lookup table maps each map zoom level to the right h3 resolution, with a slight bias toward fewer, larger clusters at low zoom. The result was p95 of 90ms, end-to-end — fast enough that the cluster updates feel like part of the pan gesture, not a response to it.
| Iteration | Notes | p95 latency |
|---|---|---|
| 1 — Naïve SQL | Bounding box, no index | 14,200 ms |
| 2 — B-tree index | Compound (lat, lng) | 2,800 ms |
| 3 — PostGIS + GiST | Proper spatial index | 760 ms |
| 4 — + vector tiles | Cacheable z/x/y delivery | 240 ms |
| 5 — + h3 clustering | Production configuration | 90 ms |
The two things that almost worked
Two ideas earned a few days of investigation each before we abandoned them. Both are worth mentioning, because they are exactly the kind of thing a senior engineer might propose, and we want to save the next person the detour.
Materialised cluster tables. An obvious optimisation is to precompute the
clusters for common zoom levels into a set of tables, refreshed periodically. We tried it.
It works, in the sense that it is fast. But the cache invalidation when a question is added,
moved, or removed turns out to be exactly the kind of distributed-cache problem that has
consumed careers. With the h3 columns indexed, the live GROUP BY is fast enough
that the materialised tables are not worth their operational cost. Maybe at 100× our current
scale. Not now.
Approximate counts with HyperLogLog. Some Q&A platforms render cluster sizes using approximate counts — close enough is close enough when the marker just shows "37 questions". We considered it. PostGIS-plus-h3 returned exact counts fast enough that we did not need the approximation, and exact counts are easier to debug. A useful tool to keep in reserve, not a tool we needed.
Eight things I would tell myself in November
In the order they occurred to us, with minimal editing:
- Use the spatial database. If your data has a "where" in it and you are still on plain SQL, you are leaving an order-of-magnitude speedup unclaimed. PostGIS is mature, free, and well-documented.
- Geography over geometry, unless you have a reason. The few-percent performance cost buys you correct distance maths and one-fewer-projection-bug per year.
- Vector tiles are not just for basemaps. Any data you serve onto a slippy map benefits from being tiled — the cache properties alone earn it.
- Cluster on the server, not on the client, once you cross a few thousand points. The bandwidth savings are bigger than the compute savings, and bandwidth is what costs travellers on patchy 4G.
- H3 over square grids. Hexagons compose, neighbour, and look correct. Squares do none of those things especially well.
- Index at write time. Computing h3 indices on read is wasteful and slow. Computing them on write is essentially free, especially when wrapped in a generated column.
- Resist the temptation to materialise too early. Materialised views are a powerful tool for the right problem. Most of the time, a properly-indexed live query is fast enough, and it does not need invalidating.
- Measure on a real device, on real bandwidth. 90ms at the database is meaningless if the client is rendering 500 markers on a Redmi Note. Optimise the whole path, not just the part you find interesting.
What is next
Two directions. First, we are extending the same pipeline to the local services marketplace — guides, drivers, hosts, activity providers — which has its own geo-discovery requirements but with one extra dimension: the provider's availability changes by the hour, not by the year. Caching strategies are going to need a rethink.
Second, we are about to bring all of this onto a phone. OneCeylon does not have a mobile app yet, and the geospatial pipeline above was designed with a browser as the client. We are hiring a senior mobile engineer to build that app from scratch — and the first interesting design problem will be how the map experience feels on a phone in a moving tuk-tuk, on 3G, with the screen at half brightness.
A senior mobile engineer to build OneCeylon's first phone app — and a paid ML research intern to work on SerendAI. Both based in Colombo.
See the two roles →
