Meteorological meanderings, part 3: geocoding and reverse geocoding

Strictly speaking, this isn’t meteorology. However, in order to display meteorological data for user-requested locations, we must be able to translate our users’ requests into coordinates.

If a user searches for the name of a city, we need to be able to get a set of coordinates for that city, since our weather information is all based on coordinates. The process for doing such a search is called geocoding. Conversely, if we happen to have an interactive map and we let the user click on any point on it, it’d be nice to display a friendly name for the location they clicked. Translating coordinates to the name (or names) of a location is called reverse geocoding.

The easy solution for handling geocoding is to use a third-party service, like HERE. For low-volume traffic, using OpenStreetMap’s own hosted Nominatim geocoder instance could be an option as well. Writing our own system from scratch is impractical, for a few reasons. First, there are available open source applications already out there. Second, it’s quite a challenge to do fuzzy text matches against potentially gigantic datasets. And third, there would be a lot of work involved in importing data correctly in various formats from multiple data sources.

If we want to simply host our own geocoding system, however, we have a few options. There’s Nominatim itself, which, as mentioned above, is used by OpenStreetMap. Alternatively, applications like Pelias and Gisgraphy can import data from a variety of sources. We’ll choose Nominatim, if for no other reason, then to have the ability to compare our locally-installed version’s results to those of the OSM-hosted one. It’s a great way to do a sanity check.

Before installing and configuring Nominatim, however, we must figure out how to get source data for it. Since Nominatim uses OpenStreetMap data readily, we can look at downloading it from Geofabrik. This Germany-based company generously provides free downloads of various OSM data extracts. To download US data, there is a North America sub-region file available, which is currently sitting at 7.1GB for the .osm.pbf (Protobuf) version. But we don’t want to import the entire thing. Assuming we’re not trying to provide hyper-local weather information, there’s no need to have exact building or even street data, which can take up a lot of space. What we want to do is filter out the data we don’t need, and leave in just the data we do need. To filter the source data, we can use osmfilter from osm-c-tools. The osmctools package is available in Debian/Ubuntu and other Linux distros; pre-built Windows binaries exist as well. The filtering application takes in a wide range of parameters that let you specify exactly how you want to perform the filtering. It even has provisions for modifying data, not just removing it. An example of how we can filter out unwanted data might look like this:

$ osmconvert source-data.osm.pbf --out-o5m > source-data.o5m
$ osmfilter source-data.o5m --keep="name= and ( admin_level= or place= or boundary=administrative )" --drop-version --out-o5m -o=filtered-data.o5m
$ osmconvert filtered-data.o5m --out-pbf > filtered-data.osm.pbf

Because osmfilter doesn’t work with Protobuf files directly, we first have to convert our source data to the o5m format. Osmconvert, another application from osm-c-tools, can easily do that for us, as seen above. Then we perform the filtering itself, and after that, we convert the filtered data back to the Protobuf format. With the current version of US data, the 7.1GB file becomes a mere 80MB after filtering and conversion.

While the filtering example above is relatively short, filters can get much more complicated. The --keep parameter, predictably, tells osmfilter which tags to keep. Tags are nothing more than key/value pairs, and they form the basis for how OSM data is categorized. For example, boundary=administrative will give us a lot of (but not all) governmental areas as large as countries and as small as neighborhoods. Since OSM data is crowdsourced and continually evolving, there are times when a tag is incorrectly used, or perhaps the wrong tag is used. Also, tagging standards for OSM data have not remained fixed, so obsolete tags can still be found in use in some locations. That’s why the data filtering process isn’t always straightforward. For city-level data, the filtering in the above example should get us pretty close to a good state. With some manual verification and filter tweaking, it should be pretty easy to get it to be “good enough”.

Once we have the filtered data, we can get to work on installing and configuring Nominatim. An easy way to do this is to use Docker. There’s a user-friendly GitHub repo simply called “nominatim-docker” that contains all the necessary components to set this up. Following the repository’s instructions for version 3.5 of Nominatim, our setup steps might look like this:

$ mv filtered-data.osm.pbf /home/me/nominatimdata/
$ git clone https://github.com/mediagis/nominatim-docker.git
$ cd nominatim-docker/3.5
$ docker build --pull --rm -t nominatim .
$ docker run -t -v /home/me/nominatimdata:/data nominatim sh /app/init.sh /data/filtered-data.osm.pbf postgresdata 4
$ docker run --restart=always -p 7070:8080 -d --name nominatim -v /home/me/nominatimdata/postgresdata:/var/lib/postgresql/12/main nominatim bash /app/start.sh

Building the Docker image locally allows for easy modifications to what’s included in the image. It also, conveniently, doesn’t force us to download a multi-gigabyte image, which these can be. The first call to docker run imports our filtered data into a Postgres database running inside the Docker container. The second call launches the actual web application.

Now that we have Nominatim running in a Docker container and serving HTTP requests on port 7070, we can use its API to perform geocoding and reverse geocoding queries. We can also access port 7070 via a web browser and use the provided web interface to make test queries. (Note: as of Nominatim 3.6, the web interface is in a separate project.)

An important addition to make to this setup involves search ranking. By default, roughly speaking, larger areas will rank higher in search results than smaller areas. That means cities will be ahead of similarly-named neighborhoods, and US states will be ahead of cities. This isn’t always desirable. If we search for “Austin”, for example, the top result will be Austin County, Texas. That wouldn’t necessarily be a major issue for weather data if the city of Austin resided in this county. But it doesn’t. Austin is in Travis County, Texas. Chances are, when a user searches for “Austin”, they’re looking for the city. We want some amount of intelligence in our search rankings to fix issues like this.

Enter: Wikipedia, the venerable source of information on virtually anything. We can use a subset of its data to improve our internal search rankings. By taking into account how many other Wikipedia articles link to the entries for our OSM locations, we can measure their relative popularity. In other words, with our Austin example above, there should be more Wikipedia articles linking to the city of Austin than articles linking to Austin County, and we can use that information to alter the search results to place the city higher in the rankings. In fact, Nominatim has provisions for doing just this. And if we perform the import process, we will see the first search result for “Austin” now return the city. With this change, our geocoding setup should be ready.

In the next post, we’ll look at generating map tiles from OSM data for the purposes of displaying them in an interactive pan/zoom map.