Building a Robust Geocoder


Tutorial Information


  • Module 1 in Spatial Data Science.

Methods


Geocoding

Tools


Python

Geocoding is a process whereby an address is converted into a coordinate pair. While this may seem fairly straightforward—the prevalence of Google Maps, etc. tells us this—it is a very difficult technical problem, especially when many addresses in many localities in multiple languages are involved.

There are a huge number of geocoding services available and they change all the time—my current favorite is one called OpenCage. There are a number of reasons I like it:

  1. OpenCage understands complete addresses; while many expect you to parse your address (e.g., into street number, street name, city, state, country), OpenCage does this automatically in the background. This makes life so much easier.
  2. OpenCage is built on top of several entirely open-source geocoders.
  3. It’s free plan is fairly generous: it allows you to geocoder 2,500 addresses per day.
  4. Maybe most importantly: your data remains yours after you use the service. This is not the case if you use, for example, Google’s Geocoding service—Google will then claim that they own your data. OpenCage is built to follow the fairly stringent guidelines of the EU’s General Data Protection Regulation (GDPR), which try to coerce firms to let you own your data.

So we like OpenCage. Go to the OpenCage Website and sign-up for an account. Once you have your account set-up you should see something called an ‘API key’ on your Dashboard. You will need this key to geocode your addresses. Copy your API key and store it somewhere. You will need it later!

Create a New Virtual Environment

We begin by creating a new virtual environment for our geocoding work. As a reminder, a virtual environment is a self-contained Python environment, generally built to accommodate a specific project. We use these so that we’re not endlessly installing pacakges into a single bloated environment; instead, we work only with those packages we need. In a terminal window type…

# Tells python 3 to run the venv module...
# ...and creating a new virtual environment...
# ...in our current directory.
python3 -m venv venv/

# On Windows...
python -m venv venv\

We can activate this virtual environment by running the following at the terminal:

# On macOS or Linux...
. venv/bin/activate
# On Windows...
venv\Scripts\activate.bat

Once we’ve activated our virtual environment, any packages we install will be specific to this environment—this is very useful for managing packages on a project-by-project basis. We’re going to install the opencage package, which is a fairly thin wrapper for the OpenCage API (by ‘thin wrapper’, I just mean that it provides convenience functions without contributing much in the way of added functionality).

pip install opencage jupyter

Import Modules

Create a new Python file; at the top, we need to import the relevant OpenCageGeocode object from the library. We’ll also import pprint (pretty print) which will print, for example, JSON responses nicely.

from opencage.geocoder import OpenCageGeocode
from pprint import pprint

Set up the Geocoder

We’ll then create a new instance of the OpenCageGeocode object; in order to do this, we need to use the API key that we were given when we created a new OpenCage account. But we want to do this in a way that prevents others from finding our key. If you’re going to be working on a public git repository (e.g., Github, Gitlab), your goal is to NEVER COMMIT YOUR KEY to your public repository. This means writing your code such that the key is read in from an environment variable.

Store Your API key in an Environment Variable

The easiest way to store an environment variable is to create a new file called simply .env in your project directory and use a nifty package called python-dotenv to load your variables. If you’re using a git repository, make sure you add .env to your .gitignore file! You’ll have to install the package like so (assuming that you’ve activated your virtual environment…)

pip install python-dotenv

In your .env file, include this line:

OPENCAGE=yourapikey

Once you’ve set the environment variable, reading it in a Python script becomes fairly trivial. We’ll now do this and create a new OpenCageGeocode instance.

import os
from dotenv import load_dotenv, find_dotenv
load_dotenv()

key = os.getenv('OPENCAGE')
geocoder = OpenCageGeocode(key)

Geocoder Directionality

Generally speaking, there are two primary ‘directions’ for a geocoding operation.

  1. Forwards: Forward geocoding refers to converting an address (pseudo-spatial data) to a coordinate pair (spatial data), often alongside other supplementary infromation.
  2. Reverse: Backwards geocoding refers to converting a coordinate pair (spatial data) to an address, often with additional information.

The former is the more common research task, but for completeness’ sake, let’s demonstrate a reverse geocode. The instance of the OpenCage geocoder you created above has a method called reverse_geocode. Let’s query our geocoder using a latitute/longitude pair in the fine city of Somerville, Massachusetts using pprint() to print the (pretty extensive!) results readably.

results = geocoder.reverse_geocode(42.385776, -71.094382)
pprint(results)
[{'annotations': {'DMS': {'lat': "42° 23' 8.83716'' N",
                          'lng': "71° 5' 39.85944'' W"},
                  'FIPS': {'county': '25017', 'state': '25'},
                  'MGRS': '19TCG2759494735',
                  'Maidenhead': 'FN42kj82qo',
                  'Mercator': {'x': -7914193.002, 'y': 5190132.398},
                  'OSM': {'edit_url': 'https://www.openstreetmap.org/edit?way=29794889#map=17/42.38579/-71.09441',
                          'note_url': 'https://www.openstreetmap.org/note/new#map=17/42.38579/-71.09441&layers=N',
                          'url': 'https://www.openstreetmap.org/?mlat=42.38579&mlon=-71.09441#map=17/42.38579/-71.09441'},
                  'UN_M49': {'regions': {'AMERICAS': '019',
                                         'NORTHERN_AMERICA': '021',
                                         'US': '840',
                                         'WORLD': '001'},
                             'statistical_groupings': ['MEDC']},
                  'callingcode': 1,
                  'currency': {'alternate_symbols': ['US$'],
                               'decimal_mark': '.',
                               'disambiguate_symbol': 'US$',
                               'html_entity': '$',
                               'iso_code': 'USD',
                               'iso_numeric': '840',
                               'name': 'United States Dollar',
                               'smallest_denomination': 1,
                               'subunit': 'Cent',
                               'subunit_to_unit': 100,
                               'symbol': '$',
                               'symbol_first': 1,
                               'thousands_separator': ','},
                  'flag': '🇺🇸',
                  'geohash': 'drt3nhneh794edc7fzzj',
                  'qibla': 60.41,
                  'roadinfo': {'drive_on': 'right',
                               'road': 'Highland Avenue',
                               'speed_in': 'mph'},
                  'sun': {'rise': {'apparent': 1583320500,
                                   'astronomical': 1583314920,
                                   'civil': 1583318820,
                                   'nautical': 1583316840},
                          'set': {'apparent': 1583361480,
                                  'astronomical': 1583280660,
                                  'civil': 1583363160,
                                  'nautical': 1583365080}},
                  'timezone': {'name': 'America/New_York',
                               'now_in_dst': 0,
                               'offset_sec': -18000,
                               'offset_string': '-0500',
                               'short_name': 'EST'},
                  'what3words': {'words': 'beans.anyway.valve'}},
  'bounds': {'northeast': {'lat': 42.3859639, 'lng': -71.0941313},
             'southwest': {'lat': 42.3856123, 'lng': -71.0946794}},
  'components': {'ISO_3166-1_alpha-2': 'US',
                 'ISO_3166-1_alpha-3': 'USA',
                 '_category': 'building',
                 '_type': 'building',
                 'continent': 'North America',
                 'country': 'United States of America',
                 'country_code': 'us',
                 'county': 'Middlesex County',
                 'house_number': '79',
                 'neighbourhood': 'Spring Hill',
                 'postcode': '02143',
                 'road': 'Highland Avenue',
                 'state': 'Massachusetts',
                 'state_code': 'MA',
                 'town': 'Somerville',
                 'unknown': 'Somerville Central Library'},
  'confidence': 10,
  'formatted': 'Somerville Central Library, 79 Highland Avenue, Somerville, MA '
               '02143, United States of America',
  'geometry': {'lat': 42.3857881, 'lng': -71.0944054}}]

We see that the results are extensive! Not only does it return an address, but it returns a placename—Somerville Central Library—administrative boundaries, current sunset and sunrise information, and local currency! The data is returned formatted in JavaScript Object Notation—or JSON, This is very, very common when working with web service APIs. JSON has become something like the lingua franca of application development.

Parsing Reverse Geocoding Result

We can parse it easily! Let’s say we wanted to retrieve only the formatted placename, including the address. We want the first result (i.e., the 0th index position in Python-ese), and we want the value of the formatted key. So…

formatted = results[0]['formatted']
print(formatted)
Somerville Central Library, 79 Highland Avenue, Somerville, MA 02143, United States of America

Similarly, if we wanted to get just the street name, the town and the state, we could concatenate those values like this…

street = results[0]['components']['road'] + ', ' + results[0]['components']['town'] + ', ' + results[0]['components']['state_code']
print(street)

# Or, using join...

street = ', '.join([results[0]['components']['road'], results[0]['components']['town'], results[0]['components']['state_code']])
print(street)

# Or, slightly cleaner yet...

comp = results[0]['components']
street = ', '.join([comp['road'], comp['town'], comp['state_code']])
print(street)
Highland Avenue, Somerville, MA
Highland Avenue, Somerville, MA
Highland Avenue, Somerville, MA

Parsing JSON like this is a very common task so it’s worth getting comfortable with it!

Forward Geocoding

The more common task for a geocoder is to take an address, or a long list of addresses, and turn it into a mappable latitude/longitude pair. This, again, is called forward geocoding. The process is very similar to the above. Note that one of the great things about OpenCage is that your dataset does not need to be pre-parsed! No more meticulously combing your address fields for street numbers, street names, town names, state names, postal codes, etc. The geocoder will handle that for you.

Let’s do the reverse of the above: send the geocoder the address of the Somerville Public Library Central Branch in Winter Hill.

query = u'79 Highland Avenue, Somerville, MA 02143'
results = geocoder.geocode(query)

pprint(results)
[{'annotations': {'DMS': {'lat': "42° 23' 8.83716'' N",
                          'lng': "71° 5' 39.85944'' W"},
                  'FIPS': {'county': '25017', 'state': '25'},
                  'MGRS': '19TCG2759494735',
                  'Maidenhead': 'FN42kj82qo',
                  'Mercator': {'x': -7914193.002, 'y': 5190132.398},
                  'OSM': {'edit_url': 'https://www.openstreetmap.org/edit?way=29794889#map=17/42.38579/-71.09441',
                          'note_url': 'https://www.openstreetmap.org/note/new#map=17/42.38579/-71.09441&layers=N',
                          'url': 'https://www.openstreetmap.org/?mlat=42.38579&mlon=-71.09441#map=17/42.38579/-71.09441'},
                  'UN_M49': {'regions': {'AMERICAS': '019',
                                         'NORTHERN_AMERICA': '021',
                                         'US': '840',
                                         'WORLD': '001'},
                             'statistical_groupings': ['MEDC']},
                  'callingcode': 1,
                  'currency': {'alternate_symbols': ['US$'],
                               'decimal_mark': '.',
                               'disambiguate_symbol': 'US$',
                               'html_entity': '$',
                               'iso_code': 'USD',
                               'iso_numeric': '840',
                               'name': 'United States Dollar',
                               'smallest_denomination': 1,
                               'subunit': 'Cent',
                               'subunit_to_unit': 100,
                               'symbol': '$',
                               'symbol_first': 1,
                               'thousands_separator': ','},
                  'flag': '🇺🇸',
                  'geohash': 'drt3nhneh794edc7fzzj',
                  'qibla': 60.41,
                  'roadinfo': {'drive_on': 'right',
                               'road': 'Highland Avenue',
                               'speed_in': 'mph'},
                  'sun': {'rise': {'apparent': 1583320500,
                                   'astronomical': 1583314920,
                                   'civil': 1583318820,
                                   'nautical': 1583316840},
                          'set': {'apparent': 1583361480,
                                  'astronomical': 1583280660,
                                  'civil': 1583363160,
                                  'nautical': 1583365080}},
                  'timezone': {'name': 'America/New_York',
                               'now_in_dst': 0,
                               'offset_sec': -18000,
                               'offset_string': '-0500',
                               'short_name': 'EST'},
                  'what3words': {'words': 'beans.anyway.valve'}},
  'bounds': {'northeast': {'lat': 42.3859639, 'lng': -71.0941313},
             'southwest': {'lat': 42.3856123, 'lng': -71.0946794}},
  'components': {'ISO_3166-1_alpha-2': 'US',
                 'ISO_3166-1_alpha-3': 'USA',
                 '_category': 'education',
                 '_type': 'library',
                 'continent': 'North America',
                 'country': 'United States of America',
                 'country_code': 'us',
                 'county': 'Middlesex County',
                 'house_number': '79',
                 'library': 'Somerville Central Library',
                 'neighbourhood': 'Spring Hill',
                 'postcode': '02143',
                 'road': 'Highland Avenue',
                 'state': 'Massachusetts',
                 'state_code': 'MA',
                 'town': 'Somerville'},
  'confidence': 9,
  'formatted': 'Somerville Central Library, 79 Highland Avenue, Somerville, MA '
               '02143, United States of America',
  'geometry': {'lat': 42.3857881, 'lng': -71.0944054}},
 {'annotations': {'DMS': {'lat': "42° 23' 11.51160'' N",
                          'lng': "71° 5' 52.17000'' W"},
                  'FIPS': {'state': '25'},
                  'MGRS': '19TCG2731494825',
                  'Maidenhead': 'FN42kj82gs',
                  'Mercator': {'x': -7914573.676, 'y': 5190243.952},
                  'OSM': {'note_url': 'https://www.openstreetmap.org/note/new#map=17/42.38653/-71.09783&layers=N',
                          'url': 'https://www.openstreetmap.org/?mlat=42.38653&mlon=-71.09783#map=17/42.38653/-71.09783'},
                  'UN_M49': {'regions': {'AMERICAS': '019',
                                         'NORTHERN_AMERICA': '021',
                                         'US': '840',
                                         'WORLD': '001'},
                             'statistical_groupings': ['MEDC']},
                  'callingcode': 1,
                  'currency': {'alternate_symbols': ['US$'],
                               'decimal_mark': '.',
                               'disambiguate_symbol': 'US$',
                               'html_entity': '$',
                               'iso_code': 'USD',
                               'iso_numeric': '840',
                               'name': 'United States Dollar',
                               'smallest_denomination': 1,
                               'subunit': 'Cent',
                               'subunit_to_unit': 100,
                               'symbol': '$',
                               'symbol_first': 1,
                               'thousands_separator': ','},
                  'flag': '🇺🇸',
                  'geohash': 'drt3nhhpkw54pt5hzfm9',
                  'qibla': 60.4,
                  'roadinfo': {'drive_on': 'right',
                               'road': 'Highland Ave',
                               'speed_in': 'mph'},
                  'sun': {'rise': {'apparent': 1583320500,
                                   'astronomical': 1583314920,
                                   'civil': 1583318820,
                                   'nautical': 1583316840},
                          'set': {'apparent': 1583361480,
                                  'astronomical': 1583280660,
                                  'civil': 1583363160,
                                  'nautical': 1583365080}},
                  'timezone': {'name': 'America/New_York',
                               'now_in_dst': 0,
                               'offset_sec': -18000,
                               'offset_string': '-0500',
                               'short_name': 'EST'},
                  'what3words': {'words': 'zones.pardon.rainy'}},
  'components': {'ISO_3166-1_alpha-2': 'US',
                 'ISO_3166-1_alpha-3': 'USA',
                 '_category': 'building',
                 '_type': 'building',
                 'continent': 'North America',
                 'country': 'United States of America',
                 'country_code': 'us',
                 'house_number': '79',
                 'road': 'Highland Ave',
                 'state': 'Massachusetts',
                 'state_code': 'MA',
                 'town': 'Somerville'},
  'confidence': 10,
  'formatted': '79 Highland Ave, Somerville, MA, United States of America',
  'geometry': {'lat': 42.386531, 'lng': -71.097825}},
 {'annotations': {'DMS': {'lat': "42° 22' 58.44000'' N",
                          'lng': "71° 6' 10.08000'' W"},
                  'FIPS': {'county': '25017', 'state': '25'},
                  'MGRS': '19TCG2689594432',
                  'Maidenhead': 'FN42kj71pv',
                  'Mercator': {'x': -7915127.49, 'y': 5189698.73},
                  'OSM': {'note_url': 'https://www.openstreetmap.org/note/new#map=17/42.38290/-71.10280&layers=N',
                          'url': 'https://www.openstreetmap.org/?mlat=42.38290&mlon=-71.10280#map=17/42.38290/-71.10280'},
                  'UN_M49': {'regions': {'AMERICAS': '019',
                                         'NORTHERN_AMERICA': '021',
                                         'US': '840',
                                         'WORLD': '001'},
                             'statistical_groupings': ['MEDC']},
                  'callingcode': 1,
                  'currency': {'alternate_symbols': ['US$'],
                               'decimal_mark': '.',
                               'disambiguate_symbol': 'US$',
                               'html_entity': '$',
                               'iso_code': 'USD',
                               'iso_numeric': '840',
                               'name': 'United States Dollar',
                               'smallest_denomination': 1,
                               'subunit': 'Cent',
                               'subunit_to_unit': 100,
                               'symbol': '$',
                               'symbol_first': 1,
                               'thousands_separator': ','},
                  'flag': '🇺🇸',
                  'geohash': 'drt3n58d29jw7kz7cjf8',
                  'qibla': 60.4,
                  'roadinfo': {'drive_on': 'right', 'speed_in': 'mph'},
                  'sun': {'rise': {'apparent': 1583320500,
                                   'astronomical': 1583314920,
                                   'civil': 1583318820,
                                   'nautical': 1583316840},
                          'set': {'apparent': 1583361480,
                                  'astronomical': 1583280660,
                                  'civil': 1583363160,
                                  'nautical': 1583365080}},
                  'timezone': {'name': 'America/New_York',
                               'now_in_dst': 0,
                               'offset_sec': -18000,
                               'offset_string': '-0500',
                               'short_name': 'EST'},
                  'what3words': {'words': 'line.desks.plant'}},
  'bounds': {'northeast': {'lat': 42.392238, 'lng': -71.072728},
             'southwest': {'lat': 42.372724, 'lng': -71.118193}},
  'components': {'ISO_3166-1_alpha-2': 'US',
                 'ISO_3166-1_alpha-3': 'USA',
                 '_category': 'place',
                 '_type': 'county',
                 'continent': 'North America',
                 'country': 'United States of America',
                 'country_code': 'us',
                 'county': 'Middlesex County',
                 'postcode': '02143',
                 'state': 'Massachusetts',
                 'state_code': 'MA'},
  'confidence': 7,
  'formatted': 'Middlesex County, MA 02143, United States of America',
  'geometry': {'lat': 42.3829, 'lng': -71.1028}}]

Note that in this case, the geocoder has returned three results. Each slightly more general than the previous. The first matches the address to a specific place—the Somerville Central Library—the second matches it to an address, and the third simply places the address in Middlesex County with a zip code. In general, the first result is going to be the most precise location (and the one with the highest confidence score), so it’ll be the most useful for data analysis.

Parse Forward Geocoding Result

Just like above, we can parse the result! We could get the formatted address, the location’s identifying what3words words, or the latitude and longitude corresponding to the address.

formatted = results[0]['formatted']
print(formatted)
what3words = results[0]['annotations']['what3words']['words']
print(what3words)
lat = results[0]['geometry']['lat']
lng = results[0]['geometry']['lng']
print([lat, lng])
Somerville Central Library, 79 Highland Avenue, Somerville, MA 02143, United States of America
beans.anyway.valve
[42.3857881, -71.0944054]

Batch Geocoding

“This has all been well and good,” you might say, “but I’m here because I have 2,304 addresses in a spreadsheet. I don’t really have to hand-code 2,304 queries and parse 2,304 geocoder results, do I?”

OF COURSE NOT!

We can write a few (relatively) simple lines of code to open a delimited text file and iterate over its addresses, parsing and returning the results returned for each row.

We’ll first need to write a few lines of code that open the CSV file and iterate over its rows. There numerous ways to do this. We’re going to use the leanest possible method that uses very little beyond base Python.

# Import the csv library.
import csv

# Provide the location of the som_lib.csv file.
addressfile = 'som_lib.csv'
# Open the file...
with open(addressfile, 'r') as file:
    # ...and read its contents as a dictionary.
    reader = csv.DictReader(file)
    # Iterate over the CSV rows...
    for line in reader:
        # ...and print the value of the 'name' column.
        print(line['name'])
Somerville Public Library East Branch
Somerville Public Library West Branch
Somerville Public Library Central Branch

But what if the file can’t be found? Let’s intentionally introduce a typo into the file name.

# Import the csv library.
import csv

# Provide the location of the som_lib.csv file.
addressfile = 'som_libs.csv'
# Open the file...
with open(addressfile, 'r') as file:
    # ...and read its contents as a dictionary.
    reader = csv.DictReader(file)
    # Iterate over the CSV rows...
    for line in reader:
        # ...and print the value of the 'name' column.
        print(line['name'])
---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

<ipython-input-33-c5a8f5894e0e> in <module>
      5 addressfile = 'som_libs.csv'
      6 # Open the file...
----> 7 with open(addressfile, 'r') as file:
      8     # ...and read its contents as a dictionary.
      9     reader = csv.DictReader(file)


FileNotFoundError: [Errno 2] No such file or directory: 'som_libs.csv'

Python returns an error that is not especially interpretable, though it is nice enough to inform us that the error is of the FileNotFoundError type! Instead of throwing an error, let’s add a try/except condition…

import csv

addressfile = 'som_libs.csv'

# Try to load the file...
try:
    with open(addressfile, 'r') as file:
        reader = csv.DictReader(file)
        for line in reader:
            print(line['name'])
# ...and if reading it throws a FileNotFoundError,
# print a simple, legible message.
except FileNotFoundError:
    print(f'{addressfile} does not exist.')
som_libs.csv does not exist.

Let’s add one more of these conditions—we’re using the OpenCage free service, which has a rate limit of 2,500 requests per day. If we exceed the rate limit, the service will return a RateLimitExceededError. Let’s accommodate this congingency by adding another except conditional.

import csv

addressfile = 'som_lib.csv'

# Try to load the file...
try:
    with open(addressfile, 'r') as file:
        reader = csv.DictReader(file)
        for line in reader:
            print(line['name'])
# ...and if reading it throws a FileNotFoundError,
# print a simple, legible message.
except FileNotFoundError:
    print(f'{addressfile} does not exist.')
except RateLimitExceededError as ex:
    print(ex)
Somerville Public Library East Branch
Somerville Public Library West Branch
Somerville Public Library Central Branch

Now that we’re set up to deal with errors, we can write geocoder queries into our script. We want to keep rows that OpenCage cannot successfully geocode, so we insert an if/else statement that creates null latitude and longitude values if no result is returned.

import csv

addressfile = 'som_lib.csv'

try:
    with open(addressfile, 'r') as incsv:
        reader = csv.DictReader(incsv)
        for line in reader:
            # clean up address, stripping leading and trailing charaters
            address = line['address'].strip()
            # send query to geocoder
            result = geocoder.geocode(address)
            # check if there are results...
            if result and len(result):
                # attach lat and long fields to the iterator line.
                line['lat'] = result[0]['geometry']['lat']
                line['lng'] = result[0]['geometry']['lng']
            else:
                line['lat'] = None
                line['lng'] = None
            print(line)
            
except FileNotFoundError:
    print(f'{addressfile} does not exist.')
except RateLimitExceededError as ex:
    print(ex)
OrderedDict([('name', 'Somerville Public Library East Branch'), ('address', '115 Broadway, Somerville, MA 02145'), ('lat', 42.3879326), ('lng', -71.0835948)])
OrderedDict([('name', 'Somerville Public Library West Branch'), ('address', '40 College Ave, Somerville, MA 02144'), ('lat', 42.3981483), ('lng', -71.1217138)])
OrderedDict([('name', 'Somerville Public Library Central Branch'), ('address', '79 Highland Avenue, Somerville, MA 02143'), ('lat', 42.3857881), ('lng', -71.0944054)])

Not too bad!

Write to a File

The only thing we’ll want our basic geocoder to do is write this same information out to a file. Let’s add this functionality. We’ll do this by naming a new CSV, opening it, and instantiating it as a DictWriter. We’ll also need to specify field names for the output CSV; we do that by simply adding ‘lat’ and ‘lng’ to the list of fieldnames that our input file containts (accessed using reader.fieldnames).

Note that we also included two ways to name the output file: manually, or by using os.path.splitext() which takes a path to a file and splits it into the file path and the file extension.

import csv

addressfile = 'som_lib.csv'
outputfile = 'som_lib_geocode.csv'
# Or, if you want to be fancy...
outputfile = os.path.splitext(addressfile)[0] + '_geocode.csv'

try:
    with open(addressfile, 'r') as incsv:
        # Open outputfile in write mode ('w')...
        with open(outputfile, 'w') as outcsv:
            reader = csv.DictReader(incsv)
            # Add lat and lng to existing fieldnames
            field_names = reader.fieldnames + ['lat', 'lng']
            # Create the writer DictWriter - we'll use this to write rows.
            writer = csv.DictWriter(outcsv, field_names)
            for line in reader:
                address = line['address'].strip()
                result = geocoder.geocode(address)
                if result and len(result):
                    line['lat'] = result[0]['geometry']['lat']
                    line['lng'] = result[0]['geometry']['lng']
                else:
                    line['lat'] = None
                    line['lng'] = None
                print(line)
                # Write geocoded row to output file.
                writer.writerow(line)
except FileNotFoundError:
    print(f'{addressfile} does not exist.')
except RateLimitExceededError as ex:
    print(ex)
OrderedDict([('name', 'Somerville Public Library East Branch'), ('address', '115 Broadway, Somerville, MA 02145'), ('lat', 42.3879326), ('lng', -71.0835948)])
OrderedDict([('name', 'Somerville Public Library West Branch'), ('address', '40 College Ave, Somerville, MA 02144'), ('lat', 42.3981483), ('lng', -71.1217138)])
OrderedDict([('name', 'Somerville Public Library Central Branch'), ('address', '79 Highland Avenue, Somerville, MA 02143'), ('lat', 42.3857881), ('lng', -71.0944054)])

Excellent! We’ve written a geocoder that will take a CSV as input (with any number of columns), and output a CSV with additional columns storing latitude and longitude in WGS 84 coordinates. This can be easily imported into a desktop GIS. For simple purposes, this script is all you’ll need. There are certainly additions you could make! Foe example: how would you add a field to the output CSV that stores the what3words code?

But beyond these fairly simple ‘add more fields’ questions, we can also:

  1. Make the code more reusable by wrapping it inside of a function.
  2. Place the geocoding operation within more contemporary data science workflow by writing it for the Pandas library.
  3. Use the spatial data science library GeoPandas to write the result to a GeoJSON file.

These possibilities are included below, though we will not explore them in depth.

Extension 1: Wrap into a Function

One of the first ways you’ll want to extend any useful set of instructions is to wrap them in a re-usable function. We use standard python syntax here, providing parameters for the input file location and name, the name of the field containing addresses, and a path to an output file. Note that if the last isn’t provided, the function automatically appends _geocode.csv to end of the input filename and uses that as the output.

def geocode(in_file, address_field, out_file = False):
    if out_file:
        out_file = out_file
    else:
        out_file = os.path.splitext(in_file)[0] + '_geocode.csv'

    try:
        with open(in_file, 'r') as incsv:
            # Open outputfile in write mode ('w')...
            with open(out_file, 'w') as outcsv:
                reader = csv.DictReader(incsv)
                # Add lat and lng to existing fieldnames
                field_names = reader.fieldnames + ['lat', 'lng']
                # Create the writer DictWriter - we'll use this to write rows.
                writer = csv.DictWriter(outcsv, field_names)
                for line in reader:
                    address = line['address'].strip()
                    result = geocoder.geocode(address)
                    if result and len(result):
                        line['lat'] = result[0]['geometry']['lat']
                        line['lng'] = result[0]['geometry']['lng']
                    else:
                        line['lat'] = None
                        line['lng'] = None
                    print(line)
                    # Write geocoded row to output file.
                    writer.writerow(line)
    except FileNotFoundError:
        print(f'{in_file} does not exist.')
    except RateLimitExceededError as ex:
        print(ex)

geocode(in_file='som_lib.csv', address_field='address')
OrderedDict([('name', 'Somerville Public Library East Branch'), ('address', '115 Broadway, Somerville, MA 02145'), ('lat', 42.3879326), ('lng', -71.0835948)])
OrderedDict([('name', 'Somerville Public Library West Branch'), ('address', '40 College Ave, Somerville, MA 02144'), ('lat', 42.3981483), ('lng', -71.1217138)])
OrderedDict([('name', 'Somerville Public Library Central Branch'), ('address', '79 Highland Avenue, Somerville, MA 02143'), ('lat', 42.3857881), ('lng', -71.0944054)])

Extension 2: Rewrite Using Pandas

Many contemporary data science workflows use a Python library called pandas; essentially, the role of pandas is to make Python behave more like R. It gives you access to dataframes object classes and a whole range of vectorized data processing operations (a fancier form of iteration that moves more quickly over rows or columns).

Note the line that reads df.apply(...)—this is a very classic pandas syntax. It’s also a very class R syntax! This is a vectorized for loop. All it is doing is applying an operation (in this case, the geocoding function) to each row (axis=1).

import pandas as pd

def geocode(row, query_field):
    try:
        address = row[query_field].strip()
        result = geocoder.geocode(address)
        if result and len(result):
            row['lat'] = result[0]['geometry']['lat']
            row['lng'] = result[0]['geometry']['lng']
        else:
            row['lat'] = None
            row['lng'] = None
        return(row)
    except RateLimitExceededError as ex:
        print(ex)
        quit
        
def run(in_file, query_field, out_file=False):
    if out_file:
        out_file = out_file
    else:
        out_file = os.path.splitext(in_file)[0] + '_geocode.csv'
    try:
        df = pd.read_csv(in_file)
        df = df.apply(lambda x: geocode(x, query_field), axis=1)
        print(df)
        df.to_csv(out_file)
    except FileNotFoundError:
        print(f'{in_file} does not exist.')

run(in_file='som_lib.csv', query_field='address')
name       Somerville Public Library East Branch
address       115 Broadway, Somerville, MA 02145
lat                                      42.3879
lng                                     -71.0836
Name: 0, dtype: object
name       Somerville Public Library East Branch
address       115 Broadway, Somerville, MA 02145
lat                                      42.3879
lng                                     -71.0836
Name: 0, dtype: object
name       Somerville Public Library West Branch
address     40 College Ave, Somerville, MA 02144
lat                                      42.3981
lng                                     -71.1217
Name: 1, dtype: object
name       Somerville Public Library Central Branch
address    79 Highland Avenue, Somerville, MA 02143
lat                                         42.3858
lng                                        -71.0944
Name: 2, dtype: object

Extension 3: Write GeoJSON Using GeoPandas

Finally, we can extend the geocoder to automatically write the file to a GeoJSON, removing the intermediate step in which we interpret the latitudes and longitudes using a desktop GIS. The key row here is row['geometry'] = Point(...)—we’re creating a Point object, which Geopandas can interpret (the block beginning `gpd.GeoDataFrame()..

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

def geocode(row, query_field):
    try:
        address = row[query_field].strip()
        result = geocoder.geocode(address)
        if result and len(result):
            row['geometry'] = Point(result[0]['geometry']['lng'], result[0]['geometry']['lat'])
        else:
            row['geometry'] = None
        return(row)
    except RateLimitExceededError as ex:
        print(ex)
        quit
        
def run(in_file, query_field, out_file=False):
    if out_file:
        out_file = out_file
    else:
        out_file = os.path.splitext(in_file)[0] + '_geocode.geojson'
    try:
        df = pd.read_csv(in_file)
        gdf = gpd.GeoDataFrame(
            df.apply(lambda x: geocode(x, query_field), axis=1),
            geometry='geometry'
        )
        print(gdf)
        gdf.to_file(out_file, driver='GeoJSON')
    except FileNotFoundError:
        print(f'{in_file} does not exist.')

run(in_file='som_lib.csv', query_field='address')
                                       name  \
0     Somerville Public Library East Branch   
1     Somerville Public Library West Branch   
2  Somerville Public Library Central Branch

                                    address                    geometry  
0        115 Broadway, Somerville, MA 02145  POINT (-71.08359 42.38793)  
1      40 College Ave, Somerville, MA 02144  POINT (-71.12171 42.39815)  
2  79 Highland Avenue, Somerville, MA 02143  POINT (-71.09441 42.38579)

References