ALGAE Protocol: Preparing Geocode Data

An automated protocol for assigning early life exposures to longitudinal cohort studies

Data loading part 2: Prepare the geocode data

by Kevin Garwood

Overview Previous Next

Purpose

This table contains geographical covariate identifiers and a data quality indicator for each geocoded residential address. The covariate identifiers will be codes that identify administrative areas that contain the residential address. For example, 22 Pine Street will be within a super output area, which will be within an output area, which will be within a district, which could be within a region. These codes provide 'hooks' in the data for epidemiologists, so they can link exposure locations to area-based information such as socio-economic status, smoking habits and educational level.

The table also contains a simple flag has_valid_geocode, which may hide various data quality indicators that are specific to a cohort, or to software they use to match residential address histories to geographical map coordinates.

Location of Original Data File

You will need to create a file that has this name
original_geocode_data.csv
which must be located in:
early_life/input_data
or
later_life/input_data

Example Original Data File

See here.

Suggested Approach

The typical steps used to produce original_geocode_data are shown in the diagram and described in the following steps. We would recommend that you create the table using a succession of temporary tables rather than changing a single table. This is because you will need to re-use field values you make in this process to create the other original data tables

The geocode field is used to link original_geocode_data, original_address_history_data and original_exposure_data tables together. The map coordinates that are generated through the geocoding activity will be needed to produce exposure values for each geocode.

Preparing the geocode data.

Step 1: Gather all cohort residential addresses

We assume that you will have or will be able to produce an administrative data table that contains a chronology of residential addresses for each study member. Create unique identifiers for each address record. The identifiers will be used to
person_id time_stamp addr_line1 addr_line2 city post_code phone
12344 04-05-1995 22 Pine St Garwoodville N2 7HK 065 175 83945
12344 22-08-1996 Flat 2 32 Walnut Drive Fecht Bay N03 5XY 065 175 83945
12344 04-05-1995 22 PINE STREET GARWOODVILLE N2 065 301 90937
55556 06-06-1995 Flat 5 32 Walnut Dr. Fecht Bay N03 5XY 065 301 90937
55556 06-06-1995 335 Hoover Pt. St. Gulliver's Way P3 4WJ 065 301 90937
58803 05-06-1995 16 quire-Ave. Hansell-upon-Gretal U8 3WR 065 767 56532
58803 10-10-1997 Main St. Pearsontown X21 2HT 065 767 56532
Table Geocoding-1. A typical table in an administrative Contacts Database used to audit current address details of cohort members

First, we try to isolate just the postal address data. In the example table, we would ignore phone number because ALGAE does not need it. For the purpose of geocoding, we need neither person_id nor time_stamp fields. However, we will want to preserve the link between geocodes and address periods.

Our geocoding scheme must accommodate cases where the same geocode is used by multiple people at different times, and when the same location may be represented in several slightly different addresses. In order to minimise the amount of extraneous information used for geocoding, we can add an identifier field for each address period.

person_id time_stamp location_id
12344 04-05-1995 1
12344 22-08-1996 2
12344 04-05-1995 3
55556 06-06-1995 4
55556 06-06-1995 5
58803 05-06-1995 6
58803 10-10-1997 7
Table Geocoding-2: Preserving fields that would link a geocode to a given person for a given time.

location_id addr_line1 addr_line2 city post_code
1 22 Pine St Garwoodville N2 7HK
2 Flat 2 32 Walnut Drive Fecht Bay N03 5XY
3 22 PINE STREET GARWOODVILLE N2
4 Flat 5 32 Walnut Dr. Fecht Bay N03 5XY
5 335 Hoover Pt. St. Gulliver's Way P3 4WJ
6 16 quire-Ave. Hansell-upon-Gretal U8 3WR
7 Main St. Pearsontown X21 2HT
Table-Geocoding3: This table preserves the information we want to use for geocoding. It also contains the field location_id, which allows us to link geocodes we produce back to the address periods.

Step 2: Express Addresses in a Canonical Format

In this step, we standardise the format of residential addresses by combining all address fields into a single field value. We also standardise how we represent punctuation marks and common phrases such as avenue, street, place, point, road etc.
location_id canonical_address
1 22 PINE ST, GARWOODVILLE, N2 7HK
2 FLAT 2, 32 WALNUT DR, FECHT BAY, N03 5XY
3 22 PINE ST, GARWOODVILLE, N2
4 FLAT 5, 32 WALNUT DR, FECHT BAY, N03 5XY
5 335 HOOVER PT, ST GULLIVERS WAY, P3 4WJ
6 16 QUIRE-AVE., HANSELLUPONGRETAL, U8 3WR
7 MAIN ST, PEARSONTOWN, X21 2HT
Table Geocoding4. The individual components that make up a postal address are combined into a single field and formatted to make them amenable to be matched by geocoding software.

Step 3: Use geocoding software to produce map coordinates

In this step, you will create the geocode and version fields that are expected by the original_geocode_data table.

We expect that projects will use software to help automatically match residential addresses with map coordinates. Because the behaviour of geocoding applications can vary, we strongly recommend that you populate the required version field with the name and version of your software.

The software will generate x and y coordinates for each address. For the purposes of these examples, we will assume that you will generate geocode values by concatenating map coordinates together. However, some cohorts may be concerned about including sensitive location data in this field (See General Advice).

The software will likely produce data fields that indicate the quality of a match or reasons why the software had trouble processing it. Initially, you may want to preserve these fields in order to help trouble shoot cases where the software fails to produce a match.

location_id canonical_address x y geocode version match match_cause
1 22 PINE ST, GARWOODVILLE, N2 7HK 37.422036 122.084124 37.422036-122.084124 geoMapperv1.2 100%
2 FLAT 2, 32 WALNUT DR, FECHT BAY, N03 5XY 37.424130 122.081221 37.424130-122.081221 geoMapperv1.2 100%
3 22 PINE ST, GARWOODVILLE, N2 37.422036 122.084124 37.422036-122.084124 geoMapperv1.2 80% Incomplete post code
4 FLAT 5, 32 WALNUT DR, FECHT BAY, N03 5XY 37.424130 122.081221 37.424130-122.081221 geoMapperv1.2 100%
5 335 HOOVER PT, ST GULLIVERS WAY, P3 4WJ 37.424132 122.081333 37.424132-122.081333 geoMapperv1.2 100%
6 16 QUIRE-AVE., HANSELLUPONGRETAL, U8 3WR geoMapperv1.2 0% Unknown street
7 MAIN ST, PEARSONTOWN, X21 2HT 37.424133 122.081440 37.424133-122.081440 geoMapperv1.2 30% Missing street number
Table Geocoding5. The geocoding software will generate map coordinates and data quality fields.

Step 4: Allow for manual geocoding efforts

Undoubtedly the geocoding software will fail to match some geocodes (see General Advice section). We expect that you will do geocoding as an iterative activity that may involve manually matching canonical addresses with map coordinates. It may also involve editing canonical addresses and running the geocoding software again. Your version field should include enough information to describe which method and iteration was used to geocode failed matches.

ALGAE expects that a comments field will be present in the original_geocode_data table. Although you can leave this field blank, you may want to include other useful information in it. For example, you could record how you manually fixed a canonical address that resulted in a failed match. You may also want to preserve diagnostic information about a geocode that was generated by the geocoding software you used. How you populate this field will depend on what you think is important to record or what information about locations you're comfortable to retain in the final version of the table.

location_id canonical_address x y geocode version comments match match_cause
1 22 PINE ST, GARWOODVILLE, N2 7HK 37.422036 122.084124 37.422036-122.084124 geoMapperv1.2 100%
2 FLAT 2, 32 WALNUT DR, FECHT BAY, N03 5XY 37.424130 122.081221 37.424130-122.081221 geoMapperv1.2 100%
3 22 PINE ST, GARWOODVILLE, N2 37.422036 122.084124 37.422036-122.084124 geoMapperv1.2 80% Incomplete post code
4 FLAT 5, 32 WALNUT DR, FECHT BAY, N03 5XY 37.424130 122.081221 37.424130-122.081221 geoMapperv1.2 100%
5 335 HOOVER PT, ST GULLIVERS WAY, P3 4WJ 37.424132 122.081333 37.424132-122.081333 geoMapperv1.2 100%
6 16 CHOIR AVE, HANSELLUPONGRETAL, U8 3WR 37.424132 122.081310 37.424132-122.081310 klg_28122015 removed hyphen 100% Unknown street
7 MAIN ST, PEARSONTOWN, X21 2HT 37.424133 122.081440 37.424133-122.081440 geoMapperv1.2 30% Missing street number
Table Geocoding6. The version and comments fields can contain information about how matches were fixed so they could yield map coordinates.

Step 5: Decide how to derive values for has_valid_geocode field

In this step you will derive the has_valid_geocode field that must appear in the original_geocode_data table. You may have to decide how a collection of data quality fields generated by the geocoding software would map to a "Yes/No" result.

For example, in the example data set, you may decide that if an address was manually geocoded or if the quality of an automated match is above 50%, then the result is "Y". Otherwise the value would be "N".

location_id canonical_address x y geocode version comments has_valid_geocode
1 22 PINE ST, GARWOODVILLE, N2 7HK 37.422036 122.084124 37.422036-122.084124 geoMapperv1.2 Y
2 FLAT 2, 32 WALNUT DR, FECHT BAY, N03 5XY 37.424130 122.081221 37.424130-122.081221 geoMapperv1.2 Y
3 22 PINE ST, GARWOODVILLE, N2 37.422036 122.084124 37.422036-122.084124 geoMapperv1.2 Y
4 FLAT 5, 32 WALNUT DR, FECHT BAY, N03 5XY 37.424130 122.081221 37.424130-122.081221 geoMapperv1.2 Y
5 335 HOOVER PT, ST GULLIVERS WAY, P3 4WJ 37.424132 122.081333 37.424132-122.081333 geoMapperv1.2 Y
6 16 CHOIR AVE, HANSELLUPONGRETAL, U8 3WR 37.424132 122.081310 37.424132-122.081310 klg_28122015 removed hyphen Y
7 MAIN ST, PEARSONTOWN, X21 2HT 37.424133 122.081440 37.424133-122.081440 geoMapperv1.2 N
Table Geocoding6. The table shows how data quality fields produced by geocoding passes have been combined to produce a "Yes/No" value for has_valid_geocode.

Step 6: Add geographical covariate data fields

You will need a means of using map coordinates to identify administrative areas that contain each geocoded address. The default version of ALGAE uses ed91, oa2001, and coa2011 to represent administrative areas. If you want to use different kinds of administrative fields, please see the section Change covariates.

Step 7: Link geocodes back to address period data

In order to ensure that the geocode values in the original_geocode_data correspond with geocode values in the original_address_history_data, you need to link the geocodes we have created between the two tables. In the example, you would use location_id as a linking field.
person_id time_stamp geocode
12344 04-05-1995 37.422036-122.084124
12344 22-08-1996 37.424130-122.081221
12344 04-05-1995 37.422036-122.084124
55556 06-06-1995 37.424130-122.081221
55556 06-06-1995 37.424132-122.081333
58803 05-06-1995 37.424132-122.081310
58803 10-10-1997 37.424133-122.081440
Table Geocoding7. The location_id field is used to ensure that the geocodes used in both original_geocode_data and original_address_history_data can be linked.

Step 8: Eliminate extraneous fields

geocode version comments ed91 oa2001 coa2011 has_valid_geocode
37.422036-122.084124 geoMapperv1.2 ED91-1 OA2001-1 COA2011-1 Y
37.424130-122.081221 geoMapperv1.2 ED91-1 OA2001-2 COA2011-2 Y
37.422036-122.084124 geoMapperv1.2 ED91-1 OA2001-1 COA2011-1 Y
37.424130-122.081221 geoMapperv1.2 ED91-1 OA2001-2 COA2011-2 Y
37.424132-122.081333 geoMapperv1.2 ED91-2 OA2001-3 COA2011-3 Y
37.424132-122.081310 klg_28122015 removed hyphen ED91-2 OA2001-4 COA2011-4 Y
37.424133-122.081440 geoMapperv1.2 ED91-2 OA2001-5 COA2011-5 N
Table Geocoding8. The final version of original_geocode_data. The final table will contain only the fields which are necessary and sufficient for ALGAE to produce results.

Some projects may want to retain location information in the geocodes, whereas others may want to anonymise them to help minimise risks of identifying study members in data sets that may be maintained off-site from the cohort's facilities. Although anonymising geocodes can help minimise identifiability, it can also remove information that can be used to determine

General Advice

Decide whether you want to anonymise geocodes

Although ALGAE uses geocode values to link the input tables together, it does not include geocodes in any of the result tables it produces. The only location data that are included in results are the covariate identifiers that describe administrative areas that contain each geocode. However, we assume that an administrative area will be large enough to minimise the likelihood of identifying a specific residential address used by a given study member.

Depending on how cohort facilities and off-site research partners divide the work of producing the original data tables, the project members may want the geocode values to be anonymised to minimise the identifiability of location data. Should you decide to anonymise geocode data, take special care to ensure that geocode values in corresponding tables are also anonymised in the same way.

Expect to do geocoding in multiple passes

The residential addresses that you use will likely have been manually entered into some administrative software system. It is likely that geocoding software will fail to match some of them with map coordinates. If such an address is used by study members for a significant part of their exposure periods, then a failed match will mean they may be excluded from calculations that produce exposure results.

In order to ensure that the most study members have exposure results, you may spend a lot of effort doing geocoding. Typically your work will either mean you correct addresses and re-run the software or that you interpret incorrect addresses and use your judgement to match them with appropriate map coordinates. We suggest you assume you will need to finish geocoding in multiple passes, and that in each pass you will make use of the version and comments fields to record how coordinates were derived.

Understand why geocoding may fail for some addressses

There are many reasons why an address may result in an invalid geocode. Some of these include:

typographical errors. If residential addresses are typed manually, they may contain mistakes that result in a non-existent address.

incomplete addresses. If an address is missing a street number, the software may either fail to provide a match, or provide a guess that is of such poor quality that it is not appropriate to use in a given study.

deprecated addresses. Over time, some street names may change and post codes may be retired, or reflect a change in the scope of area they cover. If the software is unable to recognise outdated addresses, it may fail to produce a match.

new addresses. The software may be referencing an outdated collection of addresses that may not include addresses from newly built residential areas.

incomplete address database. The geocoding software may be using a database that simply doesn't contain all the addresses it should have.

parsing errors. The software may interpret characters which it uses as delimiters. For example, the presence of a comma may cause the software to break an address into an incorrect number of address components. As another example, it may interpret the forward slash in 'c/o' as a signal to perform some kind of pre-processing.

abbreviations. An address may be correctly specified but contain abbreviations that are not understood by the software. For example, if the software is designed to suit US addresses, it may understand "ct." means court but may not understand the Canadian version which uses "crt".

centroid errors. The software, in its attempts to find the centre for a postal code area, may provide coordinates that would not make sense on a map. For example, the software could determine that the centre of a postal code area that covers a cove may actually be on water. Other calculations could place the centre of a post code in at the top of a mountain or in the middle of an industrial area that has no residents.

There are certain classes of errors that neither automated nor manual geocoding approaches would likely identify. For example, an address may contain a topographical error that still results in a valid geocode. Suppose that an address is written as 22 Pine Street when it should be 222 Pine Street, and the street is long enough to accommodate both addresses.

Table Properties

You need to produce a CSV file called original_geocode_data. It must have the following fields:
Field Description Required Properties Examples
geocode Represents the location of a residential address. For ALGAE, the geocode is treated as just an identifier and the protocol attaches no meaning to the code. Yes Any text 37.422036-122.084124
x4353bi838 (anonymised)
version Describes some aspect of an attempt to geocode the residential address histories. Version may refer to the name and version of a software application that was used to associate residential addresses with map coordinates.

It may also describe the institution that did the work, a specific technique for mapping addresses to coordinates, or whether the geocoding was done manually or through software.

The field is intended to help scientists decide whether geocodes produced in a certain way should be filtered or not.

No Any text my_geocoding_program-v1.1
Imperial_Manual_pass4
your_cohort-02122015
comments Any other information that exposure scientists want to provide about a geocode No Any text
ed91 UK district, based on the 1991 census. This is an administrative area that scientists can use to link a geographic location to covariates that describe properties of the area that could account for associations between pollutants and health outcomes. No Any text, but for future data linking activities you will want to ensure codes follow their appropriate naming conventions
oa2001 UK output area, based on the 1991 census. An output area is the lowest geographical level at which census estimates are provided. It is used in the same way as ED91. No Any text, but for future data linking activities you will want to ensure codes follow their appropriate naming conventions
coa2011 UK census output area, based on the 2011 census. An output area is the lowest geographical level at which census estimates are provided. It is used in the same way as ED91. No Any text, but for future data linking activities you will want to ensure codes follow their appropriate naming conventions
has_valid_geocode Indicates whether a geocode should be considered valid or not. This flag would typically be derived from various data quality flags that a geocoding application would generate. An application may either completely fail to generate a map location or the value it does produce is a vague guess based on partial addresses.

ALGAE doesn't care why a geocode is or is not valid, but it uses the flag to help determine whether it can generate exposure values for a study member.

Yes One of the following ways to represent no or yes:


N
n
no
NO
FALSE
false
No
Yes