ALGAE Testing Part 2: Geocode Features

An automated protocol for assigning early life exposures to longitudinal cohort studies

Testing Part 2: Geocode Features

by Kevin Garwood

Testing Overview Previous Next

Background

These features relate to the spatial aspects of cleaning address histories. For various reasons, some address periods will have a 'bad geocode' - one which is blank, is considered out-of-bounds, or is otherwise marked with has_valid_geocode=N in the staging_geocode_data table. If a study member has an address period with a bad geocode that is within their exposure time frame and cannot be fixed, then that person will be excluded from any exposure assessment.

Processing geocodes is identical for early life and later life analyses; therefore, we will limit testing to only using early life data. The tests will cover variables that appear in both the finished address period file res_early_cleaned_addr.csv and in the sensitivity variable files res_early_sens_variables.

If all the tests pass for the geocode testing area, then test cases in the remaining test areas can be designed to use only valid geocodes that have exposure values. Having this test area allows us to simplify test design by separating concerns about spatial from temporal data cleaning.

Coverage

Input Fields Covered by Test Cases

Table Field
original_geocode_data geocode
original_geocode_data has_valid_geocode
original_address_history_data geocode
original_address_history_data geocode
original_address_history_data start_date
original_address_history_data end_date

Output Fields Covered by Test Cases

Table Field
early_cleaned_addr start_date
early_cleaned_addr end_date
early_cleaned_addr is_fixed_invalid_geocode
early_cleaned_addr ith_residence_type
early_cleaned_addr fin_adjusted_start_date
early_cleaned_addr fin_adjusted_end_date
early_sens_variables total_addr_periods
early_sens_variables out_of_bounds_geocodes
early_sens_variables invalid_geocodes
early_sens_variables fixed_geoocodes
early_sens_variables has_bad_geocode_within_time_frame

Test Design

Ignore the fields version, ed91, oa2001, coa2011 in staging_geocode_data

The protocol does not change any of these fields and does not use them to compute exposures. Version can help scientists determine whether there is a relationship between bad geocodes and a particular attempt to convert postal addresses into geospatial coordinates. ed91, oa2001 and coa2011 have no meaning on their own; instead, they have value when they are linked to covariate data sets.

Identify an address period which has a blank geocode

Testing should consider the case where there is no value for geocode in an address period. These would correspond to residential addresses that were so poor that the software was unable to provide any guess for geospatial coordinates.

Identify an address period which has a geocode that is marked has_valid_geocode=N in the staging_geocode_data table

Geocoding software applications may try to attempt to provide coordinates for a partially specified address, but acknowledge a low-quality match. The result may be a geocode which is not blank (see above) but which is still too poor to use in the study.

Identify an address period which is out of bounds

An out-of-bounds geocode is one which is valid, but which is not associated with any exposure values. These would correspond to residential addresses which are located outside the exposure area of interest.

Test criteria used to fix a bad address period

Develop a test case that satisfies all three criteria for a fixable address period and others that are failing one criterion. For example, have a test case where an address period overlaps with less than 25%, 25% and more than 25% with some life stage.

Identify a bad address period which occurs after the exposure time frame

Create an address period which has a bad address period which cannot be fixed but which occurs after the end of the study member's exposure time frame.