Methodology
by Kevin Garwood and Daniela Fecht
The ALGAE Protocol in its current form, supports two types of analyses: an early life analysis and a later life analysis. In the early life analysis, daily exposure estimates are aggregated over the first (T1), second (T2) and third (T3) pregnancy trimester and infancy from date of birth to the end of the first year of life (EL). In the later life analysis, annual exposures estimates are assigned to life years 1 to 15 (YR1…YR15). The analyses are the same except for the temporal resolution of exposure data they use (i.e. daily versus annual averages).
1 Pre-process original data tables
The ALGAE Protocol pre-processes the input tables in order to make the software more generic in terms of field values that different cohort studies may create. ALGAE, for example, standardises the representations of null (eg:null
,
#NULLIF
, NULL
),
yes (eg: Yes
, yes
, y
, Y
,
true
) or no (eg: No
, no
, n
,
N
, false
). The pre-processed
versions of the input tables are re-named and begin with staging_
.
For example, pre-processing alters a copy of original_geocode_data
to create a table called staging_geocode_data
.
2 Perform preliminary system checks
ALGAE tries to make assertions that certain table fields are null or unique. These checks help detect any errors that may exist in the cohort data sets as early as possible.3 Ensure daily exposure records are ready to use
ALGAE processes daily exposure estimates for the early life analysis. In the later life analysis, an exposure record has the same fields but uses annual average exposures. In later life analyses, the date is always expected to be January 1st of a calendar year. The annual exposure records are converted to daily exposure records to harmonise the further analysis.4 Calculate life stages
Life stages are calculated for each cohort member. In the early life analysis, the temporal boundaries for life stages T1, T2, T3 and EL are calculated. In the later life analysis, temporal boundaries are calculated for life stages YR1, YR2, YR3, etc. The early life analysis includes additional code to account for premature births. When cohort members are born prematurely, the protocol corrects problems in life stages, and will change the end date of T2 or T3 to reflect the premature date of birth, or that T3 is missing altogether. The life stage calculations ensure that trimesters will not overlap: in other words, each relevant day of a person's life will belong to exactly one life stage.
As well as the temporal boundaries of life stage, the temporal boundaries for the
overall exposure time frame are calculated. For example in the early life analysis,
the time frame would be from the day of conception until the end of the first year
of life calculated as: [conception date, birth date + 1 year - 1 day]
.
5 Clean address periods
5.1 Impute blank start and end dates. Impute blank start and end dates. First, ALGAE ensures that each address period has non-blank values for its start date and end date. When a start date is missing, the protocol imputes the value with the cohort member's date of conception. When the end date is missing, the value is set to the current date.5.2 Order address periods. Once any blank values have been imputed, the address periods are ordered first by the study member ID, then by the start date, then by the duration of the period at the address. Once sorted, the order is maintained for the rest of the protocol.
At this point in the protocol ALGAE begins to track data cleaning changes using sensitivity variables.
5.3 Identify and try to fix bad geocodes. Before ALGAE attempts to clean the temporal boundaries of successive address periods, it tries to identify and fix bad geocodes. A bad geocode is one which is either invalid or out-of-bounds. If study members have an address period which is within their exposure time frame and has a bad geocode, it means that their exposure will not be calculated - they are left out of exposure analysis altogether.
In order to reduce the impact of bad geocodes, ALGAE tries to fix those which it believes represent an incorrect address which is fixed in the following address record. The diagram below illustrates the criteria for identifying an address period which has a bad geocode that can be fixed.
If an address period, an, is identified as having a geocode that is "fixable" then:
-
an+1.start_date = an.start_date
- an is marked with
5.4 Identify and fix temporal gaps and overlaps.
Once the address periods have been ordered by person_id
,
start_date
and duration
, ALGAE proceeds to identify
any gaps or overlaps that may appear in the residential address histories.
The protocol scans the address periods and flags each one depending on the
fit that exists between successive address periods.
Once address periods have been marked for gaps and overlaps, ALGAE begins to fix
them so that they are temporally contiguous. In any address period, the
start_date
values are assumed to provide a much stronger and more reliable
signal of location than end_date
values. The assumption is based on
the idea that in an administrative system, start dates will likely correspond to
time stamps but end dates will likely be computed in relation to start dates.
5.5 Ensure that every day of exposure time frame is covered by an address period. It is likely that there will be a gap between date of conception and the date of enrolment in the cohort study. It may also be possible in the early or late analysis that the end date of the last available address period fails to cover days at the end of the exposure time frame. In order to ensure that all days of an exposure time frame are covered, ALGAE will adjust boundaries of first and last relevant address periods.
6 Calculate Exposures
Once ALGAE has geocoded residential addresses and cleaned the address periods, it then matches the appropriate daily exposure estimates based on the locations that a study member occupied on each day covered by their cleaned residential address histories. Each exposure record has aperson_id
, a geocode
,
a date_of_year
and daily pollution estimates for various pollutants.
The exposure records are used in different ways to assess exposures for early and later life analyses.
6.1 Assess the Data Quality of Each Daily Exposure Value ALGAE assesses exposures based on aggregating daily exposure values, some of which may be part of bad address periods that the protocol cannot fix. For example, if study members live at an invalid residential address for their entire gestation period, we cannot ignore this problem when we assess their trimester exposures.
The protocol's frequent encounters with bad address periods led us to design it so that it could report data quality indicators showing the extent to which life stage assessments were affected by various kinds of bad address periods. Before we discuss the different ways ALGAE aggregates daily exposure values, we need to cover how it grades exposure values.
ALGAE classifies all daily exposure values based on five mutually exclusive categories which are evaluated in the context of each pollutant. They are described in the diagram below:
ALGAE determines the category by answering three questions about each relevant address period:
- Is the geocode for the address period associated with at least one non-null exposure value for a given pollutant?
- Is the geocode for the address period valid?
- For a given day within the address period coverage, is there an associated exposure value?
The answer to the first question tells ALGAE if there is any exposure data associated with the geocode. If there are no exposure values available, then we can draw one of two conclusions about the location for an address period:
- it is an invalid geocode and this would explain why no exposure values exist
- it is a valid geocode but it isn't in the study area that was used for exposure modelling - the study member moved outside the study area
The protocol looks up the geocode in the staging_geocode_data
and
determines the value of its has_valid_geocode
field. If the value is
N
, then this describes the invalid address scenario. If it is
Y
, then the circumstances describe the out of bounds scenario.
If an address period's geocode is associated with at least some exposure values, it doesn't necessarily mean they are good values. Some projects may try to make a guess about the coordinates that should be associated with a poorly specified residential address. For example, suppose a person's address is simply "Pine St", but the street is fifteen blocks long. For some projects, exposure scientists may try to pick the coordinates for the middle of the street. The coordinates could be used to generate exposures, even though the coordinates may not be accurate. Human or software-based geocoding agents may flag a geocode as being invalid, even if it is used to generate exposures. These circumstances describe the poor address scenario.
The protocol is left with two remaining categories to consider. An address period may have a valid geocode that is associated with some exposures for a given pollutant. However, for specific days, there may not be exposure values available. This may occur if the temporal coverage of the exposure modelling does not cover all days that are in the exposure time frames. Should this happen, then we have the missing exposure scenario. If a day can actually be associated with a non-empty exposure value, then this describes the good address scenario.
When the address periods are linked with daily exposure values, ALGAE assigns one of
these categories for each kind of pollutant. There is an assumption that the
pollutants values should be treated independently of one another. For example, a study
may have PM10_rd values
for a given month but not have nox_rd
values. This may mean the count for missing exposure days T1 is 30 out of 92 for
nox_rd
exposures but 0 out of 92 for pm10_tot
.
When ALGAE aggregates daily exposure values, it also aggregates counts of the days that can be labelled for each category. In the exposure results, researchers can then use the counts of invalid, out of bounds, poor match, missing, and good match days and the life stage durations to establish threshholds of data quality for life stage exposures.
6.2 Aggregate Daily Exposure Methods Using Different Methods
Now that we've discussed how daily exposure values are rated for data quality, we can discuss how different methods can aggregate them differently. ALGAE runs multiple kinds of assessment to let researchers assess the extent that data cleaning of address histories would affect their analyses. Some of the other methods also provide a kind of bridge that allows the study to generate analyses that are compatible with those found in other research papers. The various assessment methods are described in the following diagram:
The following diagrams illustrate how the exposure assessments choose daily exposure values, and how they may use the locations for certain exposure days to represent longer exposure periods.