ALGAE Protocol: Methodology

An automated protocol for assigning early life exposures to longitudinal cohort studies


by Kevin Garwood and Daniela Fecht

The ALGAE Protocol in its current form, supports two types of analyses: an early life analysis and a later life analysis. In the early life analysis, daily exposure estimates are aggregated over the first (T1), second (T2) and third (T3) pregnancy trimester and infancy from date of birth to the end of the first year of life (EL). In the later life analysis, annual exposures estimates are assigned to life years 1 to 15 (YR1…YR15). The analyses are the same except for the temporal resolution of exposure data they use (i.e. daily versus annual averages).

1 Pre-process original data tables

The ALGAE Protocol pre-processes the input tables in order to make the software more generic in terms of field values that different cohort studies may create. ALGAE, for example, standardises the representations of null (eg: null, #NULLIF, NULL), yes (eg: Yes, yes, y, Y, true) or no (eg: No, no, n, N, false). The pre-processed versions of the input tables are re-named and begin with staging_. For example, pre-processing alters a copy of original_geocode_data to create a table called staging_geocode_data.

2 Perform preliminary system checks

ALGAE tries to make assertions that certain table fields are null or unique. These checks help detect any errors that may exist in the cohort data sets as early as possible.

3 Ensure daily exposure records are ready to use

ALGAE processes daily exposure estimates for the early life analysis. In the later life analysis, an exposure record has the same fields but uses annual average exposures. In later life analyses, the date is always expected to be January 1st of a calendar year. The annual exposure records are converted to daily exposure records to harmonise the further analysis.

4 Calculate life stages

Life stages are calculated for each cohort member. In the early life analysis, the temporal boundaries for life stages T1, T2, T3 and EL are calculated. In the later life analysis, temporal boundaries are calculated for life stages YR1, YR2, YR3, etc. The early life analysis includes additional code to account for premature births. When cohort members are born prematurely, the protocol corrects problems in life stages, and will change the end date of T2 or T3 to reflect the premature date of birth, or that T3 is missing altogether. The life stage calculations ensure that trimesters will not overlap: in other words, each relevant day of a person's life will belong to exactly one life stage.

As well as the temporal boundaries of life stage, the temporal boundaries for the overall exposure time frame are calculated. For example in the early life analysis, the time frame would be from the day of conception until the end of the first year of life calculated as: [conception date, birth date + 1 year - 1 day].

5 Clean address periods

5.1 Impute blank start and end dates. Impute blank start and end dates. First, ALGAE ensures that each address period has non-blank values for its start date and end date. When a start date is missing, the protocol imputes the value with the cohort member's date of conception. When the end date is missing, the value is set to the current date.
Imputing blank dates. Blank start dates are filled with conception dates and blank end dates are filled with the current date.

5.2 Order address periods. Once any blank values have been imputed, the address periods are ordered first by the study member ID, then by the start date, then by the duration of the period at the address. Once sorted, the order is maintained for the rest of the protocol.

Understanding proxy error. Ordering the address periods.

At this point in the protocol ALGAE begins to track data cleaning changes using sensitivity variables.

5.3 Identify and try to fix bad geocodes. Before ALGAE attempts to clean the temporal boundaries of successive address periods, it tries to identify and fix bad geocodes. A bad geocode is one which is either invalid or out-of-bounds. If study members have an address period which is within their exposure time frame and has a bad geocode, it means that their exposure will not be calculated - they are left out of exposure analysis altogether.

In order to reduce the impact of bad geocodes, ALGAE tries to fix those which it believes represent an incorrect address which is fixed in the following address record. The diagram below illustrates the criteria for identifying an address period which has a bad geocode that can be fixed.

Identifying bad geocodes that could be fixed.

If an address period, an, is identified as having a geocode that is "fixable" then:

  1. an+1.start_date = an.start_date
  2. an is marked with
Address periods whose geocodes are "fixable" are flagged to be ignored in future data cleaning. If an.is_fixed_invalid_geocode=Y. This marks the period and it will be ignored from all future cleaning as if it didn't exist. Note that this is not the same as marking a period for deletion, as what can happen later in the methodology.

Fixing a bad geocode.

5.4 Identify and fix temporal gaps and overlaps. Once the address periods have been ordered by person_id, start_date and duration, ALGAE proceeds to identify any gaps or overlaps that may appear in the residential address histories. The protocol scans the address periods and flags each one depending on the fit that exists between successive address periods.

Identify gaps and overlaps. The numerical sign of the difference between an+1.start_date and an.end_date is used to determine whether two successive address periods an and an+1 are temporally contiguous, show a gap or show an overlap.

Once address periods have been marked for gaps and overlaps, ALGAE begins to fix them so that they are temporally contiguous. In any address period, the start_date values are assumed to provide a much stronger and more reliable signal of location than end_date values. The assumption is based on the idea that in an administrative system, start dates will likely correspond to time stamps but end dates will likely be computed in relation to start dates.


Fix gaps and overlaps. Fixing gaps and overlaps always favours preserving the start_date of address periods.

5.5 Ensure that every day of exposure time frame is covered by an address period. It is likely that there will be a gap between date of conception and the date of enrolment in the cohort study. It may also be possible in the early or late analysis that the end date of the last available address period fails to cover days at the end of the exposure time frame. In order to ensure that all days of an exposure time frame are covered, ALGAE will adjust boundaries of first and last relevant address periods.

Filling in any remaining unaccounted exposure days with address periods. .

6 Calculate Exposures

Once ALGAE has geocoded residential addresses and cleaned the address periods, it then matches the appropriate daily exposure estimates based on the locations that a study member occupied on each day covered by their cleaned residential address histories. Each exposure record has a person_id, a geocode, a date_of_year and daily pollution estimates for various pollutants.

The exposure records are used in different ways to assess exposures for early and later life analyses.

6.1 Assess the Data Quality of Each Daily Exposure Value

ALGAE assesses exposures based on aggregating daily exposure values, some of which may be part of bad address periods that the protocol cannot fix. For example, if study members live at an invalid residential address for their entire gestation period, we cannot ignore this problem when we assess their trimester exposures.

The protocol's frequent encounters with bad address periods led us to design it so that it could report data quality indicators showing the extent to which life stage assessments were affected by various kinds of bad address periods. Before we discuss the different ways ALGAE aggregates daily exposure values, we need to cover how it grades exposure values.

ALGAE classifies all daily exposure values based on five mutually exclusive categories which are evaluated in the context of each pollutant. They are described in the diagram below:

Categories used to describe the quality of each exposure value. Blank start dates are filled with conception dates and blank end dates are filled with the current date.

ALGAE determines the category by answering three questions about each relevant address period:

  1. Is the geocode for the address period associated with at least one non-null exposure value for a given pollutant?
  2. Is the geocode for the address period valid?
  3. For a given day within the address period coverage, is there an associated exposure value?

The answer to the first question tells ALGAE if there is any exposure data associated with the geocode. If there are no exposure values available, then we can draw one of two conclusions about the location for an address period:

  • it is an invalid geocode and this would explain why no exposure values exist
  • it is a valid geocode but it isn't in the study area that was used for exposure modelling - the study member moved outside the study area

The protocol looks up the geocode in the staging_geocode_data and determines the value of its has_valid_geocode field. If the value is N, then this describes the invalid address scenario. If it is Y, then the circumstances describe the out of bounds scenario.

If an address period's geocode is associated with at least some exposure values, it doesn't necessarily mean they are good values. Some projects may try to make a guess about the coordinates that should be associated with a poorly specified residential address. For example, suppose a person's address is simply "Pine St", but the street is fifteen blocks long. For some projects, exposure scientists may try to pick the coordinates for the middle of the street. The coordinates could be used to generate exposures, even though the coordinates may not be accurate. Human or software-based geocoding agents may flag a geocode as being invalid, even if it is used to generate exposures. These circumstances describe the poor address scenario.

The protocol is left with two remaining categories to consider. An address period may have a valid geocode that is associated with some exposures for a given pollutant. However, for specific days, there may not be exposure values available. This may occur if the temporal coverage of the exposure modelling does not cover all days that are in the exposure time frames. Should this happen, then we have the missing exposure scenario. If a day can actually be associated with a non-empty exposure value, then this describes the good address scenario.

When the address periods are linked with daily exposure values, ALGAE assigns one of these categories for each kind of pollutant. There is an assumption that the pollutants values should be treated independently of one another. For example, a study may have PM10_rd values for a given month but not have nox_rd values. This may mean the count for missing exposure days T1 is 30 out of 92 for nox_rd exposures but 0 out of 92 for pm10_tot.

When ALGAE aggregates daily exposure values, it also aggregates counts of the days that can be labelled for each category. In the exposure results, researchers can then use the counts of invalid, out of bounds, poor match, missing, and good match days and the life stage durations to establish threshholds of data quality for life stage exposures.

6.2 Aggregate Daily Exposure Methods Using Different Methods

Now that we've discussed how daily exposure values are rated for data quality, we can discuss how different methods can aggregate them differently. ALGAE runs multiple kinds of assessment to let researchers assess the extent that data cleaning of address histories would affect their analyses. Some of the other methods also provide a kind of bridge that allows the study to generate analyses that are compatible with those found in other research papers. The various assessment methods are described in the following diagram:

ALGAE's exposure methods. . ALGAE matches daily exposure records with locations that study members occupied for each day of their cleaned residential address histories. The daily records are then aggregated by life stage in different ways.

The following diagrams illustrate how the exposure assessments choose daily exposure values, and how they may use the locations for certain exposure days to represent longer exposure periods.

Cleaned mobility assessment. .

Uncleaned mobility assessment. .

Life stage mobility assessment. .

Early life birth address assessment. .