Adapting the ALGAE Protocol for Other Studies

by Kevin Garwood

The ALGAE protocol was originally designed to support a specific environmental health study that has been part of a collaborative activity between Imperial College and ALSPAC. However, early on the interest in reusing the data sets the protocol produced led us to consider ways in which the protocol could be generalised or adaptable to suit other studies or be applied to different cohorts.

Before we consider how ALGAE could be adapted, it is worth reviewing the original use case that the protocol was designed to support. Please see Section Main Use Case Theme.

If your study bears strong resemblance to the original use case study, then ALGAE may be adaptable for your own activities. If it isn't, then I would suggest the most reusable part of the code for other projects may be the code that cleans address histories. I suspect it would find a great many other applications in cases where it was necessary to produce a temporally contiguous address history for people.

View the Code

ALGAE has been open sourced through the LGPL v3.0 license and you can view the code repository on GitHub.

Changing the Types of Exposures

Other studies may want to analyse different exposures, such as ozone, radon, or pollen. They may also want to express the same exposure type but through a different method. For example, ozone_meth1, ozone_meth2, ozone_meth3 may represent the pollutant concentration that has been assessed through different types of instruments or monitoring sources.

ALGAE attaches no meaning and gives no special treatment to any of its default exposure types. Although changing the exposure types is not a configurable option in the code, it is easy to modify the code. ALGAE's consistent use of naming conventions should mean that substituting one of its existing exposure types for another is literally using a search and replace operation. For example, PM10 RD is always called either pm10_rd in the code or PM10 RD in the documentation. I would suggest the fastest way of adapting the code to support a new exposure type is to just search and replace an existing exposure type such as nox_rd.

In your original data tables, you could just leave all the other pollutant values blank. ALGAE should work correctly and you would just ignore the empty contributions of the pollutants that were not relevant to your activity. This is an inefficient way of running the protocol because the protocol would be processing many fields that would be considered irrelevant. However, from the perspective of minimising the effort needed to change the code, this is probably the best approach.

If you have more than the number of exposure types that ALGAE supports, then I would choose one exposure type (e.g.: pm10_tot) and find every occurrence of it in the code. Immediately below it, add an extra line (e.g.: ozone_height_10m).

Changing the Default Gestation Age at Birth Value

By default, ALGAE will fill in a blank gestation age at birth value with 38 weeks. If this does not suit your needs, you can change it in the method setup_scripts(...) in the file Common_InitialiseGlobalConstants.sql. You can also find it by searching for the phrase "#CHANGE_DEFAULT_GESTATION_AGE" in the codebase.

Changing Conception Date Calculations

ALGAE has a default way of calculating conception date. If you want to change that, then you can change it by altering the method comm_cln_gest_age_at_birth() in the code file Common_CalculateLifeStages.sql. You can search for #CHANGE_CONCEPTION_CALCULATION in the code base.

Changing the Life Stages

Studies may want to use different life stages that cover a different exposure time frame and they may also choose to divide the same exposure time frame that ALGAE uses in different ways. The best examples of this are in the early life analysis, where researchers may have different opinions about the temporal boundaries of trimesters. They may also choose to consider the first year of life or divide that up into smaller life stages.

In general, ALGAE attaches little meaning to any of the life stages. Its main assumption is that they do not exhibit gaps or overlaps. It is possible for you to rename the life stages to something other than YR1, YR2.

ALGAE does attach significant to whatever life stage starts when the study member is born. By default this is EL but in another one of our study runs it was named differently. It uses the meaning of this life stage to help it correct calculations of life stage boundaries in cases of very premature births. If a study member is born prematurely enough, they will not have a T3.

If you need to change the life stages, then you need to alter code in one of two places:

the method early_calc_life_stages(), which is found in EarlyLife_CalculateLifeStages.sql or
the method late_calc_life_stages(), which is found in LaterLife_CalculateLifeStages.sql

You can also find more advice about how to change these methods by doing a search for #CHANGE_LIFE_STAGES in the code base.

Turning Off the Feature for Fixing Bad Address Periods

Before it cleans address histories, ALGAE tries to identify and fix certain kinds of bad address periods, where it is not clear where a study member was living. In its development history, this fixing feature was created in an attempt to help prevent study members who had bad address periods in their exposure time frames from being excluded from further analysis. It used a set of three criteria in order to identify scenarios where an incorrect change of address record was immediately followed by a corrected version of it. By default, a bad address period is fixable if:

it has an invalid geocode
it is immediately followed by an address period which does have a valid geocode
it does not overlap with any life stage by more than 25%

We developed these criteria but in response to the nature of how address history data were originally gathered and by our own assessment of what bad address periods should be fixed. However, our fixing feature has two weaknesses. First, they may be viewed as capturing our view of what would identify a certain type of data entry error. Second, whereas we use 25%, others may have a different tolerance for overlap.

Later on, ALGAE was modified so that rather than excluding study members who had bad address periods in their exposure time frames, the protocol would simply keep track of how many days in a life stage could be associated with various categories used to describe the quality of daily exposure values.

Given the advent of that mechanism, some may conclude that fixing bad address periods introduces an element of judgement that would be better left out of assessments. If you want to disable the feature for fixing bad address periods, all of the relevant code is contained in the method comm_id_and_fix_bad_geocodes(), which is found within the file Common_CleanAddressHistories.sql. The code you will need to change is well documented and can be found by searching for #DISABLE_FIXING_BAD_ADDRESS_PERIODS in the code base.

Changing the Assumptions for Fixing Gaps and Overlaps

ALGAE has facilities for fixing gaps and overlaps that may occur in the residential address histories. The way we fix them is influenced by the characteristics of the database application that was originally developed to retain the most recent postal addresses of study members.

In the original use case, we assumed that for any address period, a start date provided a stronger signal for location than end date. When we developed the algorithm for fixing gaps and overlaps, we used the policy that we would wherever possible conserve the start date. Our assumption was that study members were likely to have already been living at a location by the start date for the new location.

These assumptions have influenced our way of fixing gaps and overlaps (See Section Identify and fix temporal gaps and overlaps). However, in other studies, the residential address histories could warrant different assumptions.

For example, the histories could be managed by an application that was actually designed to track someone's past addresses rather than audit their current addresses. In this case, start date and end dates may provide an equally good signal for data quality.

You would expect then that in such applications there would be no gaps or overlaps. However, if there are, then there are at least two alternative approaches you may consider. The first is to impute the exposures for days that have been involved with gaps or overlaps. For example, for a gap day you could consider using an average of two exposures from the neighbouring locations. Perhaps the values would be imputed with some default exposure value for the exposure area.

I would recommend you don't do this because I'm not sure how this would affect the code. Specifically, I'm not sure how such a scheme would work with calculations of exposure misclassification error (i.e. you thought they were at location a₁ but they were at location a₂).

An alternative approach that could be supported by ALGAE would be to favour conserving end dates. I have trouble imagining an administrative system that would be more definite about end dates than start dates. However, if you want to change how ALGAE fixes gaps and overlaps, you will need to change code in the method comm_fix_gaps_and_overlaps() in the code file Common_CleanAddressHistories.sql. The code you would need to change can be found by searching for #CHANGE_GAP_OVERLAP_BEHAVIOUR in the code base.

Disabling the Feature for Cleaning Gaps and Overlaps

In the original use case, the address histories represented residential address histories. One of the key assumptions we made was that in an address history, a person would have to live somewhere. If they always had a location where they were living, then gaps and overlaps could be fixed to show a continual pattern of living locations.

However, for some studies, the address histories might expect gaps to exist between address periods. For example, if address histories represented the locations of main work places, then it is possible that study members would not be working for a period of time. Therefore, it would not make sense to fix gaps in the record.

If, for whatever reason, you don't want ALGAE to fix gaps and overlaps, then I would suggest it would be easiest if you ensured that the original address periods showed none of these problems. Where you want to preserve gaps between successive address periods, I would inject bad address periods into the histories that would be picked up in the end results. For example, I would suggest using either a geocode called not_applicable or some out of bounds coordinate like the North Pole to help you gauge when study members weren't at an address that would contribute exposure to the study.

In order for this to work, you would then have to disable our feature for fixing bad address periods. Following our example, if in a year of life someone had a break of a week between jobs, then for that seven days they would not be at a location that would contribute to workplace pollution. However, 7 days relative to a whole year of life is small, and ALGAE would try to remove the gap by assuming that it should really belong to the following work place.

As we have already discussed in this section, the code for disabling the fixing feature can be found by searching for #DISABLE_FIXING_BAD_ADDRESS_PERIODS in the code base.

The ALGAE Protocol

ALorithms for Generating address histories and Exposures

An automated protocol for assigning early life exposures to longitudinal cohort studies