ALGAE Protocol: ALGAE Data Dictionary: Later life address period changes (algae2200-algae2233)

An automated protocol for assigning early life exposures to longitudinal cohort studies

ALGAE Data Dictionary: Later life address period changes (algae2200-algae2233)

by Kevin Garwood

Context of Variables

These variables describe the original address periods used in early life analysis and all the changes that were made to them so they could be used in an exposure assessment. Although most of the variables would have the same value as address period variables used in the later life analysis, there are some key differences. The variable algae2228_is_within_exp indicates whether or not the address period overlaps with a study member's exposure period. There is a similar variable algae2228_is_within_exp for the later life analysis.

Both of these variables depend on the duration of the study member's exposure period. In early life, this could be [conception date, last day of first year], whereas in later life analyses, this could be [yr1.start_date, year15.end_date]. You should expect that these field values may appear different for the same address period, depending on whether you are running an early life or later life analysis.

Note that some of these variables may be considered too sensitive to take off-site. Check your information governance policies and see: Assessing Sensitive Data in the Data Dictionary.

Location of Result File

You will find these variables in a file having a name that fits the form:
res_later_cleaned_addr[Date stamp].csv
which will be found in the directory:
later_life/results/cleaned_address_history
or
later_life/results/cleaned_address_history

Example Result File

See here.

Variable Naming Conventions

The basic format of variables in this section follows this pattern:
	algae22[00-30]_[base variable name]

In this pattern, algae22 indicates that the variables relate to the address periods, as they were used within the later life assessment. Two other common phrases that appear within variable names here are:

  • _adj_: adjusted
  • _within_exp: within exposure period.

Variable Dictionary

Variable Description
algae2200_original_row_number The row number refers to the position the record had in the file that was loaded into the original_addr_history_data.
algae2201_person_id An anonymised or pseudonymised identifier which represents a study member. ALGAE uses this variable to link data together for a given study member.
algae2202_ith_residence Describes the sequence of address periods. Note that a study member could have two address periods which have the same location but a different ith_residence value. This would indicate that a person is moving back and forth between places.
algae2203_geocode Represents the location. Normally this would be some concatenation of map coordinates but ALGAE attaches no meaning to the contents of this field. It simply uses geocode as an identifier that is used to link tables.
algae2204_date_state Indicates what actions were taken to ensure that both the start date and end date fields have values. The states include:
  • imputed_both_dates: a blank start date was imputed with the study members's conception date and a blank end date was imputed with the current date
  • imputed_start_date: a blank start date was imputed with the study member's conception date.
  • imputed_end_date: a blank end date was imputed with the current date.
  • no imputation: no action was taken because the start date and the end date have non-blank values.
algae2205_start_date The original start date, before any data cleaning was done.
algae2206_end_date The original end date, before any data cleaning was done.
algae2207_duration The number of days represented by the time frame [start_date, end_date]. The total includes the boundary dates as well.
algae2208_ith_residence_type Describes the relative position of an address period with respect to all a study member's address periods. The variable has the following states:
  • only: This is the only address period that exists for the study member. It would indicate that the person has never moved. This is an address period whose start date would be changed to match the study member's conception date. The end date may also be changed so that it has a value that is at least the last day of the exposure time frame (eg: in EL, it could be at least the last day of the first year of life).
  • first: The study member has more than one address period and this address is the first one. This is an address period whose start date would be changed to match the study member's conception date.
  • middle: The study member has more than one address period and this is neither the first nor the last one.
  • last: The study member has more than one address period and this is the last one. This field value may be adjusted in order to be at least the last day of the exposure time frame.
algae2209_has_valid_geocode 'Y' if an address period has a valid geocode and 'N' if it does not have a valid geocode. For this field, a geocode is valid if it has a non-blank value and the 'has_valid_geocode' field in the original geocode data table is 'Y'.
algae2210_has_name_exposures 'Y' if the geocode is associated with at least one non-null NAME value in the exposure records found in the staging_exp_data table. 'N' if the geocode has no NAME values at all.
algae2211_has_nox_rd_exposures 'Y' if the geocode is associated with at least one non-null NOX RD value in the exposure records found in the staging_exp_data table. 'N' if the geocode has no NOX RD values at all.
algae2212_has_pm10_gr_exposures 'Y' if the geocode is associated with at least one non-null PM10 GR value in the exposure records found in the staging_exp_data table. 'N' if the geocode has no PM10 GR values at all.
algae2213_has_pm10_rd_exposures 'Y' if the geocode is associated with at least one non-null PM10 RD value in the exposure records found in the staging_exp_data table. 'N' if the geocode has no PM10 RD values at all.
algae2214_has_pm10_tot_exposures 'Y' if the geocode is associated with at least one non-null PM10 TOT value in the exposure records found in the staging_exp_data table. 'N' if the geocode has no PM10 TOT values at all.
algae2215_max_life_stage_overlap The maximum number of days that an address period will overlap with any life stage. The value is used to help assess whether address periods having bad geocodes should be cleaned or not.
algae2216_is_fixed_inv_geocode Indicates 'Y' or 'N' for whether an address period has a fixed invalid geocode. An address period with a bad geocode can be fixed if it meets three criteria:
  1. It has an invalid geocode (one that is either blank or otherwise marked with has_valid_geocode='N' in the original geocode data table)
  2. It is immediately followed by an address period which has a valid geocode
  3. The address period overlaps at most 25% with any of the study member's life stages.

If an address period a(n) meets these conditions, then a(n+1).start_date = a(n).start_date and a(n) is then said to be entirely subsumed by a(n+1) and ignored from any further processing.

algae2217_fit_extent A number whose sign indicates how this address period a(n) fits with the previous one a(n - 1). It has the following meanings:
  • zero: the previous and current address period have a temporally contiguous fit. There are no gaps or overlaps.
  • positive number: there is a temporal gap between the previous and current address periods.
  • negative number: there is a temporal overlap between the previous and current address periods.
algae2218_adj_start_date The start date of the address period, correcting for any gaps which may have occurred between the current address period and the previous one.
algae2219_adj_end_date The end date of the address period, correcting for any overlaps which may have occurred between the current address period and the next one.
algae2220_days_changed The total number of days that were changed after gap and overlaps were fixed. It is calculated as follows:

|start_date - adj_start_date| + |end_date - adj_end_date|.

Note that the days_changed value between two successive address periods may not necessarily describe different days. For example, if a(n) significantly overlaps with a(n+1)and a(n+1) significantly overlaps with a(n+2), then it is likely that the days_overlap value for a(n) and a(n+1) will cover some of the same calendar days.

algae2221_fit_type Describes how the temporal boundaries of an address period were changed as a result of correcting for gaps and overlaps. It can have the following values:
  • C: contiguous fit. Neither the start date nor the end date had to be adjusted in response to a gap or overlap problem with another address period.
  • D: the entire address period is subsumed by the address period that follows it. "D" marks the address period as a duplicate that has been deleted.
  • O: the end date was changed in response to an overlap between the current address period and the next address period.
  • G: the start date was changed in response to a gap between the current address period and the previous address period.
  • B: an address period had to be adjusted to fix both a gap and an overlap.
algae2222_start_date_delta1 The lower limit of the period which describes the change in the start date. For example, suppose the start date for an address period a(n) was originally 05-05-1996 but was changed to 01-05-1996 to help close a gap with the previous address period a(n-1). Then the change (delta) in start date would be characterised by the period [01-05-1996, 04-05-1996] inclusively and 01-05-1996 would be algae2222_start_date_delta1 and 04-05-1996 would be algae2223_start_date_delta2.

Note that if these fields are null they indicate that the start date of the address period did not have to be changed in response to a gap with the previous address period.

algae2223_start_date_delta2 The upper limit of the period which describes the change in the start date. See the example for algae2222_start_date_delta1. Notice this value will either be null or the day before the original start date.
algae2224_end_date_delta1 The lower limit of the period which describes the change in the end date. For example, suppose the end date for an address period a(n) was originally 15-11-1994 but was changed to 10-11-1994 so that it would not overlap with the start date of the following address period a(n+1) Then the change (delta) in end date would be characterised by the period [11-11-1994, 15-11-1994] inclusively. 11-11-1994 would be algae2224_end_date_delta1 and 15-11-1994 would be algae2225_end_date_delta2. Notice that the value for algae2224_end_date_delta1 will either be null or the start date of the following address period that was being overlapped.

Note that if these fields are null they indicate that the start date of the address period did not have to be changed in response to an overlap with the next address period.

algae2225_end_date_delta2 The upper limit of the period which describes the change in end date. See the example for algae2224_end_date_delta1.
algae2226_previous_geocode The geocode appearing in the previous address period. Note that determining the previous geocode ignores all address periods that have had bad geocodes 'fixed' (See entry for algae2216_is_fixed_inv_geocode). previous_geocode and next_geocode are used to assess the difference between assigned and opportunity cost exposures.
algae2227_next_geocode The geocode appearing in the next address period. Determining the next geocode ignores address periods which have bad geocodes that have been fixed (See entry for algae2222_previous_geocode).
algae2228_fin_adj_start_date This the final value of the start date. Whereas algae2218_adj_start_date will be adjusted in response to fixing gaps, this field may consider other changes that are not considered part of a correction. Specifically, the start date may be altered if it is a study member's first address period and the start date needs to be changed so that it covers all the time from the date of conception through to the date when he or she was first enrolled in the study.

Moving the start date back to the conception date is not treated as a change that warrants assessing the kind of exposure measurement error that is associated with gaps and overlaps. Instead, we rely on sensitivity variables to indicate how certain we can be that the study members occupied their first address when they were being conceived.

algae2229_imputed_first_start Indicates a 'Y' or 'N' response whether this is a first address period that has been altered to cover the period from conception until study member enrolment. In the case of a birth cohort that recruited already pregnant mothers, we would expect the answer to be 'Y'.
algae2230_fin_adj_end_date This is the final value of the end date. The value may differ from algae2219_adj_end_date if this is the last address period for a study member and the end date had to be moved so that it covered up until the last day in the exposure period. The context of this variable is similar to algae2228_fin_adj_start_date.
algae2231_imputed_last_end 'Y' if this is the last address period for a study member and the end date of that period had to be changed to be at least the last day in his or her exposure period. Otherwise 'N' for No.
algae2232_start_date_days_from_concep Measures the total number of days between the algae2228_fin_adj_start_date and the person's date of conception. The value is used to construct a data set that captures geographical covariates for every move made by a study member during their exposure period. Moves can be ordered not based on an explicit date but on a date relative to conception.
algae2233_is_within_exp Indicates 'Y' or 'N' whether this address period falls within the study member's exposure period. For example, if their last day in their exposure period is on 01-05-1994, an address period covering the dates [11-10-2003, 22-12-2005] would not fall within the person's exposure period.