ALGAE Protocol: An automated protocol for assigning early life exposures to longitudinal cohort studies

An automated protocol for assigning early life exposures to longitudinal cohort studies

Summary of Design Decisions

by Kevin Garwood

ALGAE is the product of over 100 separate design decisions that were drawn from four main areas of requirements:
  • domain science decisions
  • business decisions
  • data science decisions
  • software engineering decisions.

The design decisions are summarised in the tables below. Additional information including a detailed discussion of these decisions is provided in the document.

Domain Science Decisions

Decision Description
Domain Science-1 The protocol will use residential address histories, historically modelled pollutant concentrations at residential addresses and information about life stages to assess life stage exposures in members the ALSPAC cohort. The life stage exposures will then be linked with health outcomes relating to respiratory health outcomes that are measured at years 8 and 15.
Domain Science-2 The protocol will support both an early life and a later life analysis. The early life analysis will assess exposures in: Trimester 1 (T1), Trimester 2 (T2), Trimester 3 (T3), and an early life period that spans the first year of life (EL). The later life analysis will assess exposures for years of life beginning in year 1 and ending in year 15 (YR1...YR15).
Domain Science-3 Date of Conception will be calculated as: date of birth - (7 x gestation age at birth measured in weeks) - 1 day.
Domain Science-4 The temporal boundaries of early life stages will be calculated as follows: T1 = [conception date, conception date + 91 days], T2 = [conception date + 92 days, conception date + 183 days], T3=[conception date + 184 days, birth date - 1 day], EL=[birth date, birth date + 1 year - 1 day].
Domain Science-5 In the early life analysis, the protocol will need a facility for correcting overlaps in life stage boundaries that may occur for prematurely born cohort members.
Domain Science-6 The boundaries of life stages in the later life analysis will be calculated as follows: YRn = [date of birth + n years, date of birth + (n+1) years - 1 day.
Domain Science-7 The protocol will assess exposures for various pollutants.
Domain Science-8 Assume that the exposure will exhibit high spatial and temporal variation during the exposure time frame.
Domain Science-9 Assume that the susceptibility of cohort members to the spatial and temporal exposures variability may depend on the life stages being used in assessment.
Domain Science-10 The early life analysis will use daily exposure values and the later life analysis will use annual average exposure values.
Domain Science-11 The protocol will aggregate exposures by life-stage, and express the results using cumulative, average and median exposures.

Business Decisions

Decision Description
Business-1 The protocol must be able to omit from analysis any cohort members who have withdrawn consent to participate in the activity, at any time during the activity.
Business-2 The intermediate data sets that are made from linking cohort data and the results produced by analysis can only be created on-site at the cohort's premises, using a secure and limited computing environment.
Business-3 Scientists could only take aggregated exposure results off-site. They would not be allowed to take away results which contained birth dates or which linked residential addresses to specific individuals.
Business-4 The protocol assumes that exposure data is generated on the premises of the collaborating research institution and not those of the cohort facilities.
Business-5 Assume that exposure data for both early and later life analyses will be generated based on a set of all residential addresses that have ever been occupied by any study member for any period that falls within either their early life or later life analysis.
Business-6 An exposure record will comprise a geocode, a date of year, and concentration values for each pollutant. In the early life analysis, the date of year will represent a daily exposure record. In the later life analysis, the same field will represent an annual exposure value and will be January 1st of some year that is involved in the later life analysis.
Business-7 The protocol must be fully automated.
Business-8 The protocol must support being tested by a suite of automated test cases that can demonstrate that it is behaving correctly.
Business-9 The protocol would generate a reusable exposure data set that could be linked to the individual and health information stored by the cohort study.
Business-10 Because it produces a data set that will support multiple environmental health studies, the need to support data science and software engineering requirements becomes more important than those concerns would have if the protocol supported only a single study.
Business-11 Accept that different cohort projects will express different levels of interest in the protocol. Rather than expecting the utility of a generic protocol to be measured by the number of projects that download and use it, anticipated tiered levels of buy-in from other groups. These are: people who want to reuse its design to make their own software; people who want to reuse the software tool with a different cohort; people who want to adapt the code to suit slightly different use cases; and people who want to borrow parts of the code for other projects.
Business-12 The protocol documentation must minimally include three things for prospective users in other projects: an explanation of properties for the input data sets; instructions for installing and running the protocol; and a data dictionary that describes the meaning and context of result variables.
Business-13 Accommodate input values that may have cohort-specific representations but will have the same meaning.
Business-14 Accommodate different ways of representing input values for yes-no fields. For example, allow ALGAE to understand that "Yes", "yes", "true", and "1" are all ways of representing "Y" for Yes.
Business-15 Accommodate different ways of representing an empty field value. They may include the empty string "", a null value, some form of "NULL", and "NULLIF#".
Business-16 Document how code could be changed to support different life stages, or different temporal boundaries for existing life stages.
Business-17 Document any code that may depend on the name of a specific life stage (eg: EL, which normally contains the birth date as a start date).
Business-18 Document how other projects could change the pollutants that are used for the study.
Business-19 The protocol will not support features for anonymising or pseudonymising fields. Projects that want to de-identify these variables should do so as part of the process they follow for creating the input data files or they should de-identify variables after the results have been generated.
Business-20 The protocol will not attach any meaning to person_id or geocode fields, and will just treat them as identifiers that link tables together.
Business-21 In order to support data destruction provisions in information governance policies, the protocol will write all of its result tables to CSV files so that projects can safely delete the entire database which produced them. Isolating results in this way allows projects to minimise the risk of holding sensitive data in the intermediate data sets produced by the protocol.
Business-22 The decision about which fields the protocol will produce are based on what fields would be useful for the scientific use case and are not based on the information governance policies of any one cohort. It is possible that in a given project, ALGAE produces more data than what is necessary and sufficient for the research purpose. It is also possible that it will generate more data than a given cohort will allow to be taken off-site.
Business-23 The ALGAE protocol should be open sourced through a standard OSI license.

Data Science Decisions

Decision Description
Data Science-1 The input CSV file that holds data about geocodes will have a yes/no flag called "is_valid". If the flag value is Y, then the geocode is considered valid and may be used in exposure calculations. If the flag value is N, then the geocode is considered not valid, and may cause some study members to be excluded from exposure assessments. The criteria for validity are defined by individual cohorts
Data Science-2 The input CSV file that holds data about geocodes will have a free-text field called "version". Cohorts can use the field to describe an iteration of geocoding that uses a particular version of software, was done by a particular method or was done by particular people.
Data Science-3 The input CSV file that holds data study member data will have a yes/no field called "at_1st_address_conception". If the flag value is "Y", then the mother of the cohort member lived at this address at the date of conception.
Data Science-4 The input CSV file that holds data cohort member data will have a yes/no field called "absent_during_exp_period". If the flag value is "Y", then researchers can be confident that a cohort member spent a significant amount of their exposure time frame living at a location that was not specified in his/her residential address history.
Data Science-5 Cohort studies will need to comply with naming conventions for both the paths of input files and for the variable names within those files.
Data Science-6 Preserve the provenance of data in the original data files by storing unaltered data in original data tables that are not touched by the rest of the protocol. Let the protocol apply its data transformations on staging tables, which will contain versions of the original data that have standardised the representation of field values.
Data Science-7 In order to make it easier for researchers to identify errors, favour expressing complex operations through a sequence of simple temporary tables rather than by using a single, complex monolithic query to do the same work.
Data Science-8 Save changes using modified copies of table fields rather than changing original ones.
Data Science-9 Capture the original row number of records from input files in corresponding staging tables. Where it is possible to do so, promote the original row number fields through temporary tables that may use them.
Data Science-10 Wherever possible, do not delete table rows. Instead, flag them as being deleted. This approach allows protocol users to inspect the nature of deleted data identified for each study member.
Data Science-11 Promote ignored rows through successive temporary tables but use deletion flags to ignore them from consideration in calculations.
Data Science-12 Assume that the date that a study member's current address was updated in the contacts database is the date when he or she began living at a new address. This value is the start date of an address period.
Data Science-13 Ensure that each address period has non-blank values for its start date and end date fields. Impute missing start dates with the conception date of the study member. Impute missing end dates with the current date.
Data Science-14 Order address periods in ascending order, first by person_id, second by start_date and third by duration.
Data Science-15 Assume that in the residential address histories, start dates are stronger signals than end dates. Assume that study members were likely already living at the geocoded location in an address period at or before its start date.
Data Science-16 In the process of fixing temporal gaps and overlaps between successive address periods, act to preserve the start date of the current address period over the end date of the previous one.
Data Science-17 Let an and an+1 be two successive address periods. If a gap exists between them, then let an+1.start_date = an.end_date + 1 day. If an overlap exists between them, let an.end_date = an+1.start_date - 1 day.
Data Science-18 Let an and an+1 be two successive address periods. an is considered a duplicate if it is temporally subsumed by an+1. If this is true, then an will be flagged as a deleted record.
Data Science-19 Ensure the address history for any study member spans the entire exposure time frame (eg: [conception_date, birth date + 1 yr - 1 day]). Ensure that the start date of the earliest address period is adjusted to be at least as old as the conception date. Ensure that the end date of the latest address period is adjusted to be at least as early as the last day of the exposure time frame.
Data Science-20 If a study member has at least one address period which has a bad geocode and which spans part of their exposure time frame, he or she will be excluded from further exposure assessments.
Data Science-21 If address period is fixable if it meets the following three criteria:
  • It has a bad geocode
  • It is immediately followed by an address period which has a valid geocode
  • It has less than 25% overlap with any life stage
Data Science-22 Let an address an be a fixable address period. The protocol will attempt to ignore an from further calculations by subsuming it within an+1: an+1.start_date = an.start_date.
Data Science-23 Assess the data quality of exposure values for each pollutant independently of one another.
Data Science-24 For each pollutant, exposure results will be accompanied by day count variables that describe various mutually exclusive kinds of daily exposure values that were used in aggregation. These mutually exclusive categories include: invalid address days, out of bounds days, poor match days, missing exposure days and good match days. The sum of these counts will be the same value as the life stage duration.
Data Science-25 The protocol will support a set of sensitivity variables that will allow researchers to quantify the influence of data quality attributes and data cleaning actions on subsets of exposure results. These variables will provide meta data about data transformation activities and will be provided as a result file for users.
Data Science-26 The sensitivity variables will include variables that allow researchers to isolate exposure results based on how confident they can be that the residential address histories can cover the exposure time frame of interest. These variables will borrow the at_1st_addr_conception and absent_during_exp_period variables from the original_study_member_data table.
Data Science-27 The sensitivity variables will include ones that can help researchers determine whether low exposure values for Trimester 3 are due to a premature births or to areas that exhibit low levels of pollution. It will borrow the estimated_gestation_age field from the original_study_member_data table, and include a yes/no flag is_gestation_age_imputed.
Data Science-28 The sensitivity variables will include ones that can help researchers identify study members who were affected by bad geocodes. These variables will include the following totals measured throughout the whole exposure time frame: invalid_geocodes, fixed_geocodes, out_of_bounds_geocodes, and has_bad_geocodes_within_time_frame
Data Science-29 : The sensitivity variables will include ones that can help researchers assess the extent to which the field values of address periods were imputed for each study member. These variables will include the following totals measured across the whole exposure time frame: imp_blank_start_dates, imp_blank_end_dates, imp_blank_both_dates.
Data Science-30 The sensitivity variables will include ones that can help researchers assess the kinds of data cleaning changes that were made in order for the protocol to make a temporally contiguous record of movements that cover the exposure period. These variables will include the following totals, measured across the whole exposure time frame: total_addr_periods, gaps, gap_and_overlap_same_period, over_laps, deletions.
Data Science-31 The sensitivity variables will include days_changed, which is the total number of days that were adjusted across all address periods which overlapped with the exposure time frame. The total considers the total number of days that each address period was shifted in data cleaning. In cases of successive overlaps, some days may be counted more than once.
Data Science-32 The sensitivity variables will include total_contention_days, which measures the total number of unique days in an exposure time frame that were involved with a gap or an overlap. It is called a contention day because on such a day, a study member could be placed at either of two locations: the location that data cleaning assigned in a cleaned set of address periods, or the location that may have been allowed in uncleaned address periods.
Data Science-33 The sensitivity variables will include missing_exposure_days, which gives the total number of days in the exposure time frame that are not associated with exposure values.
Data Science-34 Assume that on a given day, placing a study member at one location when they actually lived at another could present an exposure misclassification error that was significant for studies that used the life stage exposure results.
Data Science-35 Suppose the protocol cleans two successive address periods an and an+1, where a gap or overlap exists between them. Let the exposure at the location assigned by data cleaning to correct a given contention day be called the assigned exposure. Let the exposure at the other location the study member could have occupied on that day be called the opportunity cost exposure. The exposure classification error for that day of contention will be calculated as |assigned exposure - opportunity cost exposure|.
Data Science-36 Daily values for exposure measurement error will be aggregated in the same way as exposure values are aggregated. For example, error values will be expressed for each pollutant for each life stage for each study member. Error values will be further reported through and use average, sum and median aggregation operations.
Data Science-37 The cleaned mobility assessment considers exposure contributions from all cleaned address periods that overlap with the exposure time frame. Daily exposure measurement errors are calculated to correspond with daily exposures and they are aggregated in the same ways.
Data Science-38 The uncleaned mobility assessment considers exposure contributions from all cleaned address periods that overlap with the exposure time frame. However, it omits any days that spanned a gap or overlap error in a study member's residential address history. No exposure measurement errors are assessed in this approach.
Data Science-39 The life stage mobility assessment uses the locations study members used on the first day of one of their life stages to represent the location for that whole life stage. It ignores contributions from all other locations. No exposure measurement errors are assessed in this approach.
Data Science-40 The birth address assessment uses the birth addresses of study members to represent their locations for the entire early life exposure time frame (eg: conception until the last day of the first year of life). No exposure measurement error is assessed in this approach.
Data Science-41 The protocol will compare pairs of exposure assessment methods and calculate percent error values for corresponding exposure results.
Data Science-42 The protocol will produce CSV result files whose names follow standard naming conventions. These files will be generated within a directory that has predictable structure.
Data Science-43 Variables that are produced by the protocol will have names which are prefixed by "algae[NNNN]_". The four digit number is designed to help uniquely identify the context which produced variables.
Data Science-44 The protocol will come with a comprehensive data dictionary that will define the meaning of result variables.

Software Engineering Decisions

Decision Description
Software Engineering-1 The protocol will come with clear, concise documentation that project staff can use to transform their data into the input tables that are expected by the software.
Software Engineering-2 The protocol will create a staging table for each table of original input data. The staging table will use a standardised way for representing yes/no values and empty field values. The rest of the protocol will operate on the staging tables and not the input tables.
Software Engineering-3 Consolidate code for standardising input fields within well-tested database functions. Later on in testing, assume standardisation functions will work and then rely on just using standard forms to express input data for test cases.
Software Engineering-4 When it is appropriate for the scientific use case, impute blank values. For example, blank gestation age values will be imputed with a default value for gestation age in weeks. Imputing field values helps simplify code that processes them.
Software Engineering-5 : Some missing value errors will be important to users but will have no effect on the protocol's algorithms (eg: absent_during_exp_period and at_1st_addr_conception). Reject rather than ignore these errors in order to help users fix their input data as soon as possible
Software Engineering-6 Design the protocol to fail fast: if it is going to fail because of an error, then make it fail as soon as possible.
Software Engineering-7 Identify duplicate key errors by applying a database constraint which tries to use a field as a primary key.
Software Engineering-8 Identify required field values by applying a database constraint which tries to forbid a field from having empty field values.
Software Engineering-9 It is more important that the program performs correctly than it is that the program performs quickly.
Software Engineering-10 Favour expressing complex operations through a sequence of simple temporary tables rather than by using a single, complex monolithic query to do the same work.
Software Engineering-11 Where it is important to preserve rows from one temporary table to the next, rely on LEFT JOIN operations to link the data applicable to all study members with the new data that may only be applicable to some.
Software Engineering-12 Consider a sequence of successive temporary tables marked by T1 ... Tn that are meant to build up a growing collection of field values describing the way cohort data have been processed for each study member. The protocol will compare the keys between T1 and Tn so that they have the same number and that all the keys in one are found in the other.
Software Engineering-13 The protocol will support a means of automatically comparing the actual results generated by the program with expected results that are derived through manual calculation.
Software Engineering-14 Each test case should only test one feature.
Software Engineering-15 Within a test suite, test data related to each study member should remain the same, except for data that relate to the features being tested.
Software Engineering-16 In order to make test data easier to understand use the names of test scenarios as study member identifiers.
Software Engineering-17 In order to make test data easier to understand use the names of test scenarios as study member identifiers.
Software Engineering-18 Each test case should use the minimal amount of input data it needs to exercise a feature. The values should be simple enough to make them amenable to being used in the manual calculations that produce expected test results.
Software Engineering-19 For each pollutant at each geocode, assign stepped pollution values that are constant over time but which vary significantly between geocodes and between pollutants.
Software Engineering-20 For test cases that need to demonstrate variation between average and median life stage exposures, induce differences by adjusting the boundaries of address periods spent in different locations.
Software Engineering-21 Design stepped exposure generation functions so that the differences between successive geocodes and the differences between successive pollutants is unique.
Software Engineering-22 Testing will assume that ALGAE treats all pollutants in the same manner. Testing multiple pollutants will be limited to inspecting by eye that aggregated exposure values are in fact different from pollutant to pollutant.
Software Engineering-23 Limit the scope of testing by focusing test cases on only one pollutant.