Domain Science-1
|
The protocol will use residential address histories, historically modelled
pollutant concentrations at residential addresses and information about life stages
to assess life stage exposures in members the ALSPAC cohort. The life stage
exposures will then be linked with health outcomes relating to respiratory
health outcomes that are measured at years 8 and 15.
|
Domain Science-2
|
The protocol will support both an early life and a later life analysis. The
early life analysis will assess exposures in: Trimester 1 (T1 ),
Trimester 2 (T2 ), Trimester 3 (T3 ), and an early life
period that spans the first year of life (EL ).
The later life analysis will assess exposures for years of life beginning in
year 1 and ending in year 15 (YR1...YR15 ). |
Domain Science-3
|
Date of Conception will be calculated as: date of birth - (7 x gestation
age at birth measured in weeks) - 1 day.
|
Domain Science-4
|
The temporal boundaries of early life stages will be calculated as follows:
T1 = [conception date, conception date + 91 days],
T2 = [conception date + 92 days, conception date + 183 days],
T3=[conception date + 184 days, birth date - 1 day],
EL=[birth date, birth date + 1 year - 1 day].
|
Domain Science-5
|
In the early life analysis, the protocol will need a facility for
correcting overlaps in life stage boundaries that may occur for
prematurely born cohort members.
|
Domain Science-6
|
The boundaries of life stages in the later life analysis will be
calculated as follows: YRn = [date of birth + n years, date of
birth + (n+1) years - 1 day.
|
Domain Science-7
|
The protocol will assess exposures for various pollutants.
|
Domain Science-8
|
Assume that the exposure will exhibit high spatial and temporal
variation during the exposure time frame.
|
Domain Science-9
|
Assume that the susceptibility of cohort members to the spatial and temporal
exposures variability may depend on the life stages being used in
assessment.
|
Domain Science-10
|
The early life analysis will use daily exposure values and the later
life analysis will use annual average exposure values.
|
Domain Science-11
|
The protocol will aggregate exposures by life-stage, and express the
results using cumulative, average and median exposures.
|
Business-1
|
The protocol must be able to omit from analysis any cohort members who have
withdrawn consent to participate in the activity, at any time during the
activity.
|
Business-2
|
The intermediate data sets that are made from linking cohort data and the
results produced by analysis can only be created on-site at the cohort's
premises, using a secure and limited computing environment.
|
Business-3
|
Scientists could only take aggregated exposure results off-site. They would
not be allowed to take away results which contained birth dates or which
linked residential addresses to specific individuals.
|
Business-4
|
The protocol assumes that exposure data is generated on the premises of
the collaborating research institution and not those of the cohort facilities.
|
Business-5
|
Assume that exposure data for both early and later life analyses will be generated
based on a set of all residential addresses that have ever been occupied by any
study member for any period that falls within either their early life or later life
analysis.
|
Business-6
|
An exposure record will comprise a geocode, a date of year, and concentration values
for each pollutant. In the early life analysis, the date of year will
represent a daily exposure record. In the later life analysis, the same field will
represent an annual exposure value and will be January 1st of some year that is
involved in the later life analysis.
|
Business-7
|
The protocol must be fully automated.
|
Business-8
|
The protocol must support being tested by a suite of automated test cases that
can demonstrate that it is behaving correctly.
|
Business-9
|
The protocol would generate a reusable exposure data set that could be
linked to the individual and health information stored by the cohort study.
|
Business-10
|
Because it produces a data set that will support multiple environmental
health studies, the need to support data science and software engineering
requirements becomes more important than those concerns would have if the
protocol supported only a single study.
|
Business-11
|
Accept that different cohort projects will express different levels of
interest in the protocol. Rather than expecting the utility of a generic
protocol to be measured by the number of projects that download and use it,
anticipated tiered levels of buy-in from other groups. These are: people
who want to reuse its design to make their own software; people who want to
reuse the software tool with a different cohort; people who want to adapt
the code to suit slightly different use cases; and people who
want to borrow parts of the code for other projects.
|
Business-12
|
The protocol documentation must minimally include three things for
prospective users in other projects: an explanation of properties for the
input data sets; instructions for installing and running the protocol; and
a data dictionary that describes the meaning and context of result variables.
|
Business-13
|
Accommodate input values that may have cohort-specific representations but will
have the same meaning.
|
Business-14
|
Accommodate different ways of representing input values for yes-no fields. For
example, allow ALGAE to understand that "Yes", "yes", "true", and "1" are all
ways of representing "Y" for Yes.
|
Business-15
|
Accommodate different ways of representing an empty field value. They may include
the empty string "", a null value, some form of "NULL ", and "NULLIF# ".
|
Business-16
|
Document how code could be changed to support different life stages, or different temporal
boundaries for existing life stages.
|
Business-17
|
Document any code that may depend on the name of a specific life stage (eg: EL ,
which normally contains the birth date as a start date).
|
Business-18
|
Document how other projects could change the pollutants that are used for the study.
|
Business-19
|
The protocol will not support features for anonymising or pseudonymising fields.
Projects that want to de-identify these variables should do so as part of the
process they follow for creating the input data files or they should de-identify
variables after the results have been generated.
|
Business-20
|
The protocol will not attach any meaning to person_id or geocode
fields, and will just treat them as identifiers that link tables together.
|
Business-21
|
In order to support data destruction provisions in information governance policies, the protocol
will write all of its result tables to CSV files so that projects can safely delete the entire
database which produced them. Isolating results in this way allows projects to minimise the risk
of holding sensitive data in the intermediate data sets produced by the protocol.
|
Business-22
|
The decision about which fields the protocol will produce are based on what fields would be useful
for the scientific use case and are not based on the information governance policies of any one
cohort. It is possible that in a given project, ALGAE produces more data than what is necessary
and sufficient for the research purpose. It is also possible that it will generate more data
than a given cohort will allow to be taken off-site.
|
Business-23
|
The ALGAE protocol should be open sourced through a standard OSI license.
|
Data Science-1
|
The input CSV file that holds data about geocodes will have a yes/no flag
called "is_valid ". If the flag value is Y, then the geocode is considered
valid and may be used in exposure calculations. If the flag value is N,
then the geocode is considered not valid, and may cause some study members
to be excluded from exposure assessments. The criteria for validity are
defined by individual cohorts
|
Data Science-2
|
The input CSV file that holds data about geocodes will have a free-text field called "version ".
Cohorts can use the field to describe an iteration of geocoding that uses a particular version
of software, was done by a particular method or was done by particular people.
|
Data Science-3
|
The input CSV file that holds data study member data will have a yes/no field called
"at_1st_address_conception ". If the flag value is "Y", then the
mother of the cohort member lived at this address at the date of conception.
|
Data Science-4
|
The input CSV file that holds data cohort member data will have a yes/no field called
"absent_during_exp_period ". If the flag value is "Y", then researchers can be confident
that a cohort member spent a significant amount of their exposure time frame living at
a location that was not specified in his/her residential address history.
|
Data Science-5
|
Cohort studies will need to comply with naming conventions for both the paths of
input files and for the variable names within those files.
|
Data Science-6
|
Preserve the provenance of data in the original data files by storing unaltered data in original
data tables that are not touched by the rest of the protocol. Let the protocol apply its data
transformations on staging tables, which will contain versions of the original data that have
standardised the representation of field values.
|
Data Science-7
|
In order to make it easier for researchers to identify errors, favour expressing complex operations
through a sequence of simple temporary tables rather than by using a single, complex monolithic
query to do the same work.
|
Data Science-8
|
Save changes using modified copies of table fields rather than changing original ones.
|
Data Science-9
|
Capture the original row number of records from input files in corresponding staging tables.
Where it is possible to do so, promote the original row number fields through temporary
tables that may use them.
|
Data Science-10
|
Wherever possible, do not delete table rows. Instead, flag them as being deleted. This
approach allows protocol users to inspect the nature of deleted data identified for
each study member.
|
Data Science-11
|
Promote ignored rows through successive temporary tables but use deletion flags to ignore
them from consideration in calculations.
|
Data Science-12
|
Assume that the date that a study member's current address was updated in the contacts
database is the date when he or she began living at a new address. This value is the
start date of an address period.
|
Data Science-13
|
Ensure that each address period has non-blank values for its start date and end date
fields. Impute missing start dates with the conception date of the study member.
Impute missing end dates with the current date.
|
Data Science-14
|
Order address periods in ascending order, first by person_id ,
second by start_date and third by duration .
|
Data Science-15
|
Assume that in the residential address histories, start dates are stronger signals
than end dates. Assume that study members were likely already living at the geocoded
location in an address period at or before its start date.
|
Data Science-16
|
In the process of fixing temporal gaps and overlaps between successive address periods,
act to preserve the start date of the current address period over the end date of the
previous one.
|
Data Science-17
|
Let an and an+1 be two successive address periods. If a gap exists between them, then
let an+1.start_date = an.end_date + 1 day . If an overlap exists between them, let
an.end_date = an+1.start_date - 1 day .
|
Data Science-18
|
Let an and an+1 be two successive address periods. an is considered
a duplicate if it is temporally subsumed by an+1. If this is true, then an will be flagged
as a deleted record.
|
Data Science-19
|
Ensure the address history for any study member spans the entire exposure time frame (eg:
[conception_date, birth date + 1 yr - 1 day] ). Ensure that the start date of
the earliest address period is adjusted to be at least as old as the conception date. Ensure
that the end date of the latest address period is adjusted to be at least as early as the
last day of the exposure time frame.
|
Data Science-20
|
If a study member has at least one address period which has a bad geocode and which spans part of
their exposure time frame, he or she will be excluded from further exposure assessments.
|
Data Science-21
|
If address period is fixable if it meets the following three criteria:
- It has a bad geocode
- It is immediately followed by an address period which has a valid geocode
- It has less than 25% overlap with any life stage
|
Data Science-22
|
Let an address an be a fixable address period. The protocol will attempt to ignore
an from further calculations by subsuming it within an+1: an+1.start_date = an.start_date .
|
Data Science-23
|
Assess the data quality of exposure values for each pollutant independently of one
another.
|
Data Science-24
|
For each pollutant, exposure results will be accompanied by day count variables
that describe various mutually exclusive kinds of daily exposure values that were
used in aggregation. These mutually exclusive categories include: invalid address
days, out of bounds days, poor match days, missing exposure days and good match days.
The sum of these counts will be the same value as the life stage duration.
|
Data Science-25
|
The protocol will support a set of sensitivity variables that will allow researchers to
quantify the influence of data quality attributes and data cleaning actions on subsets
of exposure results. These variables will provide meta data about data transformation
activities and will be provided as a result file for users.
|
Data Science-26
|
The sensitivity variables will include variables that allow researchers to isolate
exposure results based on how confident they can be that the residential address
histories can cover the exposure time frame of interest. These variables will borrow
the at_1st_addr_conception and absent_during_exp_period
variables from the original_study_member_data table.
|
Data Science-27
|
The sensitivity variables will include ones that can help researchers determine whether low exposure
values for Trimester 3 are due to a premature births or to areas that exhibit low levels of pollution.
It will borrow the estimated_gestation_age field from the original_study_member_data table,
and include a yes/no flag is_gestation_age_imputed .
|
Data Science-28
|
The sensitivity variables will include ones that can help researchers identify study members who
were affected by bad geocodes. These variables will include the following totals measured
throughout the whole exposure time frame: invalid_geocodes , fixed_geocodes ,
out_of_bounds_geocodes , and has_bad_geocodes_within_time_frame
|
Data Science-29
|
: The sensitivity variables will include ones that can help researchers assess the extent to which
the field values of address periods were imputed for each study member. These variables will
include the following totals measured across the whole exposure time frame: imp_blank_start_dates ,
imp_blank_end_dates , imp_blank_both_dates .
|
Data Science-30
|
The sensitivity variables will include ones that can help researchers assess the kinds
of data cleaning changes that were made in order for the protocol to make a temporally
contiguous record of movements that cover the exposure period. These variables will
include the following totals, measured across the whole exposure time frame:
total_addr_periods ,
gaps ,
gap_and_overlap_same_period ,
over_laps ,
deletions .
|
Data Science-31
|
The sensitivity variables will include days_changed , which is the total number of
days that were adjusted across all address periods which overlapped with the exposure
time frame. The total considers the total number of days that each address period
was shifted in data cleaning. In cases of successive overlaps, some days may be
counted more than once.
|
Data Science-32
|
The sensitivity variables will include total_contention_days , which measures
the total number of unique days in an exposure time frame that were involved
with a gap or an overlap. It is called a contention day because on such a day,
a study member could be placed at either of two locations: the location that
data cleaning assigned in a cleaned set of address periods, or the location
that may have been allowed in uncleaned address periods.
|
Data Science-33
|
The sensitivity variables will include missing_exposure_days , which gives the total number
of days in the exposure time frame that are not associated with exposure values.
|
Data Science-34
|
Assume that on a given day, placing a study member at one location when they actually lived at another
could present an exposure misclassification error that was significant for studies that used the
life stage exposure results.
|
Data Science-35
|
Suppose the protocol cleans two successive address periods an and an+1, where a gap or overlap exists
between them. Let the exposure at the location assigned by data cleaning to
correct a given contention day be called the assigned exposure. Let the
exposure at the other location the study member could have occupied on that day be
called the opportunity cost exposure. The exposure classification error for that day
of contention will be calculated as
|assigned exposure - opportunity cost exposure| .
|
Data Science-36
|
Daily values for exposure measurement error will be aggregated in the same way as exposure values are aggregated.
For example, error values will be expressed for each pollutant for each life stage for
each study member. Error values will be further reported through and use average, sum
and median aggregation operations.
|
Data Science-37
|
The cleaned mobility assessment considers exposure contributions from all cleaned address periods that
overlap with the exposure time frame. Daily exposure measurement errors are calculated to
correspond with daily exposures and they are aggregated in the same ways.
|
Data Science-38
|
The uncleaned mobility assessment considers exposure contributions from all cleaned
address periods that overlap with the exposure time frame. However, it omits any days
that spanned a gap or overlap error in a study member's residential address history. No
exposure measurement errors are assessed in this approach.
|
Data Science-39
|
The life stage mobility assessment uses the locations study members used on the first day
of one of their life stages to represent the location for that whole life stage. It
ignores contributions from all other locations. No exposure measurement errors are
assessed in this approach.
|
Data Science-40
|
The birth address assessment uses the birth addresses of study members to represent their
locations for the entire early life exposure time frame (eg: conception until the
last day of the first year of life). No exposure measurement error is assessed in
this approach.
|
Data Science-41
|
The protocol will compare pairs of exposure assessment methods and calculate
percent error values for corresponding exposure results.
|
Data Science-42
|
The protocol will produce CSV result files whose names follow standard naming
conventions. These files will be generated within a directory that has
predictable structure.
|
Data Science-43
|
Variables that are produced by the protocol will have names which are
prefixed by "algae[NNNN]_ ". The four digit number is designed
to help uniquely identify the context which produced variables.
|
Data Science-44
|
The protocol will come with a comprehensive data dictionary that
will define the meaning of result variables.
|
Software Engineering-1
|
The protocol will come with clear, concise documentation that project staff
can use to transform their data into the input tables that are expected
by the software.
|
Software Engineering-2
|
The protocol will create a staging table for each table of original
input data. The staging table will use a standardised way for representing
yes/no values and empty field values. The rest of the protocol will
operate on the staging tables and not the input tables.
|
Software Engineering-3
|
Consolidate code for standardising input fields within well-tested database
functions. Later on in testing, assume standardisation functions will
work and then rely on just using standard forms to express input data
for test cases.
|
Software Engineering-4
|
When it is appropriate for the scientific use case, impute blank values.
For example, blank gestation age values will be imputed with a default
value for gestation age in weeks. Imputing field values helps simplify
code that processes them.
|
Software Engineering-5
|
: Some missing value errors will be important to users but will have
no effect on the protocol's algorithms (eg: absent_during_exp_period
and at_1st_addr_conception ). Reject rather than ignore these errors
in order to help users fix their input data as soon as possible
|
Software Engineering-6
|
Design the protocol to fail fast: if it is going to fail because of an error,
then make it fail as soon as possible.
|
Software Engineering-7
|
Identify duplicate key errors by applying a database constraint which tries
to use a field as a primary key.
|
Software Engineering-8
|
Identify required field values by applying a database constraint which tries
to forbid a field from having empty field values.
|
Software Engineering-9
|
It is more important that the program performs correctly than it is that
the program performs quickly.
|
Software Engineering-10
|
Favour expressing complex operations through a sequence of simple temporary
tables rather than by using a single, complex monolithic query to do the same work.
|
Software Engineering-11
|
Where it is important to preserve rows from one temporary table to the next, rely on
LEFT JOIN operations to link the data applicable to all study members with the
new data that may only be applicable to some.
|
Software Engineering-12
|
Consider a sequence of successive temporary tables marked by T1 ... Tn that are meant to
build up a growing collection of field values describing the way cohort data have
been processed for each study member. The protocol will compare the keys between
T1 and Tn so that they have the same number and that all the keys in one are found
in the other.
|
Software Engineering-13
|
The protocol will support a means of automatically comparing the actual results
generated by the program with expected results that are derived through manual calculation.
|
Software Engineering-14
|
Each test case should only test one feature.
|
Software Engineering-15
|
Within a test suite, test data related to each study member should remain the same,
except for data that relate to the features being tested.
|
Software Engineering-16
|
In order to make test data easier to understand use the names of test
scenarios as study member identifiers.
|
Software Engineering-17
|
In order to make test data easier to understand use the names of test
scenarios as study member identifiers.
|
Software Engineering-18
|
Each test case should use the minimal amount of input data it needs to
exercise a feature. The values should be simple enough to make them
amenable to being used in the manual calculations that produce expected
test results.
|
Software Engineering-19
|
For each pollutant at each geocode, assign stepped pollution values that are constant
over time but which vary significantly between geocodes and between pollutants.
|
Software Engineering-20
|
For test cases that need to demonstrate variation between average and
median life stage exposures, induce differences by adjusting the boundaries of
address periods spent in different locations.
|
Software Engineering-21
|
Design stepped exposure generation functions so that the differences between
successive geocodes and the differences between successive pollutants is unique.
|
Software Engineering-22
|
Testing will assume that ALGAE treats all pollutants in the same manner.
Testing multiple pollutants will be limited to inspecting by eye that
aggregated exposure values are in fact different from pollutant to pollutant.
|
Software Engineering-23
|
Limit the scope of testing by focusing test cases on only one pollutant.
|