ALGAE Protocol: Creating the original_study_member_data table

An automated protocol for assigning early life exposures to longitudinal cohort studies

Data loading part 1: Prepare study member data

by Kevin Garwood

Overview Next

Purpose

The data in this table are used for two main tasks:
  • to calculate the time boundaries for life stages
  • to use the responses from cohort questionnaire variables to determine how certain we can be about assumptions we make about the residential address histories.

Please see Life stage calculations to understand how birth_date and estimated_gestation_age are used to calculate life stage data.

We need to have a field absent_during_exp_period which indicates whether study members spent significant amounts of time located at residential addresses that are not listed in their address histories. We also need a field at_1st_addr_conception to tell us whether study members were definitely at their first address during conception.

Location of Original Data File

You will need to create a file that has this name
original_study_member_data.csv
which must be located in:
early_life/input_data
or
later_life/input_data

Example Original Data File

See here.

Suggested Approach

Preparing study member data.

Step 1: Obtain person_id, birth date and gestation age at birth fields

The first task in preparing to use the ALGAE protocol will be to assign study members with an identifier, which may anonymised or pseudonymised depending on the preferences of the cohort. Birth date and gestation age at birth data are likely to come from birth records and should be part of any birth cohort's variables.

Step 2: Find questionnaire data related to establishing an address at conception

Many birth cohorts recruit pregnant mothers to enrol their child. The first address that is recorded is usually the address the mother specified at enrolment, which may not have been the location she lived at when the child was conceived. In many cases the address at enrolment will be the address at conception, but we want a flag to help us assess how confident we can be about this assumption.

The flag at_first_addr_conception is 'Y' if the study was definitely at their first recorded address at conception. Otherwise, the value must be 'N'.

Step 3: Find questionnaire data to detect locations used that are not in residential address histories

The ALGAE Protocol assumes that study members only occupied addresses that are listed in their residential address histories. However, we want a data quality flag to help us determine whether they spent any significant part of their exposure period living at addresses that are not part of their address histories.

Look for variables related to homelessness, hospitalisation, prison or visits outside the exposure area you would assume are not listed in the residential address histories. Focus specifically on variables that would be relevant during the exposure time frame.

General Advice

You may have to re-purpose a collection of questionnaire variables so that taken together, the responses can provide a value for the at_1st_addr_conception and absent_during_exp_period flags.

For example, we encountered three questions in the ALSPAC questionnaires that could help us be confident in determining whether study members were definitely at their addresses of enrolment when they were conceived.

Consider the following questions:

Variable a003: years since last move?
Responses:
	'YE short'
	'Missing'
	0
	1
	2
	3
	4
	5+
Variable a004: weeks since last move?
Responses:
	'Missing'
	0...49 weeks
Variable c470: Are you living in the same home that you were in at the start of your 
pregnancy?
Responses
	'Missing' -2 or -1
	'y' 1
	'n' 2

First, we had to divide responses for each question so they could provide a "Yes" or "No" answer to the question: "Was the study member definitely at their first listed address at conception?

You may find it helpful to construct a table like the one that follows to help organise your questionnaire data:
Variable Relevant Question Yes/No Response Values
a003 Definitely at first address? Yes Any number greater than or equal to 1
a003 Definitely at first address? No missing values or 0
a004 Definitely at first address? Yes Any integer > 32 weeks
a004 Definitely at first address? No Missing values or integer <= 32 weeks
c470 Definitely moved? Yes Response of 2
c470 Definitely moved? No Any other response besides 2

Once the values were converted to Yes or No responses, we had to rank questions based on how strongly their responses would support a definite answer. We decided that if either a003 or a004 indicated that the study member had been at the first address for a period that would cover conception, then we would accept the answer "Yes". Barring that, we would determine whether c470 would return a definite "No". If this didn't happen, then by default we would return a "Yes", meaing that the study member had been at his/her enrolment address at the time of conception.

Other scientists may have divided the responses differently and ranked the questions with a different order of importance. However, our assumptions have at least been explicitly recorded.

Example Table

See here.

Table Properties

You need to produce a CSV file called original_study_member_data. It must have the following fields: Yes
Field Description Required Properties Examples
person_id Anonymised unique identifier representing a study member Yes Any Text 1001XYZ
comments Any information the cohort wants to include in the description of a study member No Any text
birth_date Birth date of a cohort member. Yes Date of the format dd/MM/yyyy 17/05/1995
estimated_gestation_age Estimated gestation age of a study member at birth, expressed in the number of weeks. This estimate is often obtained either from estimating a pregnant mother's last menstrual period, scans of the foetus or a combination of both.

When this value is missing, ALGAE imputes it with a default value defined in default_gestation_age in the global_script_constants table.

No A positive integer 37
absent_during_exp_period Indicates whether study members spent a significant part of their exposure period at a location that does not appear in their residential address histories Yes One of the following ways to represent no or yes:


N
n
no
NO
FALSE
false
No
at_1st_addr_conception Indicates whether study members were definitely at their first registered residential addresses when they were being conceived. One of the following ways to represent no or yes: N,n, no, NO, FALSE, false, No, 0, Y, y, yes, YES, TRUE, true, Yes, 1