Data loading part 1: Prepare study member data
by Kevin Garwood
Overview | Next |
Purpose
The data in this table are used for two main tasks:- to calculate the time boundaries for life stages
- to use the responses from cohort questionnaire variables to determine how certain we can be about assumptions we make about the residential address histories.
Please see Life stage calculations to understand how birth_date
and estimated_gestation_age
are used to calculate life stage data.
We need to have a field absent_during_exp_period
which indicates whether study
members spent significant amounts of time located at residential addresses that are not listed
in their address histories. We also need a field at_1st_addr_conception
to tell
us whether study members were definitely at their first address during conception.
Location of Original Data File
You will need to create a file that has this nameoriginal_study_member_data.csvwhich must be located in:
early_life/input_dataor
later_life/input_data
Example Original Data File
See here.Suggested Approach
Step 1: Obtain person_id, birth date and gestation age at birth fields
The first task in preparing to use the ALGAE protocol will be to assign study members with an identifier, which may anonymised or pseudonymised depending on the preferences of the cohort. Birth date and gestation age at birth data are likely to come from birth records and should be part of any birth cohort's variables.Step 2: Find questionnaire data related to establishing an address at conception
Many birth cohorts recruit pregnant mothers to enrol their child. The first address that is recorded is usually the address the mother specified at enrolment, which may not have been the location she lived at when the child was conceived. In many cases the address at enrolment will be the address at conception, but we want a flag to help us assess how confident we can be about this assumption.
The flag at_first_addr_conception
is 'Y' if the study was definitely
at their first recorded address at conception. Otherwise, the value must be 'N'.
Step 3: Find questionnaire data to detect locations used that are not in residential address histories
The ALGAE Protocol assumes that study members only occupied addresses that are listed in their residential address histories. However, we want a data quality flag to help us determine whether they spent any significant part of their exposure period living at addresses that are not part of their address histories.Look for variables related to homelessness, hospitalisation, prison or visits outside the exposure area you would assume are not listed in the residential address histories. Focus specifically on variables that would be relevant during the exposure time frame.
General Advice
You may have to re-purpose a collection of questionnaire variables so that taken together, the responses can provide a value for theat_1st_addr_conception
and
absent_during_exp_period
flags.
For example, we encountered three questions in the ALSPAC questionnaires that could help us be confident in determining whether study members were definitely at their addresses of enrolment when they were conceived.
Consider the following questions:
Variable a003: years since last move? Responses: 'YE short' 'Missing' 0 1 2 3 4 5+
Variable a004: weeks since last move? Responses: 'Missing' 0...49 weeks
Variable c470: Are you living in the same home that you were in at the start of your pregnancy? Responses 'Missing' -2 or -1 'y' 1 'n' 2
First, we had to divide responses for each question so they could provide a "Yes" or "No" answer to the question: "Was the study member definitely at their first listed address at conception?
You may find it helpful to construct a table like the one that follows to help organise your questionnaire data:
Variable | Relevant Question | Yes/No Response | Values |
---|---|---|---|
a003 | Definitely at first address? | Yes | Any number greater than or equal to 1 |
a003 | Definitely at first address? | No | missing values or 0 |
a004 | Definitely at first address? | Yes | Any integer > 32 weeks |
a004 | Definitely at first address? | No | Missing values or integer <= 32 weeks |
c470 | Definitely moved? | Yes | Response of 2 |
c470 | Definitely moved? | No | Any other response besides 2 |
Once the values were converted to Yes or No responses, we had to rank questions based on how strongly their responses would support a definite answer. We decided that if either a003 or a004 indicated that the study member had been at the first address for a period that would cover conception, then we would accept the answer "Yes". Barring that, we would determine whether c470 would return a definite "No". If this didn't happen, then by default we would return a "Yes", meaing that the study member had been at his/her enrolment address at the time of conception.
Other scientists may have divided the responses differently and ranked the questions with a different order of importance. However, our assumptions have at least been explicitly recorded.
Example Table
See here.Table Properties
You need to produce a CSV file calledoriginal_study_member_data
. It must
have the following fields:
Field | Description | Required | Properties | Examples |
---|---|---|---|---|
person_id | Anonymised unique identifier representing a study member | Yes | Any Text | 1001XYZ |
comments | Any information the cohort wants to include in the description of a study member | No | Any text | |
birth_date | Birth date of a cohort member. | Yes |
Date of the format dd/MM/yyyy
|
17/05/1995 |
estimated_gestation_age |
Estimated gestation age of a study member at birth, expressed in the number
of weeks. This estimate is often obtained either from estimating a
pregnant mother's last menstrual period, scans of the foetus or a
combination of both.
When this value is missing, ALGAE imputes it with
a default value defined in |
No | A positive integer | 37 |
absent_during_exp_period | Indicates whether study members spent a significant part of their exposure period at a location that does not appear in their residential address histories | Yes |
One of the following ways to represent no or yes:
N n no NO FALSE false No |
|
at_1st_addr_conception | Indicates whether study members were definitely at their first registered residential addresses when they were being conceived. | Yes | One of the following ways to represent no or yes: N,n, no, NO, FALSE, false, No, 0, Y, y, yes, YES, TRUE, true, Yes, 1 |