ALGAE Protocol: Creating the original_exposure_data table

An automated protocol for assigning early life exposures to longitudinal cohort studies

Data loading part 4: Prepare the exposure data

by Kevin Garwood

Overview Previous

Purpose

This table contains exposure values that are linked to locations that study members have occupied during their exposure time frame. Exposures for various types of pollutants are associated with a given geocode location for a given date.

In the early analysis, ALGAE assumes the records represent daily exposure values for each location. In the later life analysis, ALGAE assumes the records are annual average exposure values. For this analysis, the date is expected to be the first date of the calendar year that an annual exposure record describes for a given location. ALGAE later converts the annual exposure values into daily exposure records.

Location of Original Data File

You will need to create a file that has this name
original_exposure_data.csv
which must be located in:
early_life/input_data
or
later_life/input_data

Example Original Data File

See here.

Suggested Approach

The suggested approach can best be viewed as a way of making a matrix where exposure values for all geocoded locations are set against daily exposure values that cover the time span used by all cohort members that are in the study.
Preparing study member data.

Step 1:Identify earliest exposure start date

We assume that you will generate exposure values by using an exposure model to process addresses of all cohort members, and cover the the exposure periods of all cohort members who are part of your study.

In order to know when your exposure model should begin generating historical values, find the earliest start date for any exposure period of any person who is enrolled in your study. In early life analysis, the date will describe the earliest conception date (See life stage calculations to see how conception date is calculated). In later life analysis, the date will describe the earliest value of birth_date + 1 year for any cohort member who is in the study. So if you start with year 1 (YR1), year 2 (YR2), year 3 (YR3) then you would be looking for the earliest value of YR1.

Step 2: Identify latest exposure end date

Next, you will want to know the latest date of the latest life stage that is covered by an analysis. In the case of early analysis, that early life stage will typically be called EL and the key date will be the most recent last day of a first year of life. For exposure modelling, you would then want to create daily exposure values for:

[earliest conception date, latest last day of EL]

In the later life analysis, you will want to identify the last day of the last life stage. For example, if your study includes the fifteenth year of life (eg: "YR15"), then you will want to identify the latest last day of Y15 for any study member. The overall time frame for your exposure model would then be:

[earliest YR1 start date, latest YR15 end date]

Step 3: Identify addresses used by study members during overall exposure time frame

The most economical way of doing this step would be to use only the addresses that study members used during their exposure period. However, this would assume that you had done all the data cleaning for the address periods first and the data cleaning for the address periods presumes that the geocodes already exist.

In the ALGAE protcol, we assume that you will geocode residential addresses without knowing who they may belong to or when they were occupied. That information is contained in the original_address_history_data table.

Step 4: Know the meaning of date_of_year depending on your analysis

The field date_of_year has a different meaning, depending on whether you are doing an early life analysis or a later life analysis. In the early life analysis, this value refers to the date of daily exposure values for various pollutants. In the later life analysis, ALGAE expects that the records will be annual exposure values and that the date will be the first day of the calendar year of exposure coverage.

General Advice

Using an exposure model to generate exposures for multiple pollutants for all geocoded locations for the overall exposure time frame will likely be computationally intensive. We suggest that early in your project, you should invest in acquiring some form of secure access, high performance computing environment.

Example Table

See here.

Table Properties

You need to produce a CSV file called original_geocode_data. It must have the following fields:
Field Description Required Properties Examples
geocode Represents the location of a residential address. For ALGAE, the geocode is treated as just an identifier and the protocol attaches no meaning to the code. Yes Any text 37.422036-122.084124
x4353bi838 (anonymised)
comments Any other information the cohort researchers want to provide about the geocode No Any text
date_of_year Date of an exposure record. If you are running the early life analyses, ALGAE assumes this is the date of a daily exposure record. If you are running the later life analysis then this date is January 1st of a calendar year and the exposure values represent annual average values. Yes Date Format: dd/MM/yyyy 23/03/1996
pm10_rd PM10 from local road sources No A number with at most 15 decimal digits precision
nox_rd NOx from local road sources No A number with at most 15 decimal digits precision
pm10_gr PM10 from sources other than roads. No A number with at most 15 decimal digits precision
pm10_rd PM10 from road sources. No A number with at most 15 decimal digits precision
pm10_tot PM10 from all sources. No A number with at most 15 decimal digits precision