ALGAE Testing Part 1: Pre-processing Features

An automated protocol for assigning early life exposures to longitudinal cohort studies

Testing Part 1: Pre-processing features

by Kevin Garwood

Testing Overview Next

Background

ALGAE alters copies of original data set files to produce corresponding staging tables that standardise certain field values and check that various fields are either not empty or are unique. The rest of the protocol is driven from these staging tables rather than the original data files.

Tests for the pre-processing features need not be concerned with the outcomes of any other features such as cleaning address histories or calculating exposures. Most of the tests focus on correctly standardising the values of "Yes/No" fields or fields with empty values in the original data sets.

If the tests for the pre-processing features all pass, then test cases in all subsequent test areas can rely on using standardised field values. For example, if we show that the protocol will replace Yes for Y in the has_valid_geocode field of the original_geocode_data file, we don't have to write a similar test for the Geocode Feature area. If the pre-processing tests complete, we can be guaranteed that any table field "person_id" will not be blank.

The main reason for developing this test area is to help constrain the variation in input values in order to minimise the proliferation of test cases we would have to consider.

Coverage

The coverage of this test area is shown in the following diagram:

Most of the testing relates to checking whether database tables exist, or if certain fields have values which are empty or duplicated. Automated test cases exist for the Yes/No fields absent_during_exp_period, at_1st_address_conception and has_valid_geocode.

There are no result output values to check. All the tests examine values in the staging tables to confirm that values were standardised correctly.

Input Fields Covered by Test Cases

Table Field
staging_study_member_data person_id
staging_study_member_data birth_date
staging_study_member_data absent_during_exp_period
staging_study_member_data at_1st_addr_conception
staging_geocode_data geocode
staging_geocode_data has_valid_geocode
staging_address_history_data person_id
staging_address_history_data geocode
staging_exp_data geocode
staging_exp_data date_of_year

Output Fields Covered by Test Cases

There are no result output values to check. All the tests examine values in the staging tables to confirm that values were standardised correctly.

Test Case Design

Do not test built-in database function

Testing does not include any effort to check that PostgreSQL's built-in database functions work - we assume they do. Therefore, no automated test cases are used to check whether the staging tables exist, or whether the database is able to add table field constraints such as requiring that they are unique or non-empty. Instead, visual inspection of the code is used to verify that built-in constraint checks are being applied to tables and fields. If any of the following assumptions do not hold, the protocol will throws an exception and stop working:
  • the tables staging_study_member_data, staging_geocode_data, staging_exposure_data and staging_address_history_data exist
  • staging_study_member_data.person_id is both unique and non-empty
  • staging_study_member_data.birth_date is not empty
  • staging_study_member_data.absent_during_exp_period and at_1st_addr_conception are not empty.
  • staging_geocode_data.geocode and has_valid_geocode are not empty
  • staging_exp_data.geocode and date_of_year are not null and together must be unique.

Create test cases that show that the protocol standardises yes/no field values correctly.

There are three fields in the original data tables that provide yes/no values. These are:
  • original_study_member_data.absent_during_pregnancy
  • original_study_member_data.at_1st_address_conception
  • original_geocode_data.has_valid_geocode

Tests are made to ensure that in each field, the following data cleaning operations can be successfully applied:

  • {n, No, NO, FALSE, false, 0} are cleaned to become N
  • {y, Yes, YES, TRUE, true, 1} are cleaned to become Y

Create test cases that show protocol standardises empty values for geocodes

Tests exist to check that the protocol can correctly transform field values of NULL, #NULLIF and the empty string '' to empty_geocode. All of the work needed to standardise empty values is contained with the function call: standardise_geocode_value(original_geocode_value, is_null_allowed)

In order to minimise testing costs, tests are developed to exercise this function rather than the set of input fields that may be transformed by them. In a production settings, these tests will appear commented out. The following inputs for the function should provide these results:

  • an original geocode of 123456-234543 should remain unchanged
  • original geocode values of NULL, #NULLIF, '' should result in empty_geocode.