ALGAE Protocol: Methodology

An automated protocol for assigning early life exposures to longitudinal cohort studies

Our Advice: Improving administrative systems used to capture address histories

by Kevin Garwood

There are three main kinds of improvements that can be made to administrative systems so that their residential address histories will better support retrospective exposure studies:
  • form comments
  • guided data entry features
  • validation features
Form comments provide a form of end-user training that will increase the likelihood that users enter data in the way that the electronic forms expect. For example, having the comment "dd-MM-yyyy" next to a date text field will encourage users to enter data in the correct format. Guided data entry features help prevent certain types of errors from occurring and validation features help address them when they happen.

We assume that there would be scarce developer resources available for cohorts to create or improve administrative systems they use to maintain contact details of their study members. For small cohorts, it is likely that such systems would begin as simple spread sheets, and would evolve into tables that are managed through a database. Eventually cohort members would be able to add, edit and delete current addresses using electronic forms that support guided data entry and validation features. Large cohorts may invest in web-based applications as part of delegating the task of updating changes of address to the study members themselves.

Apart from evolving through different technologies, the administrative systems may evolve new features as it goes from being a means of auditing current addresses to a means of tracking past addresses. Initially, the start date of an address period would likely be the time stamp that is generated when the application creates a new record. The end date of the address period may be calculated as the day before the creation time stamp of the next record.

In the following sections, we explain a sequence of incremental improvements that can improve the data quality of residential address histories that would later be taken from the contact database systems. Cohorts may decide to make different improvements, depending on their available developer resources and on the needs of any legacy contacts applications they may already be using.

General Advice

Use alphanumeric study member identifiers instead of auto-incremented numbers

The study member identifier is the most important data field in ALGAE because it links together different sources of information about the same person. The way the identifiers are generated can influence how error-prone it is to link related records together and how easy it will later be to migrate data from one information management system to the next.

Using auto-numbering to generate new study identifiers is appealing because it is simple. In spreadsheets, it may be tempting to borrow the line number that appears next to the first field. In database programs such as MS Access, it is easy to create a primary key that uses an autonumbering feature.

The problem with using auto-generated identifiers is that when they are used in different tables, there is a risk that accidentally linking the wrong fields will still produce matches that obscure the mistake.

Consider the following example of where we make an error by accidentally linking study member and current address tables using instead of

The resulting database query may return addresses for most study members and create the impression that it has correctly created address histories for each person. There is a possibility that during some important life stage, Amanda Garner's exposure will be assessed at 13 Orchard Vale instead of 5 Tanningware Rd. It is also possible that exposure scientists would never detect the error and that epidemiologists might then observe incorrect relationships between early life exposure and later life health outcomes.

Now consider the same scenario, but where the identifier field in the study_member_data has a value that is not assigned a unique value through an auto-numbering feature. When the query is run, it will produce no results, thereby indicating an obvious problem that needs to be fixed.

As technologies change, cohorts may feel compelled to migrate administrative data from one kind of information management system to another. Auto-generated keys are usually unique with respect to a particular table but may not be unique when they are used within a new system. For example, if administrators use a new database to recruit a new group of cohort members and then import cohort member data from another database, they may find that multiple people are assigned the same study identifier.

Design study member identifiers to resist typographical mistakes

If your staff are going to be manually typing in study member identifiers, then it is a good idea to generate ones that will be guaranteed to be at least two characters different than any other identifier.

For example, consider the study member identifiers 101D assigned to one person and 151D assigned to another. If cohort staff were to accidentally type in 151D but actually mean 101D, then a record could end up being assigned to another person. It is much easier to spot data linkage errors if the typographical mistakes produce identifiers that don't belong to anyone.

One way to reduce the risk of typographical mistakes is to alternate number and letter characters. The alternation between letters and number helps slow down data entry so that key strokes become more deliberate. Another way to reduce the likelihood of undetected typographical errors is to choose identifiers that have many characters and produce far more possible identifiers than are needed by the size of the cohort.

Include an audit trail feature to retain old current addresses

However you record the current address of your cohort members, you should ensure that your system audits past current address records rather than just maintaining only the most recent one. We have learned of cohorts whose administrative systems did not do this, and they are left with the challenge of relying on a single postal address to represent the entire exposure time frame that is used within a particular study.

When having the one address is no longer sufficient to support these studies, the cohorts are usually left with no choice other than to commission a data collection sweep that elicits past addresses from study members. This approach is prone to problems related to memory recall and response rates.

Having an audit trail in your system provides at least two benefits. First, it can help establish which addresses a cohort used when it mailed out sensitive data to its members. An audit trail can help support various reviews that may be part of information governance procedures. Second, the audit trail provides a way of recording movement patterns that could later be used as well or in place of an expensive questionnaire effort to accomplish the same thing.

Improving Spreadsheets

If your project is recording changes of addresses using a spreadsheet, then you may want to change the date format so that the dates are spelled rather than numbered. Entering dates using formats such as dd/MM/yyyy or MM/dd/yyyy are prone to problems of transposing numbers and having the data interpretted differently between American and European formats. In data entry activities, it is much easier to understand the meaning of 09 JUN 2002 than 09/06/2002

A spreadsheet may be an acceptable means of recording a few address changes at the beginning of recruitment for a very small cohort. However, spreadsheets have a number of drawbacks:

  • manual data entry is prone to typographical errors and the field errors can affect data linking and filtering activities
  • tables may include additional description fields to help data enterers know whose records they are editing
  • the repetitious nature of data entry may make staff more at risk of creating errors from copy-and-paste operations.
  • auditing manual data corrections is laborious

I would expect that almost any cohort project having more than a few cohort members will soon invest in using a database that is edited through electronic forms.

Suggestions for databases that are edited through electronic forms

Record the time stamp of when a new change of address is created

Many cohorts may not initially try to record the start and end dates of each study member's address periods. One of the advantages of this feature is that the time will be set automatically by the database rather than be set manually by a user. However, it has the drawback of creating artificial start dates that correspond to when a study member's change of address is updated rather than when they have actually moved.

Derive end dates in relation to start dates

Each time a new address record is created, the database should set the end date of the previous current address with the creation time stamp minus one day. This improvement has the drawback of making end dates seem more synthetic but it has the benefit of simplifying data cleaning activities that are used to fix gaps and overlaps in the address histories.

Include both an Add and an Edit feature for address change records

Some administrative systems may create a new change of address each time the current one is changed. The problem with this kind of auditing is that a new record may reflect corrections of an old address rather than represent a new location. The benefit of applying this improvement is that it begins to turn the database from an auditing tool into a tracking tool. The scientists doing the study can have more confidence that address changes actually represent moves from one place to another.

Add format comments to help describe the input formats of dates.

Adding the format for a date field can help make it more likely that users enter dates in the formats that the electronic forms expect.

Include an explicit form field to capture the start and end dates

Up until now, we have used record creation dates to represent start dates of address periods. However, the start dates will be more meaningful if data enterers have to explicitly enter start and end dates.

Break date text fields into three separate and smaller form fields

Representing the date through separate fields for day, month and year can make it easier for data enterers to know what to type. The size of the field can also help indicate what is expected to be typed. For example, if year has a text field that looks large enough to hold two characters, then the user may conclude that the system wants a two digit year to be provided.

Prefer drop down fields to text fields for representing date fields

Changing from using text fields to drop-down fields can help eliminate typographical errors. The former way may prove the quicker way to create a new date field but the latter will likely mean that a date value would be less error prone.

Prefer a date picker to drop down list fields for representing date fields.

Many modern electronic forms will let users specify a date value through a graphical pop-up calendar. Using the calendars can help eliminate typographical errors, problems specifying years with only two digits, problems specifying leap year days and other problems of choosing inappropriate days for certain months (eg: day 31 for February). By default, date pickers will show the calendar month that contains the current day. Therefore, date pickers work best when users have to select days that are recent. They may be less useful when old dates are being specified.

Make sure that end dates are not less than start dates

Preventing this kind of error will ensure that data cleaning will not have to fix it later.

Prompt users to enter a previous end date when they create a new address change

When users create a new address period for a study member, they may forget to specify an end date for that person's previous address period. To ensure that all past current addresses have both a start and an end date, the application should detect blank required dates.

Develop an end-user tutorial to help users specify addresses in a standard way

Users who create new address period records should receive some training about how standard addresses should be entered. For example, projects may suggest that they use standard ways to represent "street", "avenue", and "road". By promoting a standard way for specifying addresses, it is likely that more addresses will later be geocoded successfully.

Support key stroke filtering for postal code fields

Another way to improve the data quality of addresses is to make a list of valid post codes that changes when users type more characters of it. Even if the street addresses are of poor quality, the postal code can often provide important information for geocoding software applications.

Use address look-up service

The best way to specify an address would be to look it up using a service that filters choices based on key strokes. By specifying addresses in this way, the application would be more easily able to check that the next address change did indeed represent a new location.