Frequency and determinants of missing data in clinical and prognostic variables recently added to SEER

J Registry Manag. 2011 Autumn;38(3):120-31.

Abstract

Background: The objectives of the present study were to examine and quantify the frequency of missing information for the collaborative stage (CS) site-specific factors (SSF) added to the Surveillance, Epidemiology and End Results (SEER) data collection in 2004, to evaluate patient-, disease-, and registry-related factors associated with incomplete data, and to quantify time and effort required to collect information for each variable of interest.

Methods: The study included 2 parts: 1) an analysis of existing nationwide SEER data; and 2) an evaluation of time and effort as reported by hospital registrars in the Metropolitan Atlanta and Rural Georgia (MARGA) SEER Registry catchment area. The first analysis examined all SSF for all types of cancers reported to the SEER Program between 2004 and 2007 from all 17 SEER registries. The data for the second analysis were limited to 5 cancer sites: breast, prostate, colon/rectum, testes, and lymphoma. Information for each cancer site was collected from 40 cancer registrars who were asked to estimate the amount of time and effort spent on abstracting each variable of interest.

Results: We analyzed 825,952 cases pertaining to 18 different cancer sites and 45 different variables. Of the 45 SSF variables examined in this study, 12 had at least 50% of cases with missing data. Conversely, a total of 21 variables were at least 80% complete. Our analysis of determinants of missing SSF data showed an improvement of reporting since 2004 for most variables. Older patients (80+ years of age) tended to have a higher proportion of missing data compared to 40- to 59-year-olds (reference category). For the specific cancers presented in this paper, patients diagnosed in non-metropolitan areas tended to have a slightly higher proportion of missing data compared to those diagnosed in metropolitan areas. We found no discernable patterns of association between probability of having missing data and patients' race, sex, or registry. According to the registrars' reports, data collection for CS SSF requires a median of 2-3 minutes with a range of 1-15 minutes. There was great variability in the perceived level of difficulty associated with finding the necessarily data.

Conclusions: The data completeness for CS SSF ranges widely, and is largely site- and variable-specific. The main barrier to data completeness appears to be the availability of information in the medical records. Our results indicate that for a number of SSF the proportion of missing data is so high that these variables can be of little, if any, use for population-based research. The practical implications of our findings with respect to existing and future SSF need to be explored.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Adult
  • Aged
  • Aged, 80 and over
  • Humans
  • Middle Aged
  • Neoplasm Staging
  • Neoplasms / epidemiology*
  • Population Surveillance / methods*
  • Prognosis
  • Registries / statistics & numerical data
  • Residence Characteristics
  • SEER Program / statistics & numerical data*
  • Time Factors
  • United States / epidemiology