Using supervised machine learning to identify efficient blocking schemes for record linkage

Stat J IAOS. 2021 Jun 3;37(2):673-680. doi: 10.3233/sji-200779.

Abstract

Record linkage enables survey data to be integrated with other data sources, expanding the analytic potential of both sources. However, depending on the number of records being linked, the processing time can be prohibitive. This paper describes a case study using a supervised machine learning algorithm, known as the Sequential Coverage Algorithm (SCA). The SCA was used to develop the join strategy for two data sources, the National Center for Health Statistics' (NCHS) 2016 National Hospital Care Survey (NHCS) and the Center for Medicare & Medicaid Services (CMS) Enrollment Database (EDB), during record linkage. Due to the size of the CMS data, common record joining methods (i.e. blocking) were used to reduce the number of pairs that need to be evaluated to identify the vast majority of matches. NCHS conducted a case study examining how the SCA improved the efficiency of blocking. This paper describes how the SCA was used to design the blocking used in this linkage.

Keywords: Centers for Medicare & Medicaid Services; National Center for Health Statistics; National Hospital Care Survey; blocking; machine learning; record linkage.