Procode: A Machine-Learning Tool to Support (Re-)coding of Free-Texts of Occupations and Industries

Ann Work Expo Health. 2022 Jan 7;66(1):113-118. doi: 10.1093/annweh/wxab037.

Abstract

Procode is a free of charge web-tool that allows automatic coding of occupational data (free-texts) by implementing Complement Naïve Bayes (CNB) as a machine-learning technique. The paper describes the algorithm, performance evaluation, and future goals regarding the tool's development. Almost 30 000 free-texts with manually assigned classification codes of French classification of occupations (PCS) and French classification of activities (NAF) were used to train CNB. A 5-fold cross-validation found that Procode predicts correct classification codes in 57-81 and 63-83% cases for PCS and NAF, respectively. Procode also integrates recoding between two classifications. In the first version of Procode, this operation, however, is only a simple search function of recoding links in existing crosswalks. Future focus of the project will be collection of the data to support automatic coding to other classification and to establish a more advanced method for recoding.

Keywords: Naïve Bayes; cross-validation; epidemiology; machine learning; occupational classifications.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bayes Theorem
  • Humans
  • Industry
  • Machine Learning
  • Occupational Exposure*
  • Occupations