Efficient Feature Selection for Static Analysis Vulnerability Prediction

Sensors (Basel). 2021 Feb 6;21(4):1133. doi: 10.3390/s21041133.

Abstract

Common software vulnerabilities can result in severe security breaches, financial losses, and reputation deterioration and require research effort to improve software security. The acceleration of the software production cycle, limited testing resources, and the lack of security expertise among programmers require the identification of efficient software vulnerability predictors to highlight the system components on which testing should be focused. Although static code analyzers are often used to improve software quality together with machine learning and data mining for software vulnerability prediction, the work regarding the selection and evaluation of different types of relevant vulnerability features is still limited. Thus, in this paper, we examine features generated by SonarQube and CCCC tools, to identify those that can be used for software vulnerability prediction. We investigate the suitability of thirty-three different features to train thirteen distinct machine learning algorithms to design vulnerability predictors and identify the most relevant features that should be used for training. Our evaluation is based on a comprehensive feature selection process based on the correlation analysis of the features, together with four well-known feature selection techniques. Our experiments, using a large publicly available dataset, facilitate the evaluation and result in the identification of small, but efficient sets of features for software vulnerability prediction.

Keywords: feature selection; machine learning; software vulnerability prediction; static analysis.

Grants and funding