Textual data transformations using natural language processing for risk assessment

Mohammad Zaid Kamil; Mohammed Taleb-Berrouane; Faisal Khan; Paul Amyotte; Salim Ahmed

doi:10.1111/risa.14100

Textual data transformations using natural language processing for risk assessment

Risk Anal. 2023 Oct;43(10):2033-2052. doi: 10.1111/risa.14100. Epub 2023 Jan 22.

Authors

Mohammad Zaid Kamil¹, Mohammed Taleb-Berrouane¹, Faisal Khan^{1

2}, Paul Amyotte³, Salim Ahmed¹

Affiliations

¹ Centre for Risk, Integrity and Safety Engineering (C-RISE), Faculty of Engineering & Applied Science, Memorial University, St John's, Newfoundland, Canada.
² Mary Kay O'Connor Process Safety Center, Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, Texas, USA.
³ Department of Process Engineering and Applied Science, Dalhousie University, Halifax, Nova Scotia, Canada.

PMID: 36682740
DOI: 10.1111/risa.14100

Abstract

Underlying information about failure, including observations made in free text, can be a good source for understanding, analyzing, and extracting meaningful information for determining causation. The unstructured nature of natural language expression demands advanced methodology to identify its underlying features. There is no available solution to utilize unstructured data for risk assessment purposes. Due to the scarcity of relevant data, textual data can be a vital learning source for developing a risk assessment methodology. This work addresses the knowledge gap in extracting relevant features from textual data to develop cause-effect scenarios with minimal manual interpretation. This study applies natural language processing and text-mining techniques to extract features from past accident reports. The extracted features are transformed into parametric form with the help of fuzzy set theory and utilized in Bayesian networks as prior probabilities for risk assessment. An application of the proposed methodology is shown in microbiologically influenced corrosion-related incident reports available from the Pipeline and Hazardous Material Safety Administration database. In addition, the trained named entity recognition (NER) model is verified on eight incidents, showing a promising preliminary result for identifying all relevant features from textual data and demonstrating the robustness and applicability of the NER method. The proposed methodology can be used in domain-specific risk assessment to analyze, predict, and prevent future mishaps, ameliorating overall process safety.

Keywords: Bayesian network (BN); data; microbiologically influenced corrosion (MIC); named entity recognition (NER); natural language processing (NLP); process safety; risk assessment; text mining; unstructured.

Abstract

Grants and funding