Big Data Confidentiality: An Approach Toward Corporate Compliance Using a Rule-Based System

Big Data. 2023 Oct 31. doi: 10.1089/big.2022.0201. Online ahead of print.

Abstract

Organizations have been investing in analytics relying on internal and external data to gain a competitive advantage. However, the legal and regulatory acts imposed nationally and internationally have become a challenge, especially for highly regulated sectors such as health or finance/banking. Data handlers such as Facebook and Amazon have already sustained considerable fines or are under investigation due to violations of data governance. The era of big data has further intensified the challenges of minimizing the risk of data loss by introducing the dimensions of Volume, Velocity, and Variety into confidentiality. Although Volume and Velocity have been extensively researched, Variety, "the ugly duckling" of big data, is often neglected and difficult to solve, thus increasing the risk of data exposure and data loss. In mitigating the risk of data exposure and data loss in this article, a framework is proposed to utilize algorithmic classification and workflow capabilities to provide a consistent approach toward data evaluations across the organizations. A rule-based system, implementing the corporate data classification policy, will minimize the risk of exposure by facilitating users to identify the approved guidelines and enforce them quickly. The framework includes an exception handling process with appropriate approval for extenuating circumstances. The system was implemented in a proof of concept working prototype to showcase the capabilities and provide a hands-on experience. The information system was evaluated and accredited by a diverse audience of academics and senior business executives in the fields of security and data management. The audience had an average experience of ∼25 years and amasses a total experience of almost three centuries (294 years). The results confirmed that the 3Vs are of concern and that Variety, with a majority of 90% of the commentators, is the most troubling. In addition to that, with an approximate average of 60%, it was confirmed that appropriate policies, procedure, and prerequisites for classification are in place while implementation tools are lagging.

Keywords: anonymization; big data; data confidentiality; data exposure; data governance data loss prevention; de-anonymization; depersonalization; reidentification; variety.