Geographically distributed data management to support large-scale data analysis

Tamer Z Emara; Thanh Trinh; Joshua Zhexue Huang

doi:10.1038/s41598-023-44789-x

Geographically distributed data management to support large-scale data analysis

Sci Rep. 2023 Oct 18;13(1):17783. doi: 10.1038/s41598-023-44789-x.

Authors

Tamer Z Emara¹, Thanh Trinh^{2

3}, Joshua Zhexue Huang^{4

5}

Affiliations

¹ Faculty of Computers and Artificial Intelligence, Damietta University, New Damietta, 34519, Egypt. temara@du.edu.eg.
² Faculty of Computer Science, Phenikaa University, Ha Dong, 12116, Hanoi, Vietnam.
³ Phenikaa Research and Technology Institute (PRATI), A &A Green Phoenix Group JSC, Cau Giay, 11313, Hanoi, Vietnam.
⁴ National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China.
⁵ Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.

Abstract

Nowadays, several companies prefer storing their data on multiple data centers with replication for many reasons. The data that spans various data centers ensures the fastest possible response time for customers and workforces who are geographically separated. It also provides protecting the information from the loss in case a single data center experiences a disaster. However, the amount of data is increasing at a rapid pace, which leads to challenges in storage, analysis, and various processing tasks. In this paper, we propose and design a geographically distributed data management framework to manage the massive data stored and distributed among geo-distributed data centers. The goal of the proposed framework is to enable efficient use of the distributed data blocks for various data analysis tasks. The architecture of the proposed framework is composed of a grid of geo-distributed data centers connected to a data controller (DCtrl). The DCtrl is responsible for organizing and managing the block replicas across the geo-distributed data centers. We use the BDMS system as the installed system on the distributed data centers. BDMS stores the big data file as a set of random sample data blocks, each being a random sample of the whole data file. Then, DCtrl distributes these data blocks into multiple data centers with replication. In analyzing a big data file distributed based on the proposed framework, we randomly select a sample of data blocks replicated from other data centers on any data center. We use simulation results to demonstrate the performance of the proposed framework in big data analysis across geo-distributed data centers.