A synthetic dataset of different chart types for advancements in chart identification and visualization

Data Brief. 2024 Feb 21:53:110233. doi: 10.1016/j.dib.2024.110233. eCollection 2024 Apr.

Abstract

We introduce a meticulously curated synthetic chart dataset designed to propel algorithm advancements in data visualization and interpretation. The dataset, tailored for training and testing purposes, encompasses a diverse array of chart types, including but not limited to Area, Bar, Box, Donut, Line, Pie, and Scatter. The data collection process involves a fully automatic low-level algorithm focused on extraction of graphical elements. The algorithm ensures efficiency by restricting input images from featuring three-dimensional representations, incorporating any 3D effects, or including multiple charts in a single image. The dataset is categorized into training and testing subsets, further subdivided based on resolutions and specific chart types. The reuse potential of this dataset is substantial. It serves as a valuable resource for driving algorithmic advancements in data visualization classification and interpretation. Researchers can leverage this dataset for training and testing deep models, enhancing the adaptability of their algorithms. Moreover, it establishes a benchmark for evaluating system performance in handling diverse chart visualizations, allowing for direct comparisons, and fostering advancements in data understanding algorithms. The versatility of the dataset, encapsulating various chart types and resolutions, provides a standardized platform for assessing and comparing the effectiveness of different systems in understanding and decomposing visualizations [1,2,3].

Keywords: Chart analysis; Chart classification; Chart recognition; Document analysis; Graphics recognition; Text recognition and classification.