Equalden.HD: An R Package for testing the equality of a high dimensional set of densities

Comput Methods Programs Biomed. 2022 Apr:217:106694. doi: 10.1016/j.cmpb.2022.106694. Epub 2022 Feb 16.

Abstract

Background and objective: Nowadays the "low sample size, large dimension" scenario is often encountered in genetics and in the omic sciences, where the microarray data is typically formed by a large number of possibly dependent small samples. Standard methods to solve the k-sample problem in such a setting are of limited applicability due to lack of theoretical validation for large k, lengthy computational times, missing software solutions, or inability to deal with statistical dependence among the samples. This paper presents the R package Equalden.HD to overcome the referred limitations.

Methods: The package implements several tests for the null hypothesis that a large number of samples follow a common density. These methods are particularly well suited to the "low sample size, large dimension" setting. The implemented procedures allow for dependent samples. For each method Equalden.HD reports, among other things, the standardized value of the test statistic and the corresponding p-value. The package also includes two high-dimensional genetic data sets, Hedenfalk and Rat, which are used in this paper for illustration purposes.

Results: The usage of Equalden.HD has been illustrated through the analysis of Hedenfalk and Rat genetic data. Statistical dependence among the samples was found for both genetic data sets. The application of an appropriate k-sample test within Equalden.HD rejected the null hypothesis of inter-samples homogeneity. The methods were used to test for the within groups homogeneity in cluster analysis too, which is usually performed when the k samples are found to be significantly different. Equalden.HD helped to identify the individuals which are responsible for the lack of homogeneity of the samples. The limitations of the standard Kruskal-Wallis test for the identification of homogeneous clusters have been highlighted.

Conclusions: The methods implemented by Equalden.HD are the unique omnibus nonparametric k-sample tests that have been validated as k grows. Furthermore, the package provides suitable corrections for possibly dependent samples, which is another distinctive feature. Thus, the package opens new doors for the statistical analysis of omic data. Limitations of standard methods (e.g. Anderson-Darling and Kruskal-Wallis) and existing software solutions in the setting with a large k have been emphasized.

Keywords: -sample problem; Dependent samples; High dimensional data; Kernel smoothing.

MeSH terms

  • Animals
  • Cluster Analysis
  • Rats
  • Sample Size
  • Software*