# Unraveling the Complexities of Genomics and High-Dimensional Data
Written on
Understanding the Challenges of Big Data in Genomics
In this installment of my series on Mathematical Statistics and Machine Learning for Life Sciences, I delve into some complex analytical techniques prevalent in Computational Biology. The advent of genome-wide genotyping and whole-genome sequencing (WGS) has significantly enhanced our understanding of genetic studies in Life Sciences. However, this has also led to an explosion of high-dimensional genetic data, which is plagued by the Curse of Dimensionality. In the following sections, I will provide a theoretical framework for this issue and explain why meaningful and reliable analyses in modern Genetics and Genomics are particularly difficult.
The Evolution of Genomic Analysis
Anyone involved in Genomics research has likely encountered the comment from seasoned colleagues reminiscing about simpler times: "I recall when I analyzed only 10 genetic variants; now we handle millions thanks to WGS technologies." Each time I hear this, I'm reminded of how manageable it was to work with a mere 10 variants compared to the overwhelming complexity of analyzing vast numbers of genetic data today. My concerns about the reliability of genomic research stem from my conceptualization of data in terms of matrix dimensions.
The effectiveness of an analysis method often hinges on the ratio of N (the number of observations or samples) to P (the number of variables, measurements, or features). The ratio for genetic variation data (shown on the left) contrasts sharply with that of single-cell RNA sequencing (scRNAseq) data (on the right), despite both having equivalent element counts. In typical WGS projects, we might sequence approximately 1,000 samples (as evidenced by the 1,000 Genomes Project), while uncovering millions of mutational differences. Thus, the feature space becomes million-dimensional. Conversely, single-cell sequencing sees sample sizes soaring to millions of cells, but gene expression per cell often caps at around 1,000 for high-throughput technologies like 10X Genomics. Consequently, even with similar data volumes, the data structures for genetic variation and scRNAseq are fundamentally different. I classify scRNAseq as Big Data, offering flexibility and statistical power, while I view genetic variation as Little Data due to the numerous challenges posed by its high-dimensional nature.
Challenges of Sparse Genomics Data
One striking effect of the Curse of Dimensionality in Genomics is the sparsity of the data. Sparsity refers to insufficient statistical observations for certain variable ranges, often due to inadequate sampling. In Genomics, this arises from the difficulty in identifying carriers of rare alleles. For instance, if we assess three single-nucleotide polymorphisms (SNPs) in 100 individuals with a minor allele frequency (MAF) of 10%, we would typically expect only 10 carriers for each SNP and even fewer for combinations of alleles.
Consequently, we cannot effectively link traits in our sample of 100 individuals to all three SNPs. Even when attempting to analyze phenotypes against just two SNPs, the lack of observations (with only one sample carrying both minor alleles) renders robust computation impossible.
Illustrating the Missing Heritability Problem
In practice, our best option often involves analyzing one SNP at a time, akin to Genome-Wide Association Studies (GWAS). This singular focus means we might miss the interactions between variants, as a phenotype could result from the co-occurrence of minor alleles across several variants rather than from individual SNP effects. To account for the cumulative impacts of multiple SNPs, researchers often create Polygenic Risk Scores (PRS), which can be validated against phenotypes in independent samples.
However, this leads to the Missing Heritability problem, characterized by a failure to reliably predict phenotypes from genetic variation data. In simpler terms, while individual genetic variants may show associations with specific traits, they collectively fail to explain a significant amount of phenotypic variation when aggregated into a PRS.
Despite various proposed explanations for the Missing Heritability problem, a fundamental issue is that GWAS typically identify causal variants one by one, lacking the power to assess multiple SNPs simultaneously. As a result, these variants are combined in an additive manner in PRS, overlooking potential epistatic interactions that contribute to phenotypic variation.
Moreover, the Curse of Dimensionality also necessitates adjustments for multiple testing to mitigate false-positive findings, but this approach does not entirely resolve the issue. As highlighted by Naomi Altman et al. in a Nature Methods article, even with a false-discovery rate of 5%, the number of false positives often exceeds expectations.
Thus, even well-powered GWAS studies sometimes yield implausible strong genetic signals for traits such as "number of vehicles in household" or "time spent watching television," which may be confounded by factors like population structure.
The Limitations of Genetic Predictions
If we leverage genetic variants identified in GWAS for predicting common diseases, such as Type 2 Diabetes or Abdominal Aortic Aneurysm, we might find limited success compared to predictions based solely on clinical factors.
Additionally, recent research indicates that the predictive power of microbiome data surpasses that of GWAS studies in differentiating complex human diseases.
The video titled "CSHL Keynote, Dr. Karen Miga, UCSC Genomics Institute" offers insights into the latest advancements in genomics, emphasizing the importance of comprehensive data analysis.
Understanding the Curse of Dimensionality
To better grasp the Curse of Dimensionality, let's visualize high-dimensional genetic variation data. As we simulate data uniformly distributed within a p-dimensional ball (with varying dimensions), we observe that as dimensionality increases, data points tend to cluster near the surface of the ball, leaving the center hollow.
Additionally, a histogram of pairwise Euclidean distances reveals that as dimensionality rises, points become increasingly equidistant. The mean pairwise Euclidean distance grows with the square root of dimensionality, indicating that points drift further apart in high-dimensional spaces.
Empirical demonstrations of this phenomenon show that as we compute distances between data points, the disparity between the closest and farthest neighbors diminishes, complicating the grouping of "good," "bad," and "ugly" data points based on similarity metrics. For genetic variation data, this results in challenges when attempting to distinguish between "sick" and "healthy" individuals based on genetic information, leading to poor trait predictions and exacerbating the Missing Heritability Problem.
Addressing the Curse of Dimensionality
To mitigate the negative impacts of the Curse of Dimensionality in genomic data, I propose four strategies:
- Increase Sample Size: While this brute-force method isn't always feasible, it remains the most straightforward approach.
- Regularization: Techniques like LASSO, Ridge, and Elastic Net should be more frequently employed in genetic research.
- Dimensionality Reduction: Implementing methods such as PCA and UMAP can effectively address the Curse of Dimensionality.
- Utilize Bayesian Statistics: The incorporation of priors can offer regularization benefits, reducing overfitting risks in high-dimensional genetic data.
In Summary
This article illustrates the significant challenges faced in conducting robust analyses of genetic variation data due to its inherently high-dimensional nature, which is affected by the Curse of Dimensionality. Low sample sizes and data sparsity often culminate in the Missing Heritability problem, where the predictive power of common traits is severely compromised.
High-dimensional data exhibits counterintuitive behavior, leading to equidistant data points and weak correlations, which creates substantial obstacles for effective clustering and analysis.
As always, feel free to share any topics in Life Sciences and Computational Biology that intrigue you in the comments below, and I will strive to explore them in future columns. You can access the complete notebook on my GitHub, and follow me on Medium (Nikolay Oskolkov), Twitter (@NikolayOskolkov), and LinkedIn. In the next post, we will delve into clustering within UMAP space—stay tuned!