The Misleading Nature of PCA: A Closer Look at Its Use in Science
Written on
Principal Component Analysis (PCA) is an established tool in machine learning, especially for reducing complexity in data sets. However, recent investigations suggest that it may be inconsistent and unreliable, contributing to the ongoing reproducibility crisis within scientific research.
The Reproducibility Crisis and PCA in Genetics
The reproducibility crisis in science has gained increased attention in recent years. Various factors contribute to this issue, and the misuse of machine learning techniques is among the leading causes.
One significant challenge is the often uncurated nature of the data. In cancer research, for instance, data access can be limited, and much of the information remains undocumented. Furthermore, not all research teams make their data available, often reserving it for future publications.
The Need for Improved Data in Cancer Research
#### Open Questions Awaiting Answers pub.towardsai.net
Another concern arises from the fact that data science, now widely employed in biological and genetic studies, often involves researchers who lack formal training in these areas. Moreover, published code frequently goes unupdated, making reproducibility a daunting task. In many cases, the documentation is inadequate, and contacting the original authors may not be feasible since they may have left the lab.
The Code Reproducibility Challenge in Science and AI
#### Sharing in Scientific Research pub.towardsai.net
Additionally, systematic biases can occur due to errors in algorithm application. One notable issue is data leakage, which is frequently cited in the literature.
Machine Learning: A Double-Edged Sword for Science?
#### The Impact of Machine Learning on Reproducibility towardsdatascience.com
Biological data can be intricate, often suffering from dimensionality challenges, missing values, and confounding variables. Genetic datasets face similar issues, including small sample sizes and poor study designs.
PCA is commonly employed in genetic studies, as it simplifies clustering and is believed to represent clusters based on the two principal components. Its use dates back to the early 1960s, and it is now a staple in numerous articles.
PCA's appeal for population geneticists lies in the assumption that distances between clusters reflect actual genetic and geographical distances.
Technically, PCA can handle numerical datasets of varying sizes and consistently produces results. However, it lacks significance measures and error estimations, complicating the assessment of result quality.
> The only commonly accepted metric for evaluating PCA's quality is the proportion of explained variance, and there is no consensus on the optimal number of principal components to analyze.
Given these concerns, researchers from Lund University questioned whether PCA might contribute to the reproducibility crisis in scientific research.
Is PCA a Red Flag?
> PCA meets many risk criteria regarding reproducibility, and its prevalent use as a preliminary hypothesis generator in population genetics warrants a thorough evaluation of its reliability, robustness, and reproducibility. Testing PCA's accuracy necessitates a convincing model with clear truth.
The authors acknowledge the difficulty in assessing PCA's reliability due to the need for known ground truth. To facilitate this evaluation, they devised a simplified model where individuals express three genes, assigning colors based on their gene vectors. This model serves as a representation of population variations (SNPs) and aids in assessing PCA's accuracy.
> When applied to this data, PCA reduces it to two dimensions, capturing most of the variation. This allows for visualization of true colors in PCA's 2D scatterplot while measuring distances and comparing them to actual 3D distances.
The researchers also utilized three authentic human genotype datasets to conduct twelve common tasks in population genetics. By altering the population proportions and rerunning PCA, they visualized the outcomes.
The findings revealed that distances varied based on the number of individuals in each cluster, indicating that excluding certain populations could skew results.
For instance, the authors attempted to replicate a 2009 Nature study claiming that Indian populations are genetically distinct from European, Asian, and African groups. However, modifying data proportions resulted in inconsistent PCA outcomes, undermining its reliability.
> Contrary to the claims of some researchers, our examples demonstrate how PCA can yield conflicting and nonsensical scenarios—mathematically valid yet biologically implausible—highlighting the importance of prior knowledge in PCA interpretation. Presenting a single or few PCA plots without acknowledging alternative solutions or the proportion of explained variance is misleading.
In essence, altering data inputs can lead to significantly different conclusions drawn from PCA analyses.
Concluding Remarks
> The reproducibility crisis has prompted a rigorous examination of scientific tools and methodologies. Given PCA's significance in population genetics and its unproven reliability, we assessed its robustness and reproducibility across twelve test cases using a simplified color-based model with known population structures. PCA failed to meet all three criteria.
Understanding the reliability of genetic studies is crucial for clinical and biomedical research. This investigation reveals that a widely used method, PCA, may not be as robust or reproducible as assumed. Researchers should refrain from drawing definitive conclusions based solely on PCA outputs.
Moreover, if PCA itself lacks reliability, the clusters derived from it may also be flawed, potentially leading to erroneous or even absurd conclusions. The authors warn that manipulating populations, sample sizes, and markers can yield numerous conflicting scenarios, which is especially concerning in light of the prevalence of cherry-picking in research.
While PCA remains a valuable tool for preliminary data exploration, making inferences based on PCA alone is ill-advised. The authors liken PCA scatterplots to Rorschach tests, suggesting that interpretations vary based on individual biases.
Interested in More?
Feel free to explore my other articles, subscribe for notifications on new publications, or connect with me on LinkedIn.
Here’s my GitHub repository, where I’ll compile code and resources related to machine learning, artificial intelligence, and more.
#### GitHub - SalvatoreRa/tutorial: Tutorials on Machine Learning, AI, and Data Science ##### A collection of tutorials with mathematical explanations and reusable Python code. github.com
You might also find my recent articles intriguing: #### Stable Diffusion to Fill Gaps in Medical Image Data ##### A study shows how stable diffusion could enhance medical image analysis for rare diseases. levelup.gitconnected.com
#### Microsoft BioGPT: The ChatGPT of Life Science? ##### BioGPT achieves state-of-the-art results in various biomedical NLP tasks. levelup.gitconnected.com