Learning predictive vine copula models for complex plant traits

Project Description

Feeding a growing global population requires crop varieties that are resilient, high-yielding, and adapted to diverse environments. Predicting plant performance from genomic data is central to modern breeding but remains challenging due to the high dimensionality of genomic information and the complex, nonlinear relationships across traits, environments, and plant populations. This project introduces predictive vine copula models as a new framework for genomic prediction—capable of flexibly modeling dependence structures beyond the reach of standard linear approaches. We develop scalable methods for high-dimensional vine copula (quantile) regression and demonstrate their usefulness for identifying influential single-nucleotide polymorphisms (SNP) and improving prediction accuracy for multiple maize traits. Our approach represents the first application of vine copulas in genomic prediction and provides plant breeders with more powerful tools to unlock complex trait architectures.

Results

Developed two new high-dimensional sparse vine copula regression methods, vineregRes and vineregParCor, that scale with computational complexity O(p²), improving substantially over existing O(p³) approaches.
Introduced definitions of relevant, redundant, and irrelevant variables for quantile regression, with illustrative examples.
Simulation studies demonstrate the power of our methods in variable selection, prediction accuracy, and computational speed in sparse high-dimensional settings.
Demonstrated that existing methods suffer when redundant but relevant variables accumulate, while one of our approaches maintains best accuracy concerning pinball loss.
Applied methods to a large maize genomic dataset with 501,124 SNPs, identifying key predictors for four agronomic traits (PH V4/V6, FF, MF).
Achieved superior prediction and feature-selection performance compared to linear and conventional genomic prediction models.
Implementation of the R package sparsevinereg .

Follow-up

Future work may focus on refining feature-extraction strategies to further improve genomic prediction. This could include selecting appropriate SNP weights for estimating their latent variables, choosing the SNP group size G via cross-validation, adapting the P-value screening threshold to consider all possible extracted features, developing post-processing features for additional feature extraction, and extending the variable selection to support more flexible vine tree structures.

Özge Sahin, Claudia Czado, High-dimensional sparse vine copula regression with application to genomic prediction, Biometrics, Volume 80, Issue 1, March 2024, ujad042, https://doi.org/10.1093/biomtc/ujad042