Effectiveness of PRS Models Across Populations for Cardiovascular Diseases

Intro

Polygenic Risk Score (PRS) is a genetic risk calculator that adds up small genetic effects from many different locations in a person's DNA. It is used to estimate their risk for a disease. It was first used to study schizophrenia (SCZ), based on the hypothesis that the disorder is linked to multiple common genetic variations. It has since become widely used for predicting various diseases. PRS works best for people of European ancestry because most genetic research data comes from European populations. However, its performance significantly decreases for other populations.

While scientists have made progress in improving polygenic risk scores (PRS) for different populations, most studies rely on simulated data rather than real-world genetic information. This is because researchers have limited access to large genetic datasets that include diverse ancestry groups. Additionally, most studies focus on overall PRS accuracy across the entire genome rather than looking at specific genes that may behave differently in different populations.
PRS
Our study aims to provide gene-level insights into ancestry-specific gene-disease associations, with a focus on heart failure. Unlike common conditions like diabetes or schizophrenia, heart failure is understudied. It is complex and varies across populations, making it a good disease for testing PRS accuracy. To thoroughly understand why PRS accuracy varies across populations, we computed gene expression prediction weights using publicly available European-ancestry gene expression data. We then applied these prediction weights to GWAS summary statistics for heart failure across different populations. This allowed us to analyze gene-disease associations using FUSION, a transcriptome-wide association study (TWAS) framework.

By comparing gene-disease associations across ancestry populations, we were able to identify specific genes where PRS predictions for heart failure may be biased due to population-specific genetic architecture. Our findings provide functional insights into PRS transferability, helping to explain how genetic regulation of disease-related genes differs across ancestries.

Data

Our analysis is based on two main types of genetic data:

European SNP Training Data

  • Source: 1000 Genomes Project

  • Includes:
    • LDREF genotype data
    • Corresponding gene expression data

  • Data Descriptions

    • LDREF genotype data
      • SNPs are filtered through pruning and thresholding.

      • Before filtering:
        • 489 individuals
        • 1,190,321 SNPs across 22 chromosomes
      • After filtering:
        • 343 individuals remained
        • 145,335 significant SNPs

    • Gene expression data
      • Covers 23,722 genes
      • 343 individuals had both genetic and expression data


  • In addition to the filterings done above, principal component analysis was also performed.
PCA by Population

To reduce the impacts of potential multi-collinearity, principal component analysis is applied to reduce the dimensionality of the genotype data.

GWAS Data for Cardiovascular Disease

This dataset focuses on heart failure and includes genetic information from different population groups. Each SNP has a p-value, showing how strongly it is linked to heart failure. The total number of SNPs varies based on the size of each population group.

  • Populations and corresponding number of SNPs:

    Population Number of SNPs
    American 5,761,787
    East Asian 8,121,472
    African 16,745,089
    European 21,705,455

  • Source: Global Biobank Meta-analysis Initiative (GBMI)

  • Purpose:
    • To identify genetic risk factors for heart failure
    • To improve statistical power to detect meaningful genetic associations

Methods

Method Flowchart

Compute SNP Weights


Model weights (i.e. SNP weights) for each of the 23,722 genes are then computed for the standardized gene expression levels for three different models (using FUSION.compute_weights.R). The models used are Top1, LASSO, and Elastic Net. Click the button below if you are interested in seeing how weights are computed using different models.

Association Testing


Using the computed European weights, association tests are conducted between these weights and GWAS summary statistics for each population to assess the relationship between genetically predicted gene expression and the trait of interest. This analysis is performed using FUSION.assoc_test.R. The key outputs are the TWAS Z-scores and their corresponding p-values. A Z-score with a large magnitude indicates a strong association between predicted gene expression and the trait (heart failure), while the p-value reflects the statistical significance of this association.

Cross-Population Gene Association Analysis

With the computed Z-scores and p-values, several downstream analyses are performed to further characterize the genetic architecture of heart failure. These include:

  • Gene locus mapping: Identifies the genomic regions associated with significant genes (p ≤ 0.05).
  • Cross-population gene overlap analysis: Assesses shared and population-specific genetic associations.
  • Human genetic evidence scoring: Scores the relationship between genes and phenotypes based on their level of support from independent genetic datasets. Higher scores indicate stronger associations.
  • Gene Ontology (GO) analysis: Identifies biological pathways and molecular functions enriched among the significant genes.

Results

Hover on the plots to gain relevant insights!

Manhattan Plot
Venn Diagram
Loci
Miami Plot
Huge
GO

Gene Ontology Enrichment Analysis (European)

GO

Gene Ontology Enrichment Analysis (American)

Conclusion

Our study found that when we used genetic models trained only on European data, the results varied greatly across different ancestry groups. The genes linked to heart failure in non-European populations were very different from those in Europeans. In fact, the most important genes for heart failure in European populations were not found to be significant in other groups. Additionally, when we analyzed how these genes function in the body, we found that European-based genes showed strong connections to heart failure, while the genes identified in non-European groups had weaker associations. These findings suggest that using only European genetic data to predict heart disease in diverse populations can lead to inaccurate risk assessments and worsen health disparities.

Meet Our Team

Garvey Li

Garvey Li

Researcher

Jason Hauk

Jason Hauk

Researcher

Yiting Bu

Yiting Bu

Researcher

Yosen Lin

Yosen Lin

Researcher

Dr. Amariuta

Dr. Amariuta

Mentor