Tổng quan ứng dụng học máy trong dự đoán nguy cơ đa di truyền hướng tới y học cá thể hóa

An Overview of Machine Learning Applications in Polygenic Risk Prediction Towards Personalized Medicine

  • Trinh Thi Xuan Hanoi Open University
  • Ta Van Nhan
  • Hoang Do Thanh Tung
  • Truong Nam Hai
  • Tran Dang Hung
Keywords: Bệnh phổ biến, Điểm nguy cơ đa di truyền, GWAS, SNPs, Mảng SNP, Học máy


Trong thời gian gần đây, Điểm nguy cơ đa di truyền (Polygenic risk score - PRS) được xem như một công cụ tiềm năng cho y học chính xác dựa trên các biến dị di truyền phổ biến có đóng góp từ nhỏ tới vừa đối với nguy cơ mắc bệnh di truyền, nhưng tổng gộp các biến dị này lại có thể nâng cao giá trị dự đoán bệnh trong quần thể. Đã có nhiều phương pháp học máy được đưa ra nhằm cải tiến khả năng dự đoán của PRS cũng như những nỗ lực để đưa PRS vào ứng dụng trong lâm sàng. Mặc dù vậy, việc lựa chọn phương pháp một cách hệ thống và những ứng dụng của PRS vẫn chưa thực sự rõ ràng. Vì vậy, trong bài báo tổng quan này, chúng tôi cung cấp một cái nhìn tổng quan về điểm nguy cơ đa di truyền và các nghiên cứu cải tiến sử dụng học máy nhằm nâng cao khả năng áp dụng trong lâm sàng của PRS


Kuchenbaecker, K.B. et al. (2017) "Risks of Breast, Ovarian, and Contralateral Breast Cancer for BRCA1 and BRCA2 Mutation Carriers", JAMA, 317(23), pp. 2402–2416. doi:10.1001/jama.2017.7112.

Wexler, N.S. et al. (1987) "Homozygotes for Huntington’s disease", Nature, 326(6109), pp. 194–197. doi:10.1038/326194a0.

Gusella, J.F. (1989) "Location cloning strategy for characterizing genetic defects in Huntington’s disease and Alzheimer’s disease", FASEB journal: official publication of the Federation of American Societies for Experimental Biology, 3(9), pp. 2036–2041. doi:10.1096/fasebj.3.9.2568302.

Ford, D. et al. (1998) "Genetic Heterogeneity and Penetrance Analysis of the BRCA1 and BRCA2 Genes in Breast Cancer Families", The American Journal of Human Genetics, 62(3), pp. 676–689. doi:10.1086/301749.

Klein, R.J. et al. (2005) "Complement factor H polymorphism in age-related macular degeneration", Science (New York, N.Y.), 308(5720), pp. 385–389. doi:10.1126/science.1109557.

Burton, P.R. et al. (2007) "Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls", Nature, 447(7145), pp. 661–678. doi:10.1038/nature05911.

Loos, R.J.F. (2020) "15 years of genome-wide association studies and no signs of slowing down", Nature Communications, 11(1), p. 5900. doi:10.1038/s41467-020-19653-5.

Visscher, P.M. et al. (2017) "10 Years of GWAS Discovery: Biology, Function, and Translation", American Journal of Human Genetics, 101(1), pp. 5–22. doi:10.1016/j.ajhg.2017.06.005.

Bycroft, C. et al. (2018) "The UK Biobank resource with deep phenotyping and genomic data", Nature, 562(7726), pp. 203–209. doi:10.1038/s41586-018-0579-z.

Pepe, M.S. et al. (2004) "Limitations of the Odds Ratio in Gauging the Performance of a Diagnostic, Prognostic, or Screening Marker", American Journal of Epidemiology, 159(9), pp. 882–890. doi:10.1093/aje/kwh101.

Jakobsdottir, J. et al. (2009) "Interpretation of Genetic Association Studies: Markers with Replicated Highly Significant Odds Ratios May Be Poor Classifiers", PLOS Genetics, 5(2), p. e1000337. doi:10.1371/journal.pgen.1000337.

Wray, N.R., Goddard, M.E. and Visscher, P.M. (2007) "Prediction of individual genetic risk to disease from genomewide association studies", Genome Research, 17(10), pp. 1520–1528. doi:10.1101/gr.6665407.

Manolio, T.A. et al. (2009) "Finding the missing heritability of complex diseases", Nature, 461(7265), pp. 747–53. doi:10.1038/nature08494

Cecile, A., Janssens, J.W. and Joyner, M.J. (2019) "Polygenic Risk Scores That Predict Common Diseases Using Millions of Single Nucleotide Polymorphisms: Is More, Better?", Clinical Chemistry, 65(5), pp. 609–611. doi:10.1373/clinchem.2018.296103.

Sha, Z., Hu, T. and Chen, Y. (2021) "Feature Selection for Polygenic Risk Scores using Genetic Algorithm and Network Science", in 2021 IEEE Congress on Evolutionary Computation (CEC), pp. 802–808. doi:10.1109/CEC45853.2021.9504993.

Klinger, J.E. et al. (2021) Interaction-based feature selection algorithm outperforms polygenic risk score in predicting Parkinson’s Disease status. medRxiv, p. 2021.07.20.21260848. doi:10.1101/2021.07.20.21260848.

Kulm, S., Mezey, J. and Elemento, O. (2022) Benchmarking Polygenic Risk Score Model Assumptions: towards more accurate risk assessment. bioRxiv, p. 2022.02.18.480983. doi:10.1101/2022.02.18.480983.

Privé, F. et al. (2019) "Making the Most of Clumping and Thresholding for Polygenic Scores", The American Journal of Human Genetics, 105(6), pp. 1213–1221. doi:10.1016/j.ajhg.2019.11.001.

Hahn, G. et al. (2021) "A fast and efficient smoothing approach to Lasso regression and an application in statistical genetics: polygenic risk scores for chronic obstructive pulmonary disease (COPD)", Statistics and Computing, 31(3), p. 35. doi:10.1007/s11222-021-10010-0.

Pattee, J. and Pan, W. (2020) "Penalized regression and model selection methods for polygenic scores on summary statistics", PLOS Computational Biology, 16(10), p. e1008271. doi:10.1371/journal.pcbi.1008271.

Dickson, S.P. et al. (2021) "GenoRisk: A polygenic risk score for Alzheimer’s disease", Alzheimer’s & Dementia: Translational Research & Clinical Interventions, 7(1), p. e12211.

Peng, J. et al. (2021) A Deep Learning-based Genomewide Polygenic Risk Score for Common Diseases Identifies Individuals with Risk. medRxiv, p. 2021.11.17.21265352. doi:10.1101/2021.11.17.21265352.

Zhao, B. and Zou, F. (2021) "On polygenic risk scores for complex traits prediction", Biometrics [Preprint]. doi:10.1111/biom.13466.

Euesden, J., Lewis, C.M. and O’Reilly, P.F. (2015a) "PRSice: Polygenic Risk Score software", Bioinformatics (Oxford, England), 31(9), pp. 1466–1468. doi:10.1093/bioinformatics/btu848.

Uffelmann, E. et al. (2021) "Genome-wide association studies", Nature Reviews Methods Primers, 1(1), pp. 1–21. doi:10.1038/s43586-021-00056-9.

Purcell, S. et al. (2007) "PLINK: A Tool Set for WholeGenome Association and Population-Based Linkage Analyses", American Journal of Human Genetics, 81(3), pp. 559–575.

Marees, A.T. et al. (2018) "A tutorial on conducting genomewide association studies: Quality control and statistical analysis", International Journal of Methods in Psychiatric Research, 27(2), p. e1608. doi:10.1002/mpr.1608.

Buniello, A. et al. (2019) "The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019", Nucleic Acids Research, 47(D1), pp. D1005–D1012. doi:10.1093/nar/gky1120.

Tryka, K.A. et al. (2014) "NCBI’s Database of Genotypes and Phenotypes: dbGaP", Nucleic Acids Research, 42(D1), pp. D975–D979. doi:10.1093/nar/gkt1211.

Sirugo, G., Williams, S.M. and Tishkoff, S.A. (2019) "The Missing Diversity in Human Genetic Studies", Cell, 177(1), pp. 26–31. doi:10.1016/j.cell.2019.02.048.

Purcell, S.M. et al. (2009) "Common polygenic variation contributes to risk of schizophrenia and bipolar disorder", Nature, 460(7256), pp. 748–752. doi:10.1038/nature08185.

Wray, N.R. et al. (2014) "Research review: Polygenic methods and their application to psychiatric traits", Journal of Child Psychology and Psychiatry, and Allied Disciplines, 55(10), pp. 1068–1087. doi:10.1111/jcpp.12295.

Khera, A.V. et al. (2018) "Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations", Nature Genetics, 50(9), pp. 1219–1224. doi:10.1038/s41588-018-0183-z.

Mavaddat, N. et al. (2019) "Polygenic Risk Scores for Prediction of Breast Cancer and Breast Cancer Subtypes", American Journal of Human Genetics, 104(1), pp. 21–34. doi:10.1016/j.ajhg.2018.11.002.

Hormozdiari, F. et al. (2015) "Identification of causal genes for complex traits", Bioinformatics, 31(12), pp. i206–i213. doi:10.1093/bioinformatics/btv240.

Visscher, P.M., Hill, W.G. and Wray, N.R. (2008) "Heritability in the genomics era — concepts and misconceptions", Nature Reviews Genetics, 9(4), pp. 255–266. doi:10.1038/nrg2322.

Lusted, L.B. (1971) "Signal detectability and medical decision-making", Science (New York, N.Y.), 171(3977), pp. 1217–1219. doi:10.1126/science.171.3977.1217.

Fawcett, T. (2006) "An introduction to ROC analysis", Pattern Recognition Letters, 27(8), pp. 861–874. doi:10.1016/j.patrec.2005.10.010.

Hanley, J.A. and McNeil, B.J. (1982) "The meaning and use of the area under a receiver operating characteristic (ROC) curve", Radiology, 143(1), pp. 29–36. doi:10.1148/radiology.143.1.7063747.

Wray, N.R. et al. (2021) "From Basic Science to Clinical Application of Polygenic Risk Scores: A Primer", JAMA Psychiatry, 78(1), pp. 101–109. doi:10.1001/jamapsychiatry.2020.3049.

Anderson, C.A. et al. (2010) "Data quality control in genetic case-control association studies", Nature Protocols, 5(9), pp. 1564–1573. doi:10.1038/nprot.2010.116.

Coleman, J.R.I. et al. (2016) "Quality control, imputation and analysis of genome-wide genotyping data from the Illumina HumanCoreExome microarray", Briefings in Functional Genomics, 15(4), pp. 298–304. doi:10.1093/bfgp/elv037.

Choi, S.W., Mak, T.S.-H. and O’Reilly, P.F. (2020) "Tutorial: a guide to performing polygenic risk score analyses", Nature Protocols, 15(9), pp. 2759–2772. doi:10.1038/s41596-020- 0353-1.

Han, B. and Eskin, E. (2011) "Random-Effects Model Aimed at Discovering Associations in Meta-Analysis of Genomewide Association Studies", American Journal of Human Genetics, 88(5), pp. 586–598. doi:10.1016/j.ajhg.2011.04.014.

Allan, B.L. (1987) "Calculating medication error rates", American Journal of Hospital Pharmacy, 44(5), pp. 1044, 1046.

Ge, T. et al. (2019) "Polygenic prediction via Bayesian regression and continuous shrinkage priors", Nature Communications, 10(1), p. 1776. doi:10.1038/s41467-019-09718-5.

Mak, T.S.H. et al. (2017) "Polygenic scores via penalized regression on summary statistics", Genetic Epidemiology, 41(6), pp. 469–480. doi:10.1002/gepi.22050.

Newcombe, P.J. et al. (2019) "A flexible and parallelizable approach to genome-wide polygenic risk scores", Genetic Epidemiology, 43(7), pp. 730–741. doi:10.1002/gepi.22245.

Thomas, M. et al. (2020) "Genome-wide Modeling of Polygenic Risk Score in Colorectal Cancer Risk", The American Journal of Human Genetics, 107(3), pp. 432–444. doi:10.1016/j.ajhg.2020.07.006.

Paré, G., Mao, S. and Deng, W.Q. (2017) "A machinelearning heuristic to improve gene score prediction of polygenic traits", Scientific Reports, 7(1), p. 12665. doi:10.1038/s41598-017-13056-1.

Takahashi, Y. et al. (2020) "Machine learning for effectively avoiding overfitting is a crucial strategy for the genetic prediction of polygenic psychiatric phenotypes", Translational Psychiatry, 10(1), pp. 1–11. doi:10.1038/s41398-020-00957- 5.

Vlachakis, D. et al. (2021) "Improving the Utility of Polygenic Risk Scores as a Biomarker for Alzheimer’s Disease", Cells, 10(7), p. 1627. doi:10.3390/cells10071627.

Sun, J. et al. (2021) "Translating polygenic risk scores for clinical use by estimating the confidence bounds of risk prediction", Nature Communications, 12(1), p. 5276. doi:10.1038/s41467-021-25014-7.

Badré, A. et al. (2021) "Deep neural network improves the estimation of polygenic risk scores for breast cancer", Journal of Human Genetics, 66(4), pp. 359–369. doi:10.1038/s10038-020-00832-7.

Privé, F., Aschard, H. and Blum, M.G.B. (2019) "Efficient Implementation of Penalized Regression for Genetic Risk Prediction", Genetics, 212(1), pp. 65–74. doi:10.1534/genetics.119.302019.

Dubois, P.C.A. et al. (2010) "Multiple common variants for celiac disease influencing immune gene expression", Nature Genetics, 42(4), pp. 295–302. doi:10.1038/ng.543.

Sudlow, C. et al. (2015) "UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age", PLOS Medicine, 12(3), p. e1001779. doi:10.1371/journal.pmed.1001779.

Pedersen, C.B. et al. (2018) "The iPSYCH2012 case–cohort sample: new directions for unravelling genetic and environmental architectures of severe mental disorders", Molecular Psychiatry, 23(1), pp. 6–14. doi:10.1038/mp.2017.196.

Berglund, G. et al. (1993) "The Malmo Diet and Cancer Study. Design and feasibility", Journal of Internal Medicine, 233(1), pp. 45–51. doi:10.1111/j.1365-2796.1993.tb00647.x.

Stevenson, A. et al. (2019) "Neuropsychiatric Genetics of African Populations-Psychosis (NeuroGAP- sychosis): a case-control study protocol and GWAS in Ethiopia, Kenya, South Africa and Uganda", BMJ Open, 9(2), p. e025469. doi:10.1136/bmjopen-2018-025469.

Wang, Y.-F. et al. (2021) "Multi-ancestral GWAS identifies shared and Asian-specific loci for SLE and links type III interferon signaling and lysosomal function to the disease", Arthritis & Rheumatology (Hoboken, N.J.) [Preprint]. doi:10.1002/art.42021.

Swart, Y. et al. (2022) GWAS in the southern African context. bioRxiv. doi:10.1101/2022.02.16.480704.

Shen, H. et al. (2020) "Polygenic prediction and GWAS of depression, PTSD, and suicidal ideation/self-harm in a Peruvian cohort", Neuropsychopharmacology, 45(10), pp. 1595–1602. doi:10.1038/s41386-020-0603-5.

Cardona Tobar, K.M. et al. (2020) "Genome-wide association studies in sheep from Latin America. Review", Revista mexicana de ciencias pecuarias, 11(3), pp. 859–883. doi:10.22319/rmcp.v11i3.5372.

Yang, Z. et al. (2021) "Genome-wide association study reveals genetic variations associated with ocean acidification resilience in Yesso scallop Patinopecten yessoensis", Aquatic Toxicology, 240, p. 105963. doi:10.1016/j.aquatox.2021.105963.

Peng, W. et al. (2021) "Identification of growth-related SNP and genes in the genome of the Pacific abalone (Haliotis discus hannai) using GWAS", Aquaculture, 541, p. 736820. doi:10.1016/j.aquaculture.2021.736820.

Gibbs, R.A. et al. (2003) "The International HapMap Project", Nature [Preprint]. doi:10.1038/nature02168.

Siva, N. (2008) "1000 genomes project", Nature Biotechnology, 26(3), pp. 256–257.

Jadhav, A., Pramod, D. and Ramanathan, K. (2019) "Comparison of Performance of Data Imputation Methods for Numeric Dataset", Applied Artificial Intelligence, 33(10), pp. 913–933. doi:10.1080/08839514.2019.1637138.

Austin, P.C. et al. (2021) "Missing Data in Clinical Research: A Tutorial on Multiple Imputation", Canadian Journal of Cardiology, 37(9), pp. 1322–1331. doi:10.1016/j.cjca.2020.11.010.

Choquet, H. et al. (2021) "A large multiethnic GWAS meta-analysis of cataract identifies new risk loci and sexspecific effects", Nature Communications, 12(1), p. 3595. doi:10.1038/s41467-021-23873-8.

Powell, V. et al. (2021) "Investigating regions of shared genetic variation in attention deficit/hyperactivity disorder and major depressive disorder: a GWAS meta-analysis", Scientific Reports, 11(1), p. 7353. doi:10.1038/s41598-021- 86802-1.

Levey, D.F. et al. (2020) GWAS of Depression Phenotypes in the Million Veteran Program and Meta-analysis in More than 1.2 Million Participants Yields 178 Independent Risk Loci. medRxiv, p. 2020.05.18.20100685. doi:10.1101/2020.05.18.20100685.

Taherkhani, L. et al. (2022) "The Candidate Chromosomal Regions Responsible for Milk Yield of Cow: A GWAS MetaAnalysis", Animals, 12(5), p. 582. doi:10.3390/ani12050582.

Li, J.H. et al. (2021) "Low-pass sequencing increases the power of GWAS and decreases measurement error of polygenic risk scores compared to genotyping arrays", Genome Research, 31(4), pp. 529–537. doi:10.1101/gr.266486.120.

Huang, J. et al. (2022) "A Next Generation SequencingBased Protocol for Screening of Variants of Concern in Autism Spectrum Disorder", Cells, 11(1), p. 10. doi:10.3390/cells11010010.

Chatterjee, N., Shi, J. and García-Closas, M. (2016) "Developing and evaluating polygenic risk prediction models for stratified disease prevention", Nature Reviews Genetics, 17(7), pp. 392–406. doi:10.1038/nrg.2016.27.

Natarajan, P. (2018) "Polygenic Risk Scoring for Coronary Heart Disease", Journal of the American College of Cardiology, 72(16), pp. 1894–1897. doi:10.1016/j.jacc.2018.08.1041.

Zhang, X. et al. (2018) "Addition of a polygenic risk score, mammographic density, and endogenous hormones to existing breast cancer risk prediction models: A nested case–control study", PLOS Medicine, 15(9), p. e1002644. doi:10.1371/journal.pmed.1002644.

Hung, R.J. et al. (2021) "Assessing Lung Cancer Absolute Risk Trajectory Based on a Polygenic Risk Model", Cancer Research, 81(6), pp. 1607–1615. doi:10.1158/0008- 5472.CAN-20-1237.

Carr, P.R. et al. (2020) "Estimation of Absolute Risk of Colorectal Cancer Based on Healthy Lifestyle, Genetic Risk, and Colonoscopy Status in a PopulationBased Study", Gastroenterology, 159(1), pp. 129-138.e9. doi:10.1053/j.gastro.2020.03.016.

Darst, B.F. et al. (2021) "Combined Effect of a Polygenic Risk Score and Rare Genetic Variants on Prostate Cancer Risk", European Urology, 80(2), pp. 134–138. doi:10.1016/j.eururo.2021.04.013.