查看论文信息

中文题名：	全基因组关联分析中荟萃回归方法的软件实现
姓名：	郑苏秦
学号：	17011210468
保密级别：	公开
论文语种：	chi
学科代码：	085208
学科名称：	工学 - 工程 - 电子与通信工程
学生类型：	硕士
学位：	工程硕士
学校：	西安电子科技大学
院系：	通信工程学院
专业：	信息与通信工程
研究方向：	通信信号处理、基因大数据和雷达信号处理
第一导师姓名：	史罡
第一导师单位：	西安电子科技大学
第二导师姓名：	周洋
完成日期：	2020-04-02
答辩日期：	2020-05-23
外文题名：	Software Implementation of meta-regression for genome-wide association study
中文关键词：	全基因组关联分析 ; 基因-环境交互作用 ; 荟萃回归方法 ; 单核苷酸多态性 ; C++ ; 软件设计与实现
外文关键词：	GWAS ; gene-environment interaction ; meta-regression method ; SNP ; C++ ; software design and implementation
中文摘要：	︿全基因组关联分析（Genome-wide association study，GWAS）是遗传研究中的重要方法，旨在人类全基因组的范围内，寻找与疾病相关的变异。近些年来在对复杂疾病的研究中取得了广泛的进展，已经成为人类复杂疾病研究中的主要策略之一。荟萃分析是全基因组关联分析中重要的分析方法之一，它是通过收集多个研究的GWAS结果，将这些数据综合进行二次分析，从而实现更大的有效样本量，提高发现新关联的概率，可以解决单个研究样本量太小的问题。荟萃回归（Meta-regression，MR）方法是针对检验基因-环境交互作用的荟萃分析方法，是第一次将荟萃回归技术用于GWAS的基因-环境交互作用分析中。该方法通常分为两步。第一步：根据环境变量的统计分布将每个研究对象分成几组，在每组中估计单核苷酸多态性（Single nucleotide polymorphism，SNP）对复杂疾病或性状的主效应的点估计和对应的方差。第二步：把来自各研究以及各组的结果进行荟萃回归分析，计算出SNP以及SNP-环境交互作用的回归系数和协方差矩阵，从而进一步进行SNP-环境交互的统计检验。研究表明，该方法在存在交互作用时比仅对SNP的主效应进行荟萃分析具有更高的统计力，在存在线性交互作用时它与联合荟萃分析（joint meta-analysis，JMA）方法的统计力相当，在有混淆因素存在的情况下比JMA方法的稳健性更好。目前基于JMA算法的软件已被开发出来，并且应用在SNP-环境交互作用的分析中。但是MR方法尚没有软件实现，这阻碍了该方法在遗传研究中的推广普及以及应用。本课题在Linux操作系统下基于C++语言完成了对MR方法的软件实现。软件具备的基础功能如下：1.根据用户需求读取每个研究的分析结果文件以及每个研究的SNP量指标文件，2.根据缺失率，Hardy Weinberg平衡，最小等位基因频率，最小等位基因个数等质量控制指标对每个结果文件的SNP进行筛选，3.对SNP进行荟萃回归分析，包括交互作用检测，交互作用和主效应联合检测以及主效应检测，最终生成包含SNP基本信息、分析结果数据、样本量信息的分析结果文件。在实现基本功能的同时，该软件的内存损耗极低且具有较高的运行效率。本课题还对所实现的荟萃回归软件进行了广泛的功能测试和性能测试。功能测试主要通过使用测试数据和错误数据、使用不同选项和参数对软件的基本功能模块的中间结果和最终结果进行比对和验证，测试该软件的可靠性，鲁棒性以及可扩展性。性能测试采用了三个研究中的12组共有三千万个左右SNP的GWAS结果数据，对软件的所有功能进行了全面的测试。分析结果进一步与在通用统计计算软件SAS上的计算结果进行了对比验证，测试结果表明了我们的实现是一款高效精准的高性能软件。关键词：全基因组关联分析，基因-环境交互作用，荟萃回归方法，单核苷酸多态性，C++，软件设计与实现﹀
外文摘要：	︿ Genome-wide association study (GWAS) is an important method in genetic research aiming at identifying disease-related variants within the human genome. In recent years, extensive progress has been made in the study of complex diseases, it has become one of the main strategies in the study of human complex diseases. Meta-analysis is one of the most important analytical tools in GWAS. By collecting and synthesizing GWAS results of multiple studies, the data can be integrated in a secondary analysis to achieve a much larger effective sample size and increase the probability of discovering new associations, and to solve the problem that the sample size of a single study is usually too small. Meta-regression (MR) is a meta-analysis approach to test gene-environment interactions, which is also the first time that the meta-regression technique has been applied to the gene-environment interaction analysis in GWAS. The method is usually consist of two steps. step 1: dividing subjects in each study into groups according to the distribution of the environmental variable, and estimating main effect as well its associated variance of each SNP on complex diseases or traits in individual studies. step 2: meta-regression analysis of the results to calculate the regression coefficients and covariance matrix of SNP and SNP-environmental interaction, and performing statistical test of SNP-environment interaction. This method has been shown to have higher statistical power in the presence of interaction than meta-analysis of SNP main effects only, comparable to the joint meta-analysis (JMA) approach when with linear interaction, and more robust in the presence of confounding factors. At present, the software based on JMA algorithm has been developed and applied in the GWAS of SNP-environment interaction. However, MR method has not been implemented as software, which hinders its application in genetic research. This thesis completes the software implementation of the MR method based on C++ language under the Linux operating system, and the basic functions of the software are as follows: 1. reading the analysis result file and the SNP quantity index file of each study according to the user's needs, 2. screening the SNP of each result file according to the quality control indices such as missing rate, Hardy Weinberg equilibrium, minor allele frequency, minor allele count, etc. 3. carrying out MR analyses of SNPs including tests of interaction test, SNP mian effect, as well joint effects of the SNP and the interaction effects. Final result file contains SNP basic information, analysis results, sample size information, etc. At the same time, the implementation is highly efficient in terms of computation and memory usage. Finally, this thesis also carries out extensive functional and performance tests on the developed MR software. The functional test mainly uses test data and the erroneous data, uses different options and the parameters checks intermediate results and the final results of basic modules, verifies the reliability, the robustness and the scalability of the software. Performance tests uses results of 12 groups from three GWAS studies, with about 30 million SNPs to test all the functions of the software. Furthermore, results of the analysis are compared with those by the general statistical computing software SAS, the tests show that our implementation is a high performance software with high efficiency and precision. Keywords: GWAS, gene-environment interaction, meta-regression method, SNP, C++, software design and implementation ﹀
参考文献：	︿ [1]侯泳旭, 段磊, 李岭等. 基于疾病信息网络的表型相似基因搜索[J]. 软件学报, 2018, 29(03): 721-733. [2]D. E. Weeks, G. M. Lathrop. Polygenic disease: methods for mapping complex disease traits[J]. Trends in Genetics, 1995, 11(12): 513-519。 [3]M. J. Wade. Epistasis, complex traits, and mapping genes[J]. Genetics, 2001,112: 59-69. [4]M. J. Wade. A gene’s eye view of epistasis, selection and speciation[J]. Journal of Evolutionary Biology, 2002, 15: 337-346. [5]Han Buhm, Duong Dat, Sul Jae Hoon, et al. A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping.[J]. Human molecular genetics, 2016, 25(9): 1857-1866. [6]S. Sütöri, N. Eszlari, D. Baksa, et al. P.205 Lifetime depression is associated with SORCS3 gene polymorphisms: Replicating results from a large GWAS study in an independent population[J]. European Neuropsychopharmacology, 2019, 29(6): 157-159. [7]J. Marchini, P. Donnelly, L. R. Cardon. Genome-wide strategies for detecting multiple loci that influence complex diseases[J]. Nature Genet, 2005, 37(4): 413–417. [8]M. I. McCarthy, G.R. Abecasis, L.R. Cardon , et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges[J]. Nature Reviews. Genetics, 2008, 9(5): 356-369. [9]R. W. Davies, Dandona Sonny, F. R. Stewart Alexandre, et al. Improved prediction of cardiovascular disease based on a panel of single nucleotide polymorphisms identified through genome-wide association studies[J]. Circulation. Cardiovascular Genetics, 2010, 3(5): 468-474 [10]J. Kumar, S. Yumnam, T. Basu, et al. Association of polymorphisms in 9p21 region with CAD in North Indian population: replication of SNPs identified through GWAS[J]. Clinical genetics, 2011, 79(6): 588-593 [11]Diabetes Genetics Initiative of Broad Institute of Harvard and MIT, Lund University, and Novartis institutes of BioMedical Research, et al. Genome wide association analysis identifies loci for type 2 diabetes and triglyceride levels[J]. Science, 2007, 316(5829): 1331-1336. [12]M. Ubeda, J. M. Rukstalis, J. F. Habener. Inhibition of Cyclin-dependent Kinase 5 Activity Protects Pancreatic Beta Cells from Glucotoxicity[J]. Journal of Biological Chemistry, 2006, 281(39): 28858-28864. [13]A. Herbert, N. P. Gerry, M. B. Mcqueen, et al. A Common Genetic Variant Is Associated with Adult and Childhood Obesity[J]. Science, 2007, 312. [14]D. Rosskopf, A. Bornhorst, C. Rimmbach, et al. Comment on "A Common Genetic Variant Is Associated with Adult and Childhood Obesity"[J]. Science, 2007, 315(5809): 187. [15]S. Seshadri, A. L. Fitzpatrick, M. A. Ikram, et al. Genome-wide analysis of genetic loci associated with Alzheimer disease[J]. Jama, 2010, 303(18): 1832-1840. [16]J. Hardy, A. Singleton. Genomewide association studies and human disease[J]. New England Journal of Medicine, 2009, 360(17): 1759-1768. [17]T. J. Hoffmann, G. B. Ehret, P. Nandakunar, et al. Genome-wide association analyses using electronic health records identify new loci influencing blood pressure variation[J]. Nature genetics, 2017, 49(1): 54-64. [18]T. A. Pearson, T. A. Manolio. How to interpret a genome-wide association study[J]. Jama, 2008, 299(11): 1335-1344. [19]W. G. Feero, A. E. Guttmacher, T. A. Manolio. Genomewide association studies and assessment of the risk of disease[J]. New England Journal of Medicine, 2010, 363(2): 166-176. [20]GENE V GLASS. Primary, secondary and Meta-Analysis of Research 1[J]. Educational Researcher 1976; 5(10): 3-8. [21]J. Lau, J. P. Ioannidis, C. H. Schmid. Summing up evidence: one answer is not always enough[J]. The Lancet, 1998, 351(9096): 123-127. [22]P. Kraft, Y. C. Yen, D. O. Stram, et al. Exploiting gene-environment interaction to detect genetic associations[J]. Human Heredity, 2007, 63(2): 111-119. [23]H. Aschard, D. B. Hancock, S. J. London, et al. Genome-wide meta-analysis of joint tests for genetic and gene-environment interaction effects[J]. Human Heredity, 2010, 70(4): 292-300. [24]C. Kooperberg, M. Leblanc. Increasing the power of identifying genegene interactions in genome-wide association studies[J]. Genet epidemiology, 2008, 32(3): 255-263. [25]X. Xu, G. Shi, A. Nehorai. Meta-regression of gene-environment interaction in genome-wide association studies[J]. IEEE transactions on nanobioscience, 2013, 12(4): 354-362. [26]A. K. Manning, M. LaValley, C. T. Liu, et al. Meta-analysis of gene-environment interaction: Joint estimation of SNP and SNPenvironment regression coefficients[J]. Genetic epidemiology, 35(1), 2011, 35: 11-18. [27] M. Borenstein, L. V. Hedges, J. P. T. Higgins, et al. Introduction to Meta-Analysis[M]. Chichester, U.K.: Wiley, 2009. [28]D. B. Petitti. Statistical Methods in Meta-Analysis[M]. New York: Oxford Univ. Press, 2000. [29]E. Evangelou, P. A. Ioannidis John. Meta-analysis methods for genome-wide association studies and beyond[J]. Nature reviews. Genetics, 2013, 14(6): 379-89. [30]R. M. Pfeiffer, H. G. Mitchell,D. Pee. On combining data from genome-wide assocition studies to discover disease-associated SNPs[J]. Statist, 2009, 24(4): 547-560. [31]T. V. Pereira, N. A. Patsopoulos, G. Salanti, et al. Discovery properties of genome-wide association signals from cumulatively combined data sets[J]. American Journal epidemiology, 2009, 170(10): 1197–1206. [32]F. K. Kavvoura, P. A. Ioannidis John. Methods for meta-analysis in genetic association studies: a review of their potential and pitfalls[J]. Human genetics, 2008, 123(1): 1–14. [33]P. A. Ioannidis John, N. A. Patsopoulos, E. Evangelou. Heterogeneity in meta-analyses of genome-wide association investigations[J]. PLoS ONE, 2007, 2(9): 841. [34]G. Salanti. L. Southam, D. Altshuler, et al. Underlying genetic models of inheritance in established type 2 diabetes associations[J]. American journal of epidemiology, 2009, 170(5): 537–545 . [35]G. Shi, A. Nehorai. Robustness of meta-analyses in finding geneenvironment Interactions[J]. PLoS ONE, 2017, 12: e0171446. [36]P. R. Wunman, M. P. Wellman. Optimal factory scheduling using stochastic dominance Algorithms, In Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence 1996; 554-559. [37]C. J. Willer, Yun Li, G. R. Abecasis1, METAL: fast and efficient meta-analysis of genomewide association scans[J]. Bioinformatics (Oxford, England), 2010, 26(17): 2190-2191. [38]C. J. Willer, S. Sanna, A. U. Jackson et al. Newly identified loci that influence lipid concentrations and risk of coronary artery disease[J]. Nature genetics, 2008, 40(2): 161-169. [39]S. Sanna, A. U. Jackson, R. Nagaraja et al. Common variants in the GDF5-UQCC region are associated with variation in human height[J]. Nature genetics, 2008, 40(2): 198-203. [40]Shuang Feng, Dajiang Liu, Xiaowei Zhani, et al. RAREMETAL: fast and powerful meta-analysis for rare variants[J]. Bioinformatics (Oxford, England), 2014, 30(19): 2828-2829. [41]J. Z. Liu, F. Tozzi, D.M. Waterworth, et al. Meta-analysis and imputation refines the association of 15q25 with smoling quantity[J]. Nature Genetics, 2010, 42(5), 436-440. [42]J. Marchini, B. Howie, S. Myers, et al. A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics, 2007, 39(7) : 906-913. [43]S. Lee, T. M. Teslovich, M. Boehnke, X. Lin. General framework for meta-analysis of rare variants in sequencing association studies[J]. American journal of human genetics. 2013, 93(1): 42-53. [44]Reedik Mägi, P. M. Andrew. GWAMA:software for genome-wide association meta-analysis[J]. BMC Bioinformatics, 2010, 11: 288. ﹀
中图分类号：	Q34
馆藏号：	47175
开放日期：	2020-12-29

附件下载