2010 Genome-wide association studies of 14 agronomic traits in rice landraces supplement
全基因组关联分析(GWAS)取样策略
全基因组关联分析(GWAS)取样策略GWAS要想做得好,材料选择是至关重要的一环。
So,小编查阅了上百篇GWAS文献,精心梳理了一套GWAS的取样策略,是不是很贴心呢?赶紧来学习一下吧!一、常见经济作物样本选择对于经济作物来说,一般都有成百上千个品系,其中包括野生种、地方栽培种、驯化种及商业品种。
一般选择多个品系来确保群体遗传多样性。
文献中常见的经济作物的样本收集于全国或者全世界各地。
表1 常见经济作物样本收集二、常见哺乳动物样本选择对于哺乳动物,一般选择雄性个体作为研究对象(除研究产奶、产仔等性状外),并且要求所研究的对象年龄相近。
下表是我们统计的一些已发表的哺乳动物取材案例,供大家参考。
表2 常见哺乳动物样本收集三、常见家禽类样本选择对于家禽而言,一般会选择家系群体(全同胞家系或半同胞家系)。
为了增加分析内容,可以构建多个家系群体进行研究。
此外,尽量使群体所有个体生长环境以及营养程度保持一致,同时家禽的年龄也尽量保持一致,这对表型鉴定的准确性有很大的帮助。
表3 常见家禽类样本收集四、林木类样本选择对于林木类,一般选择同一物种的多个样本,多个样本做到表型丰富。
表4 林木类样本收集五、其他物种样本选择对于原生生物以及昆虫等的取样策略,可以参考表5中已发表的文献。
表5 其他物种样本收集有这么多文献支持,各位看官是不是已经整明白了GWAS该如何取材呢?最后,小编再温馨提示一句,根据文献统计及项目经验,一般来说,GWAS的样本大小要不少于300个才是极好的。
参考文献[1] Jia G, Huang X, Zhi H, et al. A haplotype map of genomic variations and genome-wide association studies of agronomic traits in foxtail millet (Setaria italica)[J]. Nature Genetics, 2013, 45(8):957-61.[2] Zhou L, Wang S B, Jian J, et al. Identification of domestication-related loci associated with flowering time and seed size in soybean with the RAD-seqgenotyping method[J]. Scientific reports, 2015, 5.[3]Zhou Z, Jiang Y, Wang Z, et al. Resequencing 302 wild and cultivated accessions identifies genes related to domesticatio n and improvement in soybean[J]. Nature Biotechnology, 2015, 33(4):408-414.[4] MorrisG P, Ramu P, Deshpande S P, et al. Population genomic and genome-wide association studies of agroclimatic traits in sorghum[J].Proceedings of the National Academy of Sciences, 2013, 110(2): 453-458.[5] Yano K, Yamamoto E, Aya K,et al. Genome-wide association study using whole-genome sequencing rapidly identifies new genes influencing agronomic traits in rice[J]. Nature Genetics, 2016, 48(8).[6] Wang X, Wang H, Liu S, et al. Genetic variation in ZmVPP1 contributes to drought tolerance in maize seedlings[J]. Nature Genetics, 2016.[7] Pryce J E, Bolormaa S, Chamberlain A J, et al. A validated genome-wide association study in 2 dairy cattle breeds for milk production and fertility traits using variable length haplotypes[J]. Journal of dairy science, 2010, 93(7):3331-3345.[8] Hayes B J, Pryce J, Chamberlain A J, et al. Genetic architecture of complex traits and accuracy of genomic prediction:coat colour, milk-fat percentage, and type in Holstein cattle as contrastingmodel traits[J]. PLoS Genet, 2010, 6(9): e1001139.[9] Heaton M P, Clawson M L, Chitko-Mckown C G,et al. Reduced lentivirus susceptibility in sheep with TMEM154 mutations[J].PLoS Genet, 2012, 8(1): e1002467.[10] Tsai K L, Noorai R E, Starr-Moss A N, et al. Genome-wide association studies for multiple diseases of the German Shepherd Dog[J]. Mammalian Genome, 2012, 23(1-2): 203-211.[11] Petersen J L, Mickelson J R, Rendahl A K, et al. Genome-wide analysis reveals selection for important traits in domestic horse breeds[J]. PLoS Genet, 2013,9(1): e1003211.[12] Do D N, Strathe A B, Ostersen T, et al. Genome-wide association study reveals genetic architecture of eating behaviorin pigs and its implications for humans obesity by comparative mapping[J]. PLoS One, 2013, 8(8).[13] Daetwyler H D, Capitan A, Pausch H, et al. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic andcomplex traits in cattle[J]. Nature genetics, 2014, 46(8): 858-865.[14] Wu Y, Fan H, Wang Y, et al. Genome-Wide Association Studies Using Haplotypes and Individual SNPs in Simmental Cattle[J]. PLoS One,2014,9(10): e109330.[15] Parker C C, Gopalakrishnan S, Carbonetto P,et al.Genome-wide association study of behavioral, physiological and gene expression traits in outbred CFW mice[J]. Nature Genetics, 2016.[16] Gu X, Feng C, Ma L, et al. Genome-wide association study of body weight in chicken F2 resource population[J]. PLoS One, 2011, 6(7): e21872.[17] Xie L, Luo C, Zhang C, et al. Genome-wide association study identified a narrow chromosome 1 region associated with chicken growth traits[J]. PLoS One, 2012, 7(2): e30910.[18] Liu R, Sun Y, Zhao G, et al. Genome-Wide Association Study Identifies Loci and Candidate Genes for Body Composition and Meat Quality Traits in Beijing-You Chickens[J]. Plos One, 2012, 8(4):-.[19] Evans L M, Slavov G T, Rodgers-Melnick E, et al. Population genomics of Populus trichocarpa identifies signatures of selection and adaptive trait associations[J]. Nature genetics, 2014.[20] Porth I, Klapšte J, Skyba O,et al. Genome‐wide association mapping for wood characteristics in Populus identifiesan array of candidate single nucleotide polymorphisms[J]. New Phytologist,2013, 200(3): 710-726.[21] Van Tyne D, Park D J, Schaffner S F, et al. Identification and functional validation of the novel antimalarial resistance locus PF10_0355 in Plasmodium falciparum[J]. PLoS Genet, 2011, 7(4): e1001383.[22] Ke C, Zhou Z, Qi W, et al. Genome-wide association study of 12 agronomic traits in peach[J]. Nature Communications,2016, 7:13246.[23] Miotto O, Amato R, Ashley E A, et al. Genetic architecture of artemisinin-resistant Plasmodium falciparum[J]. Naturegenetics, 2015, 47(3): 226-234.[24] Spötter A, Gupta P, Nürnberg G, et al. Development of a 44K SNP assay focussing on the analysis of a varroa‐specific defence behaviour in honey bees (Apis mellifera carnica)[J]. Molecular ecology resources, 2012, 12(2): 323-332.重测序业务线靳姣姣丨文案武苾菲丨编辑。
复杂疾病全基因组关联研究进展——遗传统计分析
Genome-wide association study on complex diseases: genetic statistical issues
YAN Wei-Li
School of Public Health, Xinjiang Medical University, Urumqi 830054, China
HEREDITAS (Beijing) 200253-9772
DOI: 10.3724/SP.J.1005.2008.00543
综述
复杂疾病全基因组关联研究进展——遗传统计分析
严卫丽
新疆医科大学公共卫生学院, 乌鲁木齐 830054
545
1.1 基于无关个体(Unrelated individual)的关联分析
基于无关个体的研究设计分为病例对照研究设 计(Case-control study)和基于随机人群的关联分析 (Population-based association analysis)两种情况。前 者主要用来研究质量性状(是否患病), 而后者主要 用来研究数量性状。根据研究设计不同和研究表型 的不同, 采用的统计分析方法亦不同。如病例对照 研究设计(质量性状), 比较每个 SNP 的等位基因频 率在病例和对照组中的差别可采用 4 格表的卡方检 验, 计算相对危险度(Odds Ratio, OR 值)及其 95%的 可 信 限 , 进 而 可 以 计 算 归 因 分 数 (Attributable fraction, AF)和归因危险度(Attributable risk, AR)。需 要调整主要的混杂因素, 如年龄、性别等, 则采用 logistic 回归分析, 以研究对象患病状态为因变量, 以基因型和混杂因素作为自变量进行分析。当研究 设计是基于随机人群时 (数量性状), 如研究 SNP 与 某一疾病数量表型的关联时, 如 BMI, 我们比较该 位点 3 种基因型携带者 BMI 水平是否有差别(单因素 方差分析), 当需要调整混杂因素时, 采用协方差分 析或者线性回归方程。
gwas研究基本概念1
gwas研究基本概念1
GWAS(Genome-Wide Association Studies)是一种遗传学研究方法,用于寻找基因组中与特定性状或疾病相关的遗传变异。
它的基本概念如下:
1. 基因组覆盖:GWAS研究需要覆盖整个基因组的遗传变异,以确保不会错过任何与疾病或性状相关的遗传变异。
2. 关联分析:GWAS通过对研究对象的基因组和表型数据进
行关联分析,来发现与疾病或性状相关的遗传变异。
关联分析通常使用单核苷酸多态性(SNPs)作为遗传变异的标记,并
通过比较不同基因型的频率与相关性来确定它们之间的关联。
3. 候选基因和关联区域:GWAS经常会发现一些与疾病或性
状相关的候选基因或关联区域。
候选基因可能与已知的生物学过程或相关基因的功能有关,进一步研究可以解析其在疾病发展中的作用。
4. 多态性和复杂性:GWAS研究揭示了基因多态性和复杂性
在疾病或性状发生发展中的作用。
多个基因通常与一个特定性状或疾病相关,而每个基因的影响可能相对较小。
5. GWAS研究的局限性:GWAS的结果通常需要进一步验证
和功能研究,以确认与疾病相关的候选基因或关联区域,并了解其作用机制。
此外,GWAS主要关注常见变异对疾病的影响,而较罕见的变异可能被忽略。
总之,GWAS是一种通过关联分析来寻找基因组中与性状或疾病相关的遗传变异的方法,它为研究复杂疾病的遗传基础提供了重要的信息。
s1外文文献
F orPe e r R e v i ewGenome-Wide Association Study of Eight Carcass Traits inJinghai Yellow Chicken using SLAF-seq TechnologyJournal: Poultry ScienceManuscript ID: DraftManuscript Type: Full-Length ArticleKey Words: Carcass traits, chicken, GWAS, SLAF-seqPoultry ScienceF or Pe er Re vi ew Genome-Wide Association Study of Eight Carcass Traits in Jinghai Yellow1 Chicken using SLAF-seq Technology2ABSTRACT3 Carcass traits are the most important yield components in chicken (Gallus gallus ).4 Investigation of its carcass traits will help develop high-yield varieties in chicken. In order to5 identify the single-nucleotide polymorphisms (SNPs) and candidate genes affecting carcass traits,6 a genome-wide study of the association of eight carcass traits was performed in four7 hundred43-week-old Jinghai Yellow chickens. The SNPs that significantly associated with the8 phenotypic traits were identified by simple general linear model (GLM) and compressed mixed9 linear models (MLM). A total of fifteen SNPs were found to be significantly associated with eight 10 traits and 12 functional genes at a threshold of P <1.87E-6in the region of 75.5–76.1Mb chr4. This 11 region had the most significant effects on carcass weight, foot weight, and wing weight. Another 12 84-kb region on GGA3 for eviscerated weight and semi-eviscerated weight was identified. In 13 summary, these identified genes and SNPs will offer essential information for cloning14 yield-related genes in chicken.15Key words16Carcass traits; chicken; GWAS; SLAF-seq17 18Page 1 of 30Poultry ScienceF or Pe er Re vi ew Introduction19The increasing world population has resulted in a growing demand for meat products. To meet 20 this demand, the poultry industry has improved productivity rates mainly through genetic 21 improvement.22Carcass traits as the most important factors in the poultry industry decide the benefits of 23 industrialization. Remarkable advances on carcass traits have been achieved, and many relative 24 genes and quantitative trait loci (QTL) have been found (Abasht et al. 2007; Atzmon et al . 2008; 25 Fang et al . 2010; Tang et al . 2010). Numerous QTL have been mapped for performance and 26 carcass traits in different chromosomes (Hu et al . 2013).In a previous study, Ambo et al . (2009) 27 mapped a QTL for chicken body weight at 35 and 42 days using microsatellite markers in 28 chromosome 4. In another study with the same population, Baron et al . (2011) mapped a QTL for 29 percentage of thighs and drumsticks in the same region of chromosome 4.However, the 30 application of these QTL results in broiler breeding remains impractical because of low mapping 31 precision. These measures are labor intensive and are some degree of aimlessness. Genome-wide 32 association studies (GWAS) are now used to search for single nucleotide proteins (SNPs) and 33 functional genes that affect quantitative traits. A GWAS need not assume that genes or QTLs are 34 associated with specific traits (Hardy et al . 2009). And more, it is used to examine traits and 35 genetic markers (Cho et al . 2009; Liu et al . 2008; McCarthy et al . 2008). However, population 36 size greatly affects GWAS accuracy.37High-throughput sequencing technologies can provide new strategies for sequence-based SNP 38 genotyping. Whole-genome resequencing strategies can be used to genotype large variations39Page 2 of 30Poultry ScienceF or Pe er Re vi ew among samples (Lam et al . 2010; Rubin et al . 2010; Xia et al . 2009), but it remains cost 40 prohibitive in large populations. SLAF-seq is a new reduced representation sequencing technology 41 that uses bioinformatics methods to design a tag development plan and a screen-specific fragment 42 length to achieve mass labeling using high-throughput technologies that can adequatelyidentify 43 target species’ genome-wide information.44This study aimed to identify potential loci and candidate genes affecting carcass traits in 45 43-week-old Jinghai Yellow chickens using the specific-locus amplified fragment sequencing 46 (SLAF-seq) (Sun et al . 2013). In this strategy repetitive sequences can be avoided by using 47 predesigned schemes. And the selected fragment number can be decided for personalized research 48 purposes to maintain the balance between marker density and population size. Reference genome 49 sequences and polymorphism information are not needed when this strategy used. In the present 50 work, 400 chickens from a conservation population of a Chinese local breed (Jinghai Yellow 51 chicken) were used in GWAS. A total of eight carcass traits were measured.52Material and method 53Experimental Animals54The animals used in this study were obtained from the Jinghai Yellow Chicken Breeding Station. 55 Four hundred females of the same batch from the same generation were randomly chosen. All had 56 complete genealogical records and were reared in stair-step caging under the same recommended 57 nutritional and environmental conditions. A total of eight carcass quality traits were measured for 58 the GWAS: carcass weight (CW), foot weight (FW), single wing weight (WW), single breast 59 muscle weight (BMW), single leg muscle weight (LMW), abdominal fat weight (AW), eviscerated60Page 3 of 30Poultry ScienceF or Pe er Re vi ew weight (EW), and semi-eviscerated weight (SEW). After a 12-h fast the chicken were weighed and 61 slaughtered at d 300 by standard commercial procedures and the CW, FW, and WW values were 62 recorded. The adipose tissues surrounding the proven triculus and gizzard along with those located 63 around the cloacae were weighed as AW (Ain et al . 1996; Zhao et al . 2007). EW and SEW were 64 collected. The carcasses were dissected into deboned, skinless thighs and breasts for the 65 assessment of BMW and LMW.66SLAF-seq Technology Scheme Design67SLAF-seq was used to genotype a total of 400 individuals, as previously described (Qi et al .68 2014), with a few modifications. Genomic DNA (≥600 ng) from Jinghai Yellow chickens was 69 extracted from the blood samples using Dzup (Blood) Genomic DNA Isolation Reagent (Sangon 70 Biotech) and diluted to 50–100 µg/µL. DNA was incubated at 37°Cwith T4 DNA ligase (NEB), 71 0.6 U MseI (NewEnglandBiolabs, Hitchin, Herts, UK), A TP (NEB) and MseI adapters. 72 Restriction-ligation reactions were heat-inactivated at 65°C and then digested in an additional 73 reaction with the restriction enzymes HaeIII at 37°C. The PCR reactions contained the diluted 74 restriction-ligation samples, dNTP, Taq DNA polymerase (NEB), and an MseI primer containing a 75 barcode. The PCR products were purified using TaKaRa DNA Fragment Purification Kit Ver.2.0 76 and then pooled. The pooled sample was incubated at 37°C with MseI, T4 DNA ligase, A TP and 77 Solexa adapters. The samples were purified using a Quick Spin column (Qiagen) and then 78 separated on a 2% agarose gel to isolate the 500–800-bp fragments using a Gel Extraction Kit 79 (Qiagen). These fragments were used in a PCR amplification with Phusion Master Mix (NEB) and 80 Solexa. The Phusion PCR settings followed the Illumina sample preparation guide. The samples 81 were gel-purified and the products of appropriate sizes (300–500 bp) were excised and diluted for82Page 4 of 30Poultry ScienceF or P e e r R e v i ew sequencing using an IlluminaHiSeq TM 2000. Sequencing using theIlluminaHiSeq TM 2000 produced 83 primitive reads (double end sequence) that we evaluated and mapped using SOAP 2.20 software 84 (Lietal.2009)toassemblenewlyreferencedgenomes85 (Ensembl:ftp:///pub/release-75/fasta/gallus_gallus/dna/) to ensure that the original 86 sequencing data were effectively obtained. We chose the double-end sequences compared to the 87 only locus of the genome to do SLAF label employer. In line with the comparison error correction 88 result, we chose the group whose average depth of sequencing was not <4 to define the SLAF89 label.90Genotyping and Statistical Analysis91Plink (v1.07) (Purcell et al . 2007) was used to do quality control of the data. The SNPs with 92 low call frequency (<85%) and low minor allele frequency (<5%) were rejected. Finally, 400 93 samples and 90,030 SNPs that were distributed to 30 autosomes and Z chromosome were left for 94 GWAS analysis.95Based on the SNPs, we used ADMIXTURE 1.22 software (Alexander et al . 2009) to calculate 96 the sample’s groupstructure. We assumed that the 400-sample group number (Q value) was 1–15 97 for the cluster analysis and ensured the number of subgroups by the peak ∆Q value positions.98The SNPs that were significantly associated with the phenotypic traits were identified using a 99 TASSEL 3.0 general linear model (GLM, I) and a compressed mixed linear model (MLM, II) 100 (Zhang et al . 2010):101Y = µ + Xα +Qβ +e (1)102Y = µ + Xα +Qβ + Kµ′ +e (2)103Page 5 of 30Poultry ScienceF or Pe er Re vi ew Where Y is the phenotypic value, µ is the fixed effect value vector, X is the genotype, Q is the 104 population structure matrix calculated by the ADMIXTURE program, the proportion of each of 105 the different groups was fitted as a covariate, β is the weight vector of each group, and K is the 106 relative kinship matrix. Xis considered the genotype matrix, while α is the weight vector of each 107 marker and e as the random error. The relative kinship matrix (K) was constructed from 15,719 108 independent SNPs using software SPAGeDi 1.3a (Ou et al . 2009). P values were corrected by 109 Bonferroni (Nicodemus et al . 2005). Here there were 90,030 SNPs and the threshold Bonferroni 110 Pvalue was obtained from the estimated number of independent SNP markers and linkage 111 disequilibrium (LD) blocks. Here the independent SNPs and LD blocks were calculated using the 112 equation r 2>0.4 by Plink v1.07 through all autosomal SNPs and pruned using the indep-pairwise 113 option with a window size of 25 SNPs, a step of five SNPs, and an r 2threshold of 0.4. There were 114 a total of 26,767 SNPs, so the threshold Bonferroni P value of potential significance was 115 3.73E-5(1/26767), and the threshold Bonferroni P value of genome-wide significance was 116 1.87E-6(0.05/26767). Quantile-quantile plots for each trait and Manhattan plots of genome-wide 117 association analyses were produced using software TASSEL 3.0(/).118Results119Analysis of SLAF-seq data and SLAF markers120After SLAF library construction and sequencing, a total of 52.70Gb of raw data, consisting of 121 paired-end reads was obtained with each read being ~80 bp in length after preprocessing. Among 122 them, 86.1% bases were of high-quality, with quality scores of at least 20 (means a quality score 123 of 20, indicating a 1% chance of an error, and thus 99% confidence). In total, 236.07 M reads were124Page 6 of 30Poultry ScienceF or Pe er Re vi ew accuracy paired-ends mapping to chicken reference genome, which paired ends mapping ratio 125 were 71.66%. The numbers of SLAFs in chicken were 103,680, of which the average sequencing 126 depth was5.46 in the chicken. Among these data, 88,135 were polymorphic, giving a 127 polymorphism rate of 85.01%. The number of SLAF markers per chromosome ranged from 1 128 to19,722. The distribution of SLAF in the genome was well proportioned (Figure 1), and we then 129 detected the SNPs among the defined SLAF fragments. After quality control measures, 90,961 130 SNPs distributed among 29 chromosomes (including the Z chromosome) and the mitochondrial 131 genome (Table 1).The average physical distance between two neighboring SNPs was132 approximately 10 kb.133Association between polymorphisms and traits134We made the descriptive statistics of the phenotypic measurements of body composition traits 135 in the 400Jinghai Yellow chickens used for the present GWAS studies are given in Table 2.All 136 non-normal phenotypic data were normalized after Box-Cox or Johnson transformation. The 137 subgroups with a minimum ∆Q peak value were the best. The results indicated that a Q value of 138 10 is the lowest peak value (Figure 2A, B). Based on this result the samples were divided into 10 139 subgroups.140Two statistical methods, compressed mixed linear model (MLM) and generalized linear model 141 (GLM), were implemented to analyze association between SNPs and phenotypes. The results for 142 all SNPs demonstrated to have genome-wide significance (P < 1.87E-6) in GLM lower than the 143 suggested significance (P < 3.73E-5) in MLM. The MLM analysis considers more factors and is 144 stricter than the GLM. Emphasis is placed on the associations revealed by the compressed MLM145Page 7 of 30Poultry ScienceF or Pe er Re vi ew analyses because population structure effect could be controlled and false positives. GLM and 146 MLM help to locate loci that are useful for breeding. This could be reduced with this approach, as 147 shown in Q-Q plots (Figure S1).148Loci and Genes for Body Composition Traits149Carcass weight (CW). One SNP genome-wide significantly associated with CW from GLM that 150 was located in GGA4 1.6kb downstream from the Gallus gallus fibroblast growth factor binding151 protein2 (FGFBP2) gene. The protein encoded by the FGFBP2gene recognizes DNA promoting 152 regions and induce transcription of FGFs. FGFs which induce myoblast proliferation and 153 differentiation of myocytes make important contributions for skeletal muscle development in 154 chickens (Gibby et al . 2009; Felicio et al . 2013). Previous study identified that QTL for body 155 weight, leg length, and leg diameter in GGA2, 4, and 26 in an F2 population of chickens 156 (Ankra-Badu et al . 2010). Another similar study identified a QTL in the GGA4 associated with 157 body weight, carcass weight, breast weight, leg weight, and wing weight (Nassar et al . 2013).Only 158 the locus of the FGFBP2 gene is of suggested significance on MLM analysis.159Foot weight (FW). In the case of FW, one interesting region 75.54–75.67 Mb in length was 160 identified related to FW. There were five significant SNPs of genome-wide significance on GLM 161 analysis. The five SNPs were all clustered in GGA4 within a 0.1-Mb region 162 (75,548,514–75,679,707bp) and located within or 2.1–7.7kb away from three genes, family with 163 sequence similarity 184 name as member B (F AM184B ), Gallus gallus quinoiddihydropteridine 164 reductase (QDPR ), and LIM-domainbinding factor 2(LDB2). On MLM analysis, three of the five 165 SNPs in GGA4 had a genome-wide significant association with FW. They were rs75548810,166Page 8 of 30Poultry ScienceF or Pe er Re vi ew rs75641139, and rs75679707. Rs75548810 is 2.4kb upstream from the F AM184B gene, 167 rs75641139 is 5.7kb upstream from the QDPR gene, and rs75679707 is 7.7kb downstream from 168 the LDB2 gene. Sun also testified that the two genes were the important candidates that influence 169 shank circumference (Sun et al . 2013).Some studies have indicated that the F AM184B gene can 170 influence the daily gain, carcass weight, and ingestion of cattle (Lindholm-Perry et al . 2011). 171 Similar, this gene may have important influence on FW in chicken.172Single wing weight (WW). Four SNPs located on GGA4, GGA18, GGA20, and GGAZ with 173 genome-wide significance for WW were identified by GLM analysis (Table 3). In GGA18,the 174 SNP was located in Gallus gallus zinc finger protein 302 ZNF302). The second SNP was located 175 in GGA20, 1,276bp upstream from uncharacterized protein (PGO2). The third SNP was located in 176 GGAZ, 66.9kb downstream from the SMARCA2 gene. The last SNP in GGA4 was 5.7kb upstream 177 from QDPR . QDPR also had a genome-wide significant association with FW. All of the four SNPs 178 reached genome-wide significance on MLM analysis, and the only difference between the two 179 models was that the P value of MLM was slightly lower than that of GLM.180Single breast muscle weight (BMW) and single leg muscle weight (LMW). Associations 181 identified with the two traits indicate that some SNPs reached suggested significance in GLM 182 butcannot find SNPs reached genome-wide significance on GLM and MLM.183Eviscerated weight (EW). Three genome-wide significant SNPs associated with EW were 184 identified, and all were located in GGA3 on GLM clustered within an 84-kb region. They were 185 located in TULP4, 29kp upstream from the transmembrane protein181 gene (TMEM181). On 186 MLM, the three SNPs were of suggested significance. The TULP4 gene belongs to the tubby187Page 9 of 30Poultry ScienceF or Pe er R e vi ew protein family, which seem to serve as bipartite bridges through their phosphoinositide-binding 188 tubby (Mukhopadhyay et al . 2011). This family has unique amino-terminal functional domains 189 that coordinate multiple signaling pathways, including ciliary G-protein-coupled receptor 190 trafficking and Shh signaling (Mukhopadhyay et al . 2011). Another study also found statistical 191 evidence that TULP4 was a new candidate gene for cleft palate (Vieira et al . 2013). TMEM181 192 belongs to the TMEM family of proteins that encode transmembrane proteins. However, there are 193 no related reports on the TMEM181 gene.194Semi-eviscerated weight (SEW). No SNP reached genome-wide significance on GLM or MLM, 195 yet the three SNPs associated with EW reached genome-wide significance on GLM. This may be 196 because the two traits have some degree of relevance. Another SNP in GGA4 reached the 197 suggested significance level on MLM and was 1,614bp downstream from FGFBP2. This SNP was 198 also associated with CW and they share the same SNP.199Abdominal fat weight (AW).Three SNPs were of genome-wide significance on GGA2 and 200 GGA14 by GLM. One of the two SNPs on GGA2 was 16kb downstream from the Gallus gallus 201 tRNA aspartic acid methyltransferase 1 (TRDMT1) gene, while the other SNP on GGA2 had no 202 annotated genes nearby. The SNP on GGA14 was in an uncharacterized protein (novel gene). On 203 MLM, one SNP on GGA14 and another SNP on GGA5 reached suggested significance. The SNP 204 on GGA5 was in the cell growth regulator with ring finger domain 1(CGRRF1) gene. The 205 Manhattan plots for all the traits with significant SNP are shown in figure 3.206A heatmap (Figure S2) of the eight traits in this analyses show high co line among these traits 207 except AW. These results show that several SNPs were simultaneously associated with several208F or Pe er Re vi ew traits. As shown in Tables 2, the SNPs with genome-wide and suggested significance for eight 209 carcass traits on GLM and MLM analysis.210 Discussion211Genome-wide Association Analysis212The GWAS studies were commonly used to identify economically important production traits in 213 animal studies (Fan et al . 2011; Jiang et al . 2010; Shen et al . 2012). For quantitative traits,214 SLAF-seq technology developed large amounts specific markers with high success rate and low 215 cost (Sun et al . 2013). In the present study, a GWAS analysis between SLAF-SNPs and 216 quantitative traits was evaluated. Here we identified potential loci and candidate genes using 217 SLAF-seq in a conservation population of Jinghai Yellow chicken, the first new chicken varieties 218 authorized by the national commission on livestock and poultry resources in China (Zhao et al . 219 2011; Gu et al . 2011). Regarding statistical methods, we used the TASSEL compressed MLM and 220 GLM to analyze the association between SNPs and phenotypes. The MLM analysis considers 221 more factors and is stricter than GLM. However, MLM analysis may have a certain degree of false 222 negatives, which results in missing some useful SNPs. The two models contrast and complement 223 each other to help us find loci that are really useful for breeding. Most of the traits tested showed 224 considerable ranges between maximal and minimal values (Table 2). It would be expected in a 225 population being maintained for the conservation of genetic diversity. This variability could be 226 increase the power of the GWAS.227Loci and Genes for Traits Related to Body Composition228For CW, one SNP associated the FGFBP2 gene that plays an important role in embryogenesis,229F or Pe er Re vi ew cellular differentiation, and proliferation in chickens. The SNP g.651G>A in FGFBP2 was 230 associated with thawing loss and meat redness (P < 0.05) (Felicio et al. 2013).The present study 231 identified SNPs in FGFBP2 genes located in a QTL region, which corroborates the former reports 232 (Ankra-Badu et al . 2010; Nassar et al . 2013). FGFBP2 gene in the region can influence carcass 233 quality and muscle development.234For FM, the LDB2 and QDPR genes are significantly correlated with Beijing You chicken 235 growth traits (Gu et al . 2011). FW also has a positive correlation to BW, and shank circumference 236 has a direct influence on FW, further suggesting that these two genes are very important candidate 237 genes that influence the Jinghai Yellow chicken FW trait.238For WW, there are four genes ZNF302, PGO2, QDPR, and SMARCA2.ZNF302 belonging to 239 the zinc finger protein family. This protein family is responded to genital malformations 240 (hypospadias) in male Homo sapiens (human) (Gana et al . 2012). QDPR has proven significant 241 correlations with growth, shank circumference, and FW traits as described above. As such, these 242 three genes can serve as new candidate genes for further WW research. Gene 243 Smarca2encodedcomplex ORFs by yielding multiple mRNA variants. This complex alternative 244 splicing of this gene suggest that its functions may be very complex, not just simply inhibiting cell 245 proliferation (Yang et al . 2011).To our knowledge, there have not any reports of the PGO2 gene 246 function until now.247No identified SNP significantly associated with BMW and LMW traits on both GLM and 248 MLM in Jinghai Yellow chicken. These results support that the notion of complexity in the genetic 249 basis underlying BMW and LMW might be influenced by epigenetic factors. For SEW, it shares a250F or Pe er Re vi ew consistent region with EW that has a P value that is slightly higher than 3.73E-5. The result proves 251 the correlation between the two traits and the validity of our result.252As for AW, the two genes are TRDMT1 and CGRRF1. Some studies indicated that TRNM1 253 the smallest mammalian DNA methyltransferase participated in the recognition of DNA damage, 254 DNA recombination, and mutation repair. It also catalyzed DNA methylation at the 5-position of 255 cytosine and is the predominant epigenetic modification in mammals (Subramaniam et al . 2014). 256 In rats, CGRRF1 was bound up with obesity and obesity-associated endometrial cancer. CGRRF1 257 represents a novel, reproducible tissue marker of metformin response in the obese endometrium. 258 Furthermore, CGRRF1 expression may prove clinically useful in the prevention or treatment of 259 endometrial cancer (Zhang et al . 2014).260Relevant SNP association identified by cluster analysis261Heatmap can be useful for performing relevant SNP association analysis. Seven traits in this 262 analyses show obvious association together. The significant results observed in this population 263 showed that 13 genes located in QTL regions. These results can represent possible candidate genes 264 in poultry breeding programs. These results can help to marker-assisted selection for traits of 265 carcass weight, foot weight, single wing weight, single breast muscle weight, single leg muscle 266 weight, abdominal fat weight, eviscerated weight, and semi-eviscerated weight, which these traits 267 are of paramount importance to the poultry industry.268269F or Pe er Re vi ew270 Reference271Abasht, B., and Lamont, S.J. 2007. Genome-wide association analysis reveals cryptic alleles as an272 important factor in heterosis for fatness in chicken F 2 population. Anim Genet . 38(5): 273 491-498.274 Ain Baziz, H., Geraert, P. A., Padilha, J. C. F., and GUILLAUMIN, S. 1996. Chronic heat275 exposure enhances fat deposition and modifies muscle and fat partition in broiler carcasses. 276 Poult Sci . 75(4): 505-13.277 Alexander, D.H., Novembre, J., and Lange, K. 2009. Fast model-based estimation of ancestry in278 unrelated individuals. Genome Res . 19(9): 1655-64.279 Ambo, M., Moura, A.S., Ledur, M.C., Pinto, L.F., Baron, E.E., Ruy, D.C., Nones, K., Campos,280 R.L., Boschiero, C., Burt, D.W., and Coutinho, L.L. 2009. Quantitative trait loci for 281 performance traits in a broiler x layer cross. Anim Genet . 40(2): 200-208.282 Ankra-Badu, G.A., Shriner, D., Le Bihan-Duval, E., Mignon-Grasteau, S., Pitel, F., Beaumont, C.,283 Duclos, M.J., Simon, J., Porter, T.E., Vignal, A., Cogburn, L.A., Allison, D.B., Yi, N., and 284 Aggrey, S.E. 2010. Mapping main, epistatic and sex-specific QTL for body composition in a 285 chicken population divergently selected for low or high growth rate. BMC Genomics. 11: 286 107.287 Atzmon, G., Blum, S., Feldman, M., Cahaner, A., Lavi, U., Hillel, J. 2008. QTLs detected in a288 multigenerational resource chicken population. J Hered . 99(5): 528-38.289F or Pe er Re vi ew Baron, E.E., Moura, A.S., Ledur, M.C., Pinto, L.F., Boschiero, C., Ruy, D.C., Nones, K., Zanella,290 E.L., Rosário, M.F., Burt, D.W., Coutinho, L.L. 2011. QTL for percentage of carcass and 291 carcass parts in a broiler x layer cross. Anim Genet . 42(2): 117-24.292 Cho, Y.S., et al., 2009. A large-scale genome-wide association study of Asian populations293 uncovers genetic factors influencing eight quantitative traits. Nat Genet . 41(5): p. 527-34. 294 Fan, B., Go, M.J., Kim, Y.J et al., 2011. Genome-wide association study identifies Loci for body295 composition and structural soundness traits in pigs. PLoS One . 6(2): p. e14726.296 Fang, M., Nie, Q., Luo, C., Zhang, D., and Zhang, X. 2010. Associations of GHSR gene297 polymorphisms with chicken growth and carcass traits. Mol Biol Rep . 37(1): 423-428. 298 Felicio, A.M., Boschiero, C., Balieiro, J.C., Ledur, M.C., Ferraz, J.B., Moura, A.S., and Coutinho,299 L.L. 2013. Polymorphisms in FGFBP1 and FGFBP2 genes associated with carcass and meat 300 quality traits in chickens. Genet Mol Res . 12(1): 208-222.301 Gana, S., Veggiotti, P., Sciacca, G., Fedeli, C., Bersano, A., Micieli, G., Maghnie, M., Ciccone, R.,302 Rossi, E., Plunkett, K., Bi, W., Sutton, V.R., and Zuffardi, O. 2012. 19q13.11 cryptic deletion: 303 description of two new cases and indication for a role of WTIP haploinsufficiency in 304 hypospadias. Eur J Hum Genet . 20(8): 852-856.305 Gibby, K.A., McDonnell, K., Schmidt, M.O., and Wellstein, A. 2009. A distinct role for secreted306 fibroblast growth factor-binding proteins in development. Proc Natl Acad Sci U S A . 106(21): 307 p. 8585-90.308 Gu, X., Feng, C., Ma, L., Song, C., Wang, Y., Da, Y., Li, H., Chen, K., Ye, S., Ge, C., Hu, X., and309 Li, N. 2011. Genome-wide association study of body weight in chicken F2 resource 310 population. PLoS One . 6(7): e21872.311。
ANLN和FBXO5基因在肺鳞状细胞癌中的表达量及预后影响
ANLN和FBXO5基因在肺鳞状细胞癌中的表达量及预后影响刘卫梅【摘要】目的肺鳞状细胞癌是目前最常见的恶性肿瘤之一,其发病率每年都在升高.本次研究识别的重要基因能够预测肺鳞状细胞癌患者的预后,阐释肺鳞状细胞癌的发病机理.方法本研究用4对肺鳞状细胞癌以及匹配的正常组织实施转录组测序(RNA-Seq),筛选差异表达的基因,并实施基因功能分析和预后评估.结果共识别了821个差异表达的基因,并发现12个有显著富集的功能类型,包括Cell adhesion、Reg-ulation of transcription,Biological adhesion等.结合肺癌The Cancer Genome Atlas(TCGA)数据库发现了两个重要基因(FBXO5和ANLN),ANLN低表达或FBXO5高表达时能够显著降低肺鳞状细胞癌患者的生存情况,并且在独立的20对样本中通过PCR证实两个基因存在差异表达.结论此次研究识别的重要基因能够使我们更好地理解肺鳞状细胞癌的发展机理,也许能作为预后诊断的生物学标记.【期刊名称】《临床肺科杂志》【年(卷),期】2018(023)008【总页数】4页(P1468-1471)【关键词】肺鳞状细胞癌;RNA-Seq;生物学标志物【作者】刘卫梅【作者单位】454000 河南焦作,焦作市第二人民医院病理科(河南理工大学第一附属医院)【正文语种】中文肺鳞状细胞癌(Lung squamous cell carcinoma,LSCC)是最常见的恶性肿瘤之一,占原发性肺癌的40%-51%,每年导致的死亡人数约为15,592例,而新增病例数约为74,780例并且呈逐年上升的趋势[1]。
外科手术是肺癌最常见的治疗方法,但预后情况仍不佳[2]。
以前报道了一些肺癌基因突变以及基因表达谱的研究并发现了与肺癌有关的重要通路[3]。
此外,还报道了某些基因与肺癌预后和药物反应有关[4]。
为了找到有意义的预测肺癌预后情况的生物学标记物,本研究利用下一代测序技术识别LSCC差异表达基因并分析其功能,预测基因对于LSCC患者预后的影响。
Analysis of Genetic Diversity and Population Structure
Agricultural Sciences in China2010, 9(9): 1251-1262September 2010Received 30 October, 2009 Accepted 16 April, 2010Analysis of Genetic Diversity and Population Structure of Maize Landraces from the South Maize Region of ChinaLIU Zhi-zhai 1, 2, GUO Rong-hua 2, 3, ZHAO Jiu-ran 4, CAI Yi-lin 1, W ANG Feng-ge 4, CAO Mo-ju 3, W ANG Rong-huan 2, 4, SHI Yun-su 2, SONG Yan-chun 2, WANG Tian-yu 2 and LI Y u 21Maize Research Institute, Southwest University, Chongqing 400716, P.R.China2Institue of Crop Sciences/National Key Facility for Gene Resources and Genetic Improvement, Chinese Academy of Agricultural Sciences,Beijing 100081, P.R.China3Maize Research Institute, Sichuan Agricultural University, Ya’an 625014, P.R.China4Maize Research Center, Beijing Academy of Agricultural and Forestry Sciences, Beijing 100089, P.R.ChinaAbstractUnderstanding genetic diversity and population structure of landraces is important in utilization of these germplasm in breeding programs. In the present study, a total of 143 core maize landraces from the South Maize Region (SR) of China,which can represent the general profile of the genetic diversity in the landraces germplasm of SR, were genotyped by 54DNA microsatellite markers. Totally, 517 alleles (ranging from 4 to 22) were detected among these landraces, with an average of 9.57 alleles per locus. The total gene diversity of these core landraces was 0.61, suggesting a rather higher level of genetic diversity. Analysis of population structure based on Bayesian method obtained the samilar result as the phylogeny neighbor-joining (NJ) method. The results indicated that the whole set of 143 core landraces could be clustered into two distinct groups. All landraces from Guangdong, Hainan, and 15 landraces from Jiangxi were clustered into group 1, while those from the other regions of SR formed the group 2. The results from the analysis of genetic diversity showed that both of groups possessed a similar gene diversity, but group 1 possessed relatively lower mean alleles per locus (6.63) and distinct alleles (91) than group 2 (7.94 and 110, respectively). The relatively high richness of total alleles and distinct alleles preserved in the core landraces from SR suggested that all these germplasm could be useful resources in germplasm enhancement and maize breeding in China.Key words :maize, core landraces, genetic diversity, population structureINTRODUCTIONMaize has been grown in China for nearly 500 years since its first introduction into this second biggest pro-duction country in the world. Currently, there are six different maize growing regions throughout the coun-try according to the ecological conditions and farming systems, including three major production regions,i.e., the North Spring Maize Region, the Huang-Huai-Hai Summer Maize Region, and the Southwest MaizeRegion, and three minor regions, i.e., the South Maize Region, the Northwest Maize Region, and the Qingzang Plateau Maize Region. The South Maize Region (SR)is specific because of its importance in origin of Chi-nese maize. It is hypothesized that Chinese maize is introduced mainly from two routes. One is called the land way in which maize was first brought to Tibet from India, then to Sichuan Province in southwestern China. The other way is that maize dispersed via the oceans, first shipped to the coastal areas of southeast China by boats, and then spread all round the country1252LIU Zhi-zhai et al.(Xu 2001; Zhou 2000). SR contains all of the coastal provinces and regions lie in southeastern China.In the long-term cultivation history of maize in south-ern China, numerous landraces have been formed, in which a great amount of genetic variation was observed (Li 1998). Similar to the hybrid swapping in Europe (Reif et al. 2005a), the maize landraces have been al-most replaced by hybrids since the 1950s in China (Li 1998). However, some landraces with good adapta-tions and yield performances are still grown in a few mountainous areas of this region (Liu et al.1999). Through a great effort of collection since the 1950s, 13521 accessions of maize landraces have been cur-rently preserved in China National Genebank (CNG), and a core collection of these landraces was established (Li et al. 2004). In this core collection, a total of 143 maize landrace accessions were collected from the South Maize Region (SR) (Table 1).Since simple sequence repeat ( SSR ) markers were firstly used in human genetics (Litt and Luty 1989), it now has become one of the most widely used markers in the related researches in crops (Melchinger et al. 1998; Enoki et al. 2005), especially in the molecular characterization of genetic resources, e.g., soybean [Glycine max (L.) Merr] (Xie et al. 2005), rice (Orya sativa L.) (Garris et al. 2005), and wheat (Triticum aestivum) (Chao et al. 2007). In maize (Zea mays L.), numerous studies focusing on the genetic diversity and population structure of landraces and inbred lines in many countries and regions worldwide have been pub-lished (Liu et al. 2003; Vegouroux et al. 2005; Reif et al. 2006; Wang et al. 2008). These activities of documenting genetic diversity and population structure of maize genetic resources have facilitated the under-standing of genetic bases of maize landraces, the utili-zation of these resources, and the mining of favorable alleles from landraces. Although some studies on ge-netic diversity of Chinese maize inbred lines were con-ducted (Yu et al. 2007; Wang et al. 2008), the general profile of genetic diversity in Chinese maize landraces is scarce. Especially, there are not any reports on ge-netic diversity of the maize landraces collected from SR, a possibly earliest maize growing area in China. In this paper, a total of 143 landraces from SR listed in the core collection of CNG were genotyped by using SSR markers, with the aim of revealing genetic diver-sity of the landraces from SR (Table 2) of China and examining genetic relationships and population struc-ture of these landraces.MATERIALS AND METHODSPlant materials and DNA extractionTotally, 143 landraces from SR which are listed in the core collection of CNG established by sequential strati-fication method (Liu et al. 2004) were used in the present study. Detailed information of all these landrace accessions is listed in Table 1. For each landrace, DNA sample was extracted by a CTAB method (Saghi-Maroof et al. 1984) from a bulk pool constructed by an equal-amount of leaves materials sampled from 15 random-chosen plants of each landrace according to the proce-dure of Reif et al. (2005b).SSR genotypingA total of 54 simple sequence repeat (SSR) markers covering the entire maize genome were screened to fin-gerprint all of the 143 core landrace accessions (Table 3). 5´ end of the left primer of each locus was tailed by an M13 sequence of 5´-CACGACGTTGTAAAACGAC-3´. PCR amplification was performed in a 15 L reac-tion containing 80 ng of template DNA, 7.5 mmol L-1 of each of the four dNTPs, 1×Taq polymerase buffer, 1.5 mmol L-1 MgCl2, 1 U Taq polymerase (Tiangen Biotech Co. Ltd., Beijing, China), 1.2 mol L-1 of forward primer and universal fluorescent labeled M13 primer, and 0.3 mol L-1 of M13 sequence tailed reverse primer (Schuelke 2000). The amplification was carried out in a 96-well DNA thermal cycler (GeneAmp PCR System 9700, Applied Biosystem, USA). PCR products were size-separated on an ABI Prism 3730XL DNA sequencer (HitachiHigh-Technologies Corporation, Tokyo, Japan) via the software packages of GENEMAPPER and GeneMarker ver. 6 (SoftGenetics, USA).Data analysesAverage number of alleles per locus and average num-ber of group-specific alleles per locus were identifiedAnalysis of Genetic Diversity and Population Structure of Maize Landraces from the South Maize Region of China 1253Table 1 The detailed information about the landraces used in the present studyPGS revealed by Structure1) NJ dendragram revealed Group 1 Group 2 by phylogenetic analysis140-150tian 00120005AnH-06Jingde Anhui 0.0060.994Group 2170tian00120006AnH-07Jingde Anhui 0.0050.995Group 2Zixihuangyumi00120007AnH-08Zixi Anhui 0.0020.998Group 2Zixibaihuangzayumi 00120008AnH-09Zixi Anhui 0.0030.997Group 2Baiyulu 00120020AnH-10Yuexi Anhui 0.0060.994Group 2Wuhuazi 00120021AnH-11Yuexi Anhui 0.0030.997Group 2Tongbai 00120035AnH-12Tongling Anhui 0.0060.994Group 2Yangyulu 00120036AnH-13Yuexi Anhui 0.0040.996Group 2Huangli 00120037AnH-14Tunxi Anhui 0.0410.959Group 2Baiyumi 00120038AnH-15Tunxi Anhui 0.0030.997Group 2Dapigu00120039AnH-16Tunxi Anhui 0.0350.965Group 2150tianbaiyumi 00120040AnH-17Xiuning Anhui 0.0020.998Group 2Xiuning60tian 00120042AnH-18Xiuning Anhui 0.0040.996Group 2Wubaogu 00120044AnH-19ShitaiAnhui 0.0020.998Group 2Kuyumi00130001FuJ-01Shanghang Fujian 0.0050.995Group 2Zhongdouyumi 00130003FuJ-02Shanghang Fujian 0.0380.962Group 2Baixinyumi 00130004FuJ-03Liancheng Fujian 0.0040.996Group 2Hongxinyumi 00130005FuJ-04Liancheng Fujian 0.0340.966Group 2Baibaogu 00130008FuJ-05Changding Fujian 0.0030.997Group 2Huangyumi 00130011FuJ-06Jiangyang Fujian 0.0020.998Group 2Huabaomi 00130013FuJ-07Shaowu Fujian 0.0020.998Group 2Huangbaomi 00130014FuJ-08Songxi Fujian 0.0020.998Group 2Huangyumi 00130016FuJ-09Wuyishan Fujian 0.0460.954Group 2Huabaogu 00130019FuJ-10Jian’ou Fujian 0.0060.994Group 2Huangyumi 00130024FuJ-11Guangze Fujian 0.0010.999Group 2Huayumi 00130025FuJ-12Nanping Fujian 0.0040.996Group 2Huangyumi 00130026FuJ-13Nanping Fujian 0.0110.989Group 2Hongbaosu 00130027FuJ-14Longyan Fujian 0.0160.984Group 2Huangfansu 00130029FuJ-15Loangyan Fujian 0.0020.998Group 2Huangbaosu 00130031FuJ-16Zhangping Fujian 0.0060.994Group 2Huangfansu 00130033FuJ-17Zhangping Fujian0.0040.996Group 2Baolieyumi 00190001GuangD-01Guangzhou Guangdong 0.9890.011Group 1Nuomibao (I)00190005GuangD-02Shixing Guangdong 0.9740.026Group 1Nuomibao (II)00190006GuangD-03Shixing Guangdong 0.9790.021Group 1Zasehuabao 00190010GuangD-04Lechang Guangdong 0.9970.003Group 1Zihongmi 00190013GuangD-05Lechang Guangdong 0.9880.012Group 1Jiufengyumi 00190015GuangD-06Lechang Guangdong 0.9950.005Group 1Huangbaosu 00190029GuangD-07MeiGuangdong 0.9970.003Group 1Bailibao 00190032GuangD-08Xingning Guangdong 0.9980.002Group 1Nuobao00190038GuangD-09Xingning Guangdong 0.9980.002Group 1Jinlanghuang 00190048GuangD-10Jiangcheng Guangdong 0.9960.004Group 1Baimizhenzhusu 00190050GuangD-11Yangdong Guangdong 0.9940.006Group 1Huangmizhenzhusu 00190052GuangD-12Yangdong Guangdong 0.9930.007Group 1Baizhenzhu 00190061GuangD-13Yangdong Guangdong 0.9970.003Group 1Baiyumi 00190066GuangD-14Wuchuan Guangdong 0.9880.012Group 1Bendibai 00190067GuangD-15Suixi Guangdong 0.9980.002Group 1Shigubaisu 00190068GuangD-16Gaozhou Guangdong 0.9960.004Group 1Zhenzhusu 00190069GuangD-17Xinyi Guangdong 0.9960.004Group 1Nianyaxixinbai 00190070GuangD-18Huazhou Guangdong 0.9960.004Group 1Huangbaosu 00190074GuangD-19Xinxing Guangdong 0.9950.005Group 1Huangmisu 00190076GuangD-20Luoding Guangdong 0.940.060Group 1Huangmi’ai 00190078GuangD-21Luoding Guangdong 0.9980.002Group 1Bayuemai 00190084GuangD-22Liannan Guangdong 0.9910.009Group 1Baiyumi 00300001HaiN-01Haikou Hainan 0.9960.004Group 1Baiyumi 00300003HaiN-02Sanya Hainan 0.9970.003Group 1Hongyumi 00300004HaiN-03Sanya Hainan 0.9980.002Group 1Baiyumi00300011HaiN-04Tongshi Hainan 0.9990.001Group 1Zhenzhuyumi 00300013HaiN-05Tongshi Hainan 0.9980.002Group 1Zhenzhuyumi 00300015HaiN-06Qiongshan Hainan 0.9960.004Group 1Aiyumi 00300016HaiN-07Qiongshan Hainan 0.9960.004Group 1Huangyumi 00300021HaiN-08Qionghai Hainan 0.9970.003Group 1Y umi 00300025HaiN-09Qionghai Hainan 0.9870.013Group 1Accession name Entry code Analyzing code Origin (county/city)Province/Region1254LIU Zhi-zhai et al .Baiyumi00300032HaiN-10Tunchang Hainan 0.9960.004Group 1Huangyumi 00300051HaiN-11Baisha Hainan 0.9980.002Group 1Baihuangyumi 00300055HaiN-12BaishaHainan 0.9970.003Group 1Machihuangyumi 00300069HaiN-13Changjiang Hainan 0.9900.010Group 1Hongyumi00300073HaiN-14Dongfang Hainan 0.9980.002Group 1Xiaohonghuayumi 00300087HaiN-15Lingshui Hainan 0.9980.002Group 1Baiyumi00300095HaiN-16Qiongzhong Hainan 0.9950.005Group 1Y umi (Baimai)00300101HaiN-17Qiongzhong Hainan 0.9980.002Group 1Y umi (Xuemai)00300103HaiN-18Qiongzhong Hainan 0.9990.001Group 1Huangmaya 00100008JiangS-10Rugao Jiangsu 0.0040.996Group 2Bainian00100012JiangS-11Rugao Jiangsu 0.0080.992Group 2Bayebaiyumi 00100016JiangS-12Rudong Jiangsu 0.0040.996Group 2Chengtuohuang 00100021JiangS-13Qidong Jiangsu 0.0050.995Group 2Xuehuanuo 00100024JiangS-14Qidong Jiangsu 0.0020.998Group 2Laobaiyumi 00100032JiangS-15Qidong Jiangsu 0.0050.995Group 2Laobaiyumi 00100033JiangS-16Qidong Jiangsu 0.0010.999Group 2Huangwuye’er 00100035JiangS-17Hai’an Jiangsu 0.0030.997Group 2Xiangchuanhuang 00100047JiangS-18Nantong Jiangsu 0.0060.994Group 2Huangyingzi 00100094JiangS-19Xinghua Jiangsu 0.0040.996Group 2Xiaojinhuang 00100096JiangS-20Yangzhou Jiangsu 0.0010.999Group 2Liushizi00100106JiangS-21Dongtai Jiangsu 0.0030.997Group 2Kangnandabaizi 00100108JiangS-22Dongtai Jiangsu 0.0020.998Group 2Shanyumi 00140020JiangX-01Dexing Jiangxi 0.9970.003Group 1Y umi00140024JiangX-02Dexing Jiangxi 0.9970.003Group 1Tianhongyumi 00140027JiangX-03Yushan Jiangxi 0.9910.009Group 1Hongganshanyumi 00140028JiangX-04Yushan Jiangxi 0.9980.002Group 1Zaoshuyumi 00140032JiangX-05Qianshan Jiangxi 0.9970.003Group 1Y umi 00140034JiangX-06Wannian Jiangxi 0.9970.003Group 1Y umi 00140038JiangX-07De’an Jiangxi 0.9940.006Group 1Y umi00140045JiangX-08Wuning Jiangxi 0.9740.026Group 1Chihongyumi 00140049JiangX-09Wanzai Jiangxi 0.9920.008Group 1Y umi 00140052JiangX-10Wanzai Jiangxi 0.9930.007Group 1Huayumi 00140060JiangX-11Jing’an Jiangxi 0.9970.003Group 1Baiyumi 00140065JiangX-12Pingxiang Jiangxi 0.9940.006Group 1Huangyumi00140066JiangX-13Pingxiang Jiangxi 0.9680.032Group 1Nuobaosuhuang 00140068JiangX-14Ruijin Jiangxi 0.9950.005Group 1Huangyumi 00140072JiangX-15Xinfeng Jiangxi 0.9960.004Group 1Wuningyumi 00140002JiangX-16Jiujiang Jiangxi 0.0590.941Group 2Tianyumi 00140005JiangX-17Shangrao Jiangxi 0.0020.998Group 2Y umi 00140006JiangX-18Shangrao Jiangxi 0.0310.969Group 2Baiyiumi 00140012JiangX-19Maoyuan Jiangxi 0.0060.994Group 260riyumi 00140016JiangX-20Maoyuan Jiangxi 0.0020.998Group 2Shanyumi 00140019JiangX-21Dexing Jiangxi 0.0050.995Group 2Laorenya 00090002ShangH-01Chongming Shanghai 0.0050.995Group 2Jinmeihuang 00090004ShangH-02Chongming Shanghai 0.0020.998Group 2Zaobaiyumi 00090006ShangH-03Chongming Shanghai 0.0020.998Group 2Chengtuohuang 00090007ShangH-04Chongming Shanghai 0.0780.922Group 2Benyumi (Huang)00090008ShangH-05Shangshi Shanghai 0.0020.998Group 2Bendiyumi 00090010ShangH-06Shangshi Shanghai 0.0040.996Group 2Baigengyumi 00090011ShangH-07Jiading Shanghai 0.0020.998Group 2Huangnuoyumi 00090012ShangH-08Jiading Shanghai 0.0040.996Group 2Huangdubaiyumi 00090013ShangH-09Jiading Shanghai 0.0440.956Group 2Bainuoyumi 00090014ShangH-10Chuansha Shanghai 0.0010.999Group 2Laorenya 00090015ShangH-11Shangshi Shanghai 0.0100.990Group 2Xiaojinhuang 00090016ShangH-12Shangshi Shanghai 0.0050.995Group 2Gengbaidayumi 00090017ShangH-13Shangshi Shanghai 0.0020.998Group 2Nongmeiyihao 00090018ShangH-14Shangshi Shanghai 0.0540.946Group 2Chuanshazinuo 00090020ShangH-15Chuansha Shanghai 0.0550.945Group 2Baoanshanyumi 00110004ZheJ-01Jiangshan Zhejiang 0.0130.987Group 2Changtaixizi 00110005ZheJ-02Jiangshan Zhejiang 0.0020.998Group 2Shanyumibaizi 00110007ZheJ-03Jiangshan Zhejiang 0.0020.998Group 2Kaihuajinyinbao 00110017ZheJ-04Kaihua Zhejiang 0.0100.990Group 2Table 1 (Continued from the preceding page)PGS revealed by Structure 1) NJ dendragram revealed Group1 Group2 by phylogenetic analysisAccession name Entry code Analyzing code Origin (county/city)Province/RegoinAnalysis of Genetic Diversity and Population Structure of Maize Landraces from the South Maize Region of China 1255Liputianzi00110038ZheJ-05Jinhua Zhejiang 0.0020.998Group 2Jinhuaqiuyumi 00110040ZheJ-06Jinhua Zhejiang 0.0050.995Group 2Pujiang80ri 00110069ZheJ-07Pujiang Zhejiang 0.0210.979Group 2Dalihuang 00110076ZheJ-08Yongkang Zhejiang 0.0140.986Group 2Ziyumi00110077ZheJ-09Yongkang Zhejiang 0.0020.998Group 2Baiyanhandipinzhong 00110078ZheJ-10Yongkang Zhejiang 0.0030.997Group 2Duosuiyumi00110081ZheJ-11Wuyi Zhejiang 0.0020.998Group 2Chun’an80huang 00110084ZheJ-12Chun’an Zhejiang 0.0020.998Group 2120ribaiyumi 00110090ZheJ-13Chun’an Zhejiang 0.0020.998Group 2Lin’anliugu 00110111ZheJ-14Lin’an Zhejiang 0.0030.997Group 2Qianhuangyumi00110114ZheJ-15Lin’an Zhejiang 0.0030.997Group 2Fenshuishuitianyumi 00110118ZheJ-16Tonglu Zhejiang 0.0410.959Group 2Kuihualiugu 00110119ZheJ-17Tonglu Zhejiang 0.0030.997Group 2Danbaihuang 00110122ZheJ-18Tonglu Zhejiang 0.0020.998Group 2Hongxinma 00110124ZheJ-19Jiande Zhejiang 0.0030.997Group 2Shanyumi 00110136ZheJ-20Suichang Zhejiang 0.0030.997Group 2Bai60ri 00110143ZheJ-21Lishui Zhejiang 0.0050.995Group 2Zeibutou 00110195ZheJ-22Xianju Zhejiang 0.0020.998Group 2Kelilao00110197ZheJ-23Pan’an Zhejiang 0.0600.940Group 21)The figures refered to the proportion of membership that each landrace possessed.Table 1 (Continued from the preceding page)PGS revealed by Structure 1) NJ dendragram revealed Group 1 Group 2 by phylogenetic analysisAccession name Entry code Analyzing code Origin (county/city)Province/Regoin Table 2 Construction of two phylogenetic groups (SSR-clustered groups) and their correlation with geographical locationsGeographical location SSR-clustered groupChi-square testGroup 1Group 2Total Guangdong 2222 χ2 = 124.89Hainan 1818P < 0.0001Jiangxi 15621Anhui 1414Fujian 1717Jiangsu 1313Shanghai 1515Zhejiang 2323Total5588143by the software of Excel MicroSatellite toolkit (Park 2001). Average number of alleles per locus was calcu-lated by the formula rAA rj j¦1, with the standarddeviation of1)()(12¦ r A AA rj jV , where A j was thenumber of distinct alleles at locus j , and r was the num-ber of loci (Park 2001).Unbiased gene diversity also known as expected heterozygosity, observed heterozygosity for each lo-cus and average gene diversity across the 54 SSR loci,as well as model-based groupings inferred by Struc-ture ver. 2.2, were calculated by the softwarePowerMarker ver.3.25 (Liu et al . 2005). Unbiased gene diversity for each locus was calculated by˅˄¦ 2ˆ1122ˆi x n n h , where 2ˆˆ2ˆ2¦¦z ji ijij i X X x ,and ij X ˆwas the frequency of genotype A i A jin the sample, and n was the number of individuals sampled.The average gene diversity across 54 loci was cal-culated as described by Nei (1987) as follows:rh H rj j ¦1ˆ, with the variance ,whereThe average observed heterozygosity across the en-tire loci was calculated as described by (Hedrick 1983)as follows: r jrj obsobs n h h ¦1, with the standard deviationrn h obs obsobs 1V1256LIU Zhi-zhai et al.Phylogenetic analysis and population genetic structureRelationships among all of the 143 accessions collected from SR were evaluated by using the unweighted pair group method with neighbor-joining (NJ) based on the log transformation of the proportion of shared alleles distance (InSPAD) via PowerMarker ver. 3.25 (FukunagaTable 3 The PIC of each locus and the number of alleles detected by 54 SSRsLocus Bin Repeat motif PIC No. of alleles Description 2)bnlg1007y51) 1.02AG0.7815Probe siteumc1122 1.06GGT0.639Probe siteumc1147y41) 1.07CA0.2615Probe sitephi961001) 2.00ACCT0.298Probe siteumc1185 2.03GC0.7215ole1 (oleosin 1)phi127 2.08AGAC0.577Probe siteumc1736y21) 2.09GCA T0.677Probe sitephi453121 3.01ACC0.7111Probe sitephi374118 3.03ACC0.477Probe sitephi053k21) 3.05A TAC0.7910Probe sitenc004 4.03AG0.4812adh2 (alcohol dehydrogenase 2)bnlg490y41) 4.04T A0.5217Probe sitephi079 4.05AGATG0.495gpc1(glyceraldehyde-3-phosphate dehydrogenase 1) bnlg1784 4.07AG0.6210Probe siteumc1574 4.09GCC0.719sbp2 (SBP-domain protein 2)umc1940y51) 4.09GCA0.4713Probe siteumc1050 4.11AA T0.7810cat3 (catalase 3)nc130 5.00AGC0.5610Probe siteumc2112y31) 5.02GA0.7014Probe sitephi109188 5.03AAAG0.719Probe siteumc1860 5.04A T0.325Probe sitephi085 5.07AACGC0.537gln4 (glutamine synthetase 4)phi331888 5.07AAG0.5811Probe siteumc1153 5.09TCA0.7310Probe sitephi075 6.00CT0.758fdx1 (ferredoxin 1)bnlg249k21) 6.01AG0.7314Probe sitephi389203 6.03AGC0.416Probe sitephi299852y21) 6.07AGC0.7112Probe siteumc1545y21)7.00AAGA0.7610hsp3(heat shock protein 3)phi1127.01AG0.5310o2 (opaque endosperm 2)phi4207018.00CCG0.469Probe siteumc13598.00TC0.7814Probe siteumc11398.01GAC0.479Probe siteumc13048.02TCGA0.335Probe sitephi1158.03A TAC0.465act1(actin1)umc22128.05ACG0.455Probe siteumc11218.05AGAT0.484Probe sitephi0808.08AGGAG0.646gst1 (glutathione-S-transferase 1)phi233376y11)8.09CCG0.598Probe sitebnlg12729.00AG0.8922Probe siteumc20849.01CTAG0.498Probe sitebnlg1520k11)9.01AG0.5913Probe sitephi0659.03CACCT0.519pep1(phosphoenolpyruvate carboxylase 1)umc1492y131)9.04GCT0.2514Probe siteumc1231k41)9.05GA0.2210Probe sitephi1084119.06AGCT0.495Probe sitephi4488809.06AAG0.7610Probe siteumc16759.07CGCC0.677Probe sitephi041y61)10.00AGCC0.417Probe siteumc1432y61)10.02AG0.7512Probe siteumc136710.03CGA0.6410Probe siteumc201610.03ACAT0.517pao1 (polyamine oxidase 1)phi06210.04ACG0.337mgs1 (male-gametophyte specific 1)phi07110.04GGA0.515hsp90 (heat shock protein, 90 kDa)1) These primers were provided by Beijing Academy of Agricultural and Forestry Sciences (Beijing, China).2) Searched from Analysis of Genetic Diversity and Population Structure of Maize Landraces from the South Maize Region of China1257et al. 2005). The unrooted phylogenetic tree was finally schematized with the software MEGA (molecular evolu-tionary genetics analysis) ver. 3.1 (Kumar et al. 2004). Additionally, a chi-square test was used to reveal the correlation between the geographical origins and SSR-clustered groups through FREQ procedure implemented in SAS ver. 9.0 (2002, SAS Institute, Inc.).In order to reveal the population genetic structure (PGS) of 143 landrace accessions, a Bayesian approach was firstly applied to determine the number of groups (K) that these materials should be assigned by the soft-ware BAPS (Bayesian Analysis of Population Structure) ver.5.1. By using BAPS, a fixed-K clustering proce-dure was applied, and with each separate K, the num-ber of runs was set to 100, and the value of log (mL) was averaged to determine the appropriate K value (Corander et al. 2003; Corander and Tang 2007). Since the number of groups were determined, a model-based clustering analysis was used to assign all of the acces-sions into the corresponding groups by an admixture model and a correlated allele frequency via software Structure ver.2.2 (Pritchard et al. 2000; Falush et al. 2007), and for the given K value determined by BAPS, three independent runs were carried out by setting both the burn-in period and replication number 100000. The threshold probability assigned individuals into groupswas set by 0.8 (Liu et al. 2003). The PGS result carried out by Structure was visualized via Distruct program ver. 1.1 (Rosenberg 2004).RESULTSGenetic diversityA total of 517 alleles were detected by the whole set of54 SSRs covering the entire maize genome through all of the 143 maize landraces, with an average of 9.57 alleles per locus and ranged from 4 (umc1121) to 22 (bnlg1272) (Table 3). Among all the alleles detected, the number of distinct alleles accounted for 132 (25.53%), with an av-erage of 2.44 alleles per locus. The distinct alleles dif-fered significantly among the landraces from different provinces/regions, and the landraces from Guangdong, Fujian, Zhejiang, and Shanghai possessed more distinct alleles than those from the other provinces/regions, while those from southern Anhui possessed the lowest distinct alleles, only counting for 3.28% of the total (Table 4).Table 4 The genetic diversity within eight provinces/regions and groups revealed by 54 SSRsProvince/Region Sample size Allele no.1)Distinct allele no.Gene diversity (expected heterozygosity)Observed heterozygosity Anhui14 4.28 (4.19) 69 (72.4)0.51 (0.54)0.58 (0.58)Fujian17 4.93 (4.58 80 (79.3)0.56 (0.60)0.63 (0.62)Guangdong22 5.48 (4.67) 88 (80.4)0.57 (0.59)0.59 (0.58)Hainan18 4.65 (4.26) 79 (75.9)0.53 (0.57)0.55 (0.59)Jiangsu13 4.24 700.500.55Jiangxi21 4.96 (4.35) 72 (68.7)0.56 (0.60)0.68 (0.68)Shanghai15 5.07 (4.89) 90 (91.4)0.55 (0.60)0.55 (0.55)Zhejiang23 5.04 (4.24) 85 (74)0.53 (0.550.60 (0.61)Total/average1439.571320.610.60GroupGroup 155 6.63 (6.40) 91 (89.5)0.57 (0.58)0.62 (0.62)Group 2887.94 (6.72)110 (104.3)0.57 (0.57)0.59 (0.58)Total/Average1439.571320.610.60Provinces/Regions within a groupGroup 1Total55 6.69 (6.40) 910.57 (0.58)0.62 (0.62)Guangdong22 5.48 (4.99) 86 (90.1)0.57 (0.60)0.59 (0.58)Hainan18 4.65 (4.38) 79 (73.9)0.53 (0.56)0.55 (0.59)Jiangxi15 4.30 680.540.69Group 2Total887.97 (6.72)110 (104.3)0.57 (0.57)0.59 (0.58)Anhui14 4.28 (3.22) 69 (63.2)0.51 (0.54)0.58 (0.57)Fujian17 4.93 (3.58) 78 (76.6)0.56 (0.60)0.63 (0.61)Jiangsu13 4.24 (3.22) 71 (64.3)0.50 (0.54)0.55 (0.54)Jiangxi6 3.07 520.460.65Shanghai15 5.07 (3.20) 91 (84.1)0.55 (0.60)0.55 (0.54)Zhejiang23 5.04 (3.20) 83 (61.7)0.53 (0.54)0.60 (0.58)1258LIU Zhi-zhai et al.Among the 54 loci used in the study, 16 (or 29.63%) were dinucleotide repeat SSRs, which were defined as type class I-I, the other 38 loci were SSRs with a longer repeat motifs, and two with unknown repeat motifs, all these 38 loci were defined as the class of I-II. In addition, 15 were located within certain functional genes (defined as class II-I) and the rest were defined as class II-II. The results of comparison indicated that the av-erage number of alleles per locus captured by class I-I and II-II were 12.88 and 10.05, respectively, which were significantly higher than that by type I-II and II-I (8.18 and 8.38, respectively). The gene diversity re-vealed by class I-I (0.63) and II-I (0.63) were some-what higher than by class I-II (0.60) and II-II (0.60) (Table 5).Genetic relationships of the core landraces Overall, 143 landraces were clustered into two groups by using neighbor-joining (NJ) method based on InSPAD. All the landraces from provinces of Guangdong and Hainan and 15 of 21 from Jiangxi were clustered together to form group 1, and the other 88 landraces from the other provinces/regions formed group 2 (Fig.-B). The geographical origins of all these 143 landraces with the clustering results were schematized in Fig.-D. Revealed by the chi-square test, the phylogenetic results (SSR-clustered groups) of all the 143 landraces from provinces/regions showed a significant correlation with their geographical origin (χ2=124.89, P<0.0001, Table 2).Revealed by the phylogenetic analysis based on the InSPAD, the minimum distance was observed as 0.1671 between two landraces, i.e., Tianhongyumi (JiangX-03) and Hongganshanyumi (JiangX-04) collected from Jiangxi Province, and the maximum was between two landraces of Huangbaosu (FuJ-16) and Hongyumi (HaiN-14) collected from provinces of Fujian and Hainan, respectively, with the distance of 1.3863 (data not shown). Two landraces (JiangX-01 and JiangX-21) collected from the same location of Dexing County (Table 1) possessing the same names as Shanyumi were separated to different groups, i.e., JiangX-01 to group1, while JiangX-21 to group 2 (Table 1). Besides, JiangX-01 and JiangX-21 showed a rather distant distance of 0.9808 (data not shown). These results indicated that JiangX-01 and JiangX-21 possibly had different ances-tral origins.Population structureA Bayesian method was used to detect the number of groups (K value) of the whole set of landraces from SR with a fixed-K clustering procedure implemented in BAPS software ver. 5.1. The result showed that all of the 143 landraces could also be assigned into two groups (Fig.-A). Then, a model-based clustering method was applied to carry out the PGS of all the landraces via Structure ver. 2.2 by setting K=2. This method as-signed individuals to groups based on the membership probability, thus the threshold probability 0.80 was set for the individuals’ assignment (Liu et al. 2003). Accordingly, all of the 143 landraces were divided into two distinct model-based groups (Fig.-C). The landraces from Guangdong, Hainan, and 15 landraces from Jiangxi formed one group, while the rest 6 landraces from the marginal countries of northern Jiangxi and those from the other provinces formed an-other group (Table 1, Fig.-D). The PGS revealed by the model-based approach via Structure was perfectly consistent with the relationships resulted from the phy-logenetic analysis via PowerMarker (Table 1).DISCUSSIONThe SR includes eight provinces, i.e., southern Jiangsu and Anhui, Shanghai, Zhejiang, Fujian, Jiangxi, Guangdong, and Hainan (Fig.-C), with the annual maize growing area of about 1 million ha (less than 5% of theTable 5 The genetic diversity detected with different types of SSR markersType of locus No. of alleles Gene diversity Expected heterozygosity PIC Class I-I12.880.630.650.60 Class I-II8.180.600.580.55 Class II-I8.330.630.630.58。
rrBLUP软件:基于Ridge回归的基因定位预测软件说明书
Package‘rrBLUP’December10,2023Title Ridge Regression and Other Kernels for Genomic SelectionVersion4.6.3Author Jeffrey EndelmanMaintainer Jeffrey Endelman<*****************>Depends R(>=4.0)Imports stats,graphics,grDevices,parallelDescription Software for genomic prediction with the RR-BLUP mixed model(Endel-man2011,<doi:10.3835/plantgenome2011.08.0024>).One application is to estimate marker ef-fects by ridge regression;alternatively,BLUPs can be calculated based on an additive relation-ship matrix or a Gaussian kernel.License GPL-3URL<https:///software/>NeedsCompilation noRepository CRANDate/Publication2023-12-1017:10:06UTCR topics documented:rrBLUP-package (2)A.mat (2)GW AS (4)kin.blup (6)kinship.BLUP (8)mixed.solve (10)Index131rrBLUP-package Ridge regression and other kernels for genomic selectionDescriptionThis package has been developed primarily for genomic prediction with mixed models(but it can also do genome-wide association mapping with GWAS).The heart of the package is the function mixed.solve,which is a general-purpose solver for mixed models with a single variance com-ponent other than the error.Genomic predictions can be made by estimating marker effects(RR-BLUP)or by estimating line effects(G-BLUP).In Endelman(2011)I made the poor choice of using the letter G to denotype the genotype or marker data.To be consistent with Endelman(2011)I have retained this notation in kinship.BLUP.However,that function has now been superseded bykin.blup and A.mat,the latter being a utility for estimating the additive relationship matrix(A) from markers.In these newer functions I adopt the usual convention that G is the genetic covariance (not the marker data),which is also consistent with the notation in Endelman and Jannink(2012).Vignettes illustrating some of the features of this package can be found at https://potatobreeding./software/.ReferencesEndelman,J.B.2011.Ridge regression and other kernels for genomic selection with R package rrBLUP.Plant Genome4:250-255.<doi:10.3835/plantgenome2011.08.0024>Endelman,J.B.,and J.-L.Jannink.2012.Shrinkage estimation of the realized relationship matrix.G3:Genes,Genomes,Genetics2:1405-1413.<doi:10.1534/g3.112.004259>A.mat Additive relationship matrixDescriptionCalculates the realized additive relationship matrixUsageA.mat(X,min.MAF=NULL,max.missing=NULL,impute.method="mean",tol=0.02,n.core=1,shrink=FALSE,return.imputed=FALSE)ArgumentsX matrix (n ×m )of unphased genotypes for n lines and m biallelic markers,coded as {-1,0,1}.Fractional (imputed)and missing values (NA)are allowed.min.MAF Minimum minor allele frequency.The A matrix is not sensitive to rare alleles,so by default only monomorphic markers are removed.max.missing Maximum proportion of missing data;default removes completely missing mark-ers.impute.method There are two options.The default is "mean",which imputes with the mean for each marker.The "EM"option imputes with an EM algorithm (see details).tol Specifies the convergence criterion for the EM algorithm (see details).n.core Specifies the number of cores to use for parallel execution of the EM algorithm shrinkset shrink=FALSE to disable shrinkage estimation.See Details for how to enable shrinkage estimation.return.imputed When TRUE,the imputed marker matrix is returned.DetailsAt high marker density,the relationship matrix is estimated as A =W W /c ,where W ik =X ik +1−2p k and p k is the frequency of the 1allele at marker k.By using a normalization constant of c =2k p k (1−p k ),the mean of the diagonal elements is 1+f (Endelman and Jannink 2012).The EM imputation algorithm is based on the multivariate normal distribution and was designed for use with GBS (genotyping-by-sequencing)markers,which tend to be high density but with lots of missing data.Details are given in Poland et al.(2012).The EM algorithm stops at iteration t when the RMS error =n −1 A t −A t −1 2<tol.Shrinkage estimation can improve the accuracy of genome-wide marker-assisted selection,particularly at low marker density (Endelman and Jannink 2012).The shrinkage intensity ranges from 0(no shrinkage)to 1(A =(1+f )I ).Two algorithms for estimat-ing the shrinkage intensity are available.The first is the method described in Endelman and Jannink (2012)and is specified by shrink=list(method="EJ").The second involves designating a ran-dom sample of the markers as simulated QTL and then regressing the A matrix based on the QTL against the A matrix based on the remaining markers (Yang et al.2010;Mueller et al.2015).The re-gression method is specified by shrink=list(method="REG",n.qtl=100,n.iter=5),where the parameters n.qtl and n.iter can be varied to adjust the number of simulated QTL and number of iterations,respectively.The shrinkage and EM-imputation options are designed for opposite sce-narios (low vs.high density)and cannot be used simultaneously.When the EM algorithm is used,the imputed alleles can lie outside the interval [-1,1].Polymorphic markers that do not meet the min.MAF and max.missing criteria are not imputed.ValueIf return.imputed =FALSE,the n ×n additive relationship matrix is returned.If return.imputed =TRUE,the function returns a list containing $A the A matrix$imputed the imputed marker matrix4GW ASReferencesEndelman,J.B.,and J.-L.Jannink.2012.Shrinkage estimation of the realized relationship matrix.G3:Genes,Genomes,Genetics.2:1405-1413.<doi:10.1534/g3.112.004259>Mueller et al.2015.Shrinkage estimation of the genomic relationship matrix can improve genomic estimated breeding values in the training set.Theor Appl Genet128:693-703.<doi:10.1007/s00122-015-2464-6>Poland,J.,J.Endelman et al.2012.Genomic selection in wheat breeding using genotyping-by-sequencing.Plant Genome5:103-113.<doi:10.3835/plantgenome2012.06.0006>Yang et mon SNPs explain a large proportion of the heritability for human height.Nat.Genetics42:565-569.<doi:10.1038/ng.608>GWAS Genome-wide association analysisDescriptionPerforms genome-wide association analysis based on the mixed model(Yu et al.2006):y=Xβ+Zg+Sτ+εwhereβis a vector offixed effects that can model both environmental factors and population structure.The variable g models the genetic background of each line as a random effect with V ar[g]=Kσ2.The variableτmodels the additive SNP effect as afixed effect.The residual variance is V ar[ε]=Iσ2e.UsageGWAS(pheno,geno,fixed=NULL,K=NULL,n.PC=0,min.MAF=0.05,n.core=1,P3D=TRUE,plot=TRUE)Argumentspheno Data frame where thefirst column is the line name(gid).The remaining columnscan be either a phenotype or the levels of afixed effect.Any column not desig-nated as afixed effect is assumed to be a phenotype.geno Data frame with the marker names in thefirst column.The second and thirdcolumns contain the chromosome and map position(either bp or cM),respec-tively,which are used only when plot=TRUE to make Manhattan plots.If themarkers are unmapped,just use a placeholder for those two columns.Columns4and higher contain the marker scores for each line,coded as{-1,0,1}={aa,Aa,AA}.Fractional(imputed)and missing(NA)values are allowed.The column namesmust match the line names in the"pheno"data frame.fixed An array of strings containing the names of the columns that should be includedas(categorical)fixed effects in the mixed model.GWAS5 K Kinship matrix for the covariance between lines due to a polygenic effect.If not passed,it is calculated from the markers using A.mat.n.PC Number of principal components to include asfixed effects.Default is0(equals K model).min.MAF Specifies the minimum minor allele frequency(MAF).If a marker has a MAF less than min.MAF,it is assigned a zero score.n.core Setting n.core>1will enable parallel execution on a machine with multiple cores(use only at UNIX command line).P3D When P3D=TRUE,variance components are estimated by REML only once, without any markers in the model.When P3D=FALSE,variance componentsare estimated by REML for each marker separately.plot When plot=TRUE,qq and Manhattan plots are generated.DetailsFor unbalanced designs where phenotypes come from different environments,the environment mean can be modeled using thefixed option(e.g.,fixed="env"if the column in the pheno data.frame is called"env").When principal components are included(P+K model),the loadings are determined from an eigenvalue decomposition of the K matrix.The terminology"P3D"(population parameters previously determined)was introduced by Zhang et al.(2010).When P3D=FALSE,this function is equivalent to EMMA with REML(Kang et al.2008).When P3D=TRUE,it is equivalent to EMMAX(Kang et al.2010).The P3D=TRUE option is faster but can underestimate significance compared to P3D=FALSE.The dashed line in the Manhattan plots corresponds to an FDR rate of0.05and is calculated using the qvalue package(Storey and Tibshirani2003).The p-value corresponding to a q-value of0.05is determined by interpolation.When there are no q-values less than0.05,the dashed line is omitted. ValueReturns a data frame where thefirst three columns are the marker name,chromosome,and position, and subsequent columns are the marker scores(−log10p)for the traits.ReferencesKang et al.2008.Efficient control of population structure in model organism association mapping.Genetics178:1709-1723.Kang et al.2010.Variance component model to account for sample structure in genome-wide association studies.Nat.Genet.42:348-354.Storey and Tibshirani.2003.Statistical significance for genome-wide studies.PNAS100:9440-9445.Yu et al.2006.A unified mixed-model method for association mapping that accounts for multiple levels of relatedness.Genetics38:203-208.Zhang et al.2010.Mixed linear model approach adapted for genome-wide association studies.Nat.Genet.42:355-360.Examples#random population of200lines with1000markersM<-matrix(rep(0,200*1000),1000,200)for(i in1:200){M[,i]<-ifelse(runif(1000)<0.5,-1,1)}colnames(M)<-1:200geno<-data.frame(marker=1:1000,chrom=rep(1,1000),pos=1:1000,M,s=FALSE)QTL<-100*(1:5)#pick5QTLu<-rep(0,1000)#marker effectsu[QTL]<-1g<-as.vector(crossprod(M,u))h2<-0.5y<-g+rnorm(200,mean=0,sd=sqrt((1-h2)/h2*var(g)))pheno<-data.frame(line=1:200,y=y)scores<-GWAS(pheno,geno,plot=FALSE)kin.blup Genotypic value prediction based on kinshipDescriptionGenotypic value prediction by G-BLUP,where the genotypic covariance G can be additive or based on a Gaussian kernel.Usagekin.blup(data,geno,pheno,GAUSS=FALSE,K=NULL,fixed=NULL,covariate=NULL,PEV=FALSE,n.core=1,theta.seq=NULL)Argumentsdata Data frame with columns for the phenotype,the genotype identifier,and any environmental variables.geno Character string for the name of the column in the data frame that contains the genotype identifier.pheno Character string for the name of the column in the data frame that contains the phenotype.GAUSS To model genetic covariance with a Gaussian kernel,set GAUSS=TRUE and pass the Euclidean distance for K(see below).K There are three options for specifying kinship:(1)If K=NULL,genotypes are assumed to be independent(G=I V g).(2)For breeding value prediction,set GAUSS=FALSE and use an additive relationship matrix for K to create themodel(G=K V g).(3)For the Gaussian kernel,set GAUSS=TRUE and passthe Euclidean distance matrix for K to create the model G ij=e−(K ij/θ)2V g.fixed An array of strings containing the names of columns that should be included as (categorical)fixed effects in the mixed model.covariate An array of strings containing the names of columns that should be included as covariates in the mixed model.PEV When PEV=TRUE,the function returns the prediction error variance for the genotypic values(P EV i=V ar[g∗i−g i]).n.core Specifies the number of cores to use for parallel execution of the Gaussian kernel method(use only at UNIX command line).theta.seq The scale parameter for the Gaussian kernel is set by maximizing the restricted log-likelihood over a grid of values.By default,the grid is constructed by di-viding the interval(0,max(K)]into10points.Passing a numeric array to thisvariable(theta.seq="theta sequence")will specify a different set of grid points(e.g.,for large problems you might want fewer than10).DetailsThis function is a wrapper for mixed.solve and thus solves mixed models of the form:y=Xβ+[Z0]g+εwhereβis a vector offixed effects,g is a vector of random genotypic values with covariance G=V ar[g],and the residuals follow V ar[εi]=R iσ2e,with R i=1by default.The design matrix for the genetic values has been partitioned to illustrate that not all lines need phenotypes(i.e., for genomic selection).Unlike mixed.solve,this function does not return estimates of thefixed effects,only the BLUP solution for the genotypic values.It was designed to replace kinship.BLUP and to relieve the user of having to explicitly construct design matrices.Variance components are estimated by REML and BLUP values are returned for every entry in K,regardless of whether it has been phenotyped.The rownames of K must match the genotype labels in the data frame for phenotyped lines;missing phenotypes(NA)are simply omitted.Unlike its predecessor,this function does not handle marker data directly.For breeding value pre-diction,the user must supply a relationship matrix,which can be calculated from markers withA.mat.For Gaussian kernel predictions,pass the Euclidean distance matrix for K,which can becalculated with dist.In the terminology of mixed models,both the"fixed"and"covariate"variables arefixed effects (βin the above equation):the former are treated as factors with distinct levels while the latter are continuous with one coefficient per variable.The population mean is automatically included as a fixed effect.The prediction error variance(PEV)is the square of the SE of the BLUPs(see mixed.solve)and.can be used to estimate the expected accuracy of BLUP predictions according to r2i=1−P EV iV g K ii ValueThe function always returns$Vg REML estimate of the genetic variance$Ve REML estimate of the error variance$g BLUP solution for the genetic values$resid residuals$pred predicted genetic values,averaged over thefixed effectsIf PEV=TRUE,the list also includes$PEV Prediction error variance for the genetic valuesIf GAUSS=TRUE,the list also includes$profile the log-likelihood profile for the scale parameter in the Gaussian kernelReferencesEndelman,J.B.2011.Ridge regression and other kernels for genomic selection with R package rrBLUP.Plant Genome4:250-255.<doi:10.3835/plantgenome2011.08.0024>Examples#random population of200lines with1000markersM<-matrix(rep(0,200*1000),200,1000)for(i in1:200){M[i,]<-ifelse(runif(1000)<0.5,-1,1)}rownames(M)<-1:200A<-A.mat(M)#random phenotypesu<-rnorm(1000)g<-as.vector(crossprod(t(M),u))h2<-0.5#heritabilityy<-g+rnorm(200,mean=0,sd=sqrt((1-h2)/h2*var(g)))data<-data.frame(y=y,gid=1:200)#predict breeding valuesans<-kin.blup(data=data,geno="gid",pheno="y",K=A)accuracy<-cor(g,ans$g)kinship.BLUP Genomic prediction by kinship-BLUP(deprecated)Description***This function has been superseded by kin.blup;please refer to its help page.Usagekinship.BLUP(y,G.train,G.pred=NULL,X=NULL,Z.train=NULL,K.method="RR",n.profile=10,mixed.method="REML",n.core=1)Argumentsy Vector(n.obs×1)of observations.Missing values(NA)are omitted.G.train Matrix(n.train×m)of unphased genotypes for the training population:n.trainlines with m bi-allelic markers.Genotypes should be coded as{-1,0,1};frac-tional(imputed)and missing(NA)alleles are allowed.G.pred Matrix(n.pred×m)of unphased genotypes for the prediction population:n.predlines with m bi-allelic markers.Genotypes should be coded as{-1,0,1};frac-tional(imputed)and missing(NA)alleles are allowed.X Design matrix(n.obs×p)offixed effects.If not passed,a vector of1’s is used to model the intercept.Z.train0-1matrix(n.obs×n.train)relating observations to lines in the training set.If not passed the identity matrix is used.K.method"RR"(default)is ridge regression,for which K is the realized additive relation-ship matrix computed with A.mat.The option"GAUSS"is a Gaussian kernel(K=e−D2/θ2)and"EXP"is an exponential kernel(K=e−D/θ),where Eu-clidean distances D are computed with dist.n.profile For K.method="GAUSS"or"EXP",the number of points to use in the log-likelihood profile for the scale parameterθ.mixed.method Either"REML"(default)or"ML".n.core Setting n.core>1will enable parallel execution of the Gaussian kernel compu-tation(use only at UNIX command line).Value$g.train BLUP solution for the training set$g.pred BLUP solution for the prediction set(when G.pred!=NULL)$beta ML estimate offixed effectsFor GAUSS or EXP,function also returns$profile log-likelihood profile for the scale parameterReferencesEndelman,J.B.2011.Ridge regression and other kernels for genomic selection with R package rrBLUP.Plant Genome4:250-255.Examples#random population of200lines with1000markersG<-matrix(rep(0,200*1000),200,1000)for(i in1:200){G[i,]<-ifelse(runif(1000)<0.5,-1,1)}#random phenotypesg<-as.vector(crossprod(t(G),rnorm(1000)))10mixed.solve h2<-0.5y<-g+rnorm(200,mean=0,sd=sqrt((1-h2)/h2*var(g)))#split in half for training and predictiontrain<-1:100pred<-101:200ans<-kinship.BLUP(y=y[train],G.train=G[train,],G.pred=G[pred,],K.method="GAUSS")#correlation accuracyr.gy<-cor(ans$g.pred,y[pred])mixed.solve Mixed-model solverDescriptionCalculates maximum-likelihood(ML/REML)solutions for mixed models of the formy=Xβ+Zu+εwhereβis a vector offixed effects and u is a vector of random effects with V ar[u]=Kσ2u.The residual variance is V ar[ε]=Iσ2e.This class of mixed models,in which there is a single variance component other than the residual error,has a close relationship with ridge regression (ridge parameterλ=σ2e/σ2u).Usagemixed.solve(y,Z=NULL,K=NULL,X=NULL,method="REML",bounds=c(1e-09,1e+09),SE=FALSE,return.Hinv=FALSE)Argumentsy Vector(n×1)of observations.Missing values(NA)are omitted,along with the corresponding rows of X and Z.Z Design matrix(n×m)for the random effects.If not passed,assumed to be the identity matrix.K Covariance matrix(m×m)for random effects;must be positive semi-definite.If not passed,assumed to be the identity matrix.X Design matrix(n×p)for thefixed effects.If not passed,a vector of1’s is used to model the intercept.X must be full column rank(impliesβis estimable).method Specifies whether the full("ML")or restricted("REML")maximum-likelihood method is used.bounds Array with two elements specifying the lower and upper bound for the ridge parameter.SE If TRUE,standard errors are calculated.return.Hinv If TRUE,the function returns the inverse of H=ZKZ +λI.This is useful for GWAS.DetailsThis function can be used to predict marker effects or breeding values(see examples).The nu-merical method is based on the spectral decomposition of ZKZ and SZKZ S,where S= I−X(X X)−1X is the projection operator for the nullspace of X(Kang et al.,2008).This algorithm generates the inverse phenotypic covariance matrix V−1,which can then be used to cal-culate the BLUE and BLUP solutions for thefixed and random effects,respectively,using standard formulas(Searle et al.1992):BLUE(β)=β∗=(X V−1X)−1X V−1yBLUP(u)=u∗=σ2u KZ V−1(y−Xβ∗)The standard errors are calculated as the square root of the diagonal elements of the following matrices(Searle et al.1992):V ar[β∗]=(X V−1X)−1V ar[u∗−u]=Kσ2u−σ4u KZ V−1ZK+σ4u KZ V−1XV ar[β∗]X V−1ZK For marker effects where K=I,the function will run faster if K is not passed than if the user passes the identity matrix.ValueIf SE=FALSE,the function returns a list containing$Vu estimator forσ2u$Ve estimator forσ2e$beta BLUE(β)$u BLUP(u)$LL maximized log-likelihood(full or restricted,depending on method)If SE=TRUE,the list also contains$beta.SE standard error forβ$u.SE standard error for u∗−uIf return.Hinv=TRUE,the list also contains$Hinv the inverse of HReferencesKang et al.2008.Efficient control of population structure in model organism association mapping.Genetics178:1709-1723.Endelman,J.B.2011.Ridge regression and other kernels for genomic selection with R package rrBLUP.Plant Genome4:250-255.Searle,S.R.,G.Casella and C.E.McCulloch.1992.Variance Components.John Wiley,Hoboken.Examples#random population of200lines with1000markersM<-matrix(rep(0,200*1000),200,1000)for(i in1:200){M[i,]<-ifelse(runif(1000)<0.5,-1,1)}#random phenotypesu<-rnorm(1000)g<-as.vector(crossprod(t(M),u))h2<-0.5#heritabilityy<-g+rnorm(200,mean=0,sd=sqrt((1-h2)/h2*var(g))) #predict marker effectsans<-mixed.solve(y,Z=M)#By default K=Iaccuracy<-cor(u,ans$u)#predict breeding valuesans<-mixed.solve(y,K=A.mat(M))accuracy<-cor(g,ans$u)IndexA.mat,2,2,5,7,9dist,7,9GWAS,2,4,10kin.blup,2,6,8kinship.BLUP,2,7,8mixed.solve,2,7,10rrBLUP-package,213。
限制性两阶段多位点全基因组关联分析方法的特点与计算程序
作物学报 ACTA AGRONOMICA SINICA 2018, 44(9): 1274 1289/ISSN 0496-3490; CN 11-1809/S; CODEN TSHPA9E-mail:*********************.cn本研究由国家自然科学基金项目(31701447, 31671718), 国家重点研发计划项目(2017YFD0101500), 教育部111项目(B08025), 教育部长江学者和创新团队项目(PCSIRT_17R55), 国家现代农业产业技术体系建设专项(CARS-04), 江苏省优势学科建设工程专项, 中央高校基本科研业务费和江苏省JCIC-MCP 项目资助。
This study was supported by the National Natural Science Foundation of China (31701447, 31671718), the National Key R&D Program for Crop Breeding in China (2017YFD0101500), the MOE 111 Project (B08025), the MOE Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT_17R55), the China Agriculture Research System (CARS-04), the Jiangsu Higher Education PAPD Program, the Fundamental Research Funds for the Central Universities and the Jiangsu JCIC-MCP.*通信作者(Corresponding author): 盖钧镒,E-mail:************.cn第一作者联系方式:E-mail:****************Received(收稿日期): 2018-03-19; Accepted(接受日期): 2018-06-12; Published online(网络出版日期): 2018-06-29. URL: /kcms/detail/11.1809.S.20180629.1035.002.htmlDOI: 10.3724/SP.J.1006.2018.01274限制性两阶段多位点全基因组关联分析方法的特点与计算程序贺建波 刘方东 邢光南 王吴彬 赵团结 管荣展 盖钧镒*南京农业大学大豆研究所 / 农业部大豆生物学与遗传育种重点实验室 / 国家大豆改良中心 / 作物遗传与种质创新国家重点实验室, 江苏南京 210095摘 要: 全基因组关联分析(genome-wide association study, GWAS)的理论及应用是近十几年来国内外数量性状研究的热点, 但是以往GWAS 方法注重于个别主要QTL/基因的检测与发掘。
英语医学文献2Book 2-Unit 2-Text C
Unit 2 Gene and Its ApplicationText CGene Study Helps Unravel Biology of Alcoholism1 Genomic association studies can help scientists pick out target genes and biological pathways for further investigation, but they are not the end-all tools to explain disease mechanisms.2 A new genomewide association study (GWAS①) has found several mutations linked to increased susceptibility for developing alcohol dependence, bringing scientists a step closer to understanding the complex biological mechanisms of alcohol use disorders.3 Two single nucleotide polymorphisms(SNPs②), both located on the chromosomal region 2p35, had the highest degree of association with alcohol dependence in a relatively homogenous patient population. These SNPs are located near the PECR gene, which encodes an enzyme (peroxisomal trans-2-enoyl-coA reductase) involved in fatty-acid metabolism, particularly when the body’s energy supply is switched from glucose to fat.4 A third SNP with a slightly lower association with alcohol dependence is located within the PECR gene. These three variants are “in strong linkage disequilibrium, or LD, meaning that the same variants at different loci almost always appear togeth er,” Marcella Rietschel, M.D.③, the senior author of the study, told Psychiatric News④. “So it is very likely—but still not certain—that the PECR gene is involved [in alcohol dependence].”5 Rietschel is a professor of genetic epidemiology in psychiatry at the Central Institute of Mental Health Mannheim at the University of Heidelberg⑤, Germany.6 This chromosomal region has been implicated in alcohol dependence in previous research. The PECR gene is expressed most heavily in the liver, but very little in the brain. “As alcohol does not act only on the brain, alcohol dependence can be modulated by many factors whose primary target is not the brain,” Rietschel said. “The best known genetic variants modulating alcohol a ddiction are variants in the genes metabolizing alcohol, like variants in the alcohol dehydrogenase gene clusters.”7 Indeed, this study confirmed several other SNPs associated with alcohol dependence, including those located in the ADH1C gene coding for one of the alcohol-metabolizing alcohol dehydrogenases, as well as the CDH13 gene coding fora cell-adhesion protein known as T-cadherin. Both genes have been implicated in①Genomewide Association Study全基因组关联研究②Single Nucleotide Polymorphisms 单核苷酸多态性③abbr. Medicinae Doctor ([拉丁语] 医学博士(=Doctor of Medicine))④Psychiatric News is the newspaper of the American Psychiatric Association (APA). It is published on the first and third Fridays of each month. 《精神病学新闻》⑤德国海德堡大学曼海姆中央精神卫生研究所alcohol dependence in previous studies.8 Despite the strong evidence, these variants and their impact on alcohol dependence need to be replicated by additional studies. “The uncertainty is a problem encountered in many GWA studies: A SNP is found to be associated but is not functional itself, so one cannot be sure if this SNP is itself involved in the regulation of genes or if it is only in LD with other causal SNPs, which can be quite far away or even in other genes,” said Rietschel. Since the three top SNPs discovered in this study are close to an d in the PECR gene, “this gene definitely merits further investigation.”9 The researchers first performed a genomic scan for more than 500,000 SNPs, using a sample of 487 alcohol-dependent patients and 1,358 controls. The GWAS identified 121 SNPs that were likely candidates for genetic association.10 The patients selected for the GWAS were German men with a DSM-IV diagnosis of alcohol dependence, whose condition was severe enough to require hospitalization for treatment or prevention of alcohol withdrawal. In addition, the subjects all had an onset of alcohol dependence before age 28; early-onset alcohol dependence has been shown to have a stronger hereditary component. Because alcohol dependence is a multifactorial disorder with multiple phenotypes and genotypes, the researchers narrowed their sampling to clinically similar patients to reduce the heterogeneity.11 Because of the vast number of mutations existing in every person and the large number scanned in a GWAS, scientists face the challenge of weeding out too many potential false-positive“hits,” or variants that appear to be significantly associated with a disease when they are, in fact, random coincidences. To minimize false hits and maximize true disease-associated mutations, strategies such as stringent statistical criteria and replication studies are often used in genetic association studies.12 Here, the researchers used a method called convergent functional genomics. This approach combines gene-expression data from animal models, evidence from human genetic association studies, and findings from human tissues such as brain tissue in autopsies to help prioritize investigation on the most promising candidate genes or the most likely biological pathways. In this study, 19 SNPs were identified after the human GWAS findings were compared with homologous, over-expressed genes in rats that were “alcoholic” strains.13 Armed with 121 candidate SNPs from the GWAS and 19 SNPs derived from convergent functional genomics analysis, the researchers performed a replication study of 1,024 male patients with alcohol dependence and 996 age-matched controls. The replication study confirmed that 15 SNPs have a significant association with alcohol dependence.14 This study was funded by grants from the German government and the European Commission.Source: /content/44/16/25.1.fullWords: 776Words and Expressionsgenomic [ʤi:'nəʊmɪk] adj. of or relating to genome 基因组的;染色体的target gene 目标基因dependence [dɪ'pendəns] n. being abnormally tolerant to and dependent on somethingthat is psychologically or physically habit-forming(毒)瘾,吸毒癖好;药瘾alcohol dependence酒精瘾alcohol use disorders 因酒精使用而导致的障碍nucleotide ['nju:klɪətaɪd] n.a phosphoric ester of a nucleoside; the basic structuralunit of nucleic acids (DNA or RNA) 核苷酸polymorphism [,pɒlɪ'mɔ:fɪzm]n. (biology) the existence of two or more forms ofindividuals within the same animal species(independent of sex differences) 多形性(现象);多态性degree of association 相关度,关联度disequilibrium [,dɪsɪkwɪ'lɪbrɪəm] n. loss of equilibrium attributable to anunstable situation in which some forcesoutweigh others(尤指经济上的)不平衡,失去平衡,失调,不稳定linkage disequilibrium连锁不平衡express [ɪks'pres]vt. manifest the effects of (a gene or genetic trait) 使(某一基因)在表型中产生有关性状;在表型中表现某一基因的性状(或效应等),使(基因)示性;使(基因)合成特种蛋白dehydrogenase [di:'haɪdrəʤəneɪs] n. 脱氢酶alcohol dehydrogenase醇脱氢酶gene cluster基因簇;一组相同或者相似的基因adhesion protein粘着蛋白cadherin钙黏蛋白genomic scan 基因组扫描hospitalization [,hɒspɪtəlaɪ'zeɪʃən] n. the condition of being treated as a patient ina hospital; a period of time when you areconfined to a hospital送进医院治疗;住院(治疗);住院期间alcohol withdrawal 戒酒;酒精戒断hereditary [hɪ'redɪtərɪ] adj. tending to occur among members of a family usuallyby heredity遗传的;遗传性的multifactorial[,mʌltɪfæk'tɔ:rɪəl] adj. involving or depending on several factors orcauses (especially pertaining to a condition ordisease resulting from the interaction of manygenes) 多遗传因子的;多种因素的phenotype ['fi:nə,taɪp] n.what an organism looks like as a consequence of theinteraction of its genotype and the environment表现型;表型;显型genotype ['ʤi:nəʊtaɪp] n. a group of organisms sharing a specific geneticconstitution基因型,遗传型heterogeneity [,hetərəʊdʒɪ'niːɪtɪ] n.the quality of being diverse and notcomparable in kind多样性,不均一性;异质性false-positive假阳性的functional genomics功能基因组学homologous [hɒ'mɒləgəs] adj.having the same evolutionary origin but servingdifferent functions同种异体的;相应的,同源的;同系列的,同属列的,同周期的;(细胞、抗血清等)同种的Comprehension ExercisesExercise 1 Multiple ChoicesDirections: Choose the best answer to each of the following questions.1)According to GWAS, how many SNPs are found to have the highest degree ofassociation with alcohol dependence?A)One.B)Two.C)Three.D)Four.2)Which of the following statements is true, according to the text?A)Two single nucleotide polymorphisms (SNPs), both located on thechromosomal region 2p33, had a slightly lower degree of association withalcohol dependence.B)Because of the strong evidence, these SNPs and their impact on alcoholdependence needs no additional studies.C)Because alcohol does not act only on the brain, alcohol dependence can bemodulated by many factors whose primary target is not the brain.D)Armed with 121 candidate SNPs from the GWAS and 19 SNPs derived fromconvergent functional genomics analysis, the researchers performed areplication study of 1358 male patients with alcohol dependence and 487age-matched controls.3)Which gene is helpful to further gene studies, according to a new genomewideassociation study (GWAS)?A)SNPsB)CDH13C)ADH1CD)PECR4)Which of the following strategies or methods is not the one adopted in order tominimize false hits and maximize true disease-associated mutations?A)Stringent statistical criteria.B)Replication studies.C)Convergent functional genomics.D)Genomic scanKey:B-C-D-DExercise 2 True or False StatementsDirections: Read the following statements and decide whether they are true (T) or false (F).()1)Genomic association studies can explain disease mechanisms because it helps scientists pick out target genes and biological pathways forfurther investigation.()2) A new genomewide association study (GWAS) has found several mutations related to alcohol dependence.()3)SNPs which are located near the PECR gene, encode an enzyme (peroxisomal trans-2-enoyl-coA reductase) involved in fatty-acidmetabolism.()4)The three SNPs which are proved to have degree of association with alcohol dependence are near or within the PECR gene.()5)Marcella Rietschel believed that it is very likely that the PECR gene is involved in alcohol dependence.()6)The PECR gene is expressed most heavily in the liver, but very little in the brain.()7)The best known genetic variants modulating alcohol addiction are variants in the genes metabolizing alcohol, like variants in thedehydrogenase gene clusters.()8)ADH1C gene is the gene codes for one of the alcohol-metabolizing alcohol dehydrogenases known as T-cadherin.Key:F-T-F-T-T; T-F-FExercise 3 Word-detectingDirections: Find a word in the designated paragraph to complete the sentence.1)ADH1C gene and CDH13 gene have been i in alcohol dependence inprevious studies. (Para. 7)2)The u is a problem encountered in many GWA studies. (Para. 8)3)The patients selected for the GWAS were German men with a DSM-IV diagnosisof alcohol d . (Para. 10)4)Alcohol dependence is a m disorder with multiple phenotypes andgenotypes. (Para. 10)Key:1)implicated2)uncertainty3)dependence4)multifactorialVocabulary ExercisesEnhance your command of medical wordsExercise 1 Word-matchingDirections: Choose the definitions in the right column to match the words in the left column.Key:b-d-a-e-f-g-c-hExercise 2 TranslationDirections: Translate the following terms into Chinese.Key:1) 目标基因2) 酒精瘾3) 使用障碍4) 相关度,关联度5) 连锁不平衡6) 醇脱氢酶7) 基因簇;一组相同或者相似的基因8) 粘着蛋白9) 基因组扫描10) 戒酒;酒精戒断11) 假阳性的12) 功能基因组学Exercise 3 Word-detectingDirections: Find a word in the designated paragraph to complete the sentence.1) The physician should also order and hormonal studies. (Para. 3)2) People with diabetes have too much , or sugar, in their blood. (Para. 3)3) In the absence of an accurate , no basis exists for selecting a treatment.(Para. 10)1) nucleotide 2) polymorphism 3) loci 4) express 5) dehydrogenase 6) cadherin 7) phenotype 8) genotype a) (基因)座位(locus 的复数) b) 核苷酸 c) 表现型;表型;显型 d) 多形性(现象);多态性 e) 使(基因)示性;使(基因)合成特种蛋白 f) 脱氢酶 g) 钙黏蛋白 h) 基因型,遗传型1) target genes 2) alcohol dependence 3) use disorders 4) degree of association 5) linkage disequilibrium 6) alcohol dehydrogenase 7) gene cluster 8) adhesion protein 9) genomic scan 10) alcohol withdrawal 11) false-positive 12) functional genomics4)Doctor: I can do nothing about your conditi on. I’m afraid it’s . (Para. 10)5)The two loci all have sequence . (Para. 10)6)Translocation occurs when a fragment of one chromosome becomes attached to anon- chromosome. (Para. 12)Key:1)chromosomal2)glucose3)diagnosis4)hereditary5)heterogeneity6)homologousTranslation of the sentences1)还应该进行染色体和激素水平的检查。
gwas分析发展历程
gwas分析发展历程GWAS (Genome-Wide Association Studies) 分析是一种遗传学研究方法,旨在识别基因组中与某种特定性状或疾病相关的基因变异。
GWAS 分析的发展历程经历了以下几个阶段:1. 早期候选基因分析:在GWAS兴起之前,研究人员主要使用候选基因分析方法研究与某种特定性状或疾病相关的基因,只关注特定基因或功能区域。
这种方法受限于先验知识和假设,覆盖基因组的范围较窄。
2. GWA研究的出现:2005年,GWAS分析首次被应用于大规模研究,通过分析数百至数千个个体的基因组中的上百万个位点,寻找与疾病相关的基因变异。
GWAS研究采用高通量基因分型技术,如SNP芯片,可在基因组范围内进行全面的基因变异分析。
3. 发现多个疾病相关基因:GWAS分析的广泛应用使研究人员能够发现多个与复杂疾病(如心血管疾病、糖尿病和精神疾病)相关的基因。
通过比较患病个体和对照个体的基因型差异,可以识别出与疾病风险相关的单核苷酸多态性。
4. 遗传结构的研究:GWAS分析还被用于探索人类遗传结构和群体历史。
通过研究人群之间的基因型差异,可以了解基因在不同人群中的频率分布及迁移和演化过程。
5. 功能解析和机制研究:除了发现疾病相关基因,GWAS分析还为疾病的功能解析和机制研究提供了线索。
通过进一步研究疾病相关基因的功能和表达调控,可以深入了解疾病的发生机制,并为疾病的治疗和预防提供新的靶点和策略。
总之,GWAS分析在过去几十年中取得了长足的发展。
它不仅扩展了我们对人类基因组的认知,还为复杂疾病的发病机制研究提供了重要工具和见解。
随着技术和方法的不断更新,GWAS分析将在未来继续发挥重要作用,揭示更多与健康和疾病相关的基因。
基因-膳食交互作用与2型糖尿病
•综述•DOI:10.3760/cma.j.issn.1000⁃6699.2010.10.026作者单位:210009南京,东南大学医学院(腾飞);江苏省徐州市中心医院,东南大学医学院附属徐州医院内分泌科(邹彩艳㊁梁军);上海交通大学医学院附属瑞金医院内分泌代谢病科,上海市内分泌代谢病研究所(宋怀东);美国哈佛大学公共卫生学院营养学及流行病学系(祁禄)通信作者:梁军,Email:mwlj521@基因⁃膳食交互作用与2型糖尿病腾飞 邹彩艳 宋怀东 祁禄 梁军 【提要】 基因变异与膳食因素间交互作用对2型糖尿病的影响已经引起人们的关注㊂研究显示膳食中碳水化合物的质量和数量变化㊁膳食脂肪摄入增加可以与2型糖尿病基因变异产生交互作用,增加2型糖尿病风险㊂全基因组关联性研究(genome⁃wide association,GWA)表明遗传变异本身具有调节膳食结构与2型糖尿病之间关联性的作用㊂【关键词】 基因;膳食;交互作用;糖尿病,2型;TCF7L2Interactions of genes and diet in type 2diabetes mellitus TENG Fei *,ZOU Cai⁃yan ,SONG Huai⁃dong ,QI Lu ,LIANG Jun.*Medical College of Southeast University ,Nanjing 210009,China Coresponding author :LIANG Jun ,Email :mwlj521@【Summary 】 The interactions between genetic variations and dietary factors in type 2diabetes mellitus have attracted some attention.Several studies revealed that dietary carbohydrate quality and quantity and increased dietary fat intake might interact with genetic variations of type 2diabetes mellitus and increase risk of this disease.Genome⁃wide association studies suggest that genetic variance may modulate the association between dietary pattern and type 2diabetes mellitus.【Key words 】 Gene;Diet;Interactions;Diabetes mellitus,type 2;TCF7L2(Chin J Endocrinol Metab ,2010,26:910⁃912) 2型糖尿病已成为影响全人类健康的突出难题㊂经典的遗传学研究认为2型糖尿病与遗传因素有关㊂近两年用全基因组关联性研究(genome⁃wide association,GWA)基因分型分析技术,发现一组新的基因变异与糖尿病风险有关[1,2]㊂2型糖尿病扩大的流行趋势与过去几十年生活方式的巨大改变有关:从 传统”的生活方式转变为以摄入大量可口㊁高热量食物和久坐为特征的 西式化”或 肥胖倾向”的生活方式㊂长期观察发现生活方式对个体的影响是不同的:有些人对不良因素的反应要比其他人更为敏感[3]㊂这些异质性反映了基因与环境因素间复杂的相互影响㊂糖尿病的发生在迁移人群中有显著的地理差异,就更说明了这一点[4]㊂然而,能直接证实基因⁃环境交互作用的研究还很少㊂本综述总结了最近关于2型糖尿病易感基因与膳食交互作用的研究㊂一㊁与2型糖尿病相关的膳食和基因危险因素有资料表明某些膳食模式可能会增加2型糖尿病风险㊂证据之一是饮膳食模式与糖尿病风险存在相关性㊂地中海膳食结构富含坚果类㊁橄榄油㊁水果和蔬菜㊁豆类和鱼类㊁适度的酒精,而红肉㊁经过加工的肉类㊁精粮和全脂乳制品含量较少,这种膳食结构可以预防2型糖尿病[5]㊂相反,西方饮食模式,是以摄入大量红肉㊁经过加工的肉类㊁和精粮为特征,这种膳食结构增加了2型糖尿病风险[6]㊂一些膳食因素可能对糖尿病风险性有独特的影响㊂对6个群组进行的meta 分析发现,每天增加两份全谷类食物(未经精细加工而保持天然结构,营养成分没有流失的谷物)的摄入可以使2型糖尿病风险降低21%[7]㊂一项meta 分析提示摄入适量酒精可以增强机体对胰岛素的敏感性,与大量饮酒和戒酒相比使2型糖尿病风险降低30%[8]㊂此外,大量摄入全脂㊁反式脂肪酸,饱和脂肪和含铁血红素会促进风险的增加,蔗糖的摄入对2型糖尿病风险无显著影响,而摄入膳食纤维㊁ω⁃3脂肪酸,升糖指数(glycemic index,GI)低的食物和咖啡可能会减少风险性[9]㊂然而,大多数关于膳食作用的观点是有争议的,且其致病机制也尚未阐明㊂糖尿病的发生在迁移人群中有显著的地理差异,由于来自同一个国家的人口有相似的基因组成,因此迁移后膳食模式及生活习惯的改变而导致2型糖尿病发病率增高的情况能更好地说明膳食危害因素对2型糖尿病的影响㊂对中国人迁移到发达国家的移民学研究显示,迁移人口糖果㊁点心㊁高饱和脂肪酸㊁低纤维含量的食物摄入增加,而水果㊁蔬菜摄入减少,迁移人口糖尿病㊁冠心病等慢性疾病的发病率较国内有显著增加[4]㊂GWA 研究为全面系统研究2型糖尿病的遗传因素掀开了新的一页,有力推动了2型糖尿病相关基因的发现,为2型糖尿病的发病机制提供了更多的线索㊂自从2007年第一次报道用GWA 研究方法对法国人进行的研究证实HHEX /IDE 和SLC30A8是2型糖尿病易感位点后[10],在随后的十余个GWA研究中共发现了19个新的基因突变位点,这些位点大部分是针对欧洲人的病例对照研究中发现的[2,11]㊂转录因子7样2㊃019㊃中华内分泌代谢杂志2010年10月第26卷第10期 Chin J Endocrinol Metab,October 2010,Vol.26,No.10(transcription factor7⁃like2,TCF7L2)基因,是迄今为止对糖尿病风险影响最大的基因,将近五分之一的2型糖尿病患者存在TCF7L2基因变异[12],该基因的遗传效应已在GWA研究中被重复验证[1,2]㊂其他可以被重复验证的基因有PPARG㊁KCNJ11㊁WFS1和HNF1B[11]㊂在对亚洲人进行的2项GWA研究中证实CDKAL1和IGF2BP2与2型糖尿病有关,并且发现KCNQ1为2型糖尿病的一个新的易感位点[13,14]㊂到目前为止发现的2型糖尿病易感基因大多数影响胰岛素的分泌,只有FTO和PPARG基因与胰岛素功能有关㊂HHEX㊁TCF7L2㊁IDE㊁DKK3㊁KIF11等基因参与调解Wnt信号系统,Wnt信号肽与胰岛β细胞增殖有关,此途径传导障碍是2型糖尿病发展的重要影响因素[11,12]㊂大多数新发现的易感基因如PPARG㊁KCNJ11和CPN10等,仅使2型糖尿病风险性增加了10%~20%[12]㊂结合环境危险因素和糖尿病家族史能更好地预测2型糖尿病风险[15]㊂二㊁TCF7L2基因变异与膳食因素对2型糖尿病风险的影响TCF7L2是胰腺和胰岛细胞在胚胎期正常发育所必须的转录因子[16],TCF7L2基因变异主要与胰岛β细胞功能受损㊁胰岛素和胰升糖素样肽1(glucagon⁃like peptide⁃1,GLP⁃1)分泌减少,而胰升糖素分泌增多㊁内源性葡萄糖生成增加有关[17]㊂TCF7L2是迄今为止被广泛验证的与2型糖尿病关联性最强的基因[11,18]㊂TCF7L2在糖尿病患者体内的表达情况要高出正常值2倍,并且其rs7903146位点变异与TCF7L2mRNA在胰岛β细胞过量表达而减少胰岛素分泌有关,TCF7L2的常见变异可以作为2型糖尿病的预测因素[17]㊂在美国糖尿病预防计划(Diabetes Prevention Program, DPP)研究的基线水平上,TCF7L2的TT基因变异使胰岛素分泌减少,但并不增强胰岛素抵抗[18]㊂T等位基因变异与降低肠促胰岛素效应㊁提高肝葡萄糖生成率有关[17]㊂T等位基因携带者2型糖尿病易感性的OR值增加了1.4倍,人群归因危险度为16.9%[12]㊂TT纯合子上的rs7903146位点变异要比CC纯合子更容易从糖耐量受损的状态进展成为糖尿病的OR值为1.55(95%CI1.20~2.01)㊂TCF7L2基因的纯合子使2型糖尿病的风险增加1.45倍,而其杂合子增加了2型糖尿病风险的2.41倍[19]㊂TCF7L2基因变异与膳食因素间交互作用对2型糖尿病的影响已经引起了关注㊂DPP的研究对象包括3548个来自不同地区(美国㊁西班牙㊁亚洲等)的参与者[18],将其分成二甲双胍干预(850mg,b.i.d.)组㊁安慰剂组和生活方式干预组(减少全脂和饱和脂肪酸摄入并增加纤维摄入㊁每天至少30min适度运动)㊂该研究验证了TCF7L2上rs12255372位点TT基因型在安慰剂组明显增加2型糖尿病风险(OR=1.81,95%CI1.19~ 2.75)㊂rs7903146位点结果相似㊂而生活方式干预组(OR= 1.24,95%CI0.73~2.21)和药物干预组(OR=1.45,95%CI 0.90~2.35)则没有明显增加2型糖尿病风险㊂表明膳食干预和加强运动可以修饰遗传因素对2型糖尿病的影响㊂全谷类食物富含纤维,丰富的纤维可以刺激胃肠激素特别是肠抑胃肽的释放,从而抑制胃排空,延缓肠内葡萄糖吸收时间并减少葡萄糖在肠内的吸收量,因此可以降低餐后血糖,减少胰岛素分泌[20]㊂TCF7L2不同的等位基因型对全谷类食物的敏感性有所不同㊂Fisher等[20]通过一个前瞻性的队列研究来探讨TCF7L2的T等位基因rs7903146位点突变是否修饰了全谷类膳食对糖尿病的保护效应㊂研究对象是从波茨坦的 欧洲癌症与营养前瞻性调查”(European Prospective Investigation into Cancer and Nutrition,EPIC)群体中随机挑选的2318例健康个体和724例糖尿病患者㊂膳食中全谷类的摄入量由有效的食物频率调查问卷(Food Frequency Questionnaire,FFQ)评估㊂发现rs7903146位点CC纯合子携带者每天摄入全谷类食物50g可以显著减少2型糖尿病风险14%(1%~25%);T等位基因携带者则与2型糖尿病风险性降低无关㊂基因型的不同对2型糖尿病风险产生不同的影响有统计学意义(P= 0.016)㊂结果提示T等位基因可能降低了谷类食物改善2型糖尿病风险性的有益作用㊂并且TT纯合子与CC纯合子相比,使2型糖尿病风险增加了2倍[12]㊂TCF7L2基因变异减少了胰岛素的分泌量㊂而胰岛素的分泌量取决于血糖的升高程度,膳食中的碳水化合物显著影响餐后血糖水平和胰岛素需要量,因此,推测碳水化合物的质量和数量有可能对TCF7L2基因变异与2型糖尿病的关系起修饰作用㊂该假设在护士健康研究(Nurses′Health Study,NHS)中得到进一步验证[21]㊂NHS检测了1140例糖尿病患者和1915例对照者TCF7L2基因rs12255372位点突变,膳食摄入以半定量的FFQ来评估,检测了糖负荷(glycemic load,GL)和升糖指数㊁谷类纤维和总碳水化合物等指标(糖负荷㊁升糖指数是反映碳水化合物质量和数量的指标)㊂发现糖负荷㊁升糖指数三分位值最高的个体较最低的个体更能提示TCF7L2基因变异与2型糖尿病风险的关联,这种差异有统计学意义[糖负荷(P= 0.03),升糖指数(P=0.05)]㊂因此,摄入高糖负荷㊁升糖指数的食物与TCF7L2基因变异对2型糖尿病的风险表现为协同作用㊂TCF7L2基因变异还可以通过影响代谢综合症的某些相关因素而增加2型糖尿病风险㊂代谢综合征的中心环节是胰岛素抵抗,而胰岛素抵抗是由于一系列血糖和脂类代谢异常所引起的[22]㊂TCF7L2基因与脂肪生成及脂肪细胞分化有关[23]㊂在一项欧裔美国人参与的 降脂类药物基因学和饮食网状结构的研究”(Genetics of Lipid Lowering Drugs and Diet Network Study)验证了膳食中的脂肪可与TCF7L2基因相互作用从而影响2型糖尿病风险㊂该研究检测了1083例参与者对高脂膳食的反应[23]㊂发现T等位基因的rs7903146位点突变与摄入多不饱和脂肪酸(polyunsaturated fatty acid,PUFA)交互作用对空腹血浆极低密度脂蛋白(very⁃low⁃density lipoprotein,VLDL,P= 0.016)和餐后甘油三酯(P=0.028)㊁乳糜微粒(P=0.025)㊁VLDL(P=0.026)的浓度有十分显著的影响㊂与等位基因CC 携带者相比,T等位基因的携带者摄入了较多的PUFA(大于等于摄入能量的7.36%),并显示出更高的空腹和餐后甘油三酯㊁VLDL水平㊂三㊁GWA研究发现的其他与膳食因素㊁2型糖尿病相关的基因㊃119㊃中华内分泌代谢杂志2010年10月第26卷第10期 Chin J Endocrinol Metab,October2010,Vol.26,No.10最近一项针对美国男性的前瞻性群组研究[24]中发现基因遗传倾向与膳食结构的交互作用与糖尿病风险相关㊂该研究从GWA结果中筛查了9个糖尿病易感基因上的单核苷酸多态性(single nucleotide polymorphisms,SNP)进行研究㊂用一种简单方法计算了遗传风险分值(genetic risk score,GRS)[15]㊂参与者按照遗传风险分成低分组(GRS<10),中等分值组(GRS10~11),高分组(GRS>12)㊂根据膳食信息基线资料,因素分析产生了两种主要的膳食模式[6]:一种是摄入大量蔬菜,豆类和全谷类的膳食模式;另一种为西方膳食模式㊂发现GRS 与西方膳食模式的交互作用与糖尿病风险显著相关(P= 0.02)㊂在GRS高的男性中,按四分位划分西方膳食模式2型糖尿病的OR值分别为1.00㊁1.23(95%CI0.88~1.73)㊁1.49 (95%CI1.06~2.09)和2.06(95%CI1.48~2.88)㊂进一步的分析表明,摄入的红肉㊁经过加工的肉类和血红素铁可能是产生这种作用的主要食物/营养物质㊂提示西方饮食方式可以增加糖尿病风险,特别是那些高遗传风险的人群㊂Florez等[25]研究了DPP的3548个参与者WFS1(Wolfram syndrome1)基因上三个位点(rs10010131㊁rs752854㊁rs734312)对2型糖尿病发病率的影响,并观察了WFS1基因变异 生活方式干预交互作用与2型糖尿病的关系㊂结果发现这三个位点在整个研究人群中与糖尿病发病率没有关联㊂然而,保护性作用的WFS1基因变异有增加胰岛素分泌的趋势,这些基因变异对胰岛素敏感性的降低具有补偿作用㊂并且,携带有保护性作用等位基因的白种纯合子个体,在生活方式干预组(目标为体重减轻≥7%,每周进行体力活动≥150min)患糖尿病的几率要小一些㊂一项对14639名亚洲印度人进行的GWA研究表明MC4R (melanocortin⁃4receptor)基因附近的常见变异与肥胖症风险和胰岛素抵抗相关[26]㊂该研究发现MC4R基因附近的rs12970134位点与腰围增加有关(P<0.001),并且可以独立影响胰岛素抵抗㊂MC4R基因在大脑中有多处表达,该基因调节黑皮质素对摄食的影响和能量消耗等效应㊂另一项对5724名女性MC4R基因变异与膳食㊁体重改变及2型糖尿病风险关系的前瞻性研究中,发现位于MC4R基因附近rs17782313位点多态性与大量摄入高热量(P=0.028)以及富含脂肪(P=0.008)和蛋白质(P=0.003)的食物有关㊂MC4R等位基因C的rs17782313位点变异可以增加2型糖尿病风险的14%(2%~ 32%),该变异携带者随访十年后体重指数增加0.2kg/m2(P= 0.028)㊂然而,这些膳食因素没有改变遗传效应对肥胖和糖尿病风险的影响[27]㊂参 考 文 献[1]Sladek R,Rocheleau G,Rung J,et al.A genome⁃wide associationstudy identifies novel risk loci for type2diabetes.Nature,2007, 445:881⁃885.[2]Scott LJ,Mohlke KL,Bonnycastle LL,et al.A genome⁃wideassociation study of type2diabetes in Finns detects multiple susceptibility variants.Science,2007,316:1341⁃1345. [3]Qi L,Hu FB,Hu G.Genes,environment,and interactions inprevention of type2diabetes:a focus on physical activity and lifestyle changes.Curr Mol Med,2008,8:519⁃532.[4]Misra A,Ganda OP.Migration and its impact on adiposity and type2diabetes.Nutrition,2007,23:696⁃708.[5]Babio N,Bullo M,Salas⁃Salvado J.Mediterranean diet and metabolicsyndrome:the evidence.Public Health Nutr,2009,12:1607⁃1617.[6]van Dam RM,Rimm EB,Willett WC,et al.Dietary patterns andrisk for type2diabetes mellitus in US men.Ann Intern Med,2002, 136:201⁃209.[7]de Munter JS,Hu FB,Spiegelman D,et al.Whole grain,bran,andgerm intake and risk of type2diabetes:a prospective cohort study and systematic review.PLoS Med,2007,4:e261.[8]Koppes LLJ,Bouter LM,Dekker JM,et al.Moderate alcoholconsumption lowers the risk of type2diabetes.Diabetes Care,2005, 28:719⁃725.[9]Parillo M,Riccardi G.Diet composition and the risk of type2diabetes:epidemiological and clinical evidence.Br J Nutr,2004, 92:7⁃19.[10]Sladek R,Rocheleau G,Rung J,et al.A genome⁃wide associationstudy identifies novel risk loci for type2diabetes.Nature,2007, 445:828⁃830.[11]McCarthy MI,Zeggini E.Genome⁃wide association studies in type2diabetes.Curr Diab Rep,2009,9:164⁃171.[12]Tong Y,Lin Y,Zhang Y,et al.Association between TCF7L2genepolymorphisms and susceptibility to type2diabetes mellitus:a large Human Genome Epidemiology(HuGE)review and meta⁃analysis.BMC Med Genet,2009,10:15.[13]Unoki H,Takahashi A,Kawaguchi T,et al.SNPs in KCNQ1areassociated with susceptibility to type2diabetes in East Asian and European populations.Nat Genet,2008,40:1098⁃1102. [14]Yasuda K,Miyake K,Horikawa Y,et al.Variants in KCNQ1areassociated with susceptibility to type2diabetes mellitus.Nature Genetics,2008,40:1092⁃1097.[15]Cornelis MC,Qi L,Zhang L,et al.Joint effects of common geneticvariants on the risk for type2diabetes in US men and women of european ancestry.Ann Intern Med,2009,150:541⁃550. [16]Papadopoulou S,Edlund H.Attenuated Wnt signaling perturbspancreatic growth but not pancreatic function.Diabetes,2005,54: 2844⁃2851.[17]LyssenkoV,Lupi R,Marchetti P,et al.Mechanisms by whichcommon variants in the TCF7L2gene increase risk of type2 diabetes.J Clin Invest,2007,117:2155⁃2163.[18]Florez JC,Jablonski KA,Bayley N,et al.TCF7L2polymorphismsand progression to diabetes in the Diabetes Prevention Program.N Engl J Med,2006,355:241⁃250.[19]Grant SFA,Thorleifsson G,Reynisdottir I,et al.Variant oftranscription factor7⁃like2(TCF7L2)gene confers risk of type2 diabetes.Nat Genet,2006,38:320⁃323.[20]Fisher E,Boeing H,Fritsche A,et al.Whole⁃grain consumption andtranscription factor⁃7⁃like2(TCF7L2)rs7903146:gene⁃diet interaction in modulating type2diabetes risk.Br J Nutr,2009,101: 478⁃481.[21]Cornelis MC,Qi L,Kraft P,et al.TCF7L2,dietary carbohydrate,and risk of type2diabetes in US women.Am J Clin Nutr,2009,89: 1256⁃1262.[22]Scott LJ,Bonnycastle LL,Willer CJ,et al.Association oftranscription factor7⁃like2(TCF7L2)variants with type2diabetes in a Finnish sample.Diabetes,2006,55:2649⁃2653. [23]Warodomwichit D,Arnett DK,Kabagambe EK,et al.Polyunsaturatedfatty acids modulate the effect of TCF7L2gene variants on postprandial lipemia..J Nutr,2009,139:439⁃446. [24]Qi L,Cornelis MC,Zhang C,et al.Genetic predisposition,Westerndietary pattern,and the risk of type2diabetes in men.Am J Clin Nutr,2009,89:1453⁃1458.[25]Florez JC,Jablonski KA,McAteer J,et al.Testing of diabetes⁃associated WFS1polymorphisms in the Diabetes Prevention Program..Diabetologia,2008,51:451⁃457.[26]Chambers JC,Elliott P,Zabaneh D,et mon genetic variationnear MC4R is associated with waist circumference and insulin resistance.Nat Genet,2008,40:716⁃718.[27]Qi L,Kraft P,Hunter DJ,et al.The common obesity variant nearMC4R gene is associated with higher intakes of total energy and dietary fat,weight change and diabetes risk in women.Hum Mol Genet,2008,17:3502⁃3508.(收稿日期:2010⁃01⁃07)(本文编辑:朱鋐达)㊃219㊃中华内分泌代谢杂志2010年10月第26卷第10期 Chin J Endocrinol Metab,October2010,Vol.26,No.10。
孟德尔生平介绍
遗传学名人小传(Great Geneticists)之孟德尔罗静初1822年7月22日,约翰•孟德尔(Johann Mendel)出生在奥地利莫拉维亚(Moravia)一个名叫海钦多夫(Heinzendorf)的村子里(现已划归捷克)。
孟德尔生于一个农民家庭,排行第二,是家中唯一的男孩。
小时候,孟德尔一直在果园里劳作,生活十分艰辛。
6岁时进了村里的小学,学习语文、数学等基础课程,以及养蜂、果树嫁接等实际操作。
孟德尔自幼勤奋好学、成绩突出。
父亲听从了老师的建议,让他继续上初中和高中。
由于家境贫寒,付不起学费,高中阶段的学习生活已经十分拮据。
高中毕业后,孟德尔打算进厄尔姆兹学院(Olmults)进行两年的大学预科学习。
不幸的是,由于他父亲健康状况不佳而无力供养他继续深造。
就在人生中最困难的时刻,孟德尔的妹妹变卖自己的嫁妆,资助他完成了学业。
这使他永生难忘,并且把感激化为学习的动力,学业一直名列前茅。
然而,家境的窘迫,难以圆他大学之梦。
孟德尔意识到,必须先要找到一份可以聊生的职业,才有可能继续深造。
他接受了老师的建议,于1843年9月进入布隆(Brunn,现名Brno)市的修道院当了一名修道士,Gregor是他的教名。
布隆是莫拉维亚省的首府,也是奥匈帝国工农业生产和经济中心。
修道院相当富足,拥有一个藏书20000册的图书馆,可和当时的大学图书馆媲美,也是当时全市宗教和文化中心。
教士们文化素质很高,不少牧师都有正式的园艺、音乐、哲学等学位。
在修道院,孟德尔进行了系统的宗教学习,成绩突出,仅用3年时间就完成了4年的学业。
学习结束,孟德尔被任命为教区教士,但他并不喜欢这一工作。
修道院院长也觉察出了孟德尔的喜好,特意安排他到本地一所高中担任临时性的教学工作。
出色的教学效果,使孟德尔很很快在学生中建立了声誉。
按规定,担任高中教师,通常需要大学学历并通过资格考试。
考虑到孟德尔的实际情况,评审委员会决定保留他的教师资格,并建议他到维也纳大学完成大学学业。
GWAS基因检测
GWAS基因检测GWAS基因检测可以在全基因组范围内进行高通量的大规模筛选,可以发现单基因检测很难发现的遗传变异,检测结果更准确,做到真正的疾病预防:早发现、早预防、早治疗。
国际最新的GWAS全基因组基因检测技术,选取多个疾病密切相关的的基因位点,对SNP进行全面的检测,对易感疾病的判断更敏锐、准确度更高。
GWAS全基因组检测技术具体操作如下:(1)收集对照组和患者样本(组织、血液等),提取DNA(2)进行全基因组单核苷酸多态性(SNP)芯片扫描,获得数据信息(3)采用关联分析,筛选与疾病关联的序列变异的研究该方法试图通过疾病的变异基因和单核苷酸多态性,研究确定疾病发病易感区域和相关基因,寻找疾病的标记物,进行早期诊断和有效的个体化治疗,进行特异性预防措施。
GWAS与以往检测的区别和优点:现在针对我们平台所做的肿瘤套餐介绍如下:我们在GWAS数据库中提取了针对于亚洲人群(主要是日本、韩国、中国、马来西亚等人种)研究的肿瘤相关文献,挑选各个肿瘤的相关疾病位点进行研究。
现针对各个肿瘤介绍其相关位点信息:(1)胃癌基因检测的相关位点:参考文献:1. Zhang H, Jin G, Li H, et al. Genetic variants at 1q22 and 10q23 reproducibly associated with gastric cancer susceptibility in a Chinese population. Carcinogenesis. 2011; 32(6): 848-52.2. Wang N, Zhou R, Wang C, et al. A polymorphism of the interleukin-8 gene and cancer risk: a HuGE review and meta-analysis based on 42 case-control studies. Mol Biol Rep. 2012; 39(3): 2831-41.3. Lu Y, Chen J, Ding Y,et al. Genetic variation of PSCA gene is associated with the risk of both diffuse- and intestinal-type gastric cancer in a Chinese population. Int J Cancer. 2010; 127(9): 2183-9.4. Shi Y, Hu Z, Wu C, et al. A genome-wide association study identifies new susceptibility loci for non-cardia gastric cancer at 3q13.31 and 5p13.1. Nat Genet. 2011; 43(12): 1215-8.(2)、鼻咽癌基因检测的相关位点:(3)、肺癌基因检测的相关位点:参考文献:1. Hu Z, Wu C, Shi Y, et al. A genome-wide association study identifiestwo new lung cancer susceptibility loci at 13q12.12 and 22q12.2 in Han Chinese. Nat Genet. 2011; 43(8): 792-6.2. Ahn MJ, Won HH, Lee J, et al. The 18p11.22 locus is associated with never smoker non-small cell lung cancer susceptibility in Korean populations. Hum Genet. 2012; 131(3): 365-72.3. Yoon KA, Park JH, Han J, et al. A genome-wide association study reveals susceptibility variants for non-small cell lung cancer in the Korean population. Hum Mol Genet.2010; 19(24):4948-54.(4)、甲状腺癌基因检测的相关位点:参考文献:1.Matsuse M, Takahashi M, Mitsutake N, et al. The FOXE1 and NKX2-1 loci are associated with susceptibility to papillary thyroid carcinoma in the Japanese population. J Med Genet. 2011;48(9): 645-8.(5)、前列腺癌基因检测的相关位点:参考文献:1.Takata R, Akamatsu S, Kubo M, Takahashi A, et al. Genome-wide association study identifies five new susceptibility loci for prostate cancer in theJapanese population. Nat Genet. 2010;42(9): 751-4.(6)、子宫内膜癌基因检测的相关位点:参考文献:1. Xu WH, Long JR, Zheng W, et al. Association of the progesterone receptor gene with endometrial cancer risk in a Chinese population. Cancer. 2009; 115(12): 2693-700.2. Long J, Zheng W, Xiang YB et al. Genome-wide association studyidentifies a possible susceptibility locus for endometrial cancer. Cancer Epidemiol Biomarkers Prev. 2012; 21(6): 980-7.(7)、乳腺膜癌基因检测的相关位点:参考文献:1.Long J, Shu XO, Cai Q, et al. Evaluation of breast cancer susceptibility loci in Chinese women. Cancer Epidemiol Biomarkers Prev. 2010; 19(9): 2357-65.2.Kim HC, Lee JY, Sung H, et al. A genome-wide association study identifies a breast cancer risk variant in ERBB4at 2q34: results from the Seoul Breast Cancer Study. Breast Cancer Res.2012; 14(2): R56.3.Long J, Cai Q, Sung H, Shi J, et al. Genome-wide association study in east Asians identifies novel susceptibility loci for breast cancer. PLoS Genet. 2012; 8(2): e1002532.4.Shu XO, Long J, Lu W, et al. Novel genetic markers of breast cancer survival identified by a genome-wide association study. Cancer Res. 2012; 72(5): 1182-9.5.Cai Q, Long J, Lu W, et al. Genome-wide association study identifies breast cancer risk variant at 10q21.2: results from the Asia Breast Cancer Consortium. Hum Mol Genet. 2011; 20(24):4991-9.(8)、结直肠癌基因检测的相关位点:参考文献:1. Xiong F, Wu C, Bi X, et al. Risk of genome-wide association study-identified genetic variants for colorectal cancer in a Chinese population. Cancer Epidemiol Biomarkers Prev 2010 Jul;19(7):1855-61.2. Ho JW, Choi SC, Lee YF, et al. Replication study of SNP associationsfor colorectal cancer in Hong Kong Chinese. Br J Cancer 2011;104: 369-375.以上是我们平台所做的肿瘤套餐基因检测相关位点的信息。
A Genome-Wide Association Study of the Human
Because metabolites are hypothesized to play key roles as markers and effectors of cardio-metabolic diseases, recent studies have sought to annotate the genetic determinants of circulating metabolite levels. We report a genome-wide association study (GWAS) of 217 plasmametabolites, including >100 not measured in prior GWAS, in 2,076 participants of theFramingham Heart Study. For the majority of analytes, we find that estimated heritability explains >20% of inter-individual variation, and that variation attributable to heritable factors is greaterthan that attributable to clinical factors. Further, we identify 31 genetic loci associated with plasma metabolites, including 23 that have not previously been reported. Importantly, we include GWAS results for all surveyed metabolites, and demonstrate how this information highlights a role for AGXT2 in cholesterol ester and triacylglycerol metabolism. Thus, our study outlines the relative contributions of inherited and clinical factors on the plasma metabolome and provides a resource for metabolism research.INTRODUCTION Recent studies have begun to integrate genomic and metabolomic data in human cohorts (Demirkan et al., 2012; Gieger et al., 2008; Hicks et al., 2009; Illig et al., 2010; Kettunen et al., 2012; Suhre et al., 2011a; Suhre et al., 2011b; Tukiainen et al., 2012). Because metabolites are hypothesized to play key roles as markers and effectors of cardiometabolic diseases, these efforts seek to both refine and expand our understanding of the causal determinants of circulating metabolite levels. Studies to date have been notable for the identification of loci at enzymes or transport proteins directly involved with a given metabolite’s disposition (Suhre and Gieger, 2012). In turn, many of these loci have shown relatively large effect sizes on metabolite levels, as compared to findings in genome wide association studies (GWAS) for common diseases.Whereas genetically informative deoxyribonucleic acid sequence is limited to four distinct chemical motifs, endogenous metabolites span a variety of compound classes withsignificant differences in size and polarity, across a wide range of concentrations. As aconsequence, no single analytical method is able to accommodate the chemical diversity of the entire metabolome. Thus, GWAS of metabolite traits to date have employed variousmethodologies including nuclear magnetic resonance spectroscopy and mass spectrometry (MS), with the latter coupled to both gas and liquid chromatography (LC) (Rhee andGerszten, 2012). Even with a given analytical tool, distinct methods have been required to survey polar versus lipid analytes. We have developed a LC-MS based metabolomicsplatform that measures a total of 217 analytes (113 polar analytes and 104 lipid analytes),including >100 not measured in prior GWAS.In the current study, we performed metabolite profiling on plasma obtained from 2,076individuals in the Framingham Heart Study (FHS). The family-based structure of thiscohort, as well as its rich cardiometabolic phenotyping, presents a unique opportunity tostudy the relative contributions of heritable, environmental, and clinical factors influencing the plasma metabolome. For many metabolites, we confirm that a substantial fraction ofmetabolite variability is heritable (Shah et al., 2009; Kettunen et al., 2012), often exceeding the influence of measured clinical factors. Using GWAS, we also identify numerous locus-metabolite associations and demonstrate how these findings complement and extend prior association studies of complex human traits. Finally, we include a proof-of-principledemonstration of how the breadth of metabolite, genotype, and phenotype data we present in FHS can motivate functional studies to provide biological insight.NIH-PA Author ManuscriptNIH-PA Author ManuscriptNIH-PA Author ManuscriptRESULTSA total of 2,076 participants of the FHS Offspring Cohort, including 873 sibships,underwent metabolic profiling and genome-wide genotyping. The mean age was 55 years and 51% of participants were women (Table 1).Relative contributions of heritable and clinical factors to metabolite levelsThe estimated proportion of inter-individual metabolite variation attributable to heritable factors (including genome-wide significant loci) compared with clinical factors (age, sex,systolic blood pressure, anti-hypertensive medication use, body-mass index, diabetes,smoking status, and prevalent cardiovascular disease) is displayed in Figure 1A and Figure S1. Metabolites most influenced by clinical variables include the nicotine metabolitecotinine (70% of variation explained by clinical factors, 66% by smoking alone), andfructose/glucose/galactose (45%). Adjustment for renal function in secondary analyses did not appreciably change results (Table S1).For the majority of metabolites, the proportion of variation attributable to heritable factors was greater than that of clinical factors: for 93% of metabolites assayed, measured clinical factors accounted for 20% or less of inter-individual variation. By contrast, estimatedheritability explained greater than 20% of the inter-individual variation for 66% ofmetabolites. Amino acids and other polar analytes had the highest heritability estimates,including carnosine (h 2=0.86, P=6.8×10−4), anthranilic acid (h 2=0.84, P=3.2×10−14), and glutamate (h 2=0.82, P=9.1×10−13), whereas heritability estimates for lipid analytes were lower, with the highest estimate for lysophosphatidylcholine (LPC) 22:6 (h 2=0.46,P=2.0×10−7).Heritability estimates for essential amino acids were lower than for non-essential amino acids: mean h 2=0.29, range 0.14–0.43, versus mean h 2=0.53, range 0.15–0.82, respectively;P=0.01 (Figure 1B). Similarly, none of the essential amino acids were associated with genetic loci at a genome-wide significant level, whereas 5 of the 10 non-essential amino acids monitored by our platform had genome-wide significant findings in our study.Conversely, clinical factors explained a greater proportion of variation for essential versus non-essential amino acids (mean R 2 for clinical model=0.17, range 0.04–0.34, versus mean R 2=0.09, range 0.03–0.16, respectively). These findings align with the relative contributions of endogenous (inherited) versus dietary (environmental) factors for these small molecules,and provide internal validation for the observed distribution of metabolite heritability.GWAS identifies 31 genetic loci associated with plasma metabolite levelsGenome-wide associations are displayed in Table 2, and quantile-quantile and linkage disequilibrium-plots for these associations are displayed in Figures S2 and S3. Of 217metabolites analyzed, 64 had at least one genome-wide significant locus. Conversely, 31discrete loci were associated with at least one metabolite trait, and a number of loci were associated with multiple metabolites (Table 2). Our data both replicate previously identified associations, as well as identify numerous novel locus-metabolite associations (Figure 2).These novel findings include loci that span genes encoding proteins with a directbiochemical relationship with a given metabolite, loci previously associated with complex human disease traits, and loci with no prior significant associations in GWAS.Confirmation of previously established locus-metabolite associationsEight of the locus-metabolite associations identified in our study have been previously reported, and 7 of these 8 associations involve genes directly related to the transport or synthesis of a given metabolite (Figure 2). For example, we replicate prior associationsNIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscriptbetween loci at SLC16A9, which encodes a carnitine efflux transporter, and carnitine;PRODH (proline dehydrogenase), which encodes the enzyme that catalyzes the first step of proline catabolism, and proline; and PHGDH (phosphoglycerate dehydrogenase), which encodes the enzyme that catalyzes the first and rate-limiting step of serine biosynthesis, and serine (Suhre et al., 2011a). A locus at SLC16A10, which encodes a tyrosine andphenylalanine transporter, has previously been associated with the ratio of isoleucine/tyrosine (Suhre et al., 2011a) and the ratio of alanine/tyrosine (Kettunen et al., 2012); we show that this association holds true for tyrosine alone. We also identify an associationbetween a locus at AGXT2 (alanine-glyoxylate aminotransferase-2) and its enzyme substrate β-aminoisobutyric acid. Prior work has shown an association between this locus and urinary levels of β-aminoisobutyric acid (Suhre et al., 2011b).Further, we confirm the association between glycine and a variant at CPS1 (carbamoylphosphate synthase 1), which encodes the enzyme that catalyzes the first committed step of the urea cycle. Although not a urea cycle intermediate, glycine can react with arginine (a urea cycle intermediate) to yield ornithine (a urea cycle intermediate) and guanidinoacetic acid. Methylation of guanidinoacetic acid yields creatine, which is ultimately metabolized to creatinine. Notably, we identify a novel association between the CPS1 locus and creatine,whereas others have identified an association between CPS1 common variants andcreatinine (Kottgen et al., 2010). Thus, complementary metabolomic data sets are able to extend the network of locus-metabolite associations along defined biochemical pathways.Finally, we replicate the association between the FADS1-3 (fatty acid desaturase 1–3) locus and phosphatidylcholines (PCs) 36:4 and 38:4 (Gieger et al., 2008), as well as between a variant within GCKR (glucokinase regulator) and alanine (the GCKR variant was previously associated with alanine/glutamine) (Kettunen et al., 2012). Novel associations at these loci with triacylglycerol (TAG) traits are discussed further below.Novel associations in directly related biological pathwaysAmong the numerous novel findings in our study, we first describe eight locus-metabolite associations with strong biological plausibility. In each case, the locus of interest includes a gene encoding a protein directly responsible for the metabolism or transport of the given metabolite. For three of the loci, mutations have been identified as the cause of human disease. For example, we identify an association between a variant at UMPS (uridinemonophosphate synthase) and orotic acid. UMPS encodes an enzyme that combines orotic acid and ribose-5-phosphate to form uridine monophosphate, and mutations in this gene have been identified as the cause of hereditary orotic aciduria (OMIM: 258900). Similarly,we identify an association between a common variant at AGA (aspartylglucosaminidase) and asparagine. AGA encodes an enzyme that cleaves asparagine from N-acetylglucosamines as a final step in the lysosomal breakdown of glycoproteins, and mutations in this gene result in the lysosomal storage disease aspartylglycosaminuria (OMIM: 613228). Finally, we find an association between the SERPINA7 locus and thyroxine. SERPINA7 encodes thyroxine-binding globulin, and mutations in this gene result in various degrees of thyroxine-binding globulin deficiency (OMIM: 314200).In addition to these findings with established human disease correlates, we identify five other locus-metabolite associations with strong biochemical underpinnings. We find anassociation between a variant at DMDGH (dimethylglycine dehydrogenase) and its enzyme substrate dimethylglycine; at GMPR (guanosine monophosphate reductase) and the purine nucleoside xanthosine; at SLC6A13, which encodes a transporter with known specificity for γ-aminoisobutyrate (GABA), and the GABA-isomer β-aminoisobutyric acid; at APOA1/C3/A4/A5 (apolipoprotein A1/C3/A4/A5) and various TAGs and diacylglycerols (DAGs); and at DDAH1 (dimethylarginine dimethylaminohydrolase 1) and NG-monomethylarginineNIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript(NMMA). Prior studies have shown that DDAH1 is responsible for the degradation of the dimethylarginines NMMA and asymmetric dimethylarginine (ADMA), but not symmetric dimethylarginine (SDMA) (Hu et al., 2011), and DDAH1 polymorphisms have previously been associated with ADMA levels (Abhary et al., 2010). In our data, the top SNP (rs18582)at DDAH1 had a modest association with ADMA (P=5.6×10−6), but no association with SDMA (P=0.15).Other novel locus-metabolite associations Several of the novel locus-metabolite associations identified in our study include loci without a clear biochemical relationship with the given metabolite. In several cases,however, these loci have been associated with human disease or complex disease phenotypes. For example, mutations in SLC7A9 cause cystinuria type B (OMIM: 220100)(Feliubadalo et al., 1999) and common variants in SLC7A9 have been associated with chronic kidney disease (CKD) (Kottgen et al., 2010). We report an association between the SLC7A9 locus and NMMA, with the minor allele at this locus associated with a lower risk of CKD and lower plasma levels of NMMA, raising the question of whether NMMA is a potential biomarker or effector of CKD progression. Among the 2,076 individuals in the current study, 123 with normal kidney function at the time of metabolite profiling developed new-onset CKD in the subsequent 8 years – interestingly, higher plasma levels of NMMA were significantly associated with the risk of developing future CKD (OR per SD 1.32,P=0.003) (Rhee et al., 2013).Further examples of locus-metabolite associations identified in our study with potential links to human disease include an association between the HPS1 locus (Hermansky-Pudlak syndrome 1, OMIM: 203300) and ADMA. Similarly, loci at SYNE2 (spectrin repeat containing, nuclear envelope 2), associated with sphingomyelin (SM) 14:0 in our data, has been associated with atrial fibrillation (Ellinor et al., 2012); at DGKB (diacylglycerol kinase), associated with indole propionate, has been linked to fasting glucose (Dupuis et al.,2010); at NTAN1 (N-terminal aspargine amidase), associated with cholesterol ester (CE)20:3, has been associated with bone mineral density (Estrada et al., 2012); at LIPC ,associated with lysophosphatidylethanolamine (LPE) 16:0, has been associated with macular degeneration and the metabolic syndrome (Neale et al., 2010; Kristiansson et al., 2012); at SLCO1B1 (solute carrier organic anion transporter family member 1B1), associated with LPE 20:4, has been associated with statin-induced myopathy (Link et al., 2008); and at PDE4D (phosphodiesterase 4D), associated with SM 24:1, has previously been linked to stroke (Song et al., 2006), although this association was not validated in a larger study (Bellenguez et al., 2012).Although we catalog each of the loci that have been associated with human disease and have at least one genome-wide significant metabolite association in our study, metaboliteassociations that do not reach this threshold may also provide pathophysiologic insights. As an example, we examined metabolite associations with variants spanning KCNQ1(potassium voltage-gated channel, KQT-like subfamily, member 1); common variants in KCNQ1 have previously been associated with type 2 diabetes, with the hypothesis that the encoded channel may modulate pancreatic insulin secretion (Unoki et al., 2008; Yasuda et al., 2008). In our study, we note an association between rs384037 in KCNQ1 and triiodothyronine levels (P=5.3×10−5). Furthermore, across the 2,076 individuals in thecurrent study, triiodothyronine levels are strongly associated with metabolic traits, including plasma insulin levels (P=1.5×10−20), plasma triglyceride levels (P=5.3×10−12), and body-mass index (P=4.9×10−8). Thus, our data in FHS raise the possibility that the associationbetween common variants in KCNQ1 and type 2 diabetes may also be mediated by the gene’s role in modulating plasma triiodothyronine levels.NIH-PA Author ManuscriptNIH-PA Author ManuscriptNIH-PA Author ManuscriptLipid profiling demonstrates heterogeneous effects of loci associated with total triglyceridesAlthough prior GWAS have identified numerous loci associated with total triglyceride levels, our study is the first to incorporate comprehensive TAG profiling. As variouscombinations of acyl chains may be esterified to a glycerol backbone, bulk triglycerides are actually composed of dozens of distinct TAG molecules. In the current study, we identify significant TAG associations for three loci previously associated with total triglycerides –GCKR , and the FADS1-3 and APOA1/C3/A4/A5 gene clusters. A fourth region atchromosome 7p12.1 (rs6593086) found to be associated with TAG traits has not beenpreviously associated with total triglyceride levels. To test the hypothesis that the higher resolution phenotyping enabled by our platform sheds insight on these associations, we examined each locus’s association with all of the TAGs monitored by our platform.For leading SNPs at GCKR , FADS1-3, APOA1/C3/A4/A5, and rs6593086, Figure 3 depicts the beta coefficient and P-value for association across the 46 TAGs measured in FHS. As suggested by the four significant findings (TAGs 48:2, 48:3, 50:3, 50:4), the GCKR locus demonstrated a stronger association with TAGs of relatively lower carbon content (Figure 3A). A comprehensive view of PCs shows a similar pattern of association (includingsignificant associations with PC 32:2 and PC34:3) (Figure 4A). Notably, the top SNPassociated with these metabolic traits was rs1260326, a missense variant (L446P) that infunctional studies has been established as the likely causal variant explaining the association with fasting bulk triglyceride and glucose levels (Beer et al., 2009; Orho-Melander et al.,2008).In contrast to the GCKR locus, the FADS1-3 locus had stronger associations with TAGs of relatively higher carbon and double bond content (Figure 3B), including significantassociations with TAGs 54:4, 58:10, and 58:11. These data extend prior work that has demonstrated a similar pattern of association between this locus and plasma phospholipid carbon and double bond content (Gieger et al., 2008), and is corroborated by our own PC data (Figure 4B). The leading SNP in the APOA1/C3/A4/A5 gene cluster was associated with an intermediate TAG phenotype relative to GCKR and FADS1-3, demonstratingstronger associations for TAGs with intermediate carbon content (Figure 3C, i.e., TAGs with 50 to 56 carbons). Figure 3D demonstrates a striking pattern of TAG associations for rs6593086, a SNP that is located > 50 kb from the closest coding gene (POM121L12). As with the FADS1-3 locus, this SNP had stronger associations with TAGs of relatively greater carbon and double bond content, although the significant associations were non-overlapping.Further, rs6593086 had a consistent direction of association across the majority of TAGs,whereas the direction of effect for the FADS1-3 locus differed at the extremes of TAG carbon content.Genome wide association data across all surveyed metabolites as a resource for metabolism researchAlthough the novelty of our TAG data set motivates interest in select TAG associationpatterns, further interrogation of the breadth of our data will provide other insights as well.To that end, we include GWAS data for each of the 217 metabolites surveyed by ourplatform, including all loci with P<1.0×10−3 (Table S2), as well as comprehensivemetabolite data for each locus with at least one genome-wide significant association (Table S3). With these resources, we believe that independent investigators will be able to rapidly interrogate the genetic underpinnings of metabolites of interest, including biologically meaningful associations that do not meet the genome-wide significance threshold (but potentially of high statistical significance in a focused interrogation). Conversely,investigators focused on a particular gene highlighted by our study will be able to query NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscriptwhich metabolites are downstream of common genetic variation in that gene. Finally, access to the breadth of our metabolite GWAS data will complement the publicly available metabolite data that we have already uploaded on the database of Genotypes and Phenotypes (/gap ) for all 2,076 individuals in FHS profiled in the current study.To provide proof of principle of this approach, we examined our data in the FHS on β-aminoisobutyric acid. In cross-sectional analyses, plasma β-aminoisobutyric acid levels have a negative correlation with serum triglyceride levels in the FHS (a 1-SD decrement in log β-aminoisobutyric acid is associated with a 1.07 mg/dL increase in serum triglyercides,P=2.3×10−21). In the current study, we identify a striking association between the AGXT2locus and plasma β-aminoisobutyric acid levels (P=5.8×10−83), with the top SNP (rs37370)in AGXT2 accounting for 36% of its estimated heritability. In light of the cross-sectional association between β-aminoisobutyric acid levels and total triglycerides, we were interested to note that rs37370 also had many nominal associations with plasma TAGs and CEs (Table S4), with the direction of association opposite for TAGs versus CEs, suggesting that genes responsible for β-aminoisobutyric acid metabolism may have a causal and opposing impact on plasma TAG and CE levels.In order to test this potential link between β-aminoisobutyric acid and lipid homeostasis, we knocked down agxt2 in zebrafish using morpholino antisense oligonucleotides (Figures S4A and S4B). In some fish, this knockdown resulted in no overt phenotype, whereas in others it resulted in a prominent defect in yolk sack extension and mild pericardial edema (Figure 5A). When compared to control fish, agxt2 knockdown fish with normal phenotype (AGXT2 I) and abnormal phenotype (AGXT2 II) had lower agxt2 mRNA (Figure 5B) and β-aminoisobutyric acid levels (Figure 5C); further, AGXT2 II fish had a trend for lower agxt2 mRNA and β-aminoisobutyric acid levels compared to AGXT2 I fish. Lipid profiling demonstrated a broad, opposing, and dose-dependent impact of agxt2 knockdown on fishCE and TAG levels (Figures 5D and 5E) that aligns with the opposing directionality of association between the AGXT2 locus and TAG and CE levels in humans (Figure 5F).To test whether the association between β-aminoisobutyric acid and lipid metabolism is specific to agxt2, we used morpholino antisense oligonucleotides to knock down abat (Figure S4C), which encodes an enzyme that catalyzes an alternative pathway for β-aminoisobutyric acid metabolism (Figure S4A). As with agxt2 knockdown, abat knockdown resulted in an abnormal phenotype notable for a defect in yolk sack extension andpericardial edema (Figure S4D). Furthermore, abat knockdown in fish resulted in decreased β-aminoisobutyric acid levels compared to control fish (Figure S4E), and recapitulated the decrease in CE levels and the increase in TAG levels seen in agxt2 knockdown fish (Figures S4F and S4G).Because cholesterol esterification is confined to a defined set of enzymes, encoded by lcat ,soat1, and soat2, we next assessed the effect of agxt2 and abat knockdown in zebrafish on the expression of these genes (Figure S5). We found that both agxt2 and abat knockdown resulted in decreased expression of lcat and soat2, consistent with the lower CE levels identified by LC-MS. Notably, humans with inherited LCAT deficiency develop bothmarked reductions in circulating CE levels as well as hypertriglyceridemia (Frohlich J et al.,1988). Lcat ablation in mice similarly results in hypertriglyceridemia, whereas transgenic Lcat overexpression results in lower plasma triglyceride levels (Ng D et al., 1997; Francone OL et al. 1995). Although the exact molecular pathways linking CE and TAG metabolism in these contexts have not been fully elucidated, Ng et al have shown that Lcat deficiency in mice results in 1.) increased triglyceride production, with increased expression of Srebp-1,Fas , and Acc-1, 2.) decreased triglyceride catabolism, with impaired lipase activity, and 3.)NIH-PA Author ManuscriptNIH-PA Author ManuscriptNIH-PA Author Manuscriptincreased expression of Hmgcr and decreased expression of Soat2 (Ng D et al., 2004; Song H et al., 2006). In addition to lowering the expression of lcat and soat2, we found that both agxt2 and abat knockdown in zebrafish resulted in increased expression of srebp-1 and hmgcr , as well as decreased expression of lipc (Figure S5). Taken together, these dataextend the results of gene-metabolite-phenotype data in FHS and highlight a functional link between β-aminoisobutyric acid, CE, and TAG metabolism in zebrafish.DISCUSSION Our platform surveys >100 metabolites not screened in prior GWAS, and extends recent efforts to annotate the common genetic determinants of circulating metabolite levels.Previously unmeasured metabolites include several distinct classes of lipid analytes for which we report numerous locus-metabolite associations, many in loci previously associated with human disease. Furthermore, using the rigorous characterization of clinical factors and family-based relationships of FHS participants, we delineate the relative contributions of inherited, environmental, and clinical factors on the metabolome. For select loci, we show that a broad view of metabolite associations provides insight on gene function, in some cases confirming known biochemical functions of the gene product (e.g., FADS1-3) and in others highlighting unanticipated metabolic roles (e.g., AGXT2).For the majority of analytes, variation attributable to heritable factors is greater than that attributable to clinical factors, with the notable exception of the tobacco metabolite cotinine.In fact, heritability estimates for many metabolites are considerably higher than for traditional biomarkers, such as B-type natriuretic peptide (h 2=0.35) (Wang et al., 2003) or C-reactive protein (h 2=0.30) (Schnabel et al., 2009). In some cases, this highlights metabolites that serve as proximal reporters of underlying gene function. For example, the top SNP (rs37370) in AGXT2 accounts for approximately a third of the estimated heritability for its enzyme substrate β-aminoisobutyric acid. The top SNPs for glycine (rs7422339,CPS1) and PCs 36:4 and 38:4 (rs102275, FADS1-3) account for nearly all of theirheritability (Figure 1). For most metabolites, however, either no genome-wide significant association was identified or the top genome wide significant SNP explained only a small fraction of overall heritability. To what extent the unexplained heritability for thesemetabolites is attributable to common polymorphisms with sub-genome wide associations,the effect of rare variants or copy number variants not captured by SNPs in GWAS arrays,or other factors (including shared environmental factors) remains undetermined.For select loci associated with human disease, e.g. UMTS and hereditary orotic aciduria, the locus-metabolite association identified in our study reflects the gene product’s enzymatic function. By contrast, several loci with previously established disease associations have no enzymatic or transport function directly related to the associated metabolite. In these cases,the locus-metabolite association identified in our study may provide information on thepathophysiologic link between a given locus and disease (Adamski, 2012; Suhre and Gieger,2012). For example, the SLC7A9 locus, associated with NMMA in our study, encodes an amino acid transporter in the kidney with specificity for dibasic amino acids includingcystine and arginine (Mora et al., 1996). Common variants in SLC7A9 have been associated with CKD (Kottgen et al., 2010). However, CKD is not characterized by cystinuria orcystine stones, as with the Mendelian disorder attributable to SLC7A9 mutations. Our data highlight plasma NMMA, a methylarginine that inhibits NO synthase (Vallance et al.,1992), as a potential intermediary between common variation at this locus and renal disease.Indeed, we find that elevated plasma levels of NMMA are associated with an increased risk of future CKD among individuals with normal kidney function at baseline. Thus, our data raise the hypothesis that NMMA could be both a biomarker and effector of CKD risk.NIH-PA Author ManuscriptNIH-PA Author ManuscriptNIH-PA Author Manuscript。
全基因组关联分析
全基因组关联分析(Genome-wide association study or GWAS)人类基因包含着百万种序列变异,它们对于疾病的形成或者对患者药物的反应程度有直接或间接的影响.全基因组关联分析是指在人类全基因组范围内找出存在的序列变异,即单核苷酸多态性(SNP),从中筛选出与疾病相关的部分。
此项技术能够一次性对疾病进行轮廓性概览,在全基因组层面上,开展多中心、大样本、反复验证基因与疾病的关联研究,全面揭示疾病发生、发展,以及与治疗相关的遗传基因。
随着人类基因组学的大幅度进步和基因测序的飞速进展,这种最新的研究方式开始大规模应用于筛选与人群复杂疾病和药物特异性相关的序列变异。
进行全基因组关联分析研究时,通过采集某类疾病患者与非患者两类人群的DNA,在基因芯片上读出DNA中的序列变异,然后用生物工程技术进行分析比较。
若某些基因变异在患者人群中非常普遍,则该序列变异是与此种疾病‘相关’的。
有了全基因组关联分析,今后从事疾病诊断,患者对药物的反应程度的研究,可以集中于这些与疾病‘相关’的序列变异,从而显著缩短研究时间,提高研究效率。
全基因组关联分析是研究人类复杂疾病的一项重大突破,其优势在于:1 高通量 --- 一个反应监测成百上千个序列变异;2 不只局限于“候选基因”,基因可以是“未知”的;3无需在研究之前构建任何假设。
2005年,Science杂志报道了第一项具有年龄相关性的黄斑变性全基因组关联分析研究,之后陆续出现有关冠心病、肥胖病、II型糖尿病、甘油三酯、精神分裂症以及相关表型的报道。
由此可见,全基因组关联分析研究作为一种全新的疾病研究方式,自人类基因测序大规模展开以来,就被医学界广泛接受和应用。
截止到2010年12月,世界范围内进行了超过1200项针对200多种疾病的全基因组关联分析研究,找到4000多个‘相关’的序列变异。
在全基因组关联分析研究中,SNP基因芯片(SNP array)扮演了非常重要的角色。
Natural Variations
Natural Variations and Genome-Wide Association Studies in Crop PlantsXuehui Huang and Bin HanNational Center for Gene Research,Shanghai Institute of Plant Physiology and Ecology,Shanghai Institutes for Biological Sciences,Chinese Academy of Sciences,Shanghai 200233,China;email:xhhuang@,bhan@Annu.Rev.Plant Biol.2014.65:531–51First published online as a Review in Advance on November 20,2013The Annual Review of Plant Biology is online at This article’s doi:10.1146/annurev-arplant-050213-035715Copyright c2014by Annual Reviews.All rights reservedKeywordscrop diversity,genotyping,GWAS,domestication,breedingAbstractNatural variants of crops are generated from wild progenitor plants under both natural and human selection.Diverse crops that are able to adapt to vari-ous environmental conditions are valuable resources for crop improvements to meet the food demands of the increasing human population.With the completion of reference genome sequences,the advent of high-throughput sequencing technology now enables rapid and accurate resequencing of a large number of crop genomes to detect the genetic basis of phenotypic variations in prehensive maps of genome variations facilitate genome-wide association studies of complex traits and functional investi-gations of evolutionary changes in crops.These advances will greatly accel-erate studies on crop designs via genomics-assisted breeding.Here,we first discuss crop genome studies and describe the development of sequencing-based genotyping and genome-wide association studies in crops.We then review sequencing-based crop domestication studies and offer a perspective on genomics-driven crop designs.A n n u . R e v . P l a n tB i o l . 2014.65:531-551. D o w n l o a d e d f r o m w w w .a n n u a l r e v i e w s .o r g A c c e s s p r o v i d e d b y H u a z h o n g A g r i c u l t u r a l U n i v e r s i t y o n 12/11/14. F o r p e r s o n a l u s e o n l y .Click here for quick links to Annual Reviews content online, including:• Other articles in this volume • Top cited articles• Top downloaded articles • Our comprehensive searchFurtherANNUAL REVIEWSContentsINTRODUCTION...............................................................532ADVANCES IN CROP GENOME SEQUENCING ..............................532HIGH-THROUGHPUT GENOTYPING........................................533LINKAGE MAPPING IN CROPS................................................536GENOME-WIDE ASSOCIATION STUDIES IN CROPS........................537GENOME VARIATION INSIGHTS INTO CROP DOMESTICATION.........540CROP DESIGNS BY GENOME-WIDE SELECTION ...........................543SYNTHESIS AND CONCLUSION..............................................545INTRODUCTIONNatural variants in crop plants resulted mainly from spontaneous mutations in their wild progeni-tors.Since the beginnings of agriculture 10,000years ago,a huge number of diverse crops adapted to various environmental conditions have been cultivated.Crop domestication and breeding have had a profound influence on the genetic diversity present in modern crops.Understanding the ge-netic basis of phenotypic variation and the domestication processes in crops can help us efficiently utilize these diverse genetic resources for crop improvement.To meet the food demands of billions of people,it is critical to improve crop productivity through efficient breeding (27,108,133).The use of naturally occurring alleles has greatly in-creased grain yield.Through the use of huge germplasm resources and genetic tools such as genome sequences,genetic populations,haplotype map data sets,genome-wide association stud-ies (GWAS),and transformation techniques,crop researchers are now able to extensively and rapidly mine natural variation and associate phenotypic variation with the underlying sequence variants.Recently,the advent of second-generation sequencing has facilitated the discovery and use of natural variation in crop design and genome-wide selection (8,30).ADVANCES IN CROP GENOME SEQUENCINGThe reference genome sequences of crops are the basis of crop genetic studies.They are also important for rapidly investigating genetic variations in natural variants of crops.Since the rice genome was completely sequenced approximately 10years ago (22,28,46,89,94,130),the refer-ence genome sequences of several other major crops—including barley,millet,maize,sorghum,potato,tomato,and Brassica rapa —have been reported (7,37,48,70,83,87,96,97,110,111,116,132;for a review,see 8).The reference sequences can be determined through several strategies:the clone-by-clone (e.g.,bacterial artificial chromosome)approach (as was done,e.g.,in rice and maize),the whole-genome shotgun approach (as was done,e.g.,in sorghum and foxtail millet),or a combination of the techniques (as was done,e.g.,in tomato and barley).Clone-by-clone sequencing provides a way to achieve high-quality sequence assemblies for genomes of great im-portance.Because de novo assembly from whole-genome shotgun sequencing often results in a large number of sequence gaps,especially when using second-generation sequencing technology,it is necessary to supplement it with other information to construct long superscaffolds.These supplements may include the generation of long-insert paired-end reads,physical maps from bac-terial artificial chromosome–end sequences (or fingerprinting),and high-density genetic maps.The quality of the reference sequence,which greatly affects subsequent research,depends on bothA n n u . R e v . P l a n tB i o l . 2014.65:531-551. D o w n l o a d e d f r o m w w w .a n n u a l r e v i e w s .o r g A c c e s s p r o v i d e d b y H u a z h o n g A g r i c u l t u r a l U n i v e r s i t y o n 12/11/14. F o r p e r s o n a l u s e o n l y .the sequencing strategy and the genome of the crop to be sequenced:Different crop species have large variations in genome size,proportion of repetitive sequences,and ploidy level (24).Gener-ating a high-quality assembly for the large,complex crop genomes of species such as bread wheat (Triticum aestivum )remains a significant challenge (10,51,65).Resequencing a large number of diverse varieties,facilitated by second-generation sequenc-ing technologies and new computational biological approaches,is feasible after the release of a crop’s reference genome,which is then used to characterize genome-wide variation for genetic mapping and evolutionary studies (15,29,41,52,59,60,127).High-throughput resequenc-ing has rapidly expanded our knowledge of genetic variations in crops.Sequencing of six elite maize inbred lines identified more than 1.2million single-nucleotide polymorphisms (SNPs)and more than 30,000insertions or deletions (indels)and also uncovered hundreds of presence/absence variations of intact expressed genes (59).Among various types of sequence variants,SNPs are the most abundant (an order of magnitude greater in number than all other polymorphisms)and are also easy to identify technically (Figure 1),which is why only SNP markers have been widely used in most high-throughput genotyping analyses (21,72).In second-generation sequenc-ing,small indels (typically <6base pairs)are usually discovered through direct alignment,and large structural variations are usually discovered by read depth,often with a relatively high level of false negatives.To capture a full catalog of sequence variants in a crop,it is best to deep se-quence several representative varieties and then perform whole-genome de novo assembly and comparative genome analysis (25).HIGH-THROUGHPUT GENOTYPINGGenome variations in crops can be defined by genotyping each individual cultivar.The genotype is an individual’s full hereditary information,often defined to be the allele pattern at multiple molecular markers.An individual’s phenotype is the observable characteristics that are influenced by both the genotype and the environment.Classical genetics identifies the genotype (generally adjacent to the causal variant)that is responsible for a phenotype,through linkage mapping in a recombinant population or association mapping in a natural population.There are many types of molecular markers corresponding to different genotyping approaches.The use of polymerase chain reaction (PCR)–based markers followed by allele scoring on agarose gel laid the foundation for quantitative trait locus (QTL)mapping and gene cloning in recent years.However,the genotyping processes that are based on PCR markers are quite laborious,expensive,and time consuming when high-density genotyping is needed for a large number of individuals.The advent of high-throughput genotyping technology,coupled with the availability of reference genome sequences for multiple major crops,has greatly accelerated and facilitated genotyping procedures.Here,we summarize five high-throughput genotyping approaches (Table 1).The first use of high-throughput genotyping was a microarray technology that detects SNPs by hybridizing DNA to oligonucleotides spotted on the chips.This method,known as microarray-based genotyping,enables direct scanning of allelic variation across the genome,covering hundreds to thousands of SNPs in a short time (17,102,121).Once a comprehensive SNP data set is available for a species,a well-designed microarray can be produced;generally,the technology is then cost efficient and the process relatively convenient.Nearly all GWAS in human genetics have used microarray technology,which can scan the human genome at 0.5–1million SNPs.Microarray-based genotyping has been applied in crops such as rice and maize (26,61,72,136).Diversity arrays technology (DArT)mapping,which is also based on microarray hybridizations,has been used in many crops,such as barley (119).A n n u . R e v . P l a n tB i o l . 2014.65:531-551. D o w n l o a d e d f r o m w w w .a n n u a l r e v i e w s .o r g A c c e s s p r o v i d e d b y H u a z h o n g A g r i c u l t u r a l U n i v e r s i t y o n 12/11/14. F o r p e r s o n a l u s e o n l y .SNPsSmall insertions or deletions Structural variantsVariants in coding regions Variants in noncoding regionsa b cP r o p o r t i o n (%)1051520Minor allele frequencydFoxtail millet inbred lineS N P s i t eFigure 1Genomic variation in a natural population,using real data from foxtail millet.(a )The proportion of three types of sequence variants in the genomes.(b )The proportion of sequence variants with different functional effects.(c )Allele frequency spectrum of all variants.(d )Haplotype display in a local genomic region in 400foxtail millet varieties.Each column represents a haplotype from an inbred foxtail millet line,and each row is a single-nucleotide polymorphism (SNP)site.Dark and light red represent the major and minor alleles,respectively,at each SNP site.The advent of second-generation sequencing technology hastened a methodological leap for-ward:the high-throughput sequencing-based genotyping method (99).For genotyping of recom-binant populations,a method that utilizes low-coverage whole-genome resequencing data was developed (38,137).The method was first applied in a recombinant inbred line (RIL)population that was generated from a cross between the Oryza sativa ssp.japonica Nipponbare and Oryza sativa ssp.indica 93-11varieties (38).The genomic DNA of each line in the population was sequenced with 0.02×coverage.SNPs were identified between two parental lines,and genotype calling was based on identification of adjacent SNPs using the sliding window approach,which resulted in an ultradense genotype map.A total of 49QTLs for 14agronomic traits were identified at high resolution using the high-density genotype map (114).Among them,5QTLs of large effect were located in small genomic regions,and strong candidate genes were found in the intervals.A n n u . R e v . P l a n tB i o l . 2014.65:531-551. D o w n l o a d e d f r o m w w w .a n n u a l r e v i e w s .o r g A c c e s s p r o v i d e d b y H u a z h o n g A g r i c u l t u r a l U n i v e r s i t y o n 12/11/14. F o r p e r s o n a l u s e o n l y .Table 1Five high-throughput genotyping methodsMicroarray-based genotypingSequencing-basedgenotyping Genotyping by sequencing RNA-seq-based genotyping Exon-sequencing-basedgenotyping Preliminary requirement Comprehensive SNPs available None A suitablerestriction enzyme None Exon array developed Density Alterable Alterable Modest Modest Modest Cost Alterable Alterable Low High High Experimental workload LowMedium MediumHighHighMarkerdistributionWell distributed Well distributed Not well distributed Not well distributed Not well distributed Application Most species Most species Species with a large genome size Species with a large genome size Species with a large genome size Additional usesNoneIdentifying novel mutation variantsNoneIdentifying novel mutation variants and eQTL analysisIdentifying novel mutation variantsAbbreviations:eQTL,expression quantitative trait locus;SNP,single-nucleotide polymorphism.This method now extends to different types of mapping populations in crops,including nearly isogenic lines,chromosome segment substitution lines,and F 2populations (114,124,126,129,141).Here,we refer to this whole-genome sequencing-based genotyping (via low-pass multiplex resequencing)as sequencing-based genotyping.The low-coverage whole-genome resequencing approach was also used in the genotyping of natural populations through completely different computation methods.In the construction of the haplotype maps in rice and foxtail millet,each variety was sequenced with low genome coverage,resulting in numerous missing genotype calls owing to low-fold sequencing.Data imputation—the process of replacing missing data with inferred values according to local linkage disequilibrium (LD)—can be used to deal with such population-scale genotype data sets.For this,a k nearest neighbor–based algorithm that explores local haplotype similarity to infer the missing calls was developed,and applications in both rice and foxtail millet showed high accuracy (41,50).Many imputation pipelines are now available with different features,and researchers need to make adjustments in sequencing coverage according to the population size,diversity level,and LD decay rate of the study sample in the initial experimental designs.Some imputation methods (typically using hidden Markov model–based algorithms)are also suitable for outcrossing species whose genomes contain numerous heterozygous genotypes (11).The use and simulations of sequencing data on human populations showed that,after effective imputations,even extremely low-coverage sequencing could increase the power of GWAS when compared with the microarray method (82).Computation simulations are needed to determine the sequence coverage with different population sizes for different crop species (63).Some crops (e.g.,maize,barley,and wheat)have large genomes,and whole-genome resequenc-ing for these crops is still expensive.There are now several alternatives to sequencing the whole genome.One method is to generate restriction-site-associated DNA tags (e.g.,using Sbf I or EcoR I)and sequence them to identify polymorphic markers (6).A similar approach is to cut the genomic DNA and ligate the genomic fragments to bar-code adapters to prepare a multiplex sequenc-ing library (19).Through the use of methylation-sensitive restriction enzymes,the regions withA n n u . R e v . P l a n tB i o l . 2014.65:531-551. D o w n l o a d e d f r o m w w w .a n n u a l r e v i e w s .o r g A c c e s s p r o v i d e d b y H u a z h o n g A g r i c u l t u r a l U n i v e r s i t y o n 12/11/14. F o r p e r s o n a l u s e o n l y .transposable elements can be reduced,and the relatively low-copy regions flanking the particular restriction enzymes are enriched in the sequencing library.Genotyping by sequencing has been used in maize,sorghum,and barley,and these applications showed that this method is efficient for large-scale,low-cost genotyping despite the limitations in SNP number and distribution (19,76,91).Another approach,known as RNA sequencing (RNA-seq)–based genotyping,is to retrieve genotype from RNA-seq data.Despite the great variation in genome size for different crops,there are no significant changes in number of genes or total gene sizes.Because most repetitive regions are ignored,performing transcriptome sequencing for SNP calling rather than whole-genome resequencing is quite cost efficient.For example,a recent study genotyped a large number of SNPs from 368maize transcriptomes and then used the RNA-seq data for expression profile analysis and expression QTL (eQTL)mapping simultaneously (61).It would be much more expensive to genotype 368maize genomes,because the repeat region occupies more than 80%of the total maize genome.RNA-seq-based sequencing has several weak points.Because the SNP density in genic regions is much lower than that in intergenic regions,the number of SNPs called from RNA-seq data may not be large enough for GWAS,especially for high-LD crops.The existence of a strong bias in SNP distribution raises another problem:In a particular tissue at a particular time point,many genes have very low or even no expression and thus cannot be used in genotyping,but RNA preparation of multiple tissues at multiple time points for a large population would greatly increase the workload.Exon-sequencing-based genotyping,facilitated by exon capture,can also be applied to the mapping of some large,complex crop genomes (69,78,113)(Table 1).LINKAGE MAPPING IN CROPSTo unravel the genetic basis of complex QTLs like grain yield and stress tolerance,a large sample size is key—typically thousands of individuals are needed.It is now feasible to genotype thou-sands of genomes using high-throughput pared with genotyping,most fieldwork is still laborious,involving the measurement of multiple traits at several time points in large-scale experiments across diverse environments.Some sensor-based platforms have been developed for measuring biomass traits,including near-infrared spectroscopy on agricultural harvesters and spec-tral reflectance of plant canopies (75).Future progress in phenotyping technologies will accelerate genetic mapping and gene discovery in crops.Genetic mapping in crops is usually undertaken in segregating mapping populations,such as F 2populations,RILs,and backcross inbred lines.Further fine mapping and gene cloning then follow,often using advanced backcross-derived populations.A mapping population derived from a cross between the O.sativa ssp.indica Kasalath and O.sativa ssp.japonica Nipponbare varieties has enabled the identification and cloning of tens of QTLs underlying a wide range of traits (32).Although this strategy has been used successfully in functional genomics studies in crops,there are two major limitations to QTL mapping in conventional recombinant populations.First,there are only a few recombination events in the mapping population;for example,typically one or two recombinations occur per chromosome in rice segregating populations,which would lead to poor mapping resolution unless very large populations are used.Second,because the sequence divergence between the selected parents in a particular segregating population represents only a small fraction of all genetic variation within a species,in a single segregating mapping population,only QTLs at which the two parents differ can be detected.To overcome these disadvantages,some new types of populations have been constructed and used.In maize,nested association mapping (NAM)was developed to enable high power and highA n n u . R e v . P l a n tB i o l . 2014.65:531-551. D o w n l o a d e d f r o m w w w .a n n u a l r e v i e w s .o r g A c c e s s p r o v i d e d b y H u a z h o n g A g r i c u l t u r a l U n i v e r s i t y o n 12/11/14. F o r p e r s o n a l u s e o n l y .resolution through joint linkage-association analysis (71).The NAM population was created by crossing 25diverse inbred maize lines to the B73reference line,where the NAM population in total includes ∼5,000RILs.The NAM population has been used in large-scale genetic mapping for several important traits (12,58,86,109).In the model plant Arabidopsis ,the Multiparent Ad-vanced Generation Inter-Cross (MAGIC)population was generated,including hundreds of RILs descended from a heterogeneous stock of 19intermated Arabidopsis accessions (57).Computa-tional simulation demonstrated that the QTLs explaining 10%of the phenotypic variance can be detected with an average mapping error of approximately 300kb when using the MAGIC popula-tion.Another team crossed eight Arabidopsis accessions and produced a set of six RIL populations called the Arabidopsis multiparent RIL (AMPRIL)population (40).QTL analysis in the AMPRIL population showed that this genetic resource was able to detect QTLs explaining 2%or more of a trait’s variation.As an alternative to conventional linkage analysis,bulk segregation analysis coupled with multiple-sample pooling sequencing can be used in the genetic mapping of simple qualitative traits or mutant mapping.Several methods have been reported for this application,includ-ing SHOREmap (98),next-generation mapping (5),MutMap (1),and MutMap-Gap (105).In Arabidopsis ,a mutant in the Columbia (Col-0)reference background was crossed to the Landsberg erecta (L er -1)accession,and a pool of 500mutant F 2plants was sequenced to detect the causal mutation sites (98).To avoid potential interference from different genetic backgrounds,James et al.(49)recommended using populations derived from backcrossing the mutant line to the non-mutagenized parent for mapping by sequencing.In rice,a recessive mutant was crossed to the parental line used for the mutagenesis,and the genomes from multiple lines of mutant F 2progeny were pooled and sequenced (1).The strategy can be further improved for application to quantita-tive traits in crops (e.g.,grading the RILs into several sequencing pools according to a particular trait).GENOME-WIDE ASSOCIATION STUDIES IN CROPSGWAS take full advantage of ancient recombination events to identify the genetic loci underlying traits at a relatively high resolution.The GWAS methodology became well established in human genetics during a decade of great effort.Through global collaborations,millions of common SNPs were identified in human populations to construct a high-density haplotype map of the human genome (44,45).Several commercial microarrays were designed for large-scale genotyping and analysis of GWAS panels,with many accompanied tool kits developed.GWAS approaches have been widely used in genetic research to identify the genes involved in human disease (2,118).The contribution of GWAS work to understanding of the genetic basis and molecular mechanisms of common disease is evident.With the rapid development of sequencing technologies and computational methods,GWAS are now becoming a powerful tool for detecting natural variation underlying complex traits in crops (88).Different from GWAS in humans,GWAS in crops usually use a permanent resource—a population of diverse (and preferably homozygous)varieties that can be rephenotyped for many traits and only needs to be genotyped once—and one can subsequently generate specific mapping populations for specific traits or QTLs in crops (4).Human GWAS usually adopt a case–control design:the identification of susceptibility loci for a particular disease through a population-scale genome-wide comparison between large groups of patients and healthy controls.Moreover,hu-man GWAS have been influenced by the missing heritability problem,where most loci that they identify have a very low rate of phenotypic contribution.To detect more QTLs,the population size (up to tens of thousands of individuals)and number of markers (up to millions of SNPs orA n n u . R e v . P l a n tB i o l . 2014.65:531-551. D o w n l o a d e d f r o m w w w .a n n u a l r e v i e w s .o r g A c c e s s p r o v i d e d b y H u a z h o n g A g r i c u l t u r a l U n i v e r s i t y o n 12/11/14. F o r p e r s o n a l u s e o n l y .whole genomes)must be increased continually.GWAS in crops are therefore much less costly than those in human.GWAS have now been carried out successfully in many crops,including maize,rice,sorghum,and foxtail millet (41,42,50,58,61,76,109,136).Based on the magnitude of resources already developed and published,rice and maize are the two major models for crop GWAS,and both have panels of thousands of genotyped inbreds and multiple environment trials conducted for several traits.In rice,1,083cultivated O.sativa ssp.indica and O.sativa ssp.japonica varieties and 446wild rice accessions (Oryza rufipogon )were collected and sequenced with low genome coverage (39).A high-density haplotype map of the rice genome was constructed using data imputation,and a GWAS was then conducted to characterize the alleles associated with 10grain-related traits and flowering time using the comprehensive data set of ∼1.3million SNPs.Several association signals were tied to previously well-characterized genes.A GWAS was also carried out in 446O.rufipogon accessions for leaf sheath color and tiller angle,which would have stronger mapping power owing to higher levels of genetic diversity in the wild species.Moreover,the GWAS was performed based on the microarray-based genotyping approach.In total,413diverse accessions of O.sativa were genotyped at 44,100SNP variants and phenotyped for 34traits,and the result showed the complex genetic architecture of the traits in rice.In maize,the genetic architectures of flowering time,leaf angle,leaf size,and disease resistance traits were dissected by conducting linkage mapping and GWAS jointly in the NAM panel,and multiple related candidate genes were identified (12,58,86,109).The GWAS result demonstrated that the genetic architecture of these traits is dominated by many QTLs with small effects.Maize oil is an important food and energy source,and a GWAS in maize was recently performed for maize kernel oil composition (61).A total of 368maize lines were analyzed at ∼1million SNPs genome-wide,and 74loci were found to be associated with maize kernel oil concentration and fatty acid composition.There are some slight differences between rice GWAS and maize GWAS.The differences are reflected in the trade-offs in power and resolution in the mapping of selfing versus outcrossing species.In the rice genome,LD generally decays at ∼100kb,which might be a result of self-fertilization and the small effective population size.Some other self-pollinated crops (e.g.,foxtail millet and soybean)show similar (or slower)LD decay rates as well.Owing to extended LD,the low-coverage sequencing approach coupled with missing data imputation—which has been applied in rice GWAS—is quite efficient and powerful.The low rate of LD decay,however,also means that the resolution of GWAS in the selfing species cannot resolve a single gene.Maize,in contrast,is a standard outcrossing species because the plant has separate male and female inflorescences that differ in flowering time.Owing to its outcrossing nature,maize has rapid LD decay (within ∼2kb)and great genetic diversity,which makes it a promising model with greater power in GWAS.In most cases,the resolution of maize GWAS may reach the single-gene level.Accordingly,a typical GWAS in maize may need tens of millions of SNPs to be accurately typed genome-wide for numerous varieties,which is still challenging and costly because of the large size,abundant repeats,and paralog sequences in the maize genome.GWAS have been successfully extended to genetic studies in other crops.A total of 916diverse foxtail millet varieties,including both traditional landraces and modern cultivars,were genotyped through whole-genome low-coverage sequencing (50).A GWAS in foxtail millet identified mul-tiple loci for tens of agronomic traits,which were measured in five different environments.In sorghum,917worldwide diverse accessions were collected,and ∼0.2million SNPs were identi-fied through genotyping by sequencing (76).To identify the loci underlying variation in agro-nomic traits,a GWAS was carried out on plant height components and inflorescence architecture,and several known loci were mapped to these traits.For a barley GWAS,a team collected anA n n u . R e v . P l a n tB i o l . 2014.65:531-551. D o w n l o a d e d f r o m w w w .a n n u a l r e v i e w s .o r g A c c e s s p r o v i d e d b y H u a z h o n g A g r i c u l t u r a l U n i v e r s i t y o n 12/11/14. F o r p e r s o n a l u s e o n l y .。
自免肝-心得
一、AIH 的发病机制至今未完全阐明,一般认为是遗传与环境因素相互影响的结果。
全基因组关联性研究[2]( genome-wide association studies,GWAS) 证实,与AIH 相关的遗传易感位点主要包括: 人类白细胞抗原( HLA) ( 尤其是HLA -DRB1) 和2 种非HLA( SH2B3、CARDl0) 等位基因二、磁共振弹性成像在评价AIH 患者的晚期肝纤维化和肝硬化方面也具有较好的表现三、AIH 简化诊断标准主要是基于自身抗体阳性、血清IgG 升高、血清病毒学标志物阴性和组织病理学特征(没有针对AIH的病理学特征,因此诊断取决于兼容的生化,免疫学和组织学特征以及排除其他肝病。
)四、尽管IAHIG于1992年制定了广泛接受的AIH诊断标准102,1999年对其进行了修订,最近又提出了简化标准的建议104 105,但诊断有时并不简单,需要大量的临床专业知识。
)五、AIH 的典型血清生化异常主要表现为肝细胞损伤型改变,血清天门冬氨酸氨基转移酶(AST)和丙氨酸氨基转移酶(ALT)活性升高,而血清碱性磷酸酶(ALP)和-谷氨酰转肽酶(GGT)水平正常或轻微升高。
血清转氨酶水平正常或轻度异常不一定等同于肝内轻微或非活动性疾病,也不能完全排除AIH 诊断。
降脂药物还可以诱发AIH, 可能的发病机制是在某些特异质个体中, 药物直接在肝脏或通过肠肝循环进行代谢的产物作为半抗原与内源性蛋白质结合形成药物-蛋白质加合物, 具有免疫原性. 转运到肝细胞膜后, 形成具有抗原性的靶位, 能诱导产生抗肝细胞抗体, 最终导致自身免疫反应原发性胆汁性胆管炎PBC 的确切发病机制尚不清楚,主要与遗传、环境和自身免疫有关。
AMA 对于PBC 具有很高的诊断价值,其敏感度和特异度高达90% ~95% 。
抗线粒体抗体(AMA)阳性是PBC 的血清学特点,可见于90%~95% 的患者,并且可在临床症状和肝功能异常出现前数年出现。
长链非编码核糖核酸ANRIL与糖脂代谢性疾病
长链非编码核糖核酸ANRIL与糖脂代谢性疾病梁微微;李乃适【期刊名称】《协和医学杂志》【年(卷),期】2015(006)006【总页数】4页(P458-461)【关键词】ANRIL;糖脂代谢【作者】梁微微;李乃适【作者单位】中国医学科学院北京协和医学院北京协和医院内分泌科卫生部内分泌重点实验室,北京 100730;中国医学科学院北京协和医学院北京协和医院内分泌科卫生部内分泌重点实验室,北京 100730【正文语种】中文【中图分类】Q756近年来随着全基因组关联研究 (Genome Wide Asso-ciation Study,GWAS)的深入,人们对染色体9p21基因区段的认识大为增加[1-3]。
尽管该区段最早在2007年被发现与冠状动脉性心脏病 (coronary heart disease,CHD)发病密切相关[1,3],但此后的研究发现9p21区段不仅与2型糖尿病、心肌梗死密切相关,也与腹主动脉瘤、外周血管病、卵巢癌、子宫内膜异位、颅内动脉瘤等多种疾病具有显著相关性[4]。
值得注意的是,染色体9p21区段中与CHD、2型糖尿病相关的单核苷酸多态性 (single nucleotide polymorphism,SNP)位点几乎都位于不编码蛋白质的“基因沙漠”中[3]。
近期较深入的研究发现,这些SNP位点大多位于一个新发现的长链非编码 RNA(long noncoding RNA,lncRNA)ANRIL(lncRNA-INK4基因座反义链非编码,antisense noncoding RNA in the INK4 locus)上[5]。
本文对ANRIL及与其相关的糖脂代谢性疾病进行综述。
在染色体9p21区段与ANRIL编码区域最近的蛋白编码基因,是几千碱基对 (kb)外的INK4b-ARF-INK4a基因座[6]。
INK4b-ARF-INK4a基因座长约35 kb,内含3种已知的抑癌基因[7]:INK4b(又称 CDKN2B)、INK4a和ARF(两个基因合称CDKN2A),它们编码依赖细胞周期蛋白的激酶抑制剂p15INK4b、p16INK4a及另一种蛋白质p14ARF,这3种编码蛋白参与细胞周期调控,其基因过表达与细胞增殖、细胞凋亡、肿瘤调控相关。
生物信息gwas分析流程详解
生物信息gwas分析流程详解英文版A Detailed Guide to the Genome-Wide Association Study (GWAS) Analysis Process in BioinformaticsIntroduction:Genome-Wide Association Studies (GWAS) are a powerful tool in bioinformatics, used to identify genetic variants associated with a particular trait or disease. This article aims to provide a detailed overview of the GWAS analysis process.1. Data Collection and Preparation:The first step involves collecting genetic data from a large population of individuals, along with phenotypic information related to the trait or disease of interest. This data is then cleaned and preprocessed to remove any noise or artefacts.2. Genome-Wide Genotyping:In this step, genetic markers (or SNPs - Single Nucleotide Polymorphisms) across the entire genome are identified. Thisinvolves high-throughput genotyping techniques that provide data on millions of SNPs.3. Statistical Analysis:Statistical methods are applied to test for associations between each SNP and the trait or disease. This is usually done using regression models, considering factors like age, gender, and population structure as covariates.4. Quality Control:To ensure the accuracy of the results, quality control measures are applied. This includes removing SNPs with low call rates, those that violate Hardy-Weinberg Equilibrium, and individuals with high missingness or outliers.5. Result Interpretation:Significant SNPs that pass a certain threshold (e.g., p-value cutoff) are considered as potential genetic markers for the trait or disease. These results are then interpreted in the context of known genetic variations and their potential functional roles.6. Replication and Validation:To confirm the findings, the identified SNPs are typically replicated in independent studies or through functional validation experiments.Conclusion:GWAS analysis is a complex but powerful approach in bioinformatics, providing insights into the genetic architecture of traits and diseases. Understanding the detailed process, from data collection to validation, is crucial for accurate and reliable genetic association studies.中文版生物信息GWAS分析流程详解介绍:全基因组关联研究(GWAS)是生物信息学中的一种强大工具,用于识别与特定性状或疾病相关的遗传变异。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Supplementary Information forGenome-wide association studies of 14 agronomic traitsin rice landracesXuehui Huang, Xinghua Wei, Tao Sang, Qiang Zhao, Qi Feng, Yan Zhao, Canyang Li, Chuanrang Zhu, Tingting Lu, Zhiwu Zhang, Meng Li, Danlin Fan, Yunli Guo, Ahong Wang, Lu Wang, Liuwei Deng, Wenjun Li, Yiqi Lu, Qijun Weng, Kunyan Liu, Tao Huang, Taoying Zhou, Yufeng Jing, Wei Li, Zhang Lin, Edward S. Buckler, Qian Qian, Qi-Fa Zhang, Jiayang Li & Bin HanThis PDF file includes:Supplementary NoteSupplementary Tables 2-4 and 7-8Supplementary Figures 1-25(Supplementary Tables 1, 5, 6, and 9 are provided in the separate Excel files)Supplementary Note SNP identification and annotation & Estimation of specificity and sensitivity & Data imputation algorithm & Association signals on known loci Supplementary Table 1 The list of 517 landrace accessions sampled in this study. Supplementary Table 2 The accuracy of the low-coverage consensus sequences estimated against four sets of sequencing data.Supplementary Table 3 Specificity and missing data rate of the genotype dataset before and after missing genotypes are inferred.Supplementary Table 4 The detailed list of predicted effects of annotated SNPs. Supplementary Table 5 The list of genes over-represented for large-effect changes. Supplementary Table 6 The list of genes that contained large-effectcomplete-differentiation SNPs.Supplementary Table 7 Genome-wide significant association signals of fourteen agronomic traits using the simple model.Supplementary Table 8 Genome-wide significant association signals of fourteen agronomic traits from both the simple model and the compressed MLM. Supplementary Table 9 The genotype dataset of indica landraces on the causal polymorphic sites of three known genes.Supplementary Figure 1 Geographic origins of 517 Chinese rice landraces sampled in this study.Supplementary Figure 2 Number and accuracy of SNPs called for 517 landraces plus 3 varieties as internal accuracy controls under various stringency rules.Supplementary Figure 3 SNP density and distribution across the genome. Supplementary Figure 4 Functional and evolutionary analyses of SNPs. Supplementary Figure 5 Divergence and geographic origins of 517 rice landraces of China.Supplementary Figure 6 Sequence diversity and population genetic differentiation along chromosomes in indica.Supplementary Figure 7 Sequence diversity and population genetic differentiation along chromosomes in japonica.Supplementary Figure 8 LD decay rate across the genome.Supplementary Figure 9 An example of missing genotype imputation. Supplementary Figure 10 Missing data rate and sequencing accuracy as the function of sequencing coverage.Supplementary Figure 11 Frequency distribution of variation of fourteen traits in 373 indica landraces.Supplementary Figure 12 Genome-wide association analysis of tiller number. Supplementary Figure 13 Genome-wide association analysis of flag leaf angle. Supplementary Figure 14 Genome-wide association analysis of grain length. Supplementary Figure 15 Genome-wide association analysis of grain weight. Supplementary Figure 16 Genome-wide association analysis of spikelet number per panicle.Supplementary Figure 17 Genome-wide association analysis of gelatinization temperature.Supplementary Figure 18 Genome-wide association analysis of amylose content. Supplementary Figure 19 Genome-wide association analysis of apiculus color. Supplementary Figure 20 Genome-wide association analysis of pericarp color. Supplementary Figure 21 Genome-wide association analysis of hull color. Supplementary Figure 22 Genome-wide association analysis of drought tolerance. Supplementary Figure 23 Genome-wide association analysis of degree of seed shattering.Supplementary Figure 24 Regions of the genome showing association signals around known genes controlling heading date.Supplementary Figure 25 Steps of missing genotype imputation.Supplementary NoteSNP identification and annotationWe integrated single–base pair genotypes of 520 individuals to screen for SNPs across the genome. Discrepancies with rice reference genome were called as candidate SNPs. Unreliable sites were then filtered according to the following criteria: 1) candidate SNP loci must be bi-allelic; 2) candidate SNP loci must be more than 10 bp away from each other; and 3) all the singleton SNPs were excluded. Sites passing these criteria were retained and called as common SNPs.SNPs in coding regions were called coding SNPs on the basis of the gene models in the Rice Annotation Projects Database (release 2) and only gene models with full-length cDNAs or ESTs support were used (http://rapdb.dna.affrc.go.jp/). The coding SNPs were then annotated to be synonymous or non-synonymous SNPs, which was used to calculate the nonsynonymous-to-synonymous ratio for each gene. SNPs with large-effect changes were annotated and partitioned to be SNPs that introduced stop codons, SNPs that disrupt stop codons, SNPs that disrupt initiation codons and SNPs that disrupt splice sites.We used three steps to find gene families under relaxed selection: 1) Genes with the same Pfam domain were grouped to be a gene family. We then calculated the numbers of coding SNPs within genes of each family. Only those families with 300 or more SNPs that permitted sufficient power of statistical tests are left. 2) To avoid the impact of potential pseudogenes, genes with large-effect SNPs were removed from the family before further statistic test. 3) A chi-square test was then used for each family. The observed class is the nonsynonymous-to-synonymous ratio of each family, while the expected one is the nonsynonymous-to-synonymous ratio of total families.Estimation of specificity and sensitivityWe used four sets of sequencing data to assess genotyping accuracy. For indica cv. Guangluai-4, the 21 Mb BAC-based sequences36 and whole-genome sequences generatedfrom 20-fold coverage Illumina sequencing were used. The BAC-based sequences of Guangluai-4 were composed of 273 BACs covering 65.7% of chromosome 4, of whichthe accession numbers were listed at our website (/english/edatabasei.htm). The 20-fold Illumina sequences of Guangluai-4 are available in the EBI European Nucleotide Archive (ftp://) with accession number ERP000235. For japonica, we used the BAC-based sequences of japonica cv. Nipponbare8 and genome sequences of japonica cv. Nongken-58 generated from 14-fold coverage Illumina GA sequencing. The BAC-based sequences of japonicacv. Nipponbare were the rice reference genome (IRGSP 4.0, http://rgp.dna.affrc.go.jp/IRGSP/Build4/build4.html). The 14-fold Illumina sequences of Nongken-58 are available in the EBI European Nucleotide Archive (ftp://) with accession number ERP000236.Genotype calls of consensus sequences at positions with reads covered were then compared with the above four sets of sequencing data respectively. The number of total sites for comparison and the number of concordant sites were calculated, which gave the estimates of specificity (also called “accuracy” in the RESULTS; Supplementary Table 2). Referenced to the four independent standards, the estimated specificities were all above 99.9%, indicating that the quality control procedures were effective.The estimate of sensitivity of the genotype calling (also referred to as "recall rate" inthe RESULTS) was based on BAC-based sequences of indica cv. Guangluai-4. Direct comparison of the BAC-based sequences of Guangluai-4 with their corresponding regions of the reference genome resulted in a total of 103,965 SNPs, of which 20,888 SNPs were detected by one-fold coverage Illumina sequencing of Guangluai-4, giving a recall rate of 20.1%.Data imputation algorithmBeginning with the first window at the top of the chromosome, the local haplotype similarity is exploited. A similarity score is calculated between all pairs of individuals inthe studied population with the size of N individuals:,1()wij ij z S s ==∑z where s ij (z) is the similarity score between individuals i and j at the z -th SNP. The overall similarity score for the window is the sum of similarity scores of all SNPs in the window of size w (Supplementary Fig. 25a ).At each SNP, the major and minor alleles are denoted by “0” and “1” respectively, while the missing genotype is denoted by “?”. The single SNP similarity scores are: d ij = 1 if genotypes of two individuals are identical (0 vs. 0 or 1 vs. 1); d ij = 0.5 if one or both genotypes are missing (0 vs. ?, 1 vs. ?, or ? vs. ?); d ij = p if genotypes are different (0 vs.1); p is allowed to vary and normally takes negative values (Supplementary Fig. 25a ). This is a penalty set to avoid recognizing different haplotypes as the nearest neighbors while the difference might be caused by sequencing errors.For the population with N individuals, a matrix of S ij is obtained (Supplementary Fig. 25b ). To infer the missing genotype at a SNP site of individual i , the nearest neighbors of this individual are identified. The nearest neighbors of individual i are defined as those individuals with the highest to k -th highest S ij scores (Supplementary Fig. 25c ). For the nearest neighbors, the major allele frequency at a SNP is calculated. At a SNP site where the genotype is missing from individual i , the genotype of this individual is determined to be the same as the major allele of the nearest neighbors if the major allele frequency is higher than a threshold, f . This filling procedure is conducted for the first SNP site in a window for all individuals with missing genotype and continues as the window slides along the chromosome. All SNP sites of the last window of a chromosome are filled at once.One step is to determine the values of the four variables, w , p , k , and f , for the optimal results of missing data imputation. Because testing a large number of data points of these variables for individual windows is computationally impractical, we tested three values ofeach variable and obtained the optimized value of each variable for the entire genome.Based on the genomic and populational properties of indica landraces and through extensive cross validations, the following testing values were set for these variables: w (50, 65, 80), p (-3, -5, -7), k (3, 5, 7), and f (0.7, 0.75, 0.8), which gave 81 combination of values to test for the 373 indica landraces plus the internal control, Guangluai-4.To optimize the variables for the indica landraces, 1200 chromosomal regions each containing 300 consecutive SNPs were randomly selected from the genome for cross validation. In these regions, we randomly masked 1% genotypes to make them missing data. Two criteria were adopted for evaluating the performance of missing data imputation. One is the imputation accuracy, A, defined as the percentage of correctly inferred genotype in the 1% masked genotypes. The other is the filling rate, F, defined as the percentage of missing genotypes that are inferred and filled. For each of the 81 combinations of the four variables, A and F were calculated for each of the 1200 chromosome regions. The mean of these 1200 A and F values are calculated. A total of 81 means for A and for F were obtained and plotted (Supplementary Fig. 25d). According to the distribution of A and F, the combination of w = 80, p = -7, k = 5, and f = 0.7 was judged to be the best because it yielded the highest F = 98.1% and the nearly highest A = 98.0%. After the missing genotypes were inferred, the imputation accuracy was calculated by comparing inferred genotypes of Guangluai-4 with its accurate genome sequences.Association signals on known lociAmong the association signals we identified, several were in the close proximity of known loci, which were identified previously via mutants or crosses. The strongest signal of apiculus color located within one known loci. The peak SNP was ~20 kb away from OsC1, the gene previously identified to control the coloration (Fig. 5a)23. For pericarp color, the strongest signal was ~26 kb away from Rc, the gene known to underlie red pericarp of rice (Fig. 5b) 24. For hull color, the strongest signal was near the ibf locus, butthe causal gene has not been identified and confirmed up to date25.For grain quality, we studied two traits related to cooking and eating properties, including gelatinization temperature and amylose content. The strongest association signal was detected for gelatinization temperature, which was ~21 kb away from ALK, the previously identified starch synthase modifying cooking property (Fig. 5c)26. For amylose content, the strongest one has the peak SNP ~1 kb away from waxy, the starch synthase known to control amylose content in rice (Fig. 5d) 27,28.Grain width and length are two major characteristics for grain size. For grain width, the peak SNP with the strongest signal was ~6 kb away from qSW5, the gene previously identified for grain width variation (Fig. 5e)29. For grain length, one strong signals was <1 kb away from GS3, the gene known to control grain length of rice Fig. 5f)30.Primers for amplification and sequencing across the regions of the causal polymorphisms of there known genes were designed using Primer 3 (v 0.2). The sequences of the PCR primers were provided in Supplementary Table 9.Supplementary Table 2 The accuracy of the low-coverage consensus sequences estimated against four sets of sequencing data.Sequences used for validation Number oftotal basesNumber ofconcordanceAccuracyGuangluai-4 20x Illumina data 111,725,302 111,678,146 99.96% Guangluai-4 BAC sequences 7,324,676 7,318,437 99.91% Nongken-58 14x Illumina data 137,144,714 137,108,004 99.97% Nipponbare reference genome 104,871,563 104,837,452 99.97%Supplementary Table 3 Specificity and missing data rate of the genotype dataset before and after missing genotypes are inferred.Afterimputing Type BeforeimputingSpecificitydata 98.7% 98.7%IlluminaGuangluai-420x98.4% 98.5%sequencesBACGuangluai-4data 99.7% 98.6%Illumina14xNongken-58genome 99.6% 98.7% NipponbarereferenceMissing data rateThe entire set of 517 landraces 61.7% 2.9%The subset of 373 indica landraces 61.4% 2.6%The subset of 131 japonica landraces62.5% 3.8%Supplementary Table 4 The detailed list of predicted effects of annotated SNPs.Type of predicted effects Number of SNPs Number of Genes3,625 3,039Large-effectstopcodons 2,005 1,709thatintroducedSNPscodons 200200stopSNPsthatdisruptSNPs that disrupt initiation codons 374 374splicesites1,046 994disruptSNPsthatSynonymous 74,849 21,019Non-Synonymous 92,665 22,34211Nature Genetics: doi:10.1038/ng.695Supplementary Table 7 Genome-wide significant association signals of fourteen agronomic traits using the simple model.Trait Chr. Position (IRGSP4) Major allele Minor allele Minor allele freq -log 10P(Simplemodel)Amylose content 3 18,705,982 C T 0.06 16.81Amylose content 6 1,757,040 aC G 0.14 67.75Amylose content 6 6,189,558 A T 0.11 34.76 Amylose content 6 6,709,537 C T 0.19 35.45Amylose content 12 10,993,688 G T 0.06 15.93 Apiculus Color 6 5,335,519 a A G 0.33 48.44Apiculus Color 67,681,502 G A 0.32 17.31Apiculus Color 12 460,120 T C 0.22 9.92 Drought tolerance 1 5,536,395 G T 0.11 9.97 Drought tolerance 2 1,489,158 T C 0.12 9.32 Drought tolerance 52,275,357 A C 0.06 14.24 Drought tolerance 6 28,243,628 C T 0.09 11.98Drought tolerance 11 21,161,361 G C0.08 15.28 GelatinizationTemperature 2 31,298,733 T C 0.24 8.08GelatinizationTemperature 5 28,906,320 A T 0.47 8.48GelatinizationTemperature 6 6,726,587 aC T 0.19 14.68 GelatinizationTemperature 8 19,334,399 C A 0.27 8.33GelatinizationTemperature 1122,564,279 A C 0.24 9.01 Grain length 1 5,966,086 C T 0.08 12.31Grain length 3 17,379,260 a A C 0.07 17.64 Grain length 3 17,637,475 C A 0.08 18.10 Grain length 3 23,349,781 A C 0.13 14.47Grain length 4 1,135,241 T G 0.11 11.60Spikelet number 1 26,933,074 A G 0.36 8.53Spikelet number 3 436,658 G A 0.10 9.08Grain weight 1 23,687,541 A G 0.23 13.15Grain weight 4 3,580,191 A C 0.21 12.66Grain weight 7 15,905,023 G A 0.11 13.06Grain weight 8 6,288,077 T G 0.09 11.85Grain weight 8 20,985,668 T C0.18 11.9912Nature Genetics: doi:10.1038/ng.695Grain width 5 4,942,020 A T 0.15 16.14Grain width 5 5,341,575 a G A 0.17 23.38Grain width 7 14,364,359 C T 0.42 10.98Grain width 8 6,191,511 A G 0.05 9.07Grain width 12 13,725,521 T G 0.09 9.69Heading date 1 23,908,112 T C 0.23 44.14Heading date 4 4,142,083 T A 0.21 54.45Heading date 4 17,728,075 A G 0.06 46.56Heading date 6 28,818,321 A G 0.29 38.66Heading date 7 24,952,910 C T 0.33 37.49Hull color 6 10,378,142 T C 0.06 9.81Hull color 8 20,947,967 T C 0.18 8.39Hull color 9 7,366,211 T C 0.20 19.77Leaf angle 1 5,605,379 A T 0.08 11.39Leaf angle 1 25,367,568 C T 0.22 11.51Leaf angle 1 35,465,422 G A 0.09 12.52Leaf angle 4 24,729,628 A G 0.08 8.43Leaf angle 5 21,842,625 A G 0.40 11.27Pericarp Color 2 34,824,975 A C 0.07 9.59Pericarp Color 3 31,599,995 T A 0.18 9.54Pericarp Color 7 6,127,089 a G T 0.33 60.96Pericarp Color 7 17,543,383 G T 0.23 8.13Pericarp Color 8 12,483,076 T G 0.21 16.79Shattering degree 3 31,392,925 A G 0.09 8.54Shattering degree 5 898,443 C T 0.05 10.24Shattering degree 8 25,498,378 A T 0.23 8.98Shattering degree 11 2,231,172 C T 0.05 9.18Shattering degree 12 16,521,488 C G 0.05 8.93Tiller number 1 4,759,534 G A 0.21 8.92Tiller number 1 6,840,011 T C 0.25 8.51Tiller number 2 5,541,400 C T 0.12 8.80Tiller number 2 25,083,473 A T 0.13 8.41Tiller number 8 25,453,518 T C 0.11 8.31Chr. Chromosome.a Known loci with identified genes, which are reported in Supplementary Note.13Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695Nature Genetics: doi:10.1038/ng.695。