428 Genome Informatics 14 428–429 (2003) Identifying Potential Regulatory Sequences of Alt
Analysis of Genetic Diversity and Population Structure
Agricultural Sciences in China2010, 9(9): 1251-1262September 2010Received 30 October, 2009 Accepted 16 April, 2010Analysis of Genetic Diversity and Population Structure of Maize Landraces from the South Maize Region of ChinaLIU Zhi-zhai 1, 2, GUO Rong-hua 2, 3, ZHAO Jiu-ran 4, CAI Yi-lin 1, W ANG Feng-ge 4, CAO Mo-ju 3, W ANG Rong-huan 2, 4, SHI Yun-su 2, SONG Yan-chun 2, WANG Tian-yu 2 and LI Y u 21Maize Research Institute, Southwest University, Chongqing 400716, P.R.China2Institue of Crop Sciences/National Key Facility for Gene Resources and Genetic Improvement, Chinese Academy of Agricultural Sciences,Beijing 100081, P.R.China3Maize Research Institute, Sichuan Agricultural University, Ya’an 625014, P.R.China4Maize Research Center, Beijing Academy of Agricultural and Forestry Sciences, Beijing 100089, P.R.ChinaAbstractUnderstanding genetic diversity and population structure of landraces is important in utilization of these germplasm in breeding programs. In the present study, a total of 143 core maize landraces from the South Maize Region (SR) of China,which can represent the general profile of the genetic diversity in the landraces germplasm of SR, were genotyped by 54DNA microsatellite markers. Totally, 517 alleles (ranging from 4 to 22) were detected among these landraces, with an average of 9.57 alleles per locus. The total gene diversity of these core landraces was 0.61, suggesting a rather higher level of genetic diversity. Analysis of population structure based on Bayesian method obtained the samilar result as the phylogeny neighbor-joining (NJ) method. The results indicated that the whole set of 143 core landraces could be clustered into two distinct groups. All landraces from Guangdong, Hainan, and 15 landraces from Jiangxi were clustered into group 1, while those from the other regions of SR formed the group 2. The results from the analysis of genetic diversity showed that both of groups possessed a similar gene diversity, but group 1 possessed relatively lower mean alleles per locus (6.63) and distinct alleles (91) than group 2 (7.94 and 110, respectively). The relatively high richness of total alleles and distinct alleles preserved in the core landraces from SR suggested that all these germplasm could be useful resources in germplasm enhancement and maize breeding in China.Key words :maize, core landraces, genetic diversity, population structureINTRODUCTIONMaize has been grown in China for nearly 500 years since its first introduction into this second biggest pro-duction country in the world. Currently, there are six different maize growing regions throughout the coun-try according to the ecological conditions and farming systems, including three major production regions,i.e., the North Spring Maize Region, the Huang-Huai-Hai Summer Maize Region, and the Southwest MaizeRegion, and three minor regions, i.e., the South Maize Region, the Northwest Maize Region, and the Qingzang Plateau Maize Region. The South Maize Region (SR)is specific because of its importance in origin of Chi-nese maize. It is hypothesized that Chinese maize is introduced mainly from two routes. One is called the land way in which maize was first brought to Tibet from India, then to Sichuan Province in southwestern China. The other way is that maize dispersed via the oceans, first shipped to the coastal areas of southeast China by boats, and then spread all round the country1252LIU Zhi-zhai et al.(Xu 2001; Zhou 2000). SR contains all of the coastal provinces and regions lie in southeastern China.In the long-term cultivation history of maize in south-ern China, numerous landraces have been formed, in which a great amount of genetic variation was observed (Li 1998). Similar to the hybrid swapping in Europe (Reif et al. 2005a), the maize landraces have been al-most replaced by hybrids since the 1950s in China (Li 1998). However, some landraces with good adapta-tions and yield performances are still grown in a few mountainous areas of this region (Liu et al.1999). Through a great effort of collection since the 1950s, 13521 accessions of maize landraces have been cur-rently preserved in China National Genebank (CNG), and a core collection of these landraces was established (Li et al. 2004). In this core collection, a total of 143 maize landrace accessions were collected from the South Maize Region (SR) (Table 1).Since simple sequence repeat ( SSR ) markers were firstly used in human genetics (Litt and Luty 1989), it now has become one of the most widely used markers in the related researches in crops (Melchinger et al. 1998; Enoki et al. 2005), especially in the molecular characterization of genetic resources, e.g., soybean [Glycine max (L.) Merr] (Xie et al. 2005), rice (Orya sativa L.) (Garris et al. 2005), and wheat (Triticum aestivum) (Chao et al. 2007). In maize (Zea mays L.), numerous studies focusing on the genetic diversity and population structure of landraces and inbred lines in many countries and regions worldwide have been pub-lished (Liu et al. 2003; Vegouroux et al. 2005; Reif et al. 2006; Wang et al. 2008). These activities of documenting genetic diversity and population structure of maize genetic resources have facilitated the under-standing of genetic bases of maize landraces, the utili-zation of these resources, and the mining of favorable alleles from landraces. Although some studies on ge-netic diversity of Chinese maize inbred lines were con-ducted (Yu et al. 2007; Wang et al. 2008), the general profile of genetic diversity in Chinese maize landraces is scarce. Especially, there are not any reports on ge-netic diversity of the maize landraces collected from SR, a possibly earliest maize growing area in China. In this paper, a total of 143 landraces from SR listed in the core collection of CNG were genotyped by using SSR markers, with the aim of revealing genetic diver-sity of the landraces from SR (Table 2) of China and examining genetic relationships and population struc-ture of these landraces.MATERIALS AND METHODSPlant materials and DNA extractionTotally, 143 landraces from SR which are listed in the core collection of CNG established by sequential strati-fication method (Liu et al. 2004) were used in the present study. Detailed information of all these landrace accessions is listed in Table 1. For each landrace, DNA sample was extracted by a CTAB method (Saghi-Maroof et al. 1984) from a bulk pool constructed by an equal-amount of leaves materials sampled from 15 random-chosen plants of each landrace according to the proce-dure of Reif et al. (2005b).SSR genotypingA total of 54 simple sequence repeat (SSR) markers covering the entire maize genome were screened to fin-gerprint all of the 143 core landrace accessions (Table 3). 5´ end of the left primer of each locus was tailed by an M13 sequence of 5´-CACGACGTTGTAAAACGAC-3´. PCR amplification was performed in a 15 L reac-tion containing 80 ng of template DNA, 7.5 mmol L-1 of each of the four dNTPs, 1×Taq polymerase buffer, 1.5 mmol L-1 MgCl2, 1 U Taq polymerase (Tiangen Biotech Co. Ltd., Beijing, China), 1.2 mol L-1 of forward primer and universal fluorescent labeled M13 primer, and 0.3 mol L-1 of M13 sequence tailed reverse primer (Schuelke 2000). The amplification was carried out in a 96-well DNA thermal cycler (GeneAmp PCR System 9700, Applied Biosystem, USA). PCR products were size-separated on an ABI Prism 3730XL DNA sequencer (HitachiHigh-Technologies Corporation, Tokyo, Japan) via the software packages of GENEMAPPER and GeneMarker ver. 6 (SoftGenetics, USA).Data analysesAverage number of alleles per locus and average num-ber of group-specific alleles per locus were identifiedAnalysis of Genetic Diversity and Population Structure of Maize Landraces from the South Maize Region of China 1253Table 1 The detailed information about the landraces used in the present studyPGS revealed by Structure1) NJ dendragram revealed Group 1 Group 2 by phylogenetic analysis140-150tian 00120005AnH-06Jingde Anhui 0.0060.994Group 2170tian00120006AnH-07Jingde Anhui 0.0050.995Group 2Zixihuangyumi00120007AnH-08Zixi Anhui 0.0020.998Group 2Zixibaihuangzayumi 00120008AnH-09Zixi Anhui 0.0030.997Group 2Baiyulu 00120020AnH-10Yuexi Anhui 0.0060.994Group 2Wuhuazi 00120021AnH-11Yuexi Anhui 0.0030.997Group 2Tongbai 00120035AnH-12Tongling Anhui 0.0060.994Group 2Yangyulu 00120036AnH-13Yuexi Anhui 0.0040.996Group 2Huangli 00120037AnH-14Tunxi Anhui 0.0410.959Group 2Baiyumi 00120038AnH-15Tunxi Anhui 0.0030.997Group 2Dapigu00120039AnH-16Tunxi Anhui 0.0350.965Group 2150tianbaiyumi 00120040AnH-17Xiuning Anhui 0.0020.998Group 2Xiuning60tian 00120042AnH-18Xiuning Anhui 0.0040.996Group 2Wubaogu 00120044AnH-19ShitaiAnhui 0.0020.998Group 2Kuyumi00130001FuJ-01Shanghang Fujian 0.0050.995Group 2Zhongdouyumi 00130003FuJ-02Shanghang Fujian 0.0380.962Group 2Baixinyumi 00130004FuJ-03Liancheng Fujian 0.0040.996Group 2Hongxinyumi 00130005FuJ-04Liancheng Fujian 0.0340.966Group 2Baibaogu 00130008FuJ-05Changding Fujian 0.0030.997Group 2Huangyumi 00130011FuJ-06Jiangyang Fujian 0.0020.998Group 2Huabaomi 00130013FuJ-07Shaowu Fujian 0.0020.998Group 2Huangbaomi 00130014FuJ-08Songxi Fujian 0.0020.998Group 2Huangyumi 00130016FuJ-09Wuyishan Fujian 0.0460.954Group 2Huabaogu 00130019FuJ-10Jian’ou Fujian 0.0060.994Group 2Huangyumi 00130024FuJ-11Guangze Fujian 0.0010.999Group 2Huayumi 00130025FuJ-12Nanping Fujian 0.0040.996Group 2Huangyumi 00130026FuJ-13Nanping Fujian 0.0110.989Group 2Hongbaosu 00130027FuJ-14Longyan Fujian 0.0160.984Group 2Huangfansu 00130029FuJ-15Loangyan Fujian 0.0020.998Group 2Huangbaosu 00130031FuJ-16Zhangping Fujian 0.0060.994Group 2Huangfansu 00130033FuJ-17Zhangping Fujian0.0040.996Group 2Baolieyumi 00190001GuangD-01Guangzhou Guangdong 0.9890.011Group 1Nuomibao (I)00190005GuangD-02Shixing Guangdong 0.9740.026Group 1Nuomibao (II)00190006GuangD-03Shixing Guangdong 0.9790.021Group 1Zasehuabao 00190010GuangD-04Lechang Guangdong 0.9970.003Group 1Zihongmi 00190013GuangD-05Lechang Guangdong 0.9880.012Group 1Jiufengyumi 00190015GuangD-06Lechang Guangdong 0.9950.005Group 1Huangbaosu 00190029GuangD-07MeiGuangdong 0.9970.003Group 1Bailibao 00190032GuangD-08Xingning Guangdong 0.9980.002Group 1Nuobao00190038GuangD-09Xingning Guangdong 0.9980.002Group 1Jinlanghuang 00190048GuangD-10Jiangcheng Guangdong 0.9960.004Group 1Baimizhenzhusu 00190050GuangD-11Yangdong Guangdong 0.9940.006Group 1Huangmizhenzhusu 00190052GuangD-12Yangdong Guangdong 0.9930.007Group 1Baizhenzhu 00190061GuangD-13Yangdong Guangdong 0.9970.003Group 1Baiyumi 00190066GuangD-14Wuchuan Guangdong 0.9880.012Group 1Bendibai 00190067GuangD-15Suixi Guangdong 0.9980.002Group 1Shigubaisu 00190068GuangD-16Gaozhou Guangdong 0.9960.004Group 1Zhenzhusu 00190069GuangD-17Xinyi Guangdong 0.9960.004Group 1Nianyaxixinbai 00190070GuangD-18Huazhou Guangdong 0.9960.004Group 1Huangbaosu 00190074GuangD-19Xinxing Guangdong 0.9950.005Group 1Huangmisu 00190076GuangD-20Luoding Guangdong 0.940.060Group 1Huangmi’ai 00190078GuangD-21Luoding Guangdong 0.9980.002Group 1Bayuemai 00190084GuangD-22Liannan Guangdong 0.9910.009Group 1Baiyumi 00300001HaiN-01Haikou Hainan 0.9960.004Group 1Baiyumi 00300003HaiN-02Sanya Hainan 0.9970.003Group 1Hongyumi 00300004HaiN-03Sanya Hainan 0.9980.002Group 1Baiyumi00300011HaiN-04Tongshi Hainan 0.9990.001Group 1Zhenzhuyumi 00300013HaiN-05Tongshi Hainan 0.9980.002Group 1Zhenzhuyumi 00300015HaiN-06Qiongshan Hainan 0.9960.004Group 1Aiyumi 00300016HaiN-07Qiongshan Hainan 0.9960.004Group 1Huangyumi 00300021HaiN-08Qionghai Hainan 0.9970.003Group 1Y umi 00300025HaiN-09Qionghai Hainan 0.9870.013Group 1Accession name Entry code Analyzing code Origin (county/city)Province/Region1254LIU Zhi-zhai et al .Baiyumi00300032HaiN-10Tunchang Hainan 0.9960.004Group 1Huangyumi 00300051HaiN-11Baisha Hainan 0.9980.002Group 1Baihuangyumi 00300055HaiN-12BaishaHainan 0.9970.003Group 1Machihuangyumi 00300069HaiN-13Changjiang Hainan 0.9900.010Group 1Hongyumi00300073HaiN-14Dongfang Hainan 0.9980.002Group 1Xiaohonghuayumi 00300087HaiN-15Lingshui Hainan 0.9980.002Group 1Baiyumi00300095HaiN-16Qiongzhong Hainan 0.9950.005Group 1Y umi (Baimai)00300101HaiN-17Qiongzhong Hainan 0.9980.002Group 1Y umi (Xuemai)00300103HaiN-18Qiongzhong Hainan 0.9990.001Group 1Huangmaya 00100008JiangS-10Rugao Jiangsu 0.0040.996Group 2Bainian00100012JiangS-11Rugao Jiangsu 0.0080.992Group 2Bayebaiyumi 00100016JiangS-12Rudong Jiangsu 0.0040.996Group 2Chengtuohuang 00100021JiangS-13Qidong Jiangsu 0.0050.995Group 2Xuehuanuo 00100024JiangS-14Qidong Jiangsu 0.0020.998Group 2Laobaiyumi 00100032JiangS-15Qidong Jiangsu 0.0050.995Group 2Laobaiyumi 00100033JiangS-16Qidong Jiangsu 0.0010.999Group 2Huangwuye’er 00100035JiangS-17Hai’an Jiangsu 0.0030.997Group 2Xiangchuanhuang 00100047JiangS-18Nantong Jiangsu 0.0060.994Group 2Huangyingzi 00100094JiangS-19Xinghua Jiangsu 0.0040.996Group 2Xiaojinhuang 00100096JiangS-20Yangzhou Jiangsu 0.0010.999Group 2Liushizi00100106JiangS-21Dongtai Jiangsu 0.0030.997Group 2Kangnandabaizi 00100108JiangS-22Dongtai Jiangsu 0.0020.998Group 2Shanyumi 00140020JiangX-01Dexing Jiangxi 0.9970.003Group 1Y umi00140024JiangX-02Dexing Jiangxi 0.9970.003Group 1Tianhongyumi 00140027JiangX-03Yushan Jiangxi 0.9910.009Group 1Hongganshanyumi 00140028JiangX-04Yushan Jiangxi 0.9980.002Group 1Zaoshuyumi 00140032JiangX-05Qianshan Jiangxi 0.9970.003Group 1Y umi 00140034JiangX-06Wannian Jiangxi 0.9970.003Group 1Y umi 00140038JiangX-07De’an Jiangxi 0.9940.006Group 1Y umi00140045JiangX-08Wuning Jiangxi 0.9740.026Group 1Chihongyumi 00140049JiangX-09Wanzai Jiangxi 0.9920.008Group 1Y umi 00140052JiangX-10Wanzai Jiangxi 0.9930.007Group 1Huayumi 00140060JiangX-11Jing’an Jiangxi 0.9970.003Group 1Baiyumi 00140065JiangX-12Pingxiang Jiangxi 0.9940.006Group 1Huangyumi00140066JiangX-13Pingxiang Jiangxi 0.9680.032Group 1Nuobaosuhuang 00140068JiangX-14Ruijin Jiangxi 0.9950.005Group 1Huangyumi 00140072JiangX-15Xinfeng Jiangxi 0.9960.004Group 1Wuningyumi 00140002JiangX-16Jiujiang Jiangxi 0.0590.941Group 2Tianyumi 00140005JiangX-17Shangrao Jiangxi 0.0020.998Group 2Y umi 00140006JiangX-18Shangrao Jiangxi 0.0310.969Group 2Baiyiumi 00140012JiangX-19Maoyuan Jiangxi 0.0060.994Group 260riyumi 00140016JiangX-20Maoyuan Jiangxi 0.0020.998Group 2Shanyumi 00140019JiangX-21Dexing Jiangxi 0.0050.995Group 2Laorenya 00090002ShangH-01Chongming Shanghai 0.0050.995Group 2Jinmeihuang 00090004ShangH-02Chongming Shanghai 0.0020.998Group 2Zaobaiyumi 00090006ShangH-03Chongming Shanghai 0.0020.998Group 2Chengtuohuang 00090007ShangH-04Chongming Shanghai 0.0780.922Group 2Benyumi (Huang)00090008ShangH-05Shangshi Shanghai 0.0020.998Group 2Bendiyumi 00090010ShangH-06Shangshi Shanghai 0.0040.996Group 2Baigengyumi 00090011ShangH-07Jiading Shanghai 0.0020.998Group 2Huangnuoyumi 00090012ShangH-08Jiading Shanghai 0.0040.996Group 2Huangdubaiyumi 00090013ShangH-09Jiading Shanghai 0.0440.956Group 2Bainuoyumi 00090014ShangH-10Chuansha Shanghai 0.0010.999Group 2Laorenya 00090015ShangH-11Shangshi Shanghai 0.0100.990Group 2Xiaojinhuang 00090016ShangH-12Shangshi Shanghai 0.0050.995Group 2Gengbaidayumi 00090017ShangH-13Shangshi Shanghai 0.0020.998Group 2Nongmeiyihao 00090018ShangH-14Shangshi Shanghai 0.0540.946Group 2Chuanshazinuo 00090020ShangH-15Chuansha Shanghai 0.0550.945Group 2Baoanshanyumi 00110004ZheJ-01Jiangshan Zhejiang 0.0130.987Group 2Changtaixizi 00110005ZheJ-02Jiangshan Zhejiang 0.0020.998Group 2Shanyumibaizi 00110007ZheJ-03Jiangshan Zhejiang 0.0020.998Group 2Kaihuajinyinbao 00110017ZheJ-04Kaihua Zhejiang 0.0100.990Group 2Table 1 (Continued from the preceding page)PGS revealed by Structure 1) NJ dendragram revealed Group1 Group2 by phylogenetic analysisAccession name Entry code Analyzing code Origin (county/city)Province/RegoinAnalysis of Genetic Diversity and Population Structure of Maize Landraces from the South Maize Region of China 1255Liputianzi00110038ZheJ-05Jinhua Zhejiang 0.0020.998Group 2Jinhuaqiuyumi 00110040ZheJ-06Jinhua Zhejiang 0.0050.995Group 2Pujiang80ri 00110069ZheJ-07Pujiang Zhejiang 0.0210.979Group 2Dalihuang 00110076ZheJ-08Yongkang Zhejiang 0.0140.986Group 2Ziyumi00110077ZheJ-09Yongkang Zhejiang 0.0020.998Group 2Baiyanhandipinzhong 00110078ZheJ-10Yongkang Zhejiang 0.0030.997Group 2Duosuiyumi00110081ZheJ-11Wuyi Zhejiang 0.0020.998Group 2Chun’an80huang 00110084ZheJ-12Chun’an Zhejiang 0.0020.998Group 2120ribaiyumi 00110090ZheJ-13Chun’an Zhejiang 0.0020.998Group 2Lin’anliugu 00110111ZheJ-14Lin’an Zhejiang 0.0030.997Group 2Qianhuangyumi00110114ZheJ-15Lin’an Zhejiang 0.0030.997Group 2Fenshuishuitianyumi 00110118ZheJ-16Tonglu Zhejiang 0.0410.959Group 2Kuihualiugu 00110119ZheJ-17Tonglu Zhejiang 0.0030.997Group 2Danbaihuang 00110122ZheJ-18Tonglu Zhejiang 0.0020.998Group 2Hongxinma 00110124ZheJ-19Jiande Zhejiang 0.0030.997Group 2Shanyumi 00110136ZheJ-20Suichang Zhejiang 0.0030.997Group 2Bai60ri 00110143ZheJ-21Lishui Zhejiang 0.0050.995Group 2Zeibutou 00110195ZheJ-22Xianju Zhejiang 0.0020.998Group 2Kelilao00110197ZheJ-23Pan’an Zhejiang 0.0600.940Group 21)The figures refered to the proportion of membership that each landrace possessed.Table 1 (Continued from the preceding page)PGS revealed by Structure 1) NJ dendragram revealed Group 1 Group 2 by phylogenetic analysisAccession name Entry code Analyzing code Origin (county/city)Province/Regoin Table 2 Construction of two phylogenetic groups (SSR-clustered groups) and their correlation with geographical locationsGeographical location SSR-clustered groupChi-square testGroup 1Group 2Total Guangdong 2222 χ2 = 124.89Hainan 1818P < 0.0001Jiangxi 15621Anhui 1414Fujian 1717Jiangsu 1313Shanghai 1515Zhejiang 2323Total5588143by the software of Excel MicroSatellite toolkit (Park 2001). Average number of alleles per locus was calcu-lated by the formula rAA rj j¦1, with the standarddeviation of1)()(12¦ r A AA rj jV , where A j was thenumber of distinct alleles at locus j , and r was the num-ber of loci (Park 2001).Unbiased gene diversity also known as expected heterozygosity, observed heterozygosity for each lo-cus and average gene diversity across the 54 SSR loci,as well as model-based groupings inferred by Struc-ture ver. 2.2, were calculated by the softwarePowerMarker ver.3.25 (Liu et al . 2005). Unbiased gene diversity for each locus was calculated by˅˄¦ 2ˆ1122ˆi x n n h , where 2ˆˆ2ˆ2¦¦z ji ijij i X X x ,and ij X ˆwas the frequency of genotype A i A jin the sample, and n was the number of individuals sampled.The average gene diversity across 54 loci was cal-culated as described by Nei (1987) as follows:rh H rj j ¦1ˆ, with the variance ,whereThe average observed heterozygosity across the en-tire loci was calculated as described by (Hedrick 1983)as follows: r jrj obsobs n h h ¦1, with the standard deviationrn h obs obsobs 1V1256LIU Zhi-zhai et al.Phylogenetic analysis and population genetic structureRelationships among all of the 143 accessions collected from SR were evaluated by using the unweighted pair group method with neighbor-joining (NJ) based on the log transformation of the proportion of shared alleles distance (InSPAD) via PowerMarker ver. 3.25 (FukunagaTable 3 The PIC of each locus and the number of alleles detected by 54 SSRsLocus Bin Repeat motif PIC No. of alleles Description 2)bnlg1007y51) 1.02AG0.7815Probe siteumc1122 1.06GGT0.639Probe siteumc1147y41) 1.07CA0.2615Probe sitephi961001) 2.00ACCT0.298Probe siteumc1185 2.03GC0.7215ole1 (oleosin 1)phi127 2.08AGAC0.577Probe siteumc1736y21) 2.09GCA T0.677Probe sitephi453121 3.01ACC0.7111Probe sitephi374118 3.03ACC0.477Probe sitephi053k21) 3.05A TAC0.7910Probe sitenc004 4.03AG0.4812adh2 (alcohol dehydrogenase 2)bnlg490y41) 4.04T A0.5217Probe sitephi079 4.05AGATG0.495gpc1(glyceraldehyde-3-phosphate dehydrogenase 1) bnlg1784 4.07AG0.6210Probe siteumc1574 4.09GCC0.719sbp2 (SBP-domain protein 2)umc1940y51) 4.09GCA0.4713Probe siteumc1050 4.11AA T0.7810cat3 (catalase 3)nc130 5.00AGC0.5610Probe siteumc2112y31) 5.02GA0.7014Probe sitephi109188 5.03AAAG0.719Probe siteumc1860 5.04A T0.325Probe sitephi085 5.07AACGC0.537gln4 (glutamine synthetase 4)phi331888 5.07AAG0.5811Probe siteumc1153 5.09TCA0.7310Probe sitephi075 6.00CT0.758fdx1 (ferredoxin 1)bnlg249k21) 6.01AG0.7314Probe sitephi389203 6.03AGC0.416Probe sitephi299852y21) 6.07AGC0.7112Probe siteumc1545y21)7.00AAGA0.7610hsp3(heat shock protein 3)phi1127.01AG0.5310o2 (opaque endosperm 2)phi4207018.00CCG0.469Probe siteumc13598.00TC0.7814Probe siteumc11398.01GAC0.479Probe siteumc13048.02TCGA0.335Probe sitephi1158.03A TAC0.465act1(actin1)umc22128.05ACG0.455Probe siteumc11218.05AGAT0.484Probe sitephi0808.08AGGAG0.646gst1 (glutathione-S-transferase 1)phi233376y11)8.09CCG0.598Probe sitebnlg12729.00AG0.8922Probe siteumc20849.01CTAG0.498Probe sitebnlg1520k11)9.01AG0.5913Probe sitephi0659.03CACCT0.519pep1(phosphoenolpyruvate carboxylase 1)umc1492y131)9.04GCT0.2514Probe siteumc1231k41)9.05GA0.2210Probe sitephi1084119.06AGCT0.495Probe sitephi4488809.06AAG0.7610Probe siteumc16759.07CGCC0.677Probe sitephi041y61)10.00AGCC0.417Probe siteumc1432y61)10.02AG0.7512Probe siteumc136710.03CGA0.6410Probe siteumc201610.03ACAT0.517pao1 (polyamine oxidase 1)phi06210.04ACG0.337mgs1 (male-gametophyte specific 1)phi07110.04GGA0.515hsp90 (heat shock protein, 90 kDa)1) These primers were provided by Beijing Academy of Agricultural and Forestry Sciences (Beijing, China).2) Searched from Analysis of Genetic Diversity and Population Structure of Maize Landraces from the South Maize Region of China1257et al. 2005). The unrooted phylogenetic tree was finally schematized with the software MEGA (molecular evolu-tionary genetics analysis) ver. 3.1 (Kumar et al. 2004). Additionally, a chi-square test was used to reveal the correlation between the geographical origins and SSR-clustered groups through FREQ procedure implemented in SAS ver. 9.0 (2002, SAS Institute, Inc.).In order to reveal the population genetic structure (PGS) of 143 landrace accessions, a Bayesian approach was firstly applied to determine the number of groups (K) that these materials should be assigned by the soft-ware BAPS (Bayesian Analysis of Population Structure) ver.5.1. By using BAPS, a fixed-K clustering proce-dure was applied, and with each separate K, the num-ber of runs was set to 100, and the value of log (mL) was averaged to determine the appropriate K value (Corander et al. 2003; Corander and Tang 2007). Since the number of groups were determined, a model-based clustering analysis was used to assign all of the acces-sions into the corresponding groups by an admixture model and a correlated allele frequency via software Structure ver.2.2 (Pritchard et al. 2000; Falush et al. 2007), and for the given K value determined by BAPS, three independent runs were carried out by setting both the burn-in period and replication number 100000. The threshold probability assigned individuals into groupswas set by 0.8 (Liu et al. 2003). The PGS result carried out by Structure was visualized via Distruct program ver. 1.1 (Rosenberg 2004).RESULTSGenetic diversityA total of 517 alleles were detected by the whole set of54 SSRs covering the entire maize genome through all of the 143 maize landraces, with an average of 9.57 alleles per locus and ranged from 4 (umc1121) to 22 (bnlg1272) (Table 3). Among all the alleles detected, the number of distinct alleles accounted for 132 (25.53%), with an av-erage of 2.44 alleles per locus. The distinct alleles dif-fered significantly among the landraces from different provinces/regions, and the landraces from Guangdong, Fujian, Zhejiang, and Shanghai possessed more distinct alleles than those from the other provinces/regions, while those from southern Anhui possessed the lowest distinct alleles, only counting for 3.28% of the total (Table 4).Table 4 The genetic diversity within eight provinces/regions and groups revealed by 54 SSRsProvince/Region Sample size Allele no.1)Distinct allele no.Gene diversity (expected heterozygosity)Observed heterozygosity Anhui14 4.28 (4.19) 69 (72.4)0.51 (0.54)0.58 (0.58)Fujian17 4.93 (4.58 80 (79.3)0.56 (0.60)0.63 (0.62)Guangdong22 5.48 (4.67) 88 (80.4)0.57 (0.59)0.59 (0.58)Hainan18 4.65 (4.26) 79 (75.9)0.53 (0.57)0.55 (0.59)Jiangsu13 4.24 700.500.55Jiangxi21 4.96 (4.35) 72 (68.7)0.56 (0.60)0.68 (0.68)Shanghai15 5.07 (4.89) 90 (91.4)0.55 (0.60)0.55 (0.55)Zhejiang23 5.04 (4.24) 85 (74)0.53 (0.550.60 (0.61)Total/average1439.571320.610.60GroupGroup 155 6.63 (6.40) 91 (89.5)0.57 (0.58)0.62 (0.62)Group 2887.94 (6.72)110 (104.3)0.57 (0.57)0.59 (0.58)Total/Average1439.571320.610.60Provinces/Regions within a groupGroup 1Total55 6.69 (6.40) 910.57 (0.58)0.62 (0.62)Guangdong22 5.48 (4.99) 86 (90.1)0.57 (0.60)0.59 (0.58)Hainan18 4.65 (4.38) 79 (73.9)0.53 (0.56)0.55 (0.59)Jiangxi15 4.30 680.540.69Group 2Total887.97 (6.72)110 (104.3)0.57 (0.57)0.59 (0.58)Anhui14 4.28 (3.22) 69 (63.2)0.51 (0.54)0.58 (0.57)Fujian17 4.93 (3.58) 78 (76.6)0.56 (0.60)0.63 (0.61)Jiangsu13 4.24 (3.22) 71 (64.3)0.50 (0.54)0.55 (0.54)Jiangxi6 3.07 520.460.65Shanghai15 5.07 (3.20) 91 (84.1)0.55 (0.60)0.55 (0.54)Zhejiang23 5.04 (3.20) 83 (61.7)0.53 (0.54)0.60 (0.58)1258LIU Zhi-zhai et al.Among the 54 loci used in the study, 16 (or 29.63%) were dinucleotide repeat SSRs, which were defined as type class I-I, the other 38 loci were SSRs with a longer repeat motifs, and two with unknown repeat motifs, all these 38 loci were defined as the class of I-II. In addition, 15 were located within certain functional genes (defined as class II-I) and the rest were defined as class II-II. The results of comparison indicated that the av-erage number of alleles per locus captured by class I-I and II-II were 12.88 and 10.05, respectively, which were significantly higher than that by type I-II and II-I (8.18 and 8.38, respectively). The gene diversity re-vealed by class I-I (0.63) and II-I (0.63) were some-what higher than by class I-II (0.60) and II-II (0.60) (Table 5).Genetic relationships of the core landraces Overall, 143 landraces were clustered into two groups by using neighbor-joining (NJ) method based on InSPAD. All the landraces from provinces of Guangdong and Hainan and 15 of 21 from Jiangxi were clustered together to form group 1, and the other 88 landraces from the other provinces/regions formed group 2 (Fig.-B). The geographical origins of all these 143 landraces with the clustering results were schematized in Fig.-D. Revealed by the chi-square test, the phylogenetic results (SSR-clustered groups) of all the 143 landraces from provinces/regions showed a significant correlation with their geographical origin (χ2=124.89, P<0.0001, Table 2).Revealed by the phylogenetic analysis based on the InSPAD, the minimum distance was observed as 0.1671 between two landraces, i.e., Tianhongyumi (JiangX-03) and Hongganshanyumi (JiangX-04) collected from Jiangxi Province, and the maximum was between two landraces of Huangbaosu (FuJ-16) and Hongyumi (HaiN-14) collected from provinces of Fujian and Hainan, respectively, with the distance of 1.3863 (data not shown). Two landraces (JiangX-01 and JiangX-21) collected from the same location of Dexing County (Table 1) possessing the same names as Shanyumi were separated to different groups, i.e., JiangX-01 to group1, while JiangX-21 to group 2 (Table 1). Besides, JiangX-01 and JiangX-21 showed a rather distant distance of 0.9808 (data not shown). These results indicated that JiangX-01 and JiangX-21 possibly had different ances-tral origins.Population structureA Bayesian method was used to detect the number of groups (K value) of the whole set of landraces from SR with a fixed-K clustering procedure implemented in BAPS software ver. 5.1. The result showed that all of the 143 landraces could also be assigned into two groups (Fig.-A). Then, a model-based clustering method was applied to carry out the PGS of all the landraces via Structure ver. 2.2 by setting K=2. This method as-signed individuals to groups based on the membership probability, thus the threshold probability 0.80 was set for the individuals’ assignment (Liu et al. 2003). Accordingly, all of the 143 landraces were divided into two distinct model-based groups (Fig.-C). The landraces from Guangdong, Hainan, and 15 landraces from Jiangxi formed one group, while the rest 6 landraces from the marginal countries of northern Jiangxi and those from the other provinces formed an-other group (Table 1, Fig.-D). The PGS revealed by the model-based approach via Structure was perfectly consistent with the relationships resulted from the phy-logenetic analysis via PowerMarker (Table 1).DISCUSSIONThe SR includes eight provinces, i.e., southern Jiangsu and Anhui, Shanghai, Zhejiang, Fujian, Jiangxi, Guangdong, and Hainan (Fig.-C), with the annual maize growing area of about 1 million ha (less than 5% of theTable 5 The genetic diversity detected with different types of SSR markersType of locus No. of alleles Gene diversity Expected heterozygosity PIC Class I-I12.880.630.650.60 Class I-II8.180.600.580.55 Class II-I8.330.630.630.58。
材料基因组数据库建设与数据驱动的新材料创新
材料基因组数据库建设与数据驱动的新材料创新近年来,材料基因组已经成为材料科学领域的一个热门话题。
“材料基因组”(materials genome)一词的出现,很大程度上受到成功的人类基因组计划的启发。
传统上,新材料和新工艺的发现和开发依赖于科学直觉和漫长的试错过程。
多年来,材料科学家渴望找到某种类似于生物基因的材料基本构造单元,其排序及缺陷结构或可决定材料的性质或功能。
通过了解这些构件,他们希望能够按需设计材料,从而加速材料的发现和开发,并降低成本。
自2011年美国启动“材料基因组计划”[1,2] 以来,其他主要经济体如欧盟[3,4]、日本[5]和中国都在国家层面设立了类似的科学计划。
然而关于什么是“材料基因组”,一直众说纷纭,难下定论。
近期取得的共识是其仅作为设计预测材料研发模式的代称[6]。
材料基因工程(materials genome engineering, MGE)意味着通过交叉融合高通量计算、高通量实验和材料信息学技术,速度更快、效率更高、成本更少地掌握成分-组织-工艺-性能间的关联关系——这些恰恰构成了材料设计的基础。
材料基因工程的工作模式可大致可分为实验驱动、计算驱动和数据驱动[7]三种。
实验驱动模式基于高通量合成与表征实验,直接快速优化与筛选材料。
这种模式的典型代表是高通量组合材料芯片技术[8]。
计算驱动模式基于计算模拟,预测有希望的候选材料,再进行实验验证[9],大大缩小实验范围。
数据驱动模式基于大量数据,借助材料信息学方法建立模型,即利用人工智能(AI)方法,如机器学习,解析多参数间复杂的关联关系,预测出候选材料[10]。
从人类认识自然的过程来看,数千年来,科学探索跨越了实验观测、理论推演、计算仿真几个阶段。
今天,利用前所未有的计算能力和大规模的数据收集能力,现代科学正在进入“第四范式”[11],即密集数据+人工智能。
材料基因工程的数据驱动模式正是“第四范式”的体现。
应该看到,实验和计算驱动模式的实质是基于事实的判断或基于物理规律的推演,并未从根本上改变材料科学的既有思维模式与工作套路。
《珍稀濒危植物四合木Genic-SSR标记的开发及种群遗传学研究》范文
《珍稀濒危植物四合木Genic-SSR标记的开发及种群遗传学研究》篇一一、引言四合木作为一种珍稀濒危植物,其保护与遗传学研究对于生物多样性的维护和生态系统的平衡具有重要意义。
随着分子生物学技术的不断发展,Genic-SSR(简单序列重复)标记作为一种有效的分子标记技术,被广泛应用于植物遗传学和种群遗传学研究中。
本文旨在开发四合木的Genic-SSR标记,并对其种群遗传学进行深入研究,以期为四合木的保护与利用提供理论依据。
二、材料与方法1. 实验材料选取四合木的不同地理种群作为实验材料,采集新鲜叶片用于基因组DNA的提取。
2. Genic-SSR标记的开发利用生物信息学方法,对四合木的基因组进行序列分析,设计并筛选出多态性高、重复性好的Genic-SSR引物。
3. 种群遗传学研究采用PCR技术对各地理种群的四合木进行Genic-SSR标记扩增,通过数据统计与分析,揭示四合木的种群遗传结构、遗传多样性和遗传变异等特征。
三、结果与分析1. Genic-SSR标记的开发结果通过生物信息学分析,成功设计并筛选出多态性高、重复性好的Genic-SSR引物XX余对。
这些引物在四合木基因组中表现出较高的多态性,适用于后续的种群遗传学研究。
2. 种群遗传学研究结果(1)遗传结构:通过Genic-SSR标记扩增,我们揭示了四合木不同地理种群的遗传结构。
各地理种群间存在一定的遗传差异,表明四合木具有较复杂的种群遗传结构。
(2)遗传多样性:四合木的遗传多样性较高,表现为多个等位基因的存在。
不同地理种群间的遗传多样性存在一定差异,可能与地理位置、生态环境等因素有关。
(3)遗传变异:通过Genic-SSR标记数据,我们发现四合木种群内存在一定程度的遗传变异。
这些变异可能受到自然选择、基因流、突变等因素的影响。
3. 数据分析与讨论通过对Genic-SSR标记数据的统计分析,我们发现四合木的种群遗传结构、遗传多样性和遗传变异等特征与地理位置、生态环境等因素密切相关。
髓鞘相关基因影响精神分裂症及相关行为研究进展
•综述.髓鞘相关基因影响精神分裂症及相关行为研究进展☆王家银*高舒展*魏钦令△石云※徐西嘉呛【关键词】精神分裂症髓鞘基因多态性表观遗传少突胶质细胞精神分裂症是一种多基因遗传性脑疾病,主要临床特征为阳性症状、阴性症状以及认知功能障碍[1]。
其起病通常在青春期晩期或成年早期,这与少突胶质细胞(oligoden-diocyte,OL)和髓鞘发育的时间相重叠[2】。
髓鞘是0L包绕神经元轴突的多层结构。
已有研究表明精神分裂症患者存在少突胶质细胞功能障碍、髓鞘受损和白质异常[3】,另一方面髓鞘相关基因功能异常参与精神分裂症的发生发展,其基因突变增加精神分裂症的遗传风险叫本综述旨在对髓鞘相关基因影响精神分裂症及相关行为的研究进展进行总结o1影响精神分裂症发生的髓鞘相关基因髓鞘相关基因根据具体功能,大致可分为:①编码髓鞘蛋白,主要包括髓鞘相关糖蛋白(myelin-associated glycoprotein,MAG)髓鞘碱性蛋白(myelin basic protein, MBP)、髓鞘蛋白脂蛋白(myelin proteolipid protein,MPLP)、少突胶质细胞糖蛋白(myelin oligodendrocyte glycoprotein, MOG)、才3-环核昔酸了-磷酸二酯酶(才,3=cyclic nucleotide3phosphodiesterase,CNP)同;②影响髓鞘脂质合成,如固醇调控元件结合转录因子(sterol regulatory ele・ment binding factor,SREBF)®;③编码OL相关的转录因doi:10.3969/j.issn.l002-0152.2020.10.001☆国家自然科学基金面上项目(编号=81771444);国家自然科学基金重大研究计划培育项目(编号:91849112);江苏省重点研发计划临床前沿技术项目(编号:BE2019707);江苏省六大人才高峰项目(编号:WSN-166)*南京医科大学附属脑科医院精神科(南京210029)△中山大学附属第三医院精神科南京大学医学院模式动物研究所e通信作者(E-mail:*****************)子,主要包括少突胶质细胞转录因子1/2(oligodendrocyte transcription factor1/2,0LIG1/2)、转录因子4(transcription factor4,TCF4)^;④其他髓鞘相关基因包括精神分裂症断裂基因1(disrupted in schizophrenia1,DISCI)、成束与延伸蛋白]基因(fasciculation and elongation protein zeta-1, FEZ1)和Quaking基因(QKI)等⑺。
基因组学、蛋白质组学
接,并且,同一蛋白可能以许多形式进行
翻译后的修饰。故一个蛋白质组不是一个
基因组的直接产物,蛋白质组中蛋白质的
2021/数6/16 目有时可以超过基因组的数目。
5
蛋白质组学(proteomics)就是指研究蛋白 质组的技术及这些研究得到的结果。
2021/6/16
6
蛋白组学按研究对象可分为 :
全蛋白组学 (profilingproteomics):一个生 物体所表达的全部蛋白质,数量太大,难 以操作。
基因产物-蛋白质功能研究,包括利用DNA 重组技术,以及蛋白质组研究。
蛋白质与蛋白质相互作用的研究,利用酵 母双杂交系统,单杂交系统、三杂交系统 以及反向杂交系统。
2021/6/16
8
蛋白质组学的研究内容包括:
蛋白质鉴定。 翻译后修饰。 蛋白质功能确定。
2021/6/16
9
蛋白质组研究的核心——用于分离 的二维电泳技术:
2021/6/16
3
基因组研究包括两方面的内容:
结构基因组学(structural genomics)以全基 因组测序为目标,代表基因组分析的早期
阶段,以建立生物体高分辨率遗传、物理 和转录图谱为主。
功能基因组学(functional genomics)以基因
功能鉴定为目标,又被称为后基因组
(postgenome)研究,代表基因分析的新阶
差异蛋白组学 (differentiation proteomics):
通过比较生理或病理状态、用药及不用药
状态的差异表达蛋白质,用以研究疾病、
确定药物靶点和筛选药物,是目前研究的
主要途径,也是蛋白质组在应用上最具前
景的方面。
Development and Applications of CRISPR-Cas9 for Genome Engineering
Leading EdgeReviewDevelopment and Applications ofCRISPR-Cas9for Genome EngineeringPatrick D.Hsu,1,2,3Eric nder,1and Feng Zhang1,2,*1Broad Institute of MIT and Harvard,7Cambridge Center,Cambridge,MA02141,USA2McGovern Institute for Brain Research,Department of Brain and Cognitive Sciences,Department of Biological Engineering, Massachusetts Institute of Technology,Cambridge,MA02139,USA3Department of Molecular and Cellular Biology,Harvard University,Cambridge,MA02138,USA*Correspondence:zhang@/10.1016/j.cell.2014.05.010Recent advances in genome engineering technologies based on the CRISPR-associated RNA-guided endonuclease Cas9are enabling the systematic interrogation of mammalian genome function.Analogous to the search function in modern word processors,Cas9can be guided to specific locations within complex genomes by a short RNA search ing this system, DNA sequences within the endogenous genome and their functional outputs are now easily edited or modulated in virtually any organism of choice.Cas9-mediated genetic perturbation is simple and scalable,empowering researchers to elucidate the functional organization of the genome at the systems level and establish causal linkages between genetic variations and biological phenotypes. In this Review,we describe the development and applications of Cas9for a variety of research or translational applications while highlighting challenges as well as future directions.Derived from a remarkable microbial defense system,Cas9is driving innovative applications from basic biology to biotechnology and medicine.IntroductionThe development of recombinant DNA technology in the1970s marked the beginning of a new era for biology.For thefirst time,molecular biologists gained the ability to manipulate DNA molecules,making it possible to study genes and harness them to develop novel medicine and biotechnology.Recent advances in genome engineering technologies are sparking a new revolution in biological research.Rather than studying DNA taken out of the context of the genome,researchers can now directly edit or modulate the function of DNA sequences in their endogenous context in virtually any organism of choice, enabling them to elucidate the functional organization of the genome at the systems level,as well as identify causal genetic variations.Broadly speaking,genome engineering refers to the process of making targeted modifications to the genome,its contexts (e.g.,epigenetic marks),or its outputs(e.g.,transcripts).The ability to do so easily and efficiently in eukaryotic and especially mammalian cells holds immense promise to transform basic sci-ence,biotechnology,and medicine(Figure1).For life sciences research,technologies that can delete,insert, and modify the DNA sequences of cells or organisms enable dis-secting the function of specific genes and regulatory elements. Multiplexed editing could further allow the interrogation of gene or protein networks at a larger scale.Similarly,manipu-lating transcriptional regulation or chromatin states at particular loci can reveal how genetic material is organized and utilized within a cell,illuminating relationships between the architecture of the genome and its functions.In biotechnology,precise manipulation of genetic building blocks and regulatory machin-ery also facilitates the reverse engineering or reconstruction of useful biological systems,for example,by enhancing biofuel production pathways in industrially relevant organisms or by creating infection-resistant crops.Additionally,genome engi-neering is stimulating a new generation of drug development processes and medical therapeutics.Perturbation of multiple genes simultaneously could model the additive effects that un-derlie complex polygenic disorders,leading to new drug targets, while genome editing could directly correct harmful mutations in the context of human gene therapy(Tebas et al.,2014). Eukaryotic genomes contain billions of DNA bases and are difficult to manipulate.One of the breakthroughs in genome manipulation has been the development of gene targeting by homologous recombination(HR),which integrates exogenous repair templates that contain sequence homology to the donor site(Figure2A)(Capecchi,1989).HR-mediated targeting has facilitated the generation of knockin and knockout animal models via manipulation of germline competent stem cells, dramatically advancing many areas of biological research.How-ever,although HR-mediated gene targeting produces highly pre-cise alterations,the desired recombination events occur extremely infrequently(1in106–109cells)(Capecchi,1989),pre-senting enormous challenges for large-scale applications of gene-targeting experiments.To overcome these challenges,a series of programmable nuclease-based genome editing technologies havebeen1262Cell157,June5,2014ª2014Elsevier Inc.developed in recent years,enabling targeted and efficient modi-fication of a variety of eukaryotic and particularly mammalian species.Of the current generation of genome editing technolo-gies,the most rapidly developing is the class of RNA-guided endonucleases known as Cas9from the microbial adaptive im-mune system CRISPR (clustered regularly interspaced short palindromic repeats),which can be easily targeted to virtually any genomic location of choice by a short RNA guide.Here,we review the development and applications of the CRISPR-associated endonuclease Cas9as a platform technology for achieving targeted perturbation of endogenous genomic ele-ments and also discuss challenges and future avenues for inno-vation.Programmable Nucleases as Tools for Efficient and Precise Genome EditingA series of studies by Haber and Jasin (Rudin et al.,1989;Plessis et al.,1992;Rouet et al.,1994;Choulika et al.,1995;Bibikova et al.,2001;Bibikova et al.,2003)led to the realization that tar-geted DNA double-strand breaks (DSBs)could greatly stimulate genome editing through HR-mediated recombination events.Subsequently,Carroll and Chandrasegaran demonstrated the potential of designer nucleases based on zinc finger proteins for efficient,locus-specific HR (Bibikova et al.,2001,2003).Moreover,it was shown in the absence of an exogenous homol-ogy repair template that localized DSBs can induce insertions or deletion mutations (indels)via the error-prone nonhomologous end-joining (NHEJ)repair pathway (Figure 2A)(Bibikova et al.,2002).These early genome editing studies established DSB-induced HR and NHEJ as powerful pathways for the versatileand precise modification of eukaryotic genomes.To achieve effective genome editing via introduction of site-specific DNA DSBs,four major classes of customizable DNA-binding proteins have been engineered so far:meganucleases derived from microbial mobile genetic elements (Smith et al.,2006),zinc finger (ZF)nucleases based on eukaryotic transcrip-tion factors (Urnov et al.,2005;Miller et al.,2007),transcription activator-like effectors (TALEs)from Xanthomonas bacteria (Christian et al.,2010;Miller et al.,2011;Boch et al.,2009;Mos-cou and Bogdanove,2009),and most recently the RNA-guided DNA endonuclease Cas9from the type II bacterial adaptive im-mune system CRISPR (Cong et al.,2013;Mali et al.,2013a ).Meganuclease,ZF,and TALE proteins all recognize specific DNA sequences through protein-DNA interactions.Although meganucleases integrate its nuclease and DNA-binding domains,ZF and TALE proteins consist of individual modules targeting 3or 1nucleotides (nt)of DNA,respectively (Figure 2B).ZFs and TALEs can be assembled in desired combi-nations and attached to the nuclease domain of FokI to direct nucleolytic activity toward specific genomic loci.Each of these platforms,however,has unique limitations.Meganucleases have not been widely adopted as a genome engineering platform due to lack of clear correspondence between meganuclease protein residues and their target DNA sequence specificity.ZF domains,on the other hand,exhibit context-dependent binding preference due to crosstalk between adjacent modules when assembled into a larger array (Maeder et al.,2008).Although multiple strategies have been developed to account for these limitations (Gonzaelz et al.,2010;Sander et al.,2011),assembly of functional ZFPs with the desired DNA binding specificity remains a major challenge that requires an extensive screening process.Similarly,although TALE DNA-binding monomers are for the most part modular,they can still suffer from context-dependent specificity (Juillerat et al.,2014),and their repetitive sequences render construction of novel TALE arrays labor intensive and costly.Given the challenges associated with engineering of modular DNA-binding proteins,new modes of recognition would signifi-cantly simplify the development of custom nucleases.The CRISPR nuclease Cas9is targeted by a short guide RNA that recognizes the target DNA via Watson-Crick base pairing (Figure 2C).The guide sequence within these CRISPR RNAs typically corresponds to phage sequences,constituting the nat-ural mechanism for CRISPR antiviral defense,but can be easily replaced by a sequence of interest to retarget the Cas9nuclease.Multiplexed targeting by Cas9can now be achieved at unprecedented scale by introducing a battery of short guideFigure 1.Applications of Genome EngineeringGenetic and epigenetic control of cells with genome engineering technologies is enabling a broad range of applications from basic biology to biotechnology and medicine.(Clockwise from top)Causal genetic mutations or epigenetic variants associated with altered biological function or disease phenotypes can now be rapidly and efficiently recapitulated in animal or cellular models (Animal models,Genetic variation).Manipulating biological circuits could also facilitate the generation of useful synthetic materials,such as algae-derived,silica-based diatoms for oral drug delivery (Materials).Additionally,precise genetic engineering of important agricultural crops could confer resistance to envi-ronmental deprivation or pathogenic infection,improving food security while avoiding the introduction of foreign DNA (Food).Sustainable and cost-effec-tive biofuels are attractive sources for renewable energy,which could be achieved by creating efficient metabolic pathways for ethanol production in algae or corn (Fuel).Direct in vivo correction of genetic or epigenetic defects in somatic tissue would be permanent genetic solutions that address the root cause of genetically encoded disorders (Gene surgery).Finally,engineering cells to optimize high yield generation of drug precursors in bacterial factories could significantly reduce the cost and accessibility of useful therapeutics (Drug development).Cell 157,June 5,2014ª2014Elsevier Inc.1263RNAs rather than a library of large,bulky proteins.The ease of Cas9targeting,its high efficiency as a site-specific nuclease,and the possibility for highly multiplexed modifications have opened up a broad range of biological applications across basic research to biotechnology and medicine.The utility of customizable DNA-binding domains extends far beyond genome editing with site-specific endonucleases.Fusing them to modular,sequence-agnostic functional effector domains allows flexible recruitment of desired perturbations,such as transcriptional activation,to a locus of interest (Xu and Bestor,1997;Beerli et al.,2000a;Konermann et al.,2013;Maeder et al.,2013a;Mendenhall et al.,2013).In fact,any modular enzymatic component can,in principle,be substituted,allowing facile additions to the genome engineering toolbox.Integration of genome-and epigenome-modifying enzymes with inducible protein regulation further allows precise temporal control of dynamic processes (Beerli et al.,2000b;Konermann et al.,2013).CRISPR-Cas9:From Yogurt to Genome EditingThe recent development of the Cas9endonuclease for genome editing draws upon more than a decade of basic research into understanding the biological function of the mysterious repetitive elements now known as CRISPR (Figure 3),which are found throughout the bacterial and archaeal diversity.CRISPR loci typically consist of a clustered set of CRISPR-associated (Cas)genes and the signature CRISPR array—a series of repeat sequences (direct repeats)interspaced by variable sequences (spacers)corresponding to sequences within foreign genetic elements (protospacers)(Figure 4).Whereas Cas genes are translated into proteins,most CRISPR arrays are first tran-scribed as a single RNA before subsequent processing into shorter CRISPR RNAs (crRNAs),which direct the nucleolytic activity of certain Cas enzymes to degrade target nucleic acids.The CRISPR story began in 1987.While studying the iap enzyme involved in isozyme conversion of alkaline phosphatase in E.coli ,Nakata and colleagues reported a curious set of 29nt repeats downstream of the iap gene (Ishino et al.,1987).Unlike most repetitive elements,which typically take the form of tandem repeats like TALE repeat monomers,these 29nt repeats were interspaced by five intervening 32nt nonrepetitive sequences.Over the next 10years,as more microbial genomes were sequenced,additional repeat elements were reported from genomes of different bacterial and archaeal strains.Mojica and colleagues eventually classified interspaced repeat sequences as a unique family of clustered repeat elements present in >40%of sequenced bacteria and 90%of archaea (Mojica et al.,2000).These early findings began to stimulate interest in such micro-bial repeat elements.By 2002,Jansen and Mojica coined the acronym CRISPR to unify the description of microbial genomic loci consisting of an interspaced repeat array (Jansen et al.,2002;Barrangou and van der Oost,2013).At the same time,several clusters of signature CRISPR-associated (cas )genes were identified to be well conserved and typically adjacent to the repeat elements (Jansen et al.,2002),serving as a basis for the eventual classification of three different types of CRISPR systems (types I–III)(Haft et al.,2005;Makarova et al.,2011b ).Types I and III CRISPR loci contain multiple Cas proteins,now known to form complexes with crRNA (CASCADE complex for type I;Cmr or Csm RAMP complexes for type III)to facilitate the recognition and destruction of target nucleic acids (BrounsFigure 2.Genome Editing Technologies Exploit Endogenous DNA Repair Machinery(A)DNA double-strand breaks (DSBs)are typically repaired by nonhomologous end-joining (NHEJ)or homology-directed repair (HDR).In the error-prone NHEJ pathway,Ku heterodimers bind to DSB ends and serve as a molecular scaffold for associated repair proteins.Indels are introduced when the complementary strands undergo end resection and misaligned repair due to micro-homology,eventually leading to frameshift muta-tions and gene knockout.Alternatively,Rad51proteins may bind DSB ends during the initial phase of HDR,recruiting accessory factors that direct genomic recombination with homology arms on an exogenous repair template.Bypassing the matching sister chromatid facilitates the introduction of precise gene modifications.(B)Zinc finger (ZF)proteins and transcription activator-like effectors (TALEs)are naturally occurring DNA-binding domains that can be modularly assembled to target specific se-quences.ZF and TALE domains each recognize 3and 1bp of DNA,respectively.Such DNA-binding proteins can be fused to the FokI endonuclease to generate programmable site-specific nucleases.(C)The Cas9nuclease from the microbial CRISPR adaptive immune system is localized to specific DNA sequences via the guide sequence on its guide RNA (red),directly base-pairing with the DNA target.Binding of a protospacer-adjacent motif (PAM,blue)downstream of the target locus helps to direct Cas9-mediated DSBs.1264Cell 157,June 5,2014ª2014Elsevier Inc.et al.,2008;Hale et al.,2009)(Figure 4).In contrast,the type II system has a significantly reduced number of Cas proteins.However,despite increasingly detailed mapping and annotation of CRISPR loci across many microbial species,their biological significance remained elusive.A key turning point came in 2005,when systematic analysis of the spacer sequences separating the individual direct repeats suggested their extrachromosomal and phage-associated ori-gins (Mojica et al.,2005;Pourcel et al.,2005;Bolotin et al.,2005).This insight was tremendously exciting,especially given previous studies showing that CRISPR loci are transcribed (Tang et al.,2002)and that viruses are unable to infect archaeal cells carrying spacers corresponding to their own genomes (Mojica et al.,2005).Together,these findings led to the specula-tion that CRISPR arrays serve as an immune memory and defense mechanism,and individual spacers facilitate defense against bacteriophage infection by exploiting Watson-Crick base-pairing between nucleic acids (Mojica et al.,2005;Pourcel et al.,2005).Despite these compelling realizations that CRISPR loci might be involved in microbial immunity,the specific mech-anism of how the spacers act to mediate viral defense remained a challenging puzzle.Several hypotheses were raised,including thoughts that CRISPR spacers act as small RNA guides to degrade viral transcripts in a RNAi-like mechanism (Makarova et al.,2006)or that CRISPR spacers direct Cas enzymes to cleave viral DNA at spacer-matching regions (Bolotin et al.,2005).Working with the dairy production bacterial strain Strepto-coccus thermophilus at the food ingredient company Danisco,Horvath and colleagues uncovered the first experimental evidence for the natural role of a type II CRISPR system as an adaptive immunity system,demonstrating a nucleic-acid-based immune system in which CRISPR spacers dictate target speci-ficity while Cas enzymes control spacer acquisition and phage defense (Barrangou et al.,2007).A rapid series of studies illumi-nating the mechanisms of CRISPR defense followed shortly and helped to establish the mechanism as well as function of all three types of CRISPR loci in adaptive immunity.By studying the type I CRISPR locus of Escherichia coli ,van der Oost and colleagues showed that CRISPR arrays are transcribed and converted into small crRNAs containing individual spacers to guide Cas nuclease activity (Brouns et al.,2008).In the same year,CRISPR-mediated defense by a type III-A CRISPR system from Staphylococcus epidermidis was demonstrated to block plasmid conjugation,establishing the target of Cas enzyme activity as DNA rather than RNA (Marraffini andSontheimer,Figure 3.Key Studies Characterizing and Engineering CRISPR SystemsCas9has also been referred to as Cas5,Csx12,and Csn1in literature prior to 2012.For clarity,we exclusively adopt the Cas9nomenclature throughout this Review.CRISPR,clustered regularly interspaced short palindromic repeats;Cas,CRISPR-associated;crRNA,CRISPR RNA;DSB,double-strand break;tracrRNA,trans -activating CRISPR RNA.Cell 157,June 5,2014ª2014Elsevier Inc.12652008),although later investigation of a different type III-B system from Pyrococcus furiosus also revealed crRNA-directed RNA cleavage activity(Hale et al.,2009,2012).As the pace of CRISPR research accelerated,researchers quickly unraveled many details of each type of CRISPR system (Figure4).Building on an earlier speculation that protospacer-adjacent motifs(PAMs)may direct the type II Cas9nuclease to cleave DNA(Bolotin et al.,2005),Moineau and colleagues high-lighted the importance of PAM sequences by demonstrating that PAM mutations in phage genomes circumvented CRISPR inter-ference(Deveau et al.,2008).Additionally,for types I and II,the lack of PAM within the direct repeat sequence within the CRISPR array prevents self-targeting by the CRISPR system.In type III systems,however,mismatches between the50end of the crRNA and the DNA target are required for plasmid interference(Marraf-fini and Sontheimer,2010).By2010,just3years after thefirst experimental evidence for CRISPR in bacterial immunity,the basic function and mecha-nisms of CRISPR systems were becoming clear.A variety of groups had begun to harness the natural CRISPR system for various biotechnological applications,including the generation of phage-resistant dairy cultures(Quiberoni et al.,2010)and phylogenetic classification of bacterial strains(Horvath et al., 2008,2009).However,genome editing applications had not yet been explored.Around this time,two studies characterizing the functional mechanisms of the native type II CRISPR system elucidated the basic components that proved vital for engineering a simple RNA-programmable DNA endonuclease for genome editing. First,Moineau and colleagues used genetic studies in Strepto-coccus thermophilus to reveal that Cas9(formerly called Cas5,Csn1,or Csx12)is the only enzyme within the cas gene cluster that mediates target DNA cleavage(Garneau et al.,2010).Next,Charpentier and colleagues revealed a key component in the biogenesis and processing of crRNA in type II CRISPR systems—a noncoding trans-activating crRNA(tracrRNA)that hybridizes with crRNA to facilitate RNA-guided targeting of Cas9(Deltcheva et al.,2011).This dual RNA hybrid,together with Cas9and endogenous RNase III,is required for processing the CRISPR array transcript into mature crRNAs(Deltcheva et al.,2011).These two studies suggested that there are at least three components(Cas9, the mature crRNA,and tracrRNA)that are essential for recon-stituting the type II CRISPR nuclease system.Given the increasing importance of programmable site-specific nucleases based on ZFs and TALEs for enhancing eukaryotic genome editing,it was tantalizing to think that perhaps Cas9could be developed into an RNA-guided genome editing system. From this point,the race to harness Cas9for genome editing wason.Figure4.Natural Mechanisms of Microbial CRISPR Systems in Adaptive Immunity Following invasion of the cell by foreign genetic elements from bacteriophages or plasmids(step 1:phage infection),certain CRISPR-associated (Cas)enzymes acquire spacers from the exoge-nous protospacer sequences and install them into the CRISPR locus within the prokaryotic genome (step2:spacer acquisition).These spacers are segregated between direct repeats that allow the CRISPR system to mediate self and nonself recognition.The CRISPR array is a noncoding RNA transcript that is enzymatically maturated through distinct pathways that are unique to each type of CRISPR system(step3:crRNA biogenesis and processing).In types I and III CRISPR,the pre-crRNA transcript is cleaved within the repeats by CRISPR-asso-ciated ribonucleases,releasing multiple small crRNAs.Type III crRNA intermediates are further processed at the30end by yet-to-be-identified RNases to produce the fully mature transcript.In type II CRISPR,an associated trans-activating CRISPR RNA(tracrRNA)hybridizes with the direct repeats,forming an RNA duplex that is cleaved and processed by endogenous RNase III and other unknown nucleases.Maturated crRNAs from type I and III CRISPR systems are then loaded onto effector protein complexes for target recognition and degradation.In type II systems, crRNA-tracrRNA hybrids complex with Cas9to mediate interference.Both type I and III CRISPR systems use multi-protein interference modules to facilitate target recognition.In type I CRISPR,the Cascade com-plex is loaded with a crRNA molecule,constituting a catalytically inert surveillance complex that rec-ognizes target DNA.The Cas3nuclease is then recruited to the Cascade-bound R loop,mediatingtarget degradation.In type III CRISPR,crRNAs associate either with Csm or Cmr complexes that bind and cleave DNA and RNA substrates,respectively.In contrast,the type II system requires only the Cas9nuclease to degrade DNA matching its dual guide RNA consisting of a crRNA-tracrRNA hybrid.1266Cell157,June5,2014ª2014Elsevier Inc.In2011,Siksnys and colleaguesfirst demonstrated that the type II CRISPR system is transferrable,in that transplantation of the type II CRISPR locus from Streptococcus thermophilus into Escherichia coli is able to reconstitute CRISPR interference in a different bacterial strain(Sapranauskas et al.,2011).By 2012,biochemical characterizations by the groups of Charpent-ier,Doudna,and Siksnys showed that purified Cas9from Strep-tococcus thermophilus or Streptococcus pyogenes can be guided by crRNAs to cleave target DNA in vitro(Jinek et al., 2012;Gasiunas et al.,2012),in agreement with previous bacte-rial studies(Garneau et al.,2010;Deltcheva et al.,2011;Sapra-nauskas et al.,2011).Furthermore,a single guide RNA(sgRNA) can be constructed by fusing a crRNA containing the targeting guide sequence to a tracrRNA that facilitates DNA cleavage by Cas9in vitro(Jinek et al.,2012).In2013,a pair of studies simultaneously showed how to suc-cessfully engineer type II CRISPR systems from Streptococcus thermophilus(Cong et al.,2013)and Streptococcus pyogenes (Cong et al.,2013;Mali et al.,2013a)to accomplish genome editing in mammalian cells.Heterologous expression of mature crRNA-tracrRNA hybrids(Cong et al.,2013)as well as sgRNAs (Cong et al.,2013;Mali et al.,2013a)directs Cas9cleavage within the mammalian cellular genome to stimulate NHEJ or HDR-mediated genome editing.Multiple guide RNAs can also be used to target several genes at once.Since these initial studies,Cas9has been used by thousands of laboratories for genome editing applications in a variety of experimental model systems(Sander and Joung,2014).The rapid adoption of the Cas9technology was also greatly accelerated through a com-bination of open-source distributors such as Addgene,as well as a number of online user forums such as http://www. and . Structural Organization and Domain Architecture ofCas9The family of Cas9proteins is characterized by two signature nuclease domains,RuvC and HNH,each named based on homology to known nuclease domain structures(Figure2C). Though HNH is a single nuclease domain,the full RuvC domain is divided into three subdomains across the linear protein sequence,with RuvC I near the N-terminal region of Cas9and RuvC II/IIIflanking the HNH domain near the middle of the pro-tein.Recently,a pair of structural studies shed light on the struc-tural mechanism of RNA-guided DNA cleavage by Cas9. First,single-particle EM reconstructions of the Streptococcus pyogenes Cas9(SpCas9)revealed a large structural rearrange-ment between apo-Cas9unbound to nucleic acid and Cas9in complex with crRNA and tracrRNA,forming a central channel to accommodate the RNA-DNA heteroduplex(Jinek et al., 2014).Second,a high-resolution structure of SpCas9in complex with sgRNA and the complementary strand of target DNA further revealed the domain organization to comprise of an a-helical recognition(REC)lobe and a nuclease(NUC)lobe consisting of the HNH domain,assembled RuvC subdomains,and a PAM-interacting(PI)C-terminal region(Nishimasu et al.,2014) (Figure5A and Movie S1).Together,these two studies support the model that SpCas9 unbound to target DNA or guide RNA exhibits an autoinhibited conformation in which the HNH domain active site is blocked by the RuvC domain and is positioned away from the REC lobe (Jinek et al.,2014).Binding of the RNA-DNA heteroduplex would additionally be sterically inhibited by the orientation of the C-ter-minal domain.As a result,apo-Cas9likely cannot bind nor cleave target DNA.Like many ribonucleoprotein complexes,the guide RNA serves as a scaffold around which Cas9can fold and orga-nize its various domains(Nishimasu et al.,2014).The crystal structure of SpCas9in complex with an sgRNA and target DNA also revealed how the REC lobe facilitates target binding.An arginine-rich bridge helix(BH)within the REC lobe is responsible for contacting the308–12nt of the RNA-DNA het-eroduplex(Nishimasu et al.,2014),which correspond with the seed sequence identified through guide sequence mutation ex-periments(Jinek et al.,2012;Cong et al.,2013;Fu et al.,2013; Hsu et al.,2013;Pattanayak et al.,2013;Mali et al.,2013b). The SpCas9structure also provides a useful scaffold for engi-neering or refactoring of Cas9and sgRNA.Because the REC2 domain of SpCas9is poorly conserved in shorter orthologs, domain recombination or truncation is a promising approach for minimizing Cas9size.SpCas9mutants lacking REC2retain roughly50%of wild-type cleavage activity,which could be partly attributed to their weaker expression levels(Nishimasu et al., 2014).Introducing combinations of orthologous domain re-combination,truncation,and peptide linkers could facilitate the generation of a suite of Cas9mutant variants optimized for different parameters such as DNA binding,DNA cleavage,or overall protein size.Metagenomic,Structural,and Functional Diversity of Cas9Cas9is exclusively associated with the type II CRISPR locus and serves as the signature type II gene.Based on the diversity of associated Cas genes,type II CRISPR loci are further subdivided into three subtypes(IIA–IIC)(Figure5B)(Makarova et al.,2011a; Chylinski et al.,2013).Type II CRISPR loci mostly consist of the cas9,cas1,and cas2genes,as well as a CRISPR array and tracrRNA.Type IIC CRISPR systems contain only this minimal set of cas genes,whereas types IIA and IIB have an additional signature csn2or cas4gene,respectively(Chylinski et al.,2013). Subtype classification of type II CRISPR loci is based on the architecture and organization of each CRISPR locus.For example,type IIA and IIB loci usually consist of four cas genes, whereas type IIC loci only contain three cas genes.However, this classification does not reflect the structural diversity of Cas9proteins,which exhibit sequence homology and length variability irrespective of the subtype classification of their parental CRISPR locus.Of>1,000Cas9nucleases identified from sequence databases(UniProt)based on homology,protein length is rather heterogeneous,roughly ranging from900to1600 amino acids(Figure5C).The length distribution of most Cas9 proteins can be divided into two populations centered around 1,100and1,350amino acids in length.It is worth noting that a third population of large Cas9proteins belonging to subtype IIA,formerly called Csx12,typically contain around1500amino acids.Despite the apparent diversity of protein length,all Cas9pro-teins share similar domain architecture(Makarova et al.,2011a;Cell157,June5,2014ª2014Elsevier Inc.1267。
NGS全面评估基因表达调控网络
NGS全面评估基因表达调控网络NGS(Next-generation sequencing)是一种高通量测序技术,能够快速、准确地测定DNA或RNA的序列。
近年来,NGS在基因组学、转录组学、表观基因组学等领域得到了广泛应用。
基因表达调控网络是指基因之间相互作用形成的复杂网络,调控了细胞中基因的表达和功能。
本文将探讨如何利用NGS技术进行全面评估基因表达调控网络。
NGS技术的高通量测序能力使得我们能够以更全面、更深入的方式研究基因表达调控网络。
首先,利用RNA测序技术,我们能够获得全转录组水平的信息,了解在特定条件下细胞内所有基因的表达情况。
通过对多个样本的比较,我们可以发现差异表达的基因,并进一步分析其在基因调控网络中的作用。
其次,NGS技术还能够揭示细胞内的转录因子-靶基因相互作用。
传统的ChIP-seq技术可以检测到转录因子与某个特定基因座的结合情况,而NGS技术使得我们能够在全基因组水平上了解转录因子的结合情况。
结合对全转录组的RNA测序数据,我们可以准确地预测哪些基因是转录因子的靶基因,从而揭示基因调控网络的结构。
此外,表观遗传学在基因表达调控网络中也起着重要的作用。
NGS技术的出现使得我们能够以高通量的方式测定DNA甲基化、组蛋白修饰等表观修饰的分布情况。
通过对多个样本的比较,我们可以发现在不同条件下表观修饰的变化,从而揭示基因调控网络在不同生理或病理状态下的重塑过程。
NGS技术的发展也带来了新一代的计算工具和算法,用于处理和分析大规模的测序数据。
通过生物信息学的方法,我们可以从NGS数据中准确地获得基因表达数据、转录因子的结合位点以及表观修饰的信息。
同时,利用网络分析方法,我们可以构建和分析基因调控网络,并进一步揭示其中的重要调控模块和关键基因。
基于NGS技术的全面评估,我们可以更准确地了解基因表达调控网络的结构和功能,并深入研究调控网络在不同生物学过程中的作用。
这对于理解疾病的发生机制、发展新的治疗策略具有重要意义。
《2024年基于机器学习挖掘益生菌基因组分子标记及构建可视化筛选预测平台》范文
《基于机器学习挖掘益生菌基因组分子标记及构建可视化筛选预测平台》篇一一、引言随着现代生物技术的飞速发展,益生菌作为一种具有重要生理功能的微生物,在人类健康领域的应用越来越广泛。
为了更好地研究和开发益生菌,对其基因组分子标记的挖掘和筛选显得尤为重要。
本文旨在介绍一种基于机器学习的方法,挖掘益生菌基因组分子标记,并构建可视化筛选预测平台,以期为益生菌的研究和开发提供新的思路和方法。
二、益生菌基因组分子标记的挖掘2.1 数据收集与预处理首先,我们需要收集大量益生菌的基因组数据,包括其基因序列、表达水平、功能注释等信息。
然后,对这些数据进行预处理,包括数据清洗、质量控制、标准化等步骤,以保证数据的可靠性和准确性。
2.2 特征提取与选择在预处理后的数据基础上,我们利用机器学习算法提取出与益生菌功能相关的特征,如基因序列中的保守区域、表达水平的差异等。
同时,通过特征选择算法,筛选出对预测益生菌功能具有重要影响的特征。
2.3 机器学习模型的构建与优化我们采用监督学习的方法,构建机器学习模型。
模型包括分类、聚类、回归等多种类型,根据具体需求选择合适的模型。
在模型构建过程中,我们通过交叉验证、参数调优等方法,优化模型的性能,提高其预测准确率。
三、可视化筛选预测平台的构建3.1 平台架构设计可视化筛选预测平台采用模块化设计,包括数据管理模块、特征提取与选择模块、机器学习模型模块、可视化展示模块等。
各个模块之间通过接口进行数据传输和交互。
3.2 数据管理与交互平台提供数据上传、下载、查询、删除等功能,方便用户管理和使用数据。
同时,平台支持用户自定义特征、模型和参数,以满足不同研究需求。
3.3 可视化展示平台采用图表、热图、散点图等多种方式,将复杂的数据分析结果直观地展示给用户。
用户可以通过鼠标操作,轻松地查看和分析数据,提高工作效率。
四、平台应用与效果评估4.1 平台应用我们利用构建的可视化筛选预测平台,对大量益生菌进行筛选和预测。
具有局部和全局注意力机制的图注意力网络学习单样本组学数据表征
第61卷 第6期吉林大学学报(理学版)V o l .61 N o .62023年11月J o u r n a l o f J i l i nU n i v e r s i t y (S c i e n c eE d i t i o n )N o v 2023d o i :10.13413/j .c n k i .jd x b l x b .2023047具有局部和全局注意力机制的图注意力网络学习单样本组学数据表征周丰丰1,2,张金楷1(1.吉林大学计算机科学与技术学院,长春130012;2.吉林大学符号计算与知识工程教育部重点实验室,长春130012)摘要:针对生物组学数据中基因数目远大于样本数目的高维 大p 小n 问题,提出一种具有局部和全局注意力机制的图注意力网络G A T O r .该模型首先在组学数据上利用P e a r s o n 相关系数计算特征之间的相关性,构建组学数据的单样本网络;然后提出一种结合局部和全局注意力机制的图注意力网络从单样本网络中学习基于图的组学特征表示,从而将组学数据的高维特性转化为低维表示.实验结果表明,G A T O r 与其他传统分类算法相比,在分类任务的准确率及其他指标上均取得了较优性能.关键词:组学数据;单样本网络;注意力机制;图注意力网络中图分类号:T P 391 文献标志码:A 文章编号:1671-5489(2023)06-1351-07G r a phA t t e n t i o nN e t w o r kw i t hL o c a l a n dG l o b a lA t t e n t i o n M e c h a n i s mt oL e a r nS i n g l e -S a m p l eO m i cD a t aR e pr e s e n t a t i o n Z HO U F e n g f e n g 1,2,Z H A N GJ i n k a i 1(1.C o l l e g e o f C o m p u t e rS c i e n c e a n dT e c h n o l o g y ,J i l i nU n i v e r s i t y ,C h a n gc h u n 130012,C h i n a ;2.K e y L a b o r a t o r y o f S y m b o lC o m p u t a t i o na n dK n o w l ed g eE n g i ne e r i n g of M i n i s t r y o f Ed u c a t i o n ,J i l i nU n i ve r s i t y ,C h a n gc h u n 130012,C h i n a )收稿日期:2023-02-14.第一作者简介:周丰丰(1977 ),男,汉族,博士,教授,博士生导师,从事健康大数据的研究,E -m a i l :F e n g f e n g Z h o u @g m a i l .c o m.基金项目:国家自然科学基金(批准号:62072212;U 19A 2061)㊁吉林省中青年科技创新创业卓越人才(团队)项目(创新类)(批准号:20210509055R Q )和吉林省大数据智能计算实验室项目(批准号:20180622002J C ).A b s t r a c t :A i m i n g a t t h eh i g h -d i m e n s i o n a l b i g p s m a l l n p r o b l e m w h e r et h en u m b e ro f g e n e s i n b i o m i c s d a t a (d e n o t e da s p )w a s f a rm o r e t h a n t h en u m b e r o f s a m p l e s (d e n o t e d a s n ),w e p r o po s da g r a p ha t t e n t i o nn e t w o r kG A T O rw i t h l o c a l a n d g l o b a l a t t e n t i o nm e c h a n i s m s .F i r s t l y,t h em o d e l u s e d P e a r s o nc o r r e l a t i o nc o e f f i c i e n tt oc a l c u l a t et h ec o r r e l a t i o nb e t w e e nf e a t u r e so nt h eo m i cd a t a ,a n dc o n s t r u c t e das i n g l es a m p l en e t w o r ko ft h eo m i cd a t a .Se c o n d l y ,w e p r o p o s e da g r a pha t t e n t i o n n e t w o r kw h i c hc o m b i n e dl o c a l a n d g l o b a l a t t e n t i o n m e c h a n i s m st ol e a r n g r a p h -b a s e do m i c sf e a t u r e r e p r e s e n t a t i o n f r o ma s i n g l e -s a m p l e n e t w o r k ,t h e r e b y t r a n s f o r m i n g t h e h i g h -d i me n s i o n a l c h a r a c t e r i s t i c s of t h e o m i c sd a t a i n t ol o w -d i m e n s i o n a l r e p r e s e n t a t i o n s .T h ee x p e r i m e n t a l r e s u l t ss h o wt h a tc o m p a r e d w i t ho t h e r t r a d i t i o n a l c l a s s i f i c a t i o na lg o r i th m s ,G A T O r a c hi e v e sb e t t e r p e r f o r m a n c e i nc l a s s i f i c a t i o n t a s ka c c u r a c y an do t h e r i n d e x e s .K e y w o r d s :o m i c d a t a ;s i n g l e -s a m p l en e t w o r k ;a t t e n t i o nm e c h a n i s m ;g r a p ha t t e n t i o nn e t w o r k2531吉林大学学报(理学版)第61卷近年来,现代高通量生物医学技术得到快速创新和发展,生物数据积累加速[1].这些数据极大促进了许多生物学过程的潜在机制研究,包括衰老过程和复杂的疾病发病机制[2].但大多数生物组学数据集具有高噪声㊁多维度和多维异质性的特点.此外,生物组学的许多特征与表型无关,特征之间存在冗余.高通量技术生产的生物组学大多存在 大p小n 的维度灾难问题,其中p指特征数量,n指样本数量.因此,组学数据存在高维问题.特征选择是克服高维组学数据维数灾难的有效方法.特征选择方法在生物信息学领域被广泛应用于生物标记物识别和数据降维.而现有的应用于组学数据的特征选择算法,基本都是使用传统的分类学习算法对数据进行分类,即在组学数据降维的研究中,很少考虑深度特征选择算法,导致组学数据分类精度较低.目前基于深度学习的方法有许多尝试采用基于图的混合策略,在分析前将每个组学建模为一个单独的图,利用图嵌入方法从每个网络中学习节点及其周围环境的低维表示.然后将新的基于图的特征组合并输入其他机器学习模型进行预测㊁分类等.在组学数据上构建网络的常用方法包括蛋白质相互作用网络㊁基于相关性的网络㊁比值网络等.J o n s s o n等[3]通过分析蛋白质相互作用网络,发现与癌变相关的蛋白质特征常具有更密切的相关性,表明功能相似的特征或特征集通常以模块化的形式反应有机体功能的表型.W a n g等[4]通过分析蛋白质相互作用网络揭示了肝脏特异性蛋白㊁肝脏疾病蛋白和重要信号通路分子之间的相互作用特征.L i u等[5]利用P e a r s o n相关系数计算特征之间的相关性,建立相应的生物网络,利用有特定样本和无特定样本时特征之间的相关性变化构建单个样本网络,为疾病的个性化治疗提供了帮助.N e t z e r等[6]使用配对生物标记标识符(p a i r e db i o m a r k e r i d e n t i f i e r s, P B I)作为指标,测量不同群体特征比的变化,并构建了相应的生物网络.将单个特征作为网络节点,通过特征之间比值关系的变化构建网络的方法首先应用于代谢组学数据,之后推广到基因组学数据[7].因此,选择合适的网络并找到高效的分类学习算法尤为重要.本文在从组学数据的单样本网络中学习有用信息的基础上,提出一种具有局部和全局注意力机制的图注意力网络(G A T O r).首先,从每个样本的组学数据中构建一个图,以一个组学特征作为一个节点,两两特征之间的相关性作为边的权值;由于构建图的时间复杂度为平方,因此对无关特征进行预筛选,以减少单样本图中组学特征的数量.其次,提取图注意力网络中集成的局部和全局注意模块的有用信息作为工程特征,并从该单样本网络中学习到的特征进行类预测任务.实验结果表明,与现有的组学分类方法相比,G A T O r具有更好的分类性能.1算法设计1.1单样本网络单样本网络是一种基于参考数据集的利用单样本数据构建的生物分子网络,它是一种将复杂网络的理论和方法应用于疾病研究和药物开发的方法,可从系统的角度识别个体疾病所涉及的相互作用或功能失调[8].L i u等[5]提出了基于P e a r s o n相关性的单样本网络,在疾病表征基因调控网络的背景下获得个体特异性或样本特异性网络.对于节点网络,其构建需要多个样本,但在临床实践中通常无法获得.在单样本[9]的基础上对节点网络进行表征或推断是必要的.这种方法的优点是网络只依赖于从每个模型中学习基于图的变量,这些变量可用于其他机器学习模型的输入,用于聚类㊁子类型发现或生存预测.1.2图注意力网络图神经网络通过聚合网络中多层邻居节点对当前节点的影响,更新节点的嵌入式表示,然后用更新的嵌入式表示完成后续任务,如节点分类和链接预测等[10].B r u n a等[11]提出了一种基于谱域的图卷积神经网络(G C N),谱域的卷积需要在L a p l a c e矩阵上进行特征分解,每次都需进行节点的聚合,非常耗费算力.D e f f e r r a r d等[12]对卷积核进行近似操作,提出了C h e b y s h e v网络,该网络避免了L a p l a c e矩阵的特征分解,降低了运算的复杂度.K i p f等[13]对其进行了进一步优化,提出了最初的图卷积网络模型,在谱域上的图卷积网络可以发挥其最大的效能.图注意力网络(G A T)先在各节点间采用消息传递的方式聚合邻居节点[14],然后更新自身节点的信息,通过学习注意力权值,放大更重要的节点和边的权重,使用注意力机制定义聚合函数,从而计算并更新节点的特征信息,得到节点局部结构新的特征.G A T 网络由堆叠简单的图注意力层(g r a p ha t t e n t i o n l a y e r )实现,每个注意力层对节点对(i ,j ),注意力系数计算方式为a i j =e x p {L e a k y R e L U (a [W h i ] [W h j ])}ðj ɪN ie x p {L e a k y R e L U (a [W h i ] [W h j ])},(1)其中a i j 为节点j 到i 的注意力系数,N i 表示节点i 的邻居节点.节点输入特征为h ={h 1,h 2, ,h N },h i ɪℝF ,节点特征的输出为h ᶄ={h ᶄ1,h ᶄ2, ,h ᶄN },h ᶄi ɪℝF,其中N ,F 分别表示节点个数和输入特征维数;W ɪℝF ᶄˑF 表示在每个节点上应用的线性变换权重矩阵;a ɪℝ2F ᶄ为权重向量,可以将输入映射到ℝ.最终使用S o f t m a x 进行归一化并加入L e a k yR e L U 以提供非线性.最终节点的特征输出可表示为h ᶄi =σðj ɪN ia i j W h ()j ,(2)其中σ表示非线性激活函数,如S i gm o i d 和R e L U.1.3 G A T O r 网络本文提出的基于局部和全局注意力机制学习组学特征表示的图注意力网络(G A T O r)整体结构如图1所示.由图1可见,其主要包含两部分:1)单样本网络,将每个组学数据样本建模为一个单样本网络,将特征作为节点,每对特征之间的相关性作为边;2)具有局部和全局注意力机制的图注意力网图1 基于局部和全局注意力机制学习组学特征表示的图注意力网络F i g .1G r a pha t t e n t i o nn e t w o r kb a s e do n l o c a l a n d gl o b a l a t t e n t i o nm e c h a n i s mt o l e a r no m i c s f e a t u r e r e pr e s e n t a t i o n 络,用于从单样本网络中学习表征特征向量进行分类任务.G A T O r 网络的优化目标是评估中心节点附近某个邻居节点的重要性,从而为其邻居节点分配不同权重.中心节点的局部注意力只关注其一阶邻居,而全局注意力则关注图中所有节点,局部与全局注意力机制的融合优化了特征提取能力,使下游的分类性能得到提高.G A T O r 网络引入了注意力机制,用于解决G C N 对邻居节点一视同仁的局限性,通过分配不同的权重给不同的邻居,赋予模型更强的特征表示能力,将原始图数据转换到低维空间并保留关键信息,生成保留原始图中某些重要信息的低维向量,同时也提高了节点分类等下游任务的分类性能.1.3.1 构建单样本网络本文将组学数据的样本作为单样本网络训练基于图的G A T O r 模型.考虑到现实情况,仅利用一个样本数据检测复杂疾病恶性突变的临界状态和预警信号至关重要.虽然表达数据或测序数据在单个样本的基础上提供了关于分子谱的信息,但由于数据集每个病人只有一个样本数据,无法利用传统方法计算出基因的相似性网络,因此需给出足量的参考样本表征正常时期基因之间的相关性,通过对比单个样本与参考样本之间的差异反应单样本特征[5,15].首先基于基因共表达网络构建出参考网络,通常用无向图表示,网络中的节点表示特征,边表示特征之间的相关性.给定n 个参考样本,参考样本数据中任意一对特征x 和y 之间的相关性可使用P e a r s o n 相关系数(P C C )计算,用公式表示为P C C n (x ,y )=ðni =1(x i-x )(y i-y )ðni =1(x i-x )2ðni =1(y i-y )2,(3)3531 第6期 周丰丰,等:具有局部和全局注意力机制的图注意力网络学习单样本组学数据表征其中x i 和y i 分别为参考样本中第i 个样本特征x 和y 的值,x 和y 分别为参考样本组中特征x 和y 的平均值.Y u 等[16]检索了查询样本中每个组学特征相对于参考样本子集的方差,并计算了查询样本中两个组学特征方差向量之间的P C C .在查询样本中,这两个组学特征之间的P C C 值被定义为基于参考的变异P C C (r v P C C ).r v P C C 取值范围为-1~1,当r v P C C 接近-1或1时,将两个查询特征定义为正相关或负相关[17].组学数据集通常具有数千个甚至更多的特征,使得构建单样本网络的平方时间复杂性变得不切实际.本文使用t 检验衡量每个特征与类标签的关联,并选择排名靠前的k 个特征(本文中k =800)[18]进行进一步分析.采用P C C 测量特征间的冗余度.至此已构建出一个完整的单样本网络,该网络为一个加权无向图,可用于各种基于图的深度神经网络,为在网络层面表征个性化特征并分析生物系统开辟了新途径.1.3.2 局部与全局注意力机制G A T 使用特征向量a 学习节点及其邻居的相对重要性,可能无法捕获分类任务的有用信息.假设与节点本身相似的邻居节点可能更重要,则可通过直接计算两个相连节点之间的相似度得到节点的相对重要性[19].节点的局部注意力只关注其邻居,而节点的全局注意力从图中所有节点中提取信息.基于双重注意力机制的网络,通过对低层详细信息和高层语义信息的注意获取高质量㊁独特并可鉴别的特征[20].局部注意力系数计算公式为a (L)i j =e x p {β㊃c o s (W h i ,W h j )}ðj ɪN iex p {β㊃c o s (W h i ,W h j )},(4)式中β表示标准偏差,c o s (㊃)用于计算余弦相似度.为聚合来自节点邻域的信息,式(2)可表示为h ᶄ(L )i=σðj ɪN (v i)a(L )i jW h ()j .(5) 局部注意力模块与图注意力模块的区别:本文显式地使用c o s (㊃)计算节点之间的相似度作为相对重要性权重,而传统方法使用可学习参数a 学习节点之间的相对重要性.局部注意力是在图上一个节点的邻居上计算的,而本文在所有实体的集合上构造局部注意力.本文还实现了全局注意力机制,其中节点可有选择地聚合图中任何其他节点的信息.扩展图注意力层以进行全局操作.M o s t a f a 等[21]提出了一种基于欧氏距离的注意力系数.全局注意力系数可表示为a (G)i j =e x p {-λ ΦW h i -ΦW h j 2}ðNj =1e x p {-λ ΦW h i -ΦW h j 2},(6)其中:ΦɪℝD ˑF ᶄ为嵌入矩阵,它将节点特征转换到d 维节点相似度空间;λ表示标准差的逆, ㊃ 2表示2范数.节点i 的全局加权注意力为h ᶄ(G )i=σðNj =1a (G)i j W h ()j .(7)将局部聚集的特征向量和全局聚集的特征向量相连接,得到最终的特征向量h ᶄi 为h ᶄi =(h ᶄ(L )i h ᶄ(G )i),(8)其中 为串联运算符.式(8)也可视为将不同注意力头的输出相连接.使用A 个注意力头,输出特征向量的维数为2A F ᶄ,最终的特征向量也可表示为h ᶄi = Aa =1g (G ,[h 1,h N ];W a ,Φa ),(9)4531 吉林大学学报(理学版) 第61卷其中G 表示单样本网络的无向图,注意力头数A =2,g 表示进行注意力操作的过程.2 实 验2.1 数据集使用4个数据集评估G A T O r 特征工程算法,这4个数据集均选自文献[22]中整理的组学数据集:数据集R O S MA P 提供了阿尔茨海默病(A D )患者与正常对照组(N C )的组学数据;数据集L G G 用于低级别胶质瘤(L G G )的分级分类;数据集K I P A N 用于肾癌类型分类;数据集B R C A 用于乳腺癌P AM 50亚型的分型任务.每个数据集的预处理包括排除缺失值的特征以及随机选择参考样本.各数据集信息列于表1,其中第四列给出了每个数据集中两个或多个类的详细信息,最后一列给出了3种类型组学数据的特征数量,即m R N A 表达(m R N A )㊁D N A 甲基化(M e t h y )和m i R N A 表达(m i R N A ).数据缺失的特征被排除在进一步分析外.由于本文不讨论多组学整合分析,因此3种类型组学数据混在一起进行计算.表1 各数据集信息T a b l e 1 I n f o r m a t i o no f e a c hd a t a s e t数据集样本数特征数数据集类别m R N A ,M e t h y,m i R N A 特征数量R O S MA P 35179986N C (169),A D (182)55889,23788,309L G G51041193G r a d e 2(246),G r a d e 3(264)20531,20114,548K I P A N 65841087K I C H (66),K I R C (318),K I R P (274)20531,20111,445B R C A 87541140N o r m a l -l i k e (115),B a s a l -l i k e (131),H E R 2-e n r i c h e d (46),L u m i n a lA (436),L u m i n a l B (147)20531,20106,5032.2 评价指标在进行构建单样本网络等时间复杂度较高的任务前,先通过特征预筛选降低特征维度.由表1可见,4个数据集的特征数量都远大于样本数量.考虑到构建单样本网络的平方时间复杂度,因此仅对有限数量的原始组学(OM I C )特征设计G A T O r 特征.通过分层策略将每个数据集随机分为80%的训练数据集和20%的测试数据集,即保持训练数据集和测试数据集的类分布.二分类任务的评价指标为分类精度(A C C )和R O C 曲线下面积(A U C ).对于多分类任务,只计算A C C .2.3 实验结果及分析2.3.1 对比实验本文将G A T O r 的分类性能与以下7种组学数据基线方法进行比较.1)k 近邻分类器(K N N ):基于查询样本的k 个近邻的类别实现投票策略.2)支持向量机分类器(S VM ):一种流行的基于最大间隔分割平面的分类器.3)L 1正则化训练的线性回归(L a s s o ):L a s s o 回归是线性回归模型的一种收缩和变量选择方法,用于获取定量响应变量的预测误差最小的预测变量子集.4)随机森林分类器(R F ):融合多棵随机树的决策.5)朴素B a y e s 分类器(N B ):基于B a y e s 定义和特征条件独立假设的分类器方法.6)极限梯度提升算法(X G B o o s t ):提供了一种可扩展的快速梯度提升分类系统.7)全连接神经网络分类器(N N ):使用具有交叉熵损失的全连接神经网络作为基线神经网络分类器.G A T O r 算法与7种基线方法在4个数据集分类任务上的性能评估列于表2.由表2可见,G A T O r 框架在4个数据集上的A C C 和A U C 指标均优于其他基线分类器.与传统的组学分类方法相比,G A T O r 还获得了相对较小的标准差,具有更好的分类性能.5531 第6期 周丰丰,等:具有局部和全局注意力机制的图注意力网络学习单样本组学数据表征表2 G A T O r 算法与7种基线方法在4个数据集分类任务上的性能评估T a b l e 2 P e r f o r m a n c e e v a l u a t i o no fG A T O r a l g o r i t h ma n d 7b a s e l i n em e t h o d s f o r c l a s s i f i c a t i o n t a s k s o n4d a t a s e t s 方法A C CR O S MA PL G GK I P A NB RC AA U CR O S MA PL G GK N N0.657ʃ0.0360.729ʃ0.0340.967ʃ0.0110.742ʃ0.0240.709ʃ0.0450.799ʃ0.038S VM0.770ʃ0.0240.754ʃ0.0460.995ʃ0.0030.729ʃ0.0180.770ʃ0.0260.754ʃ0.046L a s s o 0.694ʃ0.0370.761ʃ0.0180.974ʃ0.0020.732ʃ0.0120.770ʃ0.0350.823ʃ0.010R F 0.726ʃ0.0290.748ʃ0.0120.981ʃ0.0060.754ʃ0.0090.811ʃ0.0190.823ʃ0.023N B0.742ʃ0.0310.753ʃ0.0280.993ʃ0.0060.765ʃ0.0110.817ʃ0.0230.829ʃ0.028X G B o o s t 0.760ʃ0.0460.756ʃ0.0400.993ʃ0.0080.781ʃ0.0080.837ʃ0.0300.840ʃ0.037N N0.755ʃ0.0210.737ʃ0.0230.991ʃ0.0050.745ʃ0.0280.827ʃ0.0250.810ʃ0.044G A T O r 0.838ʃ0.0190.831ʃ0.0150.999ʃ0.0020.835ʃ0.0100.884ʃ0.0240.848ʃ0.0212.3.2 消融实验首先,实验评估了由单样本网络(S S N )学习到的嵌入特征的贡献度.将没有S S N 模块的G A T O r 过程表示为G A T O r -S S N ,即直接将预处理后的特征加载到下一个模块中,而不使用S S N 模块.实验结果列于表3.由表3可见,完整的G A T O r 过程在4个数据集上的两个性能指标A C C 和A U C 都优于G A T O r -S S N 版本.因此,有必要将单样本网络引入到OM I C 数据的特征工程任务中.表3单样本网络(S S N )嵌入特征的分类贡献T a b l e 3 C l a s s i f i c a t i o n c o n t r i b u t i o no f f e a t u r e s e m b e d d e db y s i n g l e -s a m pl e n e t w o r k (S S N )网络A C CR O S MA PL G GK I P A NB RC AA U CR O S MA PL G GG A T O r -S S N 0.713ʃ0.0150.728ʃ0.0180.970ʃ0.0090.742ʃ0.0130.723ʃ0.0150.798ʃ0.011G A T O r0.838ʃ0.0190.831ʃ0.0150.999ʃ0.0020.835ʃ0.0100.884ʃ0.0240.848ʃ0.015其次,通过消融实验评估G A T O r 主要模块的贡献.基线模型为图注意力网络G A T.将没有局部和全局注意力机制的G A T O r 网络分别表示为G A T O r -L o c a l 和G A T O r -G l o b a l .将这3种图网络与完整的G A T O r 网络根据其提取的特征进行分类性能比较.G A T O r 图注意力网络主要模块的分类贡献列于表4.由表4可见,移除任何一个模块都会降低分类A C C 和A U C 值.去掉局部注意力机制导致的性能下降最大,表明在G A T 网络中仅包含全局注意力可能会使提取的特征分类性能恶化.而全局注意力机制和局部注意力机制的引入对基线G A T 网络具有积极贡献,即使是基线G A T 网络也比表3中G A T O r -S S N 过程提取了有用的信息,以获得更好的分类性能.表4 G A T O r 图注意力网络主要模块的分类贡献T a b l e 4 C l a s s i f i c a t i o n c o n t r i b u t i o no fm a j o rm o d u l e s o fG A T O r g r a pha t t e n t i o nn e t w o r k 网络A C CR O S MA PL G GK I P A NB RC AA U CR O S MA PL G GG A T O r -L o c a l 0.735ʃ0.0100.760ʃ0.0180.965ʃ0.0100.748ʃ0.0090.818ʃ0.0220.791ʃ0.024G A T O r -G l o b a l0.752ʃ0.0120.782ʃ0.0140.997ʃ0.0020.756ʃ0.0120.865ʃ0.0120.838ʃ0.012G A T0.757ʃ0.0130.782ʃ0.0220.997ʃ0.0020.752ʃ0.0150.840ʃ0.0180.825ʃ0.023G A T O r 0.838ʃ0.0190.831ʃ0.0150.999ʃ0.0020.835ʃ0.0100.884ʃ0.0240.848ʃ0.021综上所述,本文提出了一种结合局部和全局注意力机制的图注意力网络,用于从组学数据的单样本网络中学习有用信息.本文对组学数据所有的样本构建其对应的单样本网络,通过具有局部和全局注意机制的图注意力网络从单样本网络中学习基于图的组学特征表示进行类预测任务.实验结果表明,即使是基线图注意力网络在分类任务上的性能也优于原始的组学特征,并且局部注意力和全局注意力的融合可以进一步提高数据分类性能.参考文献[1] M I S R ABB ,L A N G E F E L DC ,O L I V I E R M ,e t a l .I n t e g r a t e dO m i c s :T o o l s ,A d v a n c e s a n dF u t u r eA a p pr o a c h e s 6531 吉林大学学报(理学版)第61卷[J ].J o u r n a l o fM o l e c u l a rE n d o c r i n o l o g y,2019,62(1):R 21-R 45.[2] Z HA N G Y ,S U N H ,MA N D A V A A ,e t a l .F a s t M i x :A V e r s a t i l eD a t a I n t e g r a t i o nP i p e l i n e f o r C e l l T y p e -S p e c i f i c B i o m a r k e r I n f e r e n c e [J ].B i o i n f o r m a t i c s ,2022,38(20):4735-4744.[3] J O N S S O NPF ,B A T E SP A.G l o b a lT o p o l o g i c a lF e a t u r e so fC a n c e rP r o t e i n s i nt h e H u m a nI n t e r a c t o m e [J ].B i o i n f o r m a t i c s ,2006,22(18):2291-2297.[4] WA N GJ ,HU O K K ,MA L X ,e ta l .T o w a r da n U n d e r s t a n d i n g o ft h eP r o t e i nI n t e r a c t i o n N e t w o r ko ft h e H u m a nL i v e r [J ].M o l e c u l a r S y s t e m sB i o l o g y ,2011,7(1):536-1-536-10.[5] L I U XP ,WA N G Y T ,J IH B ,e t a l .P e r s o n a l i z e dC h a r a c t e r i z a t i o no fD i s e a s e sU s i n g S a m p l e -S p e c i f i cN e t w o r k s [J ].N u c l e i cA c i d sR e s e a r c h ,2016,44(22):e 164-1-e 164-18.[6] N E T Z E R M ,W E I N B E R G E RK M ,HA N D L E R M ,e t a l .P r o f i l i n g t h eH u m a nR e s p o n s e t oP h ys i c a l E x e r c i s e :A C o m p u t a t i o n a l S t r a t e g y f o r t h e I d e n t i f i c a t i o n a n dK i n e t i cA n a l ys i s o fM e t a b o l i cB i o m a r k e r s [J ].J o u r n a l o f C l i n i c a l B i o i n f o r m a t i c s ,2011,1:1-6.[7] F A N G XC ,N E T Z E R M ,B A UMG A R T N E R C ,e t a l .G e n e t i cN e t w o r ka n dG e n eS e tE n r i c h m e n tA n a l ys i s t o I d e n t i f y B i o m a r k e r s R e l a t e dt o C i g a r e t t eS m o k i n g a n d L u n g Ca n c e r [J ].C a n c e r T r e a t m e n t R e v i e w s ,2013,39(1):77-88.[8] Z HO U Y Y ,Z HO U B ,P A C H EL ,e t a l .M e t a s c a p eP r o v i d e s aB i o l o g i s t -O r i e n t e dR e s o u r c e f o r t h eA n a l y s i so f S y s t e m s -L e v e lD a t a s e t s [J ].N a t u r eC o mm u n i c a t i o n s ,2019,10(1):1523-1-1523-10.[9] Z E N G T ,Z HA N G W W ,Y U XT ,e t a l .B i g -D a t a -B a s e dE d g eB i o m a r k e r s :S t u d y o nD y n a m i c a l D r u g S e n s i t i v i t y a n dR e s i s t a n c e i n I n d i v i d u a l s [J ].B r i e f i n g s i nB i o i n f o r m a t i c s ,2016,17(4):576-592.[10] HAM I L T O N W L ,Y I N G R ,L E S K O V E CJ .R e p r e s e n t a t i o nL e a r n i n g o n G r a p h s :M e t h o d sa n d A p p l i c a t i o n s [E B /O L ].(2017-09-17)[2022-10-10].h t t p s ://a r x i v .o r g/a b s /1709.05584.[11] B R U N AJ ,Z A R E M B A W ,S Z L AM A ,e ta l .S p e c t r a lN e t w o r k sa n dL o c a l l y C o n n e c t e d N e t w o r k so n G r a p h s [E B /O L ].(2013-12-21)[2022-11-08].h t t p s ://a r x i v .o r g /a b s /1312.6203.[12] D E F F E R R A R D M ,B R E S S O N X ,V A N D E R G H E Y N S TP .C o n v o l u t i o n a lN e u r a lN e t w o r k s o nG r a p h sw i t hF a s t L o c a l i z e dS p e c t r a l F i l t e r i n g [J ].A d v a n c e s i nN e u r a l I n f o r m a t i o nP r o c e s s i n g S y s t e m s ,2016,29:3844-3852.[13] K I P F T N ,W E L L I N G M.S e m i -s u p e r v i s e d C l a s s i f i c a t i o n w i t h G r a p h C o n v o l u t i o n a l N e t w o r k s [E B /O L ].(2016-09-09)[2022-11-12].h t t p s ://a r x i v .o r g/a b s /1609.02907.[14] V E L I ㊅C K O V I C 'P ,C U C U R U L LG ,C A S A N O V A A ,e t a l .G r a p hA t t e n t i o nN e t w o r k s [E B /O L ].(2017-10-30)[2022-08-21].h t t p s ://a r x i v .o r g/a b s /1710.10903.[15] L I U X P ,C HA N G X ,L I U R ,e ta l .Q u a n t i f y i n g C r i t i c a lS t a t e so fC o m p l e x D i s e a s e s U s i n g S i n g l e -S a m p l e D y n a m i cN e t w o r kB i o m a r k e r s [J ].P L o SC o m p u t a t i o n a l B i o l o g y ,2017,13(7):e 1005633-1-e 1005633-10.[16] Y U XT ,Z HA N GJS ,S U NSY ,e t a l .I n d i v i d u a l -S p e c i f i cE d g e -N e t w o r kA n a l y s i s f o rD i s e a s eP r e d i c t i o n [J ].N u c l e i cA c i d sR e s e a r c h ,2017,45(20):e 170-1-e 170-11.[17] WA L D MA N N P .O nt h e U s eo ft h e P e a r s o n C o r r e l a t i o n C o e f f i c i e n tf o r M o d e lE v a l u a t i o ni n G e n o m e -W i d e P r e d i c t i o n [J ].F r o n t i e r s i nG e n e t i c s ,2019,10:899-1-899-4.[18] HA R I D A SV ,N I J ,M E A G E R A ,e t a l .C u t t i n g E d g e :T R A N K ,aN o v e lC yt o k i n eT h a tA c t i v a t e sN F -κBa n d c -J u n N -T e r m i n a lK i n a s e [J ].T h e J o u r n a l o f I mm u n o l o g y ,1998,161(1):1-6.[19] T H E K UM P A R AM P I L K K ,WA N G C ,OH S ,e t a l .A t t e n t i o n -B a s e d G r a p h N e u r a l N e t w o r k f o r S e m i -s u p e r v i s e dL e a r n i n g [E B /O L ].(2018-05-10)[2022-11-28].h t t p s ://a r x i v .o r g /a b s /1803.03735.[20] 孙俊,才华,朱新丽,等.基于双重注意力机制的深度人脸表示算法[J ].吉林大学学报(理学版),2021,59(4):883-890.(S U NJ ,C A I H ,Z HU X L ,e ta l .D e e p F a c e R e p r e s e n t a t i o n A l g o r i t h m B a s e do n D u a lA t t e n t i o n M e c h a n i s m [J ].J o u r n a l o f J i l i nU n i v e r s i t y (S c i e n c eE d i t i o n ),2021,59(4):883-890.)[21] MO S T A F A H ,N A S S A R M.P e r m u t o h e d r a l -g c n :G r a p h C o n v o l u t i o n a l N e t w o r k s w i t h G l o b a l A t t e n t i o n [E B /O L ].(2020-05-02)[2022-12-03].h t t p s ://a r x i v .o r g/a b s /2003.00635.[22] WA N G T X ,S HA O W ,HU A N GZ ,e t a l .MO G O N E TI n t e g r a t e sM u l t i -o m i c sD a t aU s i n g G r a p hC o n v o l u t i o n a l N e t w o r k s A l l o w i n g Pa t i e n t C l a s s i f i c a t i o n a n d B i o m a r k e rI d e n t i f i c a t i o n [J ].N a t u r e C o mm u n i c a t i o n s ,2021,12(1):3445-1-3445-13.(责任编辑:韩 啸)7531 第6期 周丰丰,等:具有局部和全局注意力机制的图注意力网络学习单样本组学数据表征。
[2] Amira – User’s Guide and Reference Manual as well as Amira Programmer’s
Bibliography[1]Protein Data Bank(PBD)./pdb/.[2]Amira–User’s Guide and Reference Manual as well as Amira Programmer’sGuide.Zuse Institute Berlin(ZIB)and Indeed-Visual Concepts GmbH,Berlin, ,2001.[3]AmiraMol,AmiraDeconv-Extensions for Amira3.1.Zuse Institute Berlin(ZIB)and Mercury Computer Systems-TGS Group,http://amira.zib.de/Amira31-MolDeconv-manual.pdf,2003.[4]Tipranavir.Boehringer Ingelheim Pharmaceuticals,Inc.,2005.Anti-Viral DrugsAdvisory Committee(AVDAC)Briefing Document.[5]Tatsuya Akutsu.A polynomial time algorithm forfinding a largest common sub-graph of almost trees of bounded degree.IEICE Trans.Fundamentals E76-A,pages 1488–1493,1993.[6]Tatsuya Akutsu.Protein structure alignment using dynamic programming and iter-ative improvement.IEICE Transactions on Information and Systems,12:1629–1636, 1996.[7]Mike P.Allen and Dominic puter Simulations of Liquids.Claren-don,Oxford,1987.[8]Mathias Alterman.Design and Synthesis of HIV-1Protease Inhibitors.PhD thesis,University of Uppsala,Sweden,2001.[9]Peter Atkins and Julio de Paula.Physical Chemistry.Oxford,7edition,2002.[10]Dukka Bahadur K.C.,Tatsuya Akutsu,Etsuji Tomita,Tomokazu Seki,and AsaoFujiyama.Point matching under non-uniform distortions and protein side chain packing based on an efficient maximum clique algorithm.Genome Informatics, 13:143–152,2002.[11]Maha T.Barakat and Philip M.Dean.Molecular structure matching by simulatedannealing,III.The incorporation of null correspondences into the matching problem.Journal of Computer-Aided Molecular Design,5:107–117,1991.173174BIBLIOGRAPHY [12]Gill Barequet and Micha Sharir.Partial surface matching by using directed foot-putational Geometry,12(1-2):45–62,1999.[13]Markus Bauer,Gunnar W.Klau,and Knut Reinert.Fast and accurate structuralRNA alignment by progressive Lagrangian optimization.In Computational Life Sciences:First International Symposium,CompLife2005,volume3695of Lecture Notes on Computer Science,pages217–228,Konstanz,Germany,2005.Springer.[14]Markus Bauer,Gunnar W.Klau,and Knut Reinert.Multiple structural RNA align-ment with Lagrangian relaxation.In Algorithms in Bioinformatics,5th International Workshop,WABI2005,Proceedings,pages303–314,2005.[15]Daniel Baum.Multiple Semi-flexible3D Superposition of Drug-sized Molecules.In Computational Life Sciences:First International Symposium,CompLife2005, volume3695of Lecture Notes on Computer Science,pages198–207,Konstanz,Ger-many,2005.Springer.[16]Daniel Baum and Hans-Christian Hege.A Point-Matching Based Algorithm for3D Surface Alignment of Drug-Sized Molecules.In Computational Life Sciences: Second International Symposium,CompLife2006,volume4216of Lecture Notes on Computer Science,pages183–193,Cambridge,UK,2006.Springer.[17]Daniel Runge(Baum).Algorithms and methods for the visualization of molecularsurfaces and interfaces.Diploma thesis,Humboldt Universit¨a t zu Berlin and Zuse Institute Berlin(ZIB),1999.[18]Barnard Chemical Information Ldt.,.[19]Niko putational Analysis of HIV Drug Resistance Data.Doctoralthesis,Universit¨a t des Saarlandes,Germany,2004.[20]Guy W.Bemis and Irwin D.Kuntz.A fast and efficient method for2D and3D molec-ular shape description.Journal of Computer-Aided Molecular Design,6(6):607–628, 1992.[21]Andreas Bender.Studies on Molecular Similarity.PhD thesis,Darwin College,University of Cambridge,UK,2005.[22]Andreas Bender,Andreas Klamt,Karin Wichmann,Michael Thormann,andRobert C.Glen.Molecular similarity searching using COSMO screening charges (COSMO/3PP).In Computational Life Sciences:First International Symposium, CompLife2005,volume3695of Lecture Notes on Computer Science,pages175–185, Konstanz,Germany,2005.Springer.[23]Andreas Bender,Hamse Y.Mussa,Gurprem S.Gill,and Robert C.Glen.Molecu-lar surface point environments for virtual screening and the elucidation of binding patterns(MOLPRINT3D).Journal of Medicinal Chemistry,47:6569–6583,2004.BIBLIOGRAPHY175 [24]Denise D.Beusen and Garland R.Marshall.Pharmacophores definition using theactive analog approach.In Osman F.G¨u ner,editor,Pharmacophore-Perception, Development,and Use in Drug Design,pages24–45.IUL Biotechnology Series,1999.[25]Denise D.Beusen and E.F.Berkley Shands.Systematic search strategies in confor-mational analysis.Drug Discovery Today,1(10):429–437,1996.[26]Jonas Bostr¨o m.Reproducing the conformations of protein-bound ligands:A criticalevaluation of several popular conformational searching tools.Journal of Computer-Aided Molecular Design,15(12):1137–1152,2001.[27]Heinrich Braun and Martin Riedmiller.Rprop:A fast and robust backpropagationlearning strategy.In Proceedings of the Fourth Australian Conference on Neural Networks,pages598–591,1993.[28]Andrew T.Brint and Peter Willett.Algorithms for the identification of three-dimensional maximal common substructures.Journal of Chemical Information and Computer Sciences,27:152–158,1987.[29]Andrew T.Brint and Peter Willett.Pharmacophoric pattern matching infiles of3d chemical structures:comparison of geometric searching algorithms.Journal of Molecular Graphics,5(1):49–56,1987.[30]Coen Bron and Joep Kerbosch.Algorithm457:Finding all cliques of an undirectedmunications of the ACM,16(9):575–577,1973.[31]Alberto Caprara and Giuseppe Lancia.Structural alignment of large-size proteinsvia Lagrangian relaxation.In RECOMB,pages100–108,2002.[32]Raymond E.Carhart,Dennis H.Smith,and R.Venkataraghavan.Atom pairs asmolecular features in structure-activity studies:Definition and applications.Journal of Chemical Information and Computer Sciences,25:64–73,1985.[33]George Chang,Wayne C.Guida,and W.Clark Still.An internal coordinate Monte-Carlo method for searching conformational space.Journal of the Association of Computing Machinery,111:4379–4386,1989.[34]Thomas H.Cormen,Charles E.Leiserson,Ronald L.Rivest,and Clifford Stein.Introduction to Algorithms.MIT Press and McGraw-Hill,2nd edition,2001. [35]David A.Cosgrove,Denis M.Bayada,and A.Peter Johnson.A novel method ofaligning molecules by local surface shape similarity.Journal of Computer-Aided Molecular Design,14(6):573–591,2000.[36]Gordon M.Crippen.Distance Geometry and Conformational Calculations.J.Wiley,New York,1981.[37]Daylight Chemical Information Systems,Inc.,.176BIBLIOGRAPHY [38]Daylight Theory Manual,Daylight Version 4.9,Release Date04/17/06.Daylight Chemical Information Systems,Inc., /dayhtml/doc/theory/index.html,2006.[39]Mark de Berg,Otfried Schwarzkopf,Marc van Kreveld,and Mark -putational Geometry:Algorithms and Applications.Springer,1997.[40]Philip M.Dean and P.Callow.Molecular recognition:Identification of local minimafor matching in rotational3-space by cluster analysis.Journal of Molecular Graphics, 5(3):159–164,1987.[41]Philip M.Dean and P.-L.Chau.Molecular recognition:Optimized searching throughrotational3-space for pattern matches on molecular surfaces.Journal of Molecular Graphics,5(3):152–158,1987.[42]Peter Deuflhard.From molecular dynamics to conformation dynamics in drug de-sign.In M.Kirkilionis,S.Kr¨o mker,R.Rannacher,and F.Tomi,editors,Trends in Nonlinear Analysis,pages269–287.Springer,2003.[43]Peter Deuflhard,Michael Dellnitz,Oliver Junge,and Christof Sch¨u putationof essential molecular dynamics by subdivision techniques.In P.Deuflhard et al., editor,Lecture Notes in Computational Science and Engineering,volume4,pages 98–115.Springer,1998.[44]Peter Deuflhard,Wilhelm Huisinga,Alexander Fischer,and Christof Sch¨u tte.Identi-fication of almost invariant aggregates in reversible nearly uncoupled markov chains.Linear Algebra and its Applications,315:39–59,2000.[45]Peter Deuflhard and Marcus Weber.Robust Perron cluster analysis in conformationdynamics.In M.Dellnitz,S.Kirkland,M.Neumann,and Ch.Sch¨u tte,editors,Lin.Alg.Appl.-Special Issue on Matrices and Mathematical Biology,volume398,pages 161–184.Elsevier,2005.[46]Oliver Deussen,Stefan Hiller,Cornelius W. A.M.van Overveld,and ThomasStrothotte.Floating points:A method for computing stipple put.Graph.Forum,19(3),2000.[47]James Devillers,editor.Genetic Algorithms in Molecular Modeling.Academic Press,1996.[48]Reinhard Diestel.Graph Theory.Springer,1997.[49]John H.Van Drie.Future directions in pharmacophore discovery.In Osman F.G¨u ner,editor,Pharmacophore-Perception,Development,and Use in Drug Design, pages515–530.IUL Biotechnology Series,1999.[50]John H.Van Drie,David Weininger,and Yvonne C.Martin.ALADDIN:An inte-grated tool for computer-assisted molecular design and pharmacophore recognitionBIBLIOGRAPHY177 from geometric,steric,and substructure searching of three-dimensional molecular structures.Journal of Computer-Aided Molecular Design,3:225–251,1989.[51]Qiang Du,Vance Faber,and Max Gunzburger.Centroidal Voronoi tesselations:Applications and algorithms.SIAM Rev.,41(4):637–677,1999.[52]Bruce S.Duncan and Arthur J.Olson.Approximation and characterization of molec-ular surfaces.Biopolymers,33(2):219–229,1993.[53]Bruce S.Duncan and Arthur J.Olson.Shape analysis of molecular surfaces.Biopoly-mers,33(2):231–238,1993.[54]Paul Ehrlich.Dtsch.Chem.Ges.,42:17,1909.[55]Thomas E.Exner,Matthias Keil,and J¨u rgen Brickmann.Pattern recognition strate-gies for molecular surfaces.I.Pattern generation using fuzzy set theory.Journal of Computational Chemistry,23(12):1176–1187,2002.[56]Thomas E.Exner,Matthias Keil,and J¨u rgen Brickmann.Pattern recognition strate-gies for molecular surfaces.II.Surface complementarity.Journal of Computational Chemistry,23(12):1188–1197,2002.[57]Paul W.Finn,Lydia E.Kavraki,Jean-Claude Latombe,Rajeev Motwani,Chris-tian R.Shelton,Suresh Venkatasubramanian,and A.Yao.RAPID:Randomized pharmacophore identification for drug design.In Proceedings of the thirteenth an-nual symposium on Computational geometry,pages324–333.ACM Press,1997. [58]Alexander Fischer,Christof Sch¨u tte,Peter Deuflhard,and Frank Cordes.Hier-archical uncoupling-coupling of metastable conformations.In Tamar Schlick and Hin Hark Gan,editors,Computational Methods for Macromolecules:Challenges and Applications,Proceedings of the3rd International Workshop on Algorithms for Macromolecular Modeling,New York,Oct.12–14,2000,volume24of Lecture Notes in Computational Science and Engineering,pages235–259,Berlin,2002.Springer.[59]Daan Frenkel and Berend Smit.Understanding Molecular Simulations:From Algo-rithms to Applications.Academic Press,San Diego,1996.[60]Andrew S.Glassner,editor.An introduction to ray tracing.Academic Press Ltd.,London,UK,1989.[61]David E.Goldberg.Genetic Algorithms in Search,Optimization,and MachineLearning.Addison-Wesley Professional,January1989.[62]Brian B.Goldman and W.Todd Wipke.Quadratic shape descriptors.1.Rapid su-perposition of dissimilar molecules using geometrically invariant surface descriptors.Journal of Chemical Information and Computer Sciences,40(3):644–658,2000. [63]Deborah Goldman,Sorin Istrail,and Christos H.Papadimitriou.Algorithmic aspectsof protein structure similarity.In FOCS,pages512–522,1999.178BIBLIOGRAPHY [64]V.E.Golender and A.B.Rozenblit.Logical and Combinatorial Algorithms for DrugDesign.Research Studies Press,1983.[65]Andrew C.Good,Edward E.Hodgkin,and W.Graham Richards.Utilization ofGaussian functions for the rapid evaluation of molecular similarity.Journal of Chem-ical Information and Computer Sciences,32(3):188–191,1992.[66]Peter J.Goodford.A computational procedure for determining energetically favor-able binding sites on biologically important macromolecules.Journal of Medicinal Chemistry,28:849–857,1985.[67]J.Andrew Grant,M.A.Gallardo,and Barry T.Pickup.A fast method of molecularshape comparison:A simple application of a Gaussian description of molecular shape.Journal of Computational Chemistry,17(14):1653–1666,1996.[68]J.Andrew Grant and Barry T.Pickup.Gaussian shape methods.In W.Gunsterenand P.K.Weiner,editors,Computer Simulation of Biomolecular Systems,volume3.Kluwer,1997.[69]Peter Gund.Progress in Molecular and Subcellular Biology,5:117–143,1977.[70]Osman F.G¨u ner,editor.Pharmacophore-Perception,Development,and Use inDrug Design.IUL Biotechnology Series,La Jolla,1999.[71]Aysam G¨u rler.Selection andflexible optimization of binding modes from confor-mation ensembles.Master thesis,Freie Universit¨a t Berlin and Zuse Institute Berlin (ZIB),2006.[72]Dan Gusfield.Algorithms on Strings,Trees,and Sequences-Computer Science andComputational Biology.Cambridge University Press,1997.[73]Thomas A.Halgren.Merck molecular forcefield.I-V.Journal of ComputationalChemistry,17(5&6):490–641,1996.[74]Sandra Handschuh.Entwicklung und Einsatz computergest¨u tzter Methoden zur Er-mittlung struktureller¨Ahnlichkeiten:Analyse biologisch relevanter Ligand-Rezeptor Wechselwirkungen.Doctoral thesis,Friedrich-Alexander-Universit¨a t Erlangen-N¨u rn-berg,1999.[75]Sandra Handschuh,Markus Wagener,and Johann Gasteiger.Superposition of three-dimensional chemical structures allowing for conformationalflexibility by a hybrid method.Journal of Chemical Information and Computer Sciences,38:220–232,1998.[76]Timothy F.Havel,Irwin D.Kuntz,and Gordon M.Crippen.Effect of distanceconstraints on macromolecular conformation.Biopolymers,18:73–81,1979.[77]Hans-Christian Hege,Tobias H¨o llerer,and Detlev Stalling.Volume rendering–mathematical models and algorithmic aspects.TR93-07,Zuse Institute Berlin, 1993.BIBLIOGRAPHY179 [78]Wolfgang Heiden and J¨u rgen Brickmann.Segmentation of protein surfaces usingfuzzy logic.Journal of Molecular Graphics,12(2):106–115,1994.[79]Stefan Hiller,Oliver Deussen,and Alexander Keller.Tiled blue noise samples.InVMV,pages265–272,2001.[80]Edward E.Hodgkin and W.Graham Richards.Molecular similarity based on elec-trostatic potential and electricfield.International Journal of Quantum Chemistry, 32:517–545,1987.[81]Christian Hofbauer.Molecular Surface Comparison.A Versatile Drug DiscoveryTool.PhD thesis,Technische Universit¨a t Wien,2004.[82]John D.Holliday,C.-Y.Hu,and Peter Willett.Grouping of coefficients for thecalculation of inter-molecular similarity and dissimilarity using2D fragment bit-strings.Journal of American Chemical Society,5(2):155–166,2002.[83]Liisa Holm and Chris Sander.3-d lookup:Fast protein structure database searchesat90%reliability.In Christopher J.Rawlings,Dominic A.Clark,Russ B.Altman, Lawrence Hunter,Thomas Lengauer,and Shoshana J.Wodak,editors,Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge,United Kingdom,July16-19,1995,pages179–187,1995.[84]Mamoru Hosaka.Modeling of curves and surfaces in CAD/CAM.Springer,1992.[85]Kazuhiko Iwase and Shuichi Hirono.Estimation of active conformations of drugsby a new molecular superposing procedure.Journal of Computer-Aided Molecular Design,13(5):499–512,1999.[86]P.Jaccard.´Etude comparative de la distributionflorale dans une portion des Alpeset des Jura.Bull.Soc.Vaudoise Sci.Nat.,37:547–579,1901.[87]Ajay N.Jain.Morphological similarity:A3D molecular similarity method corre-lated with protein-ligand recognition.Journal of Computer-Aided Molecular Design, 14:199–213,2000.[88]Wolfgang Kabsch.A solution for the best rotation to relate two sets of vectors.ActaCrystallographica A,32:922–923,1976.[89]Wolfgang Kabsch.A discussion of the solution for the best rotation to relate twosets of vectors.Acta Crystallographica A,34:827–828,1978.[90]Alan Kalvin,Edith Schonberg,Jacob T.Schwartz,and Micha Sharir.Two-dimensional model-based boundary matching using footprints.International Journal of Robotics Research,5(4):38–55,1987.[91]George Karypis and Vipin Kumar.METIS,a Software Package for Partitioning Un-structured Graphs and Computing Fill-Reduced Orderings of Sparse Matrices.Uni-versity of Minnesota,Department of Computer Science,1998.180BIBLIOGRAPHY [92]George Karypis and Vipin Kumar.Multilevel k-way partitioning scheme for irregulargraphs.Journal of Parallel and Distributed Computing,48(1):96–129,1998. [93]Ephraim Katchalski-Katzir,Isaac Shariv,Miriam Einstein,Asher A.Friesem,Claude Aflalo,and Ilya A.Vakser.Molecular surface recognition:Determination of geometricfit between proteins and their ligands by correlation techniques.Proceed-ings of the National Academy of Sciences USA,89:2195–2199,1992.[94]Arie Kaufman,editor.Volume Visualization.IEEE Computer Society Press,1991.[95]Simon K.Kearsley and Graham M.Smith.An alternative method for the alignmentof molecular structures:Maximizing electrostatic and steric overlap.Tetrahedron Computer Methodology,3:615–633,1990.[96]Stefan Kirchner.An FPTAS for computing the similarity of three-dimensional pointsets.To appear in International Journal of Computational Geometry and Applica-tions.[97]Stefan Kirchner.Ein Approximationsalgorithmus zur Berechnung der¨Ahnlichkeitdreidimensionaler Punktmengen.Diploma thesis,Humboldt Universit¨a t zu Berlin, Department of Computer Science,2003.[98]Scott Kirkpatrick,D.Gelatt Jr.,and M.P.Vecchi.Optimization by simulatedannealing.Science,220(4598):671–680,1983.[99]Gerhard Klebe,Thomas Mietzner,and Frank Weber.Different approaches towardan automatic structural alignment of drug moleculars:Applications to sterol mimics, thrombin and thermolysin inhibitors.Journal of Computer-Aided Molecular Design, 8(6):751–778,1994.[100]Jan J.Koenderink.Solid Shape.MIT Press,Cambridge,USA,1990.[101]Andreas Kr¨a mer,Hans W.Horn,and Julia E.Rice.Fast3d molecular superposition and similarity search in databases offlexible molecules.Journal of Computer-Aided Molecular Design,17(1):13–38,2003.[102]Hugo Kubinyi,editor.3D QSAR in Drug Design.Volume1:Theory Methods and Applications.Springer,1993.[103]Frederick S.Kuhl,Gordon M.Crippen,and Donald K.Friesen.A combinatorial algorithm for calculating ligand binding.Journal of Computational Chemistry,5:24–34,1984.[104]Paul Labute,Chris Williams,Miklos Feher,Elizabeth Sourial,and Jonathan M.Schmidt.Flexible alignment of small molecules.Journal of Medicinal Chemistry, 44(10):1483–1490,2001.BIBLIOGRAPHY181 [105]Yehezkel Lamdan and Haim J.Wolfson.Geometric hashing:A general and efficient model-based recognition scheme.In Second International Conference on Computer Vision,pages238–249.IEEE Computer Society Press,1988.[106]Giuseppe Lancia,Robert D.Carr,Brian Walenz,and Sorin Istrail.101optimal PDB structure alignments:A branch-and-cut algorithm for the maximum contact map overlap problem.In RECOMB,pages193–202,2001.[107]Andrew R.Leach.Molecular Modelling:Principles and Applications.Prentice Hall, 2001.[108]Steve Leicester,John Finney,and Robert Bywater.A quantitative representation of molecular surface shape.II:Protein classification using Fourier shape descriptors and classical scaling.Journal of Mathematical Chemistry,16(1):343–365,1994.[109]Christian Lemmen,Claus Hiller,and Thomas Lengauer.RigFit:A new approach to superimposing ligand molecules.Journal of Computer-Aided Molecular Design, 12(5):491–502,1998.[110]Christian Lemmen and Thomas Lengauer.Time-efficientflexible superposition of medium-sized molecules.Journal of Computer-Aided Molecular Design,11(4):357–368,1997.[111]Christian Lemmen and Thomas Lengauer.FLEXS:a method for fastflexible ligand superposition.Journal of Medicinal Chemistry,41(23):4502–4520,1998.[112]Christian Lemmen and Thomas putational methods for the structural alignment of molecules.Journal of Computer-Aided Molecular Design,14:215–232, 2000.[113]Stuart P.Lloyd.Least squares quantization in PCM.IEEE Transactions on Infor-mation Theory,28(2):129–136,1982.[114]Harald Martens and Tormod Naes.Multivariate calibration.Wiley,Chichester,1991.[115]Yvonne C.Martin,Mark G.Bures,Elisabeth Danaher,Jerry DeLazzer,and Is-abella Lico.A fast new approach to pharmacophore mapping and its application to dopaminergic and benzodiazepine agonists.Journal of Computer-Aided Molecular Design,7:83–102,1993.[116]Yvonne C.Martin,Mark G.Bures,and Peter Willett.Searching databases of three-dimensional structures.In Kenny B.Lipkowitz,editor,Reviews in Computational Chemistry,volume1,pages213–263.Elsevier Science Publishers B.V.,1990. [117]Brian B.Masek,Arshad Merchant,and James B.Matthew.Molecular shape com-parison of angiotensin II receptor antagonists.Journal of Medicinal Chemistry, 36(9):1230–1238,1993.182BIBLIOGRAPHY [118]Brian B.Masek,Arshad Merchant,and James B.Matthew.Molecular skins:A new concept for the quantitative shape matching of a protein with its small molecule mimics.Proteins,17:193–202,1993.[119]Brian B.Masek,Arshad Merchant,and James B.Matthew.Molecular surface com-parisons.In Philip M.Dean,editor,Molecular Similarity in Drug Discovery.Blackey Academic and Professional,New York,1995.[120]Nelson L.Max and Elizabeth D.Getzoff.Spherical harmonic molecular surfaces.IEEE Computer Graphics and Applications,8(4):42–50,1988.[121]Michael McCool and Eugene Fiume.Hierarchical Poisson disk sampling distribu-tions.Graphics Interface,pages94–105,1992.[122]James J.McGregor.Backtrack search algorithms and the maximal common sub-graph problem.Software-Practice and Experience,12(1):23–34,1982.[123]Alan J.McMahon and Paul M.King.Optimization of Carb´o molecular similarity index using gradient methods.Journal of Computational Chemistry,18(2):151–158, 1997.[124]Colin McMartin and Regine S.Bohacek.Flexible matching of test ligands to a 3d pharmacophore using a molecular superposition forcefield:Comparison of pre-dicted and experimental conformations of inhibitors of three enzymes.Journal of Computer-Aided Molecular Design,9(3):237–250,1995.[125]Holger Meyer,Frank Cordes,and Marcus Weber.ConFlow:A new space-based application for complete conformational analysis of molecules.Unpublished manuscript.[126]Paul G.Mezey.Shape in Chemistry.VCH,1993.[127]Michael ler,Robert P.Sheridan,and Simon K.Kearsley.SQ:A program for rapidly producing pharmacophorically relevant molecular superpositions.Journal of Medicinal Chemistry,42(9):1505–1514,1999.[128]Donald R.Morrison.Patricia-practical algorithm to retrieve information coded in alphanumeric.Journal of the Association of Computing Machinery,15(4):514–534, 1968.[129]Richard M.Murray,Zexiang Li,and S.Shankar Sastry.A Mathematical Introduction to Robotic Manipulation.CRC Press,1994.[130]J.Willem M.Nissink,Marcel L.Verdonk,Jan Kroon,Thomas Mietzner,and Ger-hard Klebe.Superposition of molecules:Electron densityfitting by application of Fourier transforms.Journal of Computational Chemistry,18(5):638–645,1997.BIBLIOGRAPHY183 [131]Ruth Nussinov and Haim-J.Wolfson.Efficient detection of three-dimensional struc-tural motifs in biological macromolecules by computer vision techniques.Proceedings of the National Academy of Sciences USA,88:10495–10499,1991.[132]International Union of Biochemistry and Molecular Biology.Enzyme Nomenclature.Recommendations(1992)of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology.Academic Press,Inc.,London,1992.[133]Yumiko Ohta,Yasuyuki Ogura,and Akiyoshi Wada.Thermostable protease from thermophilic bacteria.I.Thermostability,physicochemical properties,and amino acid composition.Journal of Biological Chemistry,241(24):5919–5925,1966.[134]Panos M.Pardalos and Jue Xue.The maximum clique problem.Journal of Global Optimization,4:301–328,1994.[135]Martin F.Parretti,Romano T.Kroemer,Jeffrey H.Rothman,and W.Graham Richards.Alignment of molecules by the Monte Carlo optimization of molecular similarity indices.Journal of Computational Chemistry,18(11):1344–1353,1997. [136]M.Pastor,G.Cruciani,I.McLay,S.Pickett,and S.Clementi.GRid-INdependent descriptors(GRIND):A novel class of alignment-independent three-dimensional molecular descriptors.Journal of Medicinal Chemistry,43(17):3233–3243,2000.[137]Daniela Pelz.Functional characterization of Drosophila melanogaster Olfactory Re-ceptor Neurons.Doctoral thesis,Freie Universit¨a t Berlin,2005.[138]Xavier Pennec.Multiple registration and mean rigid shape-Application to the3D case.In K.V.Mardia,C.A.Gill,and Dryden I.L.,editors,Image Fusion and Shape Variability Techniques(16th Leeds Annual Statistical Workshop),pages178–185.University of Leeds,UK,July1996.[139]Catherine A.Pepperrell,Peter Willett,and Robin Taylor.Implementation and use of an atom-mapping procedure for similarity searching in databases of3-d.Tetrahedron Computer Methodology,3:575,1990.[140]Tim D.J.Perkins,ls,and Philip M.Dean.Molecular surface-volume and property matching to superposeflexible dissimilar molecules.Journal of Computer-Aided Molecular Design,9(6):479–490,1995.[141]Andrew R.Poirrette,Peter J.Artymiuk,David W.Rice,and Peter -parison of protein surfaces using a genetic algorithm.Journal of Computer-Aided Molecular Design,11(6):557–569,1997.[142]Matthias Rarey and J.Scott Dixon.Feature trees:A new molecular similarity measure based on tree matching.Journal of Computer-Aided Molecular Design, 12(5):471–490,1998.184BIBLIOGRAPHY [143]John W.Raymond,Eleanor J.Gardiner,and Peter Willett.Rascal:Calculation of graph similarity using maximum common edge subgraphs.The Computer Journal, 45(6):631–644,2002.[144]John W.Raymond and Peter Willett.Effectiveness of graph-based andfingerprint-based similarity measures for virtual screening of2D chemical structure databases.Journal of Computer-Aided Molecular Design,16(1):59–71,2002.[145]John W.Raymond and Peter Willett.Maximum common subgraph isomorphism algorithms for the matching of chemical structures.Journal of Computer-Aided Molecular Design,16(7):521–533,2002.[146]Jacqueline D.Reeves and Robert W.Doms.Human immunodeficiency virus type2.The Journal of General Virology,83(6):1253–1265,2002.[147]Penny Rheingans and Shrikant Joshi.Visualization of molecules with positional uncertainty.In E.Gr¨o ller,H.L¨offelmann,and W.Ribarsky,editors,Data Visual-ization’99,Proceedings of the Joint EUROGRAPHICS-IEEE TCVG Symposium on Visualization,pages299–306.Springer,Vienna,1999.[148]F.M.Richards.Areas,volumes,packing and protein structure.Ann.Rev.Biophys.Bioeng.,6:151–176,1977.[149]Isidore Rigoutsos,Daniel E.Platt,Andrea Califano,and David Silverman.Represen-tation and matching of smallflexible molecules in large databases of3d molecular information.In Pattern Discovery in Biomolecular Data,pages111–129.Oxford University Press,1999.[150]David W.Ritchie and Graham J.L.Kemp.Fast computation,rotation,and com-parison of low resolution spherical harmonic molecular surfaces.Journal of Compu-tational Chemistry,20(4):383–395,1999.[151]S.Kashif Sadiq,Stefan J.Zasada,and Peter V.Coveney.Grid assisted ensemble molecular dynamics simulations of HIV-1proteases reveal novel conformations of the inhibitor saquinavir.In Computational Life Sciences:Second International Sympo-sium,CompLife2006,volume4216of Lecture Notes on Computer Science,pages 151–161,Cambridge,UK,2006.Springer.[152]Hanan Samet.An overview of quadtrees,octrees,and related hierarchical data struc-tures.In Rae A.Earnshaw,editor,Theoretical Foundations of Computer Graphics and CAD,pages51–68.Springer,Berlin,Heidelberg,1988.[153]Martin Saunders.Stochastic exploration of molecular mechanics energy sur-faces.hunting for the global minimum.Journal of American Chemical Society, 109(10):3150–3152,1987.[154]Kristina Sch¨a dler and Fritz Wysotzki.A connectionist approach to structural simi-larity determination as a basis of clustering,classification and feature detection.In。
GMAP a genomic mapping and alignment program for mRNA and EST sequences
BIOINFORMATICS ORIGINAL PAPER Vol.21no.92005,pages1859–1875doi:10.1093/bioinformatics/bti310 Sequence analysisGMAP:a genomic mapping and alignment program for mRNAand EST sequencesThomas D.Wu1,∗and Colin K.Watanabe21Department of Bioinformatics and2Department of Corporate Information Technology,Genentech,Inc.,South San Francisco,CA94080,USAReceived on February24,2004;revised on January27,2005;accepted on February4,2005Advance Access publication February22,2005ABSTRACTMotivation:We introduce gmap,a standalone program for mapping and aligning cDNA sequences to a genome.The program maps and aligns a single sequence with minimal startup time and memory requirements,and provides fast batch processing of large sequence sets.The program generates accurate gene structures,even in the presence of substantial polymorphisms and sequence errors,without using probabilistic splice site models.Methodology underlying the program includes a minimal sampling strategy for genomic mapping, oligomer chaining for approximate alignment,sandwich DP for splice site detection,and microexon identification with statistical significance testing.Results:On a set of human messenger RNAs with random mutations at a1and3%rate,gmap identified all splice sites accurately in over 99.3%of the sequences,which was one-tenth the error rate of existing programs.On a large set of human expressed sequence tags,gmap provided higher-quality alignments more often than blat did.On a set of Arabidopsis cDNAs,gmap performed comparably with GeneSeqer. In these experiments,gmap demonstrated a several-fold increase in speed over existing programs.Availability:Source code for gmap and associated programs is available at /share/gmapContact:twu@Supplementary information:/share/gmapINTRODUCTIONMapping and alignment of cDNA sequences—both messenger RNAs (mRNAs)and expressed sequence tags(ESTs)—onto the genome has become a central procedure in genome research.The resulting cDNA–genomic alignments not only reveal the intron–exon struc-ture of genes,but also facilitate the study of splicing mechanics and such transcript-based phenomena as alternative splicing,single nuc-leotide polymorphisms,and cDNA insertions and deletions(Jiang and Jacob,1998;Irizarry et al.,2000;Kan et al.,2001,2002;Zavolan et al.,2002;Modrek and Lee,2002;Clamp et al.,2003;Wheeler et al.,2003;Drabenstot et al.,2003;Kim et al.,2004;Florea et al., 2005).To address these needs,programs such as ssaha(Ning et al., 2001)have been introduced to map cDNA sequences to a gen-ome.Other programs have been developed to align a cDNA to a given genomic segment,including est_genome(Mott,1997), To whom correspondence should be addressed.dds/gap2(Huang,1996),sim4(Florea et al.,1998),Spidey(Wheelan et al.,2001),GeneSeqer(Usuka et al.,2000;Schlueter et al.,2003) and MGAlign(Lee et al.,2003;Ranganathan et al.,2003).Finally, some recent integrated programs,such as blat(Kent,2002)and squall(Ogasawara and Morishita,2002),perform both genomic mapping and alignment.Despite the availability of these programs,achieving perfection in cDNA–genomic alignment has been surprisingly elusive.Studies of existing programs have revealed various types of errors in identify-ing gene structures and splice sites(Haas et al.,2002).In compiling a database of EST-based splice sites,researchers have reportedly had to resort to manual curation of alignments to obtain the cor-rect results(Burset et al.,2001).Difficulties generally arise when a cDNA sequence differs from its corresponding genomic exons,due to polymorphisms,mutations or sequencing errors.Sequencing errors are especially prevalent in ESTs,where error rates are estimated to be1.5%for high-quality sequences(Zhuo et al.,2003)and3–4% overall(Richterich,1998).Such sequencing errors,especially near exon–exon junctions,can complicate the detection of splice sites. One approach to this situation has been to combine information across various alignments(Birney et al.,2004;Haas et al.,2003; Brendel et al.,2004)or even multiple sources of evidence(Allen et al.,2004)to arrive at a consensus answer.However,since such programs depend ultimately upon the original solutions generated by cDNA–genomic alignment programs,advances in the underlying alignment methodology are still important.In this paper,we introduce an integrated genomic mapping and alignment program called gmap(Genomic Mapping and Alignment Program).In contrast to programs designed primarily to run in client/ server mode,such as blat and squall,our program operates as a traditional standalone program.Gmap provides not only improved performance over existing programs in terms of speed and accur-acy,but also enhanced functionality.The functionality provided by gmap allows a user to:(1)map and align a single cDNA interact-ively against a large genome in about a second,without the startup time of several minutes typically needed by existing mapping pro-grams;(2)switch arbitrarily among different genomes,without the need for a pre-loaded server dedicated to each genome;(3)run the program on computers with as little as128MB of RAM(random access memory);(4)perform high-throughput batch processing of cDNAs by using memory mapping and multithreading when appro-priate memory and hardware are available;(5)generate accurate gene models,even in the presence of substantial polymorphisms and sequence errors;(6)locate splice sites accurately without the use©The Author2005.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@1859T.D.Wu and C.K.Watanabeof probabilistic splice site models,allowing generalized use of the program across species;(7)detect statistically significant microexons and incorporate them into the alignment;and(8)handle mapping and alignment tasks on genomes having alternate assemblies,linkage groups or strains.In the remainder of the paper,we review existing work on cDNA–genomic mapping and alignment,and describe the methods underlying gmap.Next we provide examples of how these methods in gmap lead to improved splice site and gene structure prediction.Then we compare the performance of gmap with existing programs in three large-scale experiments.In experiment1,we test for robustness to sequence error by using test sets of human mRNAs with compu-tationally simulated sequence errors.In experiment2,we examine mapping and alignment quality for human ESTs with naturally occur-ring sequence errors.In experiment3,we evaluate the performance of gmap on another species,namely,the plant Arabidopsis thaliana. Finally,we describe the implementation of gmap and additional features provided by the program.RELATED WORKOne approach to cDNA–genomic alignment has been to use general sequence alignment programs,such as blast(Altschul et al.,1990), and then to assemble the resulting hits into gene structures(Gelfand et al.,1996;Wiehe et al.,2001;Milanesi and Rogozin,2003;Zhang, 2003;Yeo et al.,2004).However,the cDNA–genomic alignment problem is important enough to warrant programs specialized for the task.The particular problem that arises in cDNA–genomic align-ment is the presence of introns,which appear as large genomic gaps of up to hundreds of thousands of nucleotides in length.Introns have characteristic patterns at their splice sites,which cDNA–genomic alignment programs must take into account.About99%of introns are bounded on their ends by the canonical dinucleotide pair GT–AG; the remainder have a semi-canonical dinucleotide pair GC–AG or AT–AC,or another,non-canonical dinucleotide pair(Burset et al., 2000).Probabilistic patterns of conservation are also seen at pos-itions further away from the intron–exon boundary(Mount,1981; Senapathy et al.,1990;Solovyev,2002).Existing programs for cDNA–genomic mapping and alignment, cited in the Introduction,provide a foundation for further advances. In particular,gmap draws upon three fundamental concepts intro-duced by earlier programs.First,gmap uses an oligomer index table for genomic mapping.Second,gmap takes a hierarchical approach to genomic alignment,byfirst computing an approximate alignment and thenfilling in the details.Finally,like almost all existing align-ment programs,gmap applies specific methods tailored for detecting splice sites and for incorporating them into the alignment. Although essentially all cDNA–genomic mapping and alignment programs share these fundamental building blocks,they differ in their particular methods for implementing them;it is these methodological choices that largely account for differences in their performance. In the Algorithm section,we provide a detailed description of the specific methods underlying gmap;in the rest of this section,we summarize the basic similarities and differences of our methods relative to existing ones.Genomic mappingGenomic mapping can be accomplished rapidly because of the near-identity between a cDNA sequence and its corresponding genomic exons,which manifests as regions of exact matches.Existing pro-grams exploit this fact either byfinding clusters of relatively short oligomers,such as11-mers(blat)or14-mers(ssaha and squall), or by using fewer long oligomers.The long oligomer approach is exemplified by MGAlign(Ranganathan et al.,2003):although it does not perform mapping on a genomic scale,it initially aligns a cDNA to a given genomic segment by scanning20-mers from the ends of the cDNA.Similarly,rapid mapping is provided by MUM-mer(Delcher et al.,1999,2002),which uses suffix trees(Manber and Myers,1993)tofind long unique matches between genomes, and MegaBlast(Zhang et al.,2000),which uses28-mers to identify sequence matches.Existing cDNA–genomic mapping programs that use an oligomer index on a genomic scale begin by pre-loading the index into memory, which means that these programs not only have a long startup time, but also require computers with large amounts of dedicated RAM.For example,squall requires12GB of RAM,and the standalone version of blat requires8GB of RAM in order to map a cDNA sequence onto the entire human genome.The startup time for the standalone version of blat is several minutes,which makes it inconvenient for a researcher who wishes to map a single cDNA sequence to a genome,or who wishes to switch quickly among different genomes or versions of a genome.Therefore,blat typically runs in a client–server mode,in which a dedicated server for a particular genome keeps its genomic oligomerfiles resident in RAM.A blat server, which also requires several minutes of startup time,needs1.2GB of RAM to process the human genome,and must be kept running continuously to answer queries from a client computer.In contrast,gmap is a standalone program that has been designed to handle individual queries rapidly,with essentially no startup time. Instead of pre-loading the entire oligomer indexfile into memory, gmap looks up oligomers as needed directly from thefile.Because access tofiles is much slower than to memory,ourfile-based strategy is enabled by a minimal sampling strategy that attempts to perform as few oligomer lookups as possible,while still mapping reliably to an entire genome.Our sampling strategy involves more than scanning long oligomers from the ends of a cDNA tofind a matching pair. Because our mapping universe is an entire genome,we must safe-guard against false mapping results from the initial matching pair, which can arise due to paralogs,pseudogenes and segmental duplica-tions in the genome(Wheelan et al.,2001;Bailey et al.,2002;Zhang and Gerstein,2004).Therefore,reliable matching on a genomic scale requires additional steps,such as accumulating additional oligomer evidence beyond thefirst matching pair;monitoring when the num-ber of candidate locations has been limited adequately;and sampling adaptively to extract information from different parts of the cDNA sequence,including the middle when necessary.Approximate alignmentAn approximate alignment step is necessitated by the large size of genomic segments,which makes a nucleotide-level align-ment prohibitively time-consuming,and is therefore used in some form by virtually every cDNA–genomic alignment program.In est_genome,approximate alignments are computed by using local Smith and Waterman(1981)alignments and the resulting seg-ments are then recomputed with a global Needleman and Wunsch (1970)alignment.Spidey computes an alignment with increasing detail by performing successive blast runs at decreasing stringency levels.1860GMAP for mRNA and EST sequencesIn other programs,the predominant strategy has been a‘seed-and-extend’strategy,in which the programfirstfinds significant oligomer matches between the cDNA and genomic segment,then extends these seeds to form longer matching fragments,andfinally assembles a selection of these fragments into a collinear chain. The seed-and-extend strategy is found in a variety of programs, including those for genome–genome alignment(Chain et al.,2003; Morgenstern,1999;Batzoglou et al.,2000;Kent and Zahler,2000; Schwartz et al.,2000;Ma et al.,2002;Brudno et al.,2003a,b;Bray et al.,2003;Kalafus et al.,2004),and constitutes the approach in several cDNA–genomic alignment programs.Sim4finds match-ing seeds of12-mers in the genomic segment,extends these seeds by nucleotide-level scoring of matches and mismatches,and then assembles the resulting‘exon cores’through dynamic programming (DP).MGAlign also applies DP,both to extend its fragments and to combine local alignments into longer ones.Blat breaks the cDNA into500-bp chunks,uses these chunks to create alignment fragments through a recursive seed-and-extend method,and then uses DP to stitch together these subalignments.In contrast,gmap uses an oligomer chaining method that involves neither seeds nor extensions.Rather,this methodfinds all matching 8-mers between the cDNA and genomic sequence,and then uses DP tofind an optimal global chain of8-mers.In this process,exons are not created explicitly,but instead emerge implicitly from the globally optimal distribution of8-mer matches between the cDNA and genomic segment.Although exon–exon boundaries are defined only approximately by this method,their location is determined by both distant alignment information and local information.Oligomer chaining may extend an exon alignment that otherwise looks locally unfavorable,or terminate an exon alignment that otherwise looks locally favorable,when such decisions contribute toward a better global alignment.We have found that the use of global information is particularly important in the presence of sequence polymorph-isms or errors,which can adversely affect local decision-making for extending fragments.Splice site identificationApproximate alignment using8-mers or other fragments generally does not have the resolution needed at the nucleotide level to detect splice sites accurately.To recognize splice sites correctly in the pres-ence of sequence errors,a program must often introduce substitutions or gaps,shift nucleotides from one end of the intron to the other,or explore alternate locations for the splice sites.Existing approaches to splice site identification are based upon two ideas.Thefirst idea is to apply various heuristics tofix or adjust the approximate alignment to incorporate a splice site.For example, Spidey and MGAlign search for splice sites in the overlap between adjacent exons,and then trim the exons at the highest-scoring splice site,whereas sim4has an intron shifting procedure that adjusts the exon–exon junction tofind the best pair of splice sites.The other idea is to use splice site models,such as scoring matrices(Salzberg,1997; Brendel and Kleffe,1998),which model the observed frequency of nucleotides near the5 and3 splice sites(Nakata et al.,1985; Gelfand,1989)and thereby provide clues about the presence and location of splice sites.In contrast,gmap handles this problem by using a formal DP procedure that we call sandwich DP.Sandwich DP involves two DP matrices,one for each end of an intron,and attempts tofind the best alignment path across the diagonals of both matrices.Rather than attempting tofix an existing approximate alignment,the method computes the whole subalignment in the region surrounding an intron.This approach guarantees that all possible combinations of substitutions,gaps and intron shifts are considered,and per-mits the use of various DP techniques.These techniques include specialized gap penalties that favor insertions or deletions of tri-nucleotides(Gotoh,1999)and band-limited alignment(Sankoff and Kruskal,1999),which enables efficient consideration of substitutions or gaps at a large distance away from the splice site. MicroexonsIn addition to the above features,gmap has an explicit procedure for detecting microexons and incorporating them into the alignment. Microexons as short as1nucleotide in length have found apparent experimental support(McAllister et al.,1992;Sterner and Berget, 1993;Simpson et al.,2000;Carlo et al.,2000),and a computational study suggests that between0.5and1.6%of mRNA sequences in various species contain microexons(V olfovsky et al.,2003).Such short exons pose an acknowledged problem for cDNA–genomic alignment programs(Florea et al.,1998).A procedure for identi-fying microexons has been developed by V olfovsky et al.(2003), and applied in a large-scale study.We further this work by integrat-ing the detection procedure into the framework of a cDNA–genomic alignment program,and by adding a probabilistic extension that ensures that incorporated microexons are statistically significant. ALGORITHMIn this section,we discuss the methods used by gmap in the con-text of each of the major components needed for cDNA–genomic mapping and alignment.Specifically,we describe:(1)a minimal sampling strategy for genomic mapping,(2)oligomer chaining for generating approximate gene structures,(3)sandwich DP for identi-fying splice sites,and(4)microexon identification with statistical significance testing.Minimal sampling strategyFor genomic mapping,gmap uses a sampling strategy designed to minimize the number of oligomer lookups needed to map a cDNA reliably to the genome.Our minimal sampling strategy is based upon the use of long oligomers to achieve high specificity,combined with an adaptive sampling scheme to utilize mapping evidence from different parts of the cDNA sequence.As discussed previously,the rationale for using long oligomers is their exponentially greater specificity in the genome,which means that mapping can be performed with few oligomer matches.Our choice of24as an oligomer length is guided by our own study of oligomer uniqueness in the human genome,as shown in Figure1. This graph,based on the unmasked portion of the NCBI human gen-ome(build29),shows the percentage of the observed oligomers of various lengths that are unique in the genome.For example, among all11-mers in the genome,only0.1%of them have a unique position in the genome.Likewise,among all14-mers,only 22.5%specify a unique position in the genome.On the other hand, when the oligomer length is20or more,the percentage of oli-gomers with a unique genomic location reaches an asymptotic level of96–97%.Our implementation of24-mer lookups on a genomic scale requires some adaptation of the index table scheme of ssaha(Ning1861T.D.Wu andC.K.WatanabeFig.1.Distribution of oligomers of various lengths in the masked region of the human genome (NCBI build 29).The horizontal axis represents vari-ous oligomer sizes from 11to 25.The total space of possible oligomers increases exponentially,as shown by the exponentially increasing line.For each oligomer size,counts of all overlapping oligomers in the masked part of the human genome are shown by the top line,and the counts of distinct oligomers are shown by the topmost sigmoid line.Distinct oligomers can be divided into unique oligomers,which occur once (shown by the sigmoid line with percentages),and repeated oligomers,which occur more than once (shown by the bottom line).et al .,2001).In that scheme,a position file contains the observed pos-itions of oligomers in the genome,and an offset file contains pointers to the position file to indicate where a block of positions begins and ends for a given oligomer.Because this offset file contains an entry for each possible oligomer,its size grows exponentially with the oli-gomer length.In fact,14-mers represent the current practical limit for the ssaha data structure,because the corresponding index file occupies 1.1GB.Extending this indexing scheme to 24-mers would yield a sparse offset file of 424=281trillion 32-bit entries,which would be prohibitively large to store.Therefore,in our initial implementation of gmap ,we tried a hash-ing scheme instead,where the space of 24-mers is mapped onto a space of 12-mers using a hash function.If a given 24-mer has a match somewhere in the genome,an entry for the 24-mer can be found in the expected hash bin.This entry then provides the appropriate offset into the position file.Although this hashing scheme worked reasonably well,we sub-sequently found a more efficient solution by using a double lookup scheme,which breaks up the problem of finding a 24-mer into the problem of finding two 12-mers.In other words,we implement the ssaha data structure for 12-mers,with the requirement that entries in the position table be pre-sorted in ascending numeric order within each oligomer.To find the positions for a given 24-mer,we look up two lists of genomic positions,one for the initial 12-mer and one for the terminal 12-mer.The desired set of 24-mer genomic loca-tions is obtained by finding pairs of entries in these two lists that are separated by 12nucleotides.The reason for pre-sorting the genomic positions within each oligomer is to make this procedure run in linear time with the number of genomic positions,rather than quadratic.The size of the position file is determined by the genome size and by how often oligomers are sampled in the genome.Although minimal coverage of the genome can be achieved by sampling allnon-overlapping 12-mers in the genome,an overlapping sampling interval provides increased resolution,but at the cost of a larger position file.An additional advantage of overlapping sampling inter-vals in our scheme is that it permits lookups of oligomers other than 12-mers and 24-mers.For example,if we store 12-mers at an overlapping interval of 6(which is our default),we can determ-ine the genomic location of oligomers of length 12,18,24,and so on.These intermediate-length oligomers can be useful in genomic mapping.The use of 18-mers can give additional sensitivity for diver-gent sequences,such as in the cross-species genomic mapping of mouse cDNAs onto the human genome,and vice versa.In addition,short cDNA sequences often have too few 24-mers for reliable gen-omic mapping.In these cases,the program uses smaller oligomers:18-mers if the cDNA is between 40and 80nt,and 12-mers if it is less than 40nt.In addition to using highly specific 24-mers,gmap employs an adaptive sampling scheme designed to utilize mapping information from different parts of the cDNA sequence.The sampling process begins by scanning both ends of the cDNA sequence,and monitoring the results until a pair of 24-mers match to approximately the same location in the genome.The definition of ‘same location’depends upon the length of the query cDNA,with an allowed genomic expan-sion of 1000times the query length,subject to a default upper limit of 1million nucleotides.Therefore,the program will not attempt to predict a long intron for a very short EST.To avoid false localizations from a fortuitous pair of matches to the genome,the program continues to sample beyond the first pair of successful hits,in order to accumulate evidence of other pos-sible localizations in the genome.This amount of further sampling is determined both by a minimum distance (default 48nt)and by a min-imum number of additional successful matches (default 3)required.If this process yields a limited number of genomic locations,the mapping process terminates.On the other hand,if there are a large number of candidate genomic locations,then gmap begins a sampling process that uses informa-tion from the middle of the cDNA sequence.This sampling process is performed iteratively,with the sampling interval halved in each round.At each sampling interval,the program looks for clusters on the genome with a high concentration of matches,with the provi-sion that genomic positions be collinear with the cDNA positions.Sampling terminates when the correct genome location is resolved to a limited number of good candidates.This determination is made by setting a threshold at 70%of the number of matches in the best cluster,and requiring that only a limited number of clusters (currently defined as 10or fewer)are above this threshold.For each candidate cluster of 24-mers,the program extracts the corresponding segment from the genome,with the correct strand of the genome determined by the orientation of the matching 24-mers.To extend the genomic segment to regions that may be relevant for further alignment at the oligomer and nucleotide level,the program looks up the genomic positions of the nearest 12-mers that match to the ends of the cDNA sequence.Oligomer chainingFor approximate alignment,oligomer chaining attempts to find a path of 8-mers that match between the cDNA sequence and each genomic segment found in the mapping step.The procedure is illustrated in the top part of Figure 2.Instead of the standard DP paradigm,which uses a matrix to align two sequences,oligomer chaining uses an1862GMAP for mRNA and ESTsequencesFig.2.Oligomer chaining and nucleotide-level alignment.The top part of the figure shows oligomer chaining.The horizontal axis represents positions on the cDNA sequence.Each cDNA position may have one or more matches of 8-mers to the genomic segment,represented by a vertical stack of cells.For each cell,the DP procedure looks for an optimal previous cell,as represented by thin diagonal lines between cells.The highest-scoring chain of 8-mer matches,represented by a thick line,describes the optimal approximate alignment.This alignment may contain jumps in cDNA or genomic coordinates,due to introns,cDNA insertions or sequence differences.These jumps are resolved by various nucleotide-level alignment procedures,represented in the bottom of the figure by various DP matrices.Sandwich alignments bridge large coordinate jumps across introns (horizontal dashed line)or long cDNA insertions (vertical dashed line).The existence of short exons is resolved by an exon testing procedure that compares alignments with and without the short exon.equivalent but more efficient representation in the form of an array of linked lists.Each position in the array corresponds to an overlapping 8-mer in the cDNA sequence,and each 8-mer has a linked list of positions in the genomic segment where that 8-mer is found.These linked lists are represented in the figure as a vertical stack of cells at each cDNA position.Each cell also contains placeholders for the optimal subscore to that point and for a pointer to the best previous cell that produced the optimal subscore.The array of linked lists is generated by first pre-scanning the cDNA for overlapping 8-mers and noting which 8-mers are present,and hence relevant.This pre-scan prevents unnecessary work later,because most of the 8-mers in the longer genomic sequence are irrel-evant.Then the algorithm scans the genomic segment for relevant 8-mers and adds their genomic positions to a list maintained for each relevant 8-mer.Finally,the algorithm scans the cDNA again,making a copy of the appropriate position list for each element of the array.After building this data structure,oligomer chaining proceeds with a DP procedure that assigns a subscore and pointer to each cell,starting from the beginning of the cDNA sequence.For each cell,the algorithm looks backward to cells at previous cDNA positions to identify the cell that both is consistent and generates a maximal score to the given cell.A previous cell is consistent if its genomic position is lower than that of the given cell,which enforces collin-earity of the cDNA and genomic sequences.The score for the cell is the score of the previous cell plus 1to indicate the length of the chain.Because introns will cause 8-mers in the cDNA not to match,the algorithm compensates for such cases by adding enough points to ensure that local extension does not gain an unwarranted advantage over an intron.One cost of our approach is greater computational complexity than one based on larger fragments.As described so far,oligomer chain-ing is O(m 2g 2),where m is the length of the cDNA and g is the average number of cells per linked list,which is generally propor-tional to the length of the genomic segment.(A total of mg cells must be processed,and at each cell the algorithm must look back at the previous set of cells processed.)In order to reduce the complex-ity to O(mg 2),we impose a sufficiency limit on the look backward.Note that this limit applies only to the cDNA sequence coordinates;there is no limitation on the look backward in genomic sequence coordinates.The sufficiency limit has a default value of 60,which expresses our calculated expectation that we should find at least one matching 8-mer between the cDNA and genome within that distance,even accounting for extremely low sequence quality.By using prob-ability calculations based on finite-state automata (Atteson,1998),we estimate that if the sequence error rate is 5%,then the chance of failing to have an error-free stretch of 8nucleotides out of 60total nucleotides is 3.8×10−6.The pointer and optimal subscore for a given cell are based on the best solution found within this sufficiency limit.However,a cDNA sequence may have a local concentration of mismatches or gaps that precludes 8-mers from being identified in a particular stretch.Therefore,if no matching 8-mer is found within the suf-ficiency limit,the algorithm will continue looking backward as far as needed to find a match.This provision allows the algorithm to1863。
使用生物大数据中心数据库进行基因表达谱分析的步骤
使用生物大数据中心数据库进行基因表达谱分析的步骤生物大数据中心数据库是一个强大的工具,可以用于分析基因表达谱。
在进行基因表达谱分析之前,我们需要明确几个步骤。
本文将详细介绍如何使用生物大数据中心数据库进行基因表达谱分析。
第一步是向生物大数据中心数据库注册账号并登录。
注册账号是使用生物大数据中心数据库进行基因表达谱分析的第一步。
可以访问该数据库的官方网站进行注册。
填写个人信息、用户名和密码后,您将获得一个账号。
登录之后,您可以访问数据库的各个功能和工具。
第二步是选择合适的基因表达数据集。
生物大数据中心数据库拥有众多的基因表达数据集,您可以根据自己的研究需求选择合适的数据集。
数据集通常被分类为不同的物种、组织类型和疾病状态。
例如,如果您的研究关注人类心脏组织的基因表达谱,您可以选择包含心脏组织样本的数据集。
第三步是导入和预处理基因表达数据。
一旦选择了适当的数据集,您可以根据需要下载数据集中的原始数据。
原始数据通常以文本文件或Excel文件的形式提供。
在导入数据之前,您可能需要进行一些预处理步骤,例如去除噪声、归一化或筛选不感兴趣的基因。
这些预处理步骤可以使用生物大数据中心数据库中的工具完成。
第四步是进行基因表达谱分析。
生物大数据中心数据库提供了各种分析工具,可以帮助您更好地理解基因表达谱。
其中包括差异表达基因分析、基因共表达网络分析、功能富集分析等。
差异表达基因分析可以帮助您识别在不同样本之间表达水平显著不同的基因。
基因共表达网络分析可以帮助您发现在相似组织或条件下共同表达的基因模块。
功能富集分析可以帮助您理解哪些生物学过程和信号通路参与了基因的调控。
这些工具可以根据您的研究需求进行灵活的组合和调整。
第五步是解释和呈现分析结果。
一旦完成了基因表达谱分析,您将得到大量的结果,包括差异表达基因列表、共表达基因模块和功能富集结果。
解释和呈现这些结果对于得到有意义的结论至关重要。
生物大数据中心数据库通常提供了数据可视化和分析结果导出的功能。
利用生物大数据技术分析基因调控网络的拓扑结构和模块化特性
利用生物大数据技术分析基因调控网络的拓扑结构和模块化特性生物大数据技术是近年来迅速发展的一项技术,通过对大规模生物数据的收集、整理和分析,可以深入了解生物体内基因调控的方式和机制。
其中,基因调控网络的拓扑结构和模块化特性是该领域的关键研究内容之一。
本文将以生物大数据技术为基础,分析基因调控网络的拓扑结构和模块化特性。
基因调控网络是指由基因和其调控基因之间相互作用构建而成的复杂网络。
通过研究基因调控网络的拓扑结构,我们可以探索基因调控系统中各个基因的关系和相互作用方式。
而基因调控网络的模块化特性则指的是调控系统中存在的一些相对独立的功能模块。
利用生物大数据技术,我们可以收集大量的基因表达数据和基因调控信息。
通过对这些数据的整理和分析,可以构建出基因调控网络的拓扑结构。
研究人员可以利用这些数据,根据基因的相互作用关系,构建出一个具有节点和边的网络图。
其中,每个基因可以表示为一个节点,而基因之间的相互调控关系则用边连接。
通过对这个网络图的分析,我们可以揭示基因调控网络的整体结构和特征。
基因调控网络的拓扑结构通常呈现出一些特征,如小世界特性、无标度特性和模块化特性等。
其中,小世界特性意味着基因调控网络中的节点之间通过少量的步骤即可相互到达。
这种特性使得基因调控网络具有较高的信息传递效率。
无标度特性则表示网络中存在少量的大连接度节点,这些节点起到关键的调控作用。
而模块化特性则表明基因调控网络可以分解成一些相对独立的功能模块,每个模块负责特定的生物功能。
通过生物大数据技术的分析,我们可以发现和研究这些拓扑结构和模块化特性对于理解基因调控机制具有重要意义。
首先,拓扑结构可以帮助我们找到重要的控制节点,从而挖掘潜在的靶向治疗目标和将药物应用于更精确的治疗。
其次,模块化特性可以帮助我们确定功能上有关联的基因组,有助于对基因功能和生物过程的理解。
利用生物大数据技术分析基因调控网络的拓扑结构和模块化特性也面临一些挑战。
第四范式生物信息学突破和应用
第四范式生物信息学突破和应用生物信息学是一门综合性的学科,利用计算机科学、数学和统计学等方法来研究生物学中的各种问题。
它已经在许多领域取得了突破性的进展,其中之一就是第四范式生物信息学。
第四范式生物信息学是利用大数据分析、机器学习和人工智能等技术,对生物学数据进行高效处理和解读的一种新方法。
在生物信息学的发展历程中,第四范式的出现可以说是一次革命性的突破,为生物学研究提供了全新的思路和工具。
第四范式生物信息学的基本原理是将大规模的生物学数据(如基因组数据、转录组数据、蛋白组数据等)整合起来,通过机器学习和人工智能等算法进行分析和挖掘,从中提取有用的信息和知识。
这种方法在许多生物学研究领域都有广泛的应用,包括基因组学、转录组学、蛋白质组学、药物设计等。
在基因组学领域,第四范式生物信息学可以帮助研究人员更好地理解和研究基因组数据。
通过对大规模基因组数据的分析,可以揭示基因之间的相互作用关系、基因功能的预测、标记基因的鉴定等。
此外,第四范式生物信息学还可以帮助人们进行基因组测序数据的高效处理和解读,从而开展更深入的基因组研究。
在转录组学领域,第四范式生物信息学可以用于研究基因的表达调控。
通过分析大规模的转录组数据,可以揭示基因的表达模式、预测转录因子的调控网络等。
这对于研究生物的发育过程、疾病的发生机制等具有重要意义。
通过第四范式生物信息学的方法,研究人员可以更全面地了解转录组数据中的信息,加深对生物系统的理解。
在蛋白质组学领域,第四范式生物信息学可以用于蛋白质结构预测和功能注释。
通过对大规模的蛋白质组数据进行分析,可以预测蛋白质的三维结构、识别功能域和蛋白质相互作用等。
这对于研究蛋白质的功能和机制有重要意义,可以为药物设计和疾病治疗提供有力支持。
此外,第四范式生物信息学还可以在药物设计中发挥重要作用。
通过分析大规模的分子结构数据和药物活性数据,可以预测药物的活性、发现新的药物靶点和设计更有效的药物分子。
bZIP转录因子在植物逆境胁迫响应和生长发育中的作用
bZIP转录因子在植物逆境胁迫响应和生长发育中的作用目录一、内容描述 (2)1. bZIP转录因子的研究背景与意义 (3)2. bZIP转录因子在植物中的分布与分类 (4)二、bZIP转录因子的基本特性 (5)1. bZIP蛋白的结构与功能 (6)2. bZIP转录因子的稳定性与活性调节 (8)三、bZIP转录因子在植物逆境胁迫响应中的作用 (9)1. bZIP转录因子对干旱胁迫的响应 (10)节水机制 (11)抗旱基因的表达调控 (12)2. bZIP转录因子对盐碱胁迫的响应 (14)盐碱适应机制 (15)盐碱抗性基因的表达调控 (16)3. bZIP转录因子对低温胁迫的响应 (17)低温适应机制 (18)低温抗性基因的表达调控 (19)4. bZIP转录因子对病虫害胁迫的响应 (20)病虫害防御机制 (21)抗病虫害基因的表达调控 (22)四、bZIP转录因子在植物生长发育中的作用 (23)1. bZIP转录因子对植物生长发育的调控 (24)生长激素的合成与信号传导 (25)光合作用与呼吸作用的调节 (27)2. bZIP转录因子对植物抗逆性的影响 (28)抗逆基因的表达与调控 (30)抗逆性的遗传与表观遗传 (31)五、bZIP转录因子的应用与展望 (33)1. bZIP转录因子在育种中的应用 (34)育种材料的创制 (35)育种性状的改良 (37)2. bZIP转录因子在基因工程中的应用 (39)抗逆基因的克隆与表达 (40)基因编辑技术的应用 (41)3. bZIP转录因子研究的未来趋势与挑战 (42)六、结论 (43)1. bZIP转录因子在植物逆境胁迫响应和生长发育中的重要作用.442. 对未来bZIP转录因子研究的展望 (46)一、内容描述本研究旨在探讨bZIP转录因子在植物逆境胁迫响应和生长发育中的作用。
bZIP是一种重要的RNA结合蛋白,参与了多种生物学过程,包括基因转录调控、蛋白质稳定性维持等。
生物大数据技术如何挖掘基因组中的非编码调控元件
生物大数据技术如何挖掘基因组中的非编码调控元件随着生物学研究的深入,科学家们发现,除了编码基因组外,非编码基因组也在生物体内发挥着重要的调控功能。
非编码调控元件是指在基因本身的编码区域之外,对基因表达起调控作用的DNA序列。
这些调控元件可以影响基因的活性和表达水平,进而影响生物的发育、生理和疾病等方面。
然而,非编码调控元件的识别和功能解析是一个具有挑战性的任务。
为了挖掘基因组中的非编码调控元件,生物大数据技术成为一种不可或缺的工具。
生物大数据技术通过整合、分析和解释大量的基因组学数据,为研究人员提供了关键的信息和洞察力。
首先,生物大数据技术可以通过基因组序列比对和注释来识别非编码调控元件。
基因组序列比对是将新获取的序列与已知的参考基因组序列进行比较,从而找到两者之间的相似性和差异性。
通过与已知的非编码调控元件进行比对,可以识别出新的调控元件,并进一步分析其功能。
其次,生物大数据技术可以利用转录组学数据来鉴定非编码调控元件。
转录组学是研究细胞或组织中mRNA的表达情况和变化的方法。
通过测量基因组中的转录本,我们可以了解到基因的表达水平和调控。
生物大数据技术可以对转录组学数据进行分析,从而识别出与基因表达相关的非编码调控元件。
例如,通过RNA测序技术可以获取RNA的拷贝数目,从而推断出基因的表达水平,并识别出可能调控基因表达的非编码调控元件。
另外,生物大数据技术还可以利用表观遗传学数据来挖掘非编码调控元件。
表观遗传学是研究基因组中DNA化学修饰和染色质状态的学科。
DNA甲基化是一种常见的表观遗传修饰方式,可以通过改变DNA序列的化学结构来调控基因的活性。
生物大数据技术可以利用甲基化谱系的测序数据,从而确定基因组中的甲基化位点,进一步识别出与基因表达相关的非编码调控元件。
除了上述方法,生物大数据技术还可以利用染色质构象和互作数据来挖掘非编码调控元件。
染色质构象是指染色质在三维空间中的空间结构。
通过利用染色体互作数据,可以确定基因组中不同区域之间的相互作用关系。
基于Openvigil药物警戒数据分析网站的间质性肺疾病相关药物风险信号研究
基于Openvigil药物警戒数据分析网站的间质性肺疾病相关药物风险信号研究基于Openvigil药物警戒数据分析网站的间质性肺疾病相关药物风险信号研究近年来,药物不良反应(ADR)和药物安全性问题备受关注。
针对药物的副作用和风险,很多药物监测系统和数据库被广泛建立并用于风险评估。
然而,大规模的、全球范围的药物监测数据分析仍然是一个具有挑战性的任务。
Openvigil药物警戒数据分析网站则提供了一个强大的工具来研究药物的风险信号。
Openvigil是一个由德国生物信息学研究网络(BiGiNet)所开发的在线数据库,它集成了来自全球各种药物监测系统和药物不良反应数据库的数据。
通过整合这些数据,Openvigil网站提供了一个全球的、匿名的药物不良反应监测平台。
本研究旨在使用Openvigil药物警戒数据分析网站,对间质性肺疾病(ILD)相关药物的风险信号进行研究。
间质性肺疾病是一种罕见但严重的肺部疾病,其特点是肺泡与肺血管之间的间质纤维化和炎症。
药物诱导的间质性肺疾病是一种罕见但认可的不良反应,诊断和治疗都具有挑战性。
首先,我们通过Openvigil网站查询到与间质性肺疾病相关的药物,包括具有ILDSPT(ILD特异性性)标签的药物和曾被报道与ILD有关的药物。
然后,我们从Openvigil的数据库中提取与这些药物相关的不良反应报告。
对这些报告进行初步筛选,去除与ILD无关的报告。
最后,我们使用统计分析方法,如频繁项集挖掘、关联规则分析等,来研究这些药物与ILD的关联性和风险。
通过Openvigil药物警戒数据分析网站的研究,我们发现某些药物与间质性肺疾病存在一定的关联性和风险。
这些药物包括某些免疫抑制剂、非甾体类抗炎药、抗菌药物等。
具体来说,某些免疫抑制剂的使用与ILD的发生和严重程度有关。
另外,某些非甾体类抗炎药的长期使用也与ILD的风险有关。
此外,某些抗菌药物如万古霉素等也可能增加ILD的风险。
国际癌症基因组联盟数据库的应用介绍
国际癌症基因组联盟数据库的应用介绍蒿花;王馨笛;耿辉;袁炜;陈新欢;王军;马茂【期刊名称】《中国循证心血管医学杂志》【年(卷),期】2024(16)2【摘要】随着21世纪分子生物学技术蓬勃发展,高通量测序技术识别了大量的癌症突变基因和分子标志物。
面对庞大数量级基因的相互作用,各种非编码RNA及其复杂调控功能,使现代医学从分子角度揭示癌症潜在的作用方式和发生发展过程,难以准确阐述目标癌症的分子机制,因此急需结合高通量基因表达,表观组、蛋白质组、转录组信息数据,进行分子遗传学、分子药理学、病因病理学分析,获得潜在的癌症风险、分型等,识别致癌基因进行体细胞突变位点、碱基改变、功能影响等层面的数据,深度揭示癌症的发生发展、药物作用靶点、预后及治疗的癌症亚型等。
国际癌症基因组联盟(ICGC)为解决上述问题,收集来自50种不同癌症类型或亚型的癌症建立方便研究者进行大规模癌症基因水平研究的数据库,该数据库在基因组、表观基因组和转录组水平对25000余种癌症基因组进行系统研究,分析致癌的突变基因、诱变可能的影响,为癌症的预后和治疗管理确定临床相关亚型,促进新的癌症药物疗法开发。
【总页数】5页(P144-148)【作者】蒿花;王馨笛;耿辉;袁炜;陈新欢;王军;马茂【作者单位】西安交通大学第一附属医院健康医学科;西安交通大学医学部;西安交通大学第一附属医院心内科【正文语种】中文【中图分类】R4【相关文献】1.国际蚜虫基因组联盟成功绘制出蚜虫基因组图谱2.癌症基因组图集数据库及其应用3.国际癌症基因组联盟在中国发起4个新项目4.基于癌症基因组图谱数据库构建肝细胞癌相关miRNA-mRNA调控网络5.利用癌症基因组图谱数据库分析泛素结合酶E2S在肺腺癌中的表达及临床意义因版权原因,仅展示原文概要,查看原文内容请购买。
tbtools基因结构
tbtools基因结构TBTools (TubercuList Bioinformatics Tools) 是由法国巴黎南大学的TB数据库团队开发的一套基因组学工具软件,该软件是专门针对结核分枝杆菌 (Mycobacterium tuberculosis) 这种人类致病菌的基因组信息分析和整合工具。
其主要目的是为了为结核研究提供一个全面的分析和注释结核分枝杆菌基因组序列的工具集。
其中,TBTools 提供了基因序列的结构分析工具,主要包括基因的功能预测、外显子预测、基因结构分析、同源基因分析以及分子进化分析等多个功能模块。
TBTools 还支持用户自定义序列格式,以及基于 BLAST、HMMER 等多种算法的序列比对工具。
基因结构分析是 TBTools 中的一个重要功能模块,能够准确预测基因的编码区域、内含子和外显子位置,并进行详细的注释。
基因结构预测的准确性对于进一步研究基因功能和编码区域的启动子和调控元件有着重要的意义。
TBTools 基因结构预测的算法包括基于神经网络和软件 Glimmer 的模型。
这些模型能够预测已知基因序列和未知序列的编码区域,并识别外显子和内含子的位置。
TBTools 还能够检测错误的基因预测和不完整的序列,从而提高基因预测的精确度。
TBTools 还提供了基因结构注释的功能,它可以为每个基因提供一系列的信息,包括编码序列位置、外显子位置、内含子位置、启动子和调控元件等。
这些注释信息对于基因功能研究和大规模基因组结构分析有着重要的意义。
除了基因结构预测和注释,TBTools 还提供了多种其他分析和注释工具,如同源基因分析模块和分子进化分析模块。
这些模块允许用户比较和分析不同物种的基因组序列,判断同源基因的进化关系,从而研究基因功能和物种进化的关系。
总之,TBTools 提供了一个全面的基因组分析和注释工具集,可以支持研究者对结核分枝杆菌基因序列的分析和研究。
基因结构分析模块则为研究者提供了准确的基因预测、外显子预测和注释信息,为后续研究提供了指导方向。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
428Genome Informatics14:428–429(2003) Identifying Potential Regulatory Sequences of
Alternative Splicing
Hitomi Itoh1,2Takanori Washio1Masaru Tomita1,2 t00514hi@sfc.keio.ac.jp washy@sfc.keio.ac.jp mt@sfc.keio.ac.jp
1Institute for Advanced Biosciences,Keio University,Tsuruoka,Yamagata997-0035, Japan
2Department of Environmental Information,Keio University,Fujisawa,Kanagawa252-8520,Japan
Keywords:alternative splicing,exonic splicing enhancer(ESE),phylogenetic footprinting
1Introduction
Alternative splicing is an important mechanism that contributes to expanding protein diversity by generating multiple protein isoforms from a single gene.We have previously reported computational approach to infer alternative splicing patterns from Mus musculus full-length cDNA clones and mi-croarray data[4].Although we have predicted a large number of unreported splice variants,general mechanisms that regulate those alternative splicing were yet to be understood.In the present study, we constructed datasets of putative alternatively spliced genes by use of full-length cDNA clones and genomic DNA sequences.Our putative datasets were then evaluated by comparing information con-tents of constitutive splice-sites and those of alternative ing phylogenetic footprinting approach,we identified the potential regulatory sequences of alternative splicing.
2Method and Results
2.1Detection of Putative Alternative Splicing Patterns
We detected putative splice variants using full-length cDNA clones and complete genomic sequences. To construct datasets of putative splice variants and infer complete gene structure,we mapped each full-length cDNA clones to genomic sequences by use of BLAST[1]and SIM4[3].Clones were then grouped into clusters if all the internal regions were correctly mapped.Mapping and clustering results are summarized in Table1.
Table1:Mapping and clustering results.
H.sapiens15,5069,14658,349 6.382,818985
M.musculus49,16219,20087,183 4.548,8791,424
Alternative Splicing Regulatory Sequences429
2.2Identification of Potential Alternative Splicing Regulatory Sequences
As afirst step toward identifying potential regulatory sequences of alternative splicing,we compared the orthologous gene cluster among mammals including mouse,rat,human,cattle,pig,and dog,as shown in Fig.1.We then extracted evolutionary conserved alternative cassette exon and constitutive exon,in order to apply phylogenetic footprinting approach for identifying regulatory sequences.Phy-logenetic footprinting is one of the well-known approaches for the discovery of regulatory sequences in a set of orthologous gene clusters[2].Using this approach,we identified7potential regulatory sequences of alternative splicing.Although experimental verification is lacking,the fact that our ap-proach has identified motifs that have high sequence similarity with known ESE motifs may suggest that these unknown motifs are also functional regulatory element and thus are good novel candidates for functional regulatory sequence.
Figure1:Distribution of mutation rate in evolutionary conserved exons.
3Discussion
Our computational approach identified potential regulatory sequences of alternative splicing.Based on ourfindings,it may be reasonable to suggest that alternative exon tends to have more regulatory sequence for weak splice-sites to be regulated alternatively.The fact that known regulatory sequence motifs were found in our predicted candidate motifs may validate this computational approach and thus indicate that this approach is applicable for identifying potential regulatory sequences of alternative splicing.
References
[1]Altschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.,and Lipman,D.J.,Basic local alignment search
tool,J.Mol.Biol.,215:403–10,1990.
[2]Blanchette,M.and Tompa,M.,Discovery of regulatory elements by a computational method for
phylogenetic footprinting,Genome Res.,12:739–48,2002.
[3]Florea,L.,Hartzell,G.,Zhang,Z.,Rubin,G.M.,and Miller,W.,A computer program for aligning
a cDNA sequence with a genomic DNA sequence,Genome Res.,8:967–74,1998.
[4]Kochiwa,H.,Suzuki,R.,Washio,T.,Saito,R.,Bono,H.,Carninci,P.,Okazaki,Y.,Miki,R.,
Hayashizaki,Y.,and Tomita,M.,Inferring alternative splicing patterns in mouse from a full-length cDNA library and microarray data,Genome Res.,12:1286–93,2002.。