Localization of an Absorbing Inhomogeneity in a Scattering Medium in a Statistical Framewor
群落内物种多样性发生与维持的一个假说
生物多样性 1997,5(3):161~167CHIN ESE BIODIV ERSIT Y群落内物种多样性发生与维持的一个假说3张大勇 姜新华(兰州大学生物系干旱农业生态国家重点实验室, 兰州 730000)摘 要 本文根据作者对竞争排除法则的研究而提出了一个新的群落多样性假说。
按照作者的观点,占用相同生态位的物种可以稳定共存;这样,群落内物种多样性将受到4个基本因子所控制。
它们分别是:(Ⅰ)生态位的数量;(Ⅱ)区域物种库的大小;(Ⅲ)物种迁入速率,以及(Ⅳ)物种灭绝速率。
该假说强调区域生物地理过程与局域生态过程共同决定了群落内种多样性的大小及分布模式。
关键词 局域物种多样性,物种分化,区域物种库,生态位,竞争排除法则A hypothesis for the origin and maintenance of within2community species diversity/Zhang Dayong,JiangXinhu a//CHINESE BIODIVERSIT Y.—1997,5(3):161~167This paper formulates a novel hypothesis of community diversity on the basis of rejecting the competitive exclu2 sion principle.Since we accept the view that many species could occupy the same niche,local s pecies diversity is considered to be controlled by four fundamental factors,which are,res pectively,(Ⅰ)the number of niches in the community,(Ⅱ)the size of regional species2pool,(Ⅲ)species immigration rate,and(Ⅳ)species extinc2 tion rate.The hypothesis suggests that both regional biogeographic processes and local ecological processes will play an important role in determining the magnitude and pattern of community diversity.K ey w ords local species diversity,speciation,regional species2pool,niche,competitive exclusion principle Author’s address Department of Biology&State K ey Laboratory of Arid Agroecology,Lanzhou Univer2sity,Lanzhou 7300001 引言由于环境污染和生境破坏等人类活动的影响,大规模物种灭绝已成为当今社会所密切关注的一个焦点。
气泡形核 bubble nucleation ,growth and coalescence
Bubble nucleation,growth and coalescence during the 1997Vulcanian explosions of Soufrière Hills Volcano,MontserratT.Giachetti a ,b ,c ,⁎,T.H.Druitt a ,b ,c ,A.Burgisser d ,L.Arbaret d ,C.Galven eaClermont Université,UniversitéBlaise Pascal,Laboratoire Magmas et Volcans,BP 10448,F-63000Clermont-Ferrand,France bCNRS,UMR 6524,LMV,F-63038Clermont-Ferrand,France cIRD,R 163,LMV,F-63038Clermont-Ferrand,France dInstitut des Sciences de la Terre d'Orléans,Universitéd'Orléans,1A,rue de la Férollerie,45071Orléans Cedex 2,France eLaboratoire des Oxydes et Fluorures,Facultédes Sciences et Techniques,Universitédu Maine,Avenue Olivier Messiaen,72085Le Mans Cedex 9,Francea b s t r a c ta r t i c l e i n f o Article history:Received 29July 2009Accepted 5April 2010Available online 13April 2010Keywords:Vulcanian explosions Soufrière Hills vesiculationbubble nucleation bubble growth coalescenceamphibole boudinageSoufrière Hills Volcano had two periods of repetitive Vulcanian activity in 1997.Each explosion discharged the contents of the upper 0.5–2km of the conduit as pyroclastic flows and fallout:frothy pumices from a deep,gas-rich zone,lava and breadcrust bombs from a degassed lava plug,and dense pumices from a transition zone.Vesicles constitute 1–66vol.%of breadcrust bombs and 24–79%of pumices,all those larger than a few tens of µm being interconnected.Small vesicles (b few tens of µm)in all pyroclasts are interpreted as having formed syn-explosively,as shown by their presence in breadcrust bombs formed from originally non-vesicular magma.Most large vesicles (N few hundreds of µm)in pumices are interpreted as pre-dating explosion,implying pre-explosive conduit porosities up to 55%.About a sixth of large vesicles in pumices,and all those in breadcrust bombs,are angular voids formed by syn-explosive fracturing of amphibole phenocrysts.An intermediate-sized vesicle population formed by coalescence of the small syn-explosive bubbles.Bubble nucleation took place heterogeneously on titanomagnetite,number densities of which greatly exceed those of vesicles,and growth took place mainly by decompression.Development of pyroclast vesicle textures was controlled by the time interval between the onset of explosion –decompression and surface quench in contact with va-plug fragments entered the air quickly after fragmentation (∼10s),so the interiors continued to vesiculate once the rinds had quenched,forming breadcrust bombs.Deeper,gas-rich magma took longer (∼50s)to reach the surface,and vesiculation of resulting pumice clasts was essentially complete prior to surface quench.This accounts for the absence of breadcrusting on pumice clasts,and for the textural similarity between pyroclastic flow and fallout pumices,despite different thermal histories after leaving the vent.It also allowed syn-explosive coalescence to proceed further in the pumices than in the breadcrust bombs.Uniaxial boudinage of amphibole phenocrysts in pumices implies signi ficant syn-explosive vesiculation even prior to magma fragmentation,probably in a zone of steep pressure gradient beneath the descending fragmentation front.Syn-explosive decompression rates estimated from vesicle number densities (N 0.3–6.5MPa s −1)are consistent with those predicted by previously published numerical models.©2010Elsevier B.V.All rights reserved.1.IntroductionExplosive volcanic eruptions are driven by the nucleation,growth and coalescence of gas bubbles,followed by fragmentation of the magmatic foam into a suspension of pyroclasts and gas that is discharged at high velocities into the atmosphere.Studies of pyroclast textures,coupled with experimental and numerical approaches,have advanced understanding of these processes (Lensky et al.2004;Spieler et al.2004b;Adams et al.2006;Toramaru 2006;Gardner 2007;Cluzel et al.2008;Koyaguchi et al.2008and references therein),but many questions remain.One concerns the relative importance of homogeneous versus heterogeneous nucleation.Homogeneous nu-cleation requires gas supersaturations of at least several tens of MPa (Mangan and Sisson 2000;Mourtada-Bonnefoi and Laporte 2002;2004;Mangan et al.2004),whereas heterogeneous nucleation requires lower supersaturations (Hurwitz and Navon 1994;Gardner 2007;Cluzel et al.,2008).The degree of equilibrium between gas and melt during bubble growth also has an effect.Equilibrium degassing requires ef ficient volatile diffusion coupled with melt viscosity low enough to allow free gas expansion (Lyakhovsky et al.,1996;Liu and Zhang,2000;Lensky et al.,2004).High degrees of disequilibrium favour short-lived eruptions,whereas equilibrium allows more sustained fragmentation (Melnik and Sparks,2002;Mason et al.,2006).Another issue concerns the timing of bubble growth andJournal of Volcanology and Geothermal Research 193(2010)215–231⁎Corresponding author.Clermont Université,UniversitéBlaise Pascal,Laboratoire Magmas et Volcans,BP 10448,F-63000Clermont-Ferrand,France.E-mail address:giachettithomas@club-internet.fr (T.Giachetti).0377-0273/$–see front matter ©2010Elsevier B.V.All rights reserved.doi:10.1016/j.jvolgeores.2010.04.001Contents lists available at ScienceDirectJournal of Volcanology and Geothermal Researchj o u r n a l h o me p a g e :w w w.e l s ev i e r.c o m/l o c a t e /j vo l g e o r e scoalescence relative to fragmentation and eruption.Some authors postulate little growth following fragmentation(Klug and Cashman, 1991)whereas others envisage significant post-fragmentation growth(Thomas et al.1994;Kaminski and Jaupart,1997).Post-fragmentation bubble growth is largely controlled by melt viscosity, being important in mafic melts and less so in silicic melts with viscosities N108–109Pa s(Thomas et al.,1994;Gardner et al.,1996; Kaminski and Jaupart,1997).Bubble coalescence and connections control permeability acquisition and the ability of magma to outgas during ascent.Vesicle size distributions provide information on magma vesicu-lation history.Pumices commonly contain multiple vesicle popula-tions covering a large range of sizes(Klug and Cashman,1996;Klug et al.,2002;Adams et al.,2006)that may result from coalescence following a single nucleation event(Orsi et al.,1992;Klug and Cashman,1994,1996;Klug et al.,2002;Burgisser and Gardner,2005). Alternatively,each population may represent a distinct nucleation event,consistent with some ascent models which predict multiple events for viscous magma(Witham and Sparks,1986;Proussevitch and Sahagian,1996;Blower et al.,2001;Massol and Koyaguchi,2005). Small vesicles are commonly attributed to syn-explosive vesiculation that generates an exponential size distribution(Mangan et al.,1993; Klug and Cashman,1996;Klug et al.,2002;Adams et al.,2006).Size distributions of larger populations typically obey power laws usually attributed to coalescence(Klug et al.,2002;Houghton et al.,2003; Gurioli et al.,2005;Adams et al.,2006;Klug and Cashman,1996), although multiple nucleation events also generate power-law distributions(Blower et al.,2001).Magma decompression rates can be estimated from vesicle number densities assuming a unique and brief nucleation event(Toramaru,2006;Cluzel et al.,2008).Detailed studies of eruptive products are required to address these questions and provide ground truth for models.Most vesiculation studies to date have concerned Plinian eruptions.In this paper we study vesiculation during a sequence of well documented Vulcanian explo-sions at Soufrière Hills Volcano in1997.The explosions have been previously described(Druitt et al.,2002;Cole et al.,2002)and modelled (Melnik and Sparks,2002,Clarke et al.,2002;Formenti et al.,2003;Diller et al.,2006;Mason et al.,2006),and their products studied texturally (Formenti and Druitt,2003;Clarke et al.,2007)and chemically(Harford et al.,2003).A key feature was the eruption of pyroclasts of a wide range of types,including dense lava fragments,breadcrust bombs and pumices of different densities.Textural analysis,including a set of high-resolution vesicle-size distributions,enables us to recognize populations of vesicles formed by explosion decompression,quantify bubble nucleation mechanisms and decompression rates,and constrain the timing of bubble nucleation,growth and coalescence during,and immediately following,a typical explosion.In a companion paper we present measurements of groundmass water contents and reconstruct the state of the pre-explosion conduit(Burgisser et al.,in press).2.The1997Vulcanian explosions of the Soufrière Hills VolcanoThe eruption of Soufrière Hills Volcano(Fig.1)began phreatically in July1995;extrusion of lava began in November of the same year and continued intermittently until the time of writing.The explosions in 1997occurred every3–63h(mean of∼10h)in two periods:thirteen between4and12August,and seventyfive between22September and 21October(Druitt et al.,2002).Each consisted of an initial high-intensity phase lasting a few tens of seconds,followed by a waning phase lasting1–3h.Multiple jets were ejected at40–140m s−1during thefirst10–20s of each explosion,then collapsed back to form pumiceous pyroclasticflows that travelled up to6km from the crater (Formenti et al.,2003).Fallout of pumice and ash occurred from high (3–15km)buoyant plumes that developed above the collapsing fountains.Fallout andflow took place at the same time from individual explosions.Each explosion discharged on average8×108kg of magma,about two-thirds as pyroclasticflows and one-third as fallout, representing a conduit drawdown of0.5–2km(Druitt et al.,2002). Studies of quench pressures using microlite contents and glass water contents support a maximum drawdown of∼2km(Clarke et al.,2007; Burgisser et al.,in press).Each explosion started when magma overpressure exceeded the strength of an overlying degassed plug and a fragmentation front propagated down the conduit at a few tens of m s−1 (Druitt et al.,2002;Clarke et al.,2002;Melnik and Sparks,2002;Spieler et al.,2004a;Diller et al.,2006;Mason et al.,2006).After each explosion, magma rose up the conduit before the onset of a new explosion.The Soufrière Hills andesite contains phenocrysts of plagioclase, hornblende,orthopyroxene,magnetite,ilmenite and quartz set in rhyolitic glass.The pre-eruptive temperature was∼850°C(Devine et al., 2003).3.MethodolgyField work was carried out in2006and2008at three sites(Fig.1): sites1and2are situated on the fans of overlapping pyroclasticflow lobes from the explosions,and site3is a composite layer of fallout pumice from many explosions.Fallout pumices were also collected at a fourth site(site4;Fig.1)during an explosion in August1997.Field descriptions were made using a rock saw to cut perpendicular to any flow banding and parallel to any crystal fabric,and over100 representative pyroclasts were taken for laboratory study.Abundances of isolated and connected vesicles were measured on 2–5cm cubes cut from30breadcrust bombs and34flow and fall pumices using a Multivolume1305Helium Pycnometer and the method of Formenti and Druitt(2003),which is explained in the Supplementary electronic material.Separate measurements were made on the rims and cores of22 breadcrust bombs.Twenty-six of the pumice clasts ranged from lapilli to block size,all being b20cm in diameter.Measurements were also made on multiple core-to-rim samples from eight pumices N30cm in diameter. Texturally or compositionally banded pyroclasts were not included.Microscopic observations were made on the broken surfaces of pyroclast fragments using a Jeol JSM-591LV Scanning Electron Micro-scope(SEM)at an acceleration of15kV,and on polished epoxy-impregnated thin sections using the SEM and a stereomicroscope.Six samples representative(in terms of vesicularity and texture)of the pyroclast assemblage were chosen for high-resolution analysis of vesicle and crystal size distributions.Banded clasts,and those with a significant fraction of non-spherical vesicles,were excluded,thereby justifying use of a single,randomly oriented thin section for each sample.Vesicle and crystal size distributions were measured by image analysis in two dimensions(Toramaru,1990;Mangan et al.,1993; Klug and Cashman1994,1996;Klug et al.,2002;Adams et al.,2006;Shea et al.,2010).The technique,described fully in Appendix A,allowed objects as small as∼1µm to be measured.Differential epoxy penetration enabled us to distinguish interconnected from isolated vesicles.To represent the state of the magma immediately prior to the last discernible stage of coalescence,we manually‘decoalesced’neighbouring vesicles separated by a partially retracted wall.Volume distributions were assumed to equal area distributions(Klug et al.,2002).Volumetric number densities(N v)were calculated from area number densities(N a) using both the methods of Cheng and Lemlich(1983)and of Sahagian and Proussevitch(1998),which yield very similar values(Table1).Values of N v presented in this paper are those obtained using thefirst method,for reasons discussed in the Appendix A.4.Field descriptionsThe pyroclast assemblage consists predominantly of pumices of different colours,vesicularities and textures,with less than a few percent of breadcrust bombs and dense glassy lava clasts.Pyroclasts of all types were present in the pyroclasticflow deposits,although the216T.Giachetti et al./Journal of Volcanology and Geothermal Research193(2010)215–231relative proportions varied from lobe to lobe,while dense lava and breadcrust bombs were absent in the fallout.The samples described below come from several different explosions,and cannot be assigned to speci fic dates/times owing to the complex superposition of flow and fallout lobes from the many events.They represent the products of an ‘average ’explosion,as justi fied by (1)the first-order similarity of all the explosions (Druitt et al.,2002),and (2)the presence of the entire pyroclast spectrum in all pyroclastic flow lobes we examined.Pumices in the pyroclastic-flow deposits occur as lapilli and blocks up to N 1m in diameter with subangular-to-rounded shapes due to abrasion during transport.They range from beige,well vesiculated varieties,to grey,brown or black denser varieties (Fig.2a –b).A pink colouration affects the surfaces of many blocks,but rarely pervades the interiors.While the majority of pumices are texturally homoge-neous in hand specimen,some denser ones are flow banded with phenocryst alignment in the plane of banding.Rare compositional banding de fined by trails of disintegrated ma fic inclusions also occurs.All pumice clasts (as distinguished from breadcrust bombs)lack surface breadcrusting.This probably cannot be explained by abrasion,because breadcrust fragments are not observed in the flow matrices.All pumices smaller than ∼30cm lack radial gradients in vesicle abundance or size.However,some blocks larger than this exhibit visibly obvious radial gradients in vesicle size,with an outer 3–7-cm-thick rind with vesicles up to several mm,and a more coarsely vesicular interior containing vesicles up to an order of magnitude larger (Fig.2c).In some cases a crude cm-scale radial jointing affects the rind.The rind is inferred to represent the initial textural state of the pumice,while the interior records vesicle coarsening that took place during or after emplacement.The possibility that the interior represents the initial state,and that the rind developed by compaction during rolling in the pyroclastic flow,is not favoured because (1)the rinds texturally resemble the majority of smaller pumice blocks and lapilli,whereas vesicles in the interiors are abnormally coarse,and (2)no circumferential flattening of rind vesicles is observed.Many blocks also contain large voids up to several cm across,including anastamosing vesicle pipes and channels,ductile tears in the plane of flow banding,and curviplanar tears and cracks subparallel to clast margins (Fig.2d),which together account for b 10%of the total vesicularity.Fallout pumices are up to several cm in size and most preserve their original eruption –fragmentation shapes,unmodi fied by abrasion in pyroclastic flows or breakage on ground impact.They range in colour from white to brown and in shape from spheroidal to tabular,the latter comprising about three-quarters of the sample suite.Again,no surface breadcrusting is observed.Breadcrust bombs occur from a few cm to over a metre in diameter.They have vesicular interiors surrounded by darker,less vesicular b 10mm glassy rinds.A continuous range of textural varieties are observed between two endmembers.Coarsely breadcrusted bombs are relatively dense,with well de fined,dark-grey-to-black,poorly-to-non-vesicular rinds,broad,deep surface fractures de fining large polygons,and grey-to-brown vesicular interiors (Fig.2e –f).Finely breadcrusted bombs are less dense,with diffuse,pale vesicular rinds,finer polygonal networks of narrower,shallower surface fractures,and paler,commonly flow-banded interiors (Fig.2g –h).Some bombs that broke during eruption exhibit two generations of breadcrusting,the breakage surface being more finely breadcrusted than the original,outer surface of the bomb.Breakage is inferred to have exposed the already vesicular interior,which then developed a second generation of finer breadcrust-ing.Bombs were abraded during transport in the pyroclastic flows;most lack completely preserved breadcrust surfaces with sharp edges and corners,and partial rind removal,rounding of polygon edges,and abrasion of vesicular interiors are common.Clasts of black,essentially nonvesicular lava resembling the glassy rinds of the coarsely breadcrust bombs are interpreted as an integral component of the explosion-pyroclast suite.On the other hand,grey-to-brown holocrystalline lava and cinderblock clasts resembling typical dome rock are probably derived either from the crater walls or from earlier block-and-ash flow deposits traversed by the explosion pyroclastic flows.Fig.1.Map of Montserrat showing Soufrière Hills Volcano and sampling locations.Grey:pyroclastic flow deposits of the 1997Vulcanian explosions.217T.Giachetti et al./Journal of Volcanology and Geothermal Research 193(2010)215–231Fig.2.Pumices and breadcrust bombs from the explosions.a)Dark brown pumice with 60%vesicularity,b)Pale pumice with 76%vesicularity,c)Large grey pumice exhibiting a radial gradient in vesicularity,line marks the outer surface of the clast,d)Curviplanar tears and cracks subparallel to clast margins in a dense pumice,e –f)Exterior and cross section of a coarsely breadcrust-bomb,g –h)Exterior and cross section of a finely breadcrusted bomb.219T.Giachetti et al./Journal of Volcanology and Geothermal Research 193(2010)215–2315.Pyroclast vesicularitiesVesicularities of texturally homogenous pumice lapilli and blocks range from24to79vol.%(Fig.3)and correlate with colour,being lowest in darker pumices and higher in paler ones.The fraction of isolated vesicles(isolated divided by total vesicularity)is universally low(b0.25,with85%b0.1).Flow pumices cover the entire vesicularity range and have isolated fractions of0–0.13,whereas fallout pumices have vesicularities of43–72vol.%and isolated fractions of0.04–0.14,a single sample having0.25(Fig.3).No variation of either vesicularity or isolated vesicle fraction with clast size is observed.Vesicularity profiles across eight N30cm pumice blocks are shown in Fig.4.Four of these appeared homogeneous in thefield,and four had visually obvious radial gradients in vesicle size.The four homogeneous blocks(SHV4–12–13–22)lack significant gradients in vesicularity from core to rim,as anticipated from inspection.The four vesicle-size-graded blocks(SHV2–14–23–25),on the other hand, exhibit vesicularity gradients,but these vary from sample to sample and no systematic decrease in vesicularity from core to rim is evident.The coarse interiors of these pumices are no more vesicular than the morefinely vesicular rims.Textural coarsening in the interiors therefore took place without inflation,as consistent with the absence of surface breadcrusting.Breadcrust-bombs differ from pumices in that(1)their vesicular-ity range(20–66vol.%most lying between35and55%,Fig.3)is smaller,and highly vesicular(N66%)samples are not observed;and (2)the fraction of isolated pores(0.05–0.33,N80%being0.1–0.2) is higher than in pumices of similar vesicularity.Bomb rinds contain 1–25vol.%vesicles,most of which are isolated.Rind and interior vesicularities are broadly correlated(Fig.5).Coarsely breadcrusted bombs have the lowest vesicularities,both in rinds and interiors,and finely breadcrusted bombs are more vesicular.It is the existence of vesicular rinds onfinely breadcrusted bombs that gives these bombs their pale colours and make distinction between rind and interior less clear than in the coarsely breadcrusted bombs.Full tables of vesicularity data are provided as Supplementary electronic material.6.Microscopic vesicle texturesThe pyroclasts contain vesicles with a broad range of sizes set in microlite-bearing groundmass.In this section we focus on vesicles less than a few mm in diameter present in hand specimens,and distinguish three populations:small(less than a few tens ofµm),intermediate(few tens to a few hundreds ofµm)and large(few hundreds ofµm to a few mm).It is shown later that these three populations also have genetic significance.Vesicle textures in fallout andflow pumices are very similar and are described together.The large vesicles form interconnected networks with curved,scalloped walls indicative of rge vesicles in the more vesicular pumices are quasi-spherical to elliptical in shape. Those in dense pumices commonly have more ragged,fissure-like shapes,suggesting that perhaps they already existed prior to explosion. About15%of the large vesicles are angular voids associated with fractured amphibole phenocrysts(Fig.6a).Intermediate-sized vesicles in all pumices have variably rounded to ragged shapes and,like the large ones,form interconnected networks in three dimensions.In contrast, small vesicles are commonly spherical and many are isolated;they either form a‘matrix’in which the intermediate vesicles are dispersed (Fig.6b),or are situated in the walls separating the latter.In some samples the smallest isolated vesicles form sub-spherical clusters several tens of microns in diameter that protrude with bulbous, cauliform shapes into larger vesicles(Fig.6c;Formenti and Druitt, 2003).There is textural evidence that many vesicles of intermediate size formed by coalescence of the small vesicles(rather than pre-existing them),the process commonly being preserved quenched in progress (Fig.6d).The sizes of some intermediate vesicles appear to be inherited from the clusters of small vesicles when the latter coalesced while preserving the overall sub-spherical form of the cluster.Vesicles in pumices are commonly observed in spatial association with rge,angular voids are associated with fractured amphiboles,and have two endmember types:(1)voids in amphiboles boudinaged uniaxially in the plane offlow foliation,with well defined length-perpendicular fractures(Fig.7);(2)voids in amphiboles that are fractured both perpendicular and parallel to length,and the fragments dispersed around the vesicle margins in a manner suggestive of more isotropic expansion.In both types,crystal fragments arecommonlyFig.3.Plot of connected versus total vesicularity for all the samples of this vasamples from dome collapses at Soufrière Hills Volcano are also shown(Formenti andDruitt,2003).As connected vesicularities could not be determined for breadcrustbombs rinds,we just show the range of bulk vesicularities obtained(thick blackline).Fig.4.Vesicularity as a function of relative position inside large pumices(N30cm).Filled diamonds with solid lines are those pumices that were judged in thefield to betexturally homogeneous;squares with dashed lines are those that had larger vesicles inthe interior than in therind.Fig.5.Relationships between rind and interior vesicularities of breadcrust bombs,including both coarsely andfinely breadcrusted types.220T.Giachetti et al./Journal of Volcanology and Geothermal Research193(2010)215–231connected by thin,delicate threads of glass generated either by the bursting of melt inclusions,or by the pulling-out of thin,pre-existing melt films in incipient cracks.A single type of amphibole-associated void is commonly dominant within a given pumice block.Type 1is observed in ∼45%of pumices and type 2in ∼35%,the remaining ∼20%of pumices lacking voids associated with amphibole.Another common texture involves radial arrangements of stretched vesicles around phenocrysts of plagioclase or amphibole (Fig.6e).This is attributed to expansion of a magmatic foam around a rigid crystal;it cannot be due to heterogeneous bubble nucleation because in each case the vesicles are separated from the crystal by a thin glass film,showing that the crystal was not wetted by gas.Only in the case of titanomagnetite is it common to see vesicles in direct contact with crystals without intervening glass,suggesting that titanomagnetite provided nucleation sites for bubbles (Fig.8).There is abundant evidence that bubble coalescence was ongoing at all scales larger than a few µm at the time of sample quench:ovoid,neck-like connections with partially retracted walls between neighbouring vesicles (Fig.6d),wrinkling of thin vesicle walls (Fig.6f),the occurrence of thin glass fibres,and the interconnection of all but a fraction of the smallest vesicles.Minimum observed vesicle wall thicknesses are b 1µm.Breadcrust bomb rinds contain small,mostly isolated,vesicles that are irregularly distributed,being most abundant near rind-penetratingsurface fractures and around phenocrysts (Fig.9a,c).Areas of vesicle-free groundmass occur in the rinds of coarsely breadcrusted bombs,but not in those of the finely breadcrusted bombs.The lower limit of the rind is commonly marked by string-like networks of small vesicles,which then merge to form the more uniformly distributed vesicle population of the interior.The interiors of all bombs contain distinct large and small vesicle rge vesicles are invariably associated with fractured amphiboles,like those in the pumices.However,well developed uniaxial boudinage is never observed in breadcrust bombs,and the voids are mostly of the more isotropic type 2.Small vesicles are uniformly distributed throughout the bomb interiors (Fig.9b,d);they are mostly isolated,with quasi-spherical forms,and commonly occur in strings and clusters around crystals and large vesicles.Evidence for vesicle coalescence is abundant in bomb interiors,although less so than in pumices.7.Size distributions of vesicles and crystalsThe six samples chosen for analysis of vesicle and crystal size distributions were a coarsely breadcrusted bomb (BCP1),a finely breadcrusted bomb (BCP43),three pyroclastic-flow pumices (AMO29,AMO36and PV3),and a fallout pumice (R2).SeparatemeasurementsFig.6.SEM images of broken surfaces (a –d)and thin sections (e –f)of pumices.a)Angular void in a fractured amphibole phenocryst,the fragments being connected by thin glass fibres (white arrows),b)Visual evidence for three different size populations (large,intermediate and small)of vesicles in pumices,c)Cauliform-shaped clusters of small vesicles protruding into intermediate ones,d)Evidence for coalescence of small vesicles to form intermediate-sized ones,e)Microphenocryst of plagioclase surrounded by radiating,elongated vesicles,f)Wrinkling of vesicle wall indicative of the onset of rupture (white arrow).221T.Giachetti et al./Journal of Volcanology and Geothermal Research 193(2010)215–231。
氢键的应用
how the standard perception of halogen substituents, which assumes an isotropic negative electron density around the halogen, was replaced by a description that takes the σ-hole into account. Halogen bonds have been found to occur in a multitude of inorganic, organic, and biological systems.4,5 In an early study from the 1950s, Hassel and Hvoslef solved the crystal structure of the equimolar Br2:dioxane adduct and found Br···O contacts featuring distances substantially below the sum of the van der Waals radii of both atoms, indicating a strong attractive interaction between both atoms.6,7 In 1984, a search of the Cambridge crystallographic data files for short iodine···N/O/S contacts revealed that these interactions are also formed in biologically relevant systems, being employed by nature for the molecular recognition of thyroid hormones at their target proteins such as transthyretin.8 In protein−ligand environments, halogen bonds can be formed between a halogenated ligand and any accessible Lewis base in the binding pocket.9 Probably because of its presence in every amino acid, the backbone carbonyl oxygen function is the most prominent Lewis base involved in halogen bonds in protein binding sites, as found from an analysis of the Protein Data Bank (PDB).10,11 Additionally, halogen bonds can be formed involving side chain groups, such as hydroxyls in serine, threonine, and tyrosine, carboxylate groups in aspartate and glutamate, sulfurs in cysteine and methionine, nitrogens in histidine, and the π surfaces of phenylalanine, tyrosine, histidine, and tryptophan. Several examples for these contacts are given in Figure 2.
GeneOntologyAnalysis:基因本体论分析
17
GO tree example
GO tree: A child can have more than one parent
⎯ Standard assignment of genes into functional categories ⎯ Controlled vocabulary for describing biological meanings
u Gene Ontology or GO project at NCBI
2) Define controlled terms (ontologies) for description of gene products from 3 aspects:
u Biological process (DNA repair, mitosis) u Molecular function (protein serine/threonine kinase activity, transcription factor
Gene Ontology -Cellular Component
/GO_nature_genetics_2000.pdf
Any one gene can be a member of more than one GO classification
21
Temporal snapshots of Go terms and mappings are available in BioC (~700, April 2014)
Use of Public Human Genetic Variant 1
Use of Public Human Genetic Variant 1Databases to Support Clinical Validity 2for Next Generation Sequencing3(NGS)-Based In Vitro Diagnostics 456Draft Guidance for Stakeholders and 7Food and Drug Administration Staff 89DRAFT GUIDANCE1011This draft guidance document is being distributed for comment purposes only. 1213Document issued on July 8, 2016.141516You should submit comments and suggestions regarding this draft document within 90 days of 17publication in the Federal Register of the notice announcing the availability of the draft18guidance. Submit electronic comments to . Submit written19comments to the Division of Dockets Management (HFA-305), Food and Drug Administration, 205630 Fishers Lane, rm. 1061, Rockville, MD 20852. Identify all comments with the docket21number listed in the notice of availability that publishes in the Federal Register.2223For questions about this document concerning devices regulated by CDRH, contact Personalized 24Medicine Staff at 301-796-6206 or PMI@. For questions regarding this document as 25applied to devices regulated by CBER, contact the Office of Communication, Outreach and26Development in CBER at 1-800-835-4709 or 240-402-8010 or by email at ocod@. 27282930U.S. Department of Health and Human Services31Food and Drug Administration32Center for Devices and Radiological Health33Office of In Vitro Diagnostics and Radiological Health3435Center for Biologics Evaluation and Research36Preface3738Additional Copies394041CDRH42Additional copies are available from the Internet. You may also send an e-mail request to CDRH-43Guidance@ to receive a copy of the guidance. Please use the document number 16008 to identify the guidance you are requesting.444546CBER4748Additional copies are available from the Center for Biologics Evaluation and Research (CBER), by 49written request, Office of Communication, Outreach, and Development (OCOD), 10903 New50Hampshire Ave., Bldg. 71, Room 3128, Silver Spring, MD 20993-0002, or by calling 1-800-835-514709 or 240-402-8010, by email, ocod@ or from the Internet at52/BiologicsBloodVaccines/GuidanceComplianceRegulatoryInformation/Guidan ces/default.htm.5354Table of Contents555657I.Introduction (1)58II.Background (1)59III.Scope (4)60IV.Recommendations to Support Recognition of Publicly Accessible Genetic Variant 61Databases of Human Genetic Variants as Sources of Valid Scientific Evidence62Supporting Clinical Validity of NGS Tests (4)63A.Database Procedures and Operations (5)64B.Data Quality (6)65C.Curation, Variant Interpretation and Assertions (7)66D.Professional Training and Conflicts of Interest (8)67V.FDA’s Genetic Variant Database Recognition Process (8)68A.Recognition Process for Genetic Variant Databases (9)691.Submission for Recognition (9)2.FDA Review of Genetic Variant Database Policies and Procedures (9)70713.Maintenance of FDA Recognition (10)72e of Third Parties (11)e of Data and Assertions from Recognized Genetic Variant Databases (11)7374Use of Public Human Genetic Variant 75Databases to Support Clinical Validity 76for Next Generation Sequencing7778798081828384858687888990919293949596979899100101102103104requirements are cited. The use of the word should in Agency guidance means that something is suggested or recommended, but not required.105106II.Background107108109NGS can enable rapid, broad, and deep sequencing of a portion of a gene, an entire exome(s), or 110a whole genome and may be used clinically for a variety of diagnostic purposes, including riskContains Nonbinding RecommendationsDraft – Not For Implementation111prediction, diagnosis, and treatment selection for a disease or condition. The rapid adoption of 112NGS-based tests in both research and clinical practice is leading to identification of an increasing 113number of genetic variants, including rare variants that may be unique to a single individual or 114family. Understanding the clinical significance of these genetic variants holds great promise for 115the future of personalized medicine.116117Although the importance of genetic variant data aggregation is widely recognized, today much of 118the data that would be useful to support clinical validity of NGS-based tests is generally stored in 119a manner in which it is not publicly accessible. Aggregation of clinical genotype-phenotype120associations and evaluation of the level of evidence underlying these associations under a well-defined process will continue to promote more rapid translation of genetic information into121122useful clinical evidence.123For the purposes of this draft guidance document, a “genetic variant database” is a publicly124125accessible database of human genetic variants that aggregates and curates reports of human126phenotype-genotype relationships to a disease or condition with publicly available127documentation of evidence supporting those linkages. Genetic variant databases may also128include assertions1 about specific genotype-phenotype correlations.129130FDA believes that the aggregation,2 curation,3 and interpretation4 of clinical genotype-phenotype 131associations in genetic variant databases could support the clinical validity of claims made about 132a variant detected by an NGS-based test and a disease or condition. In relying on assertions in 133genetic variant databases that follow the recommendations in this guidance, FDA hopes to134encourage the deposition of variant information in such databases, reduce regulatory burden on 135test developers, and spur advancements in the interpretation and implementation of precision 136medicine.137138Publicly Accessible Databases of Human Genetic Variants as Sources of Valid Scientific139Evidence Supporting Clinical Validity140141To determine whether an NGS-based test has a reasonable assurance of safety and effectiveness, 142the Agency relies upon the review of valid scientific evidence to support the analytical and143clinical performance of the test. Valid scientific evidence is defined as evidence from well-144controlled investigations, partially controlled studies, studies and objective trials without145matched controls, well-documented case histories conducted by qualified experts, and reports of 146significant human experience with a marketed device, from which it can fairly and responsibly1 For the purposes of this guidance, an assertion is the informed assessment of a genotype-phenotype correlation (orlack thereof) given the current state of knowledge for a particular variant. An assertion is generally noted in thegenetic variant database entry for a particular variant (e.g., benign, drug resistant, etc.).2 For the purposes of this guidance, the term aggregation refers to the process by which variant data aresystematically input into a genetic variant database. This process may require that data conform to specified formats.3 For the purposes of this guidance, curation refers to the process by which data regarding a specific variant arecollected from various sources, annotated, and maintained over time.4 For the purposes of this guidance, the term interpretation refers to the process by which genetic variant databasepersonnel evaluate the evidence regarding a linkage between a genetic variant and a disease or condition and make an assertion about that linkage (or lack thereof).Contains Nonbinding RecommendationsDraft – Not For Implementation147be concluded by qualified experts that there is a reasonable assurance of safety and148effectiveness.5In determining whether a particular NGS test has a reasonable assurance of safety 149and effectiveness, FDA must determine, based on valid scientific evidence that “in a significant 150portion of the target population, the use of the device for its intended uses and conditions of use, 151when accompanied by adequate directions for use and warnings against unsafe use, will provide 152clinically significant results.”6153The evidence residing in many genetic variant databases has been collected from multiple154155sources that can meet the valid scientific evidence definition, such as evidence from well-156controlled clinical investigations, clinical evidence generated in CLIA (Clinical Laboratory157Improvement Amendments of 1988)-certified laboratories, published peer-reviewed literature, 158and certain case study reports. Some organizations that are currently developing genetic variant 159databases have adopted protocols and methodologies (e.g., quality measures) and/or external 160guidelines (e.g., from professional societies or standards development organizations) for161evidence aggregation, curation, and interpretation practices. While interpretation processes may 162vary across databases and organizations, they typically involve the use of qualified experts who 163make informed conclusions about the presence or absence of a genetic variant and its meaning 164for a particular disease or clinical decision.165166Further, there are several parallels between the standards set forth by well-recognized167professional guidelines for variant interpretation and FDA review of clinical validity. Personnel 168interpreting variants use a range of evidence, including the types and positions of variants,169inheritance, prevalence, well-established functional studies, and prior knowledge of gene-disease 170relationships. Generally, the standards for use of evidence appear to parallel the types of171evidence appropriate to support an FDA premarket submission. Under 21 CFR 860.7(c)(2),172isolated case reports, random experience, reports lacking sufficient details to permit scientific 173evaluation, and unsubstantiated opinions are not regarded as valid scientific evidence.174Accordingly, FDA believes that summary literature is inferior in this respect to data available for 175independent evaluation. FDA assesses clinical validity based on the totality of availableevidence provided in a given submission. Similarly, well-recognized professional guidelines 176177dictate that database personnel interpreting variants integrate multiple lines of evidence to make 178an assertion of clinical validity.179180The Agency believes such practices help assure the quality of data and assertions within genetic 181variant databases and has built upon these approaches in developing the recommendations in this guidance.182183184FDA has long believed that public access to data is important so that all interested persons (e.g., 185healthcare providers and patients) can make the best medical treatment decisions. To that end, 186for all IVDs that have received clearance or de novo classification from FDA since November 1872003, FDA has published a Decision Summary containing a review of the analytical and clinical validity data and other information submitted by the applicant to support the submission and 1885 21 CFR 860.7(c)(2).6 21 CFR 860.7(e)(1).Contains Nonbinding RecommendationsDraft – Not For Implementation189FDA’s justification for clearing or classifying the IVD; FDA is also required to publish190Summaries of Safety and Effectiveness Data for approved PMAs under section 520(h) of theFederal Food, Drug and Cosmetic Act (FD&C Act) (21 U.S.C. 360j(h)).7 FDA believes that 191192similar public availability and access to data contained in genetic variant databases is importantto patients and healthcare providers in order to make fully informed medical decisions.193194195FDA believes that if genetic variant databases follow the recommendations in this document,including transparency regarding evidence evaluation, and obtain FDA recognition as described 196197below, the data and assertions within would generally constitute valid scientific evidence that can198be used to support clinical validity.199III.Scope200201202This draft guidance document describes FDA’s considerations in determining whether a genetic203variant database is a source of valid scientific evidence that could support the clinical validity of204an NGS-based test in a premarket submission. This draft guidance further outlines the process by205which administrators8 of publicly accessible genetic variant databases could voluntarily apply to206FDA for recognition, and how FDA would review such applications and periodically reevaluate207recognized databases.208209The genetic variant databases discussed in this draft guidance only include those that contain210human genetic variants, and do not include databases used for microbial genome identificationand detection of antimicrobial resistance and virulence markers. This draft guidance does not 211212apply to software used to classify and interpret genetic variants, but instead, only regards use ofcurated databases using expert human interpretation.213214IV.Recommendations to Support Recognition of Publicly 215Accessible Genetic Variant Databases of Human216Genetic Variants as Sources of Valid Scientific Evidence 217Supporting Clinical Validity of NGS Tests218219220FDA believes that evidence contained in a genetic variant database that conforms to the221recommendations described below would generally constitute valid scientific evidence that can 222be used to support the clinical validity of an NGS-based test.223FDA believes that such a genetic variant database would: (1) operate in a manner that provides 224225sufficient information and assurances regarding the quality of source data and its evidence7 No Decision Summaries or Summaries of Safety and Effectiveness Data are posted for those devices for which theapplicant failed to demonstrate substantial equivalence or a reasonable assurance of safety and effectiveness.8 FDA acknowledges that many databases may not use the term “administrator” or may have a committee ofindividuals that oversee the database. Therefore, for the purposes of this guidance, a genetic variant databaseadministrator is the entity or entities that oversee database operations.Contains Nonbinding RecommendationsDraft – Not For Implementation226review and variant assertions; (2) provide transparency regarding its data sources and its227operations, particularly around how variant evidence is evaluated and interpreted; (3) collect, 228store, and report data and conclusions in compliance with all applicable requirements regarding 229protected health information, patient privacy, research subject protections, and data security; and 230(4) house sequence information generated by validated methods.231232In the subsections below, FDA discusses recommendations for the operation of a genetic variant 233database, and the aggregation, curation, and interpretation of data therein, so that such data234would generally constitute valid scientific evidence supportive of clinical validity. FDA235acknowledges that individual genetic variant databases may have different, but equally236scientifically valid, approaches to assuring data quality, clinical relevance, data security, patient privacy, and transparency. Additionally, FDA recognizes that several professional societies have 237238or are developing guidelines for genetic variant curation and interpretation that may differ239depending upon discipline, but may each be appropriate in the context of the intended use.240Genetic variant database administrators should focus on ensuring that their procedures and241quality requirements are sufficiently robust to provide a high degree of confidence in their242conclusions regarding genotype-phenotype associations.243A.Database Procedures and Operations244245246Transparency and Public Accessibility: FDA recommends that genetic variant databaseadministrators make publicly available sufficient information regarding data sources and247248standard operating procedures (SOPs) for evaluation and interpretation of evidence to allow FDA and the public to understand the criteria and processes used to collect and interpret evidence249250about variants and enable patients and healthcare providers to make fully informed medical251decisions.252253SOP Version Control: SOPs should define how variant information is aggregated, curated, and 254interpreted. These SOPs should be documented and versioned. Changes to SOPs should be255clearly documented with sufficiently detailed information regarding the change accompanied by 256any necessary explanation to ensure all stakeholders understand any limitations created by or 257implications of the change in procedure. To maintain quality variant assertions and ensure thatgenetic variant database operations keep pace with advances in technology and scientific258259knowledge, operations and SOPs should be reviewed at least on an annual basis.260Data Preservation: FDA recommends that genetic variant database administrators have261262processes in place for assessing overall database stability and architecture and for ensuring that 263data linkages are properly maintained. When a genetic variant database contains linkages to 264secondary databases, the genetic variant database administrator should have predefined processes 265in place to recognize changes to the secondary databases and account for them in version control 266of the primary database. FDA recommends genetic variant database administrator back-up thedatabase on a regular basis so that it can be reinstated as necessary.267268269Genetic variant database administrators should have a plan in place to ensure database content 270and processes are preserved in the event a genetic variant database ceases operations271permanently or temporarily (e.g., a database loses funding, infrastructure upgrades). A locationContains Nonbinding RecommendationsDraft – Not For Implementation272to deposit data, including versioning information and supporting SOPs and documentation, in the 273event that the genetic variant database ceases operation should be identified.274275Security and Privacy: Genetic variant database operations must be in compliance with all276applicable federal laws and regulations (e.g., the Health Insurance Portability and Accountability 277Act, the Genetic Information Nondiscrimination Act, the Privacy Act, the Federal Policy for the 278Protection of Human Subjects (“Common Rule”), etc.) regarding protected health information, 279patient privacy, research involving human subjects, and data security, as applicable. It is the 280responsibility of the genetic variant database administrator to identify the applicable laws and 281regulations and to assure that any requirements are addressed. Genetic variant database282administrators should also put in place adequate security measures to ensure the protection and 283privacy of patient and protected health information and provide training for database staff on 284security and privacy protection.285286Data formats: To facilitate genetic variant database use for regulatory purposes and to help287assure the accuracy and quality of variant assertions, genetic variant database administratorsshould employ commonly accepted data formats and identify which format is in use by the288289genetic database. This standardization will help minimize ambiguity regarding variants and290better enable comparisons of variant assertions between different databases or other entities. 291B.Data Quality292293294It is essential that the data and information regarding genotypes and phenotypes or clinical295information placed into the genetic variant database are of sufficient quality, and based on296current scientific knowledge, in order for there to be a reasonable assurance that the assertions 297made linking specific genetic variants to diseases or conditions are accurate.298299Nomenclature: To aid in the accurate interpretation of genetic variants, genetic variant databases 300should use consistent nomenclature that is widely accepted by the genomics community for gene names and/or symbols, genomic coordinates, variants, described clinical and functional301302characteristics, and classifications. The genetic variant database administrator should also make 303available a detailed description of which nomenclature is used to allow FDA and external users to accurately interpret the information presented.304305306Metadata: Variant data in the genetic variant database should be accompanied by metadata,including the number of independent laboratories and/or studies reporting the variant307308classification, name of the laboratory(ies) that reported the variant, the name of the test used to 309detect the variant, and, to the extent possible, details of the technical characteristics of the test 310that was used (e.g., reference sequence version or build, instrument, software, bioinformatics 311tools, etc.) and variant characteristics (e.g., zygosity, phasing, and segregation). Genetic variant 312databases should clearly and transparently document evidence source(s) used to support variant interpretation (e.g., literature, well-documented case histories, etc.).313314315Data Uniqueness: Genetic variant database operations should also include methods to ensure that 316individual data points (e.g., a variant from one individual for a particular phenotype) are not317represented more than once in the database.Contains Nonbinding RecommendationsDraft – Not For ImplementationC.Curation, Variant Interpretation and Assertions318319320The processes that genetic variant database personnel use for curation and variant interpretation should be based on well-defined SOPs and carried out by qualified professionals.321322323Curation and Variant Interpretation: Written SOPs for curation and variant interpretation,including evaluation of data from clinical practice guidelines, peer-reviewed literature, and pre-324325curated knowledge bases, should be available to the public for review. SOPs should generally 326include validated decision matrices, such as those based on well-recognized professionalguidelines. All genetic variant database curation and interpretation rules, and future327328modifications of those rules, should be explained and made available to the public. Furthermore, 329if curated data or variant interpretations from other sources are to be integrated into the genetic 330variant database, then the curation and interpretation processes and data quality of those outside 331sources should be audited by the database administrator on a regular basis. Each interpretation 332should be performed independently by at least two qualified and trained professionals, as333discussed below, and genetic variant databases should have SOPs for resolving differences in 334interpretation.Providing SOPs publicly for each of these activities will allow outside users to 335evaluate the evidence used in variant interpretation and thereby promote the consistency of336interpretation.337338FDA believes that use of publicly available decision matrices9 for variant interpretation that are 339based on rigorous professional guidelines is central to assuring that assertions from genetic340variant databases constitute valid scientific evidence supporting the clinical validity of a test. 341FDA reviewers must evaluate evidence in the context of a test’s intended use and conditions of 342use, including specific facts about genes or diseases under consideration (e.g., population343incidence of a disease, variant incidence) into their review. See 21 CFR 860.7(e)(1). Similarly, 344such factors should be incorporated into a finalized decision matrix.345346Assertions:The types of evidence that personnel interpreting variants may use for an347interpretation, and their corresponding strengths, should be defined, and combined into a scoring 348system. Assertions within an FDA-recognized genetic variant database should be appropriate to 349the level of certainty and the nature of the genotype-phenotype relationship and be adequately 350supported. Assertions should be versioned, such that changes in assertions over time arerecorded and maintained. Assertions and the evidence underlying them should be truthful and not 351352misleading and be made in language that is clear and understandable. In order to be FDA-353recognized, a genetic variant database should not include any recommendations regardingclinical treatment or diagnosis.354355356For example, it is appropriate for an assertion to include descriptive language about a variant 357such as responder, non-responder, pathogenic, benign, likely pathogenic, likely benign, variant 358of unknown significance, etc. as long as such language is truthful, not misleading, and supported 359by adequate evidence detailed within the genetic variant database. FDA believes that it is9 For the purposes of this guidance, a decision matrix is an evidence-based tool used to guide the interpretation ofthe genotype-phenotype relationship between variants and diseases or conditions.Contains Nonbinding RecommendationsDraft – Not For Implementation360generally not scientifically appropriate to make a definitive assertion (e.g., pathogenic) about the 361clinical validity of a variant based on a single piece of evidence, or on only weak evidence.Assertions that a particular genotype-phenotype association is clinically valid should generally 362363involve multiple lines of evidence and, at a minimum, should identify a primary source of364scientific evidence and other supporting evidence. Further, wherever appropriate to avoid any 365potential misunderstanding regarding the strength of the evidence supporting an assertion, the 366assertion should include a clear description of the evidence associated with it.367D.Professional Training and Conflicts of Interest368369370Professional Training: FDA recognizes that many different types of genetics professionals may 371be involved in the curatorial and interpretive process as part of a team (e.g., genetic counselors, 372Ph.D.-level scientists, physicians). Adequate training and expertise of personnel interpreting 373variants plays an important role in the quality of variant review and interpretation. FDA believes 374that interpretation should be performed by qualified professionals with appropriate levels of375oversight in place (e.g., multiple levels of review). Personnel interpreting variants should have 376received adequate training and there should be methodologies in place, such as proficiency377testing, to ensure that such personnel meet and maintain high quality standards over time.378379Finally, curation procedures should ensure that all data has been collected in compliance with all 380applicable requirements for protecting patient health information and research involving human 381subjects.382383Conflicts of Interest: Conflicts of interest, especially financial ones, could introduce bias and 384undermine the quality of variant interpretations in genetic variant databases, as well as the385confidence in such interpretations, if not adequately mitigated. To be considered for recognition 386by FDA, efforts should be made to minimize, and make transparent, any potential conflicts of 387interest pertaining to a genetic variant database or its personnel.388V.FDA’s Genetic Variant Database Recognition Process 389390391FDA believes that data and assertions from genetic variant databases that follow the392recommendations discussed in this document would generally constitute valid scientific evidence 393supportive of clinical validity in a premarket submission. Therefore, FDA intends to implement a 394recognition process10 for publicly accessible genetic variant databases and their assertions to 395streamline premarket review of NGS tests. Specific variant assertions and underlying data from a 396recognized genetic variant database could generally be submitted by NGS-test developers as part 397of their premarket review submission, if applicable, in some cases without submission of398additional clinical data regarding that variant.39910 The genetic variant database recognition process discussed in this document may be viewed as analogous to thestandards recognition process under section 514 of the FD&C Act (21 U.S.C. 360d), but would not be conductedunder this provision.。
食品质量检测专业英文论文
Pictures of Appetizing Foods Activate Gustatory Cortices for Taste and RewardW.Kyle Simmons 1,Alex Martin 2and Lawrence W.Barsalou 11Department of Psychology,Emory University,Atlanta,GA 30322,USA and 2Cognitive Neuropsychology Section,Laboratory of Brain and Cognition,National Institute of Mental Health,Bethesda,MD,USAIncreasing research indicates that concepts are represented as distributed circuits of property information across the brain’s modality-specific areas.The current study examines the distributed representation of an important but under-explored category,foods.Participants viewed pictures of appetizing foods (along with pictures of locations for comparison)during event-related pared to location pictures,food pictures activated the right insula/operculum and the left orbitofrontal cortex,both gustatory processing areas.Food pictures also activated regions of visual cortex that represent object shape.Together these areas contribute to a distributed neural circuit that represents food knowledge.Not only does this circuit become active during the tasting of actual foods,it also becomes active while viewing food pictures.Via the process of pattern completion,food pictures activate gustatory regions of the circuit to produce conceptual inferences about taste.Consistent with theories that ground knowledge in the modalities,these inferences arise as reenactments of modality-specific processing.Keywords:concepts,fMRI,insula/operculum,knowledge,orbitofrontal cortex IntroductionHow are concepts for everyday objects represented in the brain?Based on accumulating lesion and neuroimaging evi-dence,an object concept is represented as a distributed circuit of property representations across the brain’s modality-specific areas (Martin,2001;Martin and Chao,2001;Thompson-Schill,2003).On encountering a physical object,relevant modalities represent it during perception and action.As the object is processed,association areas partially capture property informa-tion on these modalities,so that this information can later be reactivated during conceptual processing,when the object is absent (Damasio and Damasio,1994;Simmons and Barsalou,2003).Although these conceptual reenactments share import-ant commonalties with mental imagery,there are also important differences.Mental imagery typically results from deliberate attempts to construct conscious vivid images in working memory.In contrast,the perceptual reenactments that underlie conceptual processing often appear to lie outside awareness,resulting instead from relatively automatic and implicit pro-cesses.Of primary interest,these reenactments occur in responses to words and other symbols,and play central roles in the representation of conceptual knowledge (Barsalou,1999,2003a,b;Barsalou et al.,2003a,b).The category of tools illustrates the distribution of property representations across modality-specific brain areas.When people use a hammer,a distributed set of brain areas becomes active to represent the hammer’s properties,including its visual form (ventral occipitotemporal cortex),the physical actionsused to manipulate it (ventral premotor cortex and intraparietal sulcus),and the visual motion that results (middle temporal gyrus)(Beauchamp et al.,2002;Chao et al.,1999;Chao and Martin,2000;Damasio et al.,2001;Grafton et al.,1997;Handy et al.,2003;Johnson-Frey,2004;Martin et al.,1995;Perani et al.,1995).As just described,the brain’s association areas capture this distributed set of modality-specific states for later concep-tual use.On subsequent occasions,when no hammers are present,reenactments of these states represent hammers con-ceptually (e.g.during language comprehension and thought).In the experiment reported here,we explored the distributed property account for the category of foods.Foods constitute a central category for humans,not only in perception and action,but in higher cognition (Ross and Murphy,1999).Previous research on food concepts has addressed the visual properties of fruits and vegetables,relative to the visual properties of other object categories (McRae and Cree,2002).Here,we focus instead on the tastes of high-caloric,high-fat processed foods,such as cheeseburgers and cookies (see Fig.1).We focus on taste properties because the tastes of foods are at least as important as their visual appearances.We focus on processed foods because they are central to the modern diet and because they are associated with strong gustatory and appetitive responses that underlie how people select and consume them.If a distributed circuit of property information represents food knowledge,then viewing a food picture should not only activate brain areas that represent visual properties of the pictured food,but should also activate brain areas that represent how the food is likely to taste and how rewarding it would be to eat.Once one part of the distributed circuit becomes active by viewing the picture,the remainder should become active via the conceptual inference process of pattern completion across the circuit.Given the central role that such inferences play in normal food selection and consumption,it is essential to understand their bases in the brain.Furthermore,given the extensiveness of eating disorders,obesity and other food-related problems,it is important to understand how people generate taste and reward inferences to the broad array of food representations available in modern culture.We presented pictures of food and non-food entities (loca-tion pictures)to subjects undergoing event-related fMRI and predicted that a distributed circuit of brain areas would become active to represent the visual and gustatory properties of the pictured foods.Regarding the visual properties of foods,a large literature demonstrates that ventral temporal regions underlie the representation of objects’visual form properties (Ishai et al.,1999,2000).Thus,we expected regions of the inferior temporal and fusiform gyri to respond to the distinctive visual properties of the pictured foods.Analogously,location pictures shouldÓThe Author 2005.Published by Oxford University Press.All rights reserved.For permissions,please e-mail:journals.permissions@Cerebral Cortex October 2005;15:1602--1608doi:10.1093/cercor/bhi038Advance Access publication February 9,2005at Indian Institute of Science Education and Research, Kolkata (IISER-K) on February 28, 2011 Downloaded fromactivate parahippocampal gyrus,given that this region responds to the visual-spatial properties characteristic of buildings and landmarks(Aguirre et al.,1998;Epstein and Kanwisher,1998; Epstein et al.,1999).Most importantly,the current study attempted to demonstrate that pictures of visual objects,in this case foods,can produce taste inferences.If the distributed account of concept represen-tation is correct,then multiple modality-specific regions should become active when people represent foods conceptually.Not only should visual areas become active to represent a food’s unique visual properties,gustatory areas should become active to represent how the food tastes.Once people access knowledge for a pictured food,an inference is produced about how it tastes. Even though people are not actually tasting the food,their gustatory system becomes active to represent this inference. Specifically,we predicted that simply viewing pictures of appetizing foods(relative to locations)should activate two brain regions that commonly respond to actual taste stimuli in psychophysical neuroimaging studies(Francis et al.,1999;de Araujo et al.,2003a,b;O’Doherty et al.,2001b).Thefirst area, a region in the insula/operculum,is known to represent how foods actually taste(Rolls et al.,1988;Rolls and Scott,2003; Scott et al.,1986).The second area,a region in orbitofrontal cortex(OFC),is known to represent the reward values of tastes (Gottfried et al.,2003;Rolls et al.,1989).Here we demonstrate that simply viewing pictures of processed foods activates both brain regions in much the same way that taste stimulants do in psychophysical studies.Materials and MethodsSubjectsNine right-handed,native-English-speaking volunteers from the Emory University community participated in the scanning study(six female and three male;age range,18--45years).All participants completed a health questionnaire prior to scanning and none of the participants indicated a history of neurological problems.In accordance with protocols prescribed by Emory University’s Institutional Review Board,all partici-pants read and signed an informed consent document describing the procedures and possible risks.Sixteen native-English-speaking volunteers from the Emory commu-nity participated in the stimulus selection study(ten female and six male;age19--46years).None of these volunteers participated in the later brain imaging experiment.As with the imaging participants,all participants read and signed an informed consent document describing the procedures and possible risks in accordance with protocols prescribed by Emory University’s Institutional Review Board. Experimental DesignBefore beginning the brain imaging phase of the study,32types of foods and35types of locations were selected as candidate materials.The foods(e.g.cheeseburger,spaghetti,cookie,etc.)in the list were chosen because they are all encountered frequently in American society.In addition,only processed foods that are relatively high in fat and calories were used.No fruits or vegetables were included.The locations(e.g. house,mall,school,etc.)in the list were chosen because they are all types of places that participants in the study might visit frequently. The foods and locations were equated for familiarity by having volunteers(none of whom participated in the brain imaging experi-ment)provide familiarity ratings for the35types of locations,and32 types of foods.Ratings were made on a1--7scale,with1indicating that a type of food or location was completely unfamiliar and7indicating that it was extremely familiar.Based on these ratings,15food and15 location types were selected such that no reliable familiarity differences existed between the two groups of stimuli.Between six and ten pictures for each type of food and location were then collected.A group of16participants viewed all259pictures and rated each for how typical it was of its respective food or location type.Ratings were made on a1--7scale,with1indicating that a picture was not at all typical of its food or location type and7indicating that it was very typical.For each type of food or location,the three most typical pictures were selected for use in the imaging study,thus yielding a total of90picture stimuli(45foods,45locations)equated for typicality.All of the food and location pictures depicted non-unique entities that would not be individually recognizable to the participants.Finally,23location pictures and22food pictures were randomly selected to create phase-scrambled images that were presented during scanning asfiller items(see Fig.1).During scanning,participants viewed food,location and scrambled pictures.For each picture,participants used a response pad to provide yes/no judgments as to whether it was the same or different as the preceding picture.The pictures were presented in the center of the screen for2s each.Interspersed among picture presentations were variable(‘jittered’)interstimulus intervals(mean=5.7s,range=2--20s) that were included to optimize estimation of the event-related fMRI response.During these interstimulus intervals,participants saw afix-ation cross presented in the center of the screen.Participants were instructed that when they saw thefixation cross they should continue attending to the screen and prepare for the next picture presentation. Prior to beginning data collection,participants performed an abbre-viated practice run to insure that they understood the task instructions. Functional data were collected in three scanning runs.The trial lists for the three runs were counterbalanced across participants.During each run,participants saw16food and16location pictures.Fifteen picture presentations from each category were novel pictures,while one picture was repeated to maintain the participants’attention to the picture repetition detection task.In other words,one location picture and one food picture was repeated in each scanning run.Across the three scanning runs,each subject saw three food picture repetitions and three location picture repetitions.Subjects were told in advance that repeated stimuli would occur in each run.Knowing this and given that the repeated stimuli occurred infrequently,this task requires subjects to pay close attention to each picture presentation to insure that they did not miss a repetition trial.The data from the repetition trials in each run were not analyzed given that they were only included to ensure that participants remained attentive to the task.Subjects were highly accurate at repetition detection(Mean correct=98.8%,SD= 0.94).Each5min8s run consisted of4min48s of the repetition detection task,followed by an additional20s restperiod.Figure1.Examples of location,food,and scrambled image stimuli.Cerebral Cortex October2005,V15N101603at Indian Institute of Science Education and Research, Kolkata (IISER-K) on February 28, Downloaded fromImage Acquisition and AnalysisPictures were back-projected onto a screen located at the head of the scanner and were viewed through a mirror mounted on the head coil. Stimulus presentation and response collection was controlled using Presentation software(v.0.70,).In each of the three imaging runs,154gradient echo recalled MR volumes depicting BOLD contrast were collected with a3T Siemens Trio scanner.Each volume consisted of34contiguous,2mm thick slices in the axial plane(T E=30ms,T R=2000ms,flip angle=90°,FOV= 192mm2,64364matrix).Voxel size at acquisition was33332mm, but was33333mm after spatial normalization.Prior to statistical analyses,image preprocessing was conducted in SPM99(Wellcome Department of Neurology,UK,http://www.fil.ion. ).To reduce motion-related signal changes between volumes, each participant’s scans were realigned and resliced using sinc in-terpolation.Volumes were then normalized to a template EPI scan and finally smoothed in the axial plane using a6mm isotropic Gaussian kernel.Subsequent statistical analyses were also conducted using SPM99. First,individual subjects’data were analyzed using multiple regression. For each subject,event-related changes in neural activity were modeled using afinite impulse response model corresponding to picture stimuli presentation and convolved to the standard SPM hemodynamic re-sponse function.Interstimulusfixation periods having variable durations served as the signal baseline.Global effects were removed by pro-portional scaling and the data were low-passfiltered.Condition effects at the subject level were then assessed with orthogonal contrasts comparing neural activity for food and location pictures.These contrast images,one for each participant,were then analyzed in a second-level random effects analysis of the foods--locations and locations--food contrasts using one sample t-tests.A statistical significance threshold of P<0.005(uncorrected for multiple comparisons)and a spatial extent threshold of at least seven contiguous voxels(corresponding to P<0.05 uncorrected)was used in the random effects analyses.There are at least two reasons why the use of uncorrected P-values in the present study is warranted.First,the activations reported here were identified using random effects analyses which take into account both within-and between-subjects variance.Not only does this allow the results to be generalized to the population from which subjects were drawn,but it also makes the analyses inherently robust statistically. Secondly,based on much previous research reported in the literature (see Introduction and Discussion),we started with a priori hypotheses that the insula/operculum and OFC would be active in the food--location contrasts.Additionally,given that both food and location pictures depicted common objects,both conditions should activate regions in the ventral temporal cortex known to represent objects’visual form properties.More specifically,however,we predicted that the fusiform/ parahippocampal gyrus would be active in the locations--foods contrasts. To be reported here as significant,any other areas of activity would need to be active at the P<0.05level with correction for multiple comparisons.No other areas reached this level of statistical significance. ResultsViewing food pictures for two s in a simple picture-matching task activated gustatory cortex.Specifically,food pictures, relative to location pictures,activated a region of the right insula/operculum,an area that psychophysical research has shown represents the tastes of foods(extent threshold,P= 0.004;see Table1and Fig.2).Importantly,this region was not only significantly more active for food pictures than for location pictures,but it was also reliably activated relative to thefixation baseline(one-tailed,P=0.033).In addition,food pictures,relative to location pictures, activated two regions in the left OFC that psychophysical research has shown represents the reward values of tastes. One of these regions was located in the lateral portion of the OFC(extent threshold,P=0.05;Fig.3);the other,located more superiorly,stretched into the anterior aspect of the cingulate cortex(extent threshold,P=0.01).While the lateral OFC region was reliably activated relative to thefixation baseline(P< 0.001),the more superior OFC/anterior cingulate region was not(one-tailed,P=0.155).Viewing food pictures,relative to location pictures,also produced robust activity in ventral occipitotemporal cortex, bilaterally.Two of these areas were located in the right hemisphere;one extending from the inferior occipital gyrus forward into the inferior temporal gyrus(extent threshold,P= 0.02)and the other located more anteriorly in the inferior temporal gyrus(extent threshold,P=0.035).Additional activity was observed in the left hemisphere,stretching from inferior occipital gyrus into the fusiform and inferior temporal gyri (extent threshold,P=0.001).In addition to producing significantly more activity than location pictures,food pictures reliably activated each of the ventral temporal areas above the signal baseline(P<0.0001).In constrast,and consistent with previous reports(Aguirre et al.,1998;Epstein and Kanwisher,1998;Epstein et al.,1999), location pictures,relative to food pictures,produced bilateral activity extending from the medial portion of the fusiform gyrus into parahippocampal gyrus(see Table1).Activity in these regions was not only greater for locations than foods,but was also reliably activated relative to the signal baseline(P<0.001 for both hemispheres).DiscussionThesefindings support the hypothesis that a distributed circuit of brain regions represents conceptual knowledge about foods.As Figure4a,b illustrates,viewing food pictures activated two brain regions that lie in close proximity to gustatory regions active during psychophysical studies of taste perception (Francis et al.,1999;de Araujo et al.,2003a,b;O’Doherty et al.,2001b).As Figure4a illustrates,food pictures activated the insula/operculum very near regions that become active when people actually taste glucose,sucrose,salt,or umami.As Figure4b similarly illustrates,food pictures also activate OFC very near regions that become active when people experience taste stimuli directly.The close proximity of the regions active for food pictures to well-established gustatory areas suggests that food pictures automatically activate gustatory areas to produce conceptual inferences about taste properties.The two taste areas observed here are associated with dif-ferent functions in the gustatory system.The insula/operculum receives projections from the ventroposterior medial nucleus of Table1Regions showing differential responses to food and location picturesContrast Side/location MNI coordinates Peak T Px y zFoods[locations R insula36ÿ69 5.92\0.001 L OFCÿ2133ÿ18 6.60\0.001L OFC/anterior cingulateÿ1845ÿ6 5.08\0.001aR inferior temporal gyrus48ÿ45ÿ12 5.05\0.001R inferior temporal gyrus48ÿ66ÿ9 5.99\0.001L fusiformÿ48ÿ60ÿ18 4.690.001 Locations[foods L fusiformÿ21ÿ39ÿ1214.50\0.001 R fusiform27ÿ42ÿ159.61\0.001 L,left;R,right.a While this region was significantly active for food pictures relative to location pictures,it was not reliably active relative to thefixation baseline.1604Food Pictures Activate Gustatory Cortex Simmons et al.at Indian Institute of Science Education and Research, Kolkata (IISER-K) on February 28, Downloaded fromthe thalamus (Rolls and Scott,2003),the main subcortical processing area for gustatory input,and has been associated with taste per se .The OFC,in contrast,receives projections from the insula/operculum (Rolls and Scott,2003)and has been associated with the reward values of specific tastes.Specifically,electrophysiological studies in monkeys show that the firing rates of neurons in insula/operculum are not modulated by hunger and satiety,suggesting that they represent taste in-dependent of reward (Rolls et al.,1988).Conversely,the firing rates of neurons in OFC are modulated by hunger and satiety,suggesting that they represent the current reward value of tastes (Rolls et al.,1988).Thus,when a monkey is hungry,the firing rate of OFC neurons is high,given that the reward value of food is high.Similarly in humans,greater activation occurs in gustatory OFC before participants are satiated than after (Gottfried et al.,2003).Taste reward areas are located in a different OFC region than the reward areas for other stimuli (Elliot et al.,2000;O’Doherty et al.,2001a;Rolls,2000).For example,the caudal OFC responds to olfactory rewards (de Araujo et al.,2003b;Zaldand Pardo,2000;O ngu r et al.,2003),whereas the inferiormedial OFC responds to abstract rewards (e.g.money)(O’Doh-erty et al.,2001a).Interestingly,the inferior medial OFC hasa markedly different cytoarchitectonic structure than the morelateral aspect of the OFC where taste activations occur (O ngu ret al.,2003).Thus,the OFC areas active in the present study appear to represent the reward value of tastes,rather than reward in general.As Rolls (2000,p.285)notes,‘it is important to realize that it is not just some general ‘‘reward’’that is represented in the oritofrontal cortex,but instead a very detailed and information-rich representation of which particu-lar reward or punisher is present’.Laterality of the Taste ActivationsFood pictures activated the right insula/operculum,and the left OFC.Our a priori prediction was that food pictures would activate both regions bilaterally.Examination of the psycho-physical taste literature,however,clarifies the laterality of our results.First,consider the insula/operculum.Although many psychophysical taste studies observe bilateral activity in this area,the response is typically stronger and more spatially extensive on the right (Small et al.,1999).This may explain why we only found right insula/operculum activation for food pictures.Indeed,lowering the cluster size threshold in our random effects analysis (but not the P -value threshold)revealed significant activity in a region of the left frontaloperculumFigure 2.Viewing food pictures elicits activity in insula/operculum.A high-resolution anatomical scan showing activity in right insula/operculum associated with viewing pictures of food items.The bar graph displays the average percent signal change in the right insula/operculum cluster for all nine subjects during a period between 4and 14s post-stimulus.The y -axis indicates percent signal change relative to signal baseline,with error bars representing ±1SEM of the subjects.The data shown in the bar graph were obtained in the random effects contrast of foods [locations with P \0.005.Cerebral Cortex October 2005,V 15N 101605at Indian Institute of Science Education and Research, Kolkata (IISER-K) on February 28, 2011 Downloaded from(–48,21,12)that is commonly activate in psychophysical taste studies (Small et al.,1999).Although this cluster of activity was smaller in magnitude and size relative to the activation seen on the right,it suggests that our findings are consistent with the general trend in the psychophysical taste literature for greater insula/operculum activation in the right hemisphere than in the left.With respect to the OFC,we found significant activations only on the left.It is noteworthy that studies in the psycho-physical taste literature are inconsistent with regard to later-ality,with bilateral activity reported only in approximately half of the studies.Again,lowering the cluster size threshold (but not the P -value threshold)on the random effects analysis revealed significant activity in the right OFC (15,45,–3)in nearly the identical location as seen on the left (–18,45,–6).Perhaps the best explanation,however,for why we observe activity in the left OFC comes from a recent finding by Kringelbach et al.(2003).These researchers identified an area in the left OFC where activity was correlated with subjects’ratings of taste pleasantness.Interestingly,the area theyidentified is approximately one centimeter from the activity we observed in the lateral OFC.Given that we only showed pictures of highly appetizing foods,it makes sense that we would observe activity very near the left OFC region that tracks taste pleasantness.ConclusionThe findings reported here indicate that the gustatory system produces taste responses to pictures of foods,not just to actual foods.Other studies have reported similar results.A previous neuroimaging study on pictures of foods found activation in areas near those observed here (insula and OFC),but using a blocked design with fixed-effects analyses (Killgore et al.,2003).Indeed,still other research has found that even words for tastes activate taste areas (Simmons,W.K.,Pecher,D.,Hamann,S.B.,Zeelenberg,R.and Barsalou,L.W.,under review;see Fig.4b ).In general,pictures and words appear to activate property inferences for food tastes and rewards,thus grounding conceptual knowledge in modality-specific brainareas.Figure 3.Viewing pictures of foods elicits activity in left OFC.A high-resolution anatomical scan showing activity in left OFC associated with viewing pictures of food items.The bar graph on the left displays the average percent signal change in the left OFC for all nine subjects during a period between 4and 14s post-stimulus.The bar graph on the right displays the average percentage signal change in the left OFC/anterior cingulate cluster for all nine subjects during a period between 4and 14s post-stimulus.The y -axis indicates percent signal change relative to signal baseline,with error bars representing ±1SEM of the subjects.The data shown in the bar graphs were obtained in the random effects contrast of foods [locations with P \0.005.1606Food Pictures Activate Gustatory CortexSimmons et al.at Indian Institute of Science Education and Research, Kolkata (IISER-K) on February 28, 2011 Downloaded fromIn the experiment reported here,taste inferences arose even when subjects performed fast superficial processing of food stimuli.Subjects were required to only assess whether the current picture exactly matched the previous picture,each presented for only 2s.No categorization or other form of conceptual processing was required.Furthermore,the large majority of trials required the subject to note that the current picture differed from the previous picture,a judgment that could have potentially interfered with making conceptual inferences.In general,the fact that taste inferences were produced under this particular set of task conditions attests to their strength and ubiquity.Consistent with previous findings,the experiment here indicated that conceptual representations are distributed across the brain areas that underlie their processing in perception and action.Because different categories are associated with differ-ent distributions of multimodal properties (McRae and Cree,2002),different categories rely on different configurations of brain areas for conceptual representation.As reviewed earlier,much work has shown that thinking about tools activates brain areas that process visual form,visual motion,and object manipulation.Analogously,we have shown here that thinking about food activates brain areas that process taste,taste reward and food shape.Thus our findings support the view that thebrain areas representing knowledge for a particular category are those typically used to process its physical instances.Besides having implications for theories of distributed con-ceptual representation,these findings have implications for various societal issues related to food,such as eating disorders,obesity and advertising.Taste inferences in the gustatory system,as observed here,arise in response to a wide variety of food stimuli in the environment and in the media.In eating disorders and obesity,the perception of foods and food pictures,as well as thoughts of food,may be associated with dysfunctional inferences about taste and reward.Conversely,behavioral,cognitive and pharmacological interventions may,in part,restore the gustatory activity underlying inferences about taste and reward to more normal forms.NotesThis work was supported by NIMH grant 1F31MH070152-01to K.S.and National Science Foundation grants SBR-9905024and BCS-0212134and Emory University research funds to L.W.B..We are grateful to Melissa Armstrong and Christine Wilson for their assistance in stimulus preparation.Correspondence should be addressed to Lawrence W.Barsalou,Department of Psychology,Emory University,532North Kilgo Circle,Atlanta,GA 30322,USA.Email:barsalou@emory.ed.Figure 4.(a )Locations of peak right hemisphere insula/operculum activations reported in taste perception studies.(b )Locations of peak left OFC activations across various tasks.The squares in the insula/operculum at Z =20and Z =ÿ9represent peak activations observed when participants taste sucrose,whereas the square in the lateral OFC at Z =ÿ10is the peak activation in the area observed to respond to the combination of gustatory and olfactory stimuli,and thus is a likely candidate for being the center of flavor representation (de Araujo et al.,2003).The square in the insula/operculum at Z =13indicates an area of common activation when participants tasted either glucose or salt (O’Doherty et al.,2001).The squares in the insula/operculum at Z =10and in the OFC at Z =ÿ6indicate the peak activations observed when participants taste umami (de Araujo et al.,2003).The squares in the insula/operculum at Z =5and in the OFC at Z =ÿ18represent peak activations when participants tasted glucose (Francis et al.,1999).Diamonds in the inferior medial OFC represent peak activations observed when participants receive abstract rewards (O’Doherty et al.,2001).The circle in the OFC at Z =ÿ10represents peak activation observed when participants verify the taste properties of concepts using strictly linguistic stimuli (Simmons,Pecher,Hamann,Zeelenberg,and Barsalou,under review).Finally,the circles in the insula/operculum at Z =9and in the OFC at Z =ÿ18and Z =ÿ6indicate the activation peaks observed in the present study when participants viewed food pictures.When necessary,coordinates reported in other studies were converted from Talairach to MNI space.Cerebral Cortex October 2005,V 15N 101607at Indian Institute of Science Education and Research, Kolkata (IISER-K) on February 28, 2011 Downloaded from。
棉花AP2
作物学报ACTA AGRONOMICA SINICA 2024, 50(1): 126 137 / ISSN 0496-3490; CN 11-1809/S; CODEN TSHPA9E-mail:***************DOI: 10.3724/SP.J.1006.2024.34045棉花AP2/ERF转录因子GhTINY2负调控植株抗盐性的功能分析肖胜华1,2,**,*陆妍1,**李安子1覃耀斌1廖铭静1闭兆福1卓柑锋1朱永红2朱龙付2,*1 广西大学农学院 / 亚热带农业生物资源保护与利用国家重点实验室, 广西南宁 530000; 2华中农业大学 / 作物遗传改良国家重点实验室, 湖北武汉 430000摘要: 棉花属于相对耐盐作物, 但高盐胁迫同样会造成棉花产量和纤维品质的大幅下降。
深入挖掘抗盐基因并解析棉花响应盐胁迫的分子机理, 对加快棉花抗盐遗传改良育种进程具有重要意义。
本研究从棉花响应盐胁迫的转录组数据中鉴定到一个受盐诱导极显著下调表达的AP2/ERF转录因子GhTINY2, 并分析了GhTINY2超表达拟南芥的抗盐表型和各生理指标。
结果显示, 在盐胁迫下, GhTINY2超表达植株的种子萌发率显著下降; 脯氨酸、可溶性糖、叶绿素含量等均显著减少; 多个盐胁迫响应基因显著下调表达; 因而表现出更为严重的叶片萎蔫枯黄表型。
通过分析GhTINY2超表达拟南芥中的RNA-seq数据, 发现差异表达基因(DEGs)富集到叶绿素代谢、刺激响应等生物过程中,且DEGs均呈下调表达趋势。
此外, 在棉花中通过病毒诱导的基因沉默(VIGS)试验沉默GhTINY2后, TRV:GhTINY2植株在盐胁迫下叶绿素和脯氨酸含量显著增加, 从而增强了棉花的抗盐性。
综上, GhTINY2是棉花中一个负调控盐胁迫抗性的重要基因, 未来将有望通过现代基因工程技术利用GhTINY2创制耐盐棉花材料。
关键词:棉花; GhTINY2; 盐胁迫; 转录因子; 转基因Function analysis of an AP2/ERF transcription factor GhTINY2 in cotton nega-tively regulating salt toleranceXIAO Sheng-Hua1,2,**,*, LU Yan1,**, LI An-Zi1, QIN Yao-Bin1, LIAO Ming-Jing1, BI Zhao-Fu1, ZHUOGan-Feng1, ZHU Yong-Hong2, and ZHU Long-Fu2,*1 State Key Laboratory of Conservation and Utilization of Agro-Biological Resources in Subtropical Region / College of Agriculture, Guangxi Uni-versity, Nanning 530000, Guangxi, China; 2 State Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430000,Hubei, ChinaAbstract: Cotton is a relatively salt-tolerant crop, but high salt stress leads to a significant decline in cotton yield and fiber quality.Mining the genes involved in salt-tolerance and illuminating the molecular mechanisms that underlie this resistance is of greatimportance in cotton breeding programs. Here, we identified an AP2/ERF transcription factor GhTINY2 in the transcriptome da-tabase from cotton treated with salt, and the relative expression level of GhTINY2 was reduced by salt. Subsequently, thesalt-resistant phenotype and physiological indicators of the GhTINY2-overexpression Arabidopsis were analyzed. The resultsrevealed that the GhTINY2-overexpression Arabidopsis had a significant decrease in seed germination rate, the content ofproline, soluble sugar, and chlorophyll under salt stress, leading to more severe leaf wilting compared with WT. RNA-seq datafrom GhTINY2-transgenic Arabidopsis revealed that differentially expressed genes (DEGs) were enriched in a series of bio-logical processes, including chlorophyll metabolism and response to stimulus, and the relative expression level of these DEGs本研究由广西大学高层次人才科研启动基金项目(A3310051044)和广西大学农学院科研发展金项目(EE101711)资助。
colocalization analysis
colocalization analysisColocalization analysis is a powerful technique used in various fields such as cell biology, immunology, neuroscience, and ecology. It involves the quantification and analysis of the spatial overlap or association between two or more molecules or structures within a biological sample. This can provide insights into the functional relationships and interactions between these molecules, as well as their subcellular localization and distribution patterns.There are several methods and approaches that can be used for colocalization analysis, depending on the nature of the data and the specific research question. One commonly used approach is the calculation of colocalization coefficients, such as the Pearson's correlation coefficient or the Manders' overlap coefficient. These coefficients provide a measure of the degree of colocalization between two molecules, ranging from -1 to 1, where values close to 1 indicate high colocalization and values close to 0 indicate no colocalization.In addition to colocalization coefficients, other statistical methods can be employed to assess the significance of colocalization. These include permutation tests, Monte Carlo simulations, and statistical hypothesis testing using appropriate thresholds. These approaches help determine whether the observed colocalization is statistically significant or if it could occur by chance.There are various software tools available for colocalization analysis, such as ImageJ, Fiji, CellProfiler, and Imaris. These tools provide a range of image processing and analysis functions,including algorithms for colocalization analysis. They allow researchers to analyze and quantify colocalization patterns in their images, and often include visualization options to display colocalization as scatter plots, heatmaps, or intensity overlays.It is important to note that colocalization does not necessarily imply direct interaction or functional association between the molecules being analyzed. It merely indicates that the two molecules are present in the same spatial location within the sample. To determine functional interactions, additional experiments or techniques such as co-immunoprecipitation, co-localization microscopy, or proximity ligation assays may be needed.Some recent studies have utilized colocalization analysis to investigate the localization and interaction of specific molecules within cells or tissues. For example, a study published in the journal Nature Communications used colocalization analysis to examine the spatial relationship between mitochondria and lipid droplets in liver cells. The researchers found that these organelles were closely associated and proposed that this colocalization is essential for lipid metabolism.Another study published in the journal PLOS Genetics used colocalization analysis to study the relationship between gene expression and DNA methylation in human blood cells. The researchers found that certain genomic regions showed high colocalization between these two molecular markers, suggesting a functional association between gene regulation and DNA methylation.In conclusion, colocalization analysis is a valuable technique for studying the spatial relationships and functional associations between molecules in biological samples. By quantifying and analyzing colocalization patterns, researchers can gain insights into the subcellular localization, interactions, and functional relationships of molecules within cells or tissues. These analyses can be performed using various software tools and statistical methods, enabling researchers to extract meaningful information from their imaging data.。
分子生物学部分名词解释(Molecularbiology)
分子生物学部分名词解释(Molecular biology)Hypochromic effectBiochemochromic effect, in biochemistry, refers to the reduction of 260nm uv absorption in the form of a double helical structure in the form of denaturation DNA, a phenomenon called hypochromic effect.Hyperchromic effectDefinition 1: nucleic acid (DNA and RNA) molecular degenerative or broken chain, and its uv absorption value (generally measured at 260nm) increases. Applied subjects: biochemistry and molecular biology (first class); Nucleic acid and gene (secondary discipline) definition 2: the effect or property of the uv absorption value increased by the DNA or RNA in the solution in the treatment of heat and alkali. Applied discipline: genetics (first-level discipline); Molecular geneticsHalf-discontinuous replicationDefinition 1: when DNA replicates, a chain (leading chain) is continuously synthesized while the other chain (after chain) is discontinuous. Applied subjects: biochemistry and molecular biology (first class); Nucleic acids and genes (secondary disciplines) 2: double stranded DNA synthesis 5 'to 3' end is continuous synthesis, and 3 'to 5' end is discontinuous synthesis. Applied discipline: genetics (first-level discipline); Molecular genetics (secondary disciplines)Semiconservative replicationA replication model of double chain deoxyribonucleic acid (DNA), in which each single strand is used as a template for the new chain synthesis after the parental double chain separation. Therefore, when the replicates are completed, there will be two subgenerations of DNA molecules, each of which has the same nucleotide sequence as the parental moleculeA leading chainLeading strand: consistent with the direction of the replication fork movement, a new strand of DNA synthesized by successive 5-3 - - 3 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -After with chainIn the process of DNA replication, a template chain is 3 'to 5', and the DNA can be synthesized in 5 'to 3' to become the leading chain. Another template strand, is 5 'to 3' direction, on which the synthesis of from 5 'to 3' direction, but is the opposite of the direction of the replication forks mobile, therefore with the moving of the replication forks, to form a number of discrete pieces. The fragments are then linked together into a complete strand of DNA. The chain is called a post-it chainReplication forksIn DNA replication, the y-font structure formed by thecombination of unsoling, dissolving, and SSB proteins in the DNA strand is called a replication fork. In the replication fork, the double stranded DNA of the template is disintegrated and the new strand of DNA is synthesized.Silent mutationsThe same meaning mutation, the mutation replaces the base, but the amino acid sequence has not changed, keeping the function of the wild type.DNA damageDNA damage is a permanent change in the DNA nucleotide sequence that occurs during the replication process and results in changes in the genetic characteristics. To replace (insert) the insertion of (insert) exon (exon)Frameshift mutationsIn the normal DNA molecule, the base deletion or increase of the non-3 diploid number, resulting in a series of coding changes that occurred after this location, the phenomenon called shift code mutationMissense mutationIt is the code that codes for some kind of amino acid that is replaced by the base and becomes the codon of another amino acid, which changes the amino acid variety and sequence of the polypeptide chaintranscriptionThe process of transferring genetic information from genes to RNA.RNA polymerase dynamic complex with a series of component composition, and gene sequence as the genetic information template, catalytic synthesis of sequence complementary RNA, including transcription initiation, elongation, termination, etc. The process of synthesis of complementary single stranded RNA molecules by RNA polymerase is a template for DNA base sequences.The PCRPolymerase Chain Reaction (English full name: Polymerase Chain Reaction), Polymerase Chain ReactionPCR. Polymerase chain reaction (PCR) is a specific DNA fragments in vitro enzymatic synthesis of a kind of method, by the high temperature degeneration (compound) and the optimum temperature, low temperature annealing extension of several steps such as a cycle, cycle, makes the purpose of DNA amplification, rapidly with strong specificity, high sensitivity, convenient operation, time saving, etc. It not only can be used for gene isolation, cloning and sequence analysis of nucleic acids such as basic research, also can be used for the diagnosis of disease or any DNA and RNA. Polymerase Chain Reaction (Polymerase Chain Reaction, PCR), also known as cell free molecular cloning or specific sequences of DNAprimers in vitro directional enzymatic amplification techniques.The promoterDNA molecules can be combined with RNA polymerases to form regions of transcriptional initiation complexes. In many cases, the binding site for the regulating protein that facilitates this process is also included. Determine the DNA sequence of the RNA polymerase initiation site.enhancerThe sequent sequence of the gene promoter's work efficiency can be applied in any direction and in any location (upstream or downstream) of the promoter.operonIt refers to the general term for initiating genes, manipulating genes and a series of tightly linked structural genesexonDNA sequences corresponding to mature mRNA, rRNA or tRNA molecules in eukaryotic genes. Is the encoding sequence.intronsThere is no coding meaning in the eukaryotic gene and thesequence is excised. Introns are sequences that block the linear expression of genesDNA cloningApplication of enzymatic method, the various sources of genetic material in vitro, homology or different source, prokaryotic and eukaryotic, natural or artificial DNA combined with carrier DNA into a DNA molecule with self-replicating ability - replicators, then through the conversion or transfection host cells and extract containing the purpose gene into daughter cells, and then extract amplification, get a lot of the same DNA molecule, namely DNA clones.Gene libraryAn organism's genome DNA with restriction enzymes after part of the enzyme, the enzyme fragment inserted into the carrier DNA molecules, all of these into the genome DNA fragments of an aggregate of carrier molecules, will contain the organism's entire genome, which is constituted the organism's DNA library.A single genome DNA fragment cloned collectionDNA denaturationDNA degeneration refers to the hydrogen bond fracture of nucleic acid double helix base pairs, and the double chain becomes single chain, so that the natural conformation and properties of nucleic acid change.The cloneRestriction enzymes, or PCR, are used to obtain parts of the cloned DNA from the cloned DNA, and then clone the technology in other new carriers.attenuatorWhen RNA synthesis terminates, the DNA sequence that terminates the role of the transcriptional signal is terminated.Recombinant DNAA recombination of genetic information that occurs within or between a DNA molecule. Including homologous recombination, specific site recombination and transposition recombination. Recombinant DNA with artificial DNA is a key step in genetic engineering.Satellite DNAThe DNA of a sequence of highly repetitive nucleotide sequences of eukaryotic cells. The total amount of the DNA is more than 10%, mainly in the centromere region of the chromosome, usually not transcribed. Because of the small amount of GC in its base composition, it has different buoyancy density, and its name is given after centrifugation of cesium density gradient and most of its DNA is different from other "satellites"- 10 sequenceAlso called Pribnow box (prokaryote). Corresponding sequencein eukaryotes is located at - 35 bp, known as the TATA box, also known as the Goldberg - Hognessbox, is the combination of RNA polymerase Ⅱ parts.。
暴露组学名词解释
暴露组学名词解释1. 引言暴露组学是一门研究个体与环境之间相互作用的学科,它结合了基因组学、表观基因组学、转录组学、蛋白质组学等多个生物信息学领域的技术和方法,旨在探索个体对外界环境的响应以及这种响应如何影响健康和疾病。
本文将对暴露组学中的一些重要名词进行解释,帮助读者更好地理解该领域。
2. 名词解释2.1 基因组学(Genomics)基因组学是研究生物体基因组结构、功能和演化的科学。
它包括了对于DNA序列的分析、基因的注释以及基因与表型之间关系的研究。
在暴露组学中,基因组学被用来分析个体基因型与暴露物之间的相互作用,以及这种相互作用对健康和疾病风险的影响。
2.2 表观基因组学(Epigenomics)表观基因组学是研究非编码DNA上化学修饰对基因表达调控的科学。
它研究的是在基因组水平上,通过DNA甲基化、组蛋白修饰等方式对基因表达进行调控的机制。
在暴露组学中,表观基因组学被用来研究环境暴露对个体表观遗传修饰的影响,从而揭示环境对基因表达的调控机制。
2.3 转录组学(Transcriptomics)转录组学是研究特定物种或特定细胞群体中所有转录本(RNA)的总体分析。
它通过测量和分析RNA的表达水平,揭示了基因在特定条件下的表达模式。
在暴露组学中,转录组学被用来研究环境暴露对个体基因表达的影响,从而识别与环境相关的生物标志物和潜在的健康风险。
2.4 蛋白质组学(Proteomics)蛋白质组学是研究特定物种或特定细胞群体中所有蛋白质的总体分析。
它通过测量和分析蛋白质的表达水平、修饰情况和相互作用,揭示了蛋白质在细胞内的功能和调控机制。
在暴露组学中,蛋白质组学被用来研究环境暴露对蛋白质组成和功能的影响,从而深入了解环境对细胞和生物系统的影响。
2.5 代谢组学(Metabolomics)代谢组学是研究特定物种或特定细胞群体中所有代谢产物的总体分析。
它通过测量和分析代谢产物(如小分子有机物、代谢酶产物等)的水平变化,揭示了生物系统在不同条件下的代谢状态。
分子生物学常见名词解释(中英文对照)
分子生物学重要概念AAbundance (mRNA 丰度):指每个细胞中mRNA 分子的数目。
Abundant mRNA(高丰度mRNA):由少量不同种类mRNA组成,每一种在细胞中出现大量拷贝。
Acceptor splicing site (受体剪切位点):内含子右末端和相邻外显子左末端的边界。
Acentric fragment(无着丝粒片段):(由打断产生的)染色体无着丝粒片段缺少中心粒,从而在细胞分化中被丢失。
Active site(活性位点):蛋白质上一个底物结合的有限区域。
Allele(等位基因):在染色体上占据给定位点基因的不同形式。
Allelic exclusion(等位基因排斥):形容在特殊淋巴细胞中只有一个等位基因来表达编码的免疫球蛋白质。
Allosteric control(别构调控):指蛋白质一个位点上的反应能够影响另一个位点活性的能力。
Alu-equivalent family(Alu 相当序列基因):哺乳动物基因组上一组序列,它们与人类Alu家族相关。
Alu family (Alu家族):人类基因组中一系列分散的相关序列,每个约300bp长。
每个成员其两端有Alu 切割位点(名字的由来)。
α-Amanitin(鹅膏覃碱):是来自毒蘑菇Amanita phalloides 二环八肽,能抑制真核RNA聚合酶,特别是聚合酶II 转录。
Amber codon (琥珀密码子):核苷酸三联体UAG,引起蛋白质合成终止的三个密码子之一。
Amber mutation (琥珀突变):指代表蛋白质中氨基酸密码子占据的位点上突变成琥珀密码子的任何DNA 改变。
Amber suppressors (琥珀抑制子):编码tRNA的基因突变使其反密码子被改变,从而能识别UAG 密码子和之前的密码子。
Aminoacyl-tRNA (氨酰-tRNA):是携带氨基酸的转运RNA,共价连接位在氨基酸的NH2基团和tRNA 终止碱基的3¢或者2¢-OH 基团上。
Local Rademacher complexities
a rX iv:mat h /58275v1[mat h.ST]16Aug25The Annals of Statistics 2005,Vol.33,No.4,1497–1537DOI:10.1214/009053605000000282c Institute of Mathematical Statistics ,2005LOCAL RADEMACHER COMPLEXITIES By Peter L.Bartlett,Olivier Bousquet and Shahar Mendelson University of California at Berkeley ,Max Planck Institute for Biological Cybernetics and Australian National University We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity.The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages,in the sense that the Rademacher averages are computed from the data,on a subset of functions with small empirical error.We present some applications to classification and prediction with convex function classes,and with kernel classes in particular.1.Introduction.Estimating the performance of statistical procedures is useful for providing a better understanding of the factors that influence their behavior,as well as for suggesting ways to improve them.Although asymptotic analysis is a crucial first step toward understanding the behavior,finite sample error bounds are of more value as they allow the design of model selection (or parameter tuning)procedures.These error bounds typically have the following form:with high probability,the error of the estimator (typically a function in a certain class)is bounded by an empirical estimate of error plus a penalty term depending on the complexity of the class of functions that can be chosen by the algorithm.The differences between the true and empirical errors of functions in that class can be viewed as an empirical process.Many tools have been developed for understanding the behavior of such objects,and especially for evaluating their suprema—which can be thought of as a measure of how hard it is to estimate functions in the class at hand.The goal is thus to obtain the sharpest possible estimateson the complexity of function classes.A problem arises since the notion of complexity might depend on the (unknown)underlying probability measure2P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON according to which the data is produced.Distribution-free notions of the complexity,such as the Vapnik–Chervonenkis dimension[35]or the metric entropy[28],typically give conservative estimates.Distribution-dependent estimates,based for example on entropy numbers in the L2(P)distance, where P is the underlying distribution,are not useful when P is unknown. Thus,it is desirable to obtain data-dependent estimates which can readily be computed from the sample.One of the most interesting data-dependent complexity estimates is the so-called Rademacher average associated with the class.Although known for a long time to be related to the expected supremum of the empirical process (thanks to symmetrization inequalities),it wasfirst proposed as an effective complexity measure by Koltchinskii[15],Bartlett,Boucheron and Lugosi [1]and Mendelson[25]and then further studied in[3].Unfortunately,one of the shortcomings of the Rademacher averages is that they provide global estimates of the complexity of the function class,that is,they do not reflect the fact that the algorithm will likely pick functions that have a small error, and in particular,only a small subset of the function class will be used.As a result,the best error rate that can be obtained via the global Rademacher√averages is at least of the order of1/LOCAL RADEMACHER COMPLEXITIES3 general,power type inequalities.Their results,like those of van de Geer,are asymptotic.In order to exploit this key property and havefinite sample bounds,rather than considering the Rademacher averages of the entire class as the complex-ity measure,it is possible to consider the Rademacher averages of a small subset of the class,usually the intersection of the class with a ball centered at a function of interest.These local Rademacher averages can serve as a complexity measure;clearly,they are always smaller than the corresponding global averages.Several authors have considered the use of local estimates of the complexity of the function class in order to obtain better bounds. Before presenting their results,we introduce some notation which is used throughout the paper.Let(X,P)be a probability space.Denote by F a class of measurable func-tions from X to R,and set X1,...,X n to be independent random variables distributed according to P.Letσ1,...,σn be n independent Rademacher random variables,that is,independent random variables for which Pr(σi= 1)=Pr(σi=−1)=1/2.For a function f:X→R,defineP n f=1nni=1σi f(X i).For a class F,setR n F=supf∈FR n f.Define Eσto be the expectation with respect to the random variablesσ1,...,σn, conditioned on all of the other random variables.The Rademacher averageof F is E R n F,and the empirical(or conditional)Rademacher averages of FareEσR n F=1rx/n+4P.L.BARTLETT,O.BOUSQUET AND S.MENDELSONc3/n,which can be computed from the data.Forˆr N defined byˆr0=1,ˆr k+1=φn(ˆr k),they show that with probability at least1−2Ne−x,2xPˆf≤ˆr N+r)≥EσR n{f∈F:P n f≤r},and if the number of iterations N is at least1+⌈log2log2n/x⌉,then with probability at least1−Ne−x,ˆr N≤c ˆr∗+xr)=bining the above results,one has a procedure to obtain data-dependent error bounds that are of the order of thefixed point of the modulus of continuity at0of the empirical Rademacher averages.One limitation of this result is that it assumes that there is a function f∗in the class with P f∗=0.In contrast,we are interested in prediction problems where P f is the error of an estimator, and in the presence of noise there may not be any perfect estimator(even the best in the class can have nonzero error).More recently,Bousquet,Koltchinskii and Panchenko[9]have obtained a more general result avoiding the iterative procedure.Their result is that for functions with values in[0,1],with probability at least1−e−x,∀f∈F P f≤c P n f+ˆr∗+t+log log nr)≥EσR n{f∈F:P n f≤r}.The main difference between this and the results of[16]is that there is no requirement that the class contain a perfect function.However,the local Rademacher averages are centered around the zero function instead of the one that minimizes P f.As a consequence,thefixed pointˆr∗cannot be expected to converge to zero when inf f∈F P f>0.In order to remove this limitation,Lugosi and Wegkamp[19]use localized Rademacher averages of a small ball around the minimizerˆf of P n.However, their result is restricted to nonnegative functions,and in particular functions with values in{0,1}.Moreover,their bounds also involve some global in-formation,in the form of the shatter coefficients S F(X n1)of the function class(i.e.,the cardinality of the coordinate projections of the class F onLOCAL RADEMACHER COMPLEXITIES5 the data X n1).They show that there are constants c1,c2such that,with probability at least1−8/n,the empirical minimizerˆf satisfiesP f+2 ψn(ˆr n),Pˆf≤inff∈Fwhereψn(r)=c1 EσR n{f∈F:P n f≤16P nˆf+15r}+log n log n P nˆf+randˆr n=c2(log S F(X n1)+log n)/n.The limitation of this result is thatˆr n has to be chosen according to the(empirically measured)complexity of the whole class,which may not be as sharp as the Rademacher averages,and in general,is not afixed point of ψn.Moreover,the balls over which the Rademacher averages are computed in ψn contain a factor of16in front of P nˆf.As we explain later,this induces a lower bound on ψn when there is no function with P f=0in the class.It seems that the only way to capture the right behavior in the general, noisy case is to analyze the increments of the empirical process,in other words,to directly consider the functions f−f∗.This approach wasfirst proposed by Massart[22];see also[26].Massart introduces the assumption Var[ℓf(X)−ℓf∗(X)]≤d2(f,f∗)≤B(Pℓf−Pℓf∗),whereℓf is the loss associated with the function f[in other words,ℓf(X,Y)=ℓ(f(X),Y),which measures the discrepancy in the prediction made by f],d is a pseudometric and f∗minimizes the expected loss.(The previous results could also be stated in terms of loss functions,but we omitted this in order to simplify exposition.However,the extra notation is necessary to properly state Massart’s result.)This is a more refined version of the assumption we mentioned earlier on the relationship between the variance and expectation of the increments of the empirical process.It is only satisfied for some loss functionsℓand function classes F.Under this assumption,Massart considers a nondecreasing functionψsatisfying|P f−P f∗−P n f+P n f∗|+c xψ(r)≥E supf∈F,d2(f,f∗)2≤rr is nonincreasing(we refer to this property as the sub-root property later in the paper).Then,with probability at least1−e−x,∀f∈F Pℓf−Pℓf∗≤c r∗+x6P.L.BARTLETT,O.BOUSQUET AND S.MENDELSONsituations of interest,this bound suffices to prove minimax rates of conver-gence for penalized M-estimators.(Massart considers examples where the complexity term can be bounded using a priori global information about the function class.)However,the main limitation of this result is that it does not involve quantities that can be computed from the data.Finally,as we mentioned earlier,Mendelson[26]gives an analysis similar to that of Massart,in a slightly less general case(with no noise in the target values,i.e.,the conditional distribution of Y given X is concentrated at one point).Mendelson introduces the notion of the star-hull of a class of functions(see the next section for a definition)and considers Rademacher averages of this star-hull as a localized measure of complexity.His results also involve a priori knowledge of the class,such as the rate of growth of covering numbers.We can now spell out our goal in more detail:in this paper we com-bine the increment-based approach of Massart and Mendelson(dealing with differences of functions,or more generally with bounded real-valued func-tions)with the empirical local Rademacher approach of Koltchinskii and Panchenko and of Lugosi and Wegkamp,in order to obtain data-dependent bounds which depend on afixed point of the modulus of continuity of Rademacher averages computed around the empirically best function.Ourfirst main result(Theorem3.3)is a distribution-dependent result involving thefixed point r∗of a local Rademacher average of the star-hull of the class F.This shows that functions with the sub-root property can readily be obtained from Rademacher averages,while in previous work the appropriate functions were obtained only via global information about the class.The second main result(Theorems4.1and4.2)is an empirical counterpart of thefirst one,where the complexity is thefixed point of an empirical local Rademacher average.We also show that thisfixed point is within a constant factor of the nonempirical one.Equipped with this result,we can then prove(Theorem5.4)a fully data-dependent analogue of Massart’s result,where the Rademacher averages are localized around the minimizer of the empirical loss.We also show(Theorem6.3)that in the context of classification,the local Rademacher averages of star-hulls can be approximated by solving a weighted empirical error minimization problem.Ourfinal result(Corollary6.7)concerns regression with kernel classes, that is,classes of functions that are generated by a positive definite ker-nel.These classes are widely used in interpolation and estimation problems as they yield computationally efficient algorithms.Our result gives a data-dependent complexity term that can be computed directly from the eigen-values of the Gram matrix(the matrix whose entries are values of the kernel on the data).LOCAL RADEMACHER COMPLEXITIES7 The sharpness of our results is demonstrated from the fact that we recover, in the distribution-dependent case(treated in Section4),similar results to those of Massart[22],which,in the situations where they apply,give the minimax optimal rates or the best known results.Moreover,the data-dependent bounds that we obtain as counterparts of these results have the same rate of convergence(see Theorem4.2).The paper is organized as follows.In Section2we present some prelimi-nary results obtained from concentration inequalities,which we use through-out.Section3establishes error bounds using local Rademacher averages and explains how to compute theirfixed points from“global information”(e.g., estimates of the metric entropy or of the combinatorial dimensions of the indexing class),in which case the optimal estimates can be recovered.In Section4we give a data-dependent error bound using empirical and local Rademacher averages,and show the connection between thefixed points of the empirical and nonempirical Rademacher averages.In Section5we ap-ply our results to loss classes.We give estimates that generalize the results of Koltchinskii and Panchenko by eliminating the requirement that some function in the class have zero loss,and are more general than those of Lugosi and Wegkamp,since there is no need have in our case to estimate global shatter coefficients of the class.We also give a data-dependent exten-sion of Massart’s result where the local averages are computed around the minimizer of the empirical loss.Finally,Section6shows that the problem of estimating these local Rademacher averages in classification reduces to weighted empirical risk minimization.It also shows that the local averages for kernel classes can be sharply bounded in terms of the eigenvalues of the Gram matrix.2.Preliminary results.Recall that the star-hull of F around f0is de-fined bystar(F,f0)={f0+α(f−f0):f∈F,α∈[0,1]}. Throughout this paper,we will manipulate suprema of empirical processes, that is,quantities of the form sup f∈F(P f−P n f).We will always assume they are measurable without explicitly mentioning it.In other words,we assume that the class F and the distribution P satisfy appropriate(mild) conditions for measurability of this supremum(we refer to[11,28]for a detailed account of such issues).The following theorem is the main result of this section and is at the core of all the proofs presented later.It shows that if the functions in a class have small variance,the maximal deviation between empirical means and true means is controlled by the Rademacher averages of F.In particular, the bound improves as the largest variance of a class member decreases.8P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON Theorem2.1.Let F be a class of functions that map X into[a,b]. Assume that there is some r>0such that for every f∈F,Var[f(X i)]≤r. Then,for every x>0,with probability at least1−e−x,sup f∈F (P f−P n f)≤infα>0 2(1+α)E R n F+n+(b−a) 1α x 1−αEσR n F+ n+(b−a) 1α+1+αn .Moreover,the same results hold for the quantity sup f∈F(P n f−P f).This theorem,which is proved in Appendix A.2,is a more or less directconsequence of Talagrand’s inequality for empirical processes[30].However,the actual statement presented here is new in the sense that it displays thebest known constants.Indeed,compared to the previous result of Koltchin-skii and Panchenko[16]which was based on Massart’s version of Talagrand’sinequality[21],we have used the most refined concentration inequalitiesavailable:that of Bousquet[7]for the supremum of the empirical process and that of Boucheron,Lugosi and Massart[5]for the Rademacher averages.This last inequality is a powerful tool to obtain data-dependent bounds,since it allows one to replace the Rademacher average(which measures thecomplexity of the class of functions)by its empirical version,which can beefficiently computed in some cases.Details about these inequalities are givenin Appendix A.1.When applied to the full function class F,the above theorem is not useful.Indeed,with only a trivial bound on the maximal variance,better resultscan be obtained via simpler concentration inequalities,such as the boundeddifference inequality[23],which would allow x/n. However,by applying Theorem2.1to subsets of F or to modified classesobtained from F,much better results can be obtained.Hence,the presence ofan upper bound on the variance in the square root term is the key ingredientof this result.A last preliminary result that we will require is the following consequenceof Theorem2.1,which shows that if the local Rademacher averages are small,then balls in L2(P)are probably contained in the corresponding empiricalballs[i.e.,in L2(P n)]with a slightly larger radius.Corollary2.2.Let F be a class of functions that map X into[−b,b] with b>0.For every x>0and r that satisfyr≥10b E R n{f:f∈F,P f2≤r}+11b2xLOCAL RADEMACHER COMPLEXITIES9 then with probability at least1−e−x,{f∈F:P f2≤r}⊆{f∈F:P n f2≤2r}.Proof.Since the range of any function in the set F r={f2:f∈F, P f2≤r}is contained in[0,b2],it follows that Var[f2(X i)]≤P f4≤b2P f2≤b2r.Thus,by thefirst part of Theorem2.1(withα=1/4),with probability at least1−e−x,every f∈F r satisfiesP n f2≤r+52b2rx3n≤r+52+16b2x2+16b2xr is nonincreasing for r>0.We only consider nontrivial sub-root functions,that is,sub-root functions that are not the constant functionψ≡0.10P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON Lemma3.2.Ifψ:[0,∞)→[0,∞)is a nontrivial sub-root function,then it is continuous on[0,∞)and the equationψ(r)=r has a unique positive solution.Moreover,if we denote the solution by r∗,then for all r>0,r≥ψ(r)if and only if r∗≤r.The proof of this lemma is in Appendix A.2.In view of the lemma,we will simply refer to the quantity r∗as the unique positive solution ofψ(r)=r, or as thefixed point ofψ.3.1.Error bounds.We can now state and discuss the main result of this section.It is composed of two parts:in thefirst part,one requires a sub-root upper bound on the local Rademacher averages,and in the second part,it is shown that better results can be obtained when the class over which the averages are computed is enlarged slightly.Theorem3.3.Let F be a class of functions with ranges in[a,b]and assume that there are some functional T:F→R+and some constant B such that for every f∈F,Var[f]≤T(f)≤BP f.Letψbe a sub-root function and let r∗be thefixed point ofψ.1.Assume thatψsatisfies,for any r≥r∗,ψ(r)≥B E R n{f∈F:T(f)≤r}.Then,with c1=704and c2=26,for any K>1and every x>0,with probability at least1−e−x,∀f∈F P f≤K B r∗+x(11(b−a)+c2BK)K P f+c1Kn.2.If,in addition,for f∈F andα∈[0,1],T(αf)≤α2T(f),and ifψsatisfies,for any r≥r∗,ψ(r)≥B E R n{f∈star(F,0):T(f)≤r},then the same results hold true with c1=6and c2=5.The proof of this theorem is given in Section3.2.We can compare the results to our starting point(Theorem2.1).The improvement comes from the fact that the complexity term,which was es-sentially sup rψ(r)in Theorem2.1(if we had applied it to the class F di-rectly)is now reduced to r∗,thefixed point ofψ.So the complexity term is always smaller(later,we show how to estimate r∗).On the other hand,LOCAL RADEMACHER COMPLEXITIES11 there is some loss since the constant in front of P n f is strictly larger than1. Section5.2will show that this is not an issue in the applications we have in mind.In Sections5.1and5.2we investigate conditions that ensure the assump-tions of this theorem are satisfied,and we provide applications of this result to prediction problems.The condition that the variance is upper bounded by the expectation turns out to be crucial to obtain these results.The idea behind Theorem3.3originates in the work of Massart[22],who proves a slightly different version of thefirst part.The difference is that we use local Rademacher averages instead of the expectation of the supremum of the empirical process on a ball.Moreover,we give smaller constants.As far as we know,the second part of Theorem3.3is new.3.1.1.Choosing the functionψ.Notice that the functionψcannot be chosen arbitrarily and has to satisfy the sub-root property.One possible approach is to use classical upper bounds on the Rademacher averages,such as Dudley’s entropy integral.This can give a sub-root upper bound and was used,for example,in[16]and in[22].However,the second part of Theorem3.3indicates a possible choice for ψ,namely,one can takeψas the local Rademacher averages of the star-hull of F around0.The reason for this comes from the following lemma, which shows that if the class is star-shaped and T(f)behaves as a quadratic function,the Rademacher averages are sub-root.Lemma3.4.If the class F is star-shaped aroundˆf(which may depend on the data),and T:F→R+is a(possibly random)function that satis-fies T(αf)≤α2T(f)for any f∈F and anyα∈[0,1],then the(random) functionψdefined for r≥0byψ(r)=EσR n{f∈F:T(f−ˆf)≤r}is sub-root and r→Eψ(r)is also sub-root.This lemma is proved in Appendix A.2.Notice that making a class star-shaped only increases it,so thatE R n{f∈star(F,f0):T(f)≤r}≥E R n{f∈F:T(f)≤r}. However,this increase in size is moderate as can be seen,for example,if one compares covering numbers of a class and its star-hull(see,e.g.,[26], Lemma4.5).12P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON3.1.2.Some consequences.As a consequence of Theorem3.3,we obtain an error bound when F consists of uniformly bounded nonnegative functions. Notice that in this case the variance is trivially bounded by a constant times the expectation and one can directly use T(f)=P f.Corollary3.5.Let F be a class of functions with ranges in[0,1].Let ψbe a sub-root function,such that for all r≥0,E R n{f∈F:P f≤r}≤ψ(r),and let r∗be thefixed point ofψ.Then,for any K>1and every x>0,with probability at least1−e−x,every f∈F satisfiesP f≤Kn.Also,with probability at least1−e−x,every f∈F satisfiesP n f≤K+1n.Proof.When f∈[0,1],we have Var[f]≤P f so that the result follows from applying Theorem3.3with T(f)=P f.We also note that the same idea as in the proof of Theorem3.3gives a converse of Corollary2.2,namely,that with high probability the intersection of F with an empirical ball of afixed radius is contained in the intersection of F with an L2(P)ball with a slightly larger radius.Lemma3.6.Let F be a class of functions that map X into[−1,1].Fix x>0.Ifr≥20E R n{f:f∈star(F,0),P f2≤r}+26xLOCAL RADEMACHER COMPLEXITIES13 Corollary3.7.Let F be a class of{0,1}-valued functions with VC-dimen-sion d<∞.Then for all K>1and every x>0,with probability at least1−e−x,every f∈F satisfiesP f≤Kn+x14P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON(b)Upper bound the Rademacher averages of this weighted class,by“peeling off”subclasses of F according to the variance of their elements,and bounding the Rademacher averages of these subclasses usingψ.(c)Use the sub-root property ofψ,so that itsfixed point gives a common upper bound on the complexity of all the subclasses(up to some scaling).(d)Finally,convert the upper bound for functions in the weighted classinto a bound for functions in the initial class.The idea of peeling—that is,of partitioning the class F into slices wherefunctions have variance within a certain range—is at the core of the proof of thefirst part of Theorem3.3[see,e.g.,(3.1)].However,it does not appearexplicitly in the proof of the second part.One explanation is that when oneconsiders the star-hull of the class,it is enough to consider two subclasses:the functions with T(f)≤r and the ones with T(f)>r,and this is done by introducing the weighting factor T(f)∨r.This idea was exploited inthe work of Mendelson[26]and,more recently,in[4].Moreover,when oneconsiders the set F r=star(F,0)∩{T(f)≤r},any function f′∈F with T(f′)>r will have a scaled down representative in that set.So even though it seems that we look at the class star(F,0)only locally,we still take intoaccount all of the functions in F(with appropriate scaling).3.2.Proofs.Before presenting the proof,let usfirst introduce some ad-ditional notation.Given a class F,λ>1and r>0,let w(f)=min{rλk:k∈N,rλk≥T(f)}and setG r= rT(f)∨r:f∈F ,and define˜V+ r =supg∈˜G rP g−P n g and˜V−r=supg∈˜G rP n g−P g.Lemma3.8.With the above notation,assume that there is a constant B>0such that for every f∈F,T(f)≤BP f.Fix K>1,λ>0and r>0.LOCAL RADEMACHER COMPLEXITIES15If V+r≤r/(λBK),then∀f∈F P f≤KλBK.Also,if V−r≤r/(λBK),then∀f∈F P n f≤K+1λBK. Similarly,if K>1and r>0are such that˜V+r≤r/(BK),then∀f∈F P f≤K BK.Also,if˜V−r≤r/(BK),then∀f∈F P n f≤K+1BK.Proof.Notice that for all g∈G r,P g≤P n g+V+r.Fix f∈F and define g=rf/w(f).When T(f)≤r,w(f)=r,so that g=f.Thus,the fact that P g≤P n g+V+r implies that P f≤P n f+V+r≤P n f+r/(λBK).On the other hand,if T(f)>r,then w(f)=rλk with k>0and T(f)∈(rλk−1,rλk].Moreover,g=f/λk,P g≤P n g+V+r,and thusP fλk+V+r.Using the fact that T(f)>rλk−1,it follows thatP f≤P n f+λk V+r<P n f+λT(f)V+r/r≤P n f+P f/K. Rearranging,P f≤KK−1P n f+r2rx3+1n.16P.L.BARTLETT,O.BOUSQUET AND S.MENDELSONLet F(x,y):={f∈F:x≤T(f)≤y}and define k to be the smallest integer such that rλk+1≥Bb.ThenE R n G r≤E R n F(0,r)+E supf∈F(r,Bb)rw(f)R n f(3.1)=E R n F(0,r)+kj=0λ−j E supf∈F(rλj,rλj+1)R n f≤ψ(r)Bkj=0λ−jψ(rλj+1).By our assumption it follows that forβ≥1,ψ(βr)≤√Bψ(r) 1+√r/r∗ψ(r∗)=√B √2rx3+1n.Set A=10(1+α)√2x/n and C=(b−a)(1/3+1/α)x/n,and note that V+r≤A√r+C=r/(λBK).It satisfies r0≥λ2A2B2K2/2≥r∗and r0≤(λBK)2A2+2λBKC,so that applying Lemma3.8, it follows that every f∈F satisfiesP f≤KK−1P n f+λBK 100(1+α)2r∗/B2+20(1+α)2xr∗n+(b−a) 1α x2xr∗/n≤Bx/(5n)+ 5r∗/(2B)completes the proof of thefirst statement.The second statement is proved in the same way,by considering V−r instead of V+r.LOCAL RADEMACHER COMPLEXITIES17 Proof of Theorem3.3,second part.The proof of this result uses the same argument as for thefirst part.However,we consider the class˜G rdefined above.One can easily check that˜G r⊂{f∈star(F,0):T(f)≤r}, and thus E R n˜G r≤ψ(r)/B.Applying Theorem2.1to˜G r,it follows that,for all x>0,with probability1−e−x,˜V+ r≤2(1+α)2rx3+1n.The reasoning is then the same as for thefirst part,and we use in the very last step thatn .(3.2)Clearly,if f∈F,then f2maps to[0,1]and Var[f2]≤P f2.Thus,Theo-rem2.1can be applied to the class G r={rf2/(P f2∨r):f∈F},whose functions have range in[0,1]and variance bounded by r.Therefore,with probability at least1−e−x,every f∈F satisfiesr P f2−P n f22rx3+1n.Selectα=1/4and notice thatP f2∨r≤52+19xr 54+19x18P.L.BARTLETT,O.BOUSQUET AND S.MENDELSON4.Data-dependent error bounds.The results presented thus far use distribution-dependent measures of complexity of the class at hand.In-deed,the sub-root functionψof Theorem3.3is bounded in terms of theRademacher averages of the star-hull of F,but these averages can only becomputed if one knows the distribution P.Otherwise,we have seen that it is possible to compute an upper bound on the Rademacher averages using apriori global or distribution-free knowledge about the complexity of the classat hand(such as the VC-dimension).In this section we present error boundsthat can be computed directly from the data,without a priori information. Instead of computingψ,we compute an estimate, ψn,of it.The function ψn is defined using the data and is an upper bound onψwith high probability.To simplify the exposition we restrict ourselves to the case where the func-tions have a range which is symmetric around zero,say[−1,1].Moreover, we can only treat the special case where T(f)=P f2,but this is a minor restriction as in most applications this is the function of interest[i.e.,for which one can show T(f)≤BP f].4.1.Results.We now present the main result of this section,which givesan analogue of the second part of Theorem3.3,with a completely empiricalbound(i.e.,the bound can be computed from the data only).Theorem4.1.Let F be a class of functions with ranges in[−1,1]and assume that there is some constant B such that for every f∈F,P f2≤BP f. Let ψn be a sub-root function and letˆr∗be thefixed point of ψn.Fix x>0 and assume that ψn satisfies,for any r≥ˆr∗,ψn(r)≥c1EσR n{f∈star(F,0):P n f2≤2r}+c2xK−1P n f+6Kn.Also,with probability at least1−3e−x,∀f∈F P n f≤K+1Bˆr∗+x(11+5BK)。
融合半变异函数的空间随机森林插值方法
DOI: 10.12357/cjea.20210628王铭鑫, 范超, 高秉博, 任周鹏, 李发东. 融合半变异函数的空间随机森林插值方法[J]. 中国生态农业学报 (中英文), 2022, 30(3): 451−457WANG M X, FAN C, GAO B B, REN Z P, LI F D. A spatial random forest interpolation method with semi-variogram[J]. Chinese Journal of Eco-Agriculture, 2022, 30(3): 451−457融合半变异函数的空间随机森林插值方法*王铭鑫1, 范 超2, 高秉博1**, 任周鹏3, 李发东3(1. 中国农业大学土地科学与技术学院 北京 100193; 2. 武汉大学遥感信息工程学院 武汉 430079; 3. 中国科学院地理科学与资源研究所 北京 100011)摘 要: 土壤环境变量具有较强空间异质性, 为空间插值精度的提升带来了困难, 仅基于空间相关性和空间异质性的空间插值方法难以获得较高的插值精度。
机器学习方法能够融合多维辅助变量的信息, 提高土壤属性的插值精度, 但是不能有效融合空间位置关系信息进一步改善插值精度。
本文基于随机森林空间预测框架, 将空间半变异函数与随机森林算法融合, 提出了融合半变异函数的空间随机森林插值方法。
应用所提出的方法对湖南省湘潭县土壤重金属数据进行空间插值, 并与随机森林方法、基于距离的随机森林空间预测方法、普通克里金方法和回归克里金方法进行对比, 检验了所提出方法的插值精度。
结果表明, 融合半变异函数的空间随机森林插值方法相较于传统克里金方法精度提升10%以上, 相较于新型机器学习空间插值方法精度提升5%以上, 同时基于半变异函数的空间随机森林插值方法的插值制图结果具有更加合理的空间分布和丰富的细节信息。
因此, 融合半变异函数的空间随机森林插值方法能够有效结合辅助变量信息与空间位置关系信息, 有效提高土壤环境变量插值精度。
sdarticle hao3
SVM classification of human intergenic and gene sequencesY.H.Qiaoa,1,J.L.Liub,1,C.G.Zhang c ,X.H.Xu d ,Y.J.Zenga,d,*aBiomechanics and Medical Information Institute,Beijing University of Technology,Beijing 100022,ChinabCollege of Computer Science,Beijing University of Technology,Beijing 100022,ChinacBeijing Institute of Radiation,Beijing 100850,ChinadMedical School,Shan Tou University,Guangdong 515031,ChinaReceived 29March 2004;received in revised form 31January 2005;accepted 14March 2005AbstractDespite constant improvement in prediction accuracy,gene-finding programs are still unable to provide automatic gene discovery with the desired correctness.This paper presents an analysis of gene and inter-genic sequences from the point of view of language analysis,where gene and intergenic regions are regarded as two different subjects written in the four-letter alphabet {A,C,G,T},and high frequency simple sequences are taken as keywords.A measurement a (l (s ))was introduced to describe the relative repeat ratio of simple sequences.Threshold values were found for keyword selections.After eliminating Ônoise Õ,178short sequences were selected as keywords.DNA sequences are mapped to 178-dimensional Euclidean space,and SVM was used for prediction of gene regions.We showed by cross-validation that the program we developed could predict 93%of gene sequences with 7%false positives.When tested on a long genomic multi-gene sequence,our method improved nucleotide level specificity by 21%,and over 60%of predicted genes corresponded to actual genes.Ó2005Elsevier Inc.All rights reserved.Keywords:SVM;SMO algorithm;Intergenic region;Gene region0025-5564/$-see front matter Ó2005Elsevier Inc.All rights reserved.doi:10.1016/j.mbs.2005.03.005*Corresponding author.Address:Biomechanics and Medical Information Institute,Beijing University of Technol-ogy,Beijing 100022,China.Tel.:+861067391809;fax:+861067391610.E-mail address:yjzeng@ (Y.J.Zeng).1Due to the equal important contribution made to this paper,both Y.H.Qiao and J.L.Liu are first writers./locate/mbsMathematical Biosciences 195(2005)168–1781.IntroductionAs completion of the human genome sequencing is imminent,millions of nucleotides of geno-mic DNA are sequenced daily,and tools for interpreting the contents of these genomes are more important than ever.The first step in deciphering the DNA sequence information is find-ing all the genes contained in a sequence and elucidating their structure.Although many gene-finding programs have been developed in the past 10years and their prediction accuracy is constantly improving,we are still far away from completely automatic gene discovery with 100%accuracy.Current programs,although very good at discovering the majority of coding nucleotides and moderately good in discovering exact exon boundaries are still weak when it comes to predicting complete gene structures:less than 50%of predicted genes correspond exactly to actual genes [1].This provides strong motivation for developing various kinds of computational methods to predict complete gene structures.In recent years,stochastic processes and statistical learning theory have been widely used in gene-finding programs.The popular pro-grams Genscan [2]and HMMgene [3]model the structure of genomic sequence as an explicit state duration HMM,which is also known as generalized HMM.In this type of probabilistic model each state of the model has an associated arbitrary length distribution.In the case of DNA sequences,the states of the HMM simulate functional elements of the genes or genomic regions.Neural networks and discriminant functions are also used to predict elements of genes.In this paper,we have developed a novel approach to detect complete gene regions for a given DNA sequence.We analyze gene and intergenic regions using linguistical theory and regard these two regions as two subjects written in the same language.So the keywords used in these two regions are different.We extract short feature sequences for the regions,and feature expressions are represented by turning the DNA sequences into ing Support Vector Machine (SVM)we developed a classifier to predict the gene regions for a given sequence using RBF kernel function.2.Methods2.1.Finding keywordsFrom the point of view of linguistic theory,The DNA sequence could be read as a sentence composed of the four-letter alphabet,{A,C,G,T}.Intergenic and gene regions are two different regions in DNA sequences (Fig.1)which can be regarded as two different subjects written in the four-letter alphabet whose lexical features should be different from each other.In principle,shortY.H.Qiao et al./Mathematical Biosciences 195(2005)168–178169sequences occurring in high frequency in intergenic sequences provide sufficient information about intergenic regions,and short sequences in high frequency in gene sequences will represent the feature of gene regions,because they have different short key sequences that we define as key-words characterizing the two different regions.Here we try to find out these keywords based on their discrepancy in these two regions.Key-words are in fact nucleotide strings composed of {A,C,G,T}of variance lengths.Generally speaking,very long strings occur in low frequency in any DNA sequence,so we just consider sim-ple short sequences in a length of less than 7.Sliding windows of length 2,3,4,5and 6are used.We denote the length of the sliding window as l .The window is initially located at the start of a DNA sequence.l letters are read one by one in order.After a nucleotide string of length l is ob-tained,by sliding the window forward following each nucleotide,a new string will be obtained,and so on.The process was continued until reaching the end of the sequence.With this method a series of window sequences deduced from different sliding windows are obtained.We denote l (s )as short sequences of length l obtained by sliding windows,and s are some nucleotide strings such as ACG,GTAGG etc.Definition 1.Let F L (l (s ))be the measurement of repeats of l (s )for a DNA sequence,satisfyingF L ðl ðs ÞÞ¼T L ðl ðs ÞÞW L ðl Þ,where T L (l (s ))is the number of l (s )occurrences in sequence L ,and W L (l )is the number of window sequences of length l in the DNA sequence,and W L (l )=L Àl +1.Here F L (l (s ))is used to measure the frequency that a certain sub nucleotide string occurs in a DNA sequence.For a training sample dataset {L i j i =1,2,...,n }consisting of n sequences,if s occurs at least once in L i ,we say s occurred in sequence L i ;here L i denotes the i sequence.S (l (s ))is used todenote the number of sequences that s occurs in the training sample dataset.Therefore S ðl ðs ÞÞnis the ratio that s occurs in the sample dataset.For intergenic datasets we denote this ratio as S I ðl ðs ÞÞ,and for gene datasets as S G ðl ðs ÞÞ.In the process of finding keywords,the ratio is important because it provides the measurement of occurrences of l (s )in a dataset.If the ratio is close to 1,then l (s )are more abundant in a dataset,and if it is close to zero,then l (s )are rarely present in a dataset.In order to measure the frequency of a simple sequence occurrence in a dataset,we have the following definition:Definition 2.Let F (l (s ))be the measurement of repeated occurrences of l (s )for a training dataset{L i j i =1,2,...,n },satisfying F ðl ðs ÞÞ¼T ðl ðs ÞÞW ðl ÞÁS ðl ðs ÞÞn ,where T (l (s ))is the total number of l (s )occurrences in the training dataset,and W (l )is the totalnumber of window sequences of length l in the dataset,and T ðl ðs ÞÞ¼P n i ¼1T L i ðl ðs ÞÞ;W ðl Þ¼P n i ¼1W L i ðl Þ.From a statistical point of view,this quantity measures the abundance of a simple sequence in a training data set,and can be interpreted as an estimate of a joint probability.To measure the discrepancy that indicates a simple sequence occurrence in two training data-sets,we have the following definition:Definition 3.Let F I (l (s ))and F G (l (s ))denote the measurement of l (s )occurrences in intergenic and gene region datasets respectively.We define the relative discrepancy of l (s )in the two datasets as follows:170Y.H.Qiao et al./Mathematical Biosciences 195(2005)168–178a ðl ðs ÞÞ¼F I ðl ðs ÞÞÀF G ðl ðs ÞÞF I ðl ðs ÞÞþFG ðl ðs ÞÞ.In the process of finding keywords,the absolute value of a (l (s ))is important.If the absolute value of a (l (s ))is larger,and the ratio of s occurrence in a training dataset is over 60%,we deduce s as a keyword;otherwise,we regard it as Ônoise Õ.Therefore there exists a cutoffvalue,which we denote as x 0dividing s between keywords and Ônoise Õ.We define the keywords for intergenic and gene regions according to a positive or minus sign of a (s ).As mentioned above,because the selection of x 0is equivalent to the choice of keywords,we focus on the choice of the ideal value of x 0.For this purpose,a high-quality dataset is important.We built an intergenic and gene region dataset for testing this algorithm.A set of 613human 50-UTR sequences was obtained at first from the 50-end-enriched cDNA library [4],the 61350-UTR sequences located at human chromosome 22were mapped to their corresponding geno-mic sequences using the BLAT program [5](/cgi-bin/hgBlat ).The down-stream of 50-UTR of those genes from chromosome 22was analyzed,and the annotated neighbor exons and introns from the downstream of 50-UTR were lined out using the SIM4program [6](http://biom3.univ-lyon1.fr/sim4.php )up to the transcription end site (usually indicated by TA).The entire sequence from 50-UTR to transcription end site was extracted as a gene region.Hence,a total number of 613gene sequences were obtained as the dataset for analysis (some of them are only segments of a complete structure of a gene,especially those with 50-UTR which are not found from 50-end-enriched cDNA library).The downstream sequence from the former tran-scription end site to the start of 50-UTR of the next gene is spontaneously considered as an inter-genic sequence.Two programs were developed for this study.One is for a (l (s ))computing based on the defini-tions,and the other is a sorting program to sort l (s )s by their a (l (s ))values.Looking from the y -axis of Figs.2–6,the magnitude of the discrepancy of a (l (s ))for the sequences among l (s )s in different lengths is from 2to 4,therefore,different w 0are considered for l (s )s in different length.The detailed results for the value of a (l (s ))for nucleotide string sequences of length two are listed in Table 1.As shown in Table 1,because the absolute values of a (l (s ))for GC,CA,TG,CG,GT and AC are much larger than that of other sequences in the table,the value 0.01can be denoted as the cutoffvalue for defining keywords in length 2.The data in Fig.2also displays the discrepancy among the 16short sequences.For nucleotide string sequences of length 3,4,5and 6,we sort them in ascending order according to the absolute value of a (l (s )).The dispersion of the absolute value of a (l (s ))of the latter nucleotide string and the former are obtained one by one,the largest among them being selected.The absolute value of a (l (s ))of the lattersequencesY.H.Qiao et al./Mathematical Biosciences 195(2005)168–178171corresponding to the largest dispersion was used as cutoffvalue w 0.Those short sequences with an absolute value of a (l (s ))over w 0are collected as keywords.Therefore we chose w 0=0.01as the cutoffvalue for short sequences of length 2.The corresponding tables for short sequences of length 3,4,5,6with a (l (s ))are too large to be listed in the paper.The trend lines of short se-quences of length 3,4,5and 6with the change of a (l (s ))are illustrated in Figs.3–6.The cutoffvalues are given in Table ing w 0Õs,the keywords for intergenic and gene regions which are listed in Tables 3and 4respectively are selected,and the keywords are ordered by the absolute value of a (l (s)).172Y.H.Qiao et al./Mathematical Biosciences 195(2005)168–178Y.H.Qiao et al./Mathematical Biosciences195(2005)168–178173 Table1The value of a(l(s))of the alkali base sequences of length2Nucleotide base sequences a(l(s))j a(l(s))jCC0.00257165GG0.00261753TA0.0045099GA0.00726288TC0.0074633AC0.01257825GT0.01364047CG0.02033975TTÀ0.000297860.00029786AAÀ0.004971440.00497144CTÀ0.005103680.00510368AGÀ0.006273860.00627386ATÀ0.007199760.00719976TGÀ0.010762790.01076279CAÀ0.01246710.01246710GCÀ0.017432130.01743213Table2The threshold values for short strings of length2,3,4,5,6Window length23456x00.010.00480.0020.000720.0003 Based on the work offinding keywords,we use the technique of SVM to relate the gene and intergenic regions to one of the two alternative groups,G1(gene regions)and G2(intergenic regions).For this purpose,a feature expression of each DNA sequence is given in vectors.The 178keywords selected were used to map DNA sequences to178dimension Euclidean space, and the178dimension vectors(v)of DNA sequences were given as follows:The i th component of v¼F LðlðkeywordÞÞÂM.The components of v are ordered following the sequences in Table3.M is a factor to increase the components if the components are too small.For keywords of length2,M is specified as1;for words of length3,4,5,M is specified as10;and for keywords of length6,M is specified as 100.Each DNA sequence is expressed by the178keywords.SVM classifications are processed using the vectors introduced for segment sequences of DNA.2.2.SVMThe SVM is a new machine learning method that developed rapidly and has been widely used in many kinds of pattern recognition problems.The basic method of SVM is to transform the sam-ples into a high-dimension Hilbert space and to seek a separating hyperplane in this space.The separating hyperplane,which is called the optimal separating hyperplane[7],is chosen in such174Y.H.Qiao et al./Mathematical Biosciences195(2005)168–178Table3The keywords selectedThe length of keywords Keywords2GC,CA,TG,CG,GT,AC3TTT,AAA,TTA,ATT,TAA,CCC,ATA,AAT,CCG,TAT,CGG,GCC,GAC,GGA 4CGAC,GGAC,CAGC,GTCG,GTCC,GACC,GGTC,GACG,CGTC,GCTG,CGGA,AGGC,TCCG,GCCT,CCGA,ACCG,TCGG,CCAG,CAGG,ACGA,AGCC,GCAG,GCCA,ACGT,CGGT,ACGG,CTGC,CCGT,TCGT,TGCA,TCGA,CCTG,GGCT,AGCA5CCAGC,GACCC,GGACC,GGGTC,CAGCC,GCTGG,CAGGC,GGTCC,GCCTG,CGACC,GGCTG,CCCAG,CCGAC,GTCGG,GGTCG,CGGAC,CTGGG,GTCCG,AGGCT,AGCCA,CAGCT,AGCCT,CAGCA,GAGGC,AGGCA,CTGCA,CCAGG,GGACG,AGCTG,GACGG,AAAAA,CGGAG,ACGTC,TCGGA,GCCAG,CCGTC,TGCAG,GCCTC,ACGAC,TGCCT,ACGGA,CTCCG,AGGAC,CCTGG,CGTCC,GACGT,GGCAG,TGCTG,TGGCT,TCGAC,TCCGT,GACCG,GTCCT,GCAGG,CGGTC,GTCGT6AGGCTG,CAGCCT,CCCAGC,AAAAAA,GCTGGG,GCCTGG,CCAGCC,CCAGGC,GGCTGG,AGCCTG,CAGGCT,GCCAGG,GGAGGC,CCAGCT,AGGCAG,GGGTCC,GGACCC,GCTGAG,GAGGAC,GCAGTG,CTCAGC,CACTGC,CCTGGC,AGCTGG,CGACCC,GCCCAG,GCCTCC,CAGGCA,CTGGGC,GTCGGA,CCAGCA,CTGCCT,GGCAGG,TCCCAG,CAGGAG,CCTGGG,TCCGAC,ACTGCA,CCCAGG,CCGACC,CGGACC,CGGAGG,TGCCTG,TGCAGT,AGCCAG,CCTGCC,AGGGTC,GACCCT,GAGGCT,GGTCGG,GGTCCG,TGGCCA,CTGGGA,GACGGA,GGCTGA,GGACGG,CTCCTG,CCTCCG,TGAGCC,TCCAGC,AGCCTC,TGCTGG,GGCCAG,TCAGCC,TGGGAG,GACCCG,GGACCG,GGCTCATable4Test results of cross-validationGrn Ninterreg Ngenereg TP FN FP TN 1352105 2715201 3626011 4444004 5805300 6444004 7353014 8534103 9535012 10262015Total4733407429 Grn:group number;Ninterreg:number of intergenic regions;Ngenereg:number of gene regions.a way as to maximize its distance from the closest training samples.As a supervised machine learning technology,SVM is well founded theoretically on statistical learning theory,and it has been successfully applied to manyfields of pattern recognition,including object recognition[8], speaker identification[9],and text categorization[10].The SVM usually outperforms othermachine learning technologies,including Neural Networks and K -nearest neighbor classifier.In recent years,the SVM has been used in bioinformatics,including gene expression profile classifi-cation,prediction of signal peptide cleavage site and recognition of translation initiation sites.Yoonkyung Lee [11]used a binary SVM to predict multiple cancer types with excellent prediction results.More details about SVM can be found in Vapnik Õs publication [7].2.3.Classifier learningSVM requires the solution of an optimization problem (min a 1a T Q a Àe T a subject to P li ¼1a i y i ¼0and 06a i 6C ,i =1,...,l ).For classification and simplicity purpose,the optimiza-tion problem needs to solve a lot of quadratic programming (QP)problems.For our classification problem,we selected RBF as the kernel function because it has fewer parameters than a polyno-mial or sigmoid kernel function,and the linear kernel is a special case of RBF.Solving the QP problem involves two parameters:the penalty parameter C and the kernel width r .Severe under-fitting [12]occurs in the following cases:r 2is fixed and C !0;r 2!0and C is fixed to a suffi-ciently small value;or r 2!1and C is fixed.Sufficiently large C and r 2!0will result in severe overfitting [12].The cases in which r 2is fixed and C !1is not fit for the problem under consideration because it has noise.If r 2!1and C =C 1r 2where C 1is fixed then the SVM clas-sifier converges to the linear SVM classifier with penalty parameter C 1.So selecting good (C ,r 2)is important for unknown data prediction.A two-line model selection [13]approach was used for finding good C and r 2.The procedure is as follows:(1)Search for the best C of linear SVM using a subset of the training data and call it C ,and (2)Fix C and search for the best (C ,r 2)satisfying log r 2¼log C Àlog C using RBF.Before we selected RBF,we used linear kernel for some clues for the best C ,and a seeding technique [14]was used for finding C .Four hundred intergenic sequences and 400gene sequences are extracted from the dataset from which we selected keywords for SVM training.The sequential Minimal Optimization (SMO)[13]algorithm was used to solve the SVM QP problem.SMO decomposes the overall QP problem into QP sub-problems which can be solved analytically.For the standard SVM QP problem,the smallest possible optimization problem involves two Lagrange multipliers,because the Lagrange multipliersmust obey a linear equality constraint ðP li ¼1a i y i ¼0Þ.The realization of P l i ¼1a i y i ¼0is obtained by recursive algorithm,whenever a multiplier is updated,and at least one other must be adjusted.At every step,SMO chooses two Lagrange multipliers to jointly optimize,finds the optimal values for these multipliers,and updates the SVM to reflect the new optimal values.The details of solving for two Lagrange multipliers of the above optimization problem are in Ref.[13].We divided the samples into two groups,and one for classifier learning,and the other for test-ing.In order to evaluate the ability of the method we introduced for the prediction of intergenic and gene sequences,we introduced four parameters that are commonly used to measure the per-formance of a program in the field of bioinformatics:Sensitivity ðSN Þ¼TPTP þFN ;ð1ÞSpecificity ðSP Þ¼TPTP þFP;ð2ÞY.H.Qiao et al./Mathematical Biosciences 195(2005)168–178175AccuracyðACÞ¼TPþTNTPþFPþTNþFN;ð3ÞCorrelation coefficient¼ðTPÞðTNÞÀðFPÞðFNÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðPPÞðPNÞðAPÞðANÞp;ð4Þwhere TP are true positives,FP are false positives,FN are false negatives,TN are true negatives, PP are predicted positives,PN are predicted negatives,AP are actual positives and AN are actual negatives.3.ResultsWe tested the classifier in two ways.First we performed a systematic cross-validation analysis, using the data in test dataset(those we collected as keywords).Second we performed the program on human chromosome21and other chromosome.For cross-validation,we divided the samples into10groups;for each group about90%of the samples were used to train the algorithm and the other10%were used for testing.We then estimated the sensitivity,specificity,accuracy and cor-relation coefficient.The detailed results of cross-validation are listed in Table4,(The data in each group is available from our email address).We found that the classifier could predict about85%of intergenic se-quences with around10%false positives,and about88%of gene regions with about21%false negatives with accuracy of the classifier at86%.For chromosome21,wefirst scanned the full length of DNA sequences.The annotated genes and the sequences between two neighbor genes were extracted to form the dataset for testing the classifier.The results are shown in Table5, the accuracy of the prediction being only73%.SN,SP and CC are also lower than the prediction of the other test sets.The reason is that the test sets from chromosome21are based on the anno-tated gene sequences in GenBank.It is possible that there are certain gene regions in the genome, which are not predicted or annotated precisely,so the predictions for chromosome21are not so good as we expected.As another test,a set of211gene and intergenic region sequences from other chromosomes were collected by the methods we used for preparing the training samples.But the sequences which we used for the calculation of the value of a(l(s))are excluded,and the results are also shown in Table5.We found that the prediction program performed better that it did on chromo-some21.Table5Results of the performance of the classifierType of the test dataset SN SP AC CC Training set0.960.980.970.93Test set0.850.910.860.84The chromosome210.720.730.730.63 Other chromosomes0.820.850.840.84176Y.H.Qiao et al./Mathematical Biosciences195(2005)168–178Y.H.Qiao et al./Mathematical Biosciences195(2005)168–178177 4.DiscussionIn this paper we have described a SVM based methodology for two class prediction using DNA feature expression data based on keywords.A new method forfinding keywords for certain gen-ome regions is introduced.This method,which complements usual motiffinding techniques such as multiple sequences alignment,is based on language analysis and statistics.A measurement a(l(s))is introduced to describe the relative repeat ratio of sub sequences.Threshold values were found for keyword selections.A program was developed to predict whether the new sequence was from gene or intergenic regions.In the analysis of the results,we found that the accuracy is comparable to other annotation tools for DNA sequences such as GRALL2Õs and FGENEHÕs[15]prediction for coding regions; more specific than GRALL2and FGENEH(see Table6);and more sensitive than GRALL2 at nucleotide level.The results also show that the sensitivity is always larger than specificity which indicates that the programÕs prediction for gene sequences is better than intergenic sequences.These results demonstrate that a language analysis method is feasible and valid for the anno-tation of DNA nguage information,together with SVM classification nicely comple-ments the functional region annotation tools which are almost all based on hidden Markov models and discriminate analysis.The program will perform better if it integrates other annota-tion tools such as GeneScan,PromoterInspector[16]and FirstEF[17]etc.It is a pity that the accuracy of the prediction cannot be compared with that of other methods,because the prediction of the entire gene structure is still in the exploration stage and seldom used.Moreover,the existing genefinding programs usually only model some elements of the entire gene structure,so that this program defies comparison with other existing gene-finding programs.But it could have innova-tive value as a method for predicting the entire gene structure.Finally,in our analysis,the subsequences from the gene and intergenic regions are based on the training data which were extracted by mapping50-UTR sequences to their genome sequences.It is possible that there are certain50-UTR sequences in the genome,which are not predicted or anno-tated precisely,and that the length of the related intergenic samples is random,so that the samples of the training set or test set are to some extent unreliable,with the keywords within gene and intergenic regions being tentative.Further research should focus on the length distribution of the intergenic region and entire gene regions.In the future,when the human gene regions are fully available and the gene boundary is more exactly determined,more exact information may be found and the accuracy of the prediction will be improved.Table6Test summaryProgram SN SP AP CC GRALL20.790.85*0.80 FGENEH0.880.80*0.83SVM0.850.910.860.84*Not appear in reference.178Y.H.Qiao et al./Mathematical Biosciences195(2005)168–178 AcknowledgmentsThe work was supported by Beijing Municipal Key Laboratory from College of Computer Sci-ence and the Project from Beijing Municipal Commission of Education No.KM200510005015. The authors are thankful to C.N.Liu for his support and technical assistance.References[1]I.Dunham,A.R.Hunt,J.E.Collins,et al.,The DNA sequence of human chromosome22,Nature402(6761)(1999)489.[2]C.Burge,S.Karlin,Prediction of complete gene structure in human genomic DNA,J.Mol.Biol.268(1)(1997)78.[3]A.Krogh,Two methods for improving performance of an HMM and their application for gene-finding,in:Proceedings of the Fifth International Conference on Intelligent System for Molecular Biology,AAAI, MenloPark,CA,1997,p.179.[4]Y.Suzuki,D.Ishihara,M.Sasaki,et al.,Statistical analysis of the50untranslated region of human mRNA using‘‘Oligo-Capped’’cDNA libraries,Genomics64(3)(2000)286.[5]W.J.Kent,BLAT—the BLAST-like alignment tool,Genome Res.12(4)(2002)656.[6]L.Florea,G.Hartzell,Z.Zhang,G.M.Rubin,ler,A computer program for aligning a cDNA sequencewith a genomic DNA sequence,Genome Res.8(9)(1998)967.[7]V.Vapnik,Statistical Learning Theory,Wiley,New York,1998.[8]D.Roobaert,M.M.Hulle,View based3D object recognition with support vector machines,in:Proceedings of theIEEE International Workshop on Neural Networks for Signal Processing,IEEE,Wisconsin,1999,p.77.[9]M.Schmidt,H.Grish,Speaker identification via support vector classifiers,in:Proceeding of the InternationalConference on Acoustics,Speech and Signal Processing,IEEE,Long Beach,CA,1996,p.105.[10]H.Drucker,D.Wu,V.Vapnik,Support vector machines for spam categorization,IEEE Trans Neural Network10(3)(1999)1048.[11]L.Yoonkyung,L.Cheol-Koo,Classification of multiple cancer types by multicategory support vector machinesusing gene expression data,Bioinformatics19(9)(2003)1132.[12]S.S.Keerthi,C.J.Lin,Asymptotic behavior of support vector machines with gaussian kernel,Neural Comput.15(7)(2003)1667.[13]B.E.Boser,I.M.Guyon,V.Vapnik,A training algorithm for optimal margin classifiers,Fifth Annual Workshopon Computational Learning Theory,ACM,1992.[14]S.S.Keerthi,S.K.Shevade,C.Bhattacharyya,A fast iterative nearest point algorithm for support vecter machineclassifier design,IEEE Trans.Neural Networ.11(1)(2000)124.[15]M.Q.Zhang,Identification of protein coding regions in the human genome by quadratic discrimination analysis,Proc.Natl.Acad.Sci.94(2)(1997)565.[16]M.Scherf,A.Klingenhoff,T.Werner,Highly specific localization of promoter regions in large genomic sequencesby Promoter Inspector:a novel context analysis approach,J.Mol.Biol.297(3)(2000)599.[17]V.D.Ramana,G.Ivo,M.Q.Zhang,Computational identification of promoters andfirst exons in the humangenome,Nature29(4)(2001)412.。
非等位基因
非等位基因概述非等位基因是指同一基因座上的不同等位基因。
等位基因是指在某个给定的基因座上,可以存在多种不同的变体。
每个个体继承了一对等位基因,一对等位基因可能会导致不同的表型表达。
非等位基因的存在使得遗传学研究更加复杂,因为不同的等位基因会对个体的表型产生不同的影响。
背景在生物学中,基因座是指染色体上一个特定的位置,该位置上的基因决定了某个特征的表达方式。
每个基因座上可以有多种不同的等位基因。
等位基因是指在某个特定基因座上的不同基因变体。
每个个体都会继承一对等位基因,通过这对等位基因的不同组合,决定了个体的表型。
然而,并非所有基因座上的等位基因都具有相同的表现型。
非等位基因的影响非等位基因的存在导致不同等位基因会对个体表型产生不同的影响。
有些非等位基因会表现出显性效应,也就是说,当个体继承了一个突变的等位基因时,即使同时继承了一个正常的等位基因,但显性效应会使得突变的等位基因的表型表达得到体现。
相反,有些非等位基因会表现出隐性效应,当个体继承了两个突变的等位基因时,才会表现出突变的表型。
除了显性和隐性效应之外,非等位基因还可能发生两种其他类型的表型效应。
一种是共显效应,当个体继承了两个不同的突变等位基因时,在表型表达上会表现出一种新的特征,这个特征并不是单个突变等位基因所能导致的。
另一种是部分显性效应,当个体继承了两个不同的突变等位基因时,表型表达将介于两个单独突变等位基因的表型之间。
重组和非等位基因重组是指两个不同的染色体交换部分基因序列的过程。
在重组的过程中,非等位基因可能会发生改变,导致新的等位基因组合形成。
这一过程使得非等位基因的表型效应更加复杂,因为新的等位基因可能将不同基因座的效应组合起来。
非等位基因的重要性非等位基因对生物的适应性和多样性起着重要作用。
通过对等位基因的各种组合的研究,人们可以更好地理解基因与表型之间的关系,并揭示遗传变异对物种适应环境的重要性。
总结非等位基因是指同一基因座上的不同等位基因。
合成生物学中的正交遗传系统
收稿日期:2014-01-01基金项目:教育部留学回国人员科研启动基金资助 (WF2070000010)*通信作者:E-mail: dmwang09@合成生物学中的正交遗传系统葛永斌1,洪 泂2,王冬梅2*(1亳州师范高等专科学校理化系,亳州 236800;2中国科学技术大学生命科学学院,合肥 230026)摘要:并行、独立的正交系统是合成生物学的重要研究基础之一,这个系统与自然界的生物系统及其组成交叉很少或没有交叉。
它的组成包括非天然碱基对、移位密码子、非天然氨基酸、正交的氨酰tRNA 合成酶、RNA 聚合酶和启动子、正交核糖体等。
这些正交系统的组成部分可以一起组成系统发挥作用,也可以各自单独在生物体系中应用,它们给生物带来新的特性,也为研究人员提供了新的生物学研究方法。
关键词:正交系统;合成生物学;非天然氨基酸;非天然碱基对;正交氨酰tRNA 合成酶/tRNA 组合Orthogonal genetic system in synthetic biologyGE Yongbin 1, HONG Jiong 2, WANG Dongmei 2(1Department of Physics and Chemistry, Bozhou Normal College, Bozhou 236800, China ;2School of Life Science, University of Science and Technology of China, Hefei 230026, China)Abstract: Parallel and independent orthogonal system is one of the foundations of synthetic biology. This system has no or little cross with the natural biological system. The orthogonal system is composed of un-natural base pair, shift code, unnatural amino acid, orthogonal aminoacyl tRNA synthetase, orthogonal RNA synthetase, promoter and the orthogonal ribosome. These components work corporately or separately in the organism, programming cells with new functions, and also providing new techniques for researchers. Key Words: orthogonal system; synthetic biology; unnatural base pair; unnatural amino acid; orthogonal aaRS/tRNA目前的生物学研究,尤其是合成生物学研究中,研究人员不再仅仅局限于对已有的生物系统进行修改,而是逐步进行人工设计和改造,合成大量自然界不存在或未发现的生物元件、生物装置及生物系统。
人类提供了主要食源。大约在77百...
第四节基因组分析列举:水稻基因组分析浙江大学 ////0>.《生物信息学札记》樊龙江第四节基因组分析列举 : 水稻基因组分析本节将结合我们近年来的一些研究结果, 重点对第一个被基因组测序的作物?? 水稻的基因组研究和分析结果进行介绍。
水稻是第一个被全基因组测序的作物。
亚洲栽培稻(Oryza sativa ) 共有 2 个亚种 (籼稻和粳稻) , 其中一个粳稻品种“日本晴”分别通过全基因组鸟枪法 (Goffet al, 2002 )和逐步克隆方法Sasaki et al, 2002; Feng et al, 2002; The RiceChromosome 10 Sequencing Consortium, 2003; The Rice GenomeSequencing Project, 2005 测序,另一个籼稻品种“9311 ”通过全基因组鸟枪法测序Yu et al, 2002; Yu et al, 2005 。
除了核基因组外,水稻的叶绿体基因组序列早在 15 年前就已测序完成Hiratsuka et al, 1989 ,同时,其线粒体基因组最近也被测序完成 (Notsu et al. 2002 ) 。
在获得基因组序列后, 一项艰巨的研究任务是如何从巨量的水稻基因组序列中挖掘出潜藏的遗传事件、进化机制等重要生物信息。
为此本文结合我们自身的一些研究工作, 重点介绍了近年来在水稻基因组序列分析中获得的几项最新的研究结果。
1 现代的二倍体,古老的多倍体2004 年水稻基因组研究的一个重要进展,是获得清晰的证据表明水稻基因组曾发生过全基因组倍增。
Paterson 等 2004 、Guyot 等2004 和我们Fan etal, 2004;Zhang et al, 2005a 的研究结果也一致表明,在禾本科作物分化前发生过一次全基因组倍增 (whole-genome duplication ) 。
affinity chromatography
ReviewStructural analysis and classification of native proteins from E.coli commonly co-purified by immobilised metal affinity chromatographyVictor Martin Bolanos-Garcia ⁎,Owen Richard DaviesDepartment of Biochemistry,University of Cambridge,80Tennis Court Road,Cambridge,CB21GA,EnglandReceived 25January 2006;received in revised form 23March 2006;accepted 24March 2006Available online 26April 2006AbstractImmobilised metal affinity chromatography (IMAC)is the most widely used technique for single-step purification of recombinant proteins.However,despite its use in the purification of heterologue proteins in the eubacteria Escherichia coli for decades,the presence of native E.coli proteins that exhibit a high affinity for divalent cations such as nickel,cobalt or copper has remained problematic.This is of particular relevance when recombinant molecules are not expressed at high levels or when their overexpression induces that of native bacterial proteins due to pleiotropism and/or in response to stress conditions.Identification of such contaminating proteins is clearly relevant to those involved in the purification of histidine-tagged proteins either at small/medium scale or in high-throughput processes.The work presented here reviews the native proteins from E.coli most commonly co-purified by IMAC,including Fur,Crp,ArgE,SlyD,GlmS,GlgA,ODO1,ODO2,YadF and Yf bG.The binding of these proteins to metal-chelating resins can mostly be explained by their native metal-binding functions or their possession of surface clusters of histidine residues.However,some proteins fall outside these categories,implying that a further class of interactions may account for their ability to co-purify with histidine-tagged proteins.We propose a classification of these E.coli native proteins based on their physicochemical,structural and functional properties.©2006Elsevier B.V .All rights reserved.Keywords:Affinity chromatography;E.coli contaminant protein;Metal-binding classification;Histidine-tagged protein;Protein purification1.IntroductionThe use of immobilised metal affinity chromatography (IMAC)has revolutionised protein biochemistry by allowing the production of a pure protein sample through a single puri-fication step.However,the concomitant expression of native bacterial proteins that exhibit a relatively high affinity for di-valent cations during the expression of heterologue protein domains,full-length proteins or macromolecular complexes in E.coli frequently results in their co-purification during IMAC [1].Most of these metal binding proteins are present in E.coli strains of different genetic backgrounds,such as BL21,BL21(DE3),BL21(DE3)pLysS,C41,C43,Rosetta (DE3)and (DE3)pLysS as well as Origami (DE3)and (DE3)pLysS.These strains contain a lamba-lysogen DE3bacteriophage that encodes T7RNA polymerase under the control of the lac UV5operator;theexpression of T7promoter and lac UV5operator controlled genes on pET-based vectors is thus permitted upon induction with isopropyl-2-D -thio-galactopyranoside (IPTG)[2,3].Since the 1970s,IMAC has remained the most important technique for single-step protein purification [4].The expression of a recombinant protein containing a histidine-tag (usually six consecutive histidine residues)allows it to be specifically bound by chelated divalent metal ions,and then eluted through com-petition by the addition of imidazole,or through the protonation of histidine residues by a reduction in pH.This often has dramatic results in the purification of target proteins to near homogeneity from bacterial cell lysate [5,6].Further advantages of IMAC include ligand stability,high protein loading capacity,mild or denaturing elution conditions,column regeneration,low cost and scalability [7,8].This has meant that it is now in widespread use in both low and high throughput environments [9,10].There are many different metal-chelator systems for IMAC,although the most common are the tridentate ligand IDA,Ni 2+bound to tetradentate ligand NTA (Ni-NTA;Qiagen Ltd.)and Co2+Biochimica et Biophysica Acta 1760(2006)1304–1313⁎Corresponding author.Tel.:+441223766029;fax:+441223766002.E-mail address:victor@ (V .M.Bolanos-Garcia).0304-4165/$-see front matter ©2006Elsevier B.V .All rights reserved.doi:10.1016/j.bbagen.2006.03.027bound to tetradentate ligand CM-Asp (TALON ™;BD Biosciences Clontech).The interaction between Ni-NTA resin and a histidine-tagged protein is illustrated in Fig.1.NTA and TALON ™have higher affinities for metal ions than IDA,but they exhibit lower protein binding due to the loss of one coordination site.Depending on the proximity,orientation and spatial accessibility of histidine residues as well as the density of the chelating groups and metal ions,multipoint binding of different histidine residues can be ually,one histidine is enough for weak binding to IDA-Cu 2+,while more proximal histidine residues are needed for efficient binding to Zn 2+and Co 2+[8].The reported capacities of these commercial sorbents are usually in the range of 5–10mg/ml or even higher.These values commonly refer to isolated pure proteins or synthetic mixtures,so the capacities for isolation of recombinant proteins from complex sources are often lower.In successful cases,over 80%of histidine-tagged proteins can be recovered from E.coli homogenates [11].However,the use of IMAC for some recombinant proteins may be limited by their low binding affinity to metal-chelating sorbents despite optimisation and/or the use of fresh resin.This situation is often due to the histidine-tag being partially hidden from the protein surface,which can in turn be due to inter or intra-molecular interactions.Although it is occasionally possible to fully expose the histidine tag by adding detergents,glycerol,polyethylene glycol of low molecular weight and/or chaotropic agents at low concentration,in many cases,a relatively low affinity for the sorbent persists in the presence of these and other additives.The length and position of the histidine tag can also affect other fundamental properties of a recombinant protein such as its expression level,stability,oligomerisation state,and ability to constitute suitable samples (for example,protein sam-ples that allow the formation of single crystals for X-ray crystal-lography)[7].In addition to histidine-tagged recombinant proteins,some native proteins also show affinity for metal chelating resins com-monly used in IMAC.Binding of native proteins is determined by many factors,including the accessibility of surface histidine residues to the metal ions present in chelating resins,the micro-environment of binding residues,cooperation between neighbour amino acid side groups,and local conformations.Interestingly,the intrinsic metal-binding properties of several non-histidine tagged proteins have been exploited for simple single-step puri-fication by Ni-NTA-sepharose.Examples are untagged HIV-1integrase expressed in E.coli [12],the alpha subunit of the human transcription factor A (TFIIA)[13]and a proteomic-wide analysis of copper-binding proteins in plants [14].The co-purification of contaminant proteins that bind to metal-chelating sorbents is particularly problematic when one or more E.coli native proteins are expressed at high levels and/or when they exhibit a size similar to that of the recombinant protein [15].This situation might ultimately result in the purification of an undesired bacterial protein rather than the recombinant protein of interest.One example of medical relevance is the purification of TNF-α,which allowed efficient host-cell endotoxin and DNA removal but resulted in the simultaneous recovery of some re-sidual E.coli proteins during the final elution step [8].Several innovations have been described in the literature to improve the yield and purity of recombinant histidine-tagged proteins purified by single-step chromatography,such as the use of isopropanol during washing steps for the removal of contaminants,including bacterial endotoxins [16].However,this treatment may result in protein unfolding,thus making the purification of properly folded recombinant proteins a difficult task.Additionally,there are se-veral other aspects of IMAC that must be further improved,in-cluding low dynamic capacity and efficiency of cleaning pro-cedures for eliminating contaminants.The co-purification of contaminant proteins exhibiting affinity for divalent cations has been observed not only with single-tagged proteins but also with double-tagged molecules,including the combination of histidine and thioredoxin (TRX),Nut,glutathi-one-S-transferase (GST),maltose binding protein (MBP)and GroEL tags.Although the use of double tags seems to be adequate in removing most of the native bacterial contaminants,this strat-egy is not exempt of other problems such as excessive sample manipulation and the occasional partial proteolysis and/or aggre-gation and/or insolubility of the recombinant protein after tag cleavage and removal.Moreover,the use of double tags is gen-erally undesirable in high-throughput processes.Although it is well known that certain E.coli native proteins are problematic contaminants during IMAC [1,15],to ourknowledge,Fig.1.The mechanism of binding between a histidine-tag and Ni-NTA resin (Qiagen Ltd.).Nickel ions are immobilised on a nitrilotriacetic acid (NTA)sepharose resin through co-ordination sites with three oxygen atoms and one nitrogen atom.This leaves two co-ordination sites,which may be taken up by nitrogen atoms of two adjacent histidine residues of a histidine-tagged recombinant protein,thus offering a mechanism for affinity chromatography.The interaction between the nickel ion and the histidine-tagged protein may be disrupted and protein eluted through competition by the addition of imidazole or through the protonation of histidine residues by lowering the pH.1305V .M.Bolanos-Garcia,O.R.Davies /Biochimica et Biophysica Acta 1760(2006)1304–1313a formal account of the specific proteins and their mechanisms of binding to metal chelating resins has not yet been reported.In recognition of this,we reviewed the physicochemical,functional and structural properties of native proteins from E.coli that in our experience are the most commonly co-purified by IMAC and proposed a classification based on their relative affinity for metal chelating resins.Our analysis shows that most of the contaminants are stress-response proteins that tightly bind to metal chelating sorbents through surface clusters of histidine residues or other metal-binding residues that are physiologically important.We conclude that the identification of contaminant proteins aids the design of a purification strategy for recombinant proteins expressed in intact E.coli cells as well as in E.coli lysates used in cell-free translation systems.1.1.The majority of E.coli contaminants are stress-responsive proteinsThe response of E.coli to stress conditions such as nutrient starvation,heat shock and oxidative damage results in a trans-criptional shutdown of protein synthesis and in the induction of genes encoding diverse stress proteins.According to our expe-rience,which is based on the purification of more than80different histidine-tagged proteins,discussions in open forums such as CCP4()as well as with colleagues of this and other research centres,we conclude that these stress responsive proteins are the main native proteins from E.coli that co-purify with recombinant proteins during IMAC under conventional purification conditions.Very importantly,our experience shows that the relative level of expression of a particular contaminant protein from E.coli is quite variable and appears to depend upon numerous factors,including culture conditions,media composi-tion and the genetic background of the expression strain.From our experimental observations,these stress responsive proteins have been classified into three groups on the basis of the concentration of imidazole that is required for their elution from IMAC columns,thus providing an indication of the strength of binding to metal-chelating resins.Class I proteins require≥80mM imidazole for elution(Fur,Crp,SlyD,ArgE,Cu/Zn-SODM and YodA),Class II proteins require55to80mM of imidazole(GlmS, ODO2,YadF,CA T,GlgA,Yf bG and G6-PD)and Class III to proteins that bind weakly,requiring only30to50mM(Hsp60and ODO1).A constant volume of fresh Ni-NTA sorbent,identical chro-matography protocols and buffer solutions of similar composition were used to assess meaningful comparisons.All proteins described in this work were identified by mass spectrometry(MALDI-TOF) and N-terminal sequence analysis according to Edman's degrada-tion at the Protein and Nucleic Acid Chemistry(PNAC)facility (Department of Biochemistry,University of Cambridge,UK).Full details of the identified proteins,their metal-chelating binding strengths and physicochemical properties are given in Tables1and2.Upon analysis of these contaminant proteins,two distinct mechanisms of binding to metal-chelating resins can be recognised:(1)the possession of native metal-binding sites that can bind to Ni2+or Co2+metal ions and(2)the presence of surface clusters of histidine residues that bind to the chelated metal in the same way as the tandem residues of a histidine-tag.As discussedbelow,different binding mechanism(s)may account for the resinbinding capacity of the other contaminants that do not satisfythese two conditions.Our analysis also shows that native E.coli proteins that are co-purified by IMAC exhibit a wide diversity of folds,oligomerisationstates and hierarchical organisation.A high percentage of thesecontaminant proteins correspond to those with more than onedomain.In terms of frequency,most of these proteins belong to the α+βclass,closely followed by the all-βclass.Interestingly,only a marginal number of contaminant proteins correspond to the all-αclass.With the possible exception of ODO2and Hfq,there seems tobe no correlation between the in vitro protein oligomerisation stateon the relative affinity for metal chelating sorbents.1.2.Native metal-binding proteinsThe presence of physiologically important metal-bindingsites is the most significant mechanism by which native E.coliproteins co-purify with the heterologue protein during IMAC.Common contaminants that bind through metal-binding sitesinclude Fur,YodA,Cu/Zn-SODM and ArgE(Class I),YadF andGlgA(Class II).It is noteworthy that most of these contaminantsbelong to the class I,and so such proteins are most likely to befound as problematic contaminants during IMAC purification ofheterologue histidine-tagged proteins.1.2.1.Ferric uptake regulator(Fur)The DNA-binding protein Fur(ferric uptake regulator)tightlycontrols the quantity of intracellular iron in E.coli through re-pressing the transcription of iron-starvation genes upon binding toFe2+[17].The structure of P.aureaginose Fur is40%helical and18%β-sheet,encompassing an N-terminal DNA-binding domainand C-terminal dimerisation domain[18:pdb entry1MZB].Thisfold classifies the protein in the DNA-binding domain superfam-ily[SCOP,19].In addition to a functionally important regulatoryiron-binding site in the dimerisation domain,Fur also contains astructural zinc-binding site that is crucial for its function in vivocalcium-regulatory functions of Fur meaninvolved in several cellular processes,in-cluding chemotaxis,protection against oxidative damage andacid-shock response[20].Thus,Fur may be overexpressed in E.coli upon the induction of foreign genes through pleiotropism andin response to acid-shock stress.The presence of two physiolog-ical metal-binding sites confers Fur a high affinity for metal-chelating resins and explains why it is commonly co-purifiedduring IMAC.1.2.2.Metal-binding lipocalin(YodA)The E.coli protein YodA plays an important role in thebacterial response to cadmium.Cadmium is readily taken up bybacteria but is highly toxic;its high redox potential allows it toblock the functions of metalloproteins and zinc finger proteins,thus leading to oxidative stress[21].YodAwas the first naturallyoccurring metal-binding lipocalin described;sequence similar-ities with putative proteins from other bacterial species havesuggested a family of proteins that are expressed in response to1306V.M.Bolanos-Garcia,O.R.Davies/Biochimica et Biophysica Acta1760(2006)1304–1313cadmium stress [22].The transcription of yodA is also dependent on fur and soxS ,both of which mediate the protection against reactive oxygen species.These findings and more recent geno-mic studies,where an increase in YodA levels in E.coli after acid-induction in the culture medium from pH 7to 5.8was noticed [23],indicate that YodA is over-expressed in response to oxidative and acid stress.Structural studies have revealed that YodA is composed of two domains:a major calyx domain (lipoclin/calycin-like domain)and a helical domain [22].Crystal structures of YodA bound to both cadmium and zinc (pdb accessions 1OEE and 1OEK,respectively)show a common metal-binding site made up of histidine residues that lie along the side of the calyx domain in a manner that resembles the metal binding sites of several proteases and oxidoreductases.The same authors also reported the structure of YodA crystallised in the absence of metal ions (pdb accession 1OEJ,shown in Fig.2B).Interestingly,they found that nickel ions had bound to the central metal-binding site during purification byNi-NTA,thus highlighting the high affinity of YodA for metal ions,and illustrating the mechanism by which metal-binding proteins can bind to metal-chelating resins.A similar three-histidine motif is found in several metalloproteins,lyases and oxygenases,raising the possibility that E.coli proteins from these classes may also exhibit high affinity for metal-chelating sorbents.Indeed,as shown in this work,some of the common contaminants we have identified belong to these classes.1.2.3.Cu/Zn-superoxide dismutase (Cu/Zn-SODM)Superoxide dismutases (EC 1.15.1.1)are metalloenzymes that protect against oxygen toxicity by catalysing the dismutation of superoxide into molecular oxygen and hydrogen peroxide [24].They are classified into three groups on the basis of their Fe 2+,Mn 2+,Cu 2+/Zn 2+catalytic centres.Unlike other superoxide dismutases,the Cu 2+/Zn 2+-superoxide dismutase (Cu/Zn-SODM)is a monomeric protein of 17kDa [25].The E.coli protein consists of the Greek-key β-barrel topology,formed by eight antiparallel β-strands [pdb entry 1ESO:26].Similar to some of the proteins previously described,Cu/Zn-SODM is expressed under stress conditions and likely binds metal-chelating resins through its Cu 2+and Zn 2+binding sites,which consist of conserved histidine residues (shown in Fig.2C).1.2.4.Acetylornithinase (ArgE)Prokaryotic arginine synthesis usually involves the transfer of an acetyl group to glutamate by ornithine acetyltransferase in order to form ornithine.However,in E.coli acetylornithine deacetylase (acetylornithinase,ArgE)(EC 3.5.1.16)catalyses the deacylation of N 2-acetyl-L -ornithine to yield ornithine and acetate [27].Phylogenetic evidence suggests that the clustering of the arg genes in one continuous sequence pattern arose in an ancestor common to Enterobacteriaceae and Vibrionaceae,where orni-thine acetyltransferase was lost and replaced by a deacylase.Table 1Native proteins from E.coli commonly co-purified during IMAC Protein SwissProt access code Molecular Mass (kDa)%Histidine residues Isoelectric point (pI)Metal requirement Fur P0697516.78.1 5.6Fe 2+,Zn 2+(a)YodAP7634422.3 5.2 5.6Cd 2+,Zn 2+(a)Cu-Zn-SODM AAC7471817.6 4.0 5.9Cu 2+(a),Zn 2+(a)ArgE P2390842.3 4.4 5.5Fe 2+,Ni 2+YadF P3685725.0 5.5 6.1Zn 2+,Hg 2+(a,b)GlgA P0832351.7 3.4 6.0Mg 2+(a,c)GlmS P1716966.8 3.9 5.5CAT AAA5708025.5 5.5 5.9Co 2+Crp P0302023.6 2.98.3Hfq P2552111.1 4.9 6.9SlyD P3085620.810.2 4.8Zn 2+,Ni 2+S15P0237110.2 5.610.4Yf bG P7739874.2 4.1 6.3Hsp60AAC7710357.00.2 4.8ODO1P0701510.5 3.6 6.0ODO2P0701644.0 1.7 5.5G6PDP2299255.71.25.5The table summarises the physicochemical properties of these proteins.aMetal ions reported to be present in the crystallization solution.bAs seen in the structure of its human counterpart (PDB:1CRM).cObserved in its human counterpart (PDB 1PYX).Table 2Relative resin affinity of native proteins from E.coli co-purified during IMAC Relative affinity for the metal chelating sorbent Class I aClass II b Class III c Fur YadF Hsp60YodAGlgA ODO1Cu-Zn-SODM GlmS ArgE CAT Crp YfbG Hfq ODO2SlyD G6-PDS15The table summarises the basis of the protein classification used in this work.The relative affinity is estimated as the millimolar concentration of imidazole required for their elution from a Ni-NTA sepharose column.a =N 80;b =55–80;c =30–50.1307V .M.Bolanos-Garcia,O.R.Davies /Biochimica et Biophysica Acta 1760(2006)1304–1313The383amino acid ArgE protein is made up of two domains: an N-terminal Zn2+-dependent exopeptidase domain and a C-terminal exopeptidase dimerisation domain.It forms a homo-dimer in solution and requires cobalt and glutathione as cofactors. ArgE contains the Co2+/Zn2+binding motifs,and shows a high degree of sequence and structural identity with other metalloen-zymes,explaining its high affinity for metal-chelating resins. 1.2.5.Carbonic dehydratase(YadF)Carbonic anhydrases(carbonic dehydratases)(EC4.2.1.1)are enzymes that catalyse the interconversion of carbon dioxide and bicarbonate,utilising Zn2+as a cofactor.This reaction is crucial to cellular growth;the low atmospheric CO2concentration and its rapid diffusion from cells means that spontaneously produced bicarbonate is insufficient to meet the metabolic requirements of growing cells.YadF from E.coli is essential for growth in the absence of another carbonic anhydrase,CynT[28].Although the transcription of yadf is not regulated by CO2nor subjected to self-regulation,YadF expression is dependent upon bacterial growth rate:its expression is maximal in slow-growing cultures,at high cellular densities and during starvation or heat stress conditions [29].1.2.6.Glycogen synthase(GlgA)The glgA gene encodes glycogen synthase(E.C.2.4.1.21),an enzyme of477amino acid residues in length.This enzyme participates in the biosynthesis of glycogen and contains the UDP-glycosyltransferase/glycogen phosphorylase domain[30].Gly-cogen is accumulated when there is a shortage of nutrients(such as nitrogen)even in the presence of excess carbon[31],so glycogen synthesis is induced when cells enter stationary phase, making E.coli glgA an example of an inducible bacterial gene [32].Even though a large number of bacterial genes are induced during transition into stationary phase,only a minority have been characterised to date.GlgA is also an example of a gene cluster containing the genes that encode for both catabolic andanabolic Fig.2.Three-dimensional structures of proteins that show native metal-binding properties.(A)The crystal structure of Fur with zinc ions bound at sites1and2;in the native state these metal-binding sites bind to iron and zinc,respectively.(B)Cu/Zn SODM contains metal binding sites for both copper and zinc ions,offering two sites for binding to metal chelating resins.Structural analysis and images were produced using Pymol[56].(C)YodA crystallised with nickel ions bound to the metal-binding site;nickel ions had bound to the structure during purification by IMAC,demonstrating the strength of binding to Ni-NTA by native metal-binding proteins. 1308V.M.Bolanos-Garcia,O.R.Davies/Biochimica et Biophysica Acta1760(2006)1304–1313proteins[33],which ensures the tight in vivo regulation of thesemetabolic pathways.Our analysis indicates that YadF and GlgA are the only naturalmetal binding proteins that correspond to Class II;they do nothave strong interactions with metal-chelating sorbents and can beeluted by imidazole concentrations in the range55–80mM.Indeed,GlgA is known to bind Mg+2rather than Ni+2.Therelative affinity of YadF by metal-chelating sorbents is lower thanexpected,taking into account the content of histidine residues(5%).This behaviour can be explained considering that thebiologically relevant metal binding residues and most of thehistidine residues are not surface exposed,as shown in the crystalstructure of its human counterpart(PDB1CRM).1.3.Surface histidine clustersIn considering E.coli proteins that bind to metal-chelatingresins along with histidine-tagged proteins,one of the mostobvious mechanisms of binding is through the possession of anative histidine tag or surface cluster of histidine residues thatcan bind to the resin in the same way as the recombinant protein.This is a relatively common scenario,as judged by the numberof contaminating E.coli proteins that bind to metal chelatingsorbents via surface expose histidine residues,including CRP,SlyD,Hfq and S15(Class I),GlmS and CAT(Class II).1.3.1.Glucosamine-6-phosphate synthase(GlmS)GlmS is an enzyme(E.C.2.6.1.16)that catalyses the formationof D-glucosamine6-phosphate from D-fructose6-phosphate using L-glutamine as the ammonia source.N-acetylglucosamine is an essential building block of both bacterial cell walls and fungal cellwall chitin.Thus,GlmS is a potential target for antibacterial andantifungal agents.In fact,potent carbohydrate-based inhibitors ofGlmS have already been reported,including2-amino-2-deoxy-D-glucitol6-phosphate,an analogue of the putative cis-enolamineintermediate that is formed during catalysis[34].GlmS from E.coli is a67-kDa protein,organised in twodomains:the N-terminal glutamine amidohydrolase domain isresponsible for the hydrolysis of L-glutamine,and the C-terminalglucosamine-6-phosphate synthase domain catalyses the ketoseto aldose isomerisation of fructose6-phosphate[34,35].Theisomerase domain comprises two topologically identical sub-domains,each of which is dominated by a nucleotide-bindingmotif of a flavodoxin type.The isomerase catalytic site of GlmSis assembled by the association of two monomers,implying thatthis protein has evolved by gene duplication and subsequentdimerisation[36].The crystal structures of both domains havebeen reported[37]and show the presence of four surface clustersof at least three histidine residues in close proximity,explainingits high affinity for metal-chelating resins.One particularlyinteresting histidine cluster occurs at the dimerisation interface atwhich three histidine residues from each protomer come togetherto form a six residues cluster(Fig.3A).1.3.2.Chloramphenicol-O-acetyl transferase(CAT)This is a25-kDa enzyme that belongs to the superfamily ofCoA-dependent acyltransferases(E.C.2.3.1.28).It catalyses the formation of chloramphenicol3-acetate from acetyl-CoA and chloramphenicol.The c at gene has been used in molecular biology for decades to confer chloramphenicol-resistance to chloramphenicol-sensitive bacterial strains,allowing the posi-tive selection of recombinant clones.C at has also been used in the transcriptional mapping of extrachromosomal elements [38].The crystal structure of this enzyme(pdb entry4CLA)shows that CAT is aα/βprotein of a two-layer sandwich architecture. Although cobalt ions are present in the crystal structure,these likely stabilise the crystal lattice rather than representing physiological important metal binding sites.Binding to metal-chelating resins is more likely to occur through a surface cluster of three histidine residues in close proximity.The co-purification of this protein during IMAC can clearly only be a problem if the gene of interest has been cloned in a bacterial expression vector containing this selection marker or if the recombinant gene is expressed in Rosetta™competent cells, which contain a chloramphenicol-resistant plasmid that encodes for“rare”codon tRNAs.1.3.3.cAMP-regulatory protein(CRP)Many cellular signalling pathways operate through the production of cAMP,which can regulate DNA transcription by binding to cAMP-regulatory protein(CRP;also known as catabolite gene activator protein,CAP).Homodimeric CRP forms a complex with cAMP,which is able to bind DNA at specific sites near the promoter region.This binding induces a dramatic conformational change in the DNA molecule,thus controlling the transcription of catabolite-sensitive operons [39].In addition to this role,CRP also regulates gene expression in response to osmotic changes[40].The crystal structure of CRP from E.coli shows that it is40%helical and29%β-sheet, constituting an N-terminal cAMP-binding domain and a C-terminal“winged helix”DNA-binding domain.Although this protein has low histidine content(2.9%)and no metal-binding sites,the flexible N-terminal chain of each protomer contains three surface-exposed histidine residues that may sequester metal ions.Thus,these residues may exhibit a cooperative effect on the binding of the dimeric protein to metal chelating resins.1.3.4.Host factor-I protein(Hfq)Hfq is a RNA-binding protein that is required for phage QβRNA genome replication.It also binds tightly to poly(A)RNA, oxyS RNA and the untranslated RNA dsrA,targeting several mRNAs for degradation possibly by increasing polyadenylation or by interfering with ribosome binding[41–43].Novel pro-teomic tools have allowed the identification of new mRNA targets of Hfq,including Fur and SodB,thus demonstrating roles in controlling iron uptake and scavenging[44].Hfq has also been implicated in negative post-transcriptional regulation by affecting the stability of the E.coli mutS,miaA,hfq[41]and ompA mRNAs[42].The functional hexamer of Hfq presents a central canal lined by six surface histidine residues(Fig.3C), and each monomer also contains a C-terminal run of histidine residues,thus offering multiple sites for interactions with metal-chelating resins.1309V.M.Bolanos-Garcia,O.R.Davies/Biochimica et Biophysica Acta1760(2006)1304–1313。
转座元件在植物生长发育中的作用
转座元件在植物生长发育中的作用王姗姗【摘要】转座元件依据转座机制通常分为反转座子和DNA转座子2类.将关于反转座子、DNA转座子对于基因表达和植物生长发育的影响的研究进行总结,发现转座元件主要通过以下4种方式影响基因表达:第一,通过插入基因内部,破坏基因的完整结构从而使基因失活,如插入到基因的外显子、内含子及5ˊUTR区;第二,通过插入到基因的调控区而影响基因的表达水平,包括插入到基因的启动子、增强子、衰减子区,或为基因表达提供新的启动子或cis作用位点作为增强子;第三,通过一系列表观遗传机制影响基因的表达,如DNA甲基化及SiRNA;第四,通过染色体重组、基因复制、基因丢失等机制影响基因的拷贝数及表达水平,甚至产生新的基因.【期刊名称】《现代农业科技》【年(卷),期】2017(000)009【总页数】5页(P142-146)【关键词】转座元件;基因表达;表型;植物生长发育【作者】王姗姗【作者单位】西藏自治区农牧科学院农业研究所,西藏拉萨 850000【正文语种】中文【中图分类】Q943.2转座元件(Transposable elements)依据其转座机制主要可以分为反转座子(ClassⅠ)、DNA转座子(ClassⅡ),是在大部分真核生物中都广泛存在、含量丰富的可移动DNA序列,曾一度被认为是“垃圾DNA”,但现代研究表明转座元件对于基因和基因组进化具有重要影响,是物种进化的重要推动力[1-2]。
转座元件作为基因组的重要组成部分,同样是目前为止植物基因组中的最大可动部分,其数量的或扩张或缩减的波动,可造成甚至是相近物种基因组组成的显著不同,而转座元件的激活同样也可造成基因表达和功能的系列变化,这些基因或与植物的生殖生长相关,或与植物的胁迫应答相关[3]。
反转座子通过RNA介导的“复制-粘贴”机制实现转座,这类转座子通过 RNA聚合酶Ⅱ将转座元件转录为mRNA,再经由反转录酶反转录为cDNA,最后通过整合酶(INT)插入到基因组中的新位点,产生新的拷贝[4]。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Localization of an Absorbing Inhomogeneity in a ScatteringMedium in a Statistical FrameworkGuangzhi Cao,Vaibhav Gaind,Charles A.Bouman,and Kevin J.WebbSchool of Electrical and Computer Engineering,Purdue University,West Lafayette,Indiana47907-2314,USAwebb@An approach for the fast localization and detection of an absorbing inhomogeneityin a tissue-like scattering medium is presented.The probability of detection asa function of the size,location,and absorptive properties of the inhomogeneityis investigated.The detection sensitivity in relation to the source and detectorlocation serves a basis for instrument design.c 2007Optical Society of AmericaOCIS codes:170.3010,290.7050,100.3010,100.3190Optical imaging in scattering media provides important opportunities for clinical imaging and environmental sensing,among others[1].In the near-infrared wavelength range,soft tissue has both high scatter and low absorption,allowing use of a diffusion equation model for photon transport [2,3],which with exp(jωt)time dependence is∇·D(r)∇−µa(r)+jωlocalization of afluorophore in a mouse tongue[7].Milstein et al.developed a statistical approach based on maximum likelihood(ML)estimation for localization and a binary hypothesis test to detect afluorescent source[8].In this Letter,we extend Milstein’s work forfluorescence detection to study issues related to the three-dimensional localization and detection of an intrinsic absorb-ing inhomogeneity in a scattering medium such as tissue.The probability of detection is used to characterize the diagnostic capability of such a measurement system,and the detection sensitivity presented can be used to optimize the source-detector(SD)geometry,thereby providing a path to instrument design.We also investigate how factors such as SD geometry,and the physical and optical properties of the inhomogeneity,affect detection and localization.We use ML estimation to estimate the location of an absorbing inhomogeneity in a homogeneous background having parametersµa0and D0,which are assumed known.A model is needed to pa-rameterize the unknown inhomogeneity,which could have varying size and optical contrast(defined as∆µa=µa−µa0).We use the point inhomogeneity model suggested by Milstein et al.[8],which proved effective in localizingfluorescence,and account for the contrast through a weighting factor for this point absorber,given byδu(r).A measurement vector y of length M,for example,the optical intensity at a series of points on the surface at a particular modulation frequency for the light,is compared with a predicted measurement f(r),based on(1),assuming there exists a point inhomogeneity at position r.Let y0represent the expected measurement in the absence of an inho-mogeneity and f′(r)be the Fr´e chet derivative which relates perturbations inµa(r)to the predicted measurement f(r),i.e.,f(r)≈y0+f′(r)δu(r).The ML localization can thus be formulated asy−y0−f′(r)δu(r) 2Λ,(2)C(r,δu(r))=arg minrwhere C(r,δu(r))is the negative log likelihood and is treated as a cost function,Λ−1is the noise covariance matrix,for which we use a shot noise model[4],and||v||2W=v H Wv,with H being the Hermitian transpose.This optimization can be implemented as a two-step procedure in which, for each discretized position r over the region of interest,C(r,δu(r))is minimized with respect to δu(r),giving the unique(because C(r,δu(r))is quadratic)closed form estimateC(r,δu(r))δˆu(r)=arg minδuRe[(y−y0)HΛf′(r)]=assume an average signal-to-noise ratio(SNR)of approximately40dB and a modulation frequency ofω=2π×106rad/s.An analytic solution of(1),with an extrapolatedφ=0boundary condition to represent the interface between the scattering medium and free space[8],leads to an expression for f′(r).Fig.2(a)gives a plot of the negative log likelihood,and the estimated centroid of the inhomogeneity is within2.5mm of the true point.This is a promising result,given the simple measurement geometry.Fig.2(b)shows the reconstruction ofµa using the same data set.Note that the reconstructedµa is not accurate,which is due to the limited data set.We have previously shown that the reconstruction can be made quantitative with more SD pairs and through use of nonlinear optimization methods[4].The localization approach we present is thus a computationally efficient way of obtaining the position of an inhomogeneity,and anecdotally success appears possible with very limited measurement data.Determination of the inhomogeneity’s presence,or lack thereof,is a detection problem for which we employ binary hypothesis testing.Let the hypothesis H0correspond to the absence of an inho-mogeneity and H1,r to the presence of an inhomogeneity at position r.The probability densities for y under the two hypotheses arep(y|H1,r)=|Λ|2y−f(r) 2Λ (5)p(y|H0)=|Λ|2y−y0 2Λ .(6)The likelihood ratio test(LRT)isL(y,r)=logp(y|H1,r)σq).(8) Consider now the influence of physical(size,depth)and optical(contrast)properties of the inho-mogeneity on P D,assuming that these properties are known.In practice,the parameters describing the inhomogeneity are unknown and must be estimated.Therefore,the results of our simulation, with P D computed using(8),gives an upper bound for the P D of a measurement system.The measurement geometry of Fig.1is used.Fig.3(a)plots P D as a function of the inhomogeneitydepth for the case of Fig.2.P D decreases as the inhomogeneity depth increases,and the reliable detection depth is about2cm.Fig.3(b)gives P D as a function of inhomogeneity size and contrast for afixed depth of1.5cm.Notice that detection becomes more reliable as both the size and con-trast increase,with afixed source power(SNR).Fig.3(c)shows P D as a function of inhomogeneity depth and contrast.The achievable detection depth increases with the contrast butfinally saturates at about3cm.This saturation is dictated by the detector noise,i.e.,by the SNR.Fig.3(d)gives P D as a function of inhomogeneity depth and size.The achievable detection depth increases with the inhomogeneity size,but saturates also due to the noisefloor.The placement and number of sources and detectors amounts to instrument design.Our strategy is to maximize the detection sensitivity S=|y i−y0i|2/y0i for each SD pair,where y i is the element of y with S i−D i and the inhomogeneity present,and y0i that without the inhomogeneity.An increase in S corresponds to an increase in P D,as(8)indicates.The analytical result for the sensitivity is plotted in Fig.4for inhomogeneity depths of2cm and3cm.The optimal SD separations are about 2.3cm and3.5cm,respectively.The optimal SD separation increases as the inhomogeneity depth increases,ultimately being limited by the detector noisefloor.By obtaining such information,one can optimize the design of a measurement system.A convenient approximation for the closed form semi-infinite medium solution for S can be found under the assumption that d>>l∗andγ>>l∗, where l∗=3D is the mean free path andγis the distance between the SD pair,which wefind to beS≈A·γ2exp{−4k(γ2 (γ2stein,S.Oh,J.S.Reynolds,K.J.Webb,C.A.Bouman,and lane,“Three-dimensional Bayesian optical diffusion tomography with experimental data,”Opt.Lett.27, 95–97(2002).5.M.G.Erickson,J.S.Reynolds,and K.J.Webb,“Comparison of sensitivity for single anddual interfering source configurations in optical diffusion imaging,”J.Opt.Soc.Am.A14, 3080–3092(1997).6.Y.Chen,G.Zheng,Z.H.Zhang,D.Blessington,M.Zhang,H.Li,Q.Liu,L.Zhou,X.Intes,S.Achilefu,and B.Chance,“Metabolism-enhanced tumor localization byfluorescence imaging: in vivo animal studies,”Opt.Lett.28,2070–2072(2003).7.I.Gannot,A.Garashi,G.Gannot,V.Chernomordik,and A.Gandjbakhche,“In vivo quan-titative three-dimensional localization of tumor labeled with exogenous specificfluorescence markers,”Appl.Opt.42,3073–3080(2003).stein,M.D.Kennedy,P.S.Low,C.A.Bouman,and K.J.Webb,“Statistical approachfor detection and localization of afluorescing mouse tumor in a turbid medium,”Appl.Opt.44,2300–2310(2005).Fig.1.Measurement geometry for localization.A spherical absorber at depth d is assumed in the simulation.The background optical parameters are:µa 0=0.02cm −1,D 0=0.03cm,and the modulation frequency is ω=2π×106rad/s.246242075130(a)AU246240.020.0220.024cmFig.2.Localization versus reconstruction:(a)Negative log likelihood:◦denotes the true inhomogeneity location and ×the estimated location.(b)Optical diffu-sion tomography reconstruction of µa .Parameters:5sources and 5detectors and background parameters as in Fig.1;inhomogeneity µa =0.12cm −1,D =0.03cm;average SNR is 40dB;spherical inhomogeneity diameter of 0.625cm.2460.40.8Depth (cm)P DSize (cm)D e p t h (c m )42S i z e (c m )0.80.4∆µa cm−1D e p t h (c m )420.51(a)(c)(d)Fig.3.Influence of inhomogeneity depth,size and optical contrast (∆µa )on P D for the geometry and parameters shown in Fig.1,with P F =0.03and an average SNR of 40dB.(a)P D as a function of depth,for an inhomogeneity having:diameter 0.625cm,µa =0.12cm −1,and D =0.03cm.(b)P D as a function of size and ∆µa ,with d =1.5cm.(c)P D as a function of depth and ∆µa ,with a 0.625cm diameter inhomogeneity.(d)P D as a function of depth and size,with ∆µa =0.1cm −1.Fig.4.Detection sensitivity as a function of S-D distance for two inhomogeneity depths.The background optical parameters are:µa 0=0.1cm −1,D 0=0.03cm,which give k =0.9cm −1.The sensitivity for inhomogeneity depth 3cm is magnified 20times.The points are the approximate solution from (9).。