Strong replication in the globdata middleware
大学分子生物学经典双语课件C3: DNA replication
3.1 The principle of DNA replication 3.2 DNA replication model 3.3 Enzymes and protein needed in DNA replication 3.4 Process of DNA replication 3.5 Telomere and Telomerase
parental duplex is unwound.
On the lagging strand, a stretch of single-stranded
parental DNA must be exposed, and then a
segment is synthesized in the reverse direction (relative to fork movement). A series of these fragments are synthesized, then they are joined together to create an intact lagging strand.
v33enzymesandproteinsneededindnareplicationdna聚合酶dna聚合酶dna聚合酶结构基因polapolbpolc亚基1410相对分子质量1030008800083000053聚合酶活性是是是35外切酶活性校正是是是53外切酶活性是否否聚合速度ntss1620402501000持续合成能力32001500500000功能切除引物修复修复复制表31大肠杆菌dna聚合酶的比较?53exonucleaseactivity
enters newly synthesized DNA in the form of
chatgpt 浓缩文献综述的指令
文献综述是科学研究中非常重要的一环,通过对已有学术文献的整理、归纳和总结,可以为后续研究提供重要的参考和指导。
在计算机科学领域,自然语言处理是一个备受关注的研究方向,而chatgpt作为一种经典的人工智能模型,其在自然语言处理领域的应用备受关注。
chatgpt是由Open本人开发的一种基于深度学习的对话生成模型,其可以生成接近人类水平的自然语言对话。
下面将通过浓缩文献综述的形式,对chatgpt在自然语言处理领域的相关研究进行梳理和总结。
1. chatgpt的基本原理chatgpt是基于Transformer模型的改进版本,其核心原理是通过对大规模文本语料进行预训练,学习文本中的语言模式和语义信息,从而达到生成流畅、连贯对话的目的。
模型的训练采用了自监督学习的方法,通过最大化文本序列的联合概率来优化模型参数,使得模型可以对输入的自然语言进行理解和生成。
在具体的应用中,chatgpt可以用于对话生成、文本摘要、问答系统等多种自然语言处理任务。
2. chatgpt的发展历程chatgpt的发展经历了多个版本的迭代,从最初的GPT-1到目前比较成熟的GPT-3,模型的规模和性能都得到了显著的提升。
随着模型规模的不断扩大和训练数据的不断增加,chatgpt在自然语言处理领域的表现也逐渐趋近甚至超越了人类水平,成为了当前最受关注的人工智能模型之一。
3. chatgpt在对话生成领域的应用chatgpt在对话生成方面具有非常广泛的应用,包括智能客服、聊天机器人、虚拟助手等。
通过与用户进行对话交互,chatgpt可以实现智能问答、情感分析、任务指导等多种功能,极大地丰富了人机交互的方式,改变了人们日常生活和工作中的沟通方式。
4. chatgpt在文本摘要领域的应用文本摘要是自然语言处理领域的一个重要任务,其旨在从文本中提取出最重要的信息,生成简洁、精炼的摘要内容。
chatgpt可以通过对输入文本进行理解和归纳,自动生成符合人类习惯的文本摘要,极大地提高了文本处理效率和用户体验。
Impact of Relevance Measures on the Robustness and Accuracy of Collaborative Filtering
Impact of Relevance Measureson the Robustness and Accuracyof Collaborative Filtering⋆JJ Sandvig,Bamshad Mobasher,and Robin BurkeCenter for Web IntelligenceSchool of Computer Science,Telecommunications and Information SystemsDePaul University,Chicago,Illinois,USA{jsandvig,mobasher,rburke}@Abstract.The open nature of collaborative recommender systems presenta security problem.Attackers that cannot be readily distinguished fromordinary users may inject biased profiles,degrading the objectivity andaccuracy of the system over time.The standard user-based collabora-tivefiltering algorithm has been shown quite vulnerable to such attacks.In this paper,we examine relevance measures that complement neigh-bor similarity and their influence on algorithm robustness.In particular,we consider two techniques,significance weighting and trust weighting,that attempt to calculate the utility of a neighbor with respect to rat-ing prediction.Such techniques have been used to improve predictionaccuracy in collaborativefiltering.We show that significance weighting,in particular,also results in improved robustness under profile injectionattacks.1IntroductionAn adaptive system dependent on anonymous,unauthenticated user profiles is subject to manipulation.The standard collaborativefiltering algorithm builds a recommendation for a target user by combining the stored preferences of peers with similar interests.If a malicious user injects the profile database with a number offictitious identities,they may be considered peers to a genuine user and bias the recommendation.We call such attacks profile injection attacks(also known as shilling[1]).Recent research has shown that surprisingly modest at-tacks are sufficient to manipulate the most common CF algorithms[2,1,3].Such attacks degrade the objectivity and accuracy of a recommender system,causing frustration for its users.In this paper we explore the robustness of certain variants of user-based rec-ommendation.In particular,we examine variants that combine similarity met-rics with other measures to determine neighbor utility.Such relevance weighting techniques apply a weight to each neighbor’s similarity score,based on somevalue reflecting the expected relevance of that neighbor to the prediction task. We focus on two types of relevance measures:significance weighting and trust-based weighting.Significance weighting[4]takes the size of profile overlap be-tween neighbors into account.This prevents neighbors with only a few commonly rated items from dominating prediction.Trust-based weighting[5]estimates the utility of a neighbor as a rating predictor based on the historical accuracy of recommendations given by the neighbor.Traditional user-based collaborativefiltering algorithms focus exclusively on the degree of similarity between the target user and its neighbors in order to generate predicted ratings.However,the“reliability”of the neighbor profiles is generally not considered.For example,due to the sparsity of the data,the similarities may have been obtained based on very few co-rated items between the neighbor and the target user resulting in sub-optimal predictions.Similarly, unreliable neighbors that have made poor predictions in the past may have a neg-ative impact on prediction accuracy for the current item.Both of the approaches to relevance weighting mentioned above were,therefore,initially introduced in order to improve the prediction accuracy in user-based collaborativefiltering.In the trust-based model[5]an explicit trust value is computed for each user, reflecting the“reputation”of that user for making accurate recommendations. Trust is not limited to the macro profile level,and can be calculated as the repu-tation a user has for the recommendation of a particular item.The trust values, in turn,can be used as relevance weights when generating predictions.In[5], O’Donovan and Smyth further studied the impact of trust weighting approach on the robustness of collaborative recommendation and showed the trust-based models are still vulnerable to attacks.On the other hand,the significance weight-ing approach,introduced initially in[4],does not focus on trust,but rather on the number of co-rated items between the target user and the neighbors as a measure for the degree of reliability of the neighbor profiles.This approach has been shown to have a significant impact on the accuracy of predictions,partic-ularly in sparse data sets.Although these and other similar approaches have been used to improve the prediction accuracy of recommender systems,the impact of neighbor signifi-cance weighting on algorithm robustness in the face of malicious attacks has been largely ignored.The primary contribution of this paper is to demonstrate that relevance weighting is an important factor in determining the robustness of a collaborativefiltering algorithm.Choosing an optimal relevance measure can yield a large improvement in recommender stability.Our results show that significance weighting,in particular,is not only more accurate;it also improves algorithm robustness under profile injection attacks that have compact profile signatures.2Attacks in Collaborative RecommendersWe assume that an attacker intends to bias a recommender system for some eco-nomic advantage.This may be in the form of an increased number of recommen-dations for the attacker’s product,or fewer recommendations for a competitor’s product.A collaborative recommender database consists of many user profiles,each with assigned ratings to a number of products that represent the user’s er-based collaborativefiltering algorithms attempt to discover a neigh-borhood of user profiles that are similar to a target user.A rating value is predicted for all missing items in the target user’s profile,based on ratings given to the item within the neighborhood.A ranked list is produced,and typically the top20or50predictions are returned as recommendations.The standard k-nearest neighbor algorithm is widely used and reasonably accurate[4].Similarity is computed using Pearson’s correlation coefficient,and the k most similar users that have rated the target item are selected as the neighborhood.This implies a target user may have a different neighborhood for each target item.It is also common tofilter neighbors with similarity below a specified threshold.This prevents predictions being based on very distant or neg-ative correlations.After identifying a neighborhood,we use Resnick’s algorithm to compute the prediction for a target item i and target user u:p u,i=¯r u+ v∈V sim u,v(r v,i−¯r v)more similar neighbors have a larger impact on thefinal prediction.However, this type of similarity weighting alone may not be sufficient to guarantee ac-curate predictions.It is also necessary to ensure the reliability of the neighbor profiles.A common reason for the lack of reliability of predictions may be that similarities between the target user and the neighbors are based on a very small number of co-rated items.In the following section we consider two approaches that have been used to address the“reliability”problem mentioned above.These approaches have been used primarily to increase prediction accuracy.Our focus, however,will be on their impact on system robustness in the face of attacks. We conjenture that an optimal relevance weight may provide an algorithmic approach to securing recommender systems against attacks.The basic goal of a relevance measure is to estimate the utility of a neighbor as a rating predictor for the target user.The standard technique is to calculate sim-ilarity as the degree of“closeness”in Euclidean space.This is often accomplished via Pearson’s correlation coefficient or vector cosine coefficient.Additional exten-sions to similarity are well known,including significance weighting[4],variance weighting[4],case amplification[7],inverse user frequency[7],default voting[7], and profile trust[5].In this paper,we focus on the effects of significance weighting and profile trust because they are widely accepted techniques with very different properties.3.1Significance WeightingThe significance weighting approach proposed by Herlocker,et al.[4]is to adjusts similarity weights by devaluing relationships with a small number of commonly rated items.It uses a linear drop-offfor neighbors with less than N co-rated items.Neighbors with more than N co-rated items are not devalued at all.The significance weight of a target user u for a neighbor v is computed as:w u,v= sim u,v∗nlg m ,where n is the number of co-rated items,and m is the total numberof ratings in the target user’s profiing a local measure prevents unduly pe-nalizing the closest neighbors when the target user has only a minimal number of ratings.Significance weighting prefers neighbors having many commonly rated items with the target user.Neighbors with fewer commonly rated items may be pushed out of the neighborhood,even if there is a higher degree of similarity to the target user.It follows that users who have rated a large number of items willbelong to more neighborhoods than those users who have rated few items.This is a potential security risk in the context of profile injection attacks.An attack profile with a very large number offiller items will necessarily be included in more neighborhoods,regardless of the rating value.As we will show,the risk is minimized precisely because a largefiller size threshold is required to make the attack successful.In most cases,genuine users rate only a small portion of all recommendable items;therefore,an attack profile with a very largefiller size is easier to detect[8].3.2Trust WeightingThe vulnerabilities of collaborative recommender systems to attacks have led to a number of recent studies focusing on the notion of“trust”in recommenda-tion.O’Donovan and Smyth[5,9]propose trust models as a means to improve accuracy in collaborativefiltering.The basic assumption is that users with a history of being good predictors will provide accurate predictions in the future. By explicitly calculating a trust value,the reputation of a user can be used as insight into the user’s relevance to recommendation.Trust is not limited to the macro profile level,and can be calculated as the reputation a user has for the recommendation of a particular item.The trust building process generates a trust value for every user in the train-ing set by examining the predictive accuracy of the corresponding profile.By cross-validation,each user in turn is designated as the sole neighbor v for all remaining users.The system then computes the prediction set P v as all possi-ble predictions p u,i that can be made for user u∈U and item i∈I using the neighborhood V=v.For each prediction p u,i,recommend v,u,i=1if p u,i∈P v and correct v,u,i=1if|p u,i−r u,i|<εwhereεis a constant threshold and r u,i is the rating of user u for item i.Item-trust values are then computed as:trust v,i= u∈U correct v,u,i(4)sim u,i+trust v,iwhere sim u,v is Pearson’s correlation coefficient.A prediction for the target user is computed using(1),replacing sim u,v with w u,v,i.Trust-based collaborativefiltering algorithms can be very susceptible to pro-file injection attacks,because mutual opinions are reinforced during the trust building process[9].Attack profiles that contain biased ratings for a target item result in mutual reinforcement of the item’s preference.The larger the attack, the more reinforcement of the target item.Furthermore,if the target item is al-ways given the maximum value,an attack profile could have higher trust scores than a genuine profile,because correct v,u,i will always be1if v and u are both attacks on item i.In a recent study,O’Donovan and Smyth[9]propose several solutions to the reinforcement problem that utilize pseudo-random subsets of the training data during the trust building phase.Sampling the population of profiles used in trust calculation effectively smoothes the noise inherent in the entire dataset. The strategy raises an interesting research question with respect to robustness: how does a non-deterministic neighborhood formation task affect the impact of a profile injection attack?Although promising,we did not evaluate sampling the training set.For this set of experiments,we are interested only in the effect of relevance weighting.4Experimental EvaluationDataset.In our experiments,we have used the publicly-available Movie-Lens 100K dataset1.This dataset consists of100,000ratings on1682movies by943 users.All ratings are integer values between one andfive,where one is the lowest (disliked)andfive is the highest(liked).Our data includes all users who have rated at least20movies.To conduct attack experiments,the full dataset is split into training and test sets.Generally,the test set contains a sample of50user profiles that mirror the overall distribution of users in terms of number of movies seen and ratings provided.The remaining user profiles are designated as the training set.All attack profiles are built from the training set,in isolation from the test set.The set of attacked items consists of50movies whose ratings distribution matches the overall ratings distribution of all movies.Each movie is attacked as a separate test,and the results are aggregated.In each case,a number of attack profiles are generated and inserted into the training set,and any existing rating for the attacked movie in the test set is temporarily removed.For every profile injection attack,we track attack size andfiller size.Attack size is the number of injected attack profiles,and is measured as a percentage of the pre-attack training set.There are approximately1000users in the database, so an attack size of1%corresponds to about10attack profiles added to the system.Filler size is the number offiller ratings given to a specific attack pro-file,and is measured as a percentage of the total number of movies.There are approximately1700movies in the database,so afiller size of10%corresponds to about170filler ratings in each attack profile.The results reported below represent averages over all combinations of test users and attacked movies.parison of MAEEvaluation Metrics.There has been considerable research in the area of rec-ommender system evaluation focused on accuracy and performance[10].We use the mean absolute error(MAE)accuracy metric,a statistical measure for com-paring predicted values to actual user ratings[4].However,our overall goal is to measure the effectiveness of an attack;the“win”for the attacker.In the ex-periments reported below,we follow the lead of[2]in measuring stability via prediction shift.Prediction shift measures the change in an item’s predicted rating after being attacked.Let U and I be the sets of test users and attacked items,respectively. For each user-item pair(u,i)the prediction shift denoted by∆u,i,can be mea-sured as∆u,i=p′u,i−p u,i,where p and p′represent the prediction before and after attack,respectively.A positive value means that the attack has succeeded in raising the predicted rating for the item.The average prediction shift for an item i over all users in the test set can be computed as:∆i= u∈U∆u,i/|U|.The average prediction shift is then computed by averaging over individual prediction shifts for all attacked items.Note that a strong prediction shift does not guarantee an item will be recommended-it is possible that other items’scores are also affected by an attack,or that the item score is so low that even a prodigious shift does not promote it to“recommended”status.Accuracy Analysis.Wefirst compare the accuracy of k-nearest neighbor using different relevance metrics.In our experiments we examined the standard Pear-son’s correlation,standard significance weighting,local significance weighting, and item-trust weighting.For significance weighting,we have followed the lead of[4]in using N=50.For trust weighting,we have followed the lead of[5]in usingε=1.8.In all cases,10-fold cross-validation is performed on the entire dataset and no attack profiles are injected.As shown in Figure1,we achieved good results using a neighborhood size of k=30users for all relevance metrics;therefore,we applied k=30to all neighborhood formation tasks in the attack results discussed below.Overall,it isclear that some form of relevance weighting,in addition to similarity,can improveprediction accuracy.Standard and local significance weighting are particularly beneficial,although trust is also helpful when considering small neighborhoods.There are several interesting observations about the MAE results.At k=5, item-trust is more accurate than the other relevance measures.At k=15and greater,item-trust is the least accurate of the measures.It appears that the trust building process overfits the data,because trust is built on the assumption that the user for whom a trust value is computed is the only neighbor in any given neighborhood.The trust model does not take into account that a large neighborhood depends on reinforcement.For example,the closest neighbor to a target user may predict a negative rating for item i.But,when the closest three neighbors are taken into account,the second and third neighbors may predict a positive rating for item i.This effectively cancels out the prediction of the closest neighbor.In fact,a positive rating prediction may be more accurate for item i because the trend of the closest neighbors is a positive rating. Robustness Analysis.To evaluate the robustness of relevance weighting,we compare the results of push attacks using the four relevance weighting schemes described in the previous section.Figure2(A)depicts prediction shift results at different attack sizes,using a5%filler.Clearly,significance weighting is much more robust than the standard Pearson’s correlation.For all attack sizes,the pre-diction shift of significance weighting is about half that of standard correlation. Although not completely immune to attack,it is certainly a large improvement. Even at a15%attack,significance weighting may be the difference between recommending an attacked item or not.Local significance weighting also performs well against profile injection at-tack,although not to the same degree of robustness as standard significance weighting.This can be explained by the fact that target users with fewer than 50ratings do not scale their neighbors linearly.An attack profile in the neigh-borhood that is highly correlated to the target user is not devalued enough.As a result,a genuine user with less correlation to the target user,but more overlap in rated items,may be removed from the neighborhood.Item-trust weighting appears slightly more robust than standard correla-tion.The mutual-reinforcement effect is not as pronounced for attack profiles at smallerfiller sizes,because the attacks don’t have enough similarity to the target user;the trust value is outweighed.In addition,the reinforcement from genuine users is enough to gain insight into the true relevance for making pre-dictions.The combination of trust and similarity of genuine users to a target user is sufficient to remove some attack profiles from the neighborhood.To evaluate the sensitivity offiller size,we have tested a full range offiller items.The100%filler is included as a benchmark for the potential influence of an attack.However,it is not likely to be practical from an attacker’s point of view.Collaborativefiltering rating databases are often extremely sparse,so attack profiles that have rated every product are quite conspicuous.Of particular interest are smallerfiller sizes.An attack that performs well with fewfiller items is less likely to be detected.Thus,an attacker will have a better chance of actuallyFig.2.(A)Average attack prediction shift at5%filler;(B)Average attackfiller size comparisonimpacting a system’s recommendation,even if the performance of the attack is not optimal.Figure2(B)depicts prediction shift at differentfiller sizes with2%attack size. Surprisingly,asfiller size is increased,prediction shift for standard correlation goes down.This is because an attack profile with manyfiller items has greater probability of being dissimilar to the active user.On the contrary,prediction shift for significance weighting goes up.As stated previously,an attack profile with a very large number offiller items will have a better chance of being included in more neighborhoods,because it isn’t devalued by significance weighting.The counter-intuitive observation is that standard correlation is actually more robust than any of the other relevance measures at very largefiller sizes. To account for this,recall that the size of profile overlap is not addressed with standard correlation.A genuine user that is very similar to the target user,but does not have many co-rated items,is not penalized.However,with significance weighting the same user would be devalued,potentially removing the user from the neighborhood in favor of an attack profile.As shown,a25%filler size is the point where prediction shift for standard correlation surpasses the other relevance measures.Overall,this does not affect the general improvement in robustness of relevance ing the modest Movie-Lens100K dataset,a user would have to rate420movies to have a pro-file with25%filler.It is simply not feasible for a genuine user to rate25%of the items in a commercial recommender such as ,with millions of different products.From a practical perspective,the threat of largefiller attacks is minimal because they should be easily detectable[8].5ConclusionThe standard user-based collaborativefiltering algorithm has been shown quite vulnerable to profile injection attacks.An attacker is able to bias recommen-dation by building a number of profiles associated withfictitious identities.In this paper,we have demonstrated the relative robustness and stability of sup-plementing the similarity weighting of neighbors with significance weighting and item-trust values.Significance weighting,in particular,results in increased rec-ommendation accuracy and improved robustness under attack,versus the stan-dard k-nearest neighbor approach.Future work will examine other relevance measures with respect to attack,including case amplification,inverse user fre-quency,and default voting.Referencesm,S.,Riedl,J.:Shilling recommender systems for fun and profit.In:Proceedingsof the13th International WWW Conference,New York(May2004)2.O’Mahony,M.,Hurley,N.,Kushmerick,N.,Silvestre,G.:Collaborative recom-mendation:A robustness analysis.ACM Transactions on Internet Technology4(4) (2004)344–3773.Mobasher,B.,Burke,R.,Bhaumik,R.,Williams,C.:Towards trustworthy rec-ommender systems:An analysis of attack models and algorithm robustness.ACM Transactions on Internet Technology7(4)(2007)4.Herlocker,J.,Konstan,J.,Borchers,A.,Riedl,J.:An algorithmic framework forperforming collaborativefiltering.In:Proceedings of the22nd ACM Conference on Research and Development in Information Retrieval(SIGIR’99),Berkeley,CA (August1999)5.O’Donovan,J.,Smyth,B.:Trust in recommender systems.In:Proceedings of the10th International Conference on Intelligent User Interfaces(EC’04),ACM Press (2005)167–1746.Mobasher,B.,Burke,R.,Sandvig,J.J.:Model-based collaborativefiltering asa defense against profile injection attacks.In:Proceedings of the21st NationalConference on Artificial Intelligence,AAAI(July2006)1388–13937.Breese,J.,Heckerman,D.,Kadie,C.:Empirical analysis of predictive algorithmsfor collaborativefiltering.In:Uncertainty in Artificial Intelligence.Proceedings of the Fourteenth Conference,New Orleans,LA,Morgan Kaufman(1998)43–53 8.Williams,C.,Bhaumik,R.,Burke,R.,Mobasher,B.:The impact of attack profileclassification on the robustness of collaborative recommendation.In:Proceedings of the2006WebKDD Workshop,held at ACM SIGKDD Conference on Data Mining and Knowledge Discovery(KDD’06),Philadelphia(August2006)9.O’Donovan,J.,Smyth,B.:Is trust robust?:An analysis of trust-based recom-mendation.In:Proceedings of the5th ACM Conference on Electronic Commerce (EC’04),ACM Press(2006)101–10810.J.Herlocker,Konstan,J.,Tervin,L.G.,Riedl,J.:Evaluating collaborativefilteringrecommender systems.ACM Transactions on Information Systems22(1)(2004) 5–53。
survey--on sentiment detection of reviews
A survey on sentiment detection of reviewsHuifeng Tang,Songbo Tan *,Xueqi ChengInformation Security Center,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100080,PR Chinaa r t i c l e i n f o Keywords:Sentiment detection Opinion extractionSentiment classificationa b s t r a c tThe sentiment detection of texts has been witnessed a booming interest in recent years,due to the increased availability of online reviews in digital form and the ensuing need to organize them.Till to now,there are mainly four different problems predominating in this research community,namely,sub-jectivity classification,word sentiment classification,document sentiment classification and opinion extraction.In fact,there are inherent relations between them.Subjectivity classification can prevent the sentiment classifier from considering irrelevant or even potentially misleading text.Document sen-timent classification and opinion extraction have often involved word sentiment classification tech-niques.This survey discusses related issues and main approaches to these problems.Ó2009Published by Elsevier Ltd.1.IntroductionToday,very large amount of reviews are available on the web,as well as the weblogs are fast-growing in blogsphere.Product re-views exist in a variety of forms on the web:sites dedicated to a specific type of product (such as digital camera ),sites for newspa-pers and magazines that may feature reviews (like Rolling Stone or Consumer Reports ),sites that couple reviews with commerce (like Amazon ),and sites that specialize in collecting professional or user reviews in a variety of areas (like ).Less formal reviews are available on discussion boards and mailing list archives,as well as in Usenet via Google ers also com-ment on products in their personal web sites and blogs,which are then aggregated by sites such as , ,and .The information mentioned above is a rich and useful source for marketing intelligence,social psychologists,and others interested in extracting and mining opinions,views,moods,and attitudes.For example,whether a product review is positive or negative;what are the moods among Bloggers at that time;how the public reflect towards this political affair,etc.To achieve this goal,a core and essential job is to detect subjec-tive information contained in texts,include viewpoint,fancy,atti-tude,sensibility etc.This is so-called sentiment detection .A challenging aspect of this task seems to distinguish it from traditional topic-based detection (classification)is that while top-ics are often identifiable by keywords alone,sentiment can be ex-pressed in a much subtle manner.For example,the sentence ‘‘What a bad picture quality that digital camera has!...Oh,thisnew type camera has a good picture,long battery life and beautiful appearance!”compares a negative experience of one product with a positive experience of another product.It is difficult to separate out the core assessment that should actually be correlated with the document.Thus,sentiment seems to require more understand-ing than the usual topic-based classification.Sentiment detection dates back to the late 1990s (Argamon,Koppel,&Avneri,1998;Kessler,Nunberg,&SchÄutze,1997;Sper-tus,1997),but only in the early 2000s did it become a major sub-field of the information management discipline (Chaovalit &Zhou,2005;Dimitrova,Finn,Kushmerick,&Smyth,2002;Durbin,Neal Richter,&Warner,2003;Efron,2004;Gamon,2004;Glance,Hurst,&Tomokiyo,2004;Grefenstette,Qu,Shanahan,&Evans,2004;Hil-lard,Ostendorf,&Shriberg,2003;Inkpen,Feiguina,&Hirst,2004;Kobayashi,Inui,&Inui,2001;Liu,Lieberman,&Selker,2003;Rau-bern &Muller-Kogler,2001;Riloff and Wiebe,2003;Subasic &Huettner,2001;Tong,2001;Vegnaduzzo,2004;Wiebe &Riloff,2005;Wilson,Wiebe,&Hoffmann,2005).Until the early 2000s,the two main popular approaches to sentiment detection,espe-cially in the real-world applications,were based on machine learn-ing techniques and based on semantic analysis techniques.After that,the shallow nature language processing techniques were widely used in this area,especially in the document sentiment detection.Current-day sentiment detection is thus a discipline at the crossroads of NLP and IR,and as such it shares a number of characteristics with other tasks such as information extraction and text-mining.Although several international conferences have devoted spe-cial issues to this topic,such as ACL,AAAI,WWW,EMNLP,CIKM etc.,there are no systematic treatments of the subject:there are neither textbooks nor journals entirely devoted to sentiment detection yet.0957-4174/$-see front matter Ó2009Published by Elsevier Ltd.doi:10.1016/j.eswa.2009.02.063*Corresponding author.E-mail addresses:tanghuifeng@ (H.Tang),tansongbo@ (S.Tan),cxq@ (X.Cheng).Expert Systems with Applications 36(2009)10760–10773Contents lists available at ScienceDirectExpert Systems with Applicationsjournal homepage:/locate/eswaThis paperfirst introduces the definitions of several problems that pertain to sentiment detection.Then we present some appli-cations of sentiment detection.Section4discusses the subjectivity classification problem.Section5introduces semantic orientation method.The sixth section examines the effectiveness of applying machine learning techniques to document sentiment classification. The seventh section discusses opinion extraction problem.The eighth part talks about evaluation of sentiment st sec-tion concludes with challenges and discussion of future work.2.Sentiment detection2.1.Subjectivity classificationSubjectivity in natural language refers to aspects of language used to express opinions and evaluations(Wiebe,1994).Subjectiv-ity classification is stated as follows:Let S={s1,...,s n}be a set of sentences in document D.The problem of subjectivity classification is to distinguish sentences used to present opinions and other forms of subjectivity(subjective sentences set S s)from sentences used to objectively present factual information(objective sen-tences set S o),where S s[S o=S.This task is especially relevant for news reporting and Internet forums,in which opinions of various agents are expressed.2.2.Sentiment classificationSentiment classification includes two kinds of classification forms,i.e.,binary sentiment classification and multi-class senti-ment classification.Given a document set D={d1,...,d n},and a pre-defined categories set C={positive,negative},binary senti-ment classification is to classify each d i in D,with a label expressed in C.If we set C*={strong positive,positive,neutral,negative,strong negative}and classify each d i in D with a label in C*,the problem changes to multi-class sentiment classification.Most prior work on learning to identify sentiment has focused on the binary distinction of positive vs.negative.But it is often helpful to have more information than this binary distinction pro-vides,especially if one is ranking items by recommendation or comparing several reviewers’opinions.Koppel and Schler(2005a, 2005b)show that it is crucial to use neutral examples in learning polarity for a variety of reasons.Learning from negative and posi-tive examples alone will not permit accurate classification of neu-tral examples.Moreover,the use of neutral training examples in learning facilitates better distinction between positive and nega-tive examples.3.Applications of sentiment detectionIn this section,we will expound some rising applications of sen-timent detection.3.1.Products comparisonIt is a common practice for online merchants to ask their cus-tomers to review the products that they have purchased.With more and more people using the Web to express opinions,the number of reviews that a product receives grows rapidly.Most of the researches about these reviews were focused on automatically classifying the products into‘‘recommended”or‘‘not recom-mended”(Pang,Lee,&Vaithyanathan,2002;Ranjan Das&Chen, 2001;Terveen,Hill,Amento,McDonald,&Creter,1997).But every product has several features,in which maybe only part of them people are interested.Moreover,a product has shortcomings in one aspect,probably has merits in another place(Morinaga,Yamanishi,Tateishi,&Fukushima,2002;Taboada,Gillies,&McFe-tridge,2006).To analysis the online reviews and bring forward a visual man-ner to compare consumers’opinions of different products,i.e., merely with a single glance the user can clearly see the advantages and weaknesses of each product in the minds of consumers.For a potential customer,he/she can see a visual side-by-side and fea-ture-by-feature comparison of consumer opinions on these prod-ucts,which helps him/her to decide which product to buy.For a product manufacturer,the comparison enables it to easily gather marketing intelligence and product benchmarking information.Liu,Hu,and Cheng(2005)proposed a novel framework for ana-lyzing and comparing consumer opinions of competing products.A prototype system called Opinion Observer is implemented.To en-able the visualization,two tasks were performed:(1)Identifying product features that customers have expressed their opinions on,based on language pattern mining techniques.Such features form the basis for the comparison.(2)For each feature,identifying whether the opinion from each reviewer is positive or negative,if any.Different users can visualize and compare opinions of different products using a user interface.The user simply chooses the prod-ucts that he/she wishes to compare and the system then retrieves the analyzed results of these products and displays them in the interface.3.2.Opinion summarizationThe number of online reviews that a product receives grows rapidly,especially for some popular products.Furthermore,many reviews are long and have only a few sentences containing opin-ions on the product.This makes it hard for a potential customer to read them to make an informed decision on whether to purchase the product.The large number of reviews also makes it hard for product manufacturers to keep track of customer opinions of their products because many merchant sites may sell their products,and the manufacturer may produce many kinds of products.Opinion summarization(Ku,Lee,Wu,&Chen,2005;Philip et al., 2004)summarizes opinions of articles by telling sentiment polari-ties,degree and the correlated events.With opinion summariza-tion,a customer can easily see how the existing customers feel about a product,and the product manufacturer can get the reason why different stands people like it or what they complain about.Hu and Liu(2004a,2004b)conduct a work like that:Given a set of customer reviews of a particular product,the task involves three subtasks:(1)identifying features of the product that customers have expressed their opinions on(called product features);(2) for each feature,identifying review sentences that give positive or negative opinions;and(3)producing a summary using the dis-covered information.Ku,Liang,and Chen(2006)investigated both news and web blog articles.In their research,TREC,NTCIR and articles collected from web blogs serve as the information sources for opinion extraction.Documents related to the issue of animal cloning are selected as the experimental materials.Algorithms for opinion extraction at word,sentence and document level are proposed. The issue of relevant sentence selection is discussed,and then top-ical and opinionated information are summarized.Opinion sum-marizations are visualized by representative sentences.Finally, an opinionated curve showing supportive and non-supportive de-gree along the timeline is illustrated by an opinion tracking system.3.3.Opinion reason miningIn opinion analysis area,finding the polarity of opinions or aggregating and quantifying degree assessment of opinionsH.Tang et al./Expert Systems with Applications36(2009)10760–1077310761scattered throughout web pages is not enough.We can do more critical part of in-depth opinion assessment,such asfinding rea-sons in opinion-bearing texts.For example,infilm reviews,infor-mation such as‘‘found200positive reviews and150negative reviews”may not fully satisfy the information needs of different people.More useful information would be‘‘Thisfilm is great for its novel originality”or‘‘Poor acting,which makes thefilm awful”.Opinion reason mining tries to identify one of the critical ele-ments of online reviews to answer the question,‘‘What are the rea-sons that the author of this review likes or dislikes the product?”To answer this question,we should extract not only sentences that contain opinion-bearing expressions,but also sentences with rea-sons why an author of a review writes the review(Cardie,Wiebe, Wilson,&Litman,2003;Clarke&Terra,2003;Li&Yamanishi, 2001;Stoyanov,Cardie,Litman,&Wiebe,2004).Kim and Hovy(2005)proposed a method for detecting opinion-bearing expressions.In their subsequent work(Kim&Hovy,2006), they collected a large set of h review text,pros,cons i triplets from ,which explicitly state pros and cons phrases in their respective categories by each review’s author along with the re-view text.Their automatic labeling systemfirst collects phrases in pro and confields and then searches the main review text in or-der to collect sentences corresponding to those phrases.Then the system annotates this sentence with the appropriate‘‘pro”or‘‘con”label.All remaining sentences with neither label are marked as ‘‘neither”.After labeling all the data,they use it to train their pro and con sentence recognition system.3.4.Other applicationsThomas,Pang,and Lee(2006)try to determine from the tran-scripts of US Congressionalfloor debates whether the speeches rep-resent support of or opposition to proposed legislation.Mullen and Malouf(2006)describe a statistical sentiment analysis method on political discussion group postings to judge whether there is oppos-ing political viewpoint to the original post.Moreover,there are some potential applications of sentiment detection,such as online message sentimentfiltering,E-mail sentiment classification,web-blog author’s attitude analysis,sentiment web search engine,etc.4.Subjectivity classificationSubjectivity classification is a task to investigate whether a par-agraph presents the opinion of its author or reports facts.In fact, most of the research showed there was very tight relation between subjectivity classification and document sentiment classification (Pang&Lee,2004;Wiebe,2000;Wiebe,Bruce,&O’Hara,1999; Wiebe,Wilson,Bruce,Bell,&Martin,2002;Yu&Hatzivassiloglou, 2003).Subjectivity classification can prevent the polarity classifier from considering irrelevant or even potentially misleading text. Pang and Lee(2004)find subjectivity detection can compress re-views into much shorter extracts that still retain polarity informa-tion at a level comparable to that of the full review.Much of the research in automated opinion detection has been performed and proposed for discriminating between subjective and objective text at the document and sentence levels(Bruce& Wiebe,1999;Finn,Kushmerick,&Smyth,2002;Hatzivassiloglou &Wiebe,2000;Wiebe,2000;Wiebe et al.,1999;Wiebe et al., 2002;Yu&Hatzivassiloglou,2003).In this section,we will discuss some approaches used to automatically assign one document as objective or subjective.4.1.Similarity approachSimilarity approach to classifying sentences as opinions or facts explores the hypothesis that,within a given topic,opinion sen-tences will be more similar to other opinion sentences than to fac-tual sentences(Yu&Hatzivassiloglou,2003).Similarity approach measures sentence similarity based on shared words,phrases, and WordNet synsets(Dagan,Shaul,&Markovitch,1993;Dagan, Pereira,&Lee,1994;Leacock&Chodorow,1998;Miller&Charles, 1991;Resnik,1995;Zhang,Xu,&Callan,2002).To measure the overall similarity of a sentence to the opinion or fact documents,we need to go through three steps.First,use IR method to acquire the documents that are on the same topic as the sentence in question.Second,calculate its similarity scores with each sentence in those documents and make an average va-lue.Third,assign the sentence to the category(opinion or fact) for which the average value is the highest.Alternatively,for the frequency variant,we can use the similarity scores or count how many of them for each category,and then compare it with a prede-termined threshold.4.2.Naive Bayes classifierNaive Bayes classifier is a commonly used supervised machine learning algorithm.This approach presupposes all sentences in opinion or factual articles as opinion or fact sentences.Naive Bayes uses the sentences in opinion and fact documents as the examples of the two categories.The features include words, bigrams,and trigrams,as well as the part of speech in each sen-tence.In addition,the presence of semantically oriented(positive and negative)words in a sentence is an indicator that the sentence is subjective.Therefore,it can include the counts of positive and negative words in the sentence,as well as counts of the polarities of sequences of semantically oriented words(e.g.,‘‘++”for two con-secutive positively oriented words).It also include the counts of parts of speech combined with polarity information(e.g.,‘‘JJ+”for positive adjectives),as well as features encoding the polarity(if any)of the head verb,the main subject,and their immediate modifiers.Generally speaking,Naive Bayes assigns a document d j(repre-sented by a vector dÃj)to the class c i that maximizes Pðc i j dÃjÞby applying Bayes’rule as follow,Pðc i j dÃjÞ¼Pðc iÞPðdÃjj c iÞPðdÃjÞð1Þwhere PðdÃjÞis the probability that a randomly picked document dhas vector dÃjas its representation,and P(c)is the probability that a randomly picked document belongs to class c.To estimate the term PðdÃjj cÞ,Naive Bayes decomposes it byassuming all the features in dÃj(represented by f i,i=1to m)are con-ditionally independent,i.e.,Pðc i j dÃjÞ¼Pðc iÞQ mi¼1Pðf i j c iÞÀÁPðdÃjÞð2Þ4.3.Multiple Naive Bayes classifierThe hypothesis of all sentences in opinion or factual articles as opinion or fact sentences is an approximation.To address this, multiple Naive Bayes classifier approach applies an algorithm using multiple classifiers,each relying on a different subset of fea-tures.The goal is to reduce the training set to the sentences that are most likely to be correctly labeled,thus boosting classification accuracy.Given separate sets of features F1,F2,...,F m,it train separate Na-ive Bayes classifiers C1,C2,...,C m corresponding to each feature set. Assuming as ground truth the information provided by the docu-ment labels and that all sentences inherit the status of their docu-ment as opinions or facts,itfirst train C1on the entire training set,10762H.Tang et al./Expert Systems with Applications36(2009)10760–10773then use the resulting classifier to predict labels for the training set.The sentences that receive a label different from the assumed truth are then removed,and train C2on the remaining sentences. This process is repeated iteratively until no more sentences can be removed.Yu and Hatzivassiloglou(2003)report results using five feature sets,starting from words alone and adding in bigrams, trigrams,part-of-speech,and polarity.4.4.Cut-based classifierCut-based classifier approach put forward a hypothesis that, text spans(items)occurring near each other(within discourse boundaries)may share the same subjectivity status(Pang&Lee, 2004).Based on this hypothesis,Pang supplied his algorithm with pair-wise interaction information,e.g.,to specify that two particu-lar sentences should ideally receive the same subjectivity label. This algorithm uses an efficient and intuitive graph-based formula-tion relying onfinding minimum cuts.Suppose there are n items x1,x2,...,x n to divide into two classes C1and C2,here access to two types of information:ind j(x i):Individual scores.It is the non-negative estimates of each x i’s preference for being in C j based on just the features of x i alone;assoc(x i,x k):Association scores.It is the non-negative estimates of how important it is that x i and x k be in the same class.Then,this problem changes to calculate the maximization of each item’s score for one class:its individual score for the class it is assigned to,minus its individual score for the other class,then minus associated items into different classes for penalization. Thus,after some algebra,it arrives at the following optimization problem:assign the x i to C1and C2so as to minimize the partition cost:X x2C1ind2ðxÞþXx2C2ind1ðxÞþXx i2C1;x k2C2assocðx i;x kÞð3ÞThis situation can be represented in the following manner.Build an undirected graph G with vertices{v1,...,v n,s,t};the last two are, respectively,the source and sink.Add n edges(s,v i),each with weight ind1(x i),and n edges(v i,t),each with weight ind2(x i).Finally, addðC2nÞedges(v i,v k),each with weight assoc(x i,x k).A cut(S,T)of G is a partition of its nodes into sets S={s}US0and T={t}UT0,where s R S0,t R T0.Its cost cost(S,T)is the sum of the weights of all edges crossing from S to T.A minimum cut of G is one of minimum cost. Then,finding solution of this problem is changed into looking for a minimum cut of G.5.Word sentiment classificationThe task on document sentiment classification has usually in-volved the manual or semi-manual construction of semantic orien-tation word lexicons(Hatzivassiloglou&McKeown,1997; Hatzivassiloglou&Wiebe,2000;Lin,1998;Pereira,Tishby,&Lee, 1993;Riloff,Wiebe,&Wilson,2003;Turney&Littman,2002; Wiebe,2000),which built by word sentiment classification tech-niques.For instance,Das and Chen(2001)used a classifier on investor bulletin boards to see if apparently positive postings were correlated with stock price,in which several scoring methods were employed in conjunction with a manually crafted lexicon.Classify-ing the semantic orientation of individual words or phrases,such as whether it is positive or negative or has different intensities, generally using a pre-selected set of seed words,sometimes using linguistic heuristics(For example,Lin(1998)&Pereira et al.(1993) used linguistic co-locations to group words with similar uses or meanings).Some studies showed that restricting features to those adjec-tives for word sentiment classification would improve perfor-mance(Andreevskaia&Bergler,2006;Turney&Littman,2002; Wiebe,2000).However,more researches showed most of the adjectives and adverb,a small group of nouns and verbs possess semantic orientation(Andreevskaia&Bergler,2006;Esuli&Sebas-tiani,2005;Gamon&Aue,2005;Takamura,Inui,&Okumura, 2005;Turney&Littman,2003).Automatic methods of sentiment annotation at the word level can be grouped into two major categories:(1)corpus-based ap-proaches and(2)dictionary-based approaches.Thefirst group in-cludes methods that rely on syntactic or co-occurrence patterns of words in large texts to determine their sentiment(e.g.,Hatzi-vassiloglou&McKeown,1997;Turney&Littman,2002;Yu&Hat-zivassiloglou,2003and others).The second group uses WordNet (/)information,especially,synsets and hierarchies,to acquire sentiment-marked words(Hu&Liu, 2004a;Kim&Hovy,2004)or to measure the similarity between candidate words and sentiment-bearing words such as good and bad(Kamps,Marx,Mokken,&de Rijke,2004).5.1.Analysis by conjunctions between adjectivesThis method attempts to predict the orientation of subjective adjectives by analyzing pairs of adjectives(conjoined by and,or, but,either-or,or neither-nor)which are extracted from a large unlabelled document set.The underlying intuition is that the act of conjoining adjectives is subject to linguistic constraints on the orientation of the adjectives involved(e.g.and usually conjoins two adjectives of the same-orientation,while but conjoins two adjectives of opposite orientation).This is shown in the following three sentences(where thefirst two are perceived as correct and the third is perceived as incorrect)taken from Hatzivassiloglou and McKeown(1997):‘‘The tax proposal was simple and well received by the public”.‘‘The tax proposal was simplistic but well received by the public”.‘‘The tax proposal was simplistic and well received by the public”.To infer the orientation of adjectives from analysis of conjunc-tions,a supervised learning algorithm can be performed as follow-ing steps:1.All conjunctions of adjectives are extracted from a set ofdocuments.2.Train a log-linear regression classifier and then classify pairs ofadjectives either as having the same or as having different ori-entation.The hypothesized same-orientation or different-orien-tation links between all pairs form a graph.3.A clustering algorithm partitions the graph produced in step2into two clusters.By using the intuition that positive adjectives tend to be used more frequently than negative ones,the cluster containing the terms of higher average frequency in the docu-ment set is deemed to contain the positive terms.The log-linear model offers an estimate of how good each pre-diction is,since it produces a value y between0and1,in which 1corresponds to same-orientation,and one minus the produced value y corresponds to dissimilarity.Same-and different-orienta-tion links between adjectives form a graph.To partition the graph nodes into subsets of the same-orientation,the clustering algo-rithm calculates an objective function U scoring each possible par-tition P of the adjectives into two subgroups C1and C2as,UðPÞ¼X2i¼11j C i jXx;y2C i;x–ydðx;yÞ!ð4Þwhere j C i j is the cardinality of cluster i,and d(x,y)is the dissimilarity between adjectives x and y.H.Tang et al./Expert Systems with Applications36(2009)10760–1077310763In general,because the model was unsupervised,it required an immense word corpus to function.5.2.Analysis by lexical relationsThis method presents a strategy for inferring semantic orienta-tion from semantic association between words and phrases.It fol-lows a hypothesis that two words tend to be the same semantic orientation if they have strong semantic association.Therefore,it focused on the use of lexical relations defined in WordNet to calcu-late the distance between adjectives.Generally speaking,we can defined a graph on the adjectives contained in the intersection between a term set(For example, TL term set(Turney&Littman,2003))and WordNet,adding a link between two adjectives whenever WordNet indicates the presence of a synonymy relation between them,and defining a distance measure using elementary notions from graph theory.In more de-tail,this approach can be realized as following steps:1.Construct relations at the level of words.The simplest approachhere is just to collect all words in WordNet,and relate words that can be synonymous(i.e.,they occurring in the same synset).2.Define a distance measure d(t1,t2)between terms t1and t2onthis graph,which amounts to the length of the shortest path that connects t1and t2(with d(t1,t2)=+1if t1and t2are not connected).3.Calculate the orientation of a term by its relative distance(Kamps et al.,2004)from the two seed terms good and bad,i.e.,SOðtÞ¼dðt;badÞÀdðt;goodÞdðgood;badÞð5Þ4.Get the result followed by this rules:The adjective t is deemedto belong to positive if SO(t)>0,and the absolute value of SO(t) determines,as usual,the strength of this orientation(the con-stant denominator d(good,bad)is a normalization factor that constrains all values of SO to belong to the[À1,1]range).5.3.Analysis by glossesThe characteristic of this method lies in the fact that it exploits the glosses(i.e.textual definitions)that one term has in an online ‘‘glossary”,or dictionary.Its basic assumption is that if a word is semantically oriented in one direction,then the words in its gloss tend to be oriented in the same direction(Esuli&Sebastiani,2005; Esuli&Sebastiani,2006a,2006b).For instance,the glosses of good and excellent will both contain appreciative expressions;while the glosses of bad and awful will both contain derogative expressions.Generally,this method can determine the orientation of a term based on the classification of its glosses.The process is composed of the following steps:1.A seed set(S p,S n),representative of the two categories positiveand negative,is provided as input.2.Search new terms to enrich S p and S e lexical relations(e.g.synonymy)with the terms contained in S p and S n from a thesau-rus,or online dictionary,tofind these new terms,and then append them to S p or S n.3.For each term t i in S0p [S0nor in the test set(i.e.the set of termsto be classified),a textual representation of t i is generated by collating all the glosses of t i as found in a machine-readable dic-tionary.Each such representation is converted into a vector by standard text indexing techniques.4.A binary text classifier is trained on the terms in S0p [S0nandthen applied to the terms in the test set.5.4.Analysis by both lexical relations and glossesThis method determines sentiment of words and phrases both relies on lexical relations(synonymy,antonymy and hyponymy) and glosses provided in WordNet.Andreevskaia and Bergler(2006)proposed an algorithm named ‘‘STEP”(Semantic Tag Extraction Program).This algorithm starts with a small set of seed words of known sentiment value(positive or negative)and implements the following steps:1.Extend the small set of seed words by adding synonyms,ant-onyms and hyponyms of the seed words supplied in WordNet.This step brings on average a5-fold increase in the size of the original list with the accuracy of the resulting list comparable to manual annotations.2.Go through all WordNet glosses,identifies the entries that con-tain in their definitions the sentiment-bearing words from the extended seed list,and adds these head words to the corre-sponding category–positive,negative or neutral.3.Disambiguate the glosses with part-of-speech tagger,and elim-inate errors of some words acquired in step1and from the seed list.At this step,it alsofilters out all those words that have been assigned contradicting.In this algorithm,for each word we need compute a Net Overlap Score by subtracting the total number of runs assigning this word a negative sentiment from the total of the runs that consider it posi-tive.In order to make the Net Overlap Score measure usable in sen-timent tagging of texts and phrases,the absolute values of this score should be normalized and mapped onto a standard[0,1] interval.STEP accomplishes this normalization by using the value of the Net Overlap Score as a parameter in the standard fuzzy mem-bership S-function(Zadeh,1987).This function maps the absolute values of the Net Overlap Score onto the interval from0to1,where 0corresponds to the absence of membership in the category of sentiment(in this case,these will be the neutral words)and1re-flects the highest degree of membership in this category.The func-tion can be defined as follows,Sðu;a;b;cÞ¼0if u6a2uÀac a2if a6u6b1À2uÀacÀa2if b6u6c1if u P c8>>>>>><>>>>>>:ð6Þwhere u is the Net Overlap Score for the word and a,b,c are the three adjustable parameters:a is set to1,c is set to15and b,which represents a crossover point,is defined as b=(a+c)/2=8.Defined this way,the S-function assigns highest degree of membership (=1)to words that have the Net Overlap Score u P15.Net Overlap Score can be used as a measure of the words degree of membership in the fuzzy category of sentiment:the core adjec-tives,which had the highest Net Overlap Score,were identified most accurately both by STEP and by human annotators,while the words on the periphery of the category had the lowest scores and were associated with low rates of inter-annotator agreement.5.5.Analysis by pointwise mutual informationThe general strategy of this method is to infer semantic orienta-tion from semantic association.The underlying assumption is that a phrase has a positive semantic orientation when it has good asso-ciations(e.g.,‘‘romantic ambience”)and a negative semantic orien-tation when it has bad associations(e.g.,‘‘horrific events”)(Turney, 2002).10764H.Tang et al./Expert Systems with Applications36(2009)10760–10773。
新教材同步备课2024春高中生物第3章基因的本质3.3DNA的复制课件新人教版必修2
(2)注意碱基的单位是“对”还是“个”。 (3)切记在DNA复制过程中,无论复制了几次,含有亲代脱氧 核苷酸单链的DNA分子都只有两个。 (4)看清试题中问的是“DNA分子数”还是“链数”,“含” 还是“只含”等关键词,以免掉进陷阱。
二、DNA分子的复制
例1.某DNA分子中含有1 000个碱基对(被32P标记),其中有胸腺 嘧啶400个。若将该DNA分子放在只含被31P标记的脱氧核苷酸的 培养液中让其复制两次,子代DNA分子相对分子质量平均比原来 减少 1 500 。
F2:
提出DNA离心
高密度带 低密度带 高密度带
低密度带 高密度带
一、DNA复制的推测—— 假说-演绎法
1.提出问题 2.提出假说
(1)演绎推理 ③分散复制
15N 15N
提出DNA离心
P:
3.验证假说
15N 14N
F1:
细胞分 裂一次
转移到含 14NH4Cl的培养 液中
提出DNA离心
细胞再 分裂一次
二、DNA分子的复制
例3.若亲代DNA分子经过诱变,某位点上一个正常碱基变成了5-溴 尿嘧啶(BU),诱变后的DNA分子连续进行2次复制,得到4个子 代DNA分子如图所示,则BU替换的碱基可能是( C )
A.腺嘌呤 C.胞嘧啶
B.胸腺嘧啶或腺嘌呤 D.鸟嘌呤或胞嘧啶
二、DNA分子的复制
例4. 5-BrU(5-溴尿嘧啶)既可以与A配对,又可以与C配对。将一 个正常的具有分裂能力的细胞,接种到含有A、G、C、T、5-BrU 五种核苷酸的适宜培养基上,至少需要经过几次复制后,才能实现 细胞中某DNA分子某位点上碱基对从T—A到G—C的替换( B )
21世纪是生命科学的世纪20世纪后叶分子生物学的突破性...
第一章绪论一简答题1. 21世纪是生命科学的世纪。
20世纪后叶分子生物学的突破性成就,使生命科学在自然科学中的位置起了革命性的变化。
试阐述分子生物学研究领域的三大基本原则,三大支撑学科和研究的三大主要领域?答案:(1)研究领域的三大基本原则:构成生物大分子的单体是相同的;生物遗传信息表达的中心法则相同;生物大分子单体的排列(核苷酸,氨基酸)导致了生物的特异性。
(2)三大支撑学科:细胞学,遗传学和生物化学。
(3)研究的三大主要领域:主要研究生物大分子结构与功能的相互关系,其中包括DNA和蛋白质之间的相互作用;激素和受体之间的相互作用;酶和底物之间的相互作用。
2. 分子生物学的概念是什么?答案:有人把它定义得很广:从分子的形式来研究生物现象的学科。
但是这个定义使分子生物学难以和生物化学区分开来。
另一个定义要严格一些,因此更加有用:从分子水平来研究基因结构和功能。
从分子角度来解释基因的结构和活性是本书的主要内容。
3 二十一世纪生物学的新热点及领域是什么?答案:结构生物学是当前分子生物学中的一个重要前沿学科,它是在分子层次上从结构角度特别是从三维结构的角度来研究和阐明当前生物学中各个前沿领域的重要学科问题,是一个包括生物学、物理学、化学和计算数学等多学科交叉的,以结构(特别是三维结构)测定为手段,以结构与功能关系研究为内容,以阐明生物学功能机制为目的的前沿学科。
这门学科的核心内容是蛋白质及其复合物、组装体和由此形成的细胞各类组分的三维结构、运动和相互作用,以及它们与正常生物学功能和异常病理现象的关系。
分子发育生物学也是当前分子生物学中的一个重要前沿学科。
人类基因组计划,被称为“21世纪生命科学的敲门砖”。
“人类基因组计划”以及“后基因组计划”的全面展开将进入从分子水平阐明生命活动本质的辉煌时代。
目前正迅速发展的生物信息学,被称为“21世纪生命科学迅速发展的推动力”。
尤应指出,建立在生物信息基础上的生物工程制药产业,在21世纪将逐步成为最为重要的新兴产业;从单基因病和多基因病研究现状可以看出,这两种疾病的诊断和治疗在21世纪将取得不同程度的重大进展;遗传信息的进化将成为分子生物学的中心内容”的观点认为,随着人类基因组和许多模式生物基因组序列的测定,通过比较研究,人类将在基因组上读到生物进化的历史,使人类对生物进化的认识从表面深入到本质;研究发育生物学的时机已经成熟。
Survey of clustering data mining techniques
A Survey of Clustering Data Mining TechniquesPavel BerkhinYahoo!,Inc.pberkhin@Summary.Clustering is the division of data into groups of similar objects.It dis-regards some details in exchange for data simplifirmally,clustering can be viewed as data modeling concisely summarizing the data,and,therefore,it re-lates to many disciplines from statistics to numerical analysis.Clustering plays an important role in a broad range of applications,from information retrieval to CRM. Such applications usually deal with large datasets and many attributes.Exploration of such data is a subject of data mining.This survey concentrates on clustering algorithms from a data mining perspective.1IntroductionThe goal of this survey is to provide a comprehensive review of different clus-tering techniques in data mining.Clustering is a division of data into groups of similar objects.Each group,called a cluster,consists of objects that are similar to one another and dissimilar to objects of other groups.When repre-senting data with fewer clusters necessarily loses certainfine details(akin to lossy data compression),but achieves simplification.It represents many data objects by few clusters,and hence,it models data by its clusters.Data mod-eling puts clustering in a historical perspective rooted in mathematics,sta-tistics,and numerical analysis.From a machine learning perspective clusters correspond to hidden patterns,the search for clusters is unsupervised learn-ing,and the resulting system represents a data concept.Therefore,clustering is unsupervised learning of a hidden data concept.Data mining applications add to a general picture three complications:(a)large databases,(b)many attributes,(c)attributes of different types.This imposes on a data analysis se-vere computational requirements.Data mining applications include scientific data exploration,information retrieval,text mining,spatial databases,Web analysis,CRM,marketing,medical diagnostics,computational biology,and many others.They present real challenges to classic clustering algorithms. These challenges led to the emergence of powerful broadly applicable data2Pavel Berkhinmining clustering methods developed on the foundation of classic techniques.They are subject of this survey.1.1NotationsTo fix the context and clarify terminology,consider a dataset X consisting of data points (i.e.,objects ,instances ,cases ,patterns ,tuples ,transactions )x i =(x i 1,···,x id ),i =1:N ,in attribute space A ,where each component x il ∈A l ,l =1:d ,is a numerical or nominal categorical attribute (i.e.,feature ,variable ,dimension ,component ,field ).For a discussion of attribute data types see [106].Such point-by-attribute data format conceptually corresponds to a N ×d matrix and is used by a majority of algorithms reviewed below.However,data of other formats,such as variable length sequences and heterogeneous data,are not uncommon.The simplest subset in an attribute space is a direct Cartesian product of sub-ranges C = C l ⊂A ,C l ⊂A l ,called a segment (i.e.,cube ,cell ,region ).A unit is an elementary segment whose sub-ranges consist of a single category value,or of a small numerical bin.Describing the numbers of data points per every unit represents an extreme case of clustering,a histogram .This is a very expensive representation,and not a very revealing er driven segmentation is another commonly used practice in data exploration that utilizes expert knowledge regarding the importance of certain sub-domains.Unlike segmentation,clustering is assumed to be automatic,and so it is a machine learning technique.The ultimate goal of clustering is to assign points to a finite system of k subsets (clusters).Usually (but not always)subsets do not intersect,and their union is equal to a full dataset with the possible exception of outliersX =C 1 ··· C k C outliers ,C i C j =0,i =j.1.2Clustering Bibliography at GlanceGeneral references regarding clustering include [110],[205],[116],[131],[63],[72],[165],[119],[75],[141],[107],[91].A very good introduction to contem-porary data mining clustering techniques can be found in the textbook [106].There is a close relationship between clustering and many other fields.Clustering has always been used in statistics [10]and science [158].The clas-sic introduction into pattern recognition framework is given in [64].Typical applications include speech and character recognition.Machine learning clus-tering algorithms were applied to image segmentation and computer vision[117].For statistical approaches to pattern recognition see [56]and [85].Clus-tering can be viewed as a density estimation problem.This is the subject of traditional multivariate statistical estimation [197].Clustering is also widelyA Survey of Clustering Data Mining Techniques3 used for data compression in image processing,which is also known as vec-tor quantization[89].Datafitting in numerical analysis provides still another venue in data modeling[53].This survey’s emphasis is on clustering in data mining.Such clustering is characterized by large datasets with many attributes of different types. Though we do not even try to review particular applications,many important ideas are related to the specificfields.Clustering in data mining was brought to life by intense developments in information retrieval and text mining[52], [206],[58],spatial database applications,for example,GIS or astronomical data,[223],[189],[68],sequence and heterogeneous data analysis[43],Web applications[48],[111],[81],DNA analysis in computational biology[23],and many others.They resulted in a large amount of application-specific devel-opments,but also in some general techniques.These techniques and classic clustering algorithms that relate to them are surveyed below.1.3Plan of Further PresentationClassification of clustering algorithms is neither straightforward,nor canoni-cal.In reality,different classes of algorithms overlap.Traditionally clustering techniques are broadly divided in hierarchical and partitioning.Hierarchical clustering is further subdivided into agglomerative and divisive.The basics of hierarchical clustering include Lance-Williams formula,idea of conceptual clustering,now classic algorithms SLINK,COBWEB,as well as newer algo-rithms CURE and CHAMELEON.We survey these algorithms in the section Hierarchical Clustering.While hierarchical algorithms gradually(dis)assemble points into clusters (as crystals grow),partitioning algorithms learn clusters directly.In doing so they try to discover clusters either by iteratively relocating points between subsets,or by identifying areas heavily populated with data.Algorithms of thefirst kind are called Partitioning Relocation Clustering. They are further classified into probabilistic clustering(EM framework,al-gorithms SNOB,AUTOCLASS,MCLUST),k-medoids methods(algorithms PAM,CLARA,CLARANS,and its extension),and k-means methods(differ-ent schemes,initialization,optimization,harmonic means,extensions).Such methods concentrate on how well pointsfit into their clusters and tend to build clusters of proper convex shapes.Partitioning algorithms of the second type are surveyed in the section Density-Based Partitioning.They attempt to discover dense connected com-ponents of data,which areflexible in terms of their shape.Density-based connectivity is used in the algorithms DBSCAN,OPTICS,DBCLASD,while the algorithm DENCLUE exploits space density functions.These algorithms are less sensitive to outliers and can discover clusters of irregular shape.They usually work with low-dimensional numerical data,known as spatial data. Spatial objects could include not only points,but also geometrically extended objects(algorithm GDBSCAN).4Pavel BerkhinSome algorithms work with data indirectly by constructing summaries of data over the attribute space subsets.They perform space segmentation and then aggregate appropriate segments.We discuss them in the section Grid-Based Methods.They frequently use hierarchical agglomeration as one phase of processing.Algorithms BANG,STING,WaveCluster,and FC are discussed in this section.Grid-based methods are fast and handle outliers well.Grid-based methodology is also used as an intermediate step in many other algorithms (for example,CLIQUE,MAFIA).Categorical data is intimately connected with transactional databases.The concept of a similarity alone is not sufficient for clustering such data.The idea of categorical data co-occurrence comes to the rescue.The algorithms ROCK,SNN,and CACTUS are surveyed in the section Co-Occurrence of Categorical Data.The situation gets even more aggravated with the growth of the number of items involved.To help with this problem the effort is shifted from data clustering to pre-clustering of items or categorical attribute values. Development based on hyper-graph partitioning and the algorithm STIRR exemplify this approach.Many other clustering techniques are developed,primarily in machine learning,that either have theoretical significance,are used traditionally out-side the data mining community,or do notfit in previously outlined categories. The boundary is blurred.In the section Other Developments we discuss the emerging direction of constraint-based clustering,the important researchfield of graph partitioning,and the relationship of clustering to supervised learning, gradient descent,artificial neural networks,and evolutionary methods.Data Mining primarily works with large databases.Clustering large datasets presents scalability problems reviewed in the section Scalability and VLDB Extensions.Here we talk about algorithms like DIGNET,about BIRCH and other data squashing techniques,and about Hoffding or Chernoffbounds.Another trait of real-life data is high dimensionality.Corresponding de-velopments are surveyed in the section Clustering High Dimensional Data. The trouble comes from a decrease in metric separation when the dimension grows.One approach to dimensionality reduction uses attributes transforma-tions(DFT,PCA,wavelets).Another way to address the problem is through subspace clustering(algorithms CLIQUE,MAFIA,ENCLUS,OPTIGRID, PROCLUS,ORCLUS).Still another approach clusters attributes in groups and uses their derived proxies to cluster objects.This double clustering is known as co-clustering.Issues common to different clustering methods are overviewed in the sec-tion General Algorithmic Issues.We talk about assessment of results,de-termination of appropriate number of clusters to build,data preprocessing, proximity measures,and handling of outliers.For reader’s convenience we provide a classification of clustering algorithms closely followed by this survey:•Hierarchical MethodsA Survey of Clustering Data Mining Techniques5Agglomerative AlgorithmsDivisive Algorithms•Partitioning Relocation MethodsProbabilistic ClusteringK-medoids MethodsK-means Methods•Density-Based Partitioning MethodsDensity-Based Connectivity ClusteringDensity Functions Clustering•Grid-Based Methods•Methods Based on Co-Occurrence of Categorical Data•Other Clustering TechniquesConstraint-Based ClusteringGraph PartitioningClustering Algorithms and Supervised LearningClustering Algorithms in Machine Learning•Scalable Clustering Algorithms•Algorithms For High Dimensional DataSubspace ClusteringCo-Clustering Techniques1.4Important IssuesThe properties of clustering algorithms we are primarily concerned with in data mining include:•Type of attributes algorithm can handle•Scalability to large datasets•Ability to work with high dimensional data•Ability tofind clusters of irregular shape•Handling outliers•Time complexity(we frequently simply use the term complexity)•Data order dependency•Labeling or assignment(hard or strict vs.soft or fuzzy)•Reliance on a priori knowledge and user defined parameters •Interpretability of resultsRealistically,with every algorithm we discuss only some of these properties. The list is in no way exhaustive.For example,as appropriate,we also discuss algorithms ability to work in pre-defined memory buffer,to restart,and to provide an intermediate solution.6Pavel Berkhin2Hierarchical ClusteringHierarchical clustering builds a cluster hierarchy or a tree of clusters,also known as a dendrogram.Every cluster node contains child clusters;sibling clusters partition the points covered by their common parent.Such an ap-proach allows exploring data on different levels of granularity.Hierarchical clustering methods are categorized into agglomerative(bottom-up)and divi-sive(top-down)[116],[131].An agglomerative clustering starts with one-point (singleton)clusters and recursively merges two or more of the most similar clusters.A divisive clustering starts with a single cluster containing all data points and recursively splits the most appropriate cluster.The process contin-ues until a stopping criterion(frequently,the requested number k of clusters) is achieved.Advantages of hierarchical clustering include:•Flexibility regarding the level of granularity•Ease of handling any form of similarity or distance•Applicability to any attribute typesDisadvantages of hierarchical clustering are related to:•Vagueness of termination criteria•Most hierarchical algorithms do not revisit(intermediate)clusters once constructed.The classic approaches to hierarchical clustering are presented in the sub-section Linkage Metrics.Hierarchical clustering based on linkage metrics re-sults in clusters of proper(convex)shapes.Active contemporary efforts to build cluster systems that incorporate our intuitive concept of clusters as con-nected components of arbitrary shape,including the algorithms CURE and CHAMELEON,are surveyed in the subsection Hierarchical Clusters of Arbi-trary Shapes.Divisive techniques based on binary taxonomies are presented in the subsection Binary Divisive Partitioning.The subsection Other Devel-opments contains information related to incremental learning,model-based clustering,and cluster refinement.In hierarchical clustering our regular point-by-attribute data representa-tion frequently is of secondary importance.Instead,hierarchical clustering frequently deals with the N×N matrix of distances(dissimilarities)or sim-ilarities between training points sometimes called a connectivity matrix.So-called linkage metrics are constructed from elements of this matrix.The re-quirement of keeping a connectivity matrix in memory is unrealistic.To relax this limitation different techniques are used to sparsify(introduce zeros into) the connectivity matrix.This can be done by omitting entries smaller than a certain threshold,by using only a certain subset of data representatives,or by keeping with each point only a certain number of its nearest neighbors(for nearest neighbor chains see[177]).Notice that the way we process the original (dis)similarity matrix and construct a linkage metric reflects our a priori ideas about the data model.A Survey of Clustering Data Mining Techniques7With the(sparsified)connectivity matrix we can associate the weighted connectivity graph G(X,E)whose vertices X are data points,and edges E and their weights are defined by the connectivity matrix.This establishes a connection between hierarchical clustering and graph partitioning.One of the most striking developments in hierarchical clustering is the algorithm BIRCH.It is discussed in the section Scalable VLDB Extensions.Hierarchical clustering initializes a cluster system as a set of singleton clusters(agglomerative case)or a single cluster of all points(divisive case) and proceeds iteratively merging or splitting the most appropriate cluster(s) until the stopping criterion is achieved.The appropriateness of a cluster(s) for merging or splitting depends on the(dis)similarity of cluster(s)elements. This reflects a general presumption that clusters consist of similar points.An important example of dissimilarity between two points is the distance between them.To merge or split subsets of points rather than individual points,the dis-tance between individual points has to be generalized to the distance between subsets.Such a derived proximity measure is called a linkage metric.The type of a linkage metric significantly affects hierarchical algorithms,because it re-flects a particular concept of closeness and connectivity.Major inter-cluster linkage metrics[171],[177]include single link,average link,and complete link. The underlying dissimilarity measure(usually,distance)is computed for every pair of nodes with one node in thefirst set and another node in the second set.A specific operation such as minimum(single link),average(average link),or maximum(complete link)is applied to pair-wise dissimilarity measures:d(C1,C2)=Op{d(x,y),x∈C1,y∈C2}Early examples include the algorithm SLINK[199],which implements single link(Op=min),Voorhees’method[215],which implements average link (Op=Avr),and the algorithm CLINK[55],which implements complete link (Op=max).It is related to the problem offinding the Euclidean minimal spanning tree[224]and has O(N2)complexity.The methods using inter-cluster distances defined in terms of pairs of nodes(one in each respective cluster)are called graph methods.They do not use any cluster representation other than a set of points.This name naturally relates to the connectivity graph G(X,E)introduced above,because every data partition corresponds to a graph partition.Such methods can be augmented by so-called geometric methods in which a cluster is represented by its central point.Under the assumption of numerical attributes,the center point is defined as a centroid or an average of two cluster centroids subject to agglomeration.It results in centroid,median,and minimum variance linkage metrics.All of the above linkage metrics can be derived from the Lance-Williams updating formula[145],d(C iC j,C k)=a(i)d(C i,C k)+a(j)d(C j,C k)+b·d(C i,C j)+c|d(C i,C k)−d(C j,C k)|.8Pavel BerkhinHere a,b,c are coefficients corresponding to a particular linkage.This formula expresses a linkage metric between a union of the two clusters and the third cluster in terms of underlying nodes.The Lance-Williams formula is crucial to making the dis(similarity)computations feasible.Surveys of linkage metrics can be found in [170][54].When distance is used as a base measure,linkage metrics capture inter-cluster proximity.However,a similarity-based view that results in intra-cluster connectivity considerations is also used,for example,in the original average link agglomeration (Group-Average Method)[116].Under reasonable assumptions,such as reducibility condition (graph meth-ods satisfy this condition),linkage metrics methods suffer from O N 2 time complexity [177].Despite the unfavorable time complexity,these algorithms are widely used.As an example,the algorithm AGNES (AGlomerative NESt-ing)[131]is used in S-Plus.When the connectivity N ×N matrix is sparsified,graph methods directly dealing with the connectivity graph G can be used.In particular,hierarchical divisive MST (Minimum Spanning Tree)algorithm is based on graph parti-tioning [116].2.1Hierarchical Clusters of Arbitrary ShapesFor spatial data,linkage metrics based on Euclidean distance naturally gener-ate clusters of convex shapes.Meanwhile,visual inspection of spatial images frequently discovers clusters with curvy appearance.Guha et al.[99]introduced the hierarchical agglomerative clustering algo-rithm CURE (Clustering Using REpresentatives).This algorithm has a num-ber of novel features of general importance.It takes special steps to handle outliers and to provide labeling in assignment stage.It also uses two techniques to achieve scalability:data sampling (section 8),and data partitioning.CURE creates p partitions,so that fine granularity clusters are constructed in parti-tions first.A major feature of CURE is that it represents a cluster by a fixed number,c ,of points scattered around it.The distance between two clusters used in the agglomerative process is the minimum of distances between two scattered representatives.Therefore,CURE takes a middle approach between the graph (all-points)methods and the geometric (one centroid)methods.Single and average link closeness are replaced by representatives’aggregate closeness.Selecting representatives scattered around a cluster makes it pos-sible to cover non-spherical shapes.As before,agglomeration continues until the requested number k of clusters is achieved.CURE employs one additional trick:originally selected scattered points are shrunk to the geometric centroid of the cluster by a user-specified factor α.Shrinkage suppresses the affect of outliers;outliers happen to be located further from the cluster centroid than the other scattered representatives.CURE is capable of finding clusters of different shapes and sizes,and it is insensitive to outliers.Because CURE uses sampling,estimation of its complexity is not straightforward.For low-dimensional data authors provide a complexity estimate of O (N 2sample )definedA Survey of Clustering Data Mining Techniques9 in terms of a sample size.More exact bounds depend on input parameters: shrink factorα,number of representative points c,number of partitions p,and a sample size.Figure1(a)illustrates agglomeration in CURE.Three clusters, each with three representatives,are shown before and after the merge and shrinkage.Two closest representatives are connected.While the algorithm CURE works with numerical attributes(particularly low dimensional spatial data),the algorithm ROCK developed by the same researchers[100]targets hierarchical agglomerative clustering for categorical attributes.It is reviewed in the section Co-Occurrence of Categorical Data.The hierarchical agglomerative algorithm CHAMELEON[127]uses the connectivity graph G corresponding to the K-nearest neighbor model spar-sification of the connectivity matrix:the edges of K most similar points to any given point are preserved,the rest are pruned.CHAMELEON has two stages.In thefirst stage small tight clusters are built to ignite the second stage.This involves a graph partitioning[129].In the second stage agglomer-ative process is performed.It utilizes measures of relative inter-connectivity RI(C i,C j)and relative closeness RC(C i,C j);both are locally normalized by internal interconnectivity and closeness of clusters C i and C j.In this sense the modeling is dynamic:it depends on data locally.Normalization involves certain non-obvious graph operations[129].CHAMELEON relies heavily on graph partitioning implemented in the library HMETIS(see the section6). Agglomerative process depends on user provided thresholds.A decision to merge is made based on the combinationRI(C i,C j)·RC(C i,C j)αof local measures.The algorithm does not depend on assumptions about the data model.It has been proven tofind clusters of different shapes,densities, and sizes in2D(two-dimensional)space.It has a complexity of O(Nm+ Nlog(N)+m2log(m),where m is the number of sub-clusters built during the first initialization phase.Figure1(b)(analogous to the one in[127])clarifies the difference with CURE.It presents a choice of four clusters(a)-(d)for a merge.While CURE would merge clusters(a)and(b),CHAMELEON makes intuitively better choice of merging(c)and(d).2.2Binary Divisive PartitioningIn linguistics,information retrieval,and document clustering applications bi-nary taxonomies are very useful.Linear algebra methods,based on singular value decomposition(SVD)are used for this purpose in collaborativefilter-ing and information retrieval[26].Application of SVD to hierarchical divisive clustering of document collections resulted in the PDDP(Principal Direction Divisive Partitioning)algorithm[31].In our notations,object x is a docu-ment,l th attribute corresponds to a word(index term),and a matrix X entry x il is a measure(e.g.TF-IDF)of l-term frequency in a document x.PDDP constructs SVD decomposition of the matrix10Pavel Berkhin(a)Algorithm CURE (b)Algorithm CHAMELEONFig.1.Agglomeration in Clusters of Arbitrary Shapes(X −e ¯x ),¯x =1Ni =1:N x i ,e =(1,...,1)T .This algorithm bisects data in Euclidean space by a hyperplane that passes through data centroid orthogonal to the eigenvector with the largest singular value.A k -way split is also possible if the k largest singular values are consid-ered.Bisecting is a good way to categorize documents and it yields a binary tree.When k -means (2-means)is used for bisecting,the dividing hyperplane is orthogonal to the line connecting the two centroids.The comparative study of SVD vs.k -means approaches [191]can be used for further references.Hier-archical divisive bisecting k -means was proven [206]to be preferable to PDDP for document clustering.While PDDP or 2-means are concerned with how to split a cluster,the problem of which cluster to split is also important.Simple strategies are:(1)split each node at a given level,(2)split the cluster with highest cardinality,and,(3)split the cluster with the largest intra-cluster variance.All three strategies have problems.For a more detailed analysis of this subject and better strategies,see [192].2.3Other DevelopmentsOne of early agglomerative clustering algorithms,Ward’s method [222],is based not on linkage metric,but on an objective function used in k -means.The merger decision is viewed in terms of its effect on the objective function.The popular hierarchical clustering algorithm for categorical data COB-WEB [77]has two very important qualities.First,it utilizes incremental learn-ing.Instead of following divisive or agglomerative approaches,it dynamically builds a dendrogram by processing one data point at a time.Second,COB-WEB is an example of conceptual or model-based learning.This means that each cluster is considered as a model that can be described intrinsically,rather than as a collection of points assigned to it.COBWEB’s dendrogram is calleda classification tree.Each tree node(cluster)C is associated with the condi-tional probabilities for categorical attribute-values pairs,P r(x l=νlp|C),l=1:d,p=1:|A l|.This easily can be recognized as a C-specific Na¨ıve Bayes classifier.During the classification tree construction,every new point is descended along the tree and the tree is potentially updated(by an insert/split/merge/create op-eration).Decisions are based on the category utility[49]CU{C1,...,C k}=1j=1:kCU(C j)CU(C j)=l,p(P r(x l=νlp|C j)2−(P r(x l=νlp)2.Category utility is similar to the GINI index.It rewards clusters C j for in-creases in predictability of the categorical attribute valuesνlp.Being incre-mental,COBWEB is fast with a complexity of O(tN),though it depends non-linearly on tree characteristics packed into a constant t.There is a similar incremental hierarchical algorithm for all numerical attributes called CLAS-SIT[88].CLASSIT associates normal distributions with cluster nodes.Both algorithms can result in highly unbalanced trees.Chiu et al.[47]proposed another conceptual or model-based approach to hierarchical clustering.This development contains several different use-ful features,such as the extension of scalability preprocessing to categori-cal attributes,outliers handling,and a two-step strategy for monitoring the number of clusters including BIC(defined below).A model associated with a cluster covers both numerical and categorical attributes and constitutes a blend of Gaussian and multinomial models.Denote corresponding multivari-ate parameters byθ.With every cluster C we associate a logarithm of its (classification)likelihoodl C=x i∈Clog(p(x i|θ))The algorithm uses maximum likelihood estimates for parameterθ.The dis-tance between two clusters is defined(instead of linkage metric)as a decrease in log-likelihoodd(C1,C2)=l C1+l C2−l C1∪C2caused by merging of the two clusters under consideration.The agglomerative process continues until the stopping criterion is satisfied.As such,determina-tion of the best k is automatic.This algorithm has the commercial implemen-tation(in SPSS Clementine).The complexity of the algorithm is linear in N for the summarization phase.Traditional hierarchical clustering does not change points membership in once assigned clusters due to its greedy approach:after a merge or a split is selected it is not refined.Though COBWEB does reconsider its decisions,its。
Mammalian microRNAs predominantly act to decresase target mRNA levels
ARTICLES Mammalian microRNAs predominantly act to decrease target mRNA levelsHuili Guo1,2,Nicholas T.Ingolia3,4,Jonathan S.Weissman3,4&David P.Bartel1,2MicroRNAs(miRNAs)are endogenous,22-nucleotide RNAs that mediate important gene-regulatory events by pairing to the mRNAs of protein-coding genes to direct their repression.Repression of these regulatory targets leads to decreased translational efficiency and/or decreased mRNA levels,but the relative contributions of these two outcomes have been largely unknown,particularly for endogenous targets expressed at low-to-moderate levels.Here,we use ribosome profiling to measure the overall effects on protein production and compare these to simultaneously measured effects on mRNA levels. For both ectopic and endogenous miRNA regulatory interactions,lowered mRNA levels account for most($84%)of the decreased protein production.These results show that changes in mRNA levels closely reflect the impact of miRNAs on gene expression and indicate that destabilization of target mRNAs is the predominant reason for reduced protein output.Each highly conserved mammalian miRNA typically targets mRNAs of hundreds of distinct genes,such that as a class these small regula-tory RNAs dampen the expression of most protein-coding genes to optimize their expression patterns1,2.When pairing to a target is extensive,a miRNA can direct destruction of the targeted mRNA through Argonaute-catalysed mRNA cleavage3,4.This mode of repression dominates in plants5,but in animals all but a few targets lack the extensive pairing required for cleavage2.The molecular consequences of the repression mode that domi-nates in animals are less clear.Initially miRNAs were thought to repress protein output with little or no influence on mRNA levels6,7. Then mRNA-array experiments showed that miRNAs decrease the levels of many targeted mRNAs8–11.A revisit of the initially identified targets of Caenorhabditis elegans miRNAs showed that these tran-scripts also decrease in the presence of their cognate miRNAs12. The mRNA decreases are associated with poly(A)-tail shortening, leading to a model in which miRNAs cause mRNA de-adenylation, which promotes de-capping and more rapid degradation through standard mRNA-turnover processes10,13–15.The magnitude of this destabilization,however,is usually quite modest,which has bolstered the lingering notion that with some exceptions(for example, Drosophila miR-12regulation of CG10011,ref.14)most repression occurs through translational repression,and that monitoring mRNA destabilization might miss many targets that are downregulated with-out detectable mRNA changes.Challenging this view are results of high-throughput analyses comparing protein and mRNA changes after introducing or deleting individual miRNAs16,17.An interpreta-tion of these results is that the modest mRNA destabilization imparted by each miRNA–target interaction represents most of the miRNA-mediated repression16.We call this the‘mRNA-destabilization’scenario and contrast it to the original‘translational-repression’scenario,which posited decreased translation with relatively little mRNA change.In the mRNA-destabilization scenario differences between protein and mRNA changes are mostly attributed to either measurement noise or complications arising from pre-steady-state comparisons of mRNA-array data,which measure differences at one moment in time,and proteomic data,which measure differences integrated over an extended period of protein synthesis.If either mRNA levels or miRNA activities change over the period of protein synthesis(or the period of metabolic labelling),correspondence between mRNA destabilization and protein decreases could become distorted. Another complication of proteomic data sets is that they preferen-tially examine more highly expressed proteins,whose repression might differ from more modestly expressed proteins.A recent study used mRNA arrays to monitor effects on both mRNA levels and mRNA ribosome density and occupancy,thereby providing a more sensitive analysis of changes in mRNA utilization and bypassing the need to compare protein and mRNA18.This array study supports the mRNA-destabilization scenario but examines the response to an ectopically introduced miRNA,leaving open the question of whether endogenous miRNA–target interactions might impart additional translational repression.Ribosome profiling,a method that determines the positions of ribosomes on cellular mRNAs with sub-codon resolution19,is based on deep sequencing of ribosome-protected mRNA fragments(RPFs) and thereby provides quantitative data on thousands of genes not detected by general proteomics methods.Moreover,ribosome pro-filing reports on the status of the cell at a particular time point,and thus generates results more directly comparable to mRNA-profiling results than does proteomics.We extended this method to human and mouse cells,thereby enabling a fresh look at the molecular con-sequences of miRNA repression.Ribosome profiling in mammalian cellsRibosome profiling generates short sequence tags that each mark the mRNA coordinates of one bound ribosome19.The outline of our protocol for mammalian cells paralleled that used for yeast(Fig.1a). Cells were treated with cycloheximide to arrest translating ribosomes. Extracts from these cells were then treated with RNase I to degrade regions of mRNAs not protected by ribosomes.The resulting80S monosomes,many of which contained a,30-nucleotide RPF,were purified on sucrose gradients and then treated to release the RPFs, which were processed for Illumina high-throughput sequencing. We started with HeLa cells,performing ribosome profiling on miRNA-and mock-transfected cells.In parallel,poly(A)-selected1Whitehead Institute for Biomedical Research,Cambridge,Massachusetts02142,USA.2Howard Hughes Medical Institute and Department of Biology,Massachusetts Institute of Technology,Cambridge,Massachusetts02139,USA.3Howard Hughes Medical Institute and Department of Cellular and Molecular Pharmacology,University of California,San Francisco,California94158,USA.4California Institute for Quantitative Biosciences,San Francisco,California94158,USA.Vol466|12August2010|doi:10.1038/nature09267835mRNA from each sample was randomly fragmented,and the result-ing mRNA fragments were processed for sequencing (mRNA-Seq)using the same protocol as that used for the RPFs.Sequencing generated 11–18million raw reads per sample,of which 4–8million were used for subsequent analyses because they each mapped to a single location in a database of annotated pre-mRNAs and mRNA splice junctions (Supplementary Table 1).Combining RPFs from HeLa-expressed mRNAs into one composite mRNA showed that ribosome profiling captured fundamental features of translation (Fig.1b,c and Supplementary Fig.1c).Although a few RPFs mapped to annotated 59-untranslated regions (59UTRs),which indicated the presence of ribosomes at upstream open reading frames (ORFs)19,the vast majority mapped to annotated ORFs.RPF density was highest at the start and stop codons,reflecting known pauses at these positions 20.mRNA-Seq tags,in contrast,mapped uniformly across the length of the mRNA,as expected for randomly fragmented mRNA.The most striking feature in the composite-mRNA analysis was the 3-nucleotide periodicity of the RPFs.In sharp contrast to the 59termini of the mRNA-Seq tags,which mapped to all three codon nucleotides equally,the RPF 59termini mostly mapped to the first nucleotide of the codon (Fig.1d).This pattern,analogous to that observed in yeast 19,is attributable to the RPFs capturing the move-ment of ribosomes along mRNAs—three nucleotides at a time.The protocol applied to mouse neutrophils generated ,30-nucleotide RPFs with the same pattern (Supplementary Fig.1d,e).Thus,ribo-some profiling mapped,at sub-codon resolution,the positions of translating ribosomes in human and mouse cells.Similar repression regardless of target expression levelGeneral features of translation and translational efficiency in mam-malian cells will be presented elsewhere.Here,we focus on miRNA-dependent changes in protein production.Our HeLa-cell experiments examined the impact of introducing miR-1or miR-155,both of whichare not normally expressed in HeLa cells,and our mouse-neutrophil experiments examined the impact of knocking out mir-223,which encodes a miRNA highly and preferentially expressed in neutrophils 21.These cell types and miRNAs were chosen because proteomics experi-ments using either the SILAC (stable isotope labelling with amino acids in cell culture)or pSILAC (a pulsed-labelled version of SILAC)methods had already reported the impact of each of these miRNAs on the output of thousands of proteins 16,17.Pairing to the miRNA seed (nucleotides 2–7)is important for target recognition,and several types of seed-matched sites,ranging in length from 6to 8nucleotides,mediate repression 2.Ribosome-profiling and mRNA-Seq results showed the expected correlation between site length and site efficacy 2(Supplementary Fig.2).Because the response of mRNAs with single 6-nucleotide sites was marginal and observed only in the miR-1experiment,subsequent analyses focused on mRNAs with at least one canonical 7–8-nucleotide site.In the miR-155experiment,mRNAs from 5,103distinct genes passed our read threshold for single-gene quantification ($100RPFs and $100mRNA-Seq tags in the mock-transfection control).Genes with at least one 39UTR site tended to be repressed following addition of miR-155,yielding fewer mRNA-Seq tags and fewer RPFs in the presence of the miRNA (Fig.2a;P ,10248and 10237,respec-tively,one-tailed Kolmogorov–Smirnov (K–S)test,comparing to genes with no site in the entire message).Proteins from 2,597of the 5,103genes were quantified in the analogous pSILAC experi-ment 17.The mRNA and RPF changes for the pSILAC-detected subset were no less pronounced than those of the larger set of analysed genes (Fig.2a;P 50.70and 0.62for mRNA and RPF data,respectively,K–S test),which implied that the response of mRNAs of proteins detected by high-throughput quantitative proteomics accurately represented the response of all mRNAs.Analogous results were obtained in the miR-1and miR-223experiments (Fig.2b,c;P ,10210for each com-parison to genes with no site,and P .0.56for each comparison to the proteomics-detected subset).Furthermore,analyses of genes binnedabR e a d d e n s i t y (r p M )cdF r a c t i o n o f r e a d sR e a d d e n s i t y (r p M )30 nt30 ntAUGE P AUAAE P AAdd cycloheximide Lyse cellsDistance from first base Distance from first base sequenced RPFsHigh-throughput sequencing−300−200−1000010*******Distance from first base of start codon (nt)Distance from first base of stop codon (nt)0.70.00.60.10.20.30.40.5mRNA-Seq RPF0.70.00.60.10.20.30.40.5Figure 1|Ribosome profiling in human cells captured features of translation.a ,Schematic diagram of ribosome profiling.Sequencingreproducibility and evidence for mapping to the correct mRNA isoforms are illustrated (Supplementary Fig.1a,b).b ,RPF density near the ends of ORFs,combining data from all quantified genes.Plotted are RPF 59termini,as reads per million reads mapping to genes (rpM).Illustrated below the graph are the inferred ribosome positions corresponding to peak RPF densities,at which the start codon was in the P site (left)and the stop codon was in the A site (right).The offset between the 59terminus of an RPF and the firstnucleotide in the human ribosome A site was typically 15nucleotides (nt).c ,Density of RPFs and mRNA-Seq tags near the ends of ORFs in HeLa cells.RPF density is plotted as in panel b ,except positions are shifted 115nucleotides to reflect the position of the first nucleotide in the ribosome A posite data are shown for $600-nucleotide ORFs that passed our threshold for quantification ($100RPFs and $100mRNA-Seq tags).d ,Fraction of RPFs and mRNA-Seq tags mapping to each of the three codon nucleotides in panel c .ARTICLES NATURE |Vol 466|12August 2010836by expression level,which enabled inclusion of data from 11,000distinct genes that ranged broadly in expression (more than 1,000-fold difference between the first and last bins),confirmed that miRNAs do not repress their lowly expressed targets more potently than they do their more highly expressed targets (Supplementary Fig.3).As these results indicated that restricting analyses to mRNAs with higher expression,by requiring either a minimal read count or a proteomics-detected protein,did not somehow distort the picture of miRNA targeting and repression,we focused on the mRNAs with at least one 39UTR site and for which the proteomics detected a substantial change at the protein level.These sets of mRNAs were called ‘proteomics-supported targets’because they were expected to be highly enriched in direct targets of the miRNAs.Indeed,they responded more robustly to the introduction or ablation of cognate miRNAs (Fig.2a–c;P ,1025for each comparison to proteomics-detected genes with sites).Because some 7–8-nucleotide seed-matched sites do not confer repression by the corresponding miRNA 2,22,the proteomics-supported targets,which excluded most messages with non-functional sites,were the most informative for subsequent analyses.Modest influence on translational efficiencyWe next examined whether our results supported the translation-repression scenario,in which translation is repressed without a sub-stantial mRNA decrease.In the characterized examples in which miRNAs direct translation inhibition,repression is reported to occur through either reduced translation initiation 23–25or increased ribosome drop-off 26.Both of these mechanisms would lead to fewer ribosomes on target mRNAs and thus fewer RPFs from these mRNAs after account-ing for changes in mRNA levels.To detect this effect,we accounted for changes in mRNA levels by incorporating the mRNA-Seq results.For example,for each quantified gene in the miR-155experiment,we divided the change in RPFs by the change in mRNA-Seq tags (that is,we subtracted the log 2-fold changes).This calculation removed the component of the RPF change attributable to miRNA-dependent changes in poly(A)mRNA,leaving the residual change as the com-ponent attributable to a change in ribosome density,which we inter-pret as a change in ‘translational efficiency 19’.We observed a statistically significant decrease in translational efficiency for messages with miR-155sites compared to those with-out,indicating that miRNA targeting leads to fewer ribosomes on target mRNAs that have not yet lost their poly(A)-tail and become destabilized (Fig.2d,P 50.003,K–S test).This decrease,however,was very modest.Even these proteomics-supported targets under-went only a 7%decrease in translational efficiency (–0.11log 2-fold change,Fig.2d,inset),compared to a 33%decrease in polyadeny-lated mRNA (–0.59log 2-fold change,Fig.2a).Analogous results were obtained for the miR-1and miR-223experiments (Fig.2e,f;P 50.001,P 50.05,respectively).Thus,for both ectopic and endo-genous regulatory interactions,only a small fraction of repression observed by ribosome profiling (11–16%)was attributable to reduced translational efficiency.At least 84%of the repression was attributable instead to decreased mRNA levels,a percentage some-what greater than the ,75%reported from array analyses of ectopic interactions 18.aC u m u l a t i v e f r a c t i o nC u m u l a t i v e f r a c t i o nTranslational efficiency fold change (log 2)C u m u l a t i v e f r a c t i o nRPF fold change (log 2)−2−1.5−1−0.500.51 1.52mRNA-Seq fold change (log 2)−2−1.5−1−0.500.51 1.52Translational efficiency fold change (log 2)RPF fold change (log 2)−2−1.5−1−0.500.51 1.52mRNA-Seq fold change (log 2)Translational efficiency fold change (log 2)RPF fold change (log 2)mRNA-Seq fold change (log 2)−2−1.5−1−0.500.51 1.520.00.20.40.60.81.0dC u m u l a t i v e f r a c t i o nbmiR-1miR-10.00.20.40.60.81.0C u m u l a t i v e f r a c t i o n0.00.20.40.60.81.0C u m u l a t i v e f r a c t i o nemiR-223miR-223−1−0.500.51−1−0.500.51cf≥1 site (707)No site (3,186)≥1 site (299)proteomics-detected ≥1 site (121)proteomics-supported ≥1 site (707)No site (3,186)≥1 site (299)proteomics-detected ≥1 site (121)proteomics-supported ≥1 site (853)No site (2,378)≥1 site (386)proteomics-detected ≥1 site (99)proteomics-supported ≥1 site (853)No site (2,378)≥1 site (386)proteomics-detected ≥1 site (99)proteomics-supported ≥1 site (768)No site (2,916)≥1 site (337)proteomics-detected ≥1 site (77)proteomics-supported ≥1 site (768)No site (2,916)≥1 site (337)proteomics-detected ≥1 site (77)proteomics-supported miR-155miR-155Figure 2|MicroRNAs downregulated gene expression mostly through mRNA destabilization,with a small effect on translational efficiency.a ,Cumulative distributions of mRNA-Seq changes (left)and RPF changes (right)after introducing miR-155.Plotted are distributions for the genes with $1miR-15539UTR site (blue),the subset of these genes detected in the pSILAC experiment (proteomics-detected,red),the subset of theproteomics-detected genes with proteins responding with log 2-fold change #–0.3(proteomics-supported,green),and the control genes,which lacked miR-155sites throughout their mRNAs (no site,black).The number of genes in each category is indicated in parentheses.b ,Cumulativedistributions of mRNA-Seq changes (left)and RPF changes (right)after introducing miR-1.Otherwise,as in panel a .c ,Cumulative distributions of mRNA-Seq changes (left)and RPF changes (right)after deleting mir-223.Otherwise,as in panel a ,with proteomics-supported genes referring to genes with proteins that responded with log 2-fold change $0.3in the SILAC experiment.d ,Cumulative distributions of translational efficiency changes for the polyadenylated mRNA that remained after introducing miR-155.For each gene,the translational efficiency change was calculated by normalizing the RPF change by the mRNA-Seq change.For each distribution,the mean log 2-fold change (6standard error)is shown (inset).e ,Cumulative distributions of translational efficiency changes for the polyadenylated mRNA that remained after introducing miR-1.Otherwise,as in panel d .f ,Cumulative distributions of translational efficiency changes for the polyadenylated mRNA that remained after deleting mir-223.Otherwise,as in panel d .NATURE |Vol 466|12August 2010ARTICLES837Analyses described thus far focused on messages with at least one 39UTR site to the cognate miRNA,without considering whether the site was conserved in orthologous UTRs of other animals.When we focused on evolutionarily conserved sites 1,the results were similar but noisier because the conserved sites,although more efficacious,were 3–13-fold less abundant (Supplementary Fig.4).When chan-ging the focus to messages with sites only in the ORFs,the results were also similar but again noisier because sites in the open reading frames are less efficacious 16,17,22,which led to ,70%fewer genes classified as proteomics-supported targets (Supplementary Fig.5).mRNA reduction consistently mirrored RPF reductionAnalyses of fold-change distributions (Fig.2)supported the mRNA-destabilization scenario for most targets,but still allowed for the possibility that the translational-repression scenario might apply to a small subset of targets.To search for evidence for a set of unusual targets undergoing translational repression without substantial mRNA destabilization,we compared the mRNA and ribosome-profiling changes for the 5,103quantifiable genes from the miR-155experiment.Correlation between the two types of responses was strong for the messages with miR-155sites,and particularly for those that were proteomics-supported targets (Fig.3a,R 250.49and 0.63,respectively).A strong correlation was also observed for genes considered only after relaxing the expression cut-offs (Supplemen-tary Fig.6a).Any scatter that might have indicated that a few genes undergo translational repression without substantial mRNA destabi-lization strongly resembled the scatter observed in parallel analysis of genes without sites (Fig.3b).The same was observed for the miR-1experiment,but in this case the correlations were even stronger (R 250.72and 0.80,respectively),presumably because the increased response to the miRNA led to a correspondingly reduced contribution of experimental noise (Fig.3c,d;Supplementary Fig.6b).The same was also observed for the miR-223experiment,with weaker correla-tions (R 250.26and 0.40,respectively)attributable to the reduced response to the miRNA and a correspondingly increased contribution of experimental noise (Fig.3e,f).Supporting this interpretation,sys-tematically increasing expression cut-offs,which retained data with progressively lower noise from stochastic counting fluctuations,pro-gressively increased the correlation between RPF and mRNA-Seq changes (Supplementary Fig.6c).We also examined messages with multiple sites to the cognate miRNA and found that they behaved no differently with regard to the relationship between mRNA-Seq and RPF changes (Supplementary Fig.7).In summary,we found no evid-ence that countered the conclusion that miRNAs act predominantly to reduce mRNA levels of nearly all,if not all,targets.Uniform changes along the ORF lengthIf miRNA targeting causes ribosomes to drop off the message after translating a substantial fraction of the ORF,then the RPF changes summed over the length of the ORF might underestimate the reduced production of full-length protein.Therefore,we re-examined the ribosome profiling data,which determines the location of ribosomes along the length of the mRNAs,thereby providing transcriptome-wide information that could detect ribosome drop-off.For highly expressed genes targeted in their 39UTRs (e.g,TAGLN2in the miR-1experiment;Supplementary Fig.8a),downregulation at the mRNA and ribosome levels was observed along the length of the ORF.In order to extend this analysis to genes with more moderate expression,we examined composite ORFs representing proteomics-supported targets and compared these to composite ORFs representing genes without sites.When miR-155targets were compared to genes without sites,fewer mRNA-Seq tags were observed across the length of the composite ORF (Fig.4a).RPFs tended to be further reduced (P 50.007,one-tailed Mann–Whitney test),but without a systematic change in the magnitude of this additional reduction across the length of the ORF (P 50.95,two-tailed analysis of covariance (ANCOVA)test).Because ribosome drop-off would decrease the ribosome occu-pancy less at the beginning of the ORF than at the end,whereas inhibiting translation initiation would not,the observed uniform reduction supported mechanisms in which initiation was inhibited.Analogous results were observed in the miR-1experiment (Fig.4b;P 50.002,for further reduction in RPFs;P 50.85for systematic change across the ORF).Evidence for drop-off was also not observed in the miR-223experiment,although a change in translational effi-ciency was difficult to detect in this analysis,presumably because the miRNA-mediated changes were lower in magnitude (Fig.4c).The same conclusions were drawn from analyses in which we first normalized for ORF length (Supplementary Fig.9).Implications for the mechanism of repressionFor both ectopic and endogenous miRNA targeting interactions,the molecular consequences of miRNA regulation were most consistent with the mRNA-destabilization scenario.Although acquiring similar data on cell types beyond the two examined here will be important,we have no reason to doubt that our conclusion will apply broadly to the vast majority of miRNA targeting interactions.If indeed general,this conclusion will be welcome news to biologists wanting to mea-sure the ultimate impact of miRNAs on their direct regulatory tar-gets.Because the quantitative effects on translating ribosomes so closely mirrored the decreases in polyadenylated mRNA,the impact on protein production can be closely approximated using mRNAacemRNA-Seq fold change (log 2)R P F f o l d c h a n g e (l o g 2)R P F f o l d c h a n g e (l o g 2)R P F f o l d c h a n g e (l o g 2)mRNA-Seq fold change (log 2)mRNA-Seq fold change (log 2)Figure 3|Ribosome changes from miRNA targeting corresponded to mRNA changes.a ,Correspondence between ribosome (RPF)and mRNA (mRNA-Seq)changes after introducing miR-155,plotting data for the 707quantified genes with at least one miR-15539UTR site (blue circles).Proteomics-detected targets and proteomics-supported targets are highlighted (pink diamonds and green crosses,respectively).Expected standard deviations (error bars)were calculated based on the number of reads obtained per gene and assuming random counting statistics.The R 2derived from Pearson’s correlation of all data are indicated.b ,Correspondence between ribosome and mRNA changes after introducing miR-155,plotting data for 707genes randomly selected from the 3,186quantified genes lacking a miR-155site anywhere in the mRNA.Otherwise,as in panel a .c ,d ,As in panels a and b ,but plotting results for the miR-1experiment.e ,f ,As in panels a and b ,but plotting results for the miR-223experiment.ARTICLESNATURE |Vol 466|12August 2010838arrays or mRNA-Seq.Our results might also provide insight into the question of why some targets are more responsive to miRNAs than others;in the destabilization scenario,otherwise long-lived messages might undergo comparatively more destabilization than would con-stitutively short-lived ones.Translation repression and mRNA destabilization are sometimes coupled 27,which raises the possibility that the miRNA-mediated mRNA destabilization might be a consequence of translational repres-sion.If so,a greater fraction of the repression might be attributable to decreased translational efficiency if the effects were analysed sooner after introducing a miRNA.However,the fraction attributable to decreased translational efficiency remained small when repeating the analysis using samples from 12h (rather than 32h)after intro-ducing miR-155or miR-1(Supplementary Fig.10and Supplemen-tary Table 2).Although these results at earlier time points cannot rule out rapid destabilization as a consequence of translational repression,our results revealing such small decreases in translational efficiency for target mRNAs strongly imply that even if destabilization were secondary to translational repression,it would be this destabilization (that is,the reduced availability of mRNA for subsequent rounds of translation)that would exert the greatest impact on protein produc-tion.Moreover,miRNA-mediated mRNA de-adenylation,which is the best-characterized mechanism of miRNA-mediated mRNA desta-bilization,can occur with or without translation of an ORF 10,13,15,28,which suggests that the miRNA-mediated destabilization does not result from translational repression and indicates that translational repression could occur after the initial de-adenylation signal.Perhaps the miRNA-induced poly(A)-tail interactions that eventually trigger de-adenylation also cause the closed circular form of the mRNA to open up,thereby inhibiting translation initiation.This inhibition would occur before de-adenylation is complete,as polyadenylated mRNAs seem to be translationally repressed (Fig.2d–f).Another consideration is that,as done previously 16–18,we equated mRNA destabilization to the loss of polyadenylated mRNA.Thus,transcripts that have lost their poly(A)tails might still be present but underrepresented in our mRNA-Seq of poly(A)-selected mRNA.In certain cell types,most notably oocytes,such transcripts can be stable and eventually be tailed by a cytoplasmic polyadenylation complex to become translationally competent 29.In the typical somatic cell,however,de-adenylated transcripts are not translated and are instead rapidly de-capped and/or degraded.Thus,our consideration of de-adenylated transcripts as operational and functional equivalents of degraded transcripts seems appropriate.One possibility,though,is that mRNAs that were de-adenylated while being translated will yield some RPFs from ribosomes that initiated when the poly(A)tails were intact but will not yield mRNA-Seq tags.However,a narrowing of the differences between changes in RPFs and mRNA-Seq tags through this process is expected to have been very small,since the vast majority of RPFs should derive from mRNAs with poly(A)tails.A way that our results might still be reconciled with the translation-repression scenario would be if ribosome profiling missed the bulk of translation repression because translation was repressed without reducing the density of ribosomes on the targeted messages,that is,if reduced initiation was coupled with correspondingly slower elonga-tion.However,direct evidence for slower elongation has not been reported in any miRNA studies,and it seems unlikely that decreases in initiation and elongation rates would so frequently be so closely matched so as to yield such minor differences in apparent translational efficiency for so many messages.Moreover,translational repression without changes in ribosome density would cause the changes mea-sured by proteomics to exceed those measured by ribosome profiling.The same would hold for cotranslational degradation of nascent poly-peptides,another proposed mechanism for miRNA-mediated repres-sion 7,30.Arguing strongly against both of these possibilities,we found that changes measured by proteomics were not greater than those measured by ribosome profiling (Supplementary Fig.11).Although the changes we observed in translational efficiency were consistent with slightly reduced translation of the targeted messages,such changes could also occur without any miRNA-mediated trans-lational repression.If some fraction of the polyadenylated mRNA was in a cellular compartment sequestered away from the compartment containing both miRNAs and ribosomes,then preferential destabiliza-tion of the mRNA in the miRNA/ribosome compartment would lead to an observed decrease in translational efficiency without a need to invoke translational repression.For example,to the extent that mature mRNAs awaiting transport to the cytoplasm reside in the nucleus where they presumably would not be subject to either miRNA-mediated destabilization or translation,the reduction of mRNA-Seq tags would not match the reduction of RPFs,and the more pronounced RPF reduction would indicate decreased ribosome density even in the absence of translational repression.Heterologous reporter mRNAs,some of which have lent support to the translational-repression scenario,might be particularly prone to nuclear accumulation.With this consideration in mind,the observed miRNA-dependent reduc-tions in translational efficiency might be considered upper limits on the magnitude of translational repression.Although we cannot determine the precise amount of miRNA-mediated translational repression,we can reliably say that the per-vasive and dominant miRNA-mediated translational repression with persistence of repressed mRNAs,which had been widely anticipated,has not materialized.Instead,the outcome of regulation is predomi-nantly mRNA destabilization,as first suggested by analysesof05001,5002,0002,5003,0003,5001,0004,000No site, mRNA-Seq No site, RPF ≥1 site, proteomics-supported, mRNA-Seq ≥1 site, proteomics-supported, RPFDistance from start of ORF (nucleotides)5001,5002,0002,5003,0003,5001,000Distance from start of ORF (nucleotides)5001,5002,0002,5003,0001,000Distance from start of ORF (nucleotides)R e a d d e n s i t y f o l d c h a n g e (l o g 2)R e a d d e n s i t y f o l d c h a n g e (l o g 2)R e a d d e n s i t y f o l d c h a n g e (l o g 2)Figure 4|Ribosome and mRNA changes were uniform along the length of the ORFs.a ,Ribosome and mRNA changes along the length of ORFs after introducing miR-155.mRNA segments of quantified genes were binned based on their distance from the first nucleotide of the start codon,with the boundaries of the segments chosen such that each bin contained the same number of nucleotides (Supplementary Fig.8b).Binning was doneseparately for mRNAs with no miR-155site and proteomics-supported miR-155targets.Fold changes in RPFs and mRNA-Seq tags mapping to each bin were then plotted with respect to the median distance of the central nucleotide of each segment from the first nucleotide of the start codon.Changes in RPFs and mRNA-Seq tags for mRNAs with no site (grey and black,respectively)and for proteomics-supported targets (light and dark green,respectively)are shown.Only bins with read contribution from $20genes are shown (see Supplementary Fig.8b).The ANCOVA test forsystematic change across the ORF length was performed by first calculating the differences between RPF changes and mRNA-Seq changes for each group of genes,fitting lines through these changes in translational efficiency,then testing for a difference between the resulting slopes.b ,As in panel a ,but plotting results for the miR-1experiment.c ,As in panel a ,but plotting results for the miR-223experiment.NATURE |Vol 466|12August 2010ARTICLES839。
Molecular Cloning_Technical guide
Visit
22–23 Vector and Insert Joining
22–23 DNA Ligation 22 Protocol 22 Tips for Optimization 23 Product Selection
24 Transformation
5 Recombinational Cloning 8 cDNA Synthesis 9 Restriction Enzyme Digestion 9 Protocol 9 Tips for Optimization 10–15 Performance Chart 16–17 PCR 16 Protocol 16 Tips for Optimization 17 Product Selection
Molecular Cloning
Technical Guide
Overview
table of contents
3–5 Cloning Workflow Descriptions
Molecular Cloning Overview
Molecular cloning refers to the process by which recombinant DNA molecules are produced and transformed into a host organism, where they are replicated. A molecular cloning reaction is usually comprised of two components: 1. The DNA fragment of interest to be replicated 2. A vector/plasmid backbone that contains all the components for replication in the host DNA of interest, such as a gene, regulatory element(s), operon, etc., is prepared for cloning by either excising it out of the source DNA using restriction enzymes, copying it using PCR, or assembling it from individual oligonucleotides. At the same time, a plasmid vector is prepared in a linear form using restriction enzymes (REs) or Polymerase Chain Reaction (PCR). The plasmid is a small, circular piece of DNA that is replicated within the host and exists separately from the host’s chromosomal or genomic DNA. By physically joining the DNA of interest to the plasmid vector through phosphodiester bonds, the DNA of interest becomes part of the new recombinant plasmid and is replicated by the host. Plasmid vectors allow the DNA of interest to be copied easily in large amounts, and often provide the necessary control elements to be used to direct transcription and translation of the cloned DNA. As such, they have become the workhorse for many molecular methods such as protein expression, gene expression studies, and functional analysis of biomolecules. During the cloning process, the ends of the DNA of interest and the vector have to be modified to make them compatible for joining through the action of a DNA ligase, recombinase, or an in vivo DNA repair mechanism. These steps typically utilize enzymes such as nucleases, phosphatases, kinases and/or ligases. Many cloning methodologies and, more recently kits have been developed to simplify and standardize these processes. This technical guide will clarify the differences between the various cloning methods, identify NEB® products available for each method, and provide expert-tested protocols and FAQs to help you troubleshoot your experiments.
清华大学研究生学位论文格式模版与要求
清华大学博士论文编辑排版建议采用的字体、字号与正文段落字号相适应,用Word 2000 编辑数学公式时建议采取如下尺寸定义清华大学博士论文格式样例:芳杂环高分子的高温水解特性与量子化学研究(申请清华大学理学博士学位论文)培养单位:清华大学化学系Array专业:物理化学研究生:易某某指导教师:某甲甲教授某乙乙教授副指导教师:芳杂环高分子的高温水解特性与量子化学研究易某某三号仿宋或华文仿宋请将中文封面左边Experimental and Theoretical Investigations of HydrolyticStability of Aromatic Heterocyclic Polymers in High TemperatureDissertation Submitted toTsinghua Universityin partial fulfillment of the requirementfor the degree ofDoctor of Natural SciencebyDong-ming YI( Physical Chemistry )Dissertation Supervisor : Professor Yong-chang TANG Associate Supervisor : Professor Da-long WUApril, 2001中文摘要摘要论文采用共振多光子电离和Ion-dip两种检测手段对碱土金属单卤化物的里德堡态进行了实验研究。
主要成果是:⑴首次观测到中等有效主量子数的CaCl 预解离里德堡态:在n*=5-7区域内,有5个文献未报导过的2∑+实贯穿里德堡态,填补了CaCl分子此一区域里德堡态研究的空白,对CaCl里德堡态结构的完整分析和其电子态完整的图像的建立具有重要意义;⑵通过理论分析,论证了这些态是因为和一个2∑+连续态的相互作用而导致强烈的预解离。
由实验测定的预解离线宽拟合出45000-47500cm-1范围内的2∑+连续态势能曲线,它能很好地解释这些里德堡态的预解离行为;⑶还观测到若干转动常数值反常小的里德堡态,它们可能是实非贯穿里德堡态的片段。
Fedratinib治疗骨髓纤维化的研究新进展
• 398 •Int J Blood Transfus Hematol. September 2020. Vol. 43. No. 5•专题论坛•Fedratinib治疗骨髓纤维化的研究新进展张喻堤肖志坚中国医学科学院血液病医院(中国医学科学院血液学研究所),实验血液学国家重点实验室,国家血液系统疾病临床医学研究中心,天津300020通信作者:肖志坚,E m a i l:z j x i a o@i h c a m s.a c.c n【摘要】骨髓纤维化(M F)属于P h呈阴性骨髓增殖性肿瘤(M PN),包括原发性骨髓纤维化(P M F)、真性红细胞增多症后(p o s t-P V)M F和特发性血小板增多症后(p o st-E T)M F。
JAK2V617F基因突变是M F患者的主要致病基因突变。
针对J a n u s激酶(J A K)/信号转导和转录激活因子(S T A T)信号通路的第2代抑制剂fedratinib于2019年8月获美国食品与药品监督管理局(F D A)批准上市,用于治疗国际预后积分系统(IP S S)中危-2及高危M F成年患者。
该药既可用于一线治疗,也可用于第1代j A K/S T A T信号通路抑制剂芦可替尼耐药或者不耐受患者的二线治疗。
笔者拟就fedratinib的药理学特性及其治疗M F的作用机制、临床疗效、治疗相关不良反应和耐药机制的研究最新进展进行阐述。
【关键词】原发性骨髓纤维化;J A K激酶类;蛋白激酶抑制剂;抗药性;Fedratinib基金项目:国家自然科学基金(81530008、81770129);中国医学科学院医学与健康科技创新工程项目(2016-I2M-1-001);天津市科技计划项目(15ZXLCSYOOO10)[)OI: 10_ 3760/cm a.j. cn511693-20200511-00099State of art:fedratinib for treatment of patients with myelofibrosisZhang Yucli »Xiao ZhijianState K ey Laboratory o f E x peri mental H em atology ^National Clinical Research Center fo r BloodDiseases»Institute o f H em atology Blood Diseases Hospital »Chinese A cadem y o f MedicalSciences &- Peking Union Medical College ^Tiatijin300030» ChinaCorresponding a u th o r:Xiao Zhijian , Email:zjjciao@ihcams. ac. cn【Abstract】Myelofibrosis (M F) is one of Ph negative myeloproliferative neoplasms (M P N). Itcould be present as primary myelofibrosis (PM F)» post-polycythemia vera (post PV) MP' and postessential thrombocythemia (post-ET) MF. JAK2V617F mutation is the common driver gene mutationof MF. Fedratinib is a secondary generation inhibitor of Janus kinase (JA K)/signal transducer andactivator of transcription (S T A T) signaling pathway which have received approval by United StatesFood and Drug Administration (FDA) for the treatment of adult patients with International PrognosticScoring System (IPSS) intermediate-2 or high-risk MF in August 2019, both as a first-line therapy orsecond-line therapy following first generation J A K/S T A T signaling pathway inhibitor ruxolitinibintolerance or resistance. This article summarizes research progress of the pharmacological properties,mechanism of action, clinical efficacy, adverse reactions and drug resistance mechanism of fedratinibin the treatment of MF.【Key words】Primary myelofibrosis;Janus kinases; Protein kinase inhibitors;Drugresistance; FedratinibFund programs: National Natural Science Foundation of China (81530008,81770129); CAMSInitiative Fund for Medical Sciences ( 2016-I2M-1-001 ) ;Tianjin Key Natural Science Funds(15ZXLCSY00010)D O I:10. 3760/cma. j. cn511693-20200511-00099国际输血及血液学杂志2020年9月第43卷第5期• 399 •骨髓纤维化(m y e l o f i b r o s i s,M F)属于骨髓增殖性肿瘤(m y e l o p r o l i f e r a t i v e n e o p l a s m s,M P N),包括原发性骨髓纤维化(p r i m a r y m y e l o f i b r o s i s,P M F),真性红细胞增多症后(p o s t-p o l y c y t h e m i a v e r a,p o s t-P V)M F和特发性血小板增多症后(p o s tessential t h r o m b o c y t h e m i a,p o s t-E T)M F。
A survey of content based 3d shape retrieval methods
A Survey of Content Based3D Shape Retrieval MethodsJohan W.H.Tangelder and Remco C.VeltkampInstitute of Information and Computing Sciences,Utrecht University hanst@cs.uu.nl,Remco.Veltkamp@cs.uu.nlAbstractRecent developments in techniques for modeling,digitiz-ing and visualizing3D shapes has led to an explosion in the number of available3D models on the Internet and in domain-specific databases.This has led to the development of3D shape retrieval systems that,given a query object, retrieve similar3D objects.For visualization,3D shapes are often represented as a surface,in particular polygo-nal meshes,for example in VRML format.Often these mod-els contain holes,intersecting polygons,are not manifold, and do not enclose a volume unambiguously.On the con-trary,3D volume models,such as solid models produced by CAD systems,or voxels models,enclose a volume prop-erly.This paper surveys the literature on methods for con-tent based3D retrieval,taking into account the applicabil-ity to surface models as well as to volume models.The meth-ods are evaluated with respect to several requirements of content based3D shape retrieval,such as:(1)shape repre-sentation requirements,(2)properties of dissimilarity mea-sures,(3)efficiency,(4)discrimination abilities,(5)ability to perform partial matching,(6)robustness,and(7)neces-sity of pose normalization.Finally,the advantages and lim-its of the several approaches in content based3D shape re-trieval are discussed.1.IntroductionThe advancement of modeling,digitizing and visualizing techniques for3D shapes has led to an increasing amount of3D models,both on the Internet and in domain-specific databases.This has led to the development of thefirst exper-imental search engines for3D shapes,such as the3D model search engine at Princeton university[2,57],the3D model retrieval system at the National Taiwan University[1,17], the Ogden IV system at the National Institute of Multimedia Education,Japan[62,77],the3D retrieval engine at Utrecht University[4,78],and the3D model similarity search en-gine at the University of Konstanz[3,84].Laser scanning has been applied to obtain archives recording cultural heritage like the Digital Michelan-gelo Project[25,48],and the Stanford Digital Formae Urbis Romae Project[75].Furthermore,archives contain-ing domain-specific shape models are now accessible by the Internet.Examples are the National Design Repos-itory,an online repository of CAD models[59,68], and the Protein Data Bank,an online archive of struc-tural data of biological macromolecules[10,80].Unlike text documents,3D models are not easily re-trieved.Attempting tofind a3D model using textual an-notation and a conventional text-based search engine would not work in many cases.The annotations added by human beings depend on language,culture,age,sex,and other fac-tors.They may be too limited or ambiguous.In contrast, content based3D shape retrieval methods,that use shape properties of the3D models to search for similar models, work better than text based methods[58].Matching is the process of determining how similar two shapes are.This is often done by computing a distance.A complementary process is indexing.In this paper,indexing is understood as the process of building a datastructure to speed up the search.Note that the term indexing is also of-ten used for the identification of features in models,or mul-timedia documents in general.Retrieval is the process of searching and delivering the query results.Matching and in-dexing are often part of the retrieval process.Recently,a lot of researchers have investigated the spe-cific problem of content based3D shape retrieval.Also,an extensive amount of literature can be found in the related fields of computer vision,object recognition and geomet-ric modelling.Survey papers to this literature have been provided by Besl and Jain[11],Loncaric[50]and Camp-bell and Flynn[16].For an overview of2D shape match-ing methods we refer the reader to the paper by Veltkamp [82].Unfortunately,most2D methods do not generalize di-rectly to3D model matching.Work in progress by Iyer et al.[40]provides an extensive overview of3D shape search-ing techniques.Atmosukarto and Naval[6]describe a num-ber of3D model retrieval systems and methods,but do not provide a categorization and evaluation.In contrast,this paper evaluates3D shape retrieval meth-ods with respect to several requirements on content based 3D shape retrieval,such as:(1)shape representation re-quirements,(2)properties of dissimilarity measures,(3)ef-ficiency,(4)discrimination abilities,(5)ability to perform partial matching,(6)robustness,and(7)necessity of posenormalization.In section2we discuss several aspects of3D shape retrieval.The literature on3D shape matching meth-ods is discussed in section3and evaluated in section4. 2.3D shape retrieval aspectsIn this section we discuss several issues related to3D shape retrieval.2.1.3D shape retrieval frameworkAt a conceptual level,a typical3D shape retrieval frame-work as illustrated byfig.1consists of a database with an index structure created offline and an online query engine. Each3D model has to be identified with a shape descrip-tor,providing a compact overall description of the shape. To efficiently search a large collection online,an indexing data structure and searching algorithm should be available. The online query engine computes the query descriptor,and models similar to the query model are retrieved by match-ing descriptors to the query descriptor from the index struc-ture of the database.The similarity between two descriptors is quantified by a dissimilarity measure.Three approaches can be distinguished to provide a query object:(1)browsing to select a new query object from the obtained results,(2) a direct query by providing a query descriptor,(3)query by example by providing an existing3D model or by creating a3D shape query from scratch using a3D tool or sketch-ing2D projections of the3D model.Finally,the retrieved models can be visualized.2.2.Shape representationsAn important issue is the type of shape representation(s) that a shape retrieval system accepts.Most of the3D models found on the World Wide Web are meshes defined in afile format supporting visual appearance.Currently,the most common format used for this purpose is the Virtual Real-ity Modeling Language(VRML)format.Since these mod-els have been designed for visualization,they often contain only geometry and appearance attributes.In particular,they are represented by“polygon soups”,consisting of unorga-nized sets of polygons.Also,in general these models are not“watertight”meshes,i.e.they do not enclose a volume. By contrast,for volume models retrieval methods depend-ing on a properly defined volume can be applied.2.3.Measuring similarityIn order to measure how similar two objects are,it is nec-essary to compute distances between pairs of descriptors us-ing a dissimilarity measure.Although the term similarity is often used,dissimilarity corresponds to the notion of dis-tance:small distances means small dissimilarity,and large similarity.A dissimilarity measure can be formalized by a func-tion defined on pairs of descriptors indicating the degree of their resemblance.Formally speaking,a dissimilarity measure d on a set S is a non-negative valued function d:S×S→R+∪{0}.Function d may have some of the following properties:i.Identity:For all x∈S,d(x,x)=0.ii.Positivity:For all x=y in S,d(x,y)>0.iii.Symmetry:For all x,y∈S,d(x,y)=d(y,x).iv.Triangle inequality:For all x,y,z∈S,d(x,z)≤d(x,y)+d(y,z).v.Transformation invariance:For a chosen transforma-tion group G,for all x,y∈S,g∈G,d(g(x),g(y))= d(x,y).The identity property says that a shape is completely similar to itself,while the positivity property claims that dif-ferent shapes are never completely similar.This property is very strong for a high-level shape descriptor,and is often not satisfied.However,this is not a severe drawback,if the loss of uniqueness depends on negligible details.Symmetry is not always wanted.Indeed,human percep-tion does not alwaysfind that shape x is equally similar to shape y,as y is to x.In particular,a variant x of prototype y,is often found more similar to y then vice versa[81].Dissimilarity measures for partial matching,giving a small distance d(x,y)if a part of x matches a part of y, do not obey the triangle inequality.Transformation invariance has to be satisfied,if the com-parison and the extraction process of shape descriptors have to be independent of the place,orientation and scale of the object in its Cartesian coordinate system.If we want that a dissimilarity measure is not affected by any transforma-tion on x,then we may use as alternative formulation for (v):Transformation invariance:For a chosen transforma-tion group G,for all x,y∈S,g∈G,d(g(x),y)=d(x,y).When all the properties(i)-(iv)hold,the dissimilarity measure is called a metric.Other combinations are possi-ble:a pseudo-metric is a dissimilarity measure that obeys (i),(iii)and(iv)while a semi-metric obeys only(i),(ii)and(iii).If a dissimilarity measure is a pseudo-metric,the tri-angle inequality can be applied to make retrieval more effi-cient[7,83].2.4.EfficiencyFor large shape collections,it is inefficient to sequen-tially match all objects in the database with the query object. Because retrieval should be fast,efficient indexing search structures are needed to support efficient retrieval.Since for query by example the shape descriptor is computed online, it is reasonable to require that the shape descriptor compu-tation is fast enough for interactive querying.2.5.Discriminative powerA shape descriptor should capture properties that dis-criminate objects well.However,the judgement of the sim-ilarity of the shapes of two3D objects is somewhat sub-jective,depending on the user preference or the application at hand.E.g.for solid modeling applications often topol-ogy properties such as the numbers of holes in a model are more important than minor differences in shapes.On the contrary,if a user searches for models looking visually sim-ilar the existence of a small hole in the model,may be of no importance to the user.2.6.Partial matchingIn contrast to global shape matching,partial matching finds a shape of which a part is similar to a part of another shape.Partial matching can be applied if3D shape mod-els are not complete,e.g.for objects obtained by laser scan-ning from one or two directions only.Another application is the search for“3D scenes”containing an instance of the query object.Also,this feature can potentially give the user flexibility towards the matching problem,if parts of inter-est of an object can be selected or weighted by the user. 2.7.RobustnessIt is often desirable that a shape descriptor is insensitive to noise and small extra features,and robust against arbi-trary topological degeneracies,e.g.if it is obtained by laser scanning.Also,if a model is given in multiple levels-of-detail,representations of different levels should not differ significantly from the original model.2.8.Pose normalizationIn the absence of prior knowledge,3D models have ar-bitrary scale,orientation and position in the3D space.Be-cause not all dissimilarity measures are invariant under ro-tation and translation,it may be necessary to place the3D models into a canonical coordinate system.This should be the same for a translated,rotated or scaled copy of the model.A natural choice is tofirst translate the center to the ori-gin.For volume models it is natural to translate the cen-ter of mass to the origin.But for meshes this is in gen-eral not possible,because they have not to enclose a vol-ume.For meshes it is an alternative to translate the cen-ter of mass of all the faces to the origin.For example the Principal Component Analysis(PCA)method computes for each model the principal axes of inertia e1,e2and e3 and their eigenvaluesλ1,λ2andλ3,and make the nec-essary conditions to get right-handed coordinate systems. These principal axes define an orthogonal coordinate sys-tem(e1,e2,e3),withλ1≥λ2≥λ3.Next,the polyhe-dral model is rotated around the origin such that the co-ordinate system(e x,e y,e z)coincides with the coordinatesystem(e1,e2,e3).The PCA algorithm for pose estimation is fairly simple and efficient.However,if the eigenvalues are equal,prin-cipal axes may switch,without affecting the eigenvalues. Similar eigenvalues may imply an almost symmetrical mass distribution around an axis(e.g.nearly cylindrical shapes) or around the center of mass(e.g.nearly spherical shapes). Fig.2illustrates the problem.3.Shape matching methodsIn this section we discuss3D shape matching methods. We divide shape matching methods in three broad cate-gories:(1)feature based methods,(2)graph based meth-ods and(3)other methods.Fig.3illustrates a more detailed categorization of shape matching methods.Note,that the classes of these methods are not completely disjoined.For instance,a graph-based shape descriptor,in some way,de-scribes also the global feature distribution.By this point of view the taxonomy should be a graph.3.1.Feature based methodsIn the context of3D shape matching,features denote ge-ometric and topological properties of3D shapes.So3D shapes can be discriminated by measuring and comparing their features.Feature based methods can be divided into four categories according to the type of shape features used: (1)global features,(2)global feature distributions,(3)spa-tial maps,and(4)local features.Feature based methods from thefirst three categories represent features of a shape using a single descriptor consisting of a d-dimensional vec-tor of values,where the dimension d isfixed for all shapes.The value of d can easily be a few hundred.The descriptor of a shape is a point in a high dimensional space,and two shapes are considered to be similar if they are close in this space.Retrieving the k best matches for a3D query model is equivalent to solving the k nearest neighbors -ing the Euclidean distance,matching feature descriptors can be done efficiently in practice by searching in multiple1D spaces to solve the approximate k nearest neighbor prob-lem as shown by Indyk and Motwani[36].In contrast with the feature based methods from thefirst three categories,lo-cal feature based methods describe for a number of surface points the3D shape around the point.For this purpose,for each surface point a descriptor is used instead of a single de-scriptor.3.1.1.Global feature based similarityGlobal features characterize the global shape of a3D model. Examples of these features are the statistical moments of the boundary or the volume of the model,volume-to-surface ra-tio,or the Fourier transform of the volume or the boundary of the shape.Zhang and Chen[88]describe methods to com-pute global features such as volume,area,statistical mo-ments,and Fourier transform coefficients efficiently.Paquet et al.[67]apply bounding boxes,cords-based, moments-based and wavelets-based descriptors for3D shape matching.Corney et al.[21]introduce convex-hull based indices like hull crumpliness(the ratio of the object surface area and the surface area of its convex hull),hull packing(the percentage of the convex hull volume not occupied by the object),and hull compactness(the ratio of the cubed sur-face area of the hull and the squared volume of the convex hull).Kazhdan et al.[42]describe a reflective symmetry de-scriptor as a2D function associating a measure of reflec-tive symmetry to every plane(specified by2parameters) through the model’s centroid.Every function value provides a measure of global shape,where peaks correspond to the planes near reflective symmetry,and valleys correspond to the planes of near anti-symmetry.Their experimental results show that the combination of the reflective symmetry de-scriptor with existing methods provides better results.Since only global features are used to characterize the overall shape of the objects,these methods are not very dis-criminative about object details,but their implementation is straightforward.Therefore,these methods can be used as an activefilter,after which more detailed comparisons can be made,or they can be used in combination with other meth-ods to improve results.Global feature methods are able to support user feed-back as illustrated by the following research.Zhang and Chen[89]applied features such as volume-surface ratio, moment invariants and Fourier transform coefficients for 3D shape retrieval.They improve the retrieval performance by an active learning phase in which a human annotator as-signs attributes such as airplane,car,body,and so on to a number of sample models.Elad et al.[28]use a moments-based classifier and a weighted Euclidean distance measure. Their method supports iterative and interactive database searching where the user can improve the weights of the distance measure by marking relevant search results.3.1.2.Global feature distribution based similarityThe concept of global feature based similarity has been re-fined recently by comparing distributions of global features instead of the global features directly.Osada et al.[66]introduce and compare shape distribu-tions,which measure properties based on distance,angle, area and volume measurements between random surface points.They evaluate the similarity between the objects us-ing a pseudo-metric that measures distances between distri-butions.In their experiments the D2shape distribution mea-suring distances between random surface points is most ef-fective.Ohbuchi et al.[64]investigate shape histograms that are discretely parameterized along the principal axes of inertia of the model.The shape descriptor consists of three shape histograms:(1)the moment of inertia about the axis,(2) the average distance from the surface to the axis,and(3) the variance of the distance from the surface to the axis. Their experiments show that the axis-parameterized shape features work only well for shapes having some form of ro-tational symmetry.Ip et al.[37]investigate the application of shape distri-butions in the context of CAD and solid modeling.They re-fined Osada’s D2shape distribution function by classifying2random points as1)IN distances if the line segment con-necting the points lies complete inside the model,2)OUT distances if the line segment connecting the points lies com-plete outside the model,3)MIXED distances if the line seg-ment connecting the points lies passes both inside and out-side the model.Their dissimilarity measure is a weighted distance measure comparing D2,IN,OUT and MIXED dis-tributions.Since their method requires that a line segment can be classified as lying inside or outside the model it is required that the model defines a volume properly.There-fore it can be applied to volume models,but not to polyg-onal soups.Recently,Ip et al.[38]extend this approach with a technique to automatically categorize a large model database,given a categorization on a number of training ex-amples from the database.Ohbuchi et al.[63],investigate another extension of the D2shape distribution function,called the Absolute Angle-Distance histogram,parameterized by a parameter denot-ing the distance between two random points and by a pa-rameter denoting the angle between the surfaces on which two random points are located.The latter parameter is ac-tually computed as an inner product of the surface normal vectors.In their evaluation experiment this shape distribu-tion function outperformed the D2distribution function at about1.5times higher computational costs.Ohbuchi et al.[65]improved this method further by a multi-resolution ap-proach computing a number of alpha-shapes at different scales,and computing for each alpha-shape their Absolute Angle-Distance descriptor.Their experimental results show that this approach outperforms the Angle-Distance descrip-tor at the cost of high processing time needed to compute the alpha-shapes.Shape distributions distinguish models in broad cate-gories very well:aircraft,boats,people,animals,etc.How-ever,they perform often poorly when having to discrimi-nate between shapes that have similar gross shape proper-ties but vastly different detailed shape properties.3.1.3.Spatial map based similaritySpatial maps are representations that capture the spatial lo-cation of an object.The map entries correspond to physi-cal locations or sections of the object,and are arranged in a manner that preserves the relative positions of the features in an object.Spatial maps are in general not invariant to ro-tations,except for specially designed maps.Therefore,typ-ically a pose normalization is donefirst.Ankerst et al.[5]use shape histograms as a means of an-alyzing the similarity of3D molecular surfaces.The his-tograms are not built from volume elements but from uni-formly distributed surface points taken from the molecular surfaces.The shape histograms are defined on concentric shells and sectors around a model’s centroid and compare shapes using a quadratic form distance measure to compare the histograms taking into account the distances between the shape histogram bins.Vrani´c et al.[85]describe a surface by associating to each ray from the origin,the value equal to the distance to the last point of intersection of the model with the ray and compute spherical harmonics for this spherical extent func-tion.Spherical harmonics form a Fourier basis on a sphere much like the familiar sine and cosine do on a line or a cir-cle.Their method requires pose normalization to provide rotational invariance.Also,Yu et al.[86]propose a descrip-tor similar to a spherical extent function and a descriptor counting the number of intersections of a ray from the ori-gin with the model.In both cases the dissimilarity between two shapes is computed by the Euclidean distance of the Fourier transforms of the descriptors of the shapes.Their method requires pose normalization to provide rotational in-variance.Kazhdan et al.[43]present a general approach based on spherical harmonics to transform rotation dependent shape descriptors into rotation independent ones.Their method is applicable to a shape descriptor which is defined as either a collection of spherical functions or as a function on a voxel grid.In the latter case a collection of spherical functions is obtained from the function on the voxel grid by restricting the grid to concentric spheres.From the collection of spher-ical functions they compute a rotation invariant descriptor by(1)decomposing the function into its spherical harmon-ics,(2)summing the harmonics within each frequency,and computing the L2-norm for each frequency component.The resulting shape descriptor is a2D histogram indexed by ra-dius and frequency,which is invariant to rotations about the center of the mass.This approach offers an alternative for pose normalization,because their method obtains rotation invariant shape descriptors.Their experimental results show indeed that in general the performance of the obtained ro-tation independent shape descriptors is better than the cor-responding normalized descriptors.Their experiments in-clude the ray-based spherical harmonic descriptor proposed by Vrani´c et al.[85].Finally,note that their approach gen-eralizes the method to compute voxel-based spherical har-monics shape descriptor,described by Funkhouser et al.[30],which is defined as a binary function on the voxel grid, where the value at each voxel is given by the negatively ex-ponentiated Euclidean Distance Transform of the surface of a3D model.Novotni and Klein[61]present a method to compute 3D Zernike descriptors from voxelized models as natural extensions of spherical harmonics based descriptors.3D Zernike descriptors capture object coherence in the radial direction as well as in the direction along a sphere.Both 3D Zernike descriptors and spherical harmonics based de-scriptors achieve rotation invariance.However,by sampling the space only in radial direction the latter descriptors donot capture object coherence in the radial direction,as illus-trated byfig.4.The limited experiments comparing spherical harmonics and3D Zernike moments performed by Novotni and Klein show similar results for a class of planes,but better results for the3D Zernike descriptor for a class of chairs.Vrani´c[84]expects that voxelization is not a good idea, because manyfine details are lost in the voxel grid.There-fore,he compares his ray-based spherical harmonic method [85]and a variation of it using functions defined on concen-tric shells with the voxel-based spherical harmonics shape descriptor proposed by Funkhouser et al.[30].Also,Vrani´c et al.[85]accomplish pose normalization using the so-called continuous PCA algorithm.In the paper it is claimed that the continuous PCA is better as the conventional PCA and better as the weighted PCA,which takes into account the differing sizes of the triangles of a mesh.In contrast with Kazhdan’s experiments[43]the experiments by Vrani´c show that for ray-based spherical harmonics using the con-tinuous PCA without voxelization is better than using rota-tion invariant shape descriptors obtained using voxelization. Perhaps,these results are opposite to Kazhdan results,be-cause of the use of different methods to compute the PCA or the use of different databases or both.Kriegel et al.[46,47]investigate similarity for voxelized models.They obtain a spatial map by partitioning a voxel grid into disjoint cells which correspond to the histograms bins.They investigate three different spatial features asso-ciated with the grid cells:(1)volume features recording the fraction of voxels from the volume in each cell,(2) solid-angle features measuring the convexity of the volume boundary in each cell,(3)eigenvalue features estimating the eigenvalues obtained by the PCA applied to the voxels of the model in each cell[47],and a fourth method,using in-stead of grid cells,a moreflexible partition of the voxels by cover sequence features,which approximate the model by unions and differences of cuboids,each containing a number of voxels[46].Their experimental results show that the eigenvalue method and the cover sequence method out-perform the volume and solid-angle feature method.Their method requires pose normalization to provide rotational in-variance.Instead of representing a cover sequence with a single feature vector,Kriegel et al.[46]represent a cover sequence by a set of feature vectors.This approach allows an efficient comparison of two cover sequences,by compar-ing the two sets of feature vectors using a minimal match-ing distance.The spatial map based approaches show good retrieval results.But a drawback of these methods is that partial matching is not supported,because they do not encode the relation between the features and parts of an object.Fur-ther,these methods provide no feedback to the user about why shapes match.3.1.4.Local feature based similarityLocal feature based methods provide various approaches to take into account the surface shape in the neighbourhood of points on the boundary of the shape.Shum et al.[74]use a spherical coordinate system to map the surface curvature of3D objects to the unit sphere. By searching over a spherical rotation space a distance be-tween two curvature distributions is computed and used as a measure for the similarity of two objects.Unfortunately, the method is limited to objects which contain no holes, i.e.have genus zero.Zaharia and Prˆe teux[87]describe the 3D Shape Spectrum Descriptor,which is defined as the histogram of shape index values,calculated over an en-tire mesh.The shape index,first introduced by Koenderink [44],is defined as a function of the two principal curvatures on continuous surfaces.They present a method to compute these shape indices for meshes,byfitting a quadric surface through the centroids of the faces of a mesh.Unfortunately, their method requires a non-trivial preprocessing phase for meshes that are not topologically correct or not orientable.Chua and Jarvis[18]compute point signatures that accu-mulate surface information along a3D curve in the neigh-bourhood of a point.Johnson and Herbert[41]apply spin images that are2D histograms of the surface locations around a point.They apply spin images to recognize models in a cluttered3D scene.Due to the complexity of their rep-resentation[18,41]these methods are very difficult to ap-ply to3D shape matching.Also,it is not clear how to define a dissimilarity function that satisfies the triangle inequality.K¨o rtgen et al.[45]apply3D shape contexts for3D shape retrieval and matching.3D shape contexts are semi-local descriptions of object shape centered at points on the sur-face of the object,and are a natural extension of2D shape contexts introduced by Belongie et al.[9]for recognition in2D images.The shape context of a point p,is defined as a coarse histogram of the relative coordinates of the re-maining surface points.The bins of the histogram are de-。
FRAX 101 Sweep Frequency Response Analyzer 用户手册说明书
FRAX 101Sweep Frequency Response AnalyzerISmallest and most rugged FRA instrument in the industryIHighest possible repeatability by using reliable cable practice and high-performance instrumentation IFulfills all international standards for SFRA measurementsIHighest dynamic range and accuracy in the industryIWireless communication and battery operatedIAdvanced analysis and decision support built into the softwareIImports data from other FRA test setsFRAX 101Sweep Frequency Response AnalyzerDESCRIPTIONPower transformers are some of the most vital components in today’s transmission and distribution infrastructure.Transformer failures cost enormous amounts of money in unexpected outages and unscheduled maintenance. It is important to avoid these failures and make testing and diagnostics reliable and efficient.The FRAX 101 Sweep Frequency Response Analyzer (SFRA) detects potential mechanical and electricalproblems that other methods are unable to detect. Major utilities and service companies have used the FRA method for more than a decade. The measurement is easy to perform and will capture a unique “fingerprint” of the transformer. The measurement is compared to a reference “fingerprint” and gives a direct answer if the mechanical parts of the transformer are unchanged or not. Deviations indicate geometrical and/or electrical changes within the transformer.FRAX 101 detects problems such as:I Winding deformations and displacements I Shorted turns and open windings I Loosened clamping structures I Broken clamping structures I Core connection problems I Partial winding collapse I Faulty core grounds I Core movements IHoop bucklingAPPLICATIONPower transformers are specified to withstand mechanical forces from both transportation and in-service events, such as faults and lightning. However, mechanical forces may exceed specified limits during severe incidents or when the insulation’s mechanical strength has weakened due to aging. A relatively quick test where the fingerprintresponse is compared to a post event response allows for a reliable decision on whether the transformer safely can be put back into service or if further diagnostics is required.Collecting fingerprint data using Frequency Response Analysis (FRA) is an easy way to detect electro-mechanical problems in power transformers and an investment that will save time and money.1981Method BasicsA transformer consists of multiple capacitances,inductances and resistors, a very complex circuit that generates a unique fingerprint or signature when test signals are injected at discrete frequencies and responses are plotted as a curve.Capacitance is affected by the distance betweenconductors. Movements in the winding will consequently affect capacitances and change the shape of the curve.The SFRA method is based on comparisons between measured curves where variations are detected. One SFRA test consists of multiple sweeps and reveals if the transformer’s mechanical or electrical integrity has been jeopardized.Practical Application In its standard application, a “finger print” reference curvefor each winding is captured when the transformer is new or when it is in a known good condition. These curves can later be used as reference during maintenance tests or when there is reason to suspect a problem.The most reliable method is the time based comparison where curves are compared over time on measurements from the same transformer. Another method utilizes type based comparisons between “sister transformers” with the same design. Lastly, a construction based comparison can,under certain conditions, be used when comparingmeasurements between windings in the same transformer.These comparative tests can be performed 1) before and after transportation, 2) after severe through faults 3) before and after overhaul and 4) as diagnostic test if you suspect potential problems. One SFRA test can detect windingproblems that requires multiple tests with different kinds of test equipment or problems that cannot be detected with other techniques at all. The SFRA test presents a quick and cost effective way to assess if damages have occurred or if the transformer can safely be energized again. If there is a problem, the test result provides valuable information that can be used as decision support when determining further action.Having a reference measurement on a mission critical transformer when an incident has occurred is, therefore, a valuable investment as it will allow for an easier and more reliable analysis.Analysis and SoftwareAs a general guideline, shorted turns, magnetization and other problems related to the core alter the shape of the curve in the lowest frequencies. Medium frequencies represent axial or radial movements in the windings and high frequencies indicate problems involving the cables from the windings, to bushings and tap changers.FRAX 101Sweep Frequency Response AnalyzerAn example of low,medium and high frequenciesThe figure above shows a single phase transformer after a serviceoverhaul where, by mistake, the core ground never got connected (red),and after the core ground was properly connected (green). This potential problem clearly showed up at frequencies between 1 kHz and 10 kHz and a noticeable change is also visible in the 10 kHz - 200 kHz range.The FRAX Software provides numerous features to allow for efficient data analysis. Unlimited tests can be open at the same time and the user has full control on which sweeps to compare. The response can be viewed in traditional Magnitude vs. Frequency and/or Phase vs.Frequency view. The user can also choose to present the data in an Impedance or Admittance vs. Frequency view for powerful analysis on certain transformer types.FRAX 101Sweep Frequency Response AnalyzerTest Object Browser —Unlimited number of tests and sweeps. Full user control.Quick Select Tabs —Quickly change presentation view for differentperspectives and analysis tools.Quick Graph Buttons —Programmablegraph setting lets you change views quickly and easily.Sweep/Curve Settings —Every sweep can be individually turned on or off,change color,thickness and position.Dynamic Zoom —Zoom in and move your focus to any part of the curve.Operation Buttons —All essentialfunctions at your fingertips; select with mouse, function keys or touch screen.Automated analysis compares two curves using an algorithm that compare amplitude as well asfrequency shift and lets you know if the difference is severe, obvious, or light.Built-in-decision support is provided by using a built-inanalysis tool based on the international standard DL/T 911-2004.FRAX 101Sweep Frequency Response AnalyzerConsiderations When Performing SFRA MeasurementsSFRA measurements are compared over time or between different test objects. This accentuates the need to perform the test with the highest repeatability and eliminates the influence from external parameters such as cables,connections and instrument performance. FRAX offers all the necessary tools to ensure that the measured curve represents the internal condition of the transformer.Good Connections Bad connections cancompromise the test results which is why FRAX offers a rugged test clamp thatensures good connection to the bushings and solid connections to the instrument.Shortest Braid ConceptThe connection from the cable shield to ground has to be the same for every measurement on a given transformer.Traditional ground connections techniques have issues when it comes to providing repeatable conditions. This causes unwanted variations in the measured response for the highest frequencies that makes analysis difficult. The FRAX braid drops down from the connection clamp next to the insulating discs to the ground connection at the base of the bushing. This creates near identicalconditions every time you connect to a bushing whether it is tall or short.The Power of WirelessFRAX 101 uses class 1 Bluetooth ®wireless communication.Class 1 Bluetooth ®has up to 100 m range and is designed for industrial applications. An optional internal battery pack is available for full wireless flexibility. Shorter and more light-weight cables can be used when the user is liberated from cable communication and power supply cables.A standard USB interface (galvanically isolated) is included for users who prefer a direct connection to their PC. IMPORT AND EXPORTThe FRAX software can import data files from other FRA instruments making it possible to compare data obtained using another FRA unit. FRAX can import and export data according to the international XFRA standard format as well as standard CSV and TXT formats.Optimized Sweep SettingThe software offers the user an unmatched feature that allows for fast and efficient testing. Traditional SFRAsystems use a logarithmic spacing of measurement points.This results in as many test points between 20Hz and200Hz as between 200KHz and 2MHz and a relatively long measurement time.The frequency response from the transformer contains a few resonances in the low frequency range but a lot of resonances at higher frequencies. FRAX allows the user to specify less measurement points at lower frequencies and high measurement point density at higher frequencies.The result is a much faster sweep with greater detail where it is needed.Variable VoltageThe applied test voltage may affect the response at lower frequencies. Some FRA instruments do not use the 10 V peak-to-peak used by major manufacturers and this may complicate comparisons between tests. FRAX standard voltage is 10 V peak-to-peak but FRAX also allows the user to adjust the applied voltage to match the voltage used in a different test.FTB 101Several international FRA guides recommends to verify the integrity of cables and instrument before and after a test using a test circuit with a known FRA response supplied by the equipment manufacturer. FRAX comes with a field test box FTB101 as a standard accessory and allows the user to perform this important validation in the field at any time and secure measurement quality.The laptop can be operated by touch screen and the communication is wireless via Bluetooth. Measurement ground braids connect close to the connection clamps and run next to the bushing to the flange connectionto avoid cable loops that otherwise affect the measurement.Contacts made with the C-clamp guarantee good connectionsFTB 101 Field Test BoxFRAX 101Sweep Frequency Response AnalyzerDYNAMIC RANGEMaking accurate measurements in a wide frequency range with high dynamics puts great demands on test equipment,test leads, and test set up. FRAX 101 is designed with these requirements in mind. It is rugged, able to filter induced interference and has the highest dynamic range andaccuracy in the industry. FRAX 101 dynamic range or noise floor is shown in red below with a normal transformer measurement in black. A wide dynamic range, low noise floor, allows for accurate measurements in everytransformer. A margin of about 20 dB from the lowest response to the instruments noise floor must be maintained to obtain ±1 dB accuracy.SPECIFICATIONSGeneral FRA Method: Sweep frequency (SFRA)Frequency Range:0.1 Hz - 25 MHz, user selectable Number of Points:Default 1046,User selectable up to 32,000Measurement time:Default 64 s, fast setting,37 s (20 Hz - 2 MHz)Points Spacing:Log., linear or both Dynamic Range/Noise Floor:>130dB Accuracy:±0.3 dB down to -105 dB(10 Hz - 10 MHz)IF Bandwidth/Integration Time:User selectable (10% default) Software:FRAX for Windows 2000/ XP/Vista PC Communication:Bluetooth and USB(galvanically isolated)Calibration Interval:Max 3 yearsStandards/guides:Fulfill requirements in CigréBrochure 342, 2008Mechanical condition assessment of transformer windings using FRA and Chinese standard DL/T 911-2004, FRA on winding deformation of powertransformers, as well as other international standards and recommendations Analog Output Channels:1Compliance Voltage:0.2 - 20 V peak-to-peak Measurement Voltage at 50 Ω:0.1 - 10 V peak-to-peak Output Impedance:50 ΩProtection:Short-circuit protected Analog Input Channels: 2Sampling:Simultaneously Input Impedance:50 ΩSampling Rate:100 MS/sPhysicalInstrument Weight:1.4 kg/3.1 lbs Case and Accessories Weight:15 kg/33 lbsDimensions:250 x 169 x 52 mm 9.84 x 6.65 x 2.05 in Dimensions with Case:520 x 460 x 220 mm 20.5 x 18.1 x 8.7 in.Input Voltage:11 - 16 V dc or 90 - 135 V ac and 170 - 264V ac, 47-63 Hz EnvironmentalOperating Ambient Temp: -20°C to +50°C /-4°F to +122°F Operating Relative Humidity:< 90% non-condensingStorage Ambient Temp:-20°C to 70°C / -4°F to +158°F Storage Relative Humidity:< 90% non-condensingCE Standards:IEC61010 (LVD) EN61326 (EMC)PC Requirements (PC not included)Operating System:Windows 2000/ XP / Vista Processor:Pentium 500 MHz Memory:256 Mb RAM or more Hard Drive:Minimum 30 Mb free Interface:Wireless or USB (client)An example of FRAX 101’s dynamic limit (red) and transformer measurement (black)FEATURES AND BENEFITSI Smallest and most rugged FRA instrument in the industry.IGuaranteed repeatability by using superior cablingtechnology, thus avoiding the introduction of error due to cable connection and positioning (which is common in other FRA manufacturers’ equipment).IFulfills all international standards for Sweep Frequency Response Analysis (SFRA) measurements.IHighest dynamic range and accuracy in the industry allowing even the most subtle electro-mechanical changes within the transformer to be detected.IWireless communication allows easy operation without the inconvenience of cable hook up to a PC.IBattery input capability allows for easy operation without the need for mains voltage supply.IAdvanced analysis and support software tools allows for sound decision making with regard to further diagnostics analysis and/or transformer disposition.FRAX 101Sweep Frequency Response AnalyzerUKArchcliffe Road, Dover CT17 9EN EnglandT +44 (0) 1 304 502101 F +44 (0) 1 304 207342******************UNITED STATES 4271 Bronze WayDallas, TX 75237-1019 USA T 1800 723 2861 (USA only) T +1 214 333 3201 F +1 214 331 7399******************Registered to ISO 9001:2000 Cert. no. 10006.01FRAX101_DS_en_V01Megger is a registered trademark Specifications are subject to change without notice.OTHER TECHNICAL SALES OFFICES Täby SWEDEN, Norristown USA,Sydney AUSTRALIA, Toronto CANADA,Trappes FRANCE, Kingdom of BAHRAIN,Mumbai INDIA, Johannesburg SOUTH AFRICA, and Chonburi THAILANDFRAX cable set consists of double shielded high quality cables, braid for easy and reliable ground connection, and clamp for solid connections to the test object.OPTIONAL ACCESSORIESI The built-in battery pack offers flexibility when performing tests on or off the transformer.IThe Active Impedance Probe AIP 101 should be used when measuring grounded connections such as to the transformer tank or a bushing connected to thetransformer tank. AIP 101 ensures safe, accurate and easy measurements to ground.IThe Active Voltage Probe AVP 101 is designed formeasurements when higher input impedance is needed.AVP 101 can be used for measurements where up to 1M Ωinput impedance is required.Item (Qty)Cat. No.Optional Accessories Battery option, 4.8 Ah AC-90010Calibration setAC-90020 Active impedance probe AIP 101AC-90030Active voltage probe AVP 101AC-90040FRAX Demo box FDB 101AC-90050Field Demo Box FTB 101AC-90060Ground braid set, 4 x 3 m including clamps GC-30031FRAX Generator cable, 2xBNC, 9 m (30 ft)GC-30040FRAX Generator cable, 2xBNC, 18 m (59 ft)GC-30042FRAX Measure cable, 1xBNC, 9 m (30 ft)GC-30050FRAX Measure cable, 2xBNC, 18 m (59 ft)GC-30052FRAX C-clamp GC-80010FRAX for WindowsSA-AC101Item (Qty)Cat. No.FRAX 101complete with: ac/dc adapter,mains cable,ground cable 5 m (16 ft), transport case,USB cable, Bluetooth adapter, Windows software, 4x 3m (10 ft) ground braid set, 2 x C-clamp, field test box, generator cable 18 m (59 ft), measure cable 18 m (59 ft), manual AC-19090FRAX 101, incl. battery,complete with: ac/dc adapter,mains cable,ground cable 5 m (16 ft), transport case,USB cable, Bluetooth adapter,Windows software, 4x 3m (10 ft) ground braid set, 2 x C-clamp, field test box, generator cable 18 m (59 ft), measure cable 18 m (59 ft), battery pack, manual AC-19091ORDERING INFORMATION。
杨荣武分子生物学
Common sourcethem
Contaminated solutions/buffers
1:1 phenol : chloroform or
25:24:1 phenol : chloroform : isoamyl alcohol
Phenol: denatures proteins, precipitates form at interface between aqueous and organic layer
III. DNA purification • Phenol extraction • Ethanol precipitation
IV. RNA work
What do we need DNA for?
•Detect, enumerate, clone genes •Detect, enumerate species •Detect/sequence specific DNA regions •Create new DNA “constructs” (recombinant DNA
steps
Making and using mRNA (1)
Top 10 sources of RNase contamination (Ambion Scientific website)
1) Ungloved hands 2) Tips and tubes 3) Water and buffers 4) Lab surfaces 5) Endogenous cellular RNases 6) RNA samples 7) Plasmid preps 8) RNA storage (slow action of small amounts of RNAse 9) Chemical nucleases (Mg2+, Ca2+at 80°C for 5’ +) 10) Enzyme preparations
It is time to view John Searle’s Chinese Room thought
The Chinese Room: Just Say “No!”To appear in the Proceedings of the 22nd Annual Cognitive Science Society Conference, (2000), NJ: LEARobert M. FrenchQuantitative Psychology and Cognitive ScienceUniversity of Liège4000 Liège, Belgiumemail: rfrench@ulg.ac.beAbstractIt is time to view John Searle’s Chinese Room thought experiment in a new light. The main focus of attention has always been on showing what is wrong (or right) with the argument, with the tacit assumption being that somehow there could be such a Room. In this article I argue that the debate should not focus on the question “If a person in the Room answered all the questions in perfect Chinese, while not understanding a word of Chinese, what would the implications of this be for strong AI?” Rather, the question should be, “Does the very idea of such a Room and a person in the Room who is able to answer questions in perfect Chinese while not understanding any Chinese make any sense at all?” And I believe that the answer, in parallel with recent arguments that claim that it would be impossible for a machine to pass the Turing Test unless it had experienced the world as we humans have, is no.IntroductionAlan Turing’s (1950) classic article on the Imitation Game provided an elegant operational definition of intelligence. His article is now exactly fifty years old and ranks, without question, as one of the most important scientific/philosophical papers of the twentieth century. The essence of the test proposed by Turing was that the ability to perfectly simulate unrestricted human conversation would constitute a sufficient criterion for intelligence. This way of defining intelligence, for better or for worse, was largely adopted as of the mid-1950’s, implicitly if not explicitly, as the overarching goal of the nascent field of artificial intelligence (AI).Thirty years after Turing’s article appeared, John Searle (1980) put a new spin on Turing’s original arguments. He developed a thought experiment, now called “The Chinese Room,” which was a reformulation of Turing’s original test and, in so doing, produced what is undoubtedly the second most widely read and hotly discussed paper in artificial intelligence. While Turing was optimistic about the possibility of creating intelligent programs in the foreseeable future, Searle concluded his article on precisely the opposite note:“...no [computer] program, by itself, is sufficient for intentionality.” In short, Searle purported to have shown that real (human-like) intelligence was impossible for any program implemented on a computer. In the present article I will begin by briefly presenting Searle’s well-known transformation of the Turing’s Test. Unlike other critics of the Chinese Room argument, however, I will not take issue with Searle’s argument per se. Rather, I will focus on the argument’s central premise and will argue that the correct approach to the whole argument is simply to refuse to go beyond this premise, for it is, as I hope to show, untenable.The Chinese RoomInstead of Turing’s Imitation Game in which a computer in one room and a person in a separate room both attempt to convince an interrogator that they are human, Searle asks us to begin by imagining a closed room in which there is an English-speaker who knows no Chinese whatsoever. This room is full of symbolic rules specifying inputs and outputs, but, importantly, there are no translations in English to indicate to the person in the room the meaning of any Chinese symbol or string of symbols. A native Chinese person outside the room writes questions — any questions — in Chinese on a piece of paper and sends them into the room. The English-speaker receives each question inside the Room then matches the symbols in the question with symbols in the rule-base. (This does not have to be a direct table matching of the string of symbols in the question with symbols in the rule base, but can include any type of look-up program, regardless of its structural complexity.) The English-speaker is blindly led through the maze of rules to a string of symbols that constitutes an answer to the question. He copies this answer on a piece of paper and sends it out of the room. The Chinese person on the outside of the room would see a perfect response, even though the English-speaker understood no Chinese whatsoever. The Chinese person would therefore be fooled into believing that the person inside the room understood perfect Chinese.Searle then compares the person in the room to a computer program and the symbolic rules that fill the room to the knowledge databases used by the computer program. In Searle’s thought experiment the person who is answering the questions in perfect written Chinese still has no knowledge of Chinese. Searle then applies the conclusion of his thought experiment to the general question of machine intelligence. He concludes that a computer program, however perfectly it managed to communicate in writing, thereby fooling all humanquestioners, would still not understand what it was writing, any more than the person in the Chinese Room understood any Chinese. Ergo, computer programs capable of true understanding are impossible.Searle’s Central PremiseBut this reasoning is based on a central premise that needs close scrutiny.Let us begin with a simple example. If someone began a line of reasoning thus: “Just for the sake of argument, let’s assume that cows are as big as the moon,” you would most likely reply, “Stop right there, I’m not interested in hearing the rest of your argument because cows are demonstrably NOT as big as the moon.” You would be justified in not allowing the person to continue to his conclusions because, as logical as any of his subsequent reasoning might be, any conclusion arising from his absurd premise would be unjustified.Now let us consider the central premise on which Searle’s argument hangs — namely, that there could be such a thing as a “Chinese Room” in which an English-only person could actually fool a native-Chinese questioner. I hope to show that this premise is no more plausible than the existence of lunar-sized cows and, as a result, we have no business allowing ourselves to be drawn into the rest of Searle’s argument, any more than when we were asked to accept that all cows were the size of the moon.Ironically, the arguments in the present paper support Searle’s point that symbolic AI is not sufficient to produce human-like intelligence, but do so not by comparing the person in the Chinese Room to a computer program, but rather by showing that the Chinese Room itself would be an impossibility for a symbol-based AI paradigm.Subcognitive Questioning andthe Turing TestTo understand why such a Room would be impossible, which would mean that the person in the Room could never fool the outside-the-Room questioner, we must look at an argument concerning the Turing Test first put forward by French (1988, 1990, 2000). French’s claim is that no machine that had not experienced life as we humans had could ever hope to pass the Turing Test. His demonstration involves showing just how hard it would be for a computer to consistently reply in a human-like manner to what he called “subcognitive”questions. Since Searle’s Chinese Room argument is simply a reformulation of the Turing Test, we would expect to be able to apply these arguments to the Chinese Room as well, something which we will do this later in this paper.It is important to spend a moment reviewing the nature and the power of “subcognitive” questions.These are questions that are explicitly designed to provide a window on low-level (i.e., unconscious) cognitive or physical structure. By "low-level cognitive structure", we mean the subconscious associative network in human minds that consists of highly overlapping activatable representations of experience (French, 1990). Creating these questions and, especially, gathering the answers to them require a bit of preparation on the part of the Interrogator who will be administering the Turing Test.The Interrogator in the Turing Test (or the Questioner in the Chinese Room) begins by preparing a long list of these questions — the Subcognitive Question List. To get answers to these questions, she ventures out into an English-language population and selects a representative sample of individuals from that population. She asks each person surveyed all the questions on her Subcognitive Question List and records their answers. The questions along with the statistical range of answers to these questions will be the basis for her Human Subcognitive Profile. Here are some of the questions on her list (French, 1988, 1990). Questions using neologisms:"On a scale of 0 (completely implausible) to 10 (completely plausible):- Rate Flugblogs as a name Kellogg's would giveto a new breakfast cereal.- Rate Flugblogs as the name of start-up computercompany- Rate Flugblogs as the name of big, air-filled bagsworn on the feet and used to walk acrossswamps.- Rate Flugly as the name a child might give to afavorite teddy bear.- Rate Flugly as the surname of a bank accountantin a W. C. Fields movie.- Rate Flugly as the surname of a glamorous femalemovie star.“Would you like it if someone called you atrubhead? (0= not at all, ..., 10 = very much)”“Which word do you find prettier: blutch orfarfaletta?”Note that the words flugblogs, flugly, trubhead, blutch and farfaletta are made-up. They will not be found in any dictionary and, yet, because of the uncountable influences, experiences and associations of a lifetime of hearing and using English, we are able to make judgments about these neologisms. And, most importantly, while these judgments may vary between individuals, their variation is not random. For example, the average rating of Flugly as the surname of a glamorous actress will most certainly fall below the average rating of Flugly as the name for a child’s teddy bear. Why? Because English speakers, all of us, havegrown up surrounded by roughly the same sea of sounds and associations that have gradually formed our impressions of the prettiness (or ugliness) of particular words or sounds. And while not all of these associations are identical, of course, they are similar enough to be able to make predictions about how, on average, English-speaking people will react to certain words and sounds. This is precisely why Hollywood movie moguls gave the name “Cary Grant” to a suave and handsome actor born “Archibald Alexander Leach” and why “Henry Deutschendorf, Jr.” was re-baptised “John Denver.”Questions using categories:- Rate banana splits as medicine.- Rate purses as weapons.- Rate pens as weapons.- Rate dry leaves as hiding places.No dictionary definition of “dry leaves” will include in its definition “hiding place,” and, yet, everyone who was ever a child where trees shed their leaves in the fall knows that that piles of dry leaves make wonderful hiding places. But how could this information, and an infinite amount of information just like it that is based on our having experienced the world in a particular way, ever be explicitly programmed into a computer? Questions relying on human physical sensations: - Does holding a gulp of Coca-Cola in your mouthfeel more like having pins-and-needles in yourfoot or having cold water poured on your head?- Put your palms together, fingers outstretched andpressed together. Fold down your two middlefingers till the middle knuckles touch. Move theother four pairs of fingers. What happens to yourother fingers? (Try it!)We can imagine many more questions that would be designed to test not only for subcognitive associations, but for internal physical structure. These would include questions whose answers would arise, for example, from the spacing of a human’s eyes, would be the results of little self-experiments involving tactile sensations on their bodies or sensations after running in place, and so on.People’s answers to subcognitive questions are the product of a lifetime of experiencing the world with our human bodies, our human behaviors (whether culturally or genetically engendered), our human desires and needs, etc. (See Harnard (1989) for a discussion of the closely related symbol grounding problem.)I have asked people the question about Coca-Cola and pins-and-needles many times and they overwhelmingly respond that holding a soft-drink in their mouth feels more like having pins and needles in their foot than having cold water poured on them. Answering this question is dead easy for people who have a head and mouth, have drunk soft-drinks, have had cold water poured on their head, and have feet that occasionally fall asleep. But think of what it would take for a machine that had none of these to answer this question. How could the answer to this question be explicitly programmed into the machine? Perhaps (after reading this article) a programmer could put the question explicitly into the machine’s database, but there are literally infinitely many questions of this sort and to program them all in would be impossible. A program that could answer questions like these in a human-like enough manner to pass a Turing Test would have had to have experienced the world in a way that was very similar to the way in which we had experienced the world. This would mean, among many other things, that it would have to have a body very much like ours with hands like ours, with eyes where we had eyes, etc. For example, if an otherwise perfectly intelligent robot had its eyes on its knees, this would result in detectably non-human associations for such activities as, say, praying in church, falling when riding a bicycle, playing soccer, or wearing pants.The moral of the story is that it doesn’t matter if we humans are confronted with made-up words or conceptual juxtapositions that never normally occur (e.g., dry leaves and hiding place), we can still respond and, moreover, our responses will show statistical regularities over the population. Thus, by surveying the population at large with an extensive set of these questions, we draw up a Human Subcognitive Profile for the population. It is precisely this subcognitive profile that could not be reproduced by a machine that had not experienced the world as the members of the sampled human population had. The Subcognitive Question List that was used to produce the Human Subcognitive Profile gives the well-prepared Interrogator a sure-fire tool for eliminating machines from a Turing test in which humans are also participating. The Interrogator would come to the Turing Test and ask both candidates the questions on her Subcognitive Question List. The candidate most closely matching the average answer profile from the human population will be the human.The English RoomNow let us see how this technique can be gainfully applied to Searle’s Chinese Room thought experiment. We will start by modifying Searle’s original Gedankenexperiment by switching the languages around. This, of course, has no real bearing on the argument itself, but it will make our argument easier to follow. We will assume that inside the Room there is a Chinese person (let’s call him Wu) who understands not a word of written English and outside the Room is a native speaker/writer of English (Sue). Sue sends into the Room questions written in English and Wu must produce the answers to these questions in English.Now, it turns out that Sue is not your average naive questioner, but has read many articles on the Turing Test, knows about subcognitive questions and is thoroughly familiar with John Searle’s argument. She also suspects that the person inside the (English) Room might not actually be able to read English and she sets out to prove her hunch.Sue will not only send into the Room questions like, “What is the capital of Cambodia?”, “Who painted The Mona Lisa?” or “Can fleas fly?” but will also ask a large number of “subcognitive questions.” Because the Room, like the computer in the Turing Test, had not experienced the world as we had and because it would be impossible to explicitly write down all of the rules necessary to answer subcognitive questions in general, the answers to the full range of subcognitive questions could not be contained in the lists of symbolic rules in the Room. Consequently, the person in the Room would be revealed not to speak English for exactly the same reason that the machine in the Turing Test would be revealed not to be a person.Take the simple example of non existent words like blutch or trubhead. These words are neologisms and would certainly be nowhere to be found in the symbolic rules in the English Room. Somehow, the Room would have to contain, in some symbolic form, information not only about all words, but also non-words as well. But the Room, if it is to be compared with a real computer, cannot be infinitely large, nor can we assume infinite fast search of the rule base (see Hofstadter & Dennett, 1981, for a discussion of this point). So, we have two closely related problems: First, and most crucially, how could the rules have gotten into the Room in the first place (a point that Searle simply ignores)? And secondly, the number of explicit symbolic rules would require essentially an infinite amount of space. And while rooms in thought experiments can perhaps be infinitely large, the computers that they are compared to cannot be.In other words, the moral of the story here, as it was for the machine trying to pass the Turing Test, is that no matter how many symbolic rules were in the English Room they would not be sufficient for someone who did not understand written English to fool a determined English questioner. And this is where the story should rightfully end. Searle has no business taking his argument any further — and, ironically, he doesn’t need to, since the necessary inadequacy of an such a Room, regardless of how many symbolic rules it contains, proves his point about the impossibility of achieving artificial intelligence in a traditional symbol-based framework. So, when Searle asks us to accept that the English-only human in his Chinese Room could reply in perfect written Chinese to questions written in Chinese, we must say, “That’s strictly impossible, so stop right there.”Shift in Perception of the Turing Test Let us once again return to the Turing Test to better understand the present argument.It is easy to forget just how high the optimism once ran for the rapid achievement of artificial intelligence. In 1958 when computers were still in their infancy and even high-level programming languages had only just been invented, Simon and Newell, two of the founders of the field of artificial intelligence, wrote, “...there are now in the world machines that think, that learn and that create. Moreover, their ability to do these things is going to increase rapidly until – in a visible future – the range of problems they can handle will be coextensive with the range to which the human mind has been applied.” (Simon & Newell, 1958). Marvin Minsky, head of the MIT AI Laboratory, wrote in 1967, “Within a generation the problem of creating ‘artificial intelligence’ will be substantially solved” (Minsky, 1967).During this period of initial optimism, the vast majority of the authors writing about the Turing Test tacitly accepted Turing’s premise that a machine might actually be able to be built that could pass the Test in the foreseeable future. The debate in the early days of AI, therefore, centered almost exclusively around the validity of Turing’s operational definition of intelligence — namely, did passing the Turing Test constitute a sufficient condition for intelligence or did it not? But researchers’ views on the possibility of achieving artificial intelligence shifted radically between the mid-1960’s and the early 1980’s. By 1982, for example, Minsky’s position regarding achieving artificial intelligence had undergone a radical shift from one of unbounded optimism 15 years earlier to a far more sober assessment of the situation: “The AI problem is one of the hardest ever undertaken by science” (Kolata, 1982). The perception of the Turing Test underwent a parallel shift. At least in part because of the great difficulties being experienced by AI, there was a growing realization of just how hard it would be for a machine to ever pass the Turing Test. Thus, instead of discussing whether or not a machine that had passed the Turing Test was really intelligent, the discussion shifted to the question of whether it would even be possible for any machine to pass such a test (Dennett, 1985; French, 1988, 1990; Crockett 1994; Harnad, 1989; for a review, see French, 2000).The Need for a Corresponding Shift in the Perception of the Chinese RoomA shift in emphasis identical to the one that has occurred for the Turing Test is now needed for Searle’s Chinese Room thought experiment. Searle’s article was published in pre-connectionist 1980, when traditional symbolic AI was still the dominant paradigm in the field. Many of the major difficulties facing symbolic AI had come to light, but in 1980 there was still little emphasis on the “sub-symbolic” side of things.But the growing difficulties that symbolic AI had in dealing with “sub-symbolic cognition” were responsible, at least in part, for the widespread appeal of the connectionist movement of the mid-1980’s. While several of the commentaries of Searle’s original article (Searle, 1980) briefly touch on the difficulties involved in actually creating a Chinese Room, none of them focus outright on the impossibility of the Chinese Room as described by Searle and reject the rest of the argument because of its impossible premise. But this rejection corresponds precisely to rejecting the idea that a machine (that had not experienced the world as we humans have) could ever pass the Turing Test, an idea that many people now accept. We are arguing for a parallel shift in emphasis for the Chinese Room Gedankenexperiment.Can the “Robot Reply” Help?It is necessary to explore for a moment the possibility that one could somehow fill the Chinese Room with all of the appropriate rules that would allow the non-Chinese-reading person to fool a non-holds-barred Chinese questioner. Where could rules come from that would allow the person in the Chinese Room to answer all of the in-coming questions in Chinese perfectly? One possible reply is a version of the Robot Reply (Searle, 1980). Since the rules couldn’t have been symbolic and couldn’t have been explicitly programmed in for the reasons outlined above (also see French, 1988, 1990), perhaps they could have been the product of a Robot that had experienced and interacted with the world as we humans would have, all the while generating rules that would be put in the Chinese Room.This is much closer to what would be required to have the appropriate “rules,” but still leaves open the question of how you could ever come up with such a Robot. The Robot would have to be able to interact seamlessly with the world, exactly as a Chinese person would, in order to have been able to produce all the “rules” (high-level and subcognitive) that would later allow the person in the Room to fool the Well-Prepared Questioner. But then we are back to square one, for creating such a robot amounts to creating a robot that would pass the Turing Test.The Chinese Room: a Simple Refutation It must be reiterated that when Searle is attacking the “strong AI” claim that machines processing strings of symbols are capable of doing what we humans call thinking, he is explicitly talking about programs implemented on computers. It is important not to ignore the fact, as some authors unfortunately have (e.g., Block, 1981), that computers are real machines of finite size and speed; they have neither infinite storage capacity nor infinite processing speed.Now consider the standard Chinese Room, i.e., the one in which the person inside the Room has no knowledge of Chinese and the Questioner outside the Room is Chinese. Now assume that the last character of the following question is distorted in an extremely phallic way, but in a way that nonetheless leaves the character completely readable to any reader of Chinese:“Would the last character of this sentence embarrass a very shy young woman?” In order to answer this question correctly — a trivially easy task for anyone who actually reads Chinese — the Chinese Room would have to contain rules that would not only allow the person to respond perfectly to all strings of Chinese characters that formed comprehensible questions, but also to the infinitely many possible legible distortions of those strings of characters. Combinatorial explosion brings the house down around the Chinese Room. (Remember, we are talking about real computers that can store a finite amount information and must retrieve it in a finite amount of time.)One might be tempted to reply, “The solution is to eliminate all distortions. Only standard fonts of Chinese characters are permitted.” But, of course, there are hundreds, probably thousands, of different fonts of characters in Chinese (Hofstadter, 1985) and it is completely unclear what would constitute “standard fonts.” In any event, one can sidestep even this problem.Consider an equivalent situation in English. It makes perfect sense to ask, “Which letter could be most easily distorted to look like a cloud: an ‘O’ or an ‘X’?”An overwhelming majority of people would, of course, reply “O”, even though clouds, superficially and theoretically, have virtually nothing in common with the letter “O”. But how could the symbolic rules in Searle’s Room possibly serve to answer this perfectly legitimate question? A theory of clouds contained in the rules certainly wouldn’t be of any help, because that would be about storms, wind, rain and meteorology. A theory or database of cloud forms would be of scant help either, since clouds are anything but two dimensional, much less round. Perhaps only if the machine/Room had grown up scrawling vaguely circular shapes on paper and calling them clouds in kindergarten and elementary school, then maybe it would be able to answer this question. But short of having had that experience, I see little hope of an a priori theory of correspondence between clouds and letters that would be of any help.ConclusionThe time has come to view John Searle’s Chinese Room thought experiment in a new light. Up until now, the main focus of attention has been on showing what is wrong (or right) with the argument, with the tacit assumption being that somehow there could be such a Room. This parallels the first forty years of discussionson the Turing Test, where virtually all discussion centered on the sufficiency of the Test as a criterion for machine intelligence, rather than whether any machine could ever actually pass it. However, as the overwhelming difficulties of AI gradually became apparent, the debate on the Turing Test shifted to whether or not any machine that had not experience the world as we had could ever actually pass the Turing Test. It is time for an equivalent shift in attention for Searle’s Chinese Room. The question should not be, “If a person in the Room answered all the questions in perfect Chinese, while not understanding a word of Chinese, what would the implications of this be for strong AI?"” Rather, the question should be, “Does the very idea of such a Room and a person actually be able to answer questions in perfect Chinese while not understanding any Chinese make any sense at all?” And I believe that the answer, in parallel with the impossibility of a machine passing the Turing Test, is no.AcknowledgmentsThe present paper was supported in part by research grant IUAP P4/19 from the Belgian government.ReferencesBlock, N. (1981) Psychologism and behaviourism.Philosophical Review, 90, 5-43Crockett, L. (1994) The Turing Test and the Frame Problem: AI's Mistaken Understanding of Intelligence. AblexDavidson, D. (1990) Turing's test. In Karim A. Said et al. (eds.), Modelling the Mind. Oxford University Press, 1-11.Dennett, D. (1985) Can machines think? In How We Know. (ed.) M. Shafto. Harper & RowFrench, R. M. (1988). Subcognitive Probing: Hard Questions for the Turing Test. Proceedings of the Tenth Annual Cognitive Science Society Conference, Hillsdale, NJ: LEA. 361-367.French, R. M. (1990). Subcognition and the Limits of the Turing Test. Mind, 99(393), 53-65. Reprinted in: P. Millican & A. Clark (eds.). Machines and Thought: The Legacy of Alan Turing Oxford, UK: Clarendon Press, 1996.French, R. M. (2000). Peeking Behind the Screen: The Unsuspected Power of the Standard Turing Test.Journal of Experimental and Theoretical Artificial Intelligence. (in press).French, R. M. (2000). The Turing Test: The First Fifty Years. Trends in Cognitive Sciences, 4(3), 115-122. Harnad, S. (1989) Minds, machines and Searle. Journal of Experimental and Theoretical Artificial Intelligence, 1, 5-25Hofstadter, D. (1985). Variations on a Theme as the Crux of Creativity. In Metamagical Themas. New York, NY: Basic Books. p. 244.Hofstatder, D. & Dennett, D. (1981). The Mind’s I.New York, NY: Basic Books.Kolata, G. (1982) How can computers get common sense? Science, 217, p. 1237Minsky, M. (1967) Computation: Finite and Infinite Machines. Prentice-Hall, p. 2Searle, J. R. (1980). Minds, brains, and programs.Behavioral and Brain Sciences, 3, 414-424. Simon, H. and Newell, A. (1958) Heuristic problem solving: The next advance in operations research.Operations Research, 6。
第五章第五节基因克隆技术教学教材
5’ RACE
巢式PCR(nest PCR):是指利用两套PCR引物(巢式引物)进行两轮PCR扩增反应。 在第一轮扩增中,外引物用以产生扩 增产物,此产物在内引物的存在下进行第二轮扩增。 由于巢式PCR反应有两次PCR扩增,从而降低了扩增多个靶位点的可能 性(因为与两套引物都互补的引物很少)增加了检测的敏感性;又有两对PCR引物与检测模板的配对,增加了检测的可靠性。 一般应用于动物方面。如:病毒,梅毒螺旋体,HIV,肿瘤基因等 。
所有具有某种表现型的基因都可以通过该方 法克隆得到。
通过构建遗传连锁图,将目的基因定位到某 个染色体的特定位点,并在其两侧确定紧密 连锁的RFLP或RAPD分子标记。
通过对不同的生态型及限制性内切酶和杂交 探针的分析,找出与目的基因距离最近的分 子标记,通过染色体步移法将位于这两个标 记之间的基因片段克隆并分离出来,根据基 因功能互作原理鉴定目的基因。
染色体步移法克隆基因示意图
在RFLP作图中,连锁距离是根据重组率来 计算的,1cM(厘摩)相当于1%的重组率。 人类基因组中,1cM≈1000kb;拟南芥菜中 ,1cM≈290kb;小麦中,1cM≈3500kb。
用图位克 隆法获得 水稻脆杆 基因BC1
A. 将BCl定位于水稻3号染色体(Chr3)分 子标记C524a和RM16之间;
是由Frohman等(1988)发明的一项技术。是通过PCR进行cDN息。RACE是基于PCR技术基础上由已知的一段cDNA片段,通过 往两端延伸扩增从而获得完整的3'端和5'端的方法。
一般分5’ RACE和3’RACE两种: 3-RACE较简单,首先将mRNA或总RNA用PolyT引物反转录,根据 一般基因具有polyA尾巴的特点,选用特异引物(根据已知序列设计) 和PolyT引物PCR即可。 5-RACE相对较难,目前流行几种5-RACE。其一为加接头(传统), 根据接头引物和自己设计特异引物PCR,可以设计巢式PCR二次扩 增。另外,有利用反向PCR技术,连接成环再PCR。
分子生物学题库 (3)
生 命 科 学 学院 2xxx —2xxx 学年第1学期考试 A 卷答案考生 信 息 栏 ______学院______系______ 专业 ______年级姓名______学号___ 装订线2.What’s alternative mRNA processing? List the four types of alternative mRNAprocessing.同一个mRNA前体,由于加工方式不同,能形成不同的成熟mRNA。
可变mRNA 剪切的方式主要有:利用不同的polyA位点,部分内含子的保留,部分外显子的去除,以及RNA编辑等。
3.How did Meselson and Stahl prove that DNA replication is semi-conservative?用15N 标记DNA(即让细菌在15N的液体中生长到所有的DNA均含15N),然后移到14N的培养基中生长,每一代提取DNA,用氯化铯密度梯度离心,检测样品中DNA的浮力密度,发现得出的结果只能用半保留复制机制才能解释。
4.What’s DNA cloning? What are the major steps of DNA cloning?DNA克隆是将一特定的DNA片段插入具有自我复制能力的载体中,然后转化入可以大量繁殖的宿主中的过程或技术5.How to screen a cDNA expression library to find target genes? List all the possiblemethods.免疫筛选、DNA筛选、菌落PCR筛选,EST测序等6.How does transcription in prokaryotes initiate?RNA聚合全酶-正确识别DNA模板上的启动子--由酶、DNA和核苷三磷酸(NTP)构成的闭合三元起始复合物—开放起始复合物—数次合成流产—RNA合成成功,转录即自此开始。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Strong replication in the G LOB D ATA middlewareLu´ıs Rodrigues Hugo Miranda Ricardo Almeida Jo˜a o Martins Pedro VicenteUniversidade de LisboaFaculdade de Ciˆe nciasDepartamento de Informticaler,hmiranda,ralmeida,jmartins,pedrofrv@di.fc.ul.ptAbstractG LOB D ATA is a project that aims to design and implement a middleware tool offering the abstraction of a global object database repository.This tool,called C OPLA,supports transactional access to geographically distributed persistent objects independent of their location.Additionally,it supports replication of data according to different consistency criteria.For this purpose,C OPLA implements a number of consistency protocols offering different tradeoffs between performance and fault-tolerance.This paper presents the work on strong consistency protocols for the G LOB D ATA system.Two protocols are presented:a voting protocol and a non-voting protocol.Both these protocols rely on the use of atomic broadcast as a building block to serialize conflicting transactions.The paper also introduces the total order protocol being developed to support large-scale replication.1IntroductionG LOB D ATA1[1]is an European IST project started in November2000that aims to design and imple-ment a middleware tool offering the abstraction of a global object database repository.The tool,called C OPLA,supports transactional access to geographically distributed persistent objects independent of their location.Application programmers have an object-oriented view of the data repository and do not need tobe concerned of how the objects are stored,distributed or replicated.The C OPLA middleware supports the replication of data according to different consistency criteria.Each consistency criteria is implemented by one or more consistency protocols,that offer different tradeoffs between performance and fault-tolerance.This paper reports the work on strong consistency replication protocols for the G LOB D ATA system that is being performed by the Distributed ALgorithms and Network Protocols(DIALNP)group at Universidade de Lisboa.Based on the previous work of[17,16],two protocols are being implemented:a voting protocol and a non-voting protocol.Each of these protocols supports two variants,namely eager updates and deferred updates.The protocols are executed on top of an off-the-shelf relational database that is used to store the state of persistent objects and protocol control information.All protocols rely on the use of atomic broadcast as a building block to help serialize conflicting transactions.A specialized total order protocol is being implemented in the Appia system[14]to support replication in large-scale.The atomic protocol inherits ideas from the hybrid protocol of[19].The paper introduces the G LOB D ATA architecture,and resumes both the consistency protocols and the atomic multicast primitive that supports them.This paper is organized as follows:Section2describes the general C OPLA architecture.Section3presents the consistency protocols.Section4presents the atomic multicast primitive that supports the protocols. Section5presents some optimizations to the basic protocols.Section6discusses related work that studies database replication based on atomic broadcast primitives.Section7concludes this paper.2C OPLA System ArchitectureC OPLA is a middleware tool that provides transparent access to a replicated repository of persistent ob-jects.Replicas can be located on different nodes of a cluster,of a local area network,or spread across a wide are network spanning different geographic locations.To support a diversity of environments and workloads, C OPLA provides a number of replica consistency protocols.The main components of the C OPLA architecture are depicted in Figure1.The upper layer is a“client interface”module,that provides the functionality used by the C OPLA applications programmer.The pro-grammer has an object-oriented view of the persistent and distributed data:it uses a subset of Object Query Language[7]to obtain references to distributed objects.Objects can be concurrently accessed by different clients in the context of distributed transactions.For fault-tolerance,and to improve locality of read-only transactions,an object database may be replicated at different locations.Several consistency protocols are supported by C OPLA;the choice of the best protocol depends on the topology of the network and of the application’s workload.To maintain the user interface code independent of the actual protocol being used,all protocols adhere to a common protocol interface (labeled CP-API in thefigure).This allows C OPLA to be configured according to the characteristics of the environment where it runs.The uniform data store(UDS)module(developed by the Universidad P´u blica de Navarra)is responsible2Client application Client interfaceConsistency protocolsUniform DataStore (UDS)COPLACP-APIUDS-API PER-API Communicationsmodule(atomicbroadcast)Figure1.C OPLA architecturefor storing the state of the persistent objects in an off-the-shelf relational database management system (RDBMS).To perform this task,the UDS exports an interface,the UDS-API,through which objects can be stored and retrieved.It also converts all the queries posed by the application into normalized SQL queries. Finally,the UDS is used to store in a persistent way the control information required by the consistency protocols.This control information is stored and accessed through an dedicated interface,the PER-API.Architectural challenges The G LOB D ATA project is characterized by a unique combination of different requirements,that make the design of the consistency protocols a challenging ly,the G LOB D ATA aims to satisfy the following requirements:Large-scale:the consistency protocols must support replication of objects in a geographically dis-persed system,in which the nodes communicate through the Internet.This prevents the use of protocols that make used of specific network properties(such as the low-latency or network-order preservation properties of local-area networks[18]).RDBMS independence:a variety of commercial databases should be supported as the underlying data storage technology.This prevents the use of solutions that require adaptations to the database kernel.Protocol interchangeability:C OPLA must beflexible enough to adapt to changing environment con-ditions,like the scale of the system,availability of different communication facilities,and changes in the application’s workload.Therefore it should allow the use of distinct consistency protocols,that can perform differently in several scenarios.3Object-orientation:even if C OPLA maps objects into a relational model,this operation must be iso-lated from the consistency protocols.In this way,the consistency algorithms are not tied to any specific object representation.3Strong Consistency ProtocolsIn G LOB D ATA,the application programmer may trade fault-tolerance for performance.Therefore,a suite of protocols with different behavior in the presence of faults is being developed by different teams.Another project’s partner,the ITI,is developing a suite of protocols based on the notion of object ownership[15]: Each node is the manager for the objects created in it,and is responsible for managing concurrent accesses to those objects.On the other hand,the DIALNP team at Universidade de Lisboa,is developing two protocols that enforce strong consistency even in the presence of faults.In fact,the two protocols reported here can also be configured to trade reliability for performance,by implementing a deferred updates scheme.The strong consistency protocols rely extensively on the availability of an uniform atomic broadcast primitive. The implementation of this primitive will be addressed later in the paper.3.1Interaction Among ComponentsWe now describe the strong consistency protocols designed for C OPLA.Both protocols cooperate with the Uniform Data Store to obtain information about which objects are read or updated by each transaction. This information,in the form of a list of unique object identifiers(OIDs),allows the protocols to havefine-grain information about which transaction conflict with each other.Since the consistency protocols only manipulate OIDs,they remain independent from the representation of objects in the database.The C OPLA transactional model In C OPLA,the execution of a transaction includes the following steps:1.The programmer signals the system that a transaction is about to start.2.The programmer makes a query to the database,using a subset of OQL.This query returns a collectionof objects.3.The returned objects are manipulated by the programmer using the functions exported by the clientinterface.These functions allow the application to update the values of object’s attributes,and to read new objects through object relations(object attributes that are references to other objects).4.Steps2-3are repeated until the transaction is completed.5.The programmer requests the system to commit the transaction.4Interaction with the consistency protocols The common protocol interface basically exports two func-tions:a function that must be called by the application every time new objects are read by a transaction,and a function that must be called in order to commit the transaction.Thefirst function,that we call UDSAccess(),serves two main purposes:to make sure that the local copies of the objects are up-to-date(when using deferred updates,the most recent version may not be available locally);and to extract the state of the objects by calling the UDS(the access to the underlying database is not performed by the consistency protocol itself;it is a function of the UDS component).It should be noted that in the actual implementation this function is unfolded in a collection of similar functions covering different requests(attribute read,relationship read,query,etc.).For clarity of exposition,we make no distinction among these functions in the paper.The second function,called commit(),is used by the application to commit the transaction.In response to this request the consistency protocols module has to coordinate with its remote peers to serialize conflicting transactions and to decide whether it is safe to commit the transaction or if it has to be aborted due to some conflict.In order to execute this phase,the consistency protocol request the UDS module to provide the list of all objects updated by the current transaction.Additionally,the UDS also provides the consistency protocols with an opaque structure containing the state of the updated objects.It is the responsibility of the consistency protocol to propagate these updates to the remote nodes.Replication strategies Using the classification of database replication strategies introduced in[20],the strong consistency protocols of C OPLA can be classified as belonging to the“update everywhere constant interaction”class.They are“update everywhere”because they perform the updates to the data items in all replicas of the system.This approach was chosen because it is easier to deal with failures(since all nodes maintain their own copy of the data)and it does not create bottleneck points like the primary copy approach.They are“constant interaction”because the number of messages exchanged by transaction is fixed,independently of the number of operations in the transaction.Given that the cost of communication in most G LOB D ATA configurations is expected to be high,this approach is much more efficient than a linear interaction approach.The protocols described below explore the third degree of freedom:the way transactions terminate(voting or non-voting).Interaction with the atomic broadcast primitive An atomic broadcast primitive broadcasts messages among a group of servers,guaranteeing atomic and ordered delivery of messages.Specifically,let and be two messages sent by atomic broadcast to a group of servers.Atomic delivery guarantees that if a member of delivers(resp.),then all correct members of deliver(resp.).Ordered delivery guarantees that if any two members of deliver and,they deliver them in the same order.These two properties are used by both consistency protocols:the order property is used by the conflict resolution mechanism,and atomic delivery is used to simplify atomic commitment of transactions.53.2The Non-Voting ProtocolThis protocol is a modification of the one described in[17],altered to use a version scheme for concur-rency control[6],and adapted to the C OPLA transactional model.The protocol uses the following control information for each object:a version number and aflag that states whether or not the local copy of this object is up-to-date.If an object is out-of-date,the identifier of the node that has the latest version of the object is also kept.Note that in the basic protocol,all replicas are up-to-date when a transaction commits.Only in the deferred updates mode,it is possible that some replicas remain temporarily out of date.All this information is maintained in a consistency table,which is stored in persistent storage,and is updated in the context of the same transaction that alters data(i.e.,the consistency information is updated only if the transaction commits).When an object is created,its version number is set to zero.Each time a transaction updates an object, and that transaction commits,the object’s version number is incremented by one.This mechanism keeps version numbers synchronized across replicas,since the total order ensured by atomic broadcast causes all replicas to process transactions in the same order.When enforcing serializability,two kinds of conflicts must be considered by the protocol:read/write conflicts and write/write conflicts.Read/write conflicts occur when one transactions reads an object,and another concurrent transactions writes on that same object.Write/write conflicts occur when two concurrent transactions write on the same object.In G LOB D ATA,all objects are read before they are written(as shown above in the C OPLA transactional model),so a write/write conflict is also a read/write conflict.Considering this definitions,in the version number concurrency control scheme,conflicting transactions are defined as follows:Two transactions and conflict if has read an object with version and when is aboutto commit,object’s version number in the local database,,is higher than.That meansthat has read data that was later modified(by a transaction that modified and committedbefore,thus increasing’s version number),and therefore should be aborted.The general outline of the non-voting algorithm is now presented:1.All the transaction’s operations are executed locally on the node where the transaction was initiated(this node is called the delegate node).2.When the application requests a commit,the set of read objects and its version numbers,and the setof written objects is sent to all nodes using the atomic broadcast primitive.3.When a transaction is delivered by the atomic broadcast protocol,all servers verify if the receivedtransaction does not conflict with other local running transactions.There is no conflict if the versions of the objects read by the arriving transaction are greater or equal to the versions of those objects6UDSAccess(,):1.Add the list of objects to list of objects read by transaction.commit():1.Obtain from the UDS the list of objects read()and its version numbers,and the list of objectswritten()by this transaction.2.Send through the atomic broadcast primitive.3.When the message containing is delivered by the atomic broadcast:(a)If does conflict with some other transactioni.Abort.(b)else(consistent transaction)i.Abort all transactions conflicting withmit the transaction.Figure2.Non-voting protocolpresent in the local database.If no conflict is detected,then the transaction is committed,otherwise it is aborted.Since this procedure is deterministic and all nodes,including the delegate node,receive transactions by the same order,all nodes reach the same decision about the outcome of the transaction.The delegate node can now inform the client application about thefinal outcome of the transaction.Note that the last step is executed by all nodes,including the one that initiated the transaction.Depicted in Figure2is a more detailed description of the algorithm.It is divided in two functions, corresponding to the interface previously described.Both functions accept the parameter,the transaction to act upon.UDSAccess()also accepts a parameter,,which is the list of objects that has read from the UDS.Note that step3of the commit()function is executed by all nodes,including the delegate node.The algorithm uses the order given by atomic broadcast for serializing conflicting transactions:if a trans-action is delivered and is consistent,it has priority over other running transactions.This implies that if there are two conflicting transactions,and,and is delivered before,then will proceed,and will be marked as conflicting(in step3(a)),because it has read stale data.The decision is taken in each node independently,but all nodes will reach the same decision,since it depends solely on the order of message delivery(which is guaranteed to be consistent at all replicas by the atomic broadcast protocol).When a commit is decided,the version number of the objects written by this transaction are incremented,and the UDS transaction is committed.Note that to improve performance,local running transactions that conflict with a consistent transaction are aborted,in step3(b).There is a conflict when the running transaction has read objects that the arriving7transaction has written.This would cause the transaction to carry old versions of read objects on its read set, which would cause it to be aborted later on in step3(a).This way an atomic broadcast message is spared2.Aborting a transaction does not involve any special step.In this case,the commit()function is never called,and all that has to be done is to release the local resources associated with that transaction.3.3The Voting ProtocolThis protocol is an adaptation of the protocol described in[13]adapted to the C OPLA transactional model. It consists in two phases,a write set broadcast phase,and a voting phase.The general outline of the algorithm is as follows:1.All the transaction’s operations are executed locally on the delegate node,obtaining(local)read lockson read objects(note that,in order to be written,an object must be previously read).2.When the application requests a commit,the set of written objects is sent to all nodes using atomicbroadcast.3.When the write set of a transaction is delivered by atomic broadcast,all nodes try to obtain localwrite locks on all objects in the set.If there is a transaction that holds a write lock on any object of the write set of,is placed on hold until that write lock is relinquished.Transactions holding read locks on any object of the write set of are aborted(sending an abort message through atomic broadcast).When the delegate node has obtained all write locks,sends a commit message to all servers,through atomic broadcast.4.Upon the reception of a confirmation message,a node applies the transaction’s writes to the localdatabase and subsequently releases all locks held on behalf of that transaction.Upon the reception of an abort message,the delegate node aborts the transaction an releases all its locks(other nodes ignore that message).A detailed description of the algorithm is shown in Figure3.The algorithm uses the order given by atomic broadcast to serialize conflicting transactions.Thefinal transaction order is given by the order of the messages.Conflict detection is done using locks.Write/write conflicts,that occur when two concurrent transactions try to write over the same object, are detected by the lock system(two transactions try to obtain a write lock on the same object).Since write locks are obtained upon reception of,the order of these messages determines the lock acquisition order.As seen in Figure3,if a transaction obtains a write lock,it will force a later transaction to wait when it tries to obtain its lock.If commits it will force to abort.UDSAccess(,):1.For each object in the list obtain a read lock.If any of those objects is write-locked,is place onhold until that object’s write lock is released.commit():1.Obtain from the UDS the list of objects written()by.2.Send through the atomic broadcast primitive.3.When the message containing is delivered by atomic broadcast:(a)For each object in,try to obtain a write lock on it,executing the following steps atomi-cally:i.If there is one or more read locks on,every that has that read lock is aborted(by sendingan message using atomic broadcast),and the write lock on is granted to.ii.If there is a write lock on,or all the read locks on are from transactions whose message has already been delivered,will be placed on hold until thosewrite locks are released.iii.If there is no other lock on,grant the lock to.(b)If this node is the delegate node for,send by atomic broadcast.4.When a message is delivered:commit,writing all its updates in the database and releasing alllocks held by.All transactions waiting to obtain write locks on an object written by are aborted(a message is sent through atomic broadcast).5.When a is delivered:If is a local transaction the message is ignored,otherwise abort,releasingall its locks.Figure3.Voting protocol9Read/write conflicts,that occur when two concurrent transactions access the same object,one for reading and the other for writing,are solved by giving priority to writing transactions.When a message is delivered,write locks are obtained,causing transactions that have read locks on objects in to abort. This rule does not apply to transactions whose write set has already been delivered(step3(a)ii):in this case will be placed on hold until the decision is taken regarding the transaction(s)that own the read lock.All nodes obtain the same write locks in the same order,because the order of the messages is the same in all nodes,and the lock procedure is deterministic.As such,all nodes will be able to respect the decision issued by the delegate node.Optimization This protocol can be further improved,to avoid aborting unnecessary number of trans-actions.In the lock acquisition phase(after is delivered),instead of immediately aborting transactions that hold read locks on objects in,they can be placed on an alternative state,called exe-cutingabort state can proceed executing,but cannot commit.If they attempt to, they will be placed on hold.If commits,then all transactions in executingabort will return to normal execution state(if there is no other transaction that is placing in executingabort state do not need to be put on hold -they can commit immediately.Thefinal serialization order is as these transactions executed before the transaction that placed them in executingof a crashed process to be inconsistent.Therefore,in C OPLA,one needs an uniform total order protocol,i.e.a protocol that ensures that if two messages are delivered by a given order to a process(even if this process crashes),they are delivered in that order to all correct processes.Several alternatives to augment the hybrid protocol with uniform delivery have been implemented and are currently under evaluation.Thefirst alternative consists in adding an additional stability phase to the original hybrid protocol.The second alternative is to change the underlying reliable broadcast protocol to provide terminating uniform delivery of every message.The third alternative is to use two protocols in parallel:the original hybrid protocol to establish a tentative order and another consensus based protocol to establish a definitive order.These alternatives are being implemented using the Appia[14]framework and their performance is being studied.Early analysis shows that the protocol that performs best is a combination of the previous alternatives. Passive nodes select a sequencer just as in the original hybrid protocol.Sequencers are responsible for providing an uniform total order for the messages sent by passive nodes and for their own messages.They do so by applying a symmetric total order protocol based on an underlying terminating uniform reliable broadcast layer.The protocol also supports the optimistic delivery of(tentative)total order indications[8,18].Given that the order established by the(non-uniform)total order protocol is the same as thefinal uniform total order in most cases(these two orders only differ when crashes occur at particular points in the protocol execution), this order can be provided to the consistency layer as a tentative ordering information.The consistency protocols may optimistically perform some tasks that are later committed when thefinal order is delivered.5Optimizations to the Basic ProtocolsThe basic protocols described in Section3can be optimized in two different ways.One consists in delaying the propagation of updates,the deferred updates mode.Other consists in exploiting the optimistic delivery of the atomic multicast algorithm.5.1Deferred UpdatesBoth algorithms presented before can be configured to operate on a mode called deferred updates.This mode consists in postponing the transfer of updates until such data is required by a remote transaction,trad-ing fault-tolerance for performance.Note that,when using deferred updates,the outcome of a transaction is no longer immediately propagated to all replicas:it is stored only at the delegate node.If this node crashes, transactions that access this data must wait for the delegate node to recover.On the other hand,network communication is saved because updates are only propagated when needed.The changes to the protocol required to implement the deferred updates mode are encapsulated in the getNewVersions(t,l)function,which is depicted in Figure4.In both protocols,the function is called after11For each OID in:1.Check if the object’s copy in the local database is up-to-date.2.If the object is out-of-date,get the latest version from the node that holds it.Figure4.getNewVersions()step one,i.e.,it becomes step two of UDSAccess().Associated with each OID,there is afield,called owner,that contains the identifier of the node holding the latest version of that object’s data.If thatfield is empty,then the current node holds the latest version.When deferred updates mode is not used,modified data is written to the database at the end of the commit procedure.This step is modified to implement deferred updates:only the delegate node writes the altered data on its database,setting the ownerfield to empty.All the other nodes write the identifier of the delegate node in their databases.The only information that is sent across the network is merely a list of changed OIDs(instead of that list plus the data itself).5.2Exploiting Optimistic Atomic DeliveryAs described above,the atomic broadcast primitive developed in the project has the possibility of deliver-ing a message optimistically(opt-deliver),i.e.,the message is delivered in a tentative order,which is likely to be the same as thefinal order(u-deliver).This can be exploited by both consistency protocols.The ten-tative order allows the protocols to send the transaction’s updates to the database earlier.Instead of waiting for thefinal uniform order to perform the writes,they are sent to the database as soon as the tentative order is know.When thefinal order arrives,all that is required is to commit the transaction.This hides the cost of witting data behind the cost of uniform delivery,effectively doing both things in parallel.Non-voting protocol Upon reception of an opt-deliver message,all steps in the commit()function are executed,with the following modifications:in step3(a),conflicting transactions are not aborted,but placed on hold(transactions on hold can execute normally,but are suspended when they request a commit,and can only proceed when they return to normal state);in step3(b-ii),the data is sent to the UDS,but the transaction is not committed.When the message is u-delivered,and its order is the same as the tentative one,all transactions marked on hold on behalf of the current one are aborted,and the transaction is committed.If the order is not the same, then the open UDS transaction is aborted,all transactions placed on hold on behalf of this one are returned to normal state,and the message is reprocessed as if it arrived at that moment.12。