1. chatgpt的基本原理chatgpt是基于Transformer模型的改进版本,其核心原理是通过对大规模文本语料进行预训练,学习文本中的语言模式和语义信息,从而达到生成流畅、连贯对话的目的。
2. chatgpt的发展历程chatgpt的发展经历了多个版本的迭代,从最初的GPT-1到目前比较成熟的GPT-3,模型的规模和性能都得到了显著的提升。
3. chatgpt在对话生成领域的应用chatgpt在对话生成方面具有非常广泛的应用,包括智能客服、聊天机器人、虚拟助手等。
4. chatgpt在文本摘要领域的应用文本摘要是自然语言处理领域的一个重要任务,其旨在从文本中提取出最重要的信息,生成简洁、精炼的摘要内容。
Impact of Relevance Measures on the Robustness and Accuracy of Collaborative Filtering
Impact of Relevance Measureson the Robustness and Accuracyof Collaborative Filtering⋆JJ Sandvig,Bamshad Mobasher,and Robin BurkeCenter for Web IntelligenceSchool of Computer Science,Telecommunications and Information SystemsDePaul University,Chicago,Illinois,USA{jsandvig,mobasher,rburke}@Abstract.The open nature of collaborative recommender systems presenta security problem.Attackers that cannot be readily distinguished fromordinary users may inject biased profiles,degrading the objectivity andaccuracy of the system over time.The standard user-based collabora-tivefiltering algorithm has been shown quite vulnerable to such attacks.In this paper,we examine relevance measures that complement neigh-bor similarity and their influence on algorithm robustness.In particular,we consider two techniques,significance weighting and trust weighting,that attempt to calculate the utility of a neighbor with respect to rat-ing prediction.Such techniques have been used to improve predictionaccuracy in collaborativefiltering.We show that significance weighting,in particular,also results in improved robustness under profile injectionattacks.1IntroductionAn adaptive system dependent on anonymous,unauthenticated user profiles is subject to manipulation.The standard collaborativefiltering algorithm builds a recommendation for a target user by combining the stored preferences of peers with similar interests.If a malicious user injects the profile database with a number offictitious identities,they may be considered peers to a genuine user and bias the recommendation.We call such attacks profile injection attacks(also known as shilling[1]).Recent research has shown that surprisingly modest at-tacks are sufficient to manipulate the most common CF algorithms[2,1,3].Such attacks degrade the objectivity and accuracy of a recommender system,causing frustration for its users.In this paper we explore the robustness of certain variants of user-based rec-ommendation.In particular,we examine variants that combine similarity met-rics with other measures to determine neighbor utility.Such relevance weighting techniques apply a weight to each neighbor’s similarity score,based on somevalue reflecting the expected relevance of that neighbor to the prediction task. We focus on two types of relevance measures:significance weighting and trust-based weighting.Significance weighting[4]takes the size of profile overlap be-tween neighbors into account.This prevents neighbors with only a few commonly rated items from dominating prediction.Trust-based weighting[5]estimates the utility of a neighbor as a rating predictor based on the historical accuracy of recommendations given by the neighbor.Traditional user-based collaborativefiltering algorithms focus exclusively on the degree of similarity between the target user and its neighbors in order to generate predicted ratings.However,the“reliability”of the neighbor profiles is generally not considered.For example,due to the sparsity of the data,the similarities may have been obtained based on very few co-rated items between the neighbor and the target user resulting in sub-optimal predictions.Similarly, unreliable neighbors that have made poor predictions in the past may have a neg-ative impact on prediction accuracy for the current item.Both of the approaches to relevance weighting mentioned above were,therefore,initially introduced in order to improve the prediction accuracy in user-based collaborativefiltering.In the trust-based model[5]an explicit trust value is computed for each user, reflecting the“reputation”of that user for making accurate recommendations. Trust is not limited to the macro profile level,and can be calculated as the repu-tation a user has for the recommendation of a particular item.The trust values, in turn,can be used as relevance weights when generating predictions.In[5], O’Donovan and Smyth further studied the impact of trust weighting approach on the robustness of collaborative recommendation and showed the trust-based models are still vulnerable to attacks.On the other hand,the significance weight-ing approach,introduced initially in[4],does not focus on trust,but rather on the number of co-rated items between the target user and the neighbors as a measure for the degree of reliability of the neighbor profiles.This approach has been shown to have a significant impact on the accuracy of predictions,partic-ularly in sparse data sets.Although these and other similar approaches have been used to improve the prediction accuracy of recommender systems,the impact of neighbor signifi-cance weighting on algorithm robustness in the face of malicious attacks has been largely ignored.The primary contribution of this paper is to demonstrate that relevance weighting is an important factor in determining the robustness of a collaborativefiltering algorithm.Choosing an optimal relevance measure can yield a large improvement in recommender stability.Our results show that significance weighting,in particular,is not only more accurate;it also improves algorithm robustness under profile injection attacks that have compact profile signatures.2Attacks in Collaborative RecommendersWe assume that an attacker intends to bias a recommender system for some eco-nomic advantage.This may be in the form of an increased number of recommen-dations for the attacker’s product,or fewer recommendations for a competitor’s product.A collaborative recommender database consists of many user profiles,each with assigned ratings to a number of products that represent the user’s er-based collaborativefiltering algorithms attempt to discover a neigh-borhood of user profiles that are similar to a target user.A rating value is predicted for all missing items in the target user’s profile,based on ratings given to the item within the neighborhood.A ranked list is produced,and typically the top20or50predictions are returned as recommendations.The standard k-nearest neighbor algorithm is widely used and reasonably accurate[4].Similarity is computed using Pearson’s correlation coefficient,and the k most similar users that have rated the target item are selected as the neighborhood.This implies a target user may have a different neighborhood for each target item.It is also common tofilter neighbors with similarity below a specified threshold.This prevents predictions being based on very distant or neg-ative correlations.After identifying a neighborhood,we use Resnick’s algorithm to compute the prediction for a target item i and target user u:p u,i=¯r u+ v∈V sim u,v(r v,i−¯r v)more similar neighbors have a larger impact on thefinal prediction.However, this type of similarity weighting alone may not be sufficient to guarantee ac-curate predictions.It is also necessary to ensure the reliability of the neighbor profiles.A common reason for the lack of reliability of predictions may be that similarities between the target user and the neighbors are based on a very small number of co-rated items.In the following section we consider two approaches that have been used to address the“reliability”problem mentioned above.These approaches have been used primarily to increase prediction accuracy.Our focus, however,will be on their impact on system robustness in the face of attacks. We conjenture that an optimal relevance weight may provide an algorithmic approach to securing recommender systems against attacks.The basic goal of a relevance measure is to estimate the utility of a neighbor as a rating predictor for the target user.The standard technique is to calculate sim-ilarity as the degree of“closeness”in Euclidean space.This is often accomplished via Pearson’s correlation coefficient or vector cosine coefficient.Additional exten-sions to similarity are well known,including significance weighting[4],variance weighting[4],case amplification[7],inverse user frequency[7],default voting[7], and profile trust[5].In this paper,we focus on the effects of significance weighting and profile trust because they are widely accepted techniques with very different properties.3.1Significance WeightingThe significance weighting approach proposed by Herlocker,et al.[4]is to adjusts similarity weights by devaluing relationships with a small number of commonly rated items.It uses a linear drop-offfor neighbors with less than N co-rated items.Neighbors with more than N co-rated items are not devalued at all.The significance weight of a target user u for a neighbor v is computed as:w u,v= sim u,v∗nlg m ,where n is the number of co-rated items,and m is the total numberof ratings in the target user’s profiing a local measure prevents unduly pe-nalizing the closest neighbors when the target user has only a minimal number of ratings.Significance weighting prefers neighbors having many commonly rated items with the target user.Neighbors with fewer commonly rated items may be pushed out of the neighborhood,even if there is a higher degree of similarity to the target user.It follows that users who have rated a large number of items willbelong to more neighborhoods than those users who have rated few items.This is a potential security risk in the context of profile injection attacks.An attack profile with a very large number offiller items will necessarily be included in more neighborhoods,regardless of the rating value.As we will show,the risk is minimized precisely because a largefiller size threshold is required to make the attack successful.In most cases,genuine users rate only a small portion of all recommendable items;therefore,an attack profile with a very largefiller size is easier to detect[8].3.2Trust WeightingThe vulnerabilities of collaborative recommender systems to attacks have led to a number of recent studies focusing on the notion of“trust”in recommenda-tion.O’Donovan and Smyth[5,9]propose trust models as a means to improve accuracy in collaborativefiltering.The basic assumption is that users with a history of being good predictors will provide accurate predictions in the future. By explicitly calculating a trust value,the reputation of a user can be used as insight into the user’s relevance to recommendation.Trust is not limited to the macro profile level,and can be calculated as the reputation a user has for the recommendation of a particular item.The trust building process generates a trust value for every user in the train-ing set by examining the predictive accuracy of the corresponding profile.By cross-validation,each user in turn is designated as the sole neighbor v for all remaining users.The system then computes the prediction set P v as all possi-ble predictions p u,i that can be made for user u∈U and item i∈I using the neighborhood V=v.For each prediction p u,i,recommend v,u,i=1if p u,i∈P v and correct v,u,i=1if|p u,i−r u,i|<εwhereεis a constant threshold and r u,i is the rating of user u for item i.Item-trust values are then computed as:trust v,i= u∈U correct v,u,i(4)sim u,i+trust v,iwhere sim u,v is Pearson’s correlation coefficient.A prediction for the target user is computed using(1),replacing sim u,v with w u,v,i.Trust-based collaborativefiltering algorithms can be very susceptible to pro-file injection attacks,because mutual opinions are reinforced during the trust building process[9].Attack profiles that contain biased ratings for a target item result in mutual reinforcement of the item’s preference.The larger the attack, the more reinforcement of the target item.Furthermore,if the target item is al-ways given the maximum value,an attack profile could have higher trust scores than a genuine profile,because correct v,u,i will always be1if v and u are both attacks on item i.In a recent study,O’Donovan and Smyth[9]propose several solutions to the reinforcement problem that utilize pseudo-random subsets of the training data during the trust building phase.Sampling the population of profiles used in trust calculation effectively smoothes the noise inherent in the entire dataset. The strategy raises an interesting research question with respect to robustness: how does a non-deterministic neighborhood formation task affect the impact of a profile injection attack?Although promising,we did not evaluate sampling the training set.For this set of experiments,we are interested only in the effect of relevance weighting.4Experimental EvaluationDataset.In our experiments,we have used the publicly-available Movie-Lens 100K dataset1.This dataset consists of100,000ratings on1682movies by943 users.All ratings are integer values between one andfive,where one is the lowest (disliked)andfive is the highest(liked).Our data includes all users who have rated at least20movies.To conduct attack experiments,the full dataset is split into training and test sets.Generally,the test set contains a sample of50user profiles that mirror the overall distribution of users in terms of number of movies seen and ratings provided.The remaining user profiles are designated as the training set.All attack profiles are built from the training set,in isolation from the test set.The set of attacked items consists of50movies whose ratings distribution matches the overall ratings distribution of all movies.Each movie is attacked as a separate test,and the results are aggregated.In each case,a number of attack profiles are generated and inserted into the training set,and any existing rating for the attacked movie in the test set is temporarily removed.For every profile injection attack,we track attack size andfiller size.Attack size is the number of injected attack profiles,and is measured as a percentage of the pre-attack training set.There are approximately1000users in the database, so an attack size of1%corresponds to about10attack profiles added to the system.Filler size is the number offiller ratings given to a specific attack pro-file,and is measured as a percentage of the total number of movies.There are approximately1700movies in the database,so afiller size of10%corresponds to about170filler ratings in each attack profile.The results reported below represent averages over all combinations of test users and attacked movies.parison of MAEEvaluation Metrics.There has been considerable research in the area of rec-ommender system evaluation focused on accuracy and performance[10].We use the mean absolute error(MAE)accuracy metric,a statistical measure for com-paring predicted values to actual user ratings[4].However,our overall goal is to measure the effectiveness of an attack;the“win”for the attacker.In the ex-periments reported below,we follow the lead of[2]in measuring stability via prediction shift.Prediction shift measures the change in an item’s predicted rating after being attacked.Let U and I be the sets of test users and attacked items,respectively. For each user-item pair(u,i)the prediction shift denoted by∆u,i,can be mea-sured as∆u,i=p′u,i−p u,i,where p and p′represent the prediction before and after attack,respectively.A positive value means that the attack has succeeded in raising the predicted rating for the item.The average prediction shift for an item i over all users in the test set can be computed as:∆i= u∈U∆u,i/|U|.The average prediction shift is then computed by averaging over individual prediction shifts for all attacked items.Note that a strong prediction shift does not guarantee an item will be recommended-it is possible that other items’scores are also affected by an attack,or that the item score is so low that even a prodigious shift does not promote it to“recommended”status.Accuracy Analysis.Wefirst compare the accuracy of k-nearest neighbor using different relevance metrics.In our experiments we examined the standard Pear-son’s correlation,standard significance weighting,local significance weighting, and item-trust weighting.For significance weighting,we have followed the lead of[4]in using N=50.For trust weighting,we have followed the lead of[5]in usingε=1.8.In all cases,10-fold cross-validation is performed on the entire dataset and no attack profiles are injected.As shown in Figure1,we achieved good results using a neighborhood size of k=30users for all relevance metrics;therefore,we applied k=30to all neighborhood formation tasks in the attack results discussed below.Overall,it isclear that some form of relevance weighting,in addition to similarity,can improveprediction accuracy.Standard and local significance weighting are particularly beneficial,although trust is also helpful when considering small neighborhoods.There are several interesting observations about the MAE results.At k=5, item-trust is more accurate than the other relevance measures.At k=15and greater,item-trust is the least accurate of the measures.It appears that the trust building process overfits the data,because trust is built on the assumption that the user for whom a trust value is computed is the only neighbor in any given neighborhood.The trust model does not take into account that a large neighborhood depends on reinforcement.For example,the closest neighbor to a target user may predict a negative rating for item i.But,when the closest three neighbors are taken into account,the second and third neighbors may predict a positive rating for item i.This effectively cancels out the prediction of the closest neighbor.In fact,a positive rating prediction may be more accurate for item i because the trend of the closest neighbors is a positive rating. Robustness Analysis.To evaluate the robustness of relevance weighting,we compare the results of push attacks using the four relevance weighting schemes described in the previous section.Figure2(A)depicts prediction shift results at different attack sizes,using a5%filler.Clearly,significance weighting is much more robust than the standard Pearson’s correlation.For all attack sizes,the pre-diction shift of significance weighting is about half that of standard correlation. Although not completely immune to attack,it is certainly a large improvement. Even at a15%attack,significance weighting may be the difference between recommending an attacked item or not.Local significance weighting also performs well against profile injection at-tack,although not to the same degree of robustness as standard significance weighting.This can be explained by the fact that target users with fewer than 50ratings do not scale their neighbors linearly.An attack profile in the neigh-borhood that is highly correlated to the target user is not devalued enough.As a result,a genuine user with less correlation to the target user,but more overlap in rated items,may be removed from the neighborhood.Item-trust weighting appears slightly more robust than standard correla-tion.The mutual-reinforcement effect is not as pronounced for attack profiles at smallerfiller sizes,because the attacks don’t have enough similarity to the target user;the trust value is outweighed.In addition,the reinforcement from genuine users is enough to gain insight into the true relevance for making pre-dictions.The combination of trust and similarity of genuine users to a target user is sufficient to remove some attack profiles from the neighborhood.To evaluate the sensitivity offiller size,we have tested a full range offiller items.The100%filler is included as a benchmark for the potential influence of an attack.However,it is not likely to be practical from an attacker’s point of view.Collaborativefiltering rating databases are often extremely sparse,so attack profiles that have rated every product are quite conspicuous.Of particular interest are smallerfiller sizes.An attack that performs well with fewfiller items is less likely to be detected.Thus,an attacker will have a better chance of actuallyFig.2.(A)Average attack prediction shift at5%filler;(B)Average attackfiller size comparisonimpacting a system’s recommendation,even if the performance of the attack is not optimal.Figure2(B)depicts prediction shift at differentfiller sizes with2%attack size. Surprisingly,asfiller size is increased,prediction shift for standard correlation goes down.This is because an attack profile with manyfiller items has greater probability of being dissimilar to the active user.On the contrary,prediction shift for significance weighting goes up.As stated previously,an attack profile with a very large number offiller items will have a better chance of being included in more neighborhoods,because it isn’t devalued by significance weighting.The counter-intuitive observation is that standard correlation is actually more robust than any of the other relevance measures at very largefiller sizes. To account for this,recall that the size of profile overlap is not addressed with standard correlation.A genuine user that is very similar to the target user,but does not have many co-rated items,is not penalized.However,with significance weighting the same user would be devalued,potentially removing the user from the neighborhood in favor of an attack profile.As shown,a25%filler size is the point where prediction shift for standard correlation surpasses the other relevance measures.Overall,this does not affect the general improvement in robustness of relevance ing the modest Movie-Lens100K dataset,a user would have to rate420movies to have a pro-file with25%filler.It is simply not feasible for a genuine user to rate25%of the items in a commercial recommender such as ,with millions of different products.From a practical perspective,the threat of largefiller attacks is minimal because they should be easily detectable[8].5ConclusionThe standard user-based collaborativefiltering algorithm has been shown quite vulnerable to profile injection attacks.An attacker is able to bias recommen-dation by building a number of profiles associated withfictitious identities.In this paper,we have demonstrated the relative robustness and stability of sup-plementing the similarity weighting of neighbors with significance weighting and item-trust values.Significance weighting,in particular,results in increased rec-ommendation accuracy and improved robustness under attack,versus the stan-dard k-nearest neighbor approach.Future work will examine other relevance measures with respect to attack,including case amplification,inverse user fre-quency,and default voting.Referencesm,S.,Riedl,J.:Shilling recommender systems for fun and profit.In:Proceedingsof the13th International WWW Conference,New York(May2004)2.O’Mahony,M.,Hurley,N.,Kushmerick,N.,Silvestre,G.:Collaborative recom-mendation:A robustness analysis.ACM Transactions on Internet Technology4(4) (2004)344–3773.Mobasher,B.,Burke,R.,Bhaumik,R.,Williams,C.:Towards trustworthy rec-ommender systems:An analysis of attack models and algorithm robustness.ACM Transactions on Internet Technology7(4)(2007)4.Herlocker,J.,Konstan,J.,Borchers,A.,Riedl,J.:An algorithmic framework forperforming collaborativefiltering.In:Proceedings of the22nd ACM Conference on Research and Development in Information Retrieval(SIGIR’99),Berkeley,CA (August1999)5.O’Donovan,J.,Smyth,B.:Trust in recommender systems.In:Proceedings of the10th International Conference on Intelligent User Interfaces(EC’04),ACM Press (2005)167–1746.Mobasher,B.,Burke,R.,Sandvig,J.J.:Model-based collaborativefiltering asa defense against profile injection attacks.In:Proceedings of the21st NationalConference on Artificial Intelligence,AAAI(July2006)1388–13937.Breese,J.,Heckerman,D.,Kadie,C.:Empirical analysis of predictive algorithmsfor collaborativefiltering.In:Uncertainty in Artificial Intelligence.Proceedings of the Fourteenth Conference,New Orleans,LA,Morgan Kaufman(1998)43–53 8.Williams,C.,Bhaumik,R.,Burke,R.,Mobasher,B.:The impact of attack profileclassification on the robustness of collaborative recommendation.In:Proceedings of the2006WebKDD Workshop,held at ACM SIGKDD Conference on Data Mining and Knowledge Discovery(KDD’06),Philadelphia(August2006)9.O’Donovan,J.,Smyth,B.:Is trust robust?:An analysis of trust-based recom-mendation.In:Proceedings of the5th ACM Conference on Electronic Commerce (EC’04),ACM Press(2006)101–10810.J.Herlocker,Konstan,J.,Tervin,L.G.,Riedl,J.:Evaluating collaborativefilteringrecommender systems.ACM Transactions on Information Systems22(1)(2004) 5–53。
survey--on sentiment detection of reviews
A survey on sentiment detection of reviewsHuifeng Tang,Songbo Tan *,Xueqi ChengInformation Security Center,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100080,PR Chinaa r t i c l e i n f o Keywords:Sentiment detection Opinion extractionSentiment classificationa b s t r a c tThe sentiment detection of texts has been witnessed a booming interest in recent years,due to the increased availability of online reviews in digital form and the ensuing need to organize them.Till to now,there are mainly four different problems predominating in this research community,namely,sub-jectivity classification,word sentiment classification,document sentiment classification and opinion extraction.In fact,there are inherent relations between them.Subjectivity classification can prevent the sentiment classifier from considering irrelevant or even potentially misleading text.Document sen-timent classification and opinion extraction have often involved word sentiment classification tech-niques.This survey discusses related issues and main approaches to these problems.Ó2009Published by Elsevier Ltd.1.IntroductionToday,very large amount of reviews are available on the web,as well as the weblogs are fast-growing in blogsphere.Product re-views exist in a variety of forms on the web:sites dedicated to a specific type of product (such as digital camera ),sites for newspa-pers and magazines that may feature reviews (like Rolling Stone or Consumer Reports ),sites that couple reviews with commerce (like Amazon ),and sites that specialize in collecting professional or user reviews in a variety of areas (like ).Less formal reviews are available on discussion boards and mailing list archives,as well as in Usenet via Google ers also com-ment on products in their personal web sites and blogs,which are then aggregated by sites such as , ,and .The information mentioned above is a rich and useful source for marketing intelligence,social psychologists,and others interested in extracting and mining opinions,views,moods,and attitudes.For example,whether a product review is positive or negative;what are the moods among Bloggers at that time;how the public reflect towards this political affair,etc.To achieve this goal,a core and essential job is to detect subjec-tive information contained in texts,include viewpoint,fancy,atti-tude,sensibility etc.This is so-called sentiment detection .A challenging aspect of this task seems to distinguish it from traditional topic-based detection (classification)is that while top-ics are often identifiable by keywords alone,sentiment can be ex-pressed in a much subtle manner.For example,the sentence ‘‘What a bad picture quality that digital camera has!...Oh,thisnew type camera has a good picture,long battery life and beautiful appearance!”compares a negative experience of one product with a positive experience of another product.It is difficult to separate out the core assessment that should actually be correlated with the document.Thus,sentiment seems to require more understand-ing than the usual topic-based classification.Sentiment detection dates back to the late 1990s (Argamon,Koppel,&Avneri,1998;Kessler,Nunberg,&SchÄutze,1997;Sper-tus,1997),but only in the early 2000s did it become a major sub-field of the information management discipline (Chaovalit &Zhou,2005;Dimitrova,Finn,Kushmerick,&Smyth,2002;Durbin,Neal Richter,&Warner,2003;Efron,2004;Gamon,2004;Glance,Hurst,&Tomokiyo,2004;Grefenstette,Qu,Shanahan,&Evans,2004;Hil-lard,Ostendorf,&Shriberg,2003;Inkpen,Feiguina,&Hirst,2004;Kobayashi,Inui,&Inui,2001;Liu,Lieberman,&Selker,2003;Rau-bern &Muller-Kogler,2001;Riloff and Wiebe,2003;Subasic &Huettner,2001;Tong,2001;Vegnaduzzo,2004;Wiebe &Riloff,2005;Wilson,Wiebe,&Hoffmann,2005).Until the early 2000s,the two main popular approaches to sentiment detection,espe-cially in the real-world applications,were based on machine learn-ing techniques and based on semantic analysis techniques.After that,the shallow nature language processing techniques were widely used in this area,especially in the document sentiment detection.Current-day sentiment detection is thus a discipline at the crossroads of NLP and IR,and as such it shares a number of characteristics with other tasks such as information extraction and text-mining.Although several international conferences have devoted spe-cial issues to this topic,such as ACL,AAAI,WWW,EMNLP,CIKM etc.,there are no systematic treatments of the subject:there are neither textbooks nor journals entirely devoted to sentiment detection yet.0957-4174/$-see front matter Ó2009Published by Elsevier Ltd.doi:10.1016/j.eswa.2009.02.063*Corresponding author.E-mail addresses:tanghuifeng@ (H.Tang),tansongbo@ (S.Tan),cxq@ (X.Cheng).Expert Systems with Applications 36(2009)10760–10773Contents lists available at ScienceDirectExpert Systems with Applicationsjournal homepage:/locate/eswaThis paperfirst introduces the definitions of several problems that pertain to sentiment detection.Then we present some appli-cations of sentiment detection.Section4discusses the subjectivity classification problem.Section5introduces semantic orientation method.The sixth section examines the effectiveness of applying machine learning techniques to document sentiment classification. The seventh section discusses opinion extraction problem.The eighth part talks about evaluation of sentiment st sec-tion concludes with challenges and discussion of future work.2.Sentiment detection2.1.Subjectivity classificationSubjectivity in natural language refers to aspects of language used to express opinions and evaluations(Wiebe,1994).Subjectiv-ity classification is stated as follows:Let S={s1,...,s n}be a set of sentences in document D.The problem of subjectivity classification is to distinguish sentences used to present opinions and other forms of subjectivity(subjective sentences set S s)from sentences used to objectively present factual information(objective sen-tences set S o),where S s[S o=S.This task is especially relevant for news reporting and Internet forums,in which opinions of various agents are expressed.2.2.Sentiment classificationSentiment classification includes two kinds of classification forms,i.e.,binary sentiment classification and multi-class senti-ment classification.Given a document set D={d1,...,d n},and a pre-defined categories set C={positive,negative},binary senti-ment classification is to classify each d i in D,with a label expressed in C.If we set C*={strong positive,positive,neutral,negative,strong negative}and classify each d i in D with a label in C*,the problem changes to multi-class sentiment classification.Most prior work on learning to identify sentiment has focused on the binary distinction of positive vs.negative.But it is often helpful to have more information than this binary distinction pro-vides,especially if one is ranking items by recommendation or comparing several reviewers’opinions.Koppel and Schler(2005a, 2005b)show that it is crucial to use neutral examples in learning polarity for a variety of reasons.Learning from negative and posi-tive examples alone will not permit accurate classification of neu-tral examples.Moreover,the use of neutral training examples in learning facilitates better distinction between positive and nega-tive examples.3.Applications of sentiment detectionIn this section,we will expound some rising applications of sen-timent detection.3.1.Products comparisonIt is a common practice for online merchants to ask their cus-tomers to review the products that they have purchased.With more and more people using the Web to express opinions,the number of reviews that a product receives grows rapidly.Most of the researches about these reviews were focused on automatically classifying the products into‘‘recommended”or‘‘not recom-mended”(Pang,Lee,&Vaithyanathan,2002;Ranjan Das&Chen, 2001;Terveen,Hill,Amento,McDonald,&Creter,1997).But every product has several features,in which maybe only part of them people are interested.Moreover,a product has shortcomings in one aspect,probably has merits in another place(Morinaga,Yamanishi,Tateishi,&Fukushima,2002;Taboada,Gillies,&McFe-tridge,2006).To analysis the online reviews and bring forward a visual man-ner to compare consumers’opinions of different products,i.e., merely with a single glance the user can clearly see the advantages and weaknesses of each product in the minds of consumers.For a potential customer,he/she can see a visual side-by-side and fea-ture-by-feature comparison of consumer opinions on these prod-ucts,which helps him/her to decide which product to buy.For a product manufacturer,the comparison enables it to easily gather marketing intelligence and product benchmarking information.Liu,Hu,and Cheng(2005)proposed a novel framework for ana-lyzing and comparing consumer opinions of competing products.A prototype system called Opinion Observer is implemented.To en-able the visualization,two tasks were performed:(1)Identifying product features that customers have expressed their opinions on,based on language pattern mining techniques.Such features form the basis for the comparison.(2)For each feature,identifying whether the opinion from each reviewer is positive or negative,if any.Different users can visualize and compare opinions of different products using a user interface.The user simply chooses the prod-ucts that he/she wishes to compare and the system then retrieves the analyzed results of these products and displays them in the interface.3.2.Opinion summarizationThe number of online reviews that a product receives grows rapidly,especially for some popular products.Furthermore,many reviews are long and have only a few sentences containing opin-ions on the product.This makes it hard for a potential customer to read them to make an informed decision on whether to purchase the product.The large number of reviews also makes it hard for product manufacturers to keep track of customer opinions of their products because many merchant sites may sell their products,and the manufacturer may produce many kinds of products.Opinion summarization(Ku,Lee,Wu,&Chen,2005;Philip et al., 2004)summarizes opinions of articles by telling sentiment polari-ties,degree and the correlated events.With opinion summariza-tion,a customer can easily see how the existing customers feel about a product,and the product manufacturer can get the reason why different stands people like it or what they complain about.Hu and Liu(2004a,2004b)conduct a work like that:Given a set of customer reviews of a particular product,the task involves three subtasks:(1)identifying features of the product that customers have expressed their opinions on(called product features);(2) for each feature,identifying review sentences that give positive or negative opinions;and(3)producing a summary using the dis-covered information.Ku,Liang,and Chen(2006)investigated both news and web blog articles.In their research,TREC,NTCIR and articles collected from web blogs serve as the information sources for opinion extraction.Documents related to the issue of animal cloning are selected as the experimental materials.Algorithms for opinion extraction at word,sentence and document level are proposed. The issue of relevant sentence selection is discussed,and then top-ical and opinionated information are summarized.Opinion sum-marizations are visualized by representative sentences.Finally, an opinionated curve showing supportive and non-supportive de-gree along the timeline is illustrated by an opinion tracking system.3.3.Opinion reason miningIn opinion analysis area,finding the polarity of opinions or aggregating and quantifying degree assessment of opinionsH.Tang et al./Expert Systems with Applications36(2009)10760–1077310761scattered throughout web pages is not enough.We can do more critical part of in-depth opinion assessment,such asfinding rea-sons in opinion-bearing texts.For example,infilm reviews,infor-mation such as‘‘found200positive reviews and150negative reviews”may not fully satisfy the information needs of different people.More useful information would be‘‘Thisfilm is great for its novel originality”or‘‘Poor acting,which makes thefilm awful”.Opinion reason mining tries to identify one of the critical ele-ments of online reviews to answer the question,‘‘What are the rea-sons that the author of this review likes or dislikes the product?”To answer this question,we should extract not only sentences that contain opinion-bearing expressions,but also sentences with rea-sons why an author of a review writes the review(Cardie,Wiebe, Wilson,&Litman,2003;Clarke&Terra,2003;Li&Yamanishi, 2001;Stoyanov,Cardie,Litman,&Wiebe,2004).Kim and Hovy(2005)proposed a method for detecting opinion-bearing expressions.In their subsequent work(Kim&Hovy,2006), they collected a large set of h review text,pros,cons i triplets from ,which explicitly state pros and cons phrases in their respective categories by each review’s author along with the re-view text.Their automatic labeling systemfirst collects phrases in pro and confields and then searches the main review text in or-der to collect sentences corresponding to those phrases.Then the system annotates this sentence with the appropriate‘‘pro”or‘‘con”label.All remaining sentences with neither label are marked as ‘‘neither”.After labeling all the data,they use it to train their pro and con sentence recognition system.3.4.Other applicationsThomas,Pang,and Lee(2006)try to determine from the tran-scripts of US Congressionalfloor debates whether the speeches rep-resent support of or opposition to proposed legislation.Mullen and Malouf(2006)describe a statistical sentiment analysis method on political discussion group postings to judge whether there is oppos-ing political viewpoint to the original post.Moreover,there are some potential applications of sentiment detection,such as online message sentimentfiltering,E-mail sentiment classification,web-blog author’s attitude analysis,sentiment web search engine,etc.4.Subjectivity classificationSubjectivity classification is a task to investigate whether a par-agraph presents the opinion of its author or reports facts.In fact, most of the research showed there was very tight relation between subjectivity classification and document sentiment classification (Pang&Lee,2004;Wiebe,2000;Wiebe,Bruce,&O’Hara,1999; Wiebe,Wilson,Bruce,Bell,&Martin,2002;Yu&Hatzivassiloglou, 2003).Subjectivity classification can prevent the polarity classifier from considering irrelevant or even potentially misleading text. Pang and Lee(2004)find subjectivity detection can compress re-views into much shorter extracts that still retain polarity informa-tion at a level comparable to that of the full review.Much of the research in automated opinion detection has been performed and proposed for discriminating between subjective and objective text at the document and sentence levels(Bruce& Wiebe,1999;Finn,Kushmerick,&Smyth,2002;Hatzivassiloglou &Wiebe,2000;Wiebe,2000;Wiebe et al.,1999;Wiebe et al., 2002;Yu&Hatzivassiloglou,2003).In this section,we will discuss some approaches used to automatically assign one document as objective or subjective.4.1.Similarity approachSimilarity approach to classifying sentences as opinions or facts explores the hypothesis that,within a given topic,opinion sen-tences will be more similar to other opinion sentences than to fac-tual sentences(Yu&Hatzivassiloglou,2003).Similarity approach measures sentence similarity based on shared words,phrases, and WordNet synsets(Dagan,Shaul,&Markovitch,1993;Dagan, Pereira,&Lee,1994;Leacock&Chodorow,1998;Miller&Charles, 1991;Resnik,1995;Zhang,Xu,&Callan,2002).To measure the overall similarity of a sentence to the opinion or fact documents,we need to go through three steps.First,use IR method to acquire the documents that are on the same topic as the sentence in question.Second,calculate its similarity scores with each sentence in those documents and make an average va-lue.Third,assign the sentence to the category(opinion or fact) for which the average value is the highest.Alternatively,for the frequency variant,we can use the similarity scores or count how many of them for each category,and then compare it with a prede-termined threshold.4.2.Naive Bayes classifierNaive Bayes classifier is a commonly used supervised machine learning algorithm.This approach presupposes all sentences in opinion or factual articles as opinion or fact sentences.Naive Bayes uses the sentences in opinion and fact documents as the examples of the two categories.The features include words, bigrams,and trigrams,as well as the part of speech in each sen-tence.In addition,the presence of semantically oriented(positive and negative)words in a sentence is an indicator that the sentence is subjective.Therefore,it can include the counts of positive and negative words in the sentence,as well as counts of the polarities of sequences of semantically oriented words(e.g.,‘‘++”for two con-secutive positively oriented words).It also include the counts of parts of speech combined with polarity information(e.g.,‘‘JJ+”for positive adjectives),as well as features encoding the polarity(if any)of the head verb,the main subject,and their immediate modifiers.Generally speaking,Naive Bayes assigns a document d j(repre-sented by a vector dÃj)to the class c i that maximizes Pðc i j dÃjÞby applying Bayes’rule as follow,Pðc i j dÃjÞ¼Pðc iÞPðdÃjj c iÞPðdÃjÞð1Þwhere PðdÃjÞis the probability that a randomly picked document dhas vector dÃjas its representation,and P(c)is the probability that a randomly picked document belongs to class c.To estimate the term PðdÃjj cÞ,Naive Bayes decomposes it byassuming all the features in dÃj(represented by f i,i=1to m)are con-ditionally independent,i.e.,Pðc i j dÃjÞ¼Pðc iÞQ mi¼1Pðf i j c iÞÀÁPðdÃjÞð2Þ4.3.Multiple Naive Bayes classifierThe hypothesis of all sentences in opinion or factual articles as opinion or fact sentences is an approximation.To address this, multiple Naive Bayes classifier approach applies an algorithm using multiple classifiers,each relying on a different subset of fea-tures.The goal is to reduce the training set to the sentences that are most likely to be correctly labeled,thus boosting classification accuracy.Given separate sets of features F1,F2,...,F m,it train separate Na-ive Bayes classifiers C1,C2,...,C m corresponding to each feature set. Assuming as ground truth the information provided by the docu-ment labels and that all sentences inherit the status of their docu-ment as opinions or facts,itfirst train C1on the entire training set,10762H.Tang et al./Expert Systems with Applications36(2009)10760–10773then use the resulting classifier to predict labels for the training set.The sentences that receive a label different from the assumed truth are then removed,and train C2on the remaining sentences. This process is repeated iteratively until no more sentences can be removed.Yu and Hatzivassiloglou(2003)report results using five feature sets,starting from words alone and adding in bigrams, trigrams,part-of-speech,and polarity.4.4.Cut-based classifierCut-based classifier approach put forward a hypothesis that, text spans(items)occurring near each other(within discourse boundaries)may share the same subjectivity status(Pang&Lee, 2004).Based on this hypothesis,Pang supplied his algorithm with pair-wise interaction information,e.g.,to specify that two particu-lar sentences should ideally receive the same subjectivity label. This algorithm uses an efficient and intuitive graph-based formula-tion relying onfinding minimum cuts.Suppose there are n items x1,x2,...,x n to divide into two classes C1and C2,here access to two types of information:ind j(x i):Individual scores.It is the non-negative estimates of each x i’s preference for being in C j based on just the features of x i alone;assoc(x i,x k):Association scores.It is the non-negative estimates of how important it is that x i and x k be in the same class.Then,this problem changes to calculate the maximization of each item’s score for one class:its individual score for the class it is assigned to,minus its individual score for the other class,then minus associated items into different classes for penalization. Thus,after some algebra,it arrives at the following optimization problem:assign the x i to C1and C2so as to minimize the partition cost:X x2C1ind2ðxÞþXx2C2ind1ðxÞþXx i2C1;x k2C2assocðx i;x kÞð3ÞThis situation can be represented in the following manner.Build an undirected graph G with vertices{v1,...,v n,s,t};the last two are, respectively,the source and sink.Add n edges(s,v i),each with weight ind1(x i),and n edges(v i,t),each with weight ind2(x i).Finally, addðC2nÞedges(v i,v k),each with weight assoc(x i,x k).A cut(S,T)of G is a partition of its nodes into sets S={s}US0and T={t}UT0,where s R S0,t R T0.Its cost cost(S,T)is the sum of the weights of all edges crossing from S to T.A minimum cut of G is one of minimum cost. Then,finding solution of this problem is changed into looking for a minimum cut of G.5.Word sentiment classificationThe task on document sentiment classification has usually in-volved the manual or semi-manual construction of semantic orien-tation word lexicons(Hatzivassiloglou&McKeown,1997; Hatzivassiloglou&Wiebe,2000;Lin,1998;Pereira,Tishby,&Lee, 1993;Riloff,Wiebe,&Wilson,2003;Turney&Littman,2002; Wiebe,2000),which built by word sentiment classification tech-niques.For instance,Das and Chen(2001)used a classifier on investor bulletin boards to see if apparently positive postings were correlated with stock price,in which several scoring methods were employed in conjunction with a manually crafted lexicon.Classify-ing the semantic orientation of individual words or phrases,such as whether it is positive or negative or has different intensities, generally using a pre-selected set of seed words,sometimes using linguistic heuristics(For example,Lin(1998)&Pereira et al.(1993) used linguistic co-locations to group words with similar uses or meanings).Some studies showed that restricting features to those adjec-tives for word sentiment classification would improve perfor-mance(Andreevskaia&Bergler,2006;Turney&Littman,2002; Wiebe,2000).However,more researches showed most of the adjectives and adverb,a small group of nouns and verbs possess semantic orientation(Andreevskaia&Bergler,2006;Esuli&Sebas-tiani,2005;Gamon&Aue,2005;Takamura,Inui,&Okumura, 2005;Turney&Littman,2003).Automatic methods of sentiment annotation at the word level can be grouped into two major categories:(1)corpus-based ap-proaches and(2)dictionary-based approaches.Thefirst group in-cludes methods that rely on syntactic or co-occurrence patterns of words in large texts to determine their sentiment(e.g.,Hatzi-vassiloglou&McKeown,1997;Turney&Littman,2002;Yu&Hat-zivassiloglou,2003and others).The second group uses WordNet (/)information,especially,synsets and hierarchies,to acquire sentiment-marked words(Hu&Liu, 2004a;Kim&Hovy,2004)or to measure the similarity between candidate words and sentiment-bearing words such as good and bad(Kamps,Marx,Mokken,&de Rijke,2004).5.1.Analysis by conjunctions between adjectivesThis method attempts to predict the orientation of subjective adjectives by analyzing pairs of adjectives(conjoined by and,or, but,either-or,or neither-nor)which are extracted from a large unlabelled document set.The underlying intuition is that the act of conjoining adjectives is subject to linguistic constraints on the orientation of the adjectives involved(e.g.and usually conjoins two adjectives of the same-orientation,while but conjoins two adjectives of opposite orientation).This is shown in the following three sentences(where thefirst two are perceived as correct and the third is perceived as incorrect)taken from Hatzivassiloglou and McKeown(1997):‘‘The tax proposal was simple and well received by the public”.‘‘The tax proposal was simplistic but well received by the public”.‘‘The tax proposal was simplistic and well received by the public”.To infer the orientation of adjectives from analysis of conjunc-tions,a supervised learning algorithm can be performed as follow-ing steps:1.All conjunctions of adjectives are extracted from a set ofdocuments.2.Train a log-linear regression classifier and then classify pairs ofadjectives either as having the same or as having different ori-entation.The hypothesized same-orientation or different-orien-tation links between all pairs form a graph.3.A clustering algorithm partitions the graph produced in step2into two clusters.By using the intuition that positive adjectives tend to be used more frequently than negative ones,the cluster containing the terms of higher average frequency in the docu-ment set is deemed to contain the positive terms.The log-linear model offers an estimate of how good each pre-diction is,since it produces a value y between0and1,in which 1corresponds to same-orientation,and one minus the produced value y corresponds to dissimilarity.Same-and different-orienta-tion links between adjectives form a graph.To partition the graph nodes into subsets of the same-orientation,the clustering algo-rithm calculates an objective function U scoring each possible par-tition P of the adjectives into two subgroups C1and C2as,UðPÞ¼X2i¼11j C i jXx;y2C i;x–ydðx;yÞ!ð4Þwhere j C i j is the cardinality of cluster i,and d(x,y)is the dissimilarity between adjectives x and y.H.Tang et al./Expert Systems with Applications36(2009)10760–1077310763In general,because the model was unsupervised,it required an immense word corpus to function.5.2.Analysis by lexical relationsThis method presents a strategy for inferring semantic orienta-tion from semantic association between words and phrases.It fol-lows a hypothesis that two words tend to be the same semantic orientation if they have strong semantic association.Therefore,it focused on the use of lexical relations defined in WordNet to calcu-late the distance between adjectives.Generally speaking,we can defined a graph on the adjectives contained in the intersection between a term set(For example, TL term set(Turney&Littman,2003))and WordNet,adding a link between two adjectives whenever WordNet indicates the presence of a synonymy relation between them,and defining a distance measure using elementary notions from graph theory.In more de-tail,this approach can be realized as following steps:1.Construct relations at the level of words.The simplest approachhere is just to collect all words in WordNet,and relate words that can be synonymous(i.e.,they occurring in the same synset).2.Define a distance measure d(t1,t2)between terms t1and t2onthis graph,which amounts to the length of the shortest path that connects t1and t2(with d(t1,t2)=+1if t1and t2are not connected).3.Calculate the orientation of a term by its relative distance(Kamps et al.,2004)from the two seed terms good and bad,i.e.,SOðtÞ¼dðt;badÞÀdðt;goodÞdðgood;badÞð5Þ4.Get the result followed by this rules:The adjective t is deemedto belong to positive if SO(t)>0,and the absolute value of SO(t) determines,as usual,the strength of this orientation(the con-stant denominator d(good,bad)is a normalization factor that constrains all values of SO to belong to the[À1,1]range).5.3.Analysis by glossesThe characteristic of this method lies in the fact that it exploits the glosses(i.e.textual definitions)that one term has in an online ‘‘glossary”,or dictionary.Its basic assumption is that if a word is semantically oriented in one direction,then the words in its gloss tend to be oriented in the same direction(Esuli&Sebastiani,2005; Esuli&Sebastiani,2006a,2006b).For instance,the glosses of good and excellent will both contain appreciative expressions;while the glosses of bad and awful will both contain derogative expressions.Generally,this method can determine the orientation of a term based on the classification of its glosses.The process is composed of the following steps:1.A seed set(S p,S n),representative of the two categories positiveand negative,is provided as input.2.Search new terms to enrich S p and S e lexical relations(e.g.synonymy)with the terms contained in S p and S n from a thesau-rus,or online dictionary,tofind these new terms,and then append them to S p or S n.3.For each term t i in S0p [S0nor in the test set(i.e.the set of termsto be classified),a textual representation of t i is generated by collating all the glosses of t i as found in a machine-readable dic-tionary.Each such representation is converted into a vector by standard text indexing techniques.4.A binary text classifier is trained on the terms in S0p [S0nandthen applied to the terms in the test set.5.4.Analysis by both lexical relations and glossesThis method determines sentiment of words and phrases both relies on lexical relations(synonymy,antonymy and hyponymy) and glosses provided in WordNet.Andreevskaia and Bergler(2006)proposed an algorithm named ‘‘STEP”(Semantic Tag Extraction Program).This algorithm starts with a small set of seed words of known sentiment value(positive or negative)and implements the following steps:1.Extend the small set of seed words by adding synonyms,ant-onyms and hyponyms of the seed words supplied in WordNet.This step brings on average a5-fold increase in the size of the original list with the accuracy of the resulting list comparable to manual annotations.2.Go through all WordNet glosses,identifies the entries that con-tain in their definitions the sentiment-bearing words from the extended seed list,and adds these head words to the corre-sponding category–positive,negative or neutral.3.Disambiguate the glosses with part-of-speech tagger,and elim-inate errors of some words acquired in step1and from the seed list.At this step,it alsofilters out all those words that have been assigned contradicting.In this algorithm,for each word we need compute a Net Overlap Score by subtracting the total number of runs assigning this word a negative sentiment from the total of the runs that consider it posi-tive.In order to make the Net Overlap Score measure usable in sen-timent tagging of texts and phrases,the absolute values of this score should be normalized and mapped onto a standard[0,1] interval.STEP accomplishes this normalization by using the value of the Net Overlap Score as a parameter in the standard fuzzy mem-bership S-function(Zadeh,1987).This function maps the absolute values of the Net Overlap Score onto the interval from0to1,where 0corresponds to the absence of membership in the category of sentiment(in this case,these will be the neutral words)and1re-flects the highest degree of membership in this category.The func-tion can be defined as follows,Sðu;a;b;cÞ¼0if u6a2uÀac a2if a6u6b1À2uÀacÀa2if b6u6c1if u P c8>>>>>><>>>>>>:ð6Þwhere u is the Net Overlap Score for the word and a,b,c are the three adjustable parameters:a is set to1,c is set to15and b,which represents a crossover point,is defined as b=(a+c)/2=8.Defined this way,the S-function assigns highest degree of membership (=1)to words that have the Net Overlap Score u P15.Net Overlap Score can be used as a measure of the words degree of membership in the fuzzy category of sentiment:the core adjec-tives,which had the highest Net Overlap Score,were identified most accurately both by STEP and by human annotators,while the words on the periphery of the category had the lowest scores and were associated with low rates of inter-annotator agreement.5.5.Analysis by pointwise mutual informationThe general strategy of this method is to infer semantic orienta-tion from semantic association.The underlying assumption is that a phrase has a positive semantic orientation when it has good asso-ciations(e.g.,‘‘romantic ambience”)and a negative semantic orien-tation when it has bad associations(e.g.,‘‘horrific events”)(Turney, 2002).10764H.Tang et al./Expert Systems with Applications36(2009)10760–10773。
Survey of clustering data mining techniques
A Survey of Clustering Data Mining TechniquesPavel BerkhinYahoo!,Inc.pberkhin@Summary.Clustering is the division of data into groups of similar objects.It dis-regards some details in exchange for data simplifirmally,clustering can be viewed as data modeling concisely summarizing the data,and,therefore,it re-lates to many disciplines from statistics to numerical analysis.Clustering plays an important role in a broad range of applications,from information retrieval to CRM. Such applications usually deal with large datasets and many attributes.Exploration of such data is a subject of data mining.This survey concentrates on clustering algorithms from a data mining perspective.1IntroductionThe goal of this survey is to provide a comprehensive review of different clus-tering techniques in data mining.Clustering is a division of data into groups of similar objects.Each group,called a cluster,consists of objects that are similar to one another and dissimilar to objects of other groups.When repre-senting data with fewer clusters necessarily loses certainfine details(akin to lossy data compression),but achieves simplification.It represents many data objects by few clusters,and hence,it models data by its clusters.Data mod-eling puts clustering in a historical perspective rooted in mathematics,sta-tistics,and numerical analysis.From a machine learning perspective clusters correspond to hidden patterns,the search for clusters is unsupervised learn-ing,and the resulting system represents a data concept.Therefore,clustering is unsupervised learning of a hidden data concept.Data mining applications add to a general picture three complications:(a)large databases,(b)many attributes,(c)attributes of different types.This imposes on a data analysis se-vere computational requirements.Data mining applications include scientific data exploration,information retrieval,text mining,spatial databases,Web analysis,CRM,marketing,medical diagnostics,computational biology,and many others.They present real challenges to classic clustering algorithms. These challenges led to the emergence of powerful broadly applicable data2Pavel Berkhinmining clustering methods developed on the foundation of classic techniques.They are subject of this survey.1.1NotationsTo fix the context and clarify terminology,consider a dataset X consisting of data points (i.e.,objects ,instances ,cases ,patterns ,tuples ,transactions )x i =(x i 1,···,x id ),i =1:N ,in attribute space A ,where each component x il ∈A l ,l =1:d ,is a numerical or nominal categorical attribute (i.e.,feature ,variable ,dimension ,component ,field ).For a discussion of attribute data types see [106].Such point-by-attribute data format conceptually corresponds to a N ×d matrix and is used by a majority of algorithms reviewed below.However,data of other formats,such as variable length sequences and heterogeneous data,are not uncommon.The simplest subset in an attribute space is a direct Cartesian product of sub-ranges C = C l ⊂A ,C l ⊂A l ,called a segment (i.e.,cube ,cell ,region ).A unit is an elementary segment whose sub-ranges consist of a single category value,or of a small numerical bin.Describing the numbers of data points per every unit represents an extreme case of clustering,a histogram .This is a very expensive representation,and not a very revealing er driven segmentation is another commonly used practice in data exploration that utilizes expert knowledge regarding the importance of certain sub-domains.Unlike segmentation,clustering is assumed to be automatic,and so it is a machine learning technique.The ultimate goal of clustering is to assign points to a finite system of k subsets (clusters).Usually (but not always)subsets do not intersect,and their union is equal to a full dataset with the possible exception of outliersX =C 1 ··· C k C outliers ,C i C j =0,i =j.1.2Clustering Bibliography at GlanceGeneral references regarding clustering include [110],[205],[116],[131],[63],[72],[165],[119],[75],[141],[107],[91].A very good introduction to contem-porary data mining clustering techniques can be found in the textbook [106].There is a close relationship between clustering and many other fields.Clustering has always been used in statistics [10]and science [158].The clas-sic introduction into pattern recognition framework is given in [64].Typical applications include speech and character recognition.Machine learning clus-tering algorithms were applied to image segmentation and computer vision[117].For statistical approaches to pattern recognition see [56]and [85].Clus-tering can be viewed as a density estimation problem.This is the subject of traditional multivariate statistical estimation [197].Clustering is also widelyA Survey of Clustering Data Mining Techniques3 used for data compression in image processing,which is also known as vec-tor quantization[89].Datafitting in numerical analysis provides still another venue in data modeling[53].This survey’s emphasis is on clustering in data mining.Such clustering is characterized by large datasets with many attributes of different types. Though we do not even try to review particular applications,many important ideas are related to the specificfields.Clustering in data mining was brought to life by intense developments in information retrieval and text mining[52], [206],[58],spatial database applications,for example,GIS or astronomical data,[223],[189],[68],sequence and heterogeneous data analysis[43],Web applications[48],[111],[81],DNA analysis in computational biology[23],and many others.They resulted in a large amount of application-specific devel-opments,but also in some general techniques.These techniques and classic clustering algorithms that relate to them are surveyed below.1.3Plan of Further PresentationClassification of clustering algorithms is neither straightforward,nor canoni-cal.In reality,different classes of algorithms overlap.Traditionally clustering techniques are broadly divided in hierarchical and partitioning.Hierarchical clustering is further subdivided into agglomerative and divisive.The basics of hierarchical clustering include Lance-Williams formula,idea of conceptual clustering,now classic algorithms SLINK,COBWEB,as well as newer algo-rithms CURE and CHAMELEON.We survey these algorithms in the section Hierarchical Clustering.While hierarchical algorithms gradually(dis)assemble points into clusters (as crystals grow),partitioning algorithms learn clusters directly.In doing so they try to discover clusters either by iteratively relocating points between subsets,or by identifying areas heavily populated with data.Algorithms of thefirst kind are called Partitioning Relocation Clustering. They are further classified into probabilistic clustering(EM framework,al-gorithms SNOB,AUTOCLASS,MCLUST),k-medoids methods(algorithms PAM,CLARA,CLARANS,and its extension),and k-means methods(differ-ent schemes,initialization,optimization,harmonic means,extensions).Such methods concentrate on how well pointsfit into their clusters and tend to build clusters of proper convex shapes.Partitioning algorithms of the second type are surveyed in the section Density-Based Partitioning.They attempt to discover dense connected com-ponents of data,which areflexible in terms of their shape.Density-based connectivity is used in the algorithms DBSCAN,OPTICS,DBCLASD,while the algorithm DENCLUE exploits space density functions.These algorithms are less sensitive to outliers and can discover clusters of irregular shape.They usually work with low-dimensional numerical data,known as spatial data. Spatial objects could include not only points,but also geometrically extended objects(algorithm GDBSCAN).4Pavel BerkhinSome algorithms work with data indirectly by constructing summaries of data over the attribute space subsets.They perform space segmentation and then aggregate appropriate segments.We discuss them in the section Grid-Based Methods.They frequently use hierarchical agglomeration as one phase of processing.Algorithms BANG,STING,WaveCluster,and FC are discussed in this section.Grid-based methods are fast and handle outliers well.Grid-based methodology is also used as an intermediate step in many other algorithms (for example,CLIQUE,MAFIA).Categorical data is intimately connected with transactional databases.The concept of a similarity alone is not sufficient for clustering such data.The idea of categorical data co-occurrence comes to the rescue.The algorithms ROCK,SNN,and CACTUS are surveyed in the section Co-Occurrence of Categorical Data.The situation gets even more aggravated with the growth of the number of items involved.To help with this problem the effort is shifted from data clustering to pre-clustering of items or categorical attribute values. Development based on hyper-graph partitioning and the algorithm STIRR exemplify this approach.Many other clustering techniques are developed,primarily in machine learning,that either have theoretical significance,are used traditionally out-side the data mining community,or do notfit in previously outlined categories. The boundary is blurred.In the section Other Developments we discuss the emerging direction of constraint-based clustering,the important researchfield of graph partitioning,and the relationship of clustering to supervised learning, gradient descent,artificial neural networks,and evolutionary methods.Data Mining primarily works with large databases.Clustering large datasets presents scalability problems reviewed in the section Scalability and VLDB Extensions.Here we talk about algorithms like DIGNET,about BIRCH and other data squashing techniques,and about Hoffding or Chernoffbounds.Another trait of real-life data is high dimensionality.Corresponding de-velopments are surveyed in the section Clustering High Dimensional Data. The trouble comes from a decrease in metric separation when the dimension grows.One approach to dimensionality reduction uses attributes transforma-tions(DFT,PCA,wavelets).Another way to address the problem is through subspace clustering(algorithms CLIQUE,MAFIA,ENCLUS,OPTIGRID, PROCLUS,ORCLUS).Still another approach clusters attributes in groups and uses their derived proxies to cluster objects.This double clustering is known as co-clustering.Issues common to different clustering methods are overviewed in the sec-tion General Algorithmic Issues.We talk about assessment of results,de-termination of appropriate number of clusters to build,data preprocessing, proximity measures,and handling of outliers.For reader’s convenience we provide a classification of clustering algorithms closely followed by this survey:•Hierarchical MethodsA Survey of Clustering Data Mining Techniques5Agglomerative AlgorithmsDivisive Algorithms•Partitioning Relocation MethodsProbabilistic ClusteringK-medoids MethodsK-means Methods•Density-Based Partitioning MethodsDensity-Based Connectivity ClusteringDensity Functions Clustering•Grid-Based Methods•Methods Based on Co-Occurrence of Categorical Data•Other Clustering TechniquesConstraint-Based ClusteringGraph PartitioningClustering Algorithms and Supervised LearningClustering Algorithms in Machine Learning•Scalable Clustering Algorithms•Algorithms For High Dimensional DataSubspace ClusteringCo-Clustering Techniques1.4Important IssuesThe properties of clustering algorithms we are primarily concerned with in data mining include:•Type of attributes algorithm can handle•Scalability to large datasets•Ability to work with high dimensional data•Ability tofind clusters of irregular shape•Handling outliers•Time complexity(we frequently simply use the term complexity)•Data order dependency•Labeling or assignment(hard or strict vs.soft or fuzzy)•Reliance on a priori knowledge and user defined parameters •Interpretability of resultsRealistically,with every algorithm we discuss only some of these properties. The list is in no way exhaustive.For example,as appropriate,we also discuss algorithms ability to work in pre-defined memory buffer,to restart,and to provide an intermediate solution.6Pavel Berkhin2Hierarchical ClusteringHierarchical clustering builds a cluster hierarchy or a tree of clusters,also known as a dendrogram.Every cluster node contains child clusters;sibling clusters partition the points covered by their common parent.Such an ap-proach allows exploring data on different levels of granularity.Hierarchical clustering methods are categorized into agglomerative(bottom-up)and divi-sive(top-down)[116],[131].An agglomerative clustering starts with one-point (singleton)clusters and recursively merges two or more of the most similar clusters.A divisive clustering starts with a single cluster containing all data points and recursively splits the most appropriate cluster.The process contin-ues until a stopping criterion(frequently,the requested number k of clusters) is achieved.Advantages of hierarchical clustering include:•Flexibility regarding the level of granularity•Ease of handling any form of similarity or distance•Applicability to any attribute typesDisadvantages of hierarchical clustering are related to:•Vagueness of termination criteria•Most hierarchical algorithms do not revisit(intermediate)clusters once constructed.The classic approaches to hierarchical clustering are presented in the sub-section Linkage Metrics.Hierarchical clustering based on linkage metrics re-sults in clusters of proper(convex)shapes.Active contemporary efforts to build cluster systems that incorporate our intuitive concept of clusters as con-nected components of arbitrary shape,including the algorithms CURE and CHAMELEON,are surveyed in the subsection Hierarchical Clusters of Arbi-trary Shapes.Divisive techniques based on binary taxonomies are presented in the subsection Binary Divisive Partitioning.The subsection Other Devel-opments contains information related to incremental learning,model-based clustering,and cluster refinement.In hierarchical clustering our regular point-by-attribute data representa-tion frequently is of secondary importance.Instead,hierarchical clustering frequently deals with the N×N matrix of distances(dissimilarities)or sim-ilarities between training points sometimes called a connectivity matrix.So-called linkage metrics are constructed from elements of this matrix.The re-quirement of keeping a connectivity matrix in memory is unrealistic.To relax this limitation different techniques are used to sparsify(introduce zeros into) the connectivity matrix.This can be done by omitting entries smaller than a certain threshold,by using only a certain subset of data representatives,or by keeping with each point only a certain number of its nearest neighbors(for nearest neighbor chains see[177]).Notice that the way we process the original (dis)similarity matrix and construct a linkage metric reflects our a priori ideas about the data model.A Survey of Clustering Data Mining Techniques7With the(sparsified)connectivity matrix we can associate the weighted connectivity graph G(X,E)whose vertices X are data points,and edges E and their weights are defined by the connectivity matrix.This establishes a connection between hierarchical clustering and graph partitioning.One of the most striking developments in hierarchical clustering is the algorithm BIRCH.It is discussed in the section Scalable VLDB Extensions.Hierarchical clustering initializes a cluster system as a set of singleton clusters(agglomerative case)or a single cluster of all points(divisive case) and proceeds iteratively merging or splitting the most appropriate cluster(s) until the stopping criterion is achieved.The appropriateness of a cluster(s) for merging or splitting depends on the(dis)similarity of cluster(s)elements. This reflects a general presumption that clusters consist of similar points.An important example of dissimilarity between two points is the distance between them.To merge or split subsets of points rather than individual points,the dis-tance between individual points has to be generalized to the distance between subsets.Such a derived proximity measure is called a linkage metric.The type of a linkage metric significantly affects hierarchical algorithms,because it re-flects a particular concept of closeness and connectivity.Major inter-cluster linkage metrics[171],[177]include single link,average link,and complete link. The underlying dissimilarity measure(usually,distance)is computed for every pair of nodes with one node in thefirst set and another node in the second set.A specific operation such as minimum(single link),average(average link),or maximum(complete link)is applied to pair-wise dissimilarity measures:d(C1,C2)=Op{d(x,y),x∈C1,y∈C2}Early examples include the algorithm SLINK[199],which implements single link(Op=min),Voorhees’method[215],which implements average link (Op=Avr),and the algorithm CLINK[55],which implements complete link (Op=max).It is related to the problem offinding the Euclidean minimal spanning tree[224]and has O(N2)complexity.The methods using inter-cluster distances defined in terms of pairs of nodes(one in each respective cluster)are called graph methods.They do not use any cluster representation other than a set of points.This name naturally relates to the connectivity graph G(X,E)introduced above,because every data partition corresponds to a graph partition.Such methods can be augmented by so-called geometric methods in which a cluster is represented by its central point.Under the assumption of numerical attributes,the center point is defined as a centroid or an average of two cluster centroids subject to agglomeration.It results in centroid,median,and minimum variance linkage metrics.All of the above linkage metrics can be derived from the Lance-Williams updating formula[145],d(C iC j,C k)=a(i)d(C i,C k)+a(j)d(C j,C k)+b·d(C i,C j)+c|d(C i,C k)−d(C j,C k)|.8Pavel BerkhinHere a,b,c are coefficients corresponding to a particular linkage.This formula expresses a linkage metric between a union of the two clusters and the third cluster in terms of underlying nodes.The Lance-Williams formula is crucial to making the dis(similarity)computations feasible.Surveys of linkage metrics can be found in [170][54].When distance is used as a base measure,linkage metrics capture inter-cluster proximity.However,a similarity-based view that results in intra-cluster connectivity considerations is also used,for example,in the original average link agglomeration (Group-Average Method)[116].Under reasonable assumptions,such as reducibility condition (graph meth-ods satisfy this condition),linkage metrics methods suffer from O N 2 time complexity [177].Despite the unfavorable time complexity,these algorithms are widely used.As an example,the algorithm AGNES (AGlomerative NESt-ing)[131]is used in S-Plus.When the connectivity N ×N matrix is sparsified,graph methods directly dealing with the connectivity graph G can be used.In particular,hierarchical divisive MST (Minimum Spanning Tree)algorithm is based on graph parti-tioning [116].2.1Hierarchical Clusters of Arbitrary ShapesFor spatial data,linkage metrics based on Euclidean distance naturally gener-ate clusters of convex shapes.Meanwhile,visual inspection of spatial images frequently discovers clusters with curvy appearance.Guha et al.[99]introduced the hierarchical agglomerative clustering algo-rithm CURE (Clustering Using REpresentatives).This algorithm has a num-ber of novel features of general importance.It takes special steps to handle outliers and to provide labeling in assignment stage.It also uses two techniques to achieve scalability:data sampling (section 8),and data partitioning.CURE creates p partitions,so that fine granularity clusters are constructed in parti-tions first.A major feature of CURE is that it represents a cluster by a fixed number,c ,of points scattered around it.The distance between two clusters used in the agglomerative process is the minimum of distances between two scattered representatives.Therefore,CURE takes a middle approach between the graph (all-points)methods and the geometric (one centroid)methods.Single and average link closeness are replaced by representatives’aggregate closeness.Selecting representatives scattered around a cluster makes it pos-sible to cover non-spherical shapes.As before,agglomeration continues until the requested number k of clusters is achieved.CURE employs one additional trick:originally selected scattered points are shrunk to the geometric centroid of the cluster by a user-specified factor α.Shrinkage suppresses the affect of outliers;outliers happen to be located further from the cluster centroid than the other scattered representatives.CURE is capable of finding clusters of different shapes and sizes,and it is insensitive to outliers.Because CURE uses sampling,estimation of its complexity is not straightforward.For low-dimensional data authors provide a complexity estimate of O (N 2sample )definedA Survey of Clustering Data Mining Techniques9 in terms of a sample size.More exact bounds depend on input parameters: shrink factorα,number of representative points c,number of partitions p,and a sample size.Figure1(a)illustrates agglomeration in CURE.Three clusters, each with three representatives,are shown before and after the merge and shrinkage.Two closest representatives are connected.While the algorithm CURE works with numerical attributes(particularly low dimensional spatial data),the algorithm ROCK developed by the same researchers[100]targets hierarchical agglomerative clustering for categorical attributes.It is reviewed in the section Co-Occurrence of Categorical Data.The hierarchical agglomerative algorithm CHAMELEON[127]uses the connectivity graph G corresponding to the K-nearest neighbor model spar-sification of the connectivity matrix:the edges of K most similar points to any given point are preserved,the rest are pruned.CHAMELEON has two stages.In thefirst stage small tight clusters are built to ignite the second stage.This involves a graph partitioning[129].In the second stage agglomer-ative process is performed.It utilizes measures of relative inter-connectivity RI(C i,C j)and relative closeness RC(C i,C j);both are locally normalized by internal interconnectivity and closeness of clusters C i and C j.In this sense the modeling is dynamic:it depends on data locally.Normalization involves certain non-obvious graph operations[129].CHAMELEON relies heavily on graph partitioning implemented in the library HMETIS(see the section6). Agglomerative process depends on user provided thresholds.A decision to merge is made based on the combinationRI(C i,C j)·RC(C i,C j)αof local measures.The algorithm does not depend on assumptions about the data model.It has been proven tofind clusters of different shapes,densities, and sizes in2D(two-dimensional)space.It has a complexity of O(Nm+ Nlog(N)+m2log(m),where m is the number of sub-clusters built during the first initialization phase.Figure1(b)(analogous to the one in[127])clarifies the difference with CURE.It presents a choice of four clusters(a)-(d)for a merge.While CURE would merge clusters(a)and(b),CHAMELEON makes intuitively better choice of merging(c)and(d).2.2Binary Divisive PartitioningIn linguistics,information retrieval,and document clustering applications bi-nary taxonomies are very useful.Linear algebra methods,based on singular value decomposition(SVD)are used for this purpose in collaborativefilter-ing and information retrieval[26].Application of SVD to hierarchical divisive clustering of document collections resulted in the PDDP(Principal Direction Divisive Partitioning)algorithm[31].In our notations,object x is a docu-ment,l th attribute corresponds to a word(index term),and a matrix X entry x il is a measure(e.g.TF-IDF)of l-term frequency in a document x.PDDP constructs SVD decomposition of the matrix10Pavel Berkhin(a)Algorithm CURE (b)Algorithm CHAMELEONFig.1.Agglomeration in Clusters of Arbitrary Shapes(X −e ¯x ),¯x =1Ni =1:N x i ,e =(1,...,1)T .This algorithm bisects data in Euclidean space by a hyperplane that passes through data centroid orthogonal to the eigenvector with the largest singular value.A k -way split is also possible if the k largest singular values are consid-ered.Bisecting is a good way to categorize documents and it yields a binary tree.When k -means (2-means)is used for bisecting,the dividing hyperplane is orthogonal to the line connecting the two centroids.The comparative study of SVD vs.k -means approaches [191]can be used for further references.Hier-archical divisive bisecting k -means was proven [206]to be preferable to PDDP for document clustering.While PDDP or 2-means are concerned with how to split a cluster,the problem of which cluster to split is also important.Simple strategies are:(1)split each node at a given level,(2)split the cluster with highest cardinality,and,(3)split the cluster with the largest intra-cluster variance.All three strategies have problems.For a more detailed analysis of this subject and better strategies,see [192].2.3Other DevelopmentsOne of early agglomerative clustering algorithms,Ward’s method [222],is based not on linkage metric,but on an objective function used in k -means.The merger decision is viewed in terms of its effect on the objective function.The popular hierarchical clustering algorithm for categorical data COB-WEB [77]has two very important qualities.First,it utilizes incremental learn-ing.Instead of following divisive or agglomerative approaches,it dynamically builds a dendrogram by processing one data point at a time.Second,COB-WEB is an example of conceptual or model-based learning.This means that each cluster is considered as a model that can be described intrinsically,rather than as a collection of points assigned to it.COBWEB’s dendrogram is calleda classification tree.Each tree node(cluster)C is associated with the condi-tional probabilities for categorical attribute-values pairs,P r(x l=νlp|C),l=1:d,p=1:|A l|.This easily can be recognized as a C-specific Na¨ıve Bayes classifier.During the classification tree construction,every new point is descended along the tree and the tree is potentially updated(by an insert/split/merge/create op-eration).Decisions are based on the category utility[49]CU{C1,...,C k}=1j=1:kCU(C j)CU(C j)=l,p(P r(x l=νlp|C j)2−(P r(x l=νlp)2.Category utility is similar to the GINI index.It rewards clusters C j for in-creases in predictability of the categorical attribute valuesνlp.Being incre-mental,COBWEB is fast with a complexity of O(tN),though it depends non-linearly on tree characteristics packed into a constant t.There is a similar incremental hierarchical algorithm for all numerical attributes called CLAS-SIT[88].CLASSIT associates normal distributions with cluster nodes.Both algorithms can result in highly unbalanced trees.Chiu et al.[47]proposed another conceptual or model-based approach to hierarchical clustering.This development contains several different use-ful features,such as the extension of scalability preprocessing to categori-cal attributes,outliers handling,and a two-step strategy for monitoring the number of clusters including BIC(defined below).A model associated with a cluster covers both numerical and categorical attributes and constitutes a blend of Gaussian and multinomial models.Denote corresponding multivari-ate parameters byθ.With every cluster C we associate a logarithm of its (classification)likelihoodl C=x i∈Clog(p(x i|θ))The algorithm uses maximum likelihood estimates for parameterθ.The dis-tance between two clusters is defined(instead of linkage metric)as a decrease in log-likelihoodd(C1,C2)=l C1+l C2−l C1∪C2caused by merging of the two clusters under consideration.The agglomerative process continues until the stopping criterion is satisfied.As such,determina-tion of the best k is automatic.This algorithm has the commercial implemen-tation(in SPSS Clementine).The complexity of the algorithm is linear in N for the summarization phase.Traditional hierarchical clustering does not change points membership in once assigned clusters due to its greedy approach:after a merge or a split is selected it is not refined.Though COBWEB does reconsider its decisions,its。
Mammalian microRNAs predominantly act to decresase target mRNA levels
ARTICLES Mammalian microRNAs predominantly act to decrease target mRNA levelsHuili Guo1,2,Nicholas T.Ingolia3,4,Jonathan S.Weissman3,4&David P.Bartel1,2MicroRNAs(miRNAs)are endogenous,22-nucleotide RNAs that mediate important gene-regulatory events by pairing to the mRNAs of protein-coding genes to direct their repression.Repression of these regulatory targets leads to decreased translational efficiency and/or decreased mRNA levels,but the relative contributions of these two outcomes have been largely unknown,particularly for endogenous targets expressed at low-to-moderate levels.Here,we use ribosome profiling to measure the overall effects on protein production and compare these to simultaneously measured effects on mRNA levels. For both ectopic and endogenous miRNA regulatory interactions,lowered mRNA levels account for most($84%)of the decreased protein production.These results show that changes in mRNA levels closely reflect the impact of miRNAs on gene expression and indicate that destabilization of target mRNAs is the predominant reason for reduced protein output.Each highly conserved mammalian miRNA typically targets mRNAs of hundreds of distinct genes,such that as a class these small regula-tory RNAs dampen the expression of most protein-coding genes to optimize their expression patterns1,2.When pairing to a target is extensive,a miRNA can direct destruction of the targeted mRNA through Argonaute-catalysed mRNA cleavage3,4.This mode of repression dominates in plants5,but in animals all but a few targets lack the extensive pairing required for cleavage2.The molecular consequences of the repression mode that domi-nates in animals are less clear.Initially miRNAs were thought to repress protein output with little or no influence on mRNA levels6,7. Then mRNA-array experiments showed that miRNAs decrease the levels of many targeted mRNAs8–11.A revisit of the initially identified targets of Caenorhabditis elegans miRNAs showed that these tran-scripts also decrease in the presence of their cognate miRNAs12. The mRNA decreases are associated with poly(A)-tail shortening, leading to a model in which miRNAs cause mRNA de-adenylation, which promotes de-capping and more rapid degradation through standard mRNA-turnover processes10,13–15.The magnitude of this destabilization,however,is usually quite modest,which has bolstered the lingering notion that with some exceptions(for example, Drosophila miR-12regulation of CG10011,ref.14)most repression occurs through translational repression,and that monitoring mRNA destabilization might miss many targets that are downregulated with-out detectable mRNA changes.Challenging this view are results of high-throughput analyses comparing protein and mRNA changes after introducing or deleting individual miRNAs16,17.An interpreta-tion of these results is that the modest mRNA destabilization imparted by each miRNA–target interaction represents most of the miRNA-mediated repression16.We call this the‘mRNA-destabilization’scenario and contrast it to the original‘translational-repression’scenario,which posited decreased translation with relatively little mRNA change.In the mRNA-destabilization scenario differences between protein and mRNA changes are mostly attributed to either measurement noise or complications arising from pre-steady-state comparisons of mRNA-array data,which measure differences at one moment in time,and proteomic data,which measure differences integrated over an extended period of protein synthesis.If either mRNA levels or miRNA activities change over the period of protein synthesis(or the period of metabolic labelling),correspondence between mRNA destabilization and protein decreases could become distorted. Another complication of proteomic data sets is that they preferen-tially examine more highly expressed proteins,whose repression might differ from more modestly expressed proteins.A recent study used mRNA arrays to monitor effects on both mRNA levels and mRNA ribosome density and occupancy,thereby providing a more sensitive analysis of changes in mRNA utilization and bypassing the need to compare protein and mRNA18.This array study supports the mRNA-destabilization scenario but examines the response to an ectopically introduced miRNA,leaving open the question of whether endogenous miRNA–target interactions might impart additional translational repression.Ribosome profiling,a method that determines the positions of ribosomes on cellular mRNAs with sub-codon resolution19,is based on deep sequencing of ribosome-protected mRNA fragments(RPFs) and thereby provides quantitative data on thousands of genes not detected by general proteomics methods.Moreover,ribosome pro-filing reports on the status of the cell at a particular time point,and thus generates results more directly comparable to mRNA-profiling results than does proteomics.We extended this method to human and mouse cells,thereby enabling a fresh look at the molecular con-sequences of miRNA repression.Ribosome profiling in mammalian cellsRibosome profiling generates short sequence tags that each mark the mRNA coordinates of one bound ribosome19.The outline of our protocol for mammalian cells paralleled that used for yeast(Fig.1a). Cells were treated with cycloheximide to arrest translating ribosomes. Extracts from these cells were then treated with RNase I to degrade regions of mRNAs not protected by ribosomes.The resulting80S monosomes,many of which contained a,30-nucleotide RPF,were purified on sucrose gradients and then treated to release the RPFs, which were processed for Illumina high-throughput sequencing. We started with HeLa cells,performing ribosome profiling on miRNA-and mock-transfected cells.In parallel,poly(A)-selected1Whitehead Institute for Biomedical Research,Cambridge,Massachusetts02142,USA.2Howard Hughes Medical Institute and Department of Biology,Massachusetts Institute of Technology,Cambridge,Massachusetts02139,USA.3Howard Hughes Medical Institute and Department of Cellular and Molecular Pharmacology,University of California,San Francisco,California94158,USA.4California Institute for Quantitative Biosciences,San Francisco,California94158,USA.Vol466|12August2010|doi:10.1038/nature09267835mRNA from each sample was randomly fragmented,and the result-ing mRNA fragments were processed for sequencing (mRNA-Seq)using the same protocol as that used for the RPFs.Sequencing generated 11–18million raw reads per sample,of which 4–8million were used for subsequent analyses because they each mapped to a single location in a database of annotated pre-mRNAs and mRNA splice junctions (Supplementary Table 1).Combining RPFs from HeLa-expressed mRNAs into one composite mRNA showed that ribosome profiling captured fundamental features of translation (Fig.1b,c and Supplementary Fig.1c).Although a few RPFs mapped to annotated 59-untranslated regions (59UTRs),which indicated the presence of ribosomes at upstream open reading frames (ORFs)19,the vast majority mapped to annotated ORFs.RPF density was highest at the start and stop codons,reflecting known pauses at these positions 20.mRNA-Seq tags,in contrast,mapped uniformly across the length of the mRNA,as expected for randomly fragmented mRNA.The most striking feature in the composite-mRNA analysis was the 3-nucleotide periodicity of the RPFs.In sharp contrast to the 59termini of the mRNA-Seq tags,which mapped to all three codon nucleotides equally,the RPF 59termini mostly mapped to the first nucleotide of the codon (Fig.1d).This pattern,analogous to that observed in yeast 19,is attributable to the RPFs capturing the move-ment of ribosomes along mRNAs—three nucleotides at a time.The protocol applied to mouse neutrophils generated ,30-nucleotide RPFs with the same pattern (Supplementary Fig.1d,e).Thus,ribo-some profiling mapped,at sub-codon resolution,the positions of translating ribosomes in human and mouse cells.Similar repression regardless of target expression levelGeneral features of translation and translational efficiency in mam-malian cells will be presented elsewhere.Here,we focus on miRNA-dependent changes in protein production.Our HeLa-cell experiments examined the impact of introducing miR-1or miR-155,both of whichare not normally expressed in HeLa cells,and our mouse-neutrophil experiments examined the impact of knocking out mir-223,which encodes a miRNA highly and preferentially expressed in neutrophils 21.These cell types and miRNAs were chosen because proteomics experi-ments using either the SILAC (stable isotope labelling with amino acids in cell culture)or pSILAC (a pulsed-labelled version of SILAC)methods had already reported the impact of each of these miRNAs on the output of thousands of proteins 16,17.Pairing to the miRNA seed (nucleotides 2–7)is important for target recognition,and several types of seed-matched sites,ranging in length from 6to 8nucleotides,mediate repression 2.Ribosome-profiling and mRNA-Seq results showed the expected correlation between site length and site efficacy 2(Supplementary Fig.2).Because the response of mRNAs with single 6-nucleotide sites was marginal and observed only in the miR-1experiment,subsequent analyses focused on mRNAs with at least one canonical 7–8-nucleotide site.In the miR-155experiment,mRNAs from 5,103distinct genes passed our read threshold for single-gene quantification ($100RPFs and $100mRNA-Seq tags in the mock-transfection control).Genes with at least one 39UTR site tended to be repressed following addition of miR-155,yielding fewer mRNA-Seq tags and fewer RPFs in the presence of the miRNA (Fig.2a;P ,10248and 10237,respec-tively,one-tailed Kolmogorov–Smirnov (K–S)test,comparing to genes with no site in the entire message).Proteins from 2,597of the 5,103genes were quantified in the analogous pSILAC experi-ment 17.The mRNA and RPF changes for the pSILAC-detected subset were no less pronounced than those of the larger set of analysed genes (Fig.2a;P 50.70and 0.62for mRNA and RPF data,respectively,K–S test),which implied that the response of mRNAs of proteins detected by high-throughput quantitative proteomics accurately represented the response of all mRNAs.Analogous results were obtained in the miR-1and miR-223experiments (Fig.2b,c;P ,10210for each com-parison to genes with no site,and P .0.56for each comparison to the proteomics-detected subset).Furthermore,analyses of genes binnedabR e a d d e n s i t y (r p M )cdF r a c t i o n o f r e a d sR e a d d e n s i t y (r p M )30 nt30 ntAUGE P AUAAE P AAdd cycloheximide Lyse cellsDistance from first base Distance from first base sequenced RPFsHigh-throughput sequencing−300−200−1000010*******Distance from first base of start codon (nt)Distance from first base of stop codon (nt) RPF0. 1|Ribosome profiling in human cells captured features of translation.a ,Schematic diagram of ribosome profiling.Sequencingreproducibility and evidence for mapping to the correct mRNA isoforms are illustrated (Supplementary Fig.1a,b).b ,RPF density near the ends of ORFs,combining data from all quantified genes.Plotted are RPF 59termini,as reads per million reads mapping to genes (rpM).Illustrated below the graph are the inferred ribosome positions corresponding to peak RPF densities,at which the start codon was in the P site (left)and the stop codon was in the A site (right).The offset between the 59terminus of an RPF and the firstnucleotide in the human ribosome A site was typically 15nucleotides (nt).c ,Density of RPFs and mRNA-Seq tags near the ends of ORFs in HeLa cells.RPF density is plotted as in panel b ,except positions are shifted 115nucleotides to reflect the position of the first nucleotide in the ribosome A posite data are shown for $600-nucleotide ORFs that passed our threshold for quantification ($100RPFs and $100mRNA-Seq tags).d ,Fraction of RPFs and mRNA-Seq tags mapping to each of the three codon nucleotides in panel c .ARTICLES NATURE |Vol 466|12August 2010836by expression level,which enabled inclusion of data from 11,000distinct genes that ranged broadly in expression (more than 1,000-fold difference between the first and last bins),confirmed that miRNAs do not repress their lowly expressed targets more potently than they do their more highly expressed targets (Supplementary Fig.3).As these results indicated that restricting analyses to mRNAs with higher expression,by requiring either a minimal read count or a proteomics-detected protein,did not somehow distort the picture of miRNA targeting and repression,we focused on the mRNAs with at least one 39UTR site and for which the proteomics detected a substantial change at the protein level.These sets of mRNAs were called ‘proteomics-supported targets’because they were expected to be highly enriched in direct targets of the miRNAs.Indeed,they responded more robustly to the introduction or ablation of cognate miRNAs (Fig.2a–c;P ,1025for each comparison to proteomics-detected genes with sites).Because some 7–8-nucleotide seed-matched sites do not confer repression by the corresponding miRNA 2,22,the proteomics-supported targets,which excluded most messages with non-functional sites,were the most informative for subsequent analyses.Modest influence on translational efficiencyWe next examined whether our results supported the translation-repression scenario,in which translation is repressed without a sub-stantial mRNA decrease.In the characterized examples in which miRNAs direct translation inhibition,repression is reported to occur through either reduced translation initiation 23–25or increased ribosome drop-off 26.Both of these mechanisms would lead to fewer ribosomes on target mRNAs and thus fewer RPFs from these mRNAs after account-ing for changes in mRNA levels.To detect this effect,we accounted for changes in mRNA levels by incorporating the mRNA-Seq results.For example,for each quantified gene in the miR-155experiment,we divided the change in RPFs by the change in mRNA-Seq tags (that is,we subtracted the log 2-fold changes).This calculation removed the component of the RPF change attributable to miRNA-dependent changes in poly(A)mRNA,leaving the residual change as the com-ponent attributable to a change in ribosome density,which we inter-pret as a change in ‘translational efficiency 19’.We observed a statistically significant decrease in translational efficiency for messages with miR-155sites compared to those with-out,indicating that miRNA targeting leads to fewer ribosomes on target mRNAs that have not yet lost their poly(A)-tail and become destabilized (Fig.2d,P 50.003,K–S test).This decrease,however,was very modest.Even these proteomics-supported targets under-went only a 7%decrease in translational efficiency (–0.11log 2-fold change,Fig.2d,inset),compared to a 33%decrease in polyadeny-lated mRNA (–0.59log 2-fold change,Fig.2a).Analogous results were obtained for the miR-1and miR-223experiments (Fig.2e,f;P 50.001,P 50.05,respectively).Thus,for both ectopic and endo-genous regulatory interactions,only a small fraction of repression observed by ribosome profiling (11–16%)was attributable to reduced translational efficiency.At least 84%of the repression was attributable instead to decreased mRNA levels,a percentage some-what greater than the ,75%reported from array analyses of ectopic interactions 18.aC u m u l a t i v e f r a c t i o nC u m u l a t i v e f r a c t i o nTranslational efficiency fold change (log 2)C u m u l a t i v e f r a c t i o nRPF fold change (log 2)−2−1.5−1−0.500.51 1.52mRNA-Seq fold change (log 2)−2−1.5−1−0.500.51 1.52Translational efficiency fold change (log 2)RPF fold change (log 2)−2−1.5−1−0.500.51 1.52mRNA-Seq fold change (log 2)Translational efficiency fold change (log 2)RPF fold change (log 2)mRNA-Seq fold change (log 2)−2−1.5−1−0.500.51 1.520. u m u l a t i v e f r a c t i o nbmiR-1miR- u m u l a t i v e f r a c t i o n0. u m u l a t i v e f r a c t i o nemiR-223miR-223−1−0.500.51−1−0.500.51cf≥1 site (707)No site (3,186)≥1 site (299)proteomics-detected ≥1 site (121)proteomics-supported ≥1 site (707)No site (3,186)≥1 site (299)proteomics-detected ≥1 site (121)proteomics-supported ≥1 site (853)No site (2,378)≥1 site (386)proteomics-detected ≥1 site (99)proteomics-supported ≥1 site (853)No site (2,378)≥1 site (386)proteomics-detected ≥1 site (99)proteomics-supported ≥1 site (768)No site (2,916)≥1 site (337)proteomics-detected ≥1 site (77)proteomics-supported ≥1 site (768)No site (2,916)≥1 site (337)proteomics-detected ≥1 site (77)proteomics-supported miR-155miR-155Figure 2|MicroRNAs downregulated gene expression mostly through mRNA destabilization,with a small effect on translational efficiency.a ,Cumulative distributions of mRNA-Seq changes (left)and RPF changes (right)after introducing miR-155.Plotted are distributions for the genes with $1miR-15539UTR site (blue),the subset of these genes detected in the pSILAC experiment (proteomics-detected,red),the subset of theproteomics-detected genes with proteins responding with log 2-fold change #–0.3(proteomics-supported,green),and the control genes,which lacked miR-155sites throughout their mRNAs (no site,black).The number of genes in each category is indicated in parentheses.b ,Cumulativedistributions of mRNA-Seq changes (left)and RPF changes (right)after introducing miR-1.Otherwise,as in panel a .c ,Cumulative distributions of mRNA-Seq changes (left)and RPF changes (right)after deleting mir-223.Otherwise,as in panel a ,with proteomics-supported genes referring to genes with proteins that responded with log 2-fold change $0.3in the SILAC experiment.d ,Cumulative distributions of translational efficiency changes for the polyadenylated mRNA that remained after introducing miR-155.For each gene,the translational efficiency change was calculated by normalizing the RPF change by the mRNA-Seq change.For each distribution,the mean log 2-fold change (6standard error)is shown (inset).e ,Cumulative distributions of translational efficiency changes for the polyadenylated mRNA that remained after introducing miR-1.Otherwise,as in panel d .f ,Cumulative distributions of translational efficiency changes for the polyadenylated mRNA that remained after deleting mir-223.Otherwise,as in panel d .NATURE |Vol 466|12August 2010ARTICLES837Analyses described thus far focused on messages with at least one 39UTR site to the cognate miRNA,without considering whether the site was conserved in orthologous UTRs of other animals.When we focused on evolutionarily conserved sites 1,the results were similar but noisier because the conserved sites,although more efficacious,were 3–13-fold less abundant (Supplementary Fig.4).When chan-ging the focus to messages with sites only in the ORFs,the results were also similar but again noisier because sites in the open reading frames are less efficacious 16,17,22,which led to ,70%fewer genes classified as proteomics-supported targets (Supplementary Fig.5).mRNA reduction consistently mirrored RPF reductionAnalyses of fold-change distributions (Fig.2)supported the mRNA-destabilization scenario for most targets,but still allowed for the possibility that the translational-repression scenario might apply to a small subset of targets.To search for evidence for a set of unusual targets undergoing translational repression without substantial mRNA destabilization,we compared the mRNA and ribosome-profiling changes for the 5,103quantifiable genes from the miR-155experiment.Correlation between the two types of responses was strong for the messages with miR-155sites,and particularly for those that were proteomics-supported targets (Fig.3a,R 250.49and 0.63,respectively).A strong correlation was also observed for genes considered only after relaxing the expression cut-offs (Supplemen-tary Fig.6a).Any scatter that might have indicated that a few genes undergo translational repression without substantial mRNA destabi-lization strongly resembled the scatter observed in parallel analysis of genes without sites (Fig.3b).The same was observed for the miR-1experiment,but in this case the correlations were even stronger (R 250.72and 0.80,respectively),presumably because the increased response to the miRNA led to a correspondingly reduced contribution of experimental noise (Fig.3c,d;Supplementary Fig.6b).The same was also observed for the miR-223experiment,with weaker correla-tions (R 250.26and 0.40,respectively)attributable to the reduced response to the miRNA and a correspondingly increased contribution of experimental noise (Fig.3e,f).Supporting this interpretation,sys-tematically increasing expression cut-offs,which retained data with progressively lower noise from stochastic counting fluctuations,pro-gressively increased the correlation between RPF and mRNA-Seq changes (Supplementary Fig.6c).We also examined messages with multiple sites to the cognate miRNA and found that they behaved no differently with regard to the relationship between mRNA-Seq and RPF changes (Supplementary Fig.7).In summary,we found no evid-ence that countered the conclusion that miRNAs act predominantly to reduce mRNA levels of nearly all,if not all,targets.Uniform changes along the ORF lengthIf miRNA targeting causes ribosomes to drop off the message after translating a substantial fraction of the ORF,then the RPF changes summed over the length of the ORF might underestimate the reduced production of full-length protein.Therefore,we re-examined the ribosome profiling data,which determines the location of ribosomes along the length of the mRNAs,thereby providing transcriptome-wide information that could detect ribosome drop-off.For highly expressed genes targeted in their 39UTRs (e.g,TAGLN2in the miR-1experiment;Supplementary Fig.8a),downregulation at the mRNA and ribosome levels was observed along the length of the ORF.In order to extend this analysis to genes with more moderate expression,we examined composite ORFs representing proteomics-supported targets and compared these to composite ORFs representing genes without sites.When miR-155targets were compared to genes without sites,fewer mRNA-Seq tags were observed across the length of the composite ORF (Fig.4a).RPFs tended to be further reduced (P 50.007,one-tailed Mann–Whitney test),but without a systematic change in the magnitude of this additional reduction across the length of the ORF (P 50.95,two-tailed analysis of covariance (ANCOVA)test).Because ribosome drop-off would decrease the ribosome occu-pancy less at the beginning of the ORF than at the end,whereas inhibiting translation initiation would not,the observed uniform reduction supported mechanisms in which initiation was inhibited.Analogous results were observed in the miR-1experiment (Fig.4b;P 50.002,for further reduction in RPFs;P 50.85for systematic change across the ORF).Evidence for drop-off was also not observed in the miR-223experiment,although a change in translational effi-ciency was difficult to detect in this analysis,presumably because the miRNA-mediated changes were lower in magnitude (Fig.4c).The same conclusions were drawn from analyses in which we first normalized for ORF length (Supplementary Fig.9).Implications for the mechanism of repressionFor both ectopic and endogenous miRNA targeting interactions,the molecular consequences of miRNA regulation were most consistent with the mRNA-destabilization scenario.Although acquiring similar data on cell types beyond the two examined here will be important,we have no reason to doubt that our conclusion will apply broadly to the vast majority of miRNA targeting interactions.If indeed general,this conclusion will be welcome news to biologists wanting to mea-sure the ultimate impact of miRNAs on their direct regulatory tar-gets.Because the quantitative effects on translating ribosomes so closely mirrored the decreases in polyadenylated mRNA,the impact on protein production can be closely approximated using mRNAacemRNA-Seq fold change (log 2)R P F f o l d c h a n g e (l o g 2)R P F f o l d c h a n g e (l o g 2)R P F f o l d c h a n g e (l o g 2)mRNA-Seq fold change (log 2)mRNA-Seq fold change (log 2)Figure 3|Ribosome changes from miRNA targeting corresponded to mRNA changes.a ,Correspondence between ribosome (RPF)and mRNA (mRNA-Seq)changes after introducing miR-155,plotting data for the 707quantified genes with at least one miR-15539UTR site (blue circles).Proteomics-detected targets and proteomics-supported targets are highlighted (pink diamonds and green crosses,respectively).Expected standard deviations (error bars)were calculated based on the number of reads obtained per gene and assuming random counting statistics.The R 2derived from Pearson’s correlation of all data are indicated.b ,Correspondence between ribosome and mRNA changes after introducing miR-155,plotting data for 707genes randomly selected from the 3,186quantified genes lacking a miR-155site anywhere in the mRNA.Otherwise,as in panel a .c ,d ,As in panels a and b ,but plotting results for the miR-1experiment.e ,f ,As in panels a and b ,but plotting results for the miR-223experiment.ARTICLESNATURE |Vol 466|12August 2010838arrays or mRNA-Seq.Our results might also provide insight into the question of why some targets are more responsive to miRNAs than others;in the destabilization scenario,otherwise long-lived messages might undergo comparatively more destabilization than would con-stitutively short-lived ones.Translation repression and mRNA destabilization are sometimes coupled 27,which raises the possibility that the miRNA-mediated mRNA destabilization might be a consequence of translational repres-sion.If so,a greater fraction of the repression might be attributable to decreased translational efficiency if the effects were analysed sooner after introducing a miRNA.However,the fraction attributable to decreased translational efficiency remained small when repeating the analysis using samples from 12h (rather than 32h)after intro-ducing miR-155or miR-1(Supplementary Fig.10and Supplemen-tary Table 2).Although these results at earlier time points cannot rule out rapid destabilization as a consequence of translational repression,our results revealing such small decreases in translational efficiency for target mRNAs strongly imply that even if destabilization were secondary to translational repression,it would be this destabilization (that is,the reduced availability of mRNA for subsequent rounds of translation)that would exert the greatest impact on protein produc-tion.Moreover,miRNA-mediated mRNA de-adenylation,which is the best-characterized mechanism of miRNA-mediated mRNA desta-bilization,can occur with or without translation of an ORF 10,13,15,28,which suggests that the miRNA-mediated destabilization does not result from translational repression and indicates that translational repression could occur after the initial de-adenylation signal.Perhaps the miRNA-induced poly(A)-tail interactions that eventually trigger de-adenylation also cause the closed circular form of the mRNA to open up,thereby inhibiting translation initiation.This inhibition would occur before de-adenylation is complete,as polyadenylated mRNAs seem to be translationally repressed (Fig.2d–f).Another consideration is that,as done previously 16–18,we equated mRNA destabilization to the loss of polyadenylated mRNA.Thus,transcripts that have lost their poly(A)tails might still be present but underrepresented in our mRNA-Seq of poly(A)-selected mRNA.In certain cell types,most notably oocytes,such transcripts can be stable and eventually be tailed by a cytoplasmic polyadenylation complex to become translationally competent 29.In the typical somatic cell,however,de-adenylated transcripts are not translated and are instead rapidly de-capped and/or degraded.Thus,our consideration of de-adenylated transcripts as operational and functional equivalents of degraded transcripts seems appropriate.One possibility,though,is that mRNAs that were de-adenylated while being translated will yield some RPFs from ribosomes that initiated when the poly(A)tails were intact but will not yield mRNA-Seq tags.However,a narrowing of the differences between changes in RPFs and mRNA-Seq tags through this process is expected to have been very small,since the vast majority of RPFs should derive from mRNAs with poly(A)tails.A way that our results might still be reconciled with the translation-repression scenario would be if ribosome profiling missed the bulk of translation repression because translation was repressed without reducing the density of ribosomes on the targeted messages,that is,if reduced initiation was coupled with correspondingly slower elonga-tion.However,direct evidence for slower elongation has not been reported in any miRNA studies,and it seems unlikely that decreases in initiation and elongation rates would so frequently be so closely matched so as to yield such minor differences in apparent translational efficiency for so many messages.Moreover,translational repression without changes in ribosome density would cause the changes mea-sured by proteomics to exceed those measured by ribosome profiling.The same would hold for cotranslational degradation of nascent poly-peptides,another proposed mechanism for miRNA-mediated repres-sion 7,30.Arguing strongly against both of these possibilities,we found that changes measured by proteomics were not greater than those measured by ribosome profiling (Supplementary Fig.11).Although the changes we observed in translational efficiency were consistent with slightly reduced translation of the targeted messages,such changes could also occur without any miRNA-mediated trans-lational repression.If some fraction of the polyadenylated mRNA was in a cellular compartment sequestered away from the compartment containing both miRNAs and ribosomes,then preferential destabiliza-tion of the mRNA in the miRNA/ribosome compartment would lead to an observed decrease in translational efficiency without a need to invoke translational repression.For example,to the extent that mature mRNAs awaiting transport to the cytoplasm reside in the nucleus where they presumably would not be subject to either miRNA-mediated destabilization or translation,the reduction of mRNA-Seq tags would not match the reduction of RPFs,and the more pronounced RPF reduction would indicate decreased ribosome density even in the absence of translational repression.Heterologous reporter mRNAs,some of which have lent support to the translational-repression scenario,might be particularly prone to nuclear accumulation.With this consideration in mind,the observed miRNA-dependent reduc-tions in translational efficiency might be considered upper limits on the magnitude of translational repression.Although we cannot determine the precise amount of miRNA-mediated translational repression,we can reliably say that the per-vasive and dominant miRNA-mediated translational repression with persistence of repressed mRNAs,which had been widely anticipated,has not materialized.Instead,the outcome of regulation is predomi-nantly mRNA destabilization,as first suggested by analysesof05001,5002,0002,5003,0003,5001,0004,000No site, mRNA-Seq No site, RPF ≥1 site, proteomics-supported, mRNA-Seq ≥1 site, proteomics-supported, RPFDistance from start of ORF (nucleotides)5001,5002,0002,5003,0003,5001,000Distance from start of ORF (nucleotides)5001,5002,0002,5003,0001,000Distance from start of ORF (nucleotides)R e a d d e n s i t y f o l d c h a n g e (l o g 2)R e a d d e n s i t y f o l d c h a n g e (l o g 2)R e a d d e n s i t y f o l d c h a n g e (l o g 2)Figure 4|Ribosome and mRNA changes were uniform along the length of the ORFs.a ,Ribosome and mRNA changes along the length of ORFs after introducing miR-155.mRNA segments of quantified genes were binned based on their distance from the first nucleotide of the start codon,with the boundaries of the segments chosen such that each bin contained the same number of nucleotides (Supplementary Fig.8b).Binning was doneseparately for mRNAs with no miR-155site and proteomics-supported miR-155targets.Fold changes in RPFs and mRNA-Seq tags mapping to each bin were then plotted with respect to the median distance of the central nucleotide of each segment from the first nucleotide of the start codon.Changes in RPFs and mRNA-Seq tags for mRNAs with no site (grey and black,respectively)and for proteomics-supported targets (light and dark green,respectively)are shown.Only bins with read contribution from $20genes are shown (see Supplementary Fig.8b).The ANCOVA test forsystematic change across the ORF length was performed by first calculating the differences between RPF changes and mRNA-Seq changes for each group of genes,fitting lines through these changes in translational efficiency,then testing for a difference between the resulting slopes.b ,As in panel a ,but plotting results for the miR-1experiment.c ,As in panel a ,but plotting results for the miR-223experiment.NATURE |Vol 466|12August 2010ARTICLES839。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Strong replication in the G LOB D ATA middlewareLu´ıs Rodrigues Hugo Miranda Ricardo Almeida Jo˜a o Martins Pedro VicenteUniversidade de LisboaFaculdade de Ciˆe nciasDepartamento de Informticaler,hmiranda,ralmeida,jmartins,pedrofrv@di.fc.ul.ptAbstractG LOB D ATA is a project that aims to design and implement a middleware tool offering the abstraction of a global object database repository.This tool,called C OPLA,supports transactional access to geographically distributed persistent objects independent of their location.Additionally,it supports replication of data according to different consistency criteria.For this purpose,C OPLA implements a number of consistency protocols offering different tradeoffs between performance and fault-tolerance.This paper presents the work on strong consistency protocols for the G LOB D ATA system.Two protocols are presented:a voting protocol and a non-voting protocol.Both these protocols rely on the use of atomic broadcast as a building block to serialize conflicting transactions.The paper also introduces the total order protocol being developed to support large-scale replication.1IntroductionG LOB D ATA1[1]is an European IST project started in November2000that aims to design and imple-ment a middleware tool offering the abstraction of a global object database repository.The tool,called C OPLA,supports transactional access to geographically distributed persistent objects independent of their location.Application programmers have an object-oriented view of the data repository and do not need tobe concerned of how the objects are stored,distributed or replicated.The C OPLA middleware supports the replication of data according to different consistency criteria.Each consistency criteria is implemented by one or more consistency protocols,that offer different tradeoffs between performance and fault-tolerance.This paper reports the work on strong consistency replication protocols for the G LOB D ATA system that is being performed by the Distributed ALgorithms and Network Protocols(DIALNP)group at Universidade de Lisboa.Based on the previous work of[17,16],two protocols are being implemented:a voting protocol and a non-voting protocol.Each of these protocols supports two variants,namely eager updates and deferred updates.The protocols are executed on top of an off-the-shelf relational database that is used to store the state of persistent objects and protocol control information.All protocols rely on the use of atomic broadcast as a building block to help serialize conflicting transactions.A specialized total order protocol is being implemented in the Appia system[14]to support replication in large-scale.The atomic protocol inherits ideas from the hybrid protocol of[19].The paper introduces the G LOB D ATA architecture,and resumes both the consistency protocols and the atomic multicast primitive that supports them.This paper is organized as follows:Section2describes the general C OPLA architecture.Section3presents the consistency protocols.Section4presents the atomic multicast primitive that supports the protocols. Section5presents some optimizations to the basic protocols.Section6discusses related work that studies database replication based on atomic broadcast primitives.Section7concludes this paper.2C OPLA System ArchitectureC OPLA is a middleware tool that provides transparent access to a replicated repository of persistent ob-jects.Replicas can be located on different nodes of a cluster,of a local area network,or spread across a wide are network spanning different geographic locations.To support a diversity of environments and workloads, C OPLA provides a number of replica consistency protocols.The main components of the C OPLA architecture are depicted in Figure1.The upper layer is a“client interface”module,that provides the functionality used by the C OPLA applications programmer.The pro-grammer has an object-oriented view of the persistent and distributed data:it uses a subset of Object Query Language[7]to obtain references to distributed objects.Objects can be concurrently accessed by different clients in the context of distributed transactions.For fault-tolerance,and to improve locality of read-only transactions,an object database may be replicated at different locations.Several consistency protocols are supported by C OPLA;the choice of the best protocol depends on the topology of the network and of the application’s workload.To maintain the user interface code independent of the actual protocol being used,all protocols adhere to a common protocol interface (labeled CP-API in thefigure).This allows C OPLA to be configured according to the characteristics of the environment where it runs.The uniform data store(UDS)module(developed by the Universidad P´u blica de Navarra)is responsible2Client application Client interfaceConsistency protocolsUniform DataStore (UDS)COPLACP-APIUDS-API PER-API Communicationsmodule(atomicbroadcast)Figure1.C OPLA architecturefor storing the state of the persistent objects in an off-the-shelf relational database management system (RDBMS).To perform this task,the UDS exports an interface,the UDS-API,through which objects can be stored and retrieved.It also converts all the queries posed by the application into normalized SQL queries. Finally,the UDS is used to store in a persistent way the control information required by the consistency protocols.This control information is stored and accessed through an dedicated interface,the PER-API.Architectural challenges The G LOB D ATA project is characterized by a unique combination of different requirements,that make the design of the consistency protocols a challenging ly,the G LOB D ATA aims to satisfy the following requirements:Large-scale:the consistency protocols must support replication of objects in a geographically dis-persed system,in which the nodes communicate through the Internet.This prevents the use of protocols that make used of specific network properties(such as the low-latency or network-order preservation properties of local-area networks[18]).RDBMS independence:a variety of commercial databases should be supported as the underlying data storage technology.This prevents the use of solutions that require adaptations to the database kernel.Protocol interchangeability:C OPLA must beflexible enough to adapt to changing environment con-ditions,like the scale of the system,availability of different communication facilities,and changes in the application’s workload.Therefore it should allow the use of distinct consistency protocols,that can perform differently in several scenarios.3Object-orientation:even if C OPLA maps objects into a relational model,this operation must be iso-lated from the consistency protocols.In this way,the consistency algorithms are not tied to any specific object representation.3Strong Consistency ProtocolsIn G LOB D ATA,the application programmer may trade fault-tolerance for performance.Therefore,a suite of protocols with different behavior in the presence of faults is being developed by different teams.Another project’s partner,the ITI,is developing a suite of protocols based on the notion of object ownership[15]: Each node is the manager for the objects created in it,and is responsible for managing concurrent accesses to those objects.On the other hand,the DIALNP team at Universidade de Lisboa,is developing two protocols that enforce strong consistency even in the presence of faults.In fact,the two protocols reported here can also be configured to trade reliability for performance,by implementing a deferred updates scheme.The strong consistency protocols rely extensively on the availability of an uniform atomic broadcast primitive. The implementation of this primitive will be addressed later in the paper.3.1Interaction Among ComponentsWe now describe the strong consistency protocols designed for C OPLA.Both protocols cooperate with the Uniform Data Store to obtain information about which objects are read or updated by each transaction. This information,in the form of a list of unique object identifiers(OIDs),allows the protocols to havefine-grain information about which transaction conflict with each other.Since the consistency protocols only manipulate OIDs,they remain independent from the representation of objects in the database.The C OPLA transactional model In C OPLA,the execution of a transaction includes the following steps:1.The programmer signals the system that a transaction is about to start.2.The programmer makes a query to the database,using a subset of OQL.This query returns a collectionof objects.3.The returned objects are manipulated by the programmer using the functions exported by the clientinterface.These functions allow the application to update the values of object’s attributes,and to read new objects through object relations(object attributes that are references to other objects).4.Steps2-3are repeated until the transaction is completed.5.The programmer requests the system to commit the transaction.4Interaction with the consistency protocols The common protocol interface basically exports two func-tions:a function that must be called by the application every time new objects are read by a transaction,and a function that must be called in order to commit the transaction.Thefirst function,that we call UDSAccess(),serves two main purposes:to make sure that the local copies of the objects are up-to-date(when using deferred updates,the most recent version may not be available locally);and to extract the state of the objects by calling the UDS(the access to the underlying database is not performed by the consistency protocol itself;it is a function of the UDS component).It should be noted that in the actual implementation this function is unfolded in a collection of similar functions covering different requests(attribute read,relationship read,query,etc.).For clarity of exposition,we make no distinction among these functions in the paper.The second function,called commit(),is used by the application to commit the transaction.In response to this request the consistency protocols module has to coordinate with its remote peers to serialize conflicting transactions and to decide whether it is safe to commit the transaction or if it has to be aborted due to some conflict.In order to execute this phase,the consistency protocol request the UDS module to provide the list of all objects updated by the current transaction.Additionally,the UDS also provides the consistency protocols with an opaque structure containing the state of the updated objects.It is the responsibility of the consistency protocol to propagate these updates to the remote nodes.Replication strategies Using the classification of database replication strategies introduced in[20],the strong consistency protocols of C OPLA can be classified as belonging to the“update everywhere constant interaction”class.They are“update everywhere”because they perform the updates to the data items in all replicas of the system.This approach was chosen because it is easier to deal with failures(since all nodes maintain their own copy of the data)and it does not create bottleneck points like the primary copy approach.They are“constant interaction”because the number of messages exchanged by transaction is fixed,independently of the number of operations in the transaction.Given that the cost of communication in most G LOB D ATA configurations is expected to be high,this approach is much more efficient than a linear interaction approach.The protocols described below explore the third degree of freedom:the way transactions terminate(voting or non-voting).Interaction with the atomic broadcast primitive An atomic broadcast primitive broadcasts messages among a group of servers,guaranteeing atomic and ordered delivery of messages.Specifically,let and be two messages sent by atomic broadcast to a group of servers.Atomic delivery guarantees that if a member of delivers(resp.),then all correct members of deliver(resp.).Ordered delivery guarantees that if any two members of deliver and,they deliver them in the same order.These two properties are used by both consistency protocols:the order property is used by the conflict resolution mechanism,and atomic delivery is used to simplify atomic commitment of transactions.53.2The Non-Voting ProtocolThis protocol is a modification of the one described in[17],altered to use a version scheme for concur-rency control[6],and adapted to the C OPLA transactional model.The protocol uses the following control information for each object:a version number and aflag that states whether or not the local copy of this object is up-to-date.If an object is out-of-date,the identifier of the node that has the latest version of the object is also kept.Note that in the basic protocol,all replicas are up-to-date when a transaction commits.Only in the deferred updates mode,it is possible that some replicas remain temporarily out of date.All this information is maintained in a consistency table,which is stored in persistent storage,and is updated in the context of the same transaction that alters data(i.e.,the consistency information is updated only if the transaction commits).When an object is created,its version number is set to zero.Each time a transaction updates an object, and that transaction commits,the object’s version number is incremented by one.This mechanism keeps version numbers synchronized across replicas,since the total order ensured by atomic broadcast causes all replicas to process transactions in the same order.When enforcing serializability,two kinds of conflicts must be considered by the protocol:read/write conflicts and write/write conflicts.Read/write conflicts occur when one transactions reads an object,and another concurrent transactions writes on that same object.Write/write conflicts occur when two concurrent transactions write on the same object.In G LOB D ATA,all objects are read before they are written(as shown above in the C OPLA transactional model),so a write/write conflict is also a read/write conflict.Considering this definitions,in the version number concurrency control scheme,conflicting transactions are defined as follows:Two transactions and conflict if has read an object with version and when is aboutto commit,object’s version number in the local database,,is higher than.That meansthat has read data that was later modified(by a transaction that modified and committedbefore,thus increasing’s version number),and therefore should be aborted.The general outline of the non-voting algorithm is now presented:1.All the transaction’s operations are executed locally on the node where the transaction was initiated(this node is called the delegate node).2.When the application requests a commit,the set of read objects and its version numbers,and the setof written objects is sent to all nodes using the atomic broadcast primitive.3.When a transaction is delivered by the atomic broadcast protocol,all servers verify if the receivedtransaction does not conflict with other local running transactions.There is no conflict if the versions of the objects read by the arriving transaction are greater or equal to the versions of those objects6UDSAccess(,):1.Add the list of objects to list of objects read by transaction.commit():1.Obtain from the UDS the list of objects read()and its version numbers,and the list of objectswritten()by this transaction.2.Send through the atomic broadcast primitive.3.When the message containing is delivered by the atomic broadcast:(a)If does conflict with some other transactioni.Abort.(b)else(consistent transaction)i.Abort all transactions conflicting withmit the transaction.Figure2.Non-voting protocolpresent in the local database.If no conflict is detected,then the transaction is committed,otherwise it is aborted.Since this procedure is deterministic and all nodes,including the delegate node,receive transactions by the same order,all nodes reach the same decision about the outcome of the transaction.The delegate node can now inform the client application about thefinal outcome of the transaction.Note that the last step is executed by all nodes,including the one that initiated the transaction.Depicted in Figure2is a more detailed description of the algorithm.It is divided in two functions, corresponding to the interface previously described.Both functions accept the parameter,the transaction to act upon.UDSAccess()also accepts a parameter,,which is the list of objects that has read from the UDS.Note that step3of the commit()function is executed by all nodes,including the delegate node.The algorithm uses the order given by atomic broadcast for serializing conflicting transactions:if a trans-action is delivered and is consistent,it has priority over other running transactions.This implies that if there are two conflicting transactions,and,and is delivered before,then will proceed,and will be marked as conflicting(in step3(a)),because it has read stale data.The decision is taken in each node independently,but all nodes will reach the same decision,since it depends solely on the order of message delivery(which is guaranteed to be consistent at all replicas by the atomic broadcast protocol).When a commit is decided,the version number of the objects written by this transaction are incremented,and the UDS transaction is committed.Note that to improve performance,local running transactions that conflict with a consistent transaction are aborted,in step3(b).There is a conflict when the running transaction has read objects that the arriving7transaction has written.This would cause the transaction to carry old versions of read objects on its read set, which would cause it to be aborted later on in step3(a).This way an atomic broadcast message is spared2.Aborting a transaction does not involve any special step.In this case,the commit()function is never called,and all that has to be done is to release the local resources associated with that transaction.3.3The Voting ProtocolThis protocol is an adaptation of the protocol described in[13]adapted to the C OPLA transactional model. It consists in two phases,a write set broadcast phase,and a voting phase.The general outline of the algorithm is as follows:1.All the transaction’s operations are executed locally on the delegate node,obtaining(local)read lockson read objects(note that,in order to be written,an object must be previously read).2.When the application requests a commit,the set of written objects is sent to all nodes using atomicbroadcast.3.When the write set of a transaction is delivered by atomic broadcast,all nodes try to obtain localwrite locks on all objects in the set.If there is a transaction that holds a write lock on any object of the write set of,is placed on hold until that write lock is relinquished.Transactions holding read locks on any object of the write set of are aborted(sending an abort message through atomic broadcast).When the delegate node has obtained all write locks,sends a commit message to all servers,through atomic broadcast.4.Upon the reception of a confirmation message,a node applies the transaction’s writes to the localdatabase and subsequently releases all locks held on behalf of that transaction.Upon the reception of an abort message,the delegate node aborts the transaction an releases all its locks(other nodes ignore that message).A detailed description of the algorithm is shown in Figure3.The algorithm uses the order given by atomic broadcast to serialize conflicting transactions.Thefinal transaction order is given by the order of the messages.Conflict detection is done using locks.Write/write conflicts,that occur when two concurrent transactions try to write over the same object, are detected by the lock system(two transactions try to obtain a write lock on the same object).Since write locks are obtained upon reception of,the order of these messages determines the lock acquisition order.As seen in Figure3,if a transaction obtains a write lock,it will force a later transaction to wait when it tries to obtain its lock.If commits it will force to abort.UDSAccess(,):1.For each object in the list obtain a read lock.If any of those objects is write-locked,is place onhold until that object’s write lock is released.commit():1.Obtain from the UDS the list of objects written()by.2.Send through the atomic broadcast primitive.3.When the message containing is delivered by atomic broadcast:(a)For each object in,try to obtain a write lock on it,executing the following steps atomi-cally:i.If there is one or more read locks on,every that has that read lock is aborted(by sendingan message using atomic broadcast),and the write lock on is granted to.ii.If there is a write lock on,or all the read locks on are from transactions whose message has already been delivered,will be placed on hold until thosewrite locks are released.iii.If there is no other lock on,grant the lock to.(b)If this node is the delegate node for,send by atomic broadcast.4.When a message is delivered:commit,writing all its updates in the database and releasing alllocks held by.All transactions waiting to obtain write locks on an object written by are aborted(a message is sent through atomic broadcast).5.When a is delivered:If is a local transaction the message is ignored,otherwise abort,releasingall its locks.Figure3.Voting protocol9Read/write conflicts,that occur when two concurrent transactions access the same object,one for reading and the other for writing,are solved by giving priority to writing transactions.When a message is delivered,write locks are obtained,causing transactions that have read locks on objects in to abort. This rule does not apply to transactions whose write set has already been delivered(step3(a)ii):in this case will be placed on hold until the decision is taken regarding the transaction(s)that own the read lock.All nodes obtain the same write locks in the same order,because the order of the messages is the same in all nodes,and the lock procedure is deterministic.As such,all nodes will be able to respect the decision issued by the delegate node.Optimization This protocol can be further improved,to avoid aborting unnecessary number of trans-actions.In the lock acquisition phase(after is delivered),instead of immediately aborting transactions that hold read locks on objects in,they can be placed on an alternative state,called exe-cutingabort state can proceed executing,but cannot commit.If they attempt to, they will be placed on hold.If commits,then all transactions in executingabort will return to normal execution state(if there is no other transaction that is placing in executingabort state do not need to be put on hold -they can commit immediately.Thefinal serialization order is as these transactions executed before the transaction that placed them in executingof a crashed process to be inconsistent.Therefore,in C OPLA,one needs an uniform total order protocol,i.e.a protocol that ensures that if two messages are delivered by a given order to a process(even if this process crashes),they are delivered in that order to all correct processes.Several alternatives to augment the hybrid protocol with uniform delivery have been implemented and are currently under evaluation.Thefirst alternative consists in adding an additional stability phase to the original hybrid protocol.The second alternative is to change the underlying reliable broadcast protocol to provide terminating uniform delivery of every message.The third alternative is to use two protocols in parallel:the original hybrid protocol to establish a tentative order and another consensus based protocol to establish a definitive order.These alternatives are being implemented using the Appia[14]framework and their performance is being studied.Early analysis shows that the protocol that performs best is a combination of the previous alternatives. Passive nodes select a sequencer just as in the original hybrid protocol.Sequencers are responsible for providing an uniform total order for the messages sent by passive nodes and for their own messages.They do so by applying a symmetric total order protocol based on an underlying terminating uniform reliable broadcast layer.The protocol also supports the optimistic delivery of(tentative)total order indications[8,18].Given that the order established by the(non-uniform)total order protocol is the same as thefinal uniform total order in most cases(these two orders only differ when crashes occur at particular points in the protocol execution), this order can be provided to the consistency layer as a tentative ordering information.The consistency protocols may optimistically perform some tasks that are later committed when thefinal order is delivered.5Optimizations to the Basic ProtocolsThe basic protocols described in Section3can be optimized in two different ways.One consists in delaying the propagation of updates,the deferred updates mode.Other consists in exploiting the optimistic delivery of the atomic multicast algorithm.5.1Deferred UpdatesBoth algorithms presented before can be configured to operate on a mode called deferred updates.This mode consists in postponing the transfer of updates until such data is required by a remote transaction,trad-ing fault-tolerance for performance.Note that,when using deferred updates,the outcome of a transaction is no longer immediately propagated to all replicas:it is stored only at the delegate node.If this node crashes, transactions that access this data must wait for the delegate node to recover.On the other hand,network communication is saved because updates are only propagated when needed.The changes to the protocol required to implement the deferred updates mode are encapsulated in the getNewVersions(t,l)function,which is depicted in Figure4.In both protocols,the function is called after11For each OID in:1.Check if the object’s copy in the local database is up-to-date.2.If the object is out-of-date,get the latest version from the node that holds it.Figure4.getNewVersions()step one,i.e.,it becomes step two of UDSAccess().Associated with each OID,there is afield,called owner,that contains the identifier of the node holding the latest version of that object’s data.If thatfield is empty,then the current node holds the latest version.When deferred updates mode is not used,modified data is written to the database at the end of the commit procedure.This step is modified to implement deferred updates:only the delegate node writes the altered data on its database,setting the ownerfield to empty.All the other nodes write the identifier of the delegate node in their databases.The only information that is sent across the network is merely a list of changed OIDs(instead of that list plus the data itself).5.2Exploiting Optimistic Atomic DeliveryAs described above,the atomic broadcast primitive developed in the project has the possibility of deliver-ing a message optimistically(opt-deliver),i.e.,the message is delivered in a tentative order,which is likely to be the same as thefinal order(u-deliver).This can be exploited by both consistency protocols.The ten-tative order allows the protocols to send the transaction’s updates to the database earlier.Instead of waiting for thefinal uniform order to perform the writes,they are sent to the database as soon as the tentative order is know.When thefinal order arrives,all that is required is to commit the transaction.This hides the cost of witting data behind the cost of uniform delivery,effectively doing both things in parallel.Non-voting protocol Upon reception of an opt-deliver message,all steps in the commit()function are executed,with the following modifications:in step3(a),conflicting transactions are not aborted,but placed on hold(transactions on hold can execute normally,but are suspended when they request a commit,and can only proceed when they return to normal state);in step3(b-ii),the data is sent to the UDS,but the transaction is not committed.When the message is u-delivered,and its order is the same as the tentative one,all transactions marked on hold on behalf of the current one are aborted,and the transaction is committed.If the order is not the same, then the open UDS transaction is aborted,all transactions placed on hold on behalf of this one are returned to normal state,and the message is reprocessed as if it arrived at that moment.12。