不确定性数据挖掘外文翻译文献

合集下载

外文文献翻译网络团购中的不确定性需求细分-中英对照

外文文献翻译网络团购中的不确定性需求细分-中英对照

题目:我国网络团购的发展现状与问题探究一、外文原文标题:Segmenting uncertain demand in group-buying auctions原文:Demand uncertainty is a key factor in a seller’s decision-making process for products sold through online auctions. We explore demand uncertainty in group-buying auctions in terms of the extent of low-valuation demand and high-valuation demand. We focus on the analysis of a monopolistic group-buying retailer that sells products to consumers who express different product valuations. We also examine the performance of a group-buying seller who faces competitive posted-price sellers in a market for the sale of the same products, under similar assumptions about uncertain demand. Based on a Nash equilibrium analysis of bidder strategies for both of these seller-side competition structures, we are able to characterize the groupbuying auction bidders’ dominant strategies. We obtained a number of interesting findings. Group-buying is likely to be more effective in settings where there is larger low-valuation demand than high-valuation demand. The structure of demand matters. This finding has relevance to the marketplace for new cameras, next-generation microprocessors and computers, and other high-valuation goods, which are unlikely to be as effectively sold in group-buying markets. We obtained additional results for the case of continuous demand, and find that there is a basis for the seller to improve revenues via effective group-buying auction price curve design.Keywords: Consumer behavior, bidding strategy, demand uncertainty, economic analysis, electronic markets, group-buying auctions, market mechanism, posted-price mechanism, simulation, uncertainty risk.The development of advanced IT makes it possible to use novel business models to handle business problems in new and innovative ways. With the growth of the Internet, a number of new electronic auction mechanisms have emerged, and auctions are generally known to create higher expected seller revenue than posted-prices whenthe cost of running an auction is minimal or costless (Wang 1993). Some of the new mechanisms we have seen include the online Yankee and Dutch auctions, and the “name-yourown-price” and “buy-it-now” mechanisms. An example is eBay’s Dutch auction for the sale of multiple items of the same description. Another of these new electronic market mechanisms that we have observed is the group-buying auction, a homogeneous multi-unit auction (Mitchell 2002, Li et al. 2004).Internet-based sellers and digital intermediaries have adopted this market mechanism on sites such as () and (). These sites offer transaction-making mechanisms that are different from traditional auctions. In traditional auctions, bidders compete against one another to be the winner.In group-buying auctions, however, bidders have an incentive to aggregate their bids so that the seller or digital intermediary offers a lower price at which they all can buy the desired goods (Horn et al. 2000). McCabe et al. (1991) have explored multi-unit Vickrey auctions in experimental research, however, they did not consider the possibility of stochastic bidder arrival or demand uncertainty.This paper is the first to examine the impacts of demand uncertainty on the performance on online group-buying auctions. Based on a Nash equilibrium analysis of bidder strategies for a monopolist seller and a competitive seller, we are able to characterize the group-buying auction bidders’ symmetric and dominant strategies. We find that group-buying is likely to be more effective in settings where there is larger low-valuation demand than high-valuation demand. Thus, the structure of demand at different level of willingness-to-pay by consumers matters. This has relevance to the marketplace for new cameras, next-generation microprocessors and computers, and other high-valuation goods. We obtained additional results for the case of continuous demand valuations, and found that there is a basis for the seller to improve revenues based on the effective design of the group-buying auction price curve design. THEORYThe model for the group-buying auction mechanism with uncertain bidder arrival that we will develop spans three streams of literature: demand uncertainty, consumer behavior and related mechanism design issues; auction economics and mechanismdesign theory; and current theoretical knowledge about the operation of group-buying auctions from the IS and electronic commerce literature.Demand Uncertainty, Consumer Behavior and Mechanism Design Demand uncertainties typically are composed of consumer demand environment uncertainty (or uncertainty about the aggregate level of consumer demand) and randomness of demand in the marketplace (reflected in brief temporal changes and demand shocks that are not expected to persist). Consumer uncertainty about demand in the marketplace can occur based on the valuation of products, and whether consumers are willing to pay higher or lower prices. It may also occur on the basis of demand levels, especially the number of the consumers in the market. Finally, there are temporal considerations, which involve whether a consumer wishes to buy now, or whether they may be sampling quality and pricing with the intention of buying later. We distinguish between different demand level environments. In addition, it is possible that these consumer demand environments may co-exist, as is often the case when firms make strategies for price discrimination. This prompts a seller to consider setting more than one price level, as we often see in real-world retailing, as well as group-buying auctions.Dana (2001) pointed out that when a monopoly seller faces uncertainty about the consumer demand environment, it usually will not be in his best interest to set uniform prices for all consumers. The author studied a scenario in which there were more buyers associated with high demand and fewer buyers associated with low demand. In the author’s proposed price mechanism, the seller sets a price curve instead of a single price, so as to be able to offer different prices depending on the different demand conditions that appear to obtain in the marketplace. It may be useful in such settings to employ an automated pricesearching mechanism, which is demonstrated to be more robust to the uncertain demand than a uniform price mechanism will, relative to expected profits. Unlike Dana’s (2001) work though, we will study settings in which there are fewer buyers who exhibit demand at higher prices and more buyers who exhibit demand at lower prices. This is a useful way to characterize group-buying, since most participating consumers truly are price-sensitive, and this is what makesgroup-buying auction interesting to them.Nocke and Peitz (2007) have studied rationing as a tool that a monopolist to optimize its sales policy in the presence of uncertain demand. The authors examined three different selling policies that they argue are potentially optimal in their environment: uniform pricing, clearance sales, and introductory offers. A uniform pricing policy involves no seller price discrimination, though consumers are likely to exhibit different levels of willingness-to-pay when they are permitted to express themselves through purchases at different price levels. A current example of uniform pricing policy is iTunes (), which has been offering 99¢per song pricing. The consumer has to deal with very little uncertainty in the process, and this may be a good approach when the seller wants to “train” consumers to develop specific buying habits (as seems to have been the case with the online purchase of digital music in the past few years).Nocke and Peitz (2007) characterized a clearance sales policy as charging a high price initially, but then lowering the price and offering the remaining goods to low value consumers, as is often seen in department store sales policy. Consumers with a high valuation for the sale goods may decide to buy at the high price, since the endogenous probability of rationing by the seller is higher at the lower price. Apropos to this, consumers who buy late at low prices typically find that it is difficult to find the styles, colors and sizes that they want, and they may have more difficulty to coordinate the purchase of matching items (e.g., matching colors and styles of clothing). Introductory offers consist of selling a limited quantity of items at a low price initially in the market, and then raising price. A variant occurs when the seller offers a lower price for the first purchase of goods or services that typically involve multiple purchases by the consumer (e.g., book club memberships and cell phone services). Consumers who place a high valuation on a sale item rationed initially at the lower price may find it optimal to buy the goods at the higher price. Introductory offers may dominate uniform pricing, but are never optimal if the seller uses clearance sales.In uncertain markets, buyers will have private information. Che and Gale (2000) pointed out that when consumers have private information about their budgetconstraints and their valuation of sales items, so a monopolist’s optimal pricing strategy is to offer a menu of lotteries on the likelihood of consumer purchases of its products at different prices. Another approach is intertemporal price discrimination. By offering different prices with different probabilities for the consumer to obtain the good, the monopolist can profitably segment consumers even though valuation segments alone are not profitable.Even when the seller can effectively identify the consumer demand level in the marketplace, due to stochastic factors in the market environment, it still may be difficult for the seller to effectively predict demand. As a result, the seller may try to improve its demand forecast by utilizing market signals that may be observed when sales occur. However, there are likely to be some stochastic differences between the predicted demand by the seller and the realized demand in the marketplace (Kauffman and Mohtadi 2004). Lo and Wu (2003) pointed out that a typical seller faces different types of risks, and among these, a key factor is forecast error, the difference between the forecast and the actual levels of demand. Dirim and Roundy (2002) quantified forecast errors based on a scheme that estimates the variance and correlation of forecast errors and models the evolution of forecasts over time.2.2. Some Properties of Group-Buying Auction MechanismSome of the key characteristics associated with group-buying auction mechanism design are present in the literature. The group-buying auction mechanism is fundamentally different from the typical quantity discount mechanism (Dolan 1987, Weng 1995, Corbett and DeGroote 2000) that is often used in consumer and business-to-business procurement settings.First, group-buying closing prices typically decline monotonically in the total purchase quantities of participating buyers, and not just based on an individual buyer’s purchase quantities. So a group-buying auction does not lead to price discrimination among different buyers and every buyer will be charged the same closing price.Second, in group-buying auctions, imperfect information may have an impact on performance and make the final auction price uncertain. Group-buying is not the same as what happens with corporate shopping clubs or affinity group-based buying though.With these other mechanisms, consumers will be associated with one another in some way, and be able to obtain quantity discounts as a result. Another variant of the quantity discount mechanism occurs o n the Internet with shopping clubs and “power-buying”Web sites. (), Buyer’s Advantage (www.buyersadvan ), and Online Choice () are examples that we have recently observed in the marketplace. With uncertainty about the ultimate number of the bidders who will participate, interested consumers may not know whether they can get the products, or what the closing price will be when they make a bid. This may even occur when they bid the lowest price on the group-buying price curve.Third, in the quantity discount mechanism, to achieve a discount the buyer must order more than the threshold number of items required. In group-buying, the buyer can get the discount by ordering more herself or persuading other bidders to order more, as we saw with the “Tell-a-Friend” link at Lets-Buy for co-buying (and at the active group-donation site, , ).A final consideration in some group-buying auctions is that a buyer may be able to choose her own bidding price, which makes this kind of auction similar to an open outcry auction. In practice, many buyers will only be willing to state a low bid price, unless they can rely on the design of the mechanism to faithfully handle information about their actual reservation price. Group-buying auctions have a key, but paradoxical feature: to reach a lower price and higher sale quantity bucket, the consumer may need to enter the auction at a higher price and lower sales quantity bucket (Chen et al. 2009). 出处:J. Chen, R.J. Kauffman, Y. Liu, X. Song.Segmenting uncertain demand in group-buying auctions[R]. Electronic Commerce Research and Applications 2009,3(001).二、翻译文章标题:网络团购中的不确定性需求细分译文:不确定性需求,是卖家通过网络拍卖形式销售产品的决策过程中的一个关键因素。

大数据外文翻译参考文献综述

大数据外文翻译参考文献综述

大数据外文翻译参考文献综述(文档含中英文对照即英文原文和中文翻译)原文:Data Mining and Data PublishingData mining is the extraction of vast interesting patterns or knowledge from huge amount of data. The initial idea of privacy-preserving data mining PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive information. The key issues were how to modify the data and how to recover the data mining result from the modified data. Privacy-preserving data mining considers the problem of running data mining algorithms on confidential data that is not supposed to be revealed even to the partyrunning the algorithm. In contrast, privacy-preserving data publishing (PPDP) may not necessarily be tied to a specific data mining task, and the data mining task may be unknown at the time of data publishing. PPDP studies how to transform raw data into a version that is immunized against privacy attacks but that still supports effective data mining tasks. Privacy-preserving for both data mining (PPDM) and data publishing (PPDP) has become increasingly popular because it allows sharing of privacy sensitive data for analysis purposes. One well studied approach is the k-anonymity model [1] which in turn led to other models such as confidence bounding, l-diversity, t-closeness, (α,k)-anonymity, etc. In particular, all known mechanisms try to minimize information loss and such an attempt provides a loophole for attacks. The aim of this paper is to present a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explain their effects on Data Privacy.Although data mining is potentially useful, many data holders are reluctant to provide their data for data mining for the fear of violating individual privacy. In recent years, study has been made to ensure that the sensitive information of individuals cannot be identified easily.Anonymity Models, k-anonymization techniques have been the focus of intense research in the last few years. In order to ensure anonymization of data while at the same time minimizing the informationloss resulting from data modifications, everal extending models are proposed, which are discussed as follows.1.k-Anonymityk-anonymity is one of the most classic models, which technique that prevents joining attacks by generalizing and/or suppressing portions of the released microdata so that no individual can be uniquely distinguished from a group of size k. In the k-anonymous tables, a data set is k-anonymous (k ≥ 1) if each record in the data set is in- distinguishable from at least (k . 1) other records within the same data set. The larger the value of k, the better the privacy is protected. k-anonymity can ensure that individuals cannot be uniquely identified by linking attacks.2. Extending ModelsSince k-anonymity does not provide sufficient protection against attribute disclosure. The notion of l-diversity attempts to solve this problem by requiring that each equivalence class has at least l well-represented value for each sensitive attribute. The technology of l-diversity has some advantages than k-anonymity. Because k-anonymity dataset permits strong attacks due to lack of diversity in the sensitive attributes. In this model, an equivalence class is said to have l-diversity if there are at least l well-represented value for the sensitive attribute. Because there are semantic relationships among the attribute values, and different values have very different levels of sensitivity. Afteranonymization, in any equivalence class, the frequency (in fraction) of a sensitive value is no more than α.3. Related Research AreasSeveral polls show that the public has an in- creased sense of privacy loss. Since data mining is often a key component of information systems, homeland security systems, and monitoring and surveillance systems, it gives a wrong impression that data mining is a technique for privacy intrusion. This lack of trust has become an obstacle to the benefit of the technology. For example, the potentially beneficial data mining re- search project, Terrorism Information Awareness (TIA), was terminated by the US Congress due to its controversial procedures of collecting, sharing, and analyzing the trails left by individuals. Motivated by the privacy concerns on data mining tools, a research area called privacy-reserving data mining (PPDM) emerged in 2000. The initial idea of PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive information. The key issues were how to modify the data and how to recover the data mining result from the modified data. The solutions were often tightly coupled with the data mining algorithms under consideration. In contrast, privacy-preserving data publishing (PPDP) may not necessarily tie to a specific data mining task, and the data mining task is sometimes unknown at the time of data publishing. Furthermore, some PPDP solutions emphasize preserving the datatruthfulness at the record level, but PPDM solutions often do not preserve such property. PPDP Differs from PPDM in Several Major Ways as Follows :1) PPDP focuses on techniques for publishing data, not techniques for data mining. In fact, it is expected that standard data mining techniques are applied on the published data. In contrast, the data holder in PPDM needs to randomize the data in such a way that data mining results can be recovered from the randomized data. To do so, the data holder must understand the data mining tasks and algorithms involved. This level of involvement is not expected of the data holder in PPDP who usually is not an expert in data mining.2) Both randomization and encryption do not preserve the truthfulness of values at the record level; therefore, the released data are basically meaningless to the recipients. In such a case, the data holder in PPDM may consider releasing the data mining results rather than the scrambled data.3) PPDP primarily “anonymizes” the data by hiding the identity of record owners, whereas PPDM seeks to directly hide the sensitive data. Excellent surveys and books in randomization and cryptographic techniques for PPDM can be found in the existing literature. A family of research work called privacy-preserving distributed data mining (PPDDM) aims at performing some data mining task on a set of private databasesowned by different parties. It follows the principle of Secure Multiparty Computation (SMC), and prohibits any data sharing other than the final data mining result. Clifton et al. present a suite of SMC operations, like secure sum, secure set union, secure size of set intersection, and scalar product, that are useful for many data mining tasks. In contrast, PPDP does not perform the actual data mining task, but concerns with how to publish the data so that the anonymous data are useful for data mining. We can say that PPDP protects privacy at the data level while PPDDM protects privacy at the process level. They address different privacy models and data mining scenarios. In the field of statistical disclosure control (SDC), the research works focus on privacy-preserving publishing methods for statistical tables. SDC focuses on three types of disclosures, namely identity disclosure, attribute disclosure, and inferential disclosure. Identity disclosure occurs if an adversary can identify a respondent from the published data. Revealing that an individual is a respondent of a data collection may or may not violate confidentiality requirements. Attribute disclosure occurs when confidential information about a respondent is revealed and can be attributed to the respondent. Attribute disclosure is the primary concern of most statistical agencies in deciding whether to publish tabular data. Inferential disclosure occurs when individual information can be inferred with high confidence from statistical information of the published data.Some other works of SDC focus on the study of the non-interactive query model, in which the data recipients can submit one query to the system. This type of non-interactive query model may not fully address the information needs of data recipients because, in some cases, it is very difficult for a data recipient to accurately construct a query for a data mining task in one shot. Consequently, there are a series of studies on the interactive query model, in which the data recipients, including adversaries, can submit a sequence of queries based on previously received query results. The database server is responsible to keep track of all queries of each user and determine whether or not the currently received query has violated the privacy requirement with respect to all previous queries. One limitation of any interactive privacy-preserving query system is that it can only answer a sublinear number of queries in total; otherwise, an adversary (or a group of corrupted data recipients) will be able to reconstruct all but 1 . o(1) fraction of the original data, which is a very strong violation of privacy. When the maximum number of queries is reached, the query service must be closed to avoid privacy leak. In the case of the non-interactive query model, the adversary can issue only one query and, therefore, the non-interactive query model cannot achieve the same degree of privacy defined by Introduction the interactive model. One may consider that privacy-reserving data publishing is a special case of the non-interactivequery model.This paper presents a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explains their effects on Data Privacy. k-anonymity is used for security of respondents identity and decreases linking attack in the case of homogeneity attack a simple k-anonymity model fails and we need a concept which prevent from this attack solution is l-diversity. All tuples are arranged in well represented form and adversary will divert to l places or on l sensitive attributes. l-diversity limits in case of background knowledge attack because no one predicts knowledge level of an adversary. It is observe that using generalization and suppression we also apply these techniques on those attributes which doesn’t need th is extent of privacy and this leads to reduce the precision of publishing table. e-NSTAM (extended Sensitive Tuples Anonymity Method) is applied on sensitive tuples only and reduces information loss, this method also fails in the case of multiple sensitive tuples.Generalization with suppression is also the causes of data lose because suppression emphasize on not releasing values which are not suited for k factor. Future works in this front can include defining a new privacy measure along with l-diversity for multiple sensitive attribute and we will focus to generalize attributes without suppression using other techniques which are used to achieve k-anonymity because suppression leads to reduce the precision ofpublishing table.译文:数据挖掘和数据发布数据挖掘中提取出大量有趣的模式从大量的数据或知识。

外文翻译-不确定性数据挖掘:一种新的研究方向

外文翻译-不确定性数据挖掘:一种新的研究方向

毕业设计(论文)外文资料翻译系部:计算机科学与技术系专业:计算机科学与技术姓名:学号:外文出处:Proceeding of Workshop on the (用外文写)of Artificial,Hualien,TaiWan,2005不确定性数据挖掘:一种新的研究方向Michael Chau1, Reynold Cheng2, and Ben Kao31:商学院,香港大学,薄扶林,香港2:计算机系,香港理工大学九龙湖校区,香港3:计算机科学系,香港大学,薄扶林,香港摘要由于不精确测量、过时的来源或抽样误差等原因,数据不确定性常常出现在真实世界应用中。

目前,在数据库数据不确定性处理领域中,很多研究结果已经被发表。

我们认为,当不确定性数据被执行数据挖掘时,数据不确定性不得不被考虑在内,才能获得高质量的数据挖掘结果。

我们称之为“不确定性数据挖掘”问题。

在本文中,我们为这个领域可能的研究方向提出一个框架。

同时,我们以UK-means 聚类算法为例来阐明传统K-means算法怎么被改进来处理数据挖掘中的数据不确定性。

1.引言由于测量不精确、抽样误差、过时数据来源或其他等原因,数据往往带有不确定性性质。

特别在需要与物理环境交互的应用中,如:移动定位服务[15]和传感器监测[3]。

例如:在追踪移动目标(如车辆或人)的情境中,数据库是不可能完全追踪到所有目标在所有瞬间的准确位置。

因此,每个目标的位置的变化过程是伴有不确定性的。

为了提供准确地查询和挖掘结果,这些导致数据不确定性的多方面来源不得不被考虑。

在最近几年里,已有在数据库中不确定性数据管理方面的大量研究,如:数据库中不确定性的表现和不确定性数据查询。

然而,很少有研究成果能够解决不确定性数据挖掘的问题。

我们注意到,不确定性使数据值不再具有原子性。

对于使用传统数据挖掘技术,不确定性数据不得不被归纳为原子性数值。

再以追踪移动目标应用为例,一个目标的位置可以通过它最后的记录位置或通过一个预期位置(如果这个目标位置概率分布被考虑到)归纳得到。

经典数据挖掘文献

经典数据挖掘文献

Based on Apriori algorithm Data mining1.Over the valuable hideaway event in the huge database, and performs to analyze Data Mining outlineAlong with take the computer and the network as representative of information technology's development, more and more enterprises, the official organization, educational institution and the scientific research unit has achieved information digitized processing. The information content unceasing growth in the database ask the data memory, the management and the analysis a higher request.One side, the progress of data collection tool enable the humanity to have the huge data quantity, facing assumed the detonation growth of the data, the people need some new tools which could automate transforms the data into the valuable information and the knowledge. Thus, the data mining becomes a new research hot spot domain.On the other hand, along with the data bank technology rapid development and the data management system universal promotion, the data which the people accumulate also day by day .In the sharp increase data also possibility hide many important informations , people hope to make a higher level analysis of the held information, in order to used these data better.The Data Mining (Data Mining)is a new technology which excavates the concealment, formerly unknown, the latent value knowledge to the decision-making from the mass datas. The data mining is a technology that devotes to the data analysis, the understanding and the revelation data interior implication knowledge, it will become one of future information technology application profitable targets. It is likely to other new technical development course, the data mining technology also must after the concept propose, the concept accepts, the widespread research and the exploration, gradually applies and massive applies stages.The data mining technology and the daily life relations already become more and more close. We must face the pointed advertisement every day, the commercial sector reduce the cost through the data mining technology to enhance the efficiency. The data mining opponent worried the data miningobtains the information is threatens people's privacy for the price. Using the data mining might obtain some population statistics information which is unknown before and hideaway in the customer data.The people grow day by day regarding the data mining technology in certain domain application interest, for example cheat examination, suspect identification as well as latent terrorist forecast.The data mining technology may help the people to withdraw from the database correlation data is interested the knowledge, the rule or the higher level information, and may help the people to analyze them from the varying degree, thus may use in the database effectively the data. Not only the data mining technology might use in describing the past data developing process, further also will be able to forecast the future tendency.The data mining technology classified method are very many, according to the data mining duty, may divide into the connection rule excavation, the data class rule excavation, the cluster rule excavation, the dependent analysis and the dependent model discovered, as well as the concept description, the deviation analyze, the trend analysis and the pattern analysis and so on; According to the database which excavates looked that, may divide into the relations database, the object-oriented database, the space database, the time database, the multimedia databases and the different configuration database and so on; According to technical classification which uses, may divide into the artificial neural networks, the decision tree, the genetic algorithm, the neighborhood principle and may the vision and so on.The data mining process by the determination excavation object, the data preparation, the model establishment, the data mining, the result analysis indicates generally and excavates applies these main stages to be composed. The data mining may describe for these stages repeatedly the process.The data mining needs to process the question, is dische gain has the significance information, induces the useful structure, carries on policy-making as the enterprise the basis. Its application is extremely widespread, so long as this industry has the analysis value and the demanddatabase, all may carry on using the Mining tool has the goal excavating analysis. The common application case occurs much at the retail trade, the manufacturing industry, the financial finance insurance, the communication and the medical service.The data mining technology may help the people to withdraw from the database correlation data centralism is interested the knowledge, the rule or the higher level information, and may help the people to analyze them from the varying degree, thus may use in the database effectively the data. Not only the data mining technology might use in describing the past data developing process, further also will be able to forecast the future tendency. In view of this, we study the data mining to have the significance.But the data mining is only a tool, is not multi-purpose, it may discover some potential users, but why can't tell you, also cannot guarantee these potential users become the reality. The data mining success request to expected solves the question domain to have the profound understanding, understands the data, understood its process, can discover the reasonable explanation to the data mining result.2 The connection ruleThe connection rule is refers to between the mass data mean terms collection the interesting connection or the correlation relation. Along with the data accumulation, many field public figures regarding excavate the connection rule from theirs database more and more to be interested. Records from the massive commercial business discovered the interesting incidence relation, may help many commercial decisions-making the formulation.The connection rule discovery initial form is retail merchant's shopping blue analysis, the shopping blue analysis is through discovered the customer puts in its goods blue the different commodity, namely the different between relation, analyzes customer's purchase custom. Through understood which commodities also are purchased frequently by the customer, the analysis obtains between the commodity connection, this kind of connection discovery may help the retail merchant formulation marketing strategy. Theshopping blue analysis model application is may help manager to design the different store layout. One kind of strategy is: Together purchases frequently the commodity may place near somewhat, in order to further stimulate these commodities to sell together. For example, if the customer purchases the computer also to favor simultaneously purchases the financial control software, if then places the hardware exhibits near to the software, possibly is helpful in increases the two the sale. Another kind of strategy is: Place separately the hardware and the software in the store both sides, this possible to induce to purchase these commodities a customer group to choose other commodities. Certainly, the shopping blue analysis is connected the rule discovery the initial form, quite simple. The connection rule discovered the research and the application in unceasingly are also developing. For example, if some food shop through the shopping basket analysis knew “the majority of customers can simultaneously purchase the bread and the milk in a shopping”, then this food shop has the possibility through the reduction promotion bread simultaneously to enhance the bread and the milk sales volume.For example Again, if some children good store through the shopping basket analysis knew “the majority of customers can simultaneously purchase the powdered milk and the urine piece in a shopping”, then this children good store through lays aside separately the powdered milk and the urine piece in the close far place, the middle laying aside some other commonly used child thing, possibly induces the customer when the purchase powdered milk and the urine piece a group purchases other commodities.Digs the stubborn connection rule in business set mainly to include two steps: (1)Discovers all frequent item of collections because the frequent item of collection is the support is bigger than is equal to the smallest support threshold value an item of collection, therefore is composed by frequent item of collection all items the connection rule support is also bigger than is equal to the smallest support threshold value.(2) Has the strong connection rule tohave the strong connection rule is bigger than in all supports was equal to the smallest support threshold value in the connection rule, discovers all confidences to be bigger than is equal to the smallest confidence threshold value the connection rule. In the above two steps, the key is the first step, its efficiency influence entire excavation algorithm efficiency.3 the Apriori algorithmFrequent item of collection: If an item of collection satisfies the smallest support, namely if the item of collection appearance frequency is bigger than or is equal to min_sup (smallest support threshold value) with business database D in business total product, then calls this item of collection the frequent item of collection (Freqent Itemset), abbreviation frequent collection.The frequent k- item of collection set records is usually L K.The Apriori algorithm also called Breadth First or Level the Wise algorithm, proposed by Rakesh Agrawal and Rnamakrishnan Srekant in 1994, it is the present frequent collection discovery algorithm core.The Apriori algorithm uses one kind of being called as cascade search the iterative method, a frequent k- item of collection uses in searching a frequent (k+1)- item of collection.First, discovers the frequent 1- item of collection the set, this set records makes L1, L1 to use in discovering the frequent 2- item of collection set L2, but L2 uses in discovering L3, continue like this, until cannot find a frequent k- item of collection.Looks for each LK to need to scan a database.The connection rule excavation algorithm decomposition is two sub-questions: (1) Extracts in D to satisfy smallest support min_sup all frequent collections; (2) use frequent collection production satisfies smallest confidence level min_conf all connection rule.The first question is the key of this algorithm, the Apriori algorithm solves this problem based on the frequent collection theory recursion method.。

数据挖掘论文英文版

数据挖掘论文英文版

Jilin Province’s population growth and energy consumption analysisMajor StatisticsStudent No. 0401083710Name Niu FukuanJilin Province’s population growth andenergy consumption analysis[Summary]Since the third technological revolution, the energy has become the lifeline of national economy, while the energy on Earth is limited, so in between the major powers led to a number of oil-related or simply a war for oil. In order to compete on the world's resources and energy control, led to the outbreak of two world wars. China's current consumption period coincided with the advent of high-energy, CNPC, Sinopec, CNOOC three state-owned oil giants have been "going out" to develop international markets, Jilin Province as China's energy output and energy consumption province, is also active in the energy corresponding diplomacy. Economic globalization and increasingly fierce competition in the energy environment, China's energy policy is still there are many imperfections, to a certain extent, affect the energy and population development of Jilin Province, China and even to some extent can be said existing population crisis is the energy crisis.[Keyword]Energy consumption; Population; Growth; Analysis;Data sourceI select data from "China Statistical Yearbook 2009" Jilin Province 1995-2007 comprehensive annual financial data (Table 1). Record of the total population (end) of the annual data sequence {Xt}, mind full of energy consumption (kg of standard coal) annual data sequence {Yt}.Table 1 1995-2007 older and province GDP per capita consumption level of all data2001 127627 16629798.1 11.75686723 16.626706712002 128453 17585215.7 11.76331836 16.682569092003 129227 19888035.3 11.76932583 16.805628872004 129988 21344029.6 11.77519742 16.876282612005 130756 23523004.4 11.78108827 16.973489412006 131448 25592925.6 11.78636662 17.057826532007 132129 26861825.7 11.791534 17.106216721.Timing diagramFirst, the total population of Table 1 (end) of the annual data series {Xt}, full of energy consumption (kg of standard coal) annual data series {Yt} are drawn timing diagram, in order to observe the annual population data series {Xt} and national annual energy consumption data sequence {Yt} is stationary, by EVIEWS software output is shown below.Figure 1 of the total population (end) sequence timing diagramFigure 2 universal life energy consumption (kg of standard coal) sequence timing diagramFigure 1 is a sequence {Xt} the timing diagram, Figure 2 is a sequence {Yt} of the timing diagram.Two figures show both the total population (end) or universal life energy consumption (kg of standard coal) index showed a rising trend, the total population of the annual data series {Xt} and national annual energy consumption data sequence {Yt} not smooth, the two may have long-term cointegration relationship.2. Data smoothing(1)Sequence LogarithmFigures 1 and 2 by the intuitive discovery data sequence {Xt} and {Yt} showed a significant growth trend, a significant non-stationary sequence. Therefore, the total population of first sequence {Xt} and universal life energy consumption (kg of standard coal) {Yt}, respectively for the number of treatment to eliminate heteroscedasticity. That logx = lnXt, logy = lnYt, with a view to the target sequence into the linear trend trend sequence, by EVIEWS software operations, the number of sequence timing diagram, in which the population sequence {logx} timing diagram shown in Figure 3, the full sequence of energy consumption {logy} timing diagram shown in Figure 4.Figure3 Figure 4Figure 3 shows the total population observed sequence {logx} and universal life energy consumption (kg of standard coal) sequence {logy} index trend has been basically eliminated, the two have obvious long-term cointegration relationship, which is the transfer function modeling an important prerequisite. However, the above sequence of numbers is still non-stationary series. Respectively {logx} and {logy} sequence of ADF unit root test (Table 5 and Table 6), the test results as shown below. (2)Unit root testHere we will be on the province's total population and the whole sequence {Xt} energy consumption (kg of standard coal) sequence data {Yt} be the unit root test, the results obtained by Eviews software operation is as follows:Table 2 Of the total population sequence {logx}Obtained from Table 2: Total population sequence data {Xt} of the ADF is -0.784587, significantly larger than the 1% level in the critical test value of -4.3260, the 5% level greater than the critical value of -3.2195 testing, but also greater than 10% level in the critical test value -2.7557, so the total population of the data sequence {logx} {Xt} is a non-stationary series.Table 3 National energy consumption (kg of standard coal) unit root test {logy}Obtained from Table 3: National energy consumption (kg of standard coal) data {Yt} of the ADF is 0.489677, significantly larger than the 1% level in the critical test value of -4.3260, the 5% level greater than the critical test value of -3.2195, but also 10% greater than the critical level test value -2.7557, so the total population of the sequence {logx} data {Yt} is a non-stationary series.(3) Sequence of differentialBecause of the number of time series after still not a smooth sequence, so the need for further logarithm of the total population after the sequence {logx} and after a few of the universal life energy consumption (kg of standard coal) differential sequence data {logY} differential sequences were recorded as {▽logx} and {▽logy}. Are respectively the second-order differential of the total population of the sequence {▽logX} and second-order differential of the national energy consumption (kg of standard coal) sequence data {▽ logy} the ADF unit root test (Table 7 and Table 8), test results the following table.Table 4Table 4 shows that the total population of second-order differential sequence {▽logx} ADF value is -10.6278, apparently less than 1% level in the critical test value of -6.292057, less than the 5% level in the critical test value -4.450425 also 10% less than the level in the critical test value of -3.701534, second-order differential of the total population of the sequence {▽ logx} is a stationary sequence.Table5 5Table 5 shows that the second-order differential universal life energy consumption (kg of standard coal) {▽logy} of the ADF is -6.395029, apparently less than 1% level in the critical test value of -4.4613, less than the 5 % level of the critical test value of -3.2695, but also less than the 10% level the critical value of -2.7822 testing,universal life, second-order differential consumption of energy (kg of standard coal) {▽ logy} is a stationary sequence.3. Cointegration(1)Cointegration regressionCointegration theory in the 1980s there Engle Granger put forward specific, it is from the analysis of non-stationary time series start to explore the non-stationary variable contains the long-run equilibrium relationship between the non-stationary time series modeling provides a new solution.As the population time series {Xt} and universal life energy consumption time series {Yt} are logarithmic, the total population obtained by the analysis of time series {logX} and universal life energy consumption time series {logY} are second-order single whole sequence, so they may exist cointegration relationship. The results obtained by Eviews software operation is as follows:Table 6Obtained from Table 6:D(LNE2)= -0.054819 – 101.8623D(LOGX2)t = (-1.069855) (-1.120827)R2=0.122487 DW=1.593055(2)Check the smoothness of the residual sequenceFrom the Eviews software, get residual sequence analysis:Table 7Residual series unit root testObtained from Table 7: second-order differential value of -5.977460 ADF residuals, significantly less than 1% level in the critical test value -4.6405, less than 5% level in the critical test value of -3.3350, but also less than 10% level in the critical test value of -2.8169. Therefore, the second-order difference of the residual et is a stationary time series sequence. Expressed as follows:D(ET,2)=-0.042260-1.707007D(ET(-1),2)t = (-0.783744)(-5.977460)DW= 1.603022 EG=-5.977460,Since EG =- 5.977460, check the AFG cointegration test critical value table (N = 2, = 0.05, T = 16) received, EG value is less than the critical value, so to accept the original sequence et is stationary assumption. So you can determine the total population and energy consumption of all the people living there are two variables are long-term cointegration relationship.4. ECM model to establishThrough the above analysis, after the second-order differential of the logarithm of the total population time series {▽ logX} and second-order differential of Logarithm of of national energy consumption time series {▽ logY} is a stationary sequence, the second-order differential residuals et is also a stationary series. So that the number of second-order differential of the national energy consumption time series {▽ logY} as the dependent variable, after the second-order differential of the logarithm of the total population time series {▽logX} and second-order differential as residuals et from variable regression estimation, using Eviews software, the following findings:Table 8ECM model resultsTable 8 can be written by the ECM standard regression model, results are as follows:D(logY2)= -0.047266-154.4568D(LNP2) +0.171676D(ET2)t = (-1.469685) (-2.528562) (1.755694)R2= 0.579628 DW=1.760658ECM regression equation of the regression coefficients by a significance test, the error correction coefficient is positive, in line with forward correction mechanism. The estimation results show that the province of everyone's life changes in energy consumption depends not only on the change of the total population, but also on the previous year's total population deviation from the equilibrium level. In addition, the regression results show that short-term changes in the total population of all the people living there is a positive impact on energy consumption. Because short-term adjustment coefficient is significant, it shows that all the people living in JilinProvince annual consumption of energy in its long-run equilibrium value is the deviation can be corrected well.5. ARMA model(1) Model to identifyAfter differential differenced stationary series into stationary time series, after the analysis can be used ARMR model, the choice of using the model of everyone's life before the first stable after the annual energy consumption time series {logY} to estimate the first full life energy consumption sequence {logY} do autocorrelation and partial autocorrelation, the results of the following:Table 9{logy} of the autocorrelation and partial autocorrelation mapObtained from Table 9, the relevant figure from behind, after K = 1 in a random interval, partial autocorrelation can be seen in K = 1 after a random interval. So we can live on national energy consumption to establish the sequence {logY} ARMA (1,1) model, following on the ARMA (1,1) model parameter estimation, which results in the following table:Table 10ARMA (1,1) model parameter estimationTable 10 obtained by the ARMA (1,1) model parameter estimation is given by: D(LNE,2)=0.014184+0.008803D(LNE,2)t-1-0.858461U t-1(2)ARMA (1,2) model testModel of the residuals obtained for white noise test, if the residuals are not white noise sequence, then the need for ARMA (1,2) model for further improvement; if it is white noise process, the acceptance of the original model. ARMA (1,2) model residuals test results are as follows:Table11 ARMA (1,2) model residuals testTable 11 shows, Q statistic P value greater than 0.05, so the ARMA (1,1) model, the residual series is white noise sequence and accept the ARMA (1,1) model. Our whole life to predict changes in energy consumption, the results are as follows:Figure 5 National energy consumption forecast mapJilin Province of everyone's life through the forecast energy consumption, we can see all the people living consumption of energy is rising every year, which also shows that in the future for many years, Jilin Province, universal life energy consumption will be showing an upward trend. And because of the total population and the existence of universal life energy consumption effects of changes in the same direction, so the total population over the next many years, will continue to increase.6. ProblemsBased on the province's total population and the national energy consumption cointegration analysis of the relationship between population and energy consumption obtained between Jilin Province, there are long-term stability of the interaction and mutual promotion of the long-run equilibrium relationship. The above analysis can be more accurate understanding of the energy consumption of Jilin Province, Jilin Province put forward a better proposal on energy conservation. Moment, Jilin Province facing energy problems:(1) The heavy industry still accounts for a large proportion of;(2)The scale of energy-intensive industry, the rapid growth of production ofenergy saving effect;(3)The coal-based energy consumption is still.7.Recommendation:(1) Population control, and actively cooperate with the national policy of family planning, ease the pressure on the average population can consume.(2) Raise awareness of the importance of energy saving, the implementation of energy-saving target responsibility system, energy efficiency are implemented.Conscientiously implement the State Council issued the statistics of energy saving, monitoring and evaluation program of the three systems. Strict accountability.(3) Speed up industrial restructuring and transformation of economic development. Speed up industrial restructuring and transformation of economic development, to overcome the resource, energy and other bottlenecks, and take the high technological content, good economic returns, low resources consumption, little environmental pollution and human resources into full play to the new industrialization path.(4) Should pay attention to quality improvement and optimization of the structure, so that the final implementation of the restructuring to improve the overall quality of industrial and economic growth, quality and efficiency up.(5) To enhance the development and promotion of energy-saving technologies, strengthen energy security, promotion of renewable energy, clean energy.Adhere to technical progress and the deepening of reform and opening up the combination. To enhance the independent innovation capability as the adjustment of industrial structure, changing the growth mode of the central link, speed up the innovation system, efforts to address the constraints of the city development major science and technology. Vigorously promote the recycling economy demonstration pilot enterprises to actively carry out comprehensive utilization of resources and renewable resources recycling. And actively promote solar, wind, biogas, biodiesel and other renewable energy construction.References[1] Wang Yan, Applied time series analysis of the Chinese People's University Press, 2008.12[2] Pang Hao. Econometric Science Press, 2006.1。

大数据挖掘外文翻译文献

大数据挖掘外文翻译文献

文献信息:文献标题:A Study of Data Mining with Big Data(大数据挖掘研究)国外作者:VH Shastri,V Sreeprada文献出处:《International Journal of Emerging Trends and Technology in Computer Science》,2016,38(2):99-103字数统计:英文2291单词,12196字符;中文3868汉字外文文献:A Study of Data Mining with Big DataAbstract Data has become an important part of every economy, industry, organization, business, function and individual. Big Data is a term used to identify large data sets typically whose size is larger than the typical data base. Big data introduces unique computational and statistical challenges. Big Data are at present expanding in most of the domains of engineering and science. Data mining helps to extract useful data from the huge data sets due to its volume, variability and velocity. This article presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective.Keywords: Big Data, Data Mining, HACE theorem, structured and unstructured.I.IntroductionBig Data refers to enormous amount of structured data and unstructured data thatoverflow the organization. If this data is properly used, it can lead to meaningful information. Big data includes a large number of data which requires a lot of processing in real time. It provides a room to discover new values, to understand in-depth knowledge from hidden values and provide a space to manage the data effectively. A database is an organized collection of logically related data which can be easily managed, updated and accessed. Data mining is a process discovering interesting knowledge such as associations, patterns, changes, anomalies and significant structures from large amount of data stored in the databases or other repositories.Big Data includes 3 V’s as its characteristics. They are volume, velocity and variety. V olume means the amount of data generated every second. The data is in state of rest. It is also known for its scale characteristics. Velocity is the speed with which the data is generated. It should have high speed data. The data generated from social media is an example. Variety means different types of data can be taken such as audio, video or documents. It can be numerals, images, time series, arrays etc.Data Mining analyses the data from different perspectives and summarizing it into useful information that can be used for business solutions and predicting the future trends. Data mining (DM), also called Knowledge Discovery in Databases (KDD) or Knowledge Discovery and Data Mining, is the process of searching large volumes of data automatically for patterns such as association rules. It applies many computational techniques from statistics, information retrieval, machine learning and pattern recognition. Data mining extract only required patterns from the database in a short time span. Based on the type of patterns to be mined, data mining tasks can be classified into summarization, classification, clustering, association and trends analysis.Big Data is expanding in all domains including science and engineering fields including physical, biological and biomedical sciences.II.BIG DATA with DATA MININGGenerally big data refers to a collection of large volumes of data and these data are generated from various sources like internet, social-media, business organization, sensors etc. We can extract some useful information with the help of Data Mining. It is a technique for discovering patterns as well as descriptive, understandable, models from a large scale of data.V olume is the size of the data which is larger than petabytes and terabytes. The scale and rise of size makes it difficult to store and analyse using traditional tools. Big Data should be used to mine large amounts of data within the predefined period of time. Traditional database systems were designed to address small amounts of data which were structured and consistent, whereas Big Data includes wide variety of data such as geospatial data, audio, video, unstructured text and so on.Big Data mining refers to the activity of going through big data sets to look for relevant information. To process large volumes of data from different sources quickly, Hadoop is used. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Its distributed supports fast data transfer rates among nodes and allows the system to continue operating uninterrupted at times of node failure. It runs Map Reduce for distributed data processing and is works with structured and unstructured data.III.BIG DATA characteristics- HACE THEOREM.We have large volume of heterogeneous data. There exists a complex relationship among the data. We need to discover useful information from this voluminous data.Let us imagine a scenario in which the blind people are asked to draw elephant. The information collected by each blind people may think the trunk as wall, leg as tree, body as wall and tail as rope. The blind men can exchange information with each other.Figure1: Blind men and the giant elephantSome of the characteristics that include are:i.Vast data with heterogeneous and diverse sources: One of the fundamental characteristics of big data is the large volume of data represented by heterogeneous and diverse dimensions. For example in the biomedical world, a single human being is represented as name, age, gender, family history etc., For X-ray and CT scan images and videos are used. Heterogeneity refers to the different types of representations of same individual and diverse refers to the variety of features to represent single information.ii.Autonomous with distributed and de-centralized control: the sources are autonomous, i.e., automatically generated; it generates information without any centralized control. We can compare it with World Wide Web (WWW) where each server provides a certain amount of information without depending on other servers.plex and evolving relationships: As the size of the data becomes infinitely large, the relationship that exists is also large. In early stages, when data is small, there is no complexity in relationships among the data. Data generated from social media and other sources have complex relationships.IV.TOOLS:OPEN SOURCE REVOLUTIONLarge companies such as Facebook, Yahoo, Twitter, LinkedIn benefit and contribute work on open source projects. In Big Data Mining, there are many open source initiatives. The most popular of them are:Apache Mahout:Scalable machine learning and data mining open source software based mainly in Hadoop. It has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent patternmining.R: open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is used for statistical analysis of very large data sets.MOA: Stream data mining open source software to perform data mining in real time. It has implementations of classification, regression; clustering and frequent item set mining and frequent graph mining. It started as a project of the Machine Learning group of University of Waikato, New Zealand, famous for the WEKA software. The streams framework provides an environment for defining and running stream processes using simple XML based definitions and is able to use MOA, Android and Storm.SAMOA: It is a new upcoming software project for distributed stream mining that will combine S4 and Storm with MOA.Vow pal Wabbit: open source project started at Yahoo! Research and continuing at Microsoft Research to design a fast, scalable, useful learning algorithm. VW is able to learn from terafeature datasets. It can exceed the throughput of any single machine networkinterface when doing linear learning, via parallel learning.V.DATA MINING for BIG DATAData mining is the process by which data is analysed coming from different sources discovers useful information. Data Mining contains several algorithms which fall into 4 categories. They are:1.Association Rule2.Clustering3.Classification4.RegressionAssociation is used to search relationship between variables. It is applied in searching for frequently visited items. In short it establishes relationship among objects. Clustering discovers groups and structures in the data.Classification deals with associating an unknown structure to a known structure. Regression finds a function to model the data.The different data mining algorithms are:Table 1. Classification of AlgorithmsData Mining algorithms can be converted into big map reduce algorithm based on parallel computing basis.Table 2. Differences between Data Mining and Big DataVI.Challenges in BIG DATAMeeting the challenges with BIG Data is difficult. The volume is increasing every day. The velocity is increasing by the internet connected devices. The variety is also expanding and the organizations’ capability to capture and process the data is limited.The following are the challenges in area of Big Data when it is handled:1.Data capture and storage2.Data transmission3.Data curation4.Data analysis5.Data visualizationAccording to, challenges of big data mining are divided into 3 tiers.The first tier is the setup of data mining algorithms. The second tier includesrmation sharing and Data Privacy.2.Domain and Application Knowledge.The third one includes local learning and model fusion for multiple information sources.3.Mining from sparse, uncertain and incomplete data.4.Mining complex and dynamic data.Figure 2: Phases of Big Data ChallengesGenerally mining of data from different data sources is tedious as size of data is larger. Big data is stored at different places and collecting those data will be a tedious task and applying basic data mining algorithms will be an obstacle for it. Next we need to consider the privacy of data. The third case is mining algorithms. When we are applying data mining algorithms to these subsets of data the result may not be that much accurate.VII.Forecast of the futureThere are some challenges that researchers and practitioners will have to deal during the next years:Analytics Architecture:It is not clear yet how an optimal architecture of analytics systems should be to deal with historic data and with real-time data at the same time. An interesting proposal is the Lambda architecture of Nathan Marz. The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, theserving layer, and the speed layer. It combines in the same system Hadoop for the batch layer, and Storm for the speed layer. The properties of the system are: robust and fault tolerant, scalable, general, and extensible, allows ad hoc queries, minimal maintenance, and debuggable.Statistical significance: It is important to achieve significant statistical results, and not be fooled by randomness. As Efron explains in his book about Large Scale Inference, it is easy to go wrong with huge data sets and thousands of questions to answer at once.Distributed mining: Many data mining techniques are not trivial to paralyze. To have distributed versions of some methods, a lot of research is needed with practical and theoretical analysis to provide new methods.Time evolving data: Data may be evolving over time, so it is important that the Big Data mining techniques should be able to adapt and in some cases to detect change first. For example, the data stream mining field has very powerful techniques for this task.Compression: Dealing with Big Data, the quantity of space needed to store it is very relevant. There are two main approaches: compression where we don’t loose anything, or sampling where we choose what is thedata that is more representative. Using compression, we may take more time and less space, so we can consider it as a transformation from time to space. Using sampling, we are loosing information, but the gains inspace may be in orders of magnitude. For example Feldman et al use core sets to reduce the complexity of Big Data problems. Core sets are small sets that provably approximate the original data for a given problem. Using merge- reduce the small sets can then be used for solving hard machine learning problems in parallel.Visualization: A main task of Big Data analysis is how to visualize the results. As the data is so big, it is very difficult to find user-friendly visualizations. New techniques, and frameworks to tell and show stories will be needed, as for examplethe photographs, infographics and essays in the beautiful book ”The Human Face of Big Data”.Hidden Big Data: Large quantities of useful data are getting lost since new data is largely untagged and unstructured data. The 2012 IDC studyon Big Data explains that in 2012, 23% (643 exabytes) of the digital universe would be useful for Big Data if tagged and analyzed. However, currently only 3% of the potentially useful data is tagged, and even less is analyzed.VIII.CONCLUSIONThe amounts of data is growing exponentially due to social networking sites, search and retrieval engines, media sharing sites, stock trading sites, news sources and so on. Big Data is becoming the new area for scientific data research and for business applications.Data mining techniques can be applied on big data to acquire some useful information from large datasets. They can be used together to acquire some useful picture from the data.Big Data analysis tools like Map Reduce over Hadoop and HDFS helps organization.中文译文:大数据挖掘研究摘要数据已经成为各个经济、行业、组织、企业、职能和个人的重要组成部分。

数据分析外文文献+翻译

数据分析外文文献+翻译

数据分析外文文献+翻译文献1:《数据分析在企业决策中的应用》该文献探讨了数据分析在企业决策中的重要性和应用。

研究发现,通过数据分析可以获取准确的商业情报,帮助企业更好地理解市场趋势和消费者需求。

通过对大量数据的分析,企业可以发现隐藏的模式和关联,从而制定出更具竞争力的产品和服务策略。

数据分析还可以提供决策支持,帮助企业在不确定的环境下做出明智的决策。

因此,数据分析已成为现代企业成功的关键要素之一。

文献2:《机器研究在数据分析中的应用》该文献探讨了机器研究在数据分析中的应用。

研究发现,机器研究可以帮助企业更高效地分析大量的数据,并从中发现有价值的信息。

机器研究算法可以自动研究和改进,从而帮助企业发现数据中的模式和趋势。

通过机器研究的应用,企业可以更准确地预测市场需求、优化业务流程,并制定更具策略性的决策。

因此,机器研究在数据分析中的应用正逐渐受到企业的关注和采用。

文献3:《数据可视化在数据分析中的应用》该文献探讨了数据可视化在数据分析中的重要性和应用。

研究发现,通过数据可视化可以更直观地呈现复杂的数据关系和趋势。

可视化可以帮助企业更好地理解数据,发现数据中的模式和规律。

数据可视化还可以帮助企业进行数据交互和决策共享,提升决策的效率和准确性。

因此,数据可视化在数据分析中扮演着非常重要的角色。

翻译文献1标题: The Application of Data Analysis in Business Decision-making The Application of Data Analysis in Business Decision-making文献2标题: The Application of Machine Learning in Data Analysis The Application of Machine Learning in Data Analysis文献3标题: The Application of Data Visualization in Data Analysis The Application of Data Visualization in Data Analysis翻译摘要:本文献研究了数据分析在企业决策中的应用,以及机器研究和数据可视化在数据分析中的作用。

数据挖掘外文文献翻译中英文

数据挖掘外文文献翻译中英文

数据挖掘外文文献翻译(含:英文原文及中文译文)英文原文What is Data Mining?Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, “data mining” should have been more appropriately named “knowledge mining from data”, which is unfortunately somewhat long. “Knowledge mining”, a shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material. Thus, such a misnomer which carries both “data” and “mining” became a popular choice . There are many other terms carrying a similar or slightly different meaning to data mining, such as knowledge mining from databases, knowledge extraction, data / pattern analysis, data archaeology, and data dredging.Many people treat data mining as a synonym for another popularly used term, “Knowledge Discovery in Databases”, or KDD. Alternatively, others view data mining as simply an essential step in the process ofknowledge discovery in databases. Knowledge discovery consists of an iterative sequence of the following steps:· data cleaning: to remove noise or irrelevant data, · data integration: where multiple data sources may be combined,·data selection : where data relevant to the analysis task are retrieved from the database,· data transformati on : where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance,·data mining: an essential process where intelligent methods are applied in order to extract data patterns, · pattern evaluation: to identify the truly interesting patterns representing knowledge based on some interestingness measures, and·knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user .The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user, and may be stored as new knowledge in the knowledge base. Note that according to this view, data mining is only one step in the entire process, albeit an essential one since it uncovers hidden patterns for evaluation.We agree that data mining is a knowledge discovery process.However, in industry, in media, and in the database research milieu, the term “data mining” is becoming mo re popular than the longer term of “knowledge discovery in databases”. Therefore, in this book, we choose to use the term “data mining”. We adopt a broad view of data mining functionality: data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories.Based on this view, the architecture of a typical data mining system may have the following major components:1. Database, data warehouse, or other information repository. This is one or a set of databases, data warehouses, spread sheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.2. Database or data warehouse server. The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.3. Knowledge base. This is the domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Otherexamples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).4. Data mining engine. This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association analysis, classification, evolution and deviation analysis.5. Pattern evaluation module. This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search towards interesting patterns. It may access interestingness thresholds stored in the knowledge base. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns.6. Graphical user interface. This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or datastructures, evaluate mined patterns, and visualize the patterns in different forms.From a data warehouse perspective, data mining can be viewed as an advanced stage of on-1ine analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-style analytical processing of data warehouse systems by incorporating more advanced techniques for data understanding.While there may be many “data mining systems” on the market, not all of them can perform true data mining. A data analysis system that does not handle large amounts of data can at most be categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A system that can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases should be more appropriately categorized as either a database system, an information retrieval system, or a deductive database system.Data mining involves an integration of techniques from mult1ple disciplines such as database technology, statistics, machine learning, high performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis. We adopt a database perspective in our presentation of data mining in this book. That is, emphasis is placed on efficient andscalable data mining techniques for large databases. By performing data mining, interesting knowledge, regularities, or high-level information can be extracted from databases and viewed or browsed from different angles. The discovered knowledge can be applied to decision making, process control, information management, query processing, and so on. Therefore, data mining is considered as one of the most important frontiers in database systems and one of the most promising, new database applications in the information industry.A classification of data mining systemsData mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and or rough set theory, knowledge representation, inductive logic programming, or high performance computing. Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, Information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, or psychology.Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of datamining systems. Therefore, it is necessary to provide a clear classification of data mining systems. Such a classification may help potential users distinguish data mining systems and identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows.1) Classification according to the kinds of databases mined. A data mining system can be classified according to the kinds of databases mined. Database systems themselves can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly.For instance, if classifying according to data models, we may have a relational, transactional, object-oriented, object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time -series, text, or multimedia data mining system , or a World-Wide Web mining system . Other system types include heterogeneous data mining systems, and legacy data mining systems.2) Classification according to the kinds of knowledge mined. Data mining systems can be categorized according to the kinds of knowledge they mine, i.e., based on data mining functionalities, such as characterization, discrimination, association, classification, clustering,trend and evolution analysis, deviation analysis , similarity analysis, etc.A comprehensive data mining system usually provides multiple and/or integrated data mining functionalities.Moreover, data mining systems can also be distinguished based on the granularity or levels of abstraction of the knowledge mined, including generalized knowledge(at a high level of abstraction), primitive-level knowledge(at a raw data level), or knowledge at multiple levels (considering several levels of abstraction). An advanced data mining system should facilitate the discovery of knowledge at multiple levels of abstraction.3) Classification according to the kinds of techniques utilized.Data mining systems can also be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems), or the methods of data analysis employed(e.g., database-oriented or data warehouse-oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on ) .A sophisticated data mining system will often adopt multiple data mining techniques or work out an effective, integrated technique which combines the merits of a few individual approaches.中文译文什么是数据挖掘?简而言之,数据挖掘是指从大量数据中提取或“挖掘”知识。

不确定性数据挖掘在IDS中的应用

不确定性数据挖掘在IDS中的应用
中图 分 类 号 : T P 3 1 文 献标 识码 : A
Ap p l i c a t i o n o f Un c e r t a i n t y Da t a Mi n i n g t o I DS
LI U Zh i - mi n g
( C o l l e g e o fC o m p u t e r S c i e n c e , H u b e i U n i v e r s i t y o fE d u c a t i o n , Wu h a n 4 3 0 2 0 5 , C h i n a )
分 的重视 和研 究 。在所 有这 些 方法 和技 术 中 , 大 多是 基 于先 验 规 则 的 , 希望实现一个具有感知 、 识别 、 理 解、 自学 习和 自适 应 能力 的系 统 , 也 就 是 一个 智 能 的
后者则具有侦测速度快 、 隐蔽性好 、 视野更宽 、 较少的
专 用检 测设 备及 占用 资源 较少 。
第2 6卷
第 6 期
电 脑 开 发 与 应 用
( 总0 4 3 9)
・ 3 7 ・
文章 编 号 : 1 0 0 3 — 5 8 5 0 ( 2 0 1 3 ) 0 6 — 0 0 3 7 — 0 3
不确定性数据挖掘在 I D S中的应用
刘 志 明
( 湖北第二师范学院计算机学 院, 武汉 4 3 0 2 0 5 )
Ke y wo r d s :u n c e r t a i n t y , d a t a mi n i n g , r o u g h s e t , i n t r u s i o n d e t e c t i o n, I DS
引 言

数据挖掘技术毕业论文中英文资料对照外文翻译文献综述

数据挖掘技术毕业论文中英文资料对照外文翻译文献综述

数据挖掘技术毕业论文中英文资料对照外文翻译文献综述数据挖掘技术简介中英文资料对照外文翻译文献综述英文原文Introduction to Data MiningAbstract:Microsoft® SQL Server™ 2005 provides an integrated environment for creating and working with data mining models. This tutorial uses four scenarios, targeted mailing, forecasting, market basket, and sequence clustering, to demonstrate how to use the mining model algorithms, mining model viewers, and data mining tools that are included in this release of SQL Server.IntroductionThe data mining tutorial is designed to walk you through the process of creating data mining models in Microsoft SQL Server 2005. The data mining algorithms and tools in SQL Server 2005 make it easy to build a comprehensive solution for a variety of projects, including market basket analysis, forecasting analysis, and targeted mailing analysis. The scenarios for these solutions are explained in greater detail later in the tutorial.The most visible components in SQL Server 2005 are the workspaces that you use to create and work with data mining models. The online analytical processing (OLAP) and data mining tools are consolidated into two working environments: Business Intelligence Development Studio and SQL Server Management Studio. Using Business Intelligence Development Studio, you can develop an Analysis Services project disconnected from the server. When the project is ready, you can deploy it to the server. You can also work directly against the server. The main function of SQL Server Management Studio is to manage the server. Each environment is described in more detail later in this introduction. For more information on choosing between the two environments, see "Choosing Between SQL Server Management Studio and Business Intelligence Development Studio" in SQL Server Books Online.All of the data mining tools exist in the data mining editor. Using the editor you can manage mining models, create new models, view models, compare models, and create predictions basedon existing models.After you build a mining model, you will want to explore it, looking for interesting patterns and rules. Each mining model viewer in the editor is customized to explore models built with a specific algorithm. For more information about the viewers, see "Viewing a Data Mining Model" in SQL Server Books Online.Often your project will contain several mining models, so before you can use a model to create predictions, you need to be able to determine which model is the most accurate. For this reason, the editor contains a model comparison tool called the Mining Accuracy Chart tab. Using this tool you can compare the predictive accuracy of your models and determine the best model.To create predictions, you will use the Data Mining Extensions (DMX) language. DMX extends SQL, containing commands to create, modify, and predict against mining models. For more information about DMX, see "Data Mining Extensions (DMX) Reference" in SQL Server Books Online. Because creating a prediction can be complicated, the data mining editor contains a tool called Prediction Query Builder, which allows you to build queries using a graphical interface. You can also view the DMX code that is generated by the query builder.Just as important as the tools that you use to work with and create data mining models are the mechanics by which they are created. The key to creating a mining model is the data mining algorithm. The algorithm finds patterns in the data that you pass it, and it translates them into a mining model — it is the engine behind the process.Some of the most important steps in creating a data mining solution are consolidating, cleaning, and preparing the data to be used to create the mining models. SQL Server 2005 includes the Data Transformation Services (DTS) working environment, which contains tools that you can use to clean, validate, and prepare your data. For more information on using DTS in conjunction with a data mining solution, see "DTS Data Mining Tasks and Transformations" in SQL Server Books Online.In order to demonstrate the SQL Server data mining features, this tutorial uses a new sample database called AdventureWorksDW. The database is included with SQL Server 2005, and it supports OLAP and data mining functionality. In order to make the sample database available, you need to select the sample database at the installation time in the “Advanced” dialog for component selection.Adventure WorksAdventureWorksDW is based on a fictional bicycle manufacturing company named Adventure Works Cycles. Adventure Works produces and distributes metal and composite bicycles to North American, European, and Asian commercial markets. The base of operations is located in Bothell, Washington with 500 employees, and several regional sales teams are located throughout their market base.Adventure Works sells products wholesale to specialty shops and to individuals through theInternet. For the data mining exercises, you will work with the AdventureWorksDW Internet sales tables, which contain realistic patterns that work well for data mining exercises.For more information on Adventure Works Cycles see "Sample Databases and Business Scenarios" in SQL Server Books Online.Database DetailsThe Internet sales schema contains information about 9,242 customers. These customers live in six countries, which are combined into three regions:North America (83%)Europe (12%)Australia (7%)The database contains data for three fiscal years: 2002, 2003, and 2004.The products in the database are broken down by subcategory, model, and product.Business Intelligence Development StudioBusiness Intelligence Development Studio is a set of tools designed for creating business intelligence projects. Because Business Intelligence Development Studio was created as an IDE environment in which you can create a complete solution, you work disconnected from the server. You can change your data mining objects as much as you want, but the changes are not reflected on the server until after you deploy the project.Working in an IDE is beneficial for the following reasons:The Analysis Services project is the entry point for a business intelligence solution. An Analysis Services project encapsulates mining models and OLAP cubes, along with supplemental objects that make up the Analysis Services database. From Business Intelligence Development Studio, you can create and edit Analysis Services objects within a project and deploy the project to the appropriate Analysis Services server or servers.If you are working with an existing Analysis Services project, you can also use Business Intelligence Development Studio to work connected the server. In this way, changes are reflected directly on the server without having to deploy the solution.SQL Server Management StudioSQL Server Management Studio is a collection of administrative and scripting tools for working with Microsoft SQL Server components. This workspace differs from Business Intelligence Development Studio in that you are working in a connected environment where actions are propagated to the server as soon as you save your work.After the data has been cleaned and prepared for data mining, most of the tasks associated with creating a data mining solution are performed within Business Intelligence Development Studio. Using the Business Intelligence Development Studio tools, you develop and test the datamining solution, using an iterative process to determine which models work best for a given situation. When the developer is satisfied with the solution, it is deployed to an Analysis Services server. From this point, the focus shifts from development to maintenance and use, and thus SQL Server Management Studio. Using SQL Server Management Studio, you can administer your database and perform some of the same functions as in Business Intelligence Development Studio, such as viewing, and creating predictions from mining models.Data Transformation ServicesData Transformation Services (DTS) comprises the Extract, Transform, and Load (ETL) tools in SQL Server 2005. These tools can be used to perform some of the most important tasks in data mining: cleaning and preparing the data for model creation. In data mining, you typically perform repetitive data transformations to clean the data before using the data to train a mining model. Using the tasks and transformations in DTS, you can combine data preparation and model creation into a single DTS package.DTS also provides DTS Designer to help you easily build and run packages containing all of the tasks and transformations. Using DTS Designer, you can deploy the packages to a server and run them on a regularly scheduled basis. This is useful if, for example, you collect data weekly data and want to perform the same cleaning transformations each time in an automated fashion.You can work with a Data Transformation project and an Analysis Services project together as part of a business intelligence solution, by adding each project to a solution in Business Intelligence Development Studio.Mining Model AlgorithmsData mining algorithms are the foundation from which mining models are created. The variety of algorithms included in SQL Server 2005 allows you to perform many types of analysis. For more specific information about the algorithms and how they can be adjusted using parameters, see "Data Mining Algorithms" in SQL Server Books Online.Microsoft Decision TreesThe Microsoft Decision Trees algorithm supports both classification and regression and it works well for predictive modeling. Using the algorithm, you can predict both discrete and continuous attributes.In building a model, the algorithm examines how each input attribute in the dataset affects the result of the predicted attribute, and then it uses the input attributes with the strongest relationship to create a series of splits, called nodes. As new nodes are added to the model, a tree structure begins to form. The top node of the tree describes the breakdown of the predicted attribute over the overall population. Each additional node is created based on the distribution of states of the predicted attribute as compared to the input attributes. If an input attribute is seen tocause the predicted attribute to favor one state over another, a new node is added to the model. The model continues to grow until none of the remaining attributes create a split that provides an improved prediction over the existing node. The model seeks to find a combination of attributes and their states that creates a disproportionate distribution of states in the predicted attribute, therefore allowing you to predict the outcome of the predicted attribute.Microsoft ClusteringThe Microsoft Clustering algorithm uses iterative techniques to group records from a dataset into clusters containing similar characteristics. Using these clusters, you can explore the data, learning more about the relationships that exist, which may not be easy to derive logically through casual observation. Additionally, you can create predictions from the clustering model created by the algorithm. For example, consider a group of people who live in the same neighborhood, drive the same kind of car, eat the same kind of food, and buy a similar version of a product. This is a cluster of data. Another cluster may include people who go to the same restaurants, have similar salaries, and vacation twice a year outside the country. Observing how these clusters are distributed, you can better understand how the records in a dataset interact, as well as how that interaction affects the outcome of a predicted attribute.Microsoft Naïve BayesThe Microsoft Naïve Bayes algorithm quickly builds mining models that can be used for classification and prediction. It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on the known input attributes. The probabilities used to generate the model are calculated and stored during the processing of the cube. The algorithm supports only discrete or discretized attributes, and it considers all input attributes to be independent. The Microsoft Naïve Bayes algorithm produces a simple mining model that can be considered a starting point in the data mining process. Because most of the calculations used in creating the model are generated during cube processing, results are returned quickly. This makes the model a good option for exploring the data and for discovering how various input attributes are distributed in the different states of the predicted attribute.Microsoft Time SeriesThe Microsoft Time Series algorithm creates models that can be used to predict continuous variables over time from both OLAP and relational data sources. For example, you can use the Microsoft Time Series algorithm to predict sales and profits based on the historical data in a cube.Using the algorithm, you can choose one or more variables to predict, but they must be continuous. You can have only one case series for each model. The case series identifies the location in a series, such as the date when looking at sales over a length of several months or years.A case may contain a set of variables (for example, sales at different stores). The Microsoft Time Series algorithm can use cross-variable correlations in its predictions. For example, prior sales at one store may be useful in predicting current sales at another store.Microsoft Neural NetworkIn Microsoft SQL Server 2005 Analysis Services, the Microsoft Neural Network algorithm creates classification and regression mining models by constructing a multilayer perceptron network of neurons. Similar to the Microsoft Decision Trees algorithm provider, given each state of the predictable attribute, the algorithm calculates probabilities for each possible state of the input attribute. The algorithm provider processes the entire set of cases , iteratively comparing the predicted classification of the cases with the known actual classification of the cases. The errors from the initial classification of the first iteration of the entire set of cases is fed back into the network, and used to modify the network's performance for the next iteration, and so on. You can later use these probabilities to predict an outcome of the predicted attribute, based on the input attributes. One of the primary differences between this algorithm and the Microsoft Decision Trees algorithm, however, is that its learning process is to optimize network parameters toward minimizing the error while the Microsoft Decision Trees algorithm splits rules in order to maximize information gain. The algorithm supports the prediction of both discrete and continuous attributes.Microsoft Linear RegressionThe Microsoft Linear Regression algorithm is a particular configuration of the Microsoft Decision Trees algorithm, obtained by disabling splits (the whole regression formula is built in a single root node). The algorithm supports the prediction of continuous attributes.Microsoft Logistic RegressionThe Microsoft Logistic Regression algorithm is a particular configuration of the Microsoft Neural Network algorithm, obtained by eliminating the hidden layer. The algorithm supports the prediction of both discrete andcontinuous attributes.)中文译文数据挖掘技术简介摘要:微软® SQL Server™2005中提供用于创建和使用数据挖掘模型的集成环境的工作。

探索不确定性与遥感数据论文 英译汉

探索不确定性与遥感数据论文 英译汉

Exploring uncertainty in remotely sensed data withparallel coordinate plotsYong Ge , Sanping Li , V. Chris Lakhan , Arko LucieerAbstract The existence of uncertainty in classified remotely sensed data necessitates the application of enhanced techniques for identifying and visualizing the various degrees of uncertainty. This paper, therefore, applies the multidimensional graphical data analysis technique of parallel coordinate plots (PCP) to visualize the uncertainty in Landsat Thematic Mapper (TM) data classified by the Maximum Likelihood Classifier (MLC) and Fuzzy C-Means (FCM). The Landsat TM data are from the Yellow River Delta, Shandong Province, China. Image classification with MLC and FCM provides the probability vector and fuzzy membership vector of each pixel. Based on these vectors, the Shannon’s entropy (S.E.) of each pixel is calculated. PCPs are then produced for each classification output. The PCP axes denote the posterior probability vector and fuzzy membership vector and two additional axes represent S.E. and the associated degree of uncertainty. The PCPs highlight the distribution of probability values of different land cover types for each pixel, and also reflect the status of pixels with different degrees of uncertainty. Brushing functionality is then added to PCP visualization in order to highlight selected pixels of interest. This not only reduces the visualization uncertainty, but also provides invaluable information on the positional and spectral characteristics of targeted pixels.1. IntroductionA major problem that needs to be addressed in remote sensing is the analysis,identification and visualization of the uncertainties arising from the classification ofremotely sensed data with classifiers such as the Maximum Likelihood Classifier (MLC)and Fuzzy C-Means (FCM). While the estimation and mapping of uncertainty has beendiscussed by several authors (for example, Shi and Ehlers, 1996; van der Wel et al., 1998;Dungan et al., 2002; Foody and Atkinson, 2002; Lucieer and Kraak, 2004; Ibrahim et al.,2005; Ge and Li, 2008a), very little research has been done on identifying, targeting andvisualizing pixels with different degrees of uncertainty. This paper, therefore, appliesparallel coordinate plots (PCP) (Inselberg, 1985, 2009; Inselberg and Dimsdale, 1990) tovisualize the uncertainty in sample data and classified data with MLC and FuzzyC-Means. A PCP is a multivariate visualization tool that plots multiple attributes on theX-axis against their values on the Y-axis and has been widely applied to data mining andvisualization (Inselberg and Dimsdale, 1990; Guo, 2003; Edsall, 2003; Gahegan et al., 2002; Andrienko and Andrienko, 2004; Guo et al., 2005; Inselberg, 2009). The PCP is useful for providing a representation of high dimensional objects for visualizing uncertainty in remotely sensed data compared with two-dimensional, three-dimensional, animation and other visualization techniques (Ge and Li, 2008b). Several advantages of the PCP technique for visualizing multidimensional data have been outlined by Siirtoia and Ra¨iha¨ (2006).Data for the PCPs are from a 1999 Landsat Thematic Mapper (TM) image acquired over the Yellow River Delta, Shandong Province, China. After classifying the data the paper emphasizes the uncertainties arising from the classification process. The probability vector and fuzzy membership vector of each pixel, obtained with classifiers MLC and FCM, are then used in the calculation of Shannon’s entropy (S.E.), a measure of the degree of uncertainty. Axes on the PCP illustrate S.E. and the degrees of uncertainty. Brushing is then added to PCP visualization in order to highlight selected pixels. As demonstrated by Siirtoia and Ra¨iha¨(2006) the brushing operation enhances PCP visualization by allowing interaction with PCPs whereby polylines can be selected and highlighted.2. Remarks on parallel coordinate plots and brushingParallel coordinate plots, a data analysis tool, can be applied to a diverse set of multidimensional problems (Inselberg and Dimsdale, 1990). The basis of PCP is to place coordinate axes in parallel, and to present a data point as a connection line between coordinate values. A given row of data could be represented by drawing a line that connects the value of that row to each corresponding axis (Kreuseler, 2000; Shneiderman, 2004). A single set of connected line segments representing one multidimensional data item is called a polyline (Siirtoia and Ra¨iha¨, 2006). An entire dimensional dataset can be viewed whereby all observations are plotted on the same graph. All attributes in two-dimensional feature space could, therefore, be distinguished thereby allowing for the recognition of uncertainties and outliers in the dataset.The PCP can be used to visualize not only multidimensional data but alsonon-numerical multidimensional data. Jang and Yang (1996) discussed applications of PCP especially its usefulness as a dynamic graphics tool for multivariate data analysis. In this paper the PCP is applied to Landsat TM multidimensional data. This paper follows the procedures of Lucieer (2004) and Lucieer and Kraak (2004) who adopted the PCP from Hauser et al. (2002) and Ledermann (2003) to represent the uncertainties in the spectral characteristics of pixels and their corresponding fuzzy membership. To enhance the visualization of the PCP tinteractive brushing functionality is employed. In the brushing operation, pixels of interest are selected and then highlighted. Brushing permits not only highlighting but also masking and deletion of specific pixels and spatial areas. 3. Use of PCP to explore uncertainty in sample dataIn the supervised classification process of remotely sensed imagery, the quantity of samples is a major factor affecting the accuracy of the image classification. In this section, the uncertainty in the sample data is, therefore, first explored with PCP.3.1. Data acquiredIn this paper, the Landsat TM image representing the study area was acquired on August 28, 1999, and covers an area of the Yellow River delta. The image area is at the intersection of the terrain between Dongying and Binzhou, Shandong Province. The upper left latitude and longitude coordinates of the image are 11880034.0700E and 37822024.0000N, respectively. The lower right latitude and longitude coordinates are 118810052.8300E and 37813058.1300N, respectively. The image size is 515 by 515 pixels, with each pixel having a resolution of 30 m. Fig. 1 is a pseudo-color composite image of bands 5, 4 and 3.3.2. Exploring the uncertainty in sample dataThe image includes six land cover types, namely Water, Agriculture_1, Agriculture_2 (Agriculture_1 and Agriculture_2 are different crops), Urban, Bottomland (the channel of the Yellow River), and Bareground. Sample data are selected from the region of interest (see Fig. 2), and represents a total of 26,639 pixels.The Parbat software developed by Lucieer (2004) and Lucieer and Kraak (2004) is used to produce the PCP. The PCP depicts the multidimensional characteristics of pixels in the remote sensing image througha set of parallel axes and polylines in a twodimensional plane (Fig. 3). It is noticeable that there is a clear representation of sample data from different land cover types, as shown by clustering of spectral signatures, and the dispersion and overlapping of spectral signatures from different land cover types. The digital numbers (DNs) of all pixels in the land cover type,Fig. 1. Pseudo-color composition image of the study area.Fig. 2. Sample data in the region of interestBottomland, are very concentrated within a narrow band. The range of DNs of pixels in the Water class is also narrow, except for band 3. The land cover types of Water and Bottomland in Fig. 3 can be easily distinguished from the other land covers in bands 5 and 7. Further differentiation is provided in band 3. The radiation responses for Agriculture_1 and Agriculture_2 have close similarity in bands 1, 2, 3, 6 and 7, with a degree of overlap in bands 4 and 5. There is an almost perfect positive correlation between bands 1 and 2 for all categories. This occurrence presents difficulties in clearly differentiating pixels for Agriculture_1 and Agriculture_2. Hence, it is evident that there is uncertainty in differentiating pixels for the land cover types Agriculture_1 and Agriculture_2.4. Classification and measurement of uncertainty in classified remote sensing images 4.1. Uncertainties arising from the classification processThe supervised classifiers, MLC and FCM, are applied to the image, with the condition that no pixel is assigned to a null class. The classified images are shown in Fig. 4a and b.Comparison of Fig. 4a and b reveals that the classified results from MLC and FCM are not identical for the data pertaining to the same region of interest (ROI).Fig. 3. PCP of sample data.The difference images between these two classified results are presented in Fig. 5a and b, with Fig. 5a showing the classified results for MLC. The classified results of FCMfor the difference pixels are illustrated in Fig. 5b. The number of difference pixels total 16,416. These difference pixels are distributed mainly on the banks of the river, and in mixed areas of Bareground, Agriculture_1, and Agriculture_2. For the MLC classified result, 57.1% of the difference pixels are classified as Agriculture_1, and 36.7% are classified as Agriculture_2. The FCM classified results, however, demonstrate that 90.5% of difference pixels are Bareground while 9.3% are Agriculture_2.The number of pixels in the ROI and in each of the classification categories from MLC and FCM are illustrated in Fig. 6. Evidently, there is a significant difference in the number of pixels for Agriculture_1, Agriculture_2, and Bareground, while the number of pixels for Water and Bottomland are very similar. This is also demonstrated in Fig.Fig. 4. (a) Classification result from MLC; (b) classification result from fuzzy CmeansFig. 5. Spatial distribution of difference pixels between MLC and FCM: (a) difference pixels in the classified result from MLC; (b) difference pixels in the classified result from FCM.Fig. 6. Comparison of numbers of pixels in the ROI; each category from MLC and FCM.5a and b. Based on the classified results from MLC and FCM it is possible to claim that there are relatively high uncertainties in identifying Agriculture_1, Agriculture_2 and Bareground. There are, however, lower uncertainties in the identification of Water and Bareground.4.2. Measurement4.2.1. Probability/fuzzy membershipFCM produces a fuzzy membership vector for each pixel. This fuzzy membership can be taken as the area proportion within a pixel (Bastin et al., 2002). It is possible for pixels to have the same class type, but their posterior probabilities or fuzzy memberships could be different. Hence, the probability vector or fuzzy membership is normally used as a measure for uncertainty on a pixel scale. The posterior probability and fuzzy membership of Water and Agriculture_2 are illustrated in Figs. 7 and 8, respectively. The uncertainty in spatial distributions can be clearly observed in Figs. 7 and 8. In Fig. 7, pixels belonging to the Water class have larger posterior probability or fuzzy membership, thereby indicating that these pixels have smaller uncertainties. While there are insignificant variations between Fig. 7a and b there are, however, noticeable differences between Fig.8a and b, especially for the class,Agriculture_2.Fig. 7. Water class: (a) posterior probability from MLC; (b) fuzzy membership fromFCM. 4.2.2. Shannon entropy [S.E.]Entropy is a measure of uncertainty and information formulated in terms of probability theory, which expresses the relative support associated with mutually exclusive alternative classes (Foody, 1996; Brown et al., 2009). Shannon ’s entropy (Shannon, 1948), applied to the measurement of the quality of a remote sensing image, is defined as the required amount of information that determines a pixel completely belonging to a category and expresses the overall uncertainty information of the probability or membership vector, and all elements in the probability vector are used in the calculation (Maselli et al., 1994). Therefore, it is an appropriate method to measure the uncertainty for each pixel in classified remotely sensed images (van der Wel et al., 1998). The use of S.E. in this paper is represented by considering the following. Given U as the universe of discourse in the remotely sensed imagery, U contains all pixels in this image and is partitioned by {X1, X2, . . ., Xn} where n is the number of classes. The probability of each partition is denoted as pi = P(Xi) giving the S.E. as()1log 2n i ii H X p p ==∑ (1) where H(X) is the information entropy of the information source. When pi = 0, theequation becomes 0 log 0 = 0 (Zhang et al., 2001; Liang and Li, 2005). It is accepted that: 0 _ H(X) _ log n.On the basis of Bayesian decision rules, MLC determines the class type of every pixel according to its maximum posterior probability in a probability vector. The classification process could, therefore, be associated with uncertainty. S.E. derived from a probability vector or fuzzy membership vector can represent the variation in the posterior probability vector and can be taken as one of the measures on a pixel scale (van der Wel et al., 1998). Similarly, applying FCM to remotely sensed data can produce fuzzy membership values in all land cover classes. Hence, fuzzy membership can also be considered as a measure of uncertainty on a pixel scale. On the basis of the fuzzy membership of each pixel the corresponding S.E. can be calculated. To compare the MLC and FCM methods it is necessary to normalize the computed S.E. values.Fig. 8. Agriculture_2 class: (a) posterior Fig. 9.probability from MLC; (b) fuzzymembership from FCM.From Eq. (1) S.E. can be calculated through posterior probability and fuzzy memberships. Fig. 9a and b displays normalized S.E. values of classified pixels from MLC and FCM, respectively. When the grey value of a pixel is zero then the uncertainty is zero, and when the grey value is 255 then the uncertainty will be at the maximum value of 1. From Fig. 9a and b it can be observed that the classes of Water and Bottomland have lower uncertainties while the classes of Bareground, Agriculture_1 and Agriculture_2 have higher uncertainties. These results emphasize that S.E. values calculated from MLC are comparatively higher than those obtained from FCM. Obviously, FCM produces more information about the end members within a pixel than MLC.4.2.3. Degree of uncertaintyWhile S.E. provides information on the uncertainties of pixels it is, however, known that when there is a large range of greyscale values representing brightness values [0,255] the subtle differences in greyscales are not easily discernible to humans. Hence, there are difficulties in differentiating the degree of uncertainty. To overcome this problem the S.E. is, therefore, discretized equidistantly. For instance, the S.E. is discretized into the following intervals: 0.00, (0.00, 0.20], (0.20, 0.40], (0.40, 0.60], (0.60, 0.80], (0.80, 1.00]. Measurements falling into the same interval have the same degree of uncertainty. By assigning a color to each degree of uncertainty, a pixel-based uncertainty visualization is produced (see Fig. 10a and b). This discretization clearly highlights the degrees of uncertainty in the classified remotely sensing image. In Fig. 10a, representing MLC, the degrees of uncertainty for most of the pixels are 0, 1 and 2 while in Fig. 10b, associated with FCM, the degrees of uncertainty for most of the pixels are 1, 2 and 3 and occasionally 4. It is worthwhile to note that the interval map of S.E. permits a comparison of the degrees of uncertainty in classified results from different classifiers.5. PCP and brushingThe PCP is useful for visually exploring the degree of dispersion or aggregation of the DN values of pixels in each band, and can be conveniently used to investigate the reasons contributing to uncertainty. With the brushing operation, the pixels of interest could be selected and highlighted.5.1. PCPFrom the MLC results two sets of pixels have been, respectively, randomly selected from the classes of Water and Agriculture_2 tobe represented on a PCP. The Parbat software (Lucieer, 2004;Fig. 10. Degree of uncertainty: (a) derived from MLC; (b) derived from FCM.Lucieer and Kraak, 2004) is used to produce a PCP of Water (see Fig. 11) and Agriculture_2 (see Fig. 12) classified by MLC. The first six axes are posterior probabilities of the six land cover types and the last two axes are the S.E. and degree of uncertainty. For instance, Fig. 12 illustrates the uncertainties of the pixels selected from the class type of Agriculture_2 and their distributions. The posterior probability of Agriculture_2 to each pixel is relatively higher than other categories. Of significance, there is negative correlation between Agriculture_1 and Agriculture_2. In this example, the S.E. is equally divided into five intervals to obtain the degree of uncertainty. The uncertainties of the pixels selected from Water and Agriculture_2, respectively, and their distributions are illustrated by Figs. 11 and 12.Figs. 13 and 14 are obtained by placing all the DNs of these pixels, fuzzy memberships, S.E., and associated degree of uncertainty as attribute dimensions on the horizontal axis of the PCP. Similarly, pixels are randomly selected to be represented on the PCP, and the S.E. is divided into five equal intervals to obtain the degree of uncertainty. Fig. 13 highlights the uncertainty characteristics associated with the spectralsignatures of pixels from the Water class while Fig. 14 shows the spectral signatures of pixels from the Agriculture_2 class and their related uncertainty characteristics. B.1–B.7 and C.1–C.6 denote the seven bands of the TM image and their fuzzy memberships in the six land cover types, which are Water, Agriculture_1, Agriculture_2, Urban, Bottomland, and Bareground, respectively. The S.E. from the FCM classifier and the degree of uncertainty within the range 0–5 are denoted by ShE and UnL. The red line within bands 1–7 emphasize pixels with a degree of uncertainty of zero. Different colors denote different degrees of uncertainty in the PCP. From the PCP it is possible to discern the distribution of fuzzy memberships and Shannon entropies of pixels with different degrees of uncertainty.A comparison of Figs. 11 and 13 reveals that the MLC posterior probabilities results on classified pixels in the Water class have a non-zero value which are concentrated in the Bareground class. For the FCM (see Fig. 13) most of the classified pixels in the Water class have a fuzzy membership closer to 1. The fuzzy memberships of some pixels in the Bareground and Agriculture_1 and 2 classes are away from zero. It is, therefore, apparent that pixels with a highdegree of uncertainty are mixed pixels in the Water and Bareground classes, namely the boundary pixels between Water and Bareground. As expected, the spectral response for these pixels contains characteristics of both land cover types.Fig. 13 provides additional information on the spectral characteristics of Water and its uncertainty distribution. For instance, the degrees of dispersion in bands 2, 3, 4 and 5 are high, and the distance to the red line in these four bands is also relatively high. As such, their uncertainties are high. Pixels with a degree of uncertainty of 3, represented by the blue line, are relatively far The fuzzy memberships of pixels for Bareground are in the range 0.2–0.4 thereby giving rise to a high degree of uncertainty.from MLC.Fig. 13. Spectral features of the Water class and their uncertainty characteristics.Pixels with a degree of uncertainty of 4, represented by the pink line, are relatively far from those pixels with a degree of uncertainty of zero, as shown by the red line in band 4. There are two independent distribution ranges in band 3. In the case of the Water class pixels with a high degree of uncertainty, assuming their fuzzy membership on Bottomland is high, then the DNs in band 3 will be greater than the DNs of pixels with a degree ofuncertainty of zero.Fig. 14. Spectral features of pixels from the Agriculture_2 class and their uncertainty characteristicsFig. 15. (a) Class of Water: pixels with a low degree of uncertainty in the PCP; (b) class of Water: pixels with a high degree of uncertainty degree in the PCP.A comparison of Figs. 12 and 14 reveals that the classes of Agriculture_1 and Bareground have the highest uncertainty in Agriculture_2. The difference between the PCPs from MLC and FCMis because the influence of Agriculture_1 on Agriculture_2 in MLC is greater than the influence of Bareground on Agriculture_ 2. For FCM the two influences on Agriculture_2 are almost similar. When the fuzzy memberships of pixels for Agriculture_1 are greater, their DNs in bands 4 and 5 are less than that of pixels with a degree of uncertainty of zero. The DNs in all bands are dispersed for pixels with large fuzzy membership for Bareground.5.2. BrushingFrom Figs. 11 and 14 it is demonstrated that when different colors are used to represent different degrees of uncertainty, a certain amount of overlap develops between color lines, especially on the axes where the distribution of polylines is concentrated. The superposition of polylines definitely increases the difficulty in visual uncertainty analysis.A new ‘‘visual’’uncertainty is, thereby, introduced. To improve on this visualization the brushing operation is introduced to the PCP (Hauser et al., 2002; Ledermann, 2003). The difference from that of a conventional approach is that the user selects pixels of interest, and then highlights them with a brush, instead of using colored polylines. This is a suitable and convenient method for conducting targeted analysis on the spectral characteristics of pixels and their associated uncertainty.Brushing is applied to a set of pixels with a low degree of uncertainty (see Fig. 15a) and a set of pixels with a high degree of uncertainty (see Fig. 15b). The pixels targeted for investigation belong to the class type,Water.Agreen polyline denotes the pixels being brushed, while a grey polyline represents the distribution characteristics of all pixels belonging to the class type, Water. The red line for bands 1–7 represents pixels with a zero degree of uncertainty. In Fig. 15a, there is a strong negative correlation between C.5 (Bottomland) and C.6 (Bareground). This means thatthe larger the memberships of pixels in the class of Bottomland, the smaller the memberships of pixels in the class of Bareground. To further investigate the class type, Agriculture_2, PCP and brushing are used to visualize pixels with low uncertainty (seeFig. 16a) and pixels with high uncertainty (see Fig. 16b). From the Figures it becomes noticeable that when PCP is combined with brushing the user could focus on the spectral characteristics and the uncertainty distribution of pixels of interest. This effectively reduces the uncertainty introduced by the visualization of the remotely sensed data.Fig. 16. (a) Class of Agriculture_2: pixels with a low degree of uncertainty in the PCP; (b) class of Agriculture_2: pixels with a high degree of uncertainty in the PCP.6. Discussion and conclusionA major unresolved problem in image processing is how to identify and visualize the uncertainty arising out of the classification of remotely sensed data. Without doubt, targeting uncertainties will not only permit better visualization of features in geographical space but also enhance the capabilities of policymakers who have to make reliabledecisions on a broad range of geospatial issues. This paper has demonstrated the effectiveness of the combined PCP and brushing operation to explore and visualize the uncertainties in remotely sensed data classified with MLC and FCM.The MLC and FCM results demonstrate that water class pixels, with a high degree of uncertainty, have high posterior probability or fuzzy membership for the type Bottomland. Furthermore, some Agriculture_2 class pixels, with a high degree of uncertainty, have high fuzzy membership for the Bareground class type. Therefore, it is possible to compare the pixel distribution for Water and Bottomland or Agriculture_2 and Bareground, and further analyze the possible reasons causing the uncertainty by investigating the spectral characteristics for all bands. Essentially, it is necessary to use the probability vector and fuzzy membership vector of each pixel to compute the S.E. The degree of uncertainty of each pixel can then be represented on a PCP. As illustrated in this paper two axes on the PCP represent Shannon’s entropy and the degree of uncertainty.The PCP technique is also advantageous for highlighting the distribution of probability values of different land covers of each pixel, and also reflects the status of pixels with different degrees of uncertainty. Moreover, a PCP can be produced for the spectral characteristics of sample data and uncertainty attributes of classified data. The class type of the sample data can be included in the PCP to evaluate the quality of the data. Moreover, the sample data can then be compared to the classified data to evaluate whether the sample data are a reasonable reflection of the spectral characteristics of all bands. The identification of any dissimilarities or uncertainties is a definite indication of improvement in the visualization process. This paper demonstrates that there could be enhancements in PCP visualization with the addition of the brushing operation. Instead of using color polylines, as done with previous approaches, brushing permits the user to select pixels of interest. These pixels could then be highlighted with brushing instead of with color polylines. Evidently, brushing facilitates targeted analysis of the spectral characteristics of pixels and any associated uncertainty. It could, therefore, be concluded that the integration of PCP with the brushing operation is beneficial for not only visualizing uncertainty but also gaining insights on the spectral characteristics andattribute information of pixels of interest. By interacting with the PCP through the brushing operation it is possible to conduct an exploration of uncertainty, even at the sub-pixel level.AcknowledgementsThis research received partial support from the National Natural Science Foundation of China (Grant No. 40671136) and the National High Technology Research and Development Program of China (Grant No. 2006AA120106).ReferencesAndrienko, G., Andrienko, N., 2004. Parallel coordinates for exploring properties of subsets CMV. In: Roberts, J. (Ed.), Proceedings International Conference on Coordinated & Multiple Views in Exploratory Visualization, London, England, July 13, 2004, pp. 93–104.Bastin, L., Fisher, P.F., Wood, J., 2002. Visualizing uncertainty in multi-spectral remotely sensed imagery. Computers & Geosciences 28, 337–350.Brown, K.M., Foody, G.M., Atkinson, P.M., 2009. Estimating per-pixel thematicuncertainty in remote sensing classifications. International Journal of RemoteSensing 30, 209–229.Dungan, J.L., Kao, D., Pang, A., 2002. The uncertainty visualization problem in remote sensing analysis. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium, Toronto, Canada, June 2, 2002, pp. 729–731.Edsall, R.M., 2003. The parallel coordinate plot in action: design and use for geographic visualization. Computational Statistics and Data Analysis 43, 605–619.Foody, G.M., Atkinson, P.M., 2002. Uncertainty in Remote Sensing and GIS. Wiley Blackwell, London.Foody, G.M., 1996. Approaches for the production and evaluation of fuzzy land cover classifications from remotely-sensed data. International Journal of Remote Sensing 17, 1317–1340.Gahegan, M., Takatsuka, M., Wheeler, M., amd Hardisty, F., 2002. Introducing geo VISTA studio: an integrated suite of visualization and computational methods for。

电子商务毕业论文外文翻译---构建数据挖掘在客户关系管理中的应用

电子商务毕业论文外文翻译---构建数据挖掘在客户关系管理中的应用

英文翻译原文题目:Building Data Mining Applications for CRM出处:New York McGraw-HillProfessional,2000.Berson,Alex.;Smith,Stephen;Thearling, Kurt译文题目:构建数据挖掘在客户关系管理中的应用介绍在过去的几年里,公司和他们的客户之间的接触发生了戏剧性的变化。

顾客不再有过去那么高的忠诚度。

结果是,公司发现他们必须更好地了解和理解他们的客户,对于客户的要求和需求也必须更快地响应。

另外,响应的时间必须大大缩短,不能等到让你的客户等地不耐烦的时候才采取措施,那样就太晚了!为了取得成功,公司必须具有前瞻性,及早了解到你的客户需要的到底是什么。

如果现在说店主能够毫不费力地明白他们消费者的需求而且加以快速的响应,那无疑是陈词滥调。

过去的店主能够仅仅凭借自己的记忆记住他们的客户,而且当客人进来的时候知道该怎么做。

不过现在的店主无疑面临着更为严峻的情况:越来越多的消费者、越来越多的产品、越来越多的竞争对手,但是必须在比过去少的多的时间内了解消费者的需求无疑更为困难。

企业做了许多努力来加强与客户之间的联系。

举个例子来说:压缩市场周期。

企业对于客户的统计分析显示,客户的忠诚度在不断地下降。

而对于客户而言,忠诚两个字仿佛是很遥远的事情了。

一个成功的企业必须加强对他们客户的影响力,提供给他们持续的影响力。

另外,需求是随着时间不断变化的,你必须满足不断变化的需求。

如果你不能快速对客户的需求加以反应,你的客户会转向那些能够帮助他们的公司。

市场的成本越来越大,每一样东西的成本都似乎越来越大。

打印、邮资、特别的服务(如果你不提供这些特别的服务,你的竞争者会提供的)消费者希望货物能够满足他们的要求,每一项都符合。

这意味着他们提供的产品数量和供货方式会急剧地增加。

建立数据挖掘应用程序我们必须要意识到重要的一点,数据挖掘只是整个过程的一部分。

人工智能以机器学习数据挖掘深度学习为主题的参考文献

人工智能以机器学习数据挖掘深度学习为主题的参考文献

人工智能以机器学习数据挖掘深度学习为主题的参考文献以下是人工智能以机器学习数据挖掘深度学习为主题的一些参考文献:1. Alpaydin, E. (2010). Introduction to machine learning. Cambridge, MA: MIT Press.2. Bishop, C. M. (2006). Pattern recognition and machine learning. New York, NY: Springer.3. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. New York, NY: Springer.4. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA: MIT Press.5. Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT Press.6. Ng, A. (n.d.). cs229: Machine learning course notes. Retrieved from /notes/7. Russel, S. J., & Norvig, P. (2010). Artificial intelligence: A modern approach. Upper Saddle River, NJ: Prentice Hall.8. Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. New York, NY: Cambridge University Press.9. Witten, I. H., Frank, E., & Hall, M. A. (2016). Data mining: Practical machine learning tools and techniques. San Francisco, CA: Morgan Kaufmann.10. Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499-1503.。

大数据挖掘外文翻译文献

大数据挖掘外文翻译文献

大数据挖掘外文翻译文献大数据挖掘是一种通过分析和解释大规模数据集来发现实用信息和模式的过程。

它涉及到从结构化和非结构化数据中提取知识和洞察力,以支持决策制定和业务发展。

随着互联网的迅猛发展和技术的进步,大数据挖掘已经成为许多领域的关键技术,包括商业、医疗、金融和社交媒体等。

在大数据挖掘中,外文翻译文献起着重要的作用。

外文翻译文献可以提供最新的研究成果和技术发展,匡助我们了解和应用最先进的大数据挖掘算法和方法。

本文将介绍一篇与大数据挖掘相关的外文翻译文献,以匡助读者深入了解这一领域的最新发展。

标题:"A Survey of Big Data Mining Techniques for Knowledge Discovery"这篇文献是由Xiaojuan Zhu等人于2022年发表在《Expert Systems with Applications》杂志上的一篇综述文章。

该文献对大数据挖掘技术在知识发现方面的应用进行了全面的调研和总结。

以下是该文献的主要内容和贡献:1. 引言本文首先介绍了大数据挖掘的背景和意义。

随着互联网和传感器技术的快速发展,我们每天都会产生大量的数据。

这些数据包含了珍贵的信息和洞察力,可以用于改进业务决策和发现新的商机。

然而,由于数据量庞大和复杂性高,传统的数据挖掘技术已经无法处理这些数据。

因此,大数据挖掘成为了一种重要的技术。

2. 大数据挖掘的挑战本文接着介绍了大数据挖掘面临的挑战。

由于数据量庞大,传统的数据挖掘算法无法有效处理大规模数据。

此外,大数据通常是非结构化的,包含各种类型的数据,如文本、图象和视频等。

因此,如何有效地从这些非结构化数据中提取实用的信息和模式也是一个挑战。

3. 大数据挖掘技术接下来,本文介绍了一些常用的大数据挖掘技术。

这些技术包括数据预处理、特征选择、分类和聚类等。

数据预处理是指对原始数据进行清洗和转换,以提高数据质量和可用性。

特征选择是指从大量的特征中选择最实用的特征,以减少数据维度和提高模型性能。

数据挖掘论文中英文翻译

数据挖掘论文中英文翻译

数据挖掘论文中英文翻译数据挖掘(Data Mining)是一种从大量数据中提取出实用信息的过程,它结合了统计学、人工智能和机器学习等领域的技术和方法。

在数据挖掘领域,研究人员通常会撰写论文来介绍新的算法、技术和应用。

这些论文通常需要进行中英文翻译,以便让更多的人能够了解和使用这些研究成果。

在进行数据挖掘论文的翻译时,需要注意以下几个方面:1. 专业术语的翻译:数据挖掘领域有不少专业术语,如聚类(Clustering)、分类(Classification)、关联规则(Association Rules)等。

在翻译时,需要确保这些术语的准确性和一致性。

可以参考相关的研究文献、术语词典或者咨询领域专家,以确保翻译的准确性。

2. 句子结构和语法的转换:中英文的句子结构和语法有所不同,因此在翻译时需要进行适当的转换。

例如,中文通常是主谓宾的结构,而英文则更注重主语和谓语的一致性。

此外,还需要注意词序、时态和语态等方面的转换。

3. 表达方式的转换:中英文的表达方式也有所不同。

在翻译时,需要根据目标读者的背景和理解能力来选择适当的表达方式。

例如,在描述算法步骤时,可以使用英文中常见的动词短语,如"take into account"、"calculate"等。

4. 文化差异的处理:中英文的文化差异也需要在翻译中予以考虑。

某些词语或者表达在中文中可能很常见,但在英文中可能不太常用或者没有对应的翻译。

在这种情况下,可以使用解释性的方式来进行翻译,或者提供相关的背景信息。

5. 校对和修改:翻译完成后,需要进行校对和修改,以确保翻译的准确性和流畅性。

可以请专业的校对人员或者其他领域专家对翻译进行审查,提出修改意见和建议。

总之,数据挖掘论文的中英文翻译需要综合考虑专业术语、句子结构、表达方式、文化差异等方面的因素。

通过准确翻译和流畅表达,可以让更多的人理解和应用这些研究成果,推动数据挖掘领域的发展。

外文文献及译文

外文文献及译文

外文文献及译文(总26页)--本页仅作为文档封面,使用时请直接删除即可----内页可以根据需求调整合适字体及大小--本科毕业设计外文文献及译文文献、资料题目:Food Handling Using Computer文献、资料来源:文献、资料发表(出版)日期:院(部):机电工程学院专业:机械工程及自动化班级:机械054姓名:刘翠芹学号:22指导教师:董明晓教授翻译日期:外文文献:Food Handling and Packaging using Computer vision and Robot AbstractEven though the use of robot vision system in manufacturing sectors is now a commonplace, however, the technology embodied in these devices is poorly matched to industrial needs of food processors. In particular, food processing imposes special demands upon machinery. For instance the vision sensor must be programmed to detect the position of single and isolated object as well as overlapping or occluding objects. Special grippers have to be designed for handling of food articles such that they have minimum contact with the food items and hence causing minimum damage to them. In this project, started overa year ago, a vision guidance system is being developed to meet this objective. The system integrates the modified version of the Hough transform algorithm as the main recognition engine. The methods and procedures were tested on commercially produced beef burgers.1. IntroductionFrom the incoming down to the packaging lines, locating, recognizing and handling food objects are very important in food processing industry. These tasks are performed routinely in food industry mainly for quality evaluationand product classification. Such tasks are very laboriously demanding andtend to rely heavily on role of the human operator [1]. Hands of workersusing raw materials of animal origin can heavily be contaminated with faecaland other micro-pathogenic organisms [2]. The study by Trickett [3] hasshown a strong link between food poisoning and the hygiene standards of food processors. Complete automation of food handling and packaging by means of robotic arm is the most effective means to eliminate influence of manual handling of microbiological quality of foods.Robots have successfully been applied in a wide range of food industries primarily dealing with well-defined processes and products not only because they are relatively clean and hygienic, also because of their flexibility, ruggedness and repeatability. This trend will continue to grow with the increasing scrutiny and regulatory enforcements such as and Hazard Analysisand Critical Control Points (HACCP) together with companies that are lookingfor ways to decrease or eliminate worker exposure to repetitive motion tasks and harsh environment. However there are problems and challenges associated with the use of robots in food industry [4].Firstly the food products, despite of the same type, differ in size, shape and other physical variables. This imposes special demands for machinery to handle them,requiring multiple sensory, manipulation and environmental capabilities beyond those available in robots designed to automate manufacturing tasks. Secondly the success of applying robots for food handlers, hinges upon the success of detecting, locating, recognizing and handling severely overlapping and occluding cases of similar food objects. Thirdly,food objects are often delicate and usually covered with either slippery or viscous substances, making high speed handling of such targets a very challenging task. The existing contact-based mechanisms such as the vacuum suctioning and the clamp gripping are not applicable because they can potentially cause injuries and bruising to food products. Hence further research is needed in order to solve these problems. This paper addresses some of the problems, focusing on the methods used to control the robot directlyfrom the vision sensor, attempting to simulate the way that humans use their eyes to naturally control the motion of their arms.2. Materials and MethodsSample PreparationThe chosen food for this study is a locally produced beef-burger. It possesses all the important characteristics which are unique to food products, such that they are very fragile and easily deformed. The average size of the beef-burgers is mm in thickness and mm radius and gm in weight. Surface images of test samples were acquired using 8-bit robot vision system with uniform white background. The white background provides excellent contrast between the burger and the background. The chosen exposure was adjusted sothat the image intensity histograms were approximately centered at mid-way of the full-scale range. The focal distance was selected to allow single as well as multiple samples to fit in the image frame.Robot visionThe robot vision systems used in this study is the Adept Cobra 600 4-DOF articulated scara robot, manufactured by Adept Tech., USA and equipped with Adept Vision Interface, MV-5 Adept controller and TM1001 CCD monochrome camera manufactured by Pulnix Inc., Canada. The camera was mounted onto link 2 of robot arm and illuminated using the warm white deluxe (WWX) fluorescent lighting. The camera is fitted with a C-mount adapter to permit the use of Tamron f/ 8-mm lens. The TM1001 camera is connected to the AVI card via a 12-pin Hirose type camera connector of Hirose Inc., Japan. The robot vision system was operated using Adept’s AIMS and programming libraries, running onGHz and 255 MB RAM Pentium IV PC. Figure 1 shows the set-up of robot vision system.Image ProcessingThe objective of image processing in robot vision applications is mainly to extract meaningful and accurate information from the images, endowing the robots with more sophisticated position control capabilities through the use of vision feedback. The use of a simple geometric method such as introducing specially designed cues into the image scene will not work in this application since the burger images are generally complex, difficult to model andpartially or extensively occluded depending on the viewing angle. Figure 2 shows the typical beef-burger image.In order to accurately translate burger positions to robot movements, the former geometric features must firstly be extracted and secondly matched to the robot's workspace. In this application one of the useful features which uniquely characterize the pose of a burger in arbitrary locations is its centroid. This geometric descriptor is applicable since the shape of a burger is approximately circular. Furthermore this feature preserves variance to translation, rotation and scaling. Before computing the centroid of the realburger images, several preprocessing operations need to be performed on each image.Edge detection operation is carried out to detect the contour of the connected and isolated components, thereby, effectively transforming the original data into a form suitable for further processing. The edge results of Figure 2 computed using well-known Sobel and Robert operators [5] are shown in Figures 3(a),(b),(c)&(d). From these figures it can be seen that the edges determined by these operators comprised of many false edges, discontinuities and spurious spots resulting from uneven and irregular surface of the burger, non-uniform light reflection and shadows. These drawbacks are not acceptable for application described in this paper.A more sophisticated method is needed in order to obtain acceptable results. The method used to solve these problems was based on Canny edge detection operator [6]. Interested readers are referred to this publicationfor detailed mathematical explanation of this relatively new edge detector. Here only the important principles are presented in order to facilitate discussion on robot vision applications on food handling. Canny method for edge detection is principally based on some general ideas.Firstly Canny was the first to demonstrate that convolving an image with a symmetric 2-D Gaussian filter and then, secondly, differentiating in the direction of the gradient form the edge magnitude image. The presence of edges in the original image gives rise to ridges in gradient magnitude image. The objective is to detect the true edge in the right place. This can be done using method known as non-maximal suppression technique. Essentially this method works by tracking along the top of the ridges, retaining only those points at the top of the ridge, whilst suppressing all others. The tracking process exhibits hysteresis controlled by two important parameters. They are the lower threshold value Tlow, and the upper threshold value Thigh. If the edge response is above Thigh, then this pixel definitely constitutes an edge and hence retained. Pixels less than Thigh but greater than greater Tlow are considered as weak edges. Finally tracking was done to bridge all discontinued edges as well as to eliminate the false edges are retained only if they are connected to the strong edge. The result of these operations is an image with thin lines of edge points with improved edge-to-noise ratio. Even though this method reduces the effect of noise, however, the overall quality of edges depends largely on the optimal selection of the standard deviation ?. which defines the Gaussian mask for Canny’s edge detection. Experimentally the optimum value was set to 3. This corresponds to a 25 X 25 kernel. This valueis fixed for given set of background illumination and image gain. Change in any of these external factors such as illumination, image gain, background colour will also affect the optimum value of ?. Figures 4(a)&(b) show results for canny edge detection with ? set to 1 and 3.Comparing Figure 3 and Figure 4, it can be seen clearly that the edges determined by Canny's operator are less corrupted compared to edges detected either by Sobel or Robert operator. The burger edges are more complete in Figure 4 whereas in Figure 3 they are only partially visible and more obscured. Furthermore the retention of major detail by the Canny operator is very evident. The presence of overlapping and partially occluding burgers are visually recognizable. Canny operator therefore has the ability to detect major features of interest in the burger image, allowing the geometric feature of the pick-and-place species of a burger be accurately determined. The algorithm for determining the pick-and-place specie is given in the following section.Centroid Detection AlgorithmOnce the edges of the burgers have been detected, the next step in image analysis is to retrieve and extract the geometric feature which uniquely defines the shape of a burger. One important criterion of this type of shape analysis and retrieval problems is that the method must be invariant totranslation, scaling and translation of images or objects. The use of Hough transform seems to be adequate since this method achieves translation, scaling and rotation invariance by converting a global detection problem in an image space into a more easily solved local peak detection problem in the parameter space [7]. More importantly the Hough transform allows segmentation of overlapping or semi-occluded objects which is critical for processing of burger images.However the original Hough transform works well if analytic equations of object borderlines are known and invariant. In the present context these conditions are very difficult to be fulfilled because the shape of the burger is not a perfect circle. This imperfection is mainly due to non-rigid properties of the burgers, causing them to be easily deformed when pressed or come in contact with any rigid surface such as the conveyor belts. A straightforward application of Hough transform will yield a multiple set of accumulated votes in the parameter space, corresponding to different shapes and sizes of the objects [7]. This may result in many false alarms. Furthermore the ambiguity of indexing and insufficient description of point features may result in false solutions for the recognition of overlapping or semi-occluding objects. In this work we purpose a method to solve some of these problems by modifying the Hough transform employing the latest technique in object recognition based on centroid-contour distance (CCD) curve. The CCD method is given by Wang [8].Figure 5 CCD Curve of CircleThe basic idea of this technique can be explained using illustration in Figure 5. It shows a point Q lying on the contour of a circle which is characterized by a centroid C, and a radius R. The angle between the point Q and the centroid is given by ? Tracing a burger contour can be considered as circling around its centroid. The tracing path in clockwise or anticlockwise direction from fixed starting point represents a shape contour uniquely. In other words a contour point sequence corresponds to a shape uniquely if the starting point is fixed. Hence for a given C, R and ? a point Q on the contour will accurately satisfy the following criterion if the contour belongs to a perfect circle, . Q=(RCosθ,RSinθ) (1)Since in this case a perfect match is impossible to obtain for reasons stated previously, therefore, a point Q is treated as a point belonging to the edge of a burger if it is bounded by maximum and minimum R values.Mathematically(RminCosθ,RminSinθ )?Q?(RmaxCosθ,RmaxSinθ ) (2) where Rmin < R < Rmax is the range of the burger radius as shown in Figure 6. This method works by firstly treating all edge pixels in the binarized image resulting from Canny edge detection, as probable centroids of the objects. Foreach centroid location, secondly, the CCDcurves are traced using Eq. 2. For tracing ? is varied between 0o < θ? 360o, thereby searching for all pixelswhich are bounded between these two contours. If the increment value for θ in Eq. 2 is kept as 1 then the maximum possible pixels that would satisfyas the circumference point for a centroid are 360. Generally smaller ?values above certain limit improves the computation time of the algorithm. For every pixel in the binarised image being considered as the centroid, it is assumed as the center of a circle.Next the number of instances the circumference points of that circle are also edges in the binarised image are determined. If the number of pixels are greater than a threshold, it implies that the centroid pixel being considered is the centroid of the burger.Practically when the above algorithm is applied to burger images, thetotal number of matches would never be the maximum even for a correct centroid. This is because of inevitable noise, irregular light reflection and burger surface shadows. Thus the threshold value for the number of matches has to be fixed below the maximum value 36.To determine the correct value of the threshold the algorithm was applied to sequence of 19 burger images. The criteria for correctly identifying whichspecies of the burger is most likely to be lifted are that, that burger should be minimally overlapped or maximally exposed. As seen in Figure 2 the burger that lies on the top of the heap as well as on the side of the main pile could fulfill the above criterion, and hence, contributed also as the picked-and-place species. The algorithm is applied for the centroid locations from top-left to bottom-right pixels. Thus multiple burgers that satisfy the criteria to be lifted will be prioritized from top-left to bottom-right.By following this criterion the robot will be led to pick and place only those species, thereby reducing the likelihood of damaging the overlapped specie. Figure 7 shows the number of matches of each burger centroid in asequence of images using Eq. 2.3. Experimental Tests and ResultsThe methods and procedures described in the previous sections were experimented using sequence of burger images.The objective of this experiment is to sort the burger individually by pick-and-place operations. In so doing the robot must first examine the present of burgers in the heap, and second, detect which species of burgerthat was most likely to be lifted-up. Prior to experiment, the camera was calibrated for a given mounting position,enabling robot pose with respect to the position and orientation of the burger be accurately mapped.Figures 8(b)-(f) show the sequence of centroids of minimally overlapped burgers revealed using modified Hough transform, starting with detection of the 1st burger and ending with detection of the 7th burger respectively. Only the first seven centroid locations were shown here even though the locations of a total of 19 burgers were successfully located. Each centroid location was fed into a controller which kinematically positioned and orientated therobot's end-effecter in 3D space. In each detection round the pick-and-placespecies of burger was removed from the heap manually. Clearly from Figure 8 the location of a minimally overlapped or maximally exposed burger was accurately revealed in every picked and placed cycle. No partially overlapped or occluded burgers were detected.It can therefore be concluded that the proposed method works well for detecting minimally overlapping burgers which is important in ensuring a correct pick-and-place sequence of the robot. However, one drawback of this technique is that it is a very computationally intensive method, requiring approximately 3-4 seconds for every result. A time consuming yet accurate position detection algorithm may limit its applications in food industry. Hence, a special hardware for fast position detection is now being developed using Field Programmable Gate Array (FPGA) chip.Moreover a specially designed end-effecter is needed in meeting the need for robotic handling of beef-burgers. Clearly the use of conventional grippers is not suitable since they do not address the task of handling non-rigid materials and they can increase contamination problems of beef-burgers. In order to solve these problems a novel non-contact end effecter employing pneumatic levitation technique [9] is now being investigated in our laboratory.4. ConclusionTechnological development in robotics has the potential to minimize the contamination risk in food handling and packaging. The successful application of this relatively new technology in food industry, however, requires compliant with several processing parameters, namely recognition ofoverlapping and touching objects. In this paper we have implemented arelatively simple but effective recognition of overlapping and touchingobjects for use in robot positioning and guidance. By using global image descriptor together with the advanced image processing, the typicalproblematic steps of extraction and matching of geometric features are eliminated, making it possible to accurately position the robot arm even under severely overlapping cases.The algorithm resulted form this study in modified Hough transform. This algorithm was tested for detection of beef-burgers, and it was discovered that, the system is particularly robust, converging to the desired pose corresponding to a minimally overlapped burger or maximally exposed burgerfrom initial pose over the entire work space. The algorithm has a very good accuracy in detecting the food objects with more than 10% overlapping or occlusion. Further extensions of this work include improvement of robot end-effecter and the kinematics of control scheme to provide continuous motion for applications requiring dynamic tracking.5. AcknowledgementsThis work is supported by Malaysia Intensified Research in Priority Areas grant IRPA 6012602.6. References[1] KHODABANDEHLOO, K., CLARKE, . 1993. Robotics in meat. Fish and Poultry Processing. Chapman and Hall, London.[2] DE-WIT, . 1995. The importance of hand hygiene in contamination of foods. Antonie Van Leeuwenhuek 51, 523-527.[3] TRICKETT, J. 1992. Food hygiene for food handlers. Macmillan, Basingstoke, UK.[4] LEGG, B. 1993. Hi-tech agricultural engineering - a contradiction in terms of the way forward. Mechanical Incorporated Engineer, August, 86-90.[5] GONZALEZ, . and WOODS, . 2002. Digital image processing. Prentice Hall, USA.[6] CANNY, . 1986. A computational approach to edge detection. IEEE Transactions Pattern Recognition and Machine Intelligence, 8(6), 679-698. [7] ILLINGWORTH, J. and KITTLER, J. 1988. A survey of the Hough transform. Computer Vis., Graphics and Image Process., 44, 87-116.[8] WANG, Z., CHI, Z. and FENG, D. 2003. Shape based leaf retrieval. IEE Proceedings Vis. Image Signal Process., 150(1), 34-42.[9] ERZINCANLI, F. SHARP, J. and ERHAL, S. 1997. Design and Operational Considerations of a Non-contact Robotic Handling System for Non-rigid Materials. Int. J. Mech. Tools Manufact., 38, 353-361.中文译文:利用计算机视觉和机器人进行食品处理和包装1 摘要虽然在制造业使用机器人视觉系统已经相当普及,然而这一技术在机械设备的应用很难符合食品加工的工业需要.特别是,食品加工对机械有特殊要求. 例如视觉传感器必须程序化的检测单一孤立的对象以及重叠或遮挡物体的位置。

文献翻译

文献翻译

SO MAD:SensOr Mining for AnomalyDetection in railway dataJulien Rabatel12,Sandra Bringay13,and Pascal Poncelet11LIRMM,Universit´e Montpellier2,CNRS161rue Ada,34392Montpellier Cedex5,France2Fatronik France Tecnalia,Cap Omega,Rond-point Benjamin Franklin-CS3952134960Montpellier,France3Dpt MIAp,Universit´e Montpellier3,Route de Mende 34199Montpellier Cedex5,France{rabatel,bringay,poncelet}@lirmm.frAbstract.Today,many industrial companies must face problems raisedby maintenance.In particular,the anomaly detection problem is prob-ably one of the most challenging.In this paper we focus on the railwaymaintenance task and propose to automatically detect anomalies in orderto predict in advance potential failures.Wefirst address the problem ofcharacterizing normal behavior.In order to extract interesting patterns,we have developed a method to take into account the contextual criteriaassociated to railway data(itinerary,weather conditions,etc.).We thenmeasure the compliance of new data,according to extracted knowledge,and provide information about the seriousness and possible causes of adetected anomaly.1IntroductionToday,many industrial companies must face problems raised by maintenance. Among them,the anomaly detection problem is probably one of the most chal-lenging.In this paper we focus on the railway maintenance problem and propose to automatically detect anomalies in order to predict in advance potential ually data is available through sensors and provides us with important information such as temperatures,accelerations,velocity,etc.Nevertheless,data collected by sensors are difficult to exploit for several reasons.First because,a very large amount of data usually available at a rapid rate must be managed. Second,they contain a very large amount of data to provide a relevant descrip-tion of the observed behaviors.Furthermore,they contain many errors:sensor data are very noisy and sensors themselves can become defective.Finally,when considering data transmission,very often lots of information are missing.Recently,the problem of extracting knowledge from sensor data have been addressed by the data mining community.Different approaches focusing either on the data representation(e.g.,sensors clustering[1],discretization[2])or knowl-edge extraction(e.g.,association rules[2],[3],[4],[5],sequential patterns[6],[7], [8])were proposed.Nevertheless,they usually do not consider that contextual2information could improve the quality of the extracted knowledge.The develop-ment of new algorithms and softwares is required to go beyond the limitations. We propose a new method involving data mining techniques to help the de-tection of breakdowns in the context of railway maintenance.First,we extract from sensor data useful information about the behaviors of trains and then we characterize normal behaviors.Second,we use the previous characterization to determine if a new behavior of a train is normal or not.We are thus able to automatically trigger some alarms when predicting that a problem may occur.Normal behavior strongly depends on the context.For example,a very low ambient temperature will affect a train behavior.Similarly,each itinerary with its own characteristics(slopes,turns,etc..)influences a journey.Consequently it is essential,in order to characterize the behavior of trains as well as to detect anomalies,to consider the surrounding context.We have combined these ele-ments with data mining techniques.Moreover,our goal is not only to design a system for detecting anomalies in train behavior,but also to provide information on the seriousness and possible causes of a deviation.This paper is organized as follows.Section2describes the data representation in the context of train maintenance.Section3shows the characterization of normal behaviors by discovering sequential patterns.Experiments conducted with a real dataset are described in Section5.Finally,we conclude in Section6. 2Data RepresentationIn this section,we address the problem of representing data.From raw data collected by sensors,we design a representation suitable for data mining tasks.2.1Sensor Data for Train MaintenanceThe data resulting from sensors for train maintenance is complex for the two following reasons:(i)very often errors and noisy values pervades the experimen-tal data;(ii)multisource information must be handled at the same time.For instance,in train maintenance following data must be considered.Sensors.Each sensor describes one property of the global behavior of a train which can correspond to different information(e.g.,temperature,velocity, acceleration).Measurements.They stand for numerical values recorded by the sensors and could be very noisy for different reasons such as failures,data transfer,etc.Readings.They are defined as the set of values measured by all the sensors at a given date.The information carried out by a reading could be considered as the state of the global behavior observed at the given moment.Due to the data transfer some errors may occur and then readings can become incomplete or even missing.We consider that the handled data are such as those described in Table1, where a reading for a given date(first column)is described by sensor mea-sures(cells of other columns).3TIME Sensor 1Sensor 2Sensor 3...2008/03/2706:36:3901616...2008/03/2706:41:3982.51616...2008/03/2706:46:38135.61921...2008/03/2706:51:381052225...Table 1.Extract from raw data resulting from sensors.2.2Granularity in Railway DataData collected from a train constitutes a list of readings describing its behavior over time.As such a representation is not appropriate to extract useful knowl-edge,we decompose the list of readings at different levels of granularity and then we consider the three following concepts journeys,episodes and episode fragments which are defined as follows.Journey.The definition of a journey is linked to the railway context.For a train,a journey stands for the list of readings collected during the time interval between the departure and the ually,a journey is several hours long and has some interruptions when the train stops in railway stations.We consider the decomposition into journeys as the coarsest granularity of railway data.Let minDuration a minimum duration threshold,maxStop a maximum stop duration,and J be a list of readings (r m ,...,r i ,...r n ),where r i is the reading collected at time i .J is a journey if:1.(n −m )>minDuration ,2. (r u ,...,r v ,...r w )⊆J | (w −u )>maxStop,and ∀v ∈[u,w ],velocity (v )=0.Episode.The main issue for characterizing train behavior is to compare ele-ments which are similar.However,as trains can have different routes the notion of journey is not sufficient (for instance,between two different journeys,we could have different number of stops as well as a different delay between two railway stations).That is the reason why we segment the journeys into episodes to get a finer level of granularity.To obtain the episodes,we rely on the stops of a train (easily recognizable considering the train velocity).An episode is defined as a list of readings (r m ,...r i ,...,r n )such as:–velocity (m )=0and velocity (n )=01,–if m <i <n ,velocity (i )=0.1Here,the velocity of the train at time t is denoted as velocity (t ).4Fig.1.Segmentation of a journey into episodes.Figure1describes a segmentation of a journey into episodes by considering the velocity changes.This level of granularity is considered as the most relevant because it provides us with a set of homogeneous data.However,we can segment episodes in order to obtain a more detailed representation and afiner granularity level.Episode Fragment.The level of granularity corresponding to the fragments is based on the fact that the behavior of a train during an episode can easily be divided in three chronological steps.First,the train is stationary(i.e.,velocity0) then an acceleration begins.We call this step the starting step.More formally, let E=(r m,...,r n)be an episode.The startingfragment E starting=(r m,...r k) of this episode is a list of readings such as:∀i,j∈[m,k],i<j⇔velocity(i)<velocity(j).At the end of an episode,the train begins a deceleration ending with a stop. This is the ending step.More formally,let E=(r m,...,r n)be an episode.The endingfragment E ending=(r k,...r n)of this episode is a list of readings such as:∀i,j∈[k,n],i<j⇔velocity(i)>velocity(j).The traveling fragment is defined as the sublist of a given episode between the starting fragment and the ending fragment.During this fragment,there are accelerations or decelerations,but no stop.More formally,let E be an episode, E starting its starting fragment,and E ending its ending fragment.Then,the trav-eling fragment of E,denoted as E traveling,is a list of readings defined as:E traveling=E−E starting−E ending.Figure1shows the segmentation of an episode into three fragments:the starting fragment,the traveling fragment and the ending fragment.5 From now we thus consider that all the sensor data are stored in a database, containing all information about the different granularity levels.For example, all the sensor readings composing the fragment shown in Figure1are indexed and we know that a particular fragment f is an ending fragment included in an episode e,belonging to the journey J.J is associated with the itinerary I and the index of e in I is2(i.e.,the second portion of this route).3Normal Behavior CharacterizationIn this section,we focus on the data mining step in the knowledge discovery process and more precisely on the extraction of patterns characterizing normal behaviors.3.1How to Extract Normal Behavior?The objective of the behavior characterization is,from a database of sensor measurements,to provide a list of patterns depicting normal behavior.We want to answer the following question:which patterns often appear in the data?Such a problem,also known as pattern mining,has been extensively addressed by the data mining community in the last decade.Among all data mining methods,we can cite the sequential patterns mining problem.The sequential patterns were introduced in[9]and can be considered as an extension of the concept of association rule[10]by handling timestamps associated to items.The research for sequential patterns is to extract sets of items commonly associated over time.In the“basket market”concern,a se-quential pattern can be for example:“40%of the customers buy a television, then buy later on a DVD player”.In the following we give an overview of the sequential pattern mining problem.Given a set of distinct attributes,an item,denoted as i,is an attribute.An itemset,denoted as I,is an unordered collection of items(i1i2...i m).A sequence, denoted as s,is an ordered list of itemsets I1I2...I k .A sequence database, denoted as DB,is generally a large set of sequences.Given two sequences s= I1I2...I m and s = I 1I 2...I n ,if there exist integers1≤i1<i2<...<im≤nsuch that I1⊆I i1,I2⊆I i2,...,I m⊆I im,then the sequence s is a subsequenceof the sequence s ,denoted as s s .The support of a sequence is defined as the fraction of total sequences in DB that support this sequence.If a sequence s is not a subsequence of any other sequences,then we say that s is maximal.A sequence is said to be frequent if its support is at greater than or equal to a threshold minimum support(minSupp)specified by the user.The sequential patterns mining problem is,for a given threshold minSupp and a sequence database DB,tofind all maximum frequent sequences.6Sequential Patterns and Sensor Data.The discovery of sequential patterns in sensor data in the context of train maintenance requires choosing a data format adapted to the concepts of sequence,itemsets and items defined earlier.So from now we consider a sequence as a list of readings,an itemset as a reading,and an item as the state of a sensor.The order of itemsets in a sequence is given by the timestamps associated to each reading.Items are Si vt,where Si is a sensor and vt is the value measured by the sensor at time t.For example,data described in Table1are translated into the following sequence:(S10S216S316)(S182.5S216S316)(S1135.6S219S321)(S1105S222S325) .In addition,we use generalized sequences and time constraints([11],[12]). More precisely,a time constraint called maxGap is set in order to limit the time between two consecutive itemsets in a frequent sequence.For instance,if maxGap is set to15minutes,the sequence (S1low)(S2low,S3high) means that the state described by the second itemset occurs at most15minutes after thefirst one.A sequence corresponds to a list of sensor readings.So,a sequence database can be created,where a sequence is a journey,an episode or an episode fragment, depending on the chosen level of granularity(see Section2).3.2Contextualized CharacterizationEnvironmental Dimensions Structural Dimensionsid Duration Exterior Temperature Route Indexe1low high J1E1e2low low J1E2e3high high J2E1e4low low J1E1e5high low J1E2Table2.Episodes and contextual information.Structural and Environmental CriteriaWith the data mining techniques described in the previous section we are able to extract patterns describing a set of episodes which currently occurs together. However,they are not sufficient to accurately characterize train behaviors.In-deed,the behavior of a train during a trip depends on contextual2criteria. Among these criteria,we can distinguish the two following categories:2The notion of context stands for the information describing the circumstances in which a train is traveling.This information is different from behavioral data that describe the state of a train.7–Structural criteria,providing information on the journey structure of the studied episode(route3,episode index in the route).–Environmental criteria,providing information on the contextual environ-ment(weather conditions,travel characteristics,etc.).Example1Table2presents a set of episodes,identified by the id column.In this example,the sensor data were segmented by selecting the level of granularity corresponding to episodes(see Section2).Each episode is associated with envi-ronmental criteria(the duration of the episode and the exterior temperature)and structural criteria(the global route,and the index of the episode in this route). For example,the duration of the episode e2is short,and this episode was done with a low exterior temperature.In addition,e2is part of the itinerary denoted by J1,and is the second portion of J1.Let us consider a more formal description of the context of an episode.Each episode is described in a set of n dimensions,denoted by D.There are two subsets D E and D S of D,such as:–D E is the set of environmental dimensions.In Table2,there are two envi-ronmental dimensions:Duration and Exterior Temperature.–D S is the set of structural dimensions,i.e.,the Route dimension whose value is the route of the episode,and the dimension Index for the index episode in the overall route.Data Mining and classesNow we present how the influence of these criteria on railway behaviors are handled in order to extract knowledge.The general principle is the following: (i)we divide the data into classes according to criteria listed above and(ii) we extract frequent sequences in these classes in order to get contextualized patterns.Based on the general principle of contextualization,let us see how classes are constructed.Let c be a class,defined in a subset of D,denoted by D C.A classc is denoted by[c D1,...,c Di,...,c Dk],where c Diis the value of c for the dimen-sion D i,D i∈D C.We use a joker value,denoted by∗,which can substitute any value on each dimension in D C.In other words,∀A∈D,∀a∈Dim(A),{a}⊂∗.Thus,an episode e belongs to a class c if the restriction of e on D C4is included in c:∀A∈D C,e A⊆c A.3Here,a route is different from a journey.The route of a journey is the itinerary fol-lowed by the train during this journey.Therefore,several journeys may be associated with a single route(e.g.,Paris-Montpellier).4The restriction of e in DCis the description of e,limited to the dimensions of D C.8As the environmental and structural criteria are semantically very different, we have identified two structures to represent them:a lattice for environmental classes and a tree for structural classes.We now explain how these structures are defined.Environmental Lattice.The set of environmental classes can be represented in a multidimensional space containing all the combinations of different environ-mental criteria as well as their possible values.Environmental classes are defined in the set of environmental dimensions denoted by D E.Class Duration Exterior Temperature[*,*]**[low,*]low*[high,*]high*[low,high]low high[*,high]*high.........Fig.2.Extract from environmentalclasses.Fig.3.Environmental Lattice. Example2Figure2shows some of the environmental classes corresponding to the dataset presented in Table2.A class c is denoted by[c ExtT,c Dur],where ExtT stands for the dimension Exterior Temperature and Dur for the dimension Duration.For example,the class denoted by[low,∗]is equivalent to the context where the temperature is low (i.e.,c ExtT=low),for any duration(i.e.,c Dur=∗).Using the dataset of Table2,we can see that the set of episodes belonging to the class[low,∗]is{e1,e2,e4}.Similarly,all the episodes belonging to the class [low,high]is{e1}.Environmental classes and their relationships can be represented as a lattice. Nevertheless wefirst have to define a generalization/specialization order on the set of environmental classes.Definition1Let c,c be two classes.c≥c ⇔∀A∈D,v A⊂u A.If c≥c ,then c is said to be more general than c .In order to construct classes,we provide a sum operator(denoted by+)and a product operator(denoted by•).The sum of two classes gives us the most specific class generalizing them. The sum operator is defined as follows.Definition2Let c,c be two classes.9t =c +c ⇔∀A ∈D ,t A = c A if c A =c A ,∗elsewhere.The product of two classes gives the most general class specializing them.The product operator is defined as follows.Definition 3Let c ,c be two classes.Class z is defined as follows:∀A ∈D,z A =c A ∩c A .Then,t =c •c ⇔ t =z ifA ∈D |z A =∅<∅,...,∅>elsewhere.We can now define a lattice,by using the generalization/specialization order between classes and the operators defined above.The ordered set CS,≥ is a lattice denoted as CL ,in which Meet ( )and Join ( )elements are given by:1.∀T ⊂CL, T =+t ∈T t2.∀T ⊂CL, T =•t ∈T tFigure 3illustrates the lattice of environmental classes of the dataset provided in Table 2.Structural Hierarchy.Structural hierarchy is used to take into account infor-mation that could be lost by manipulating episodes.Indeed,it is important to consider the total journey including an episode,and the place of this episode in the journey.Some classes are presented in Figure 4.id Route Index Fragment[*]***[J1]J1**[J2]J2**[J1,E1]J1E1*[J1,E1,begin]J1E1begin[J1,E1,middle]J1E1middle............Fig.4.Extract from structural classes.Fig.5.Structural Hierarchy.Example 3Using the dataset of Table 2,we can see that the set of episodes be-longing to the class [J 1]is {e 1,e 2,e 4,e 5}.Similarly,the set of episodes belonging to the class [J 1,E 1]is {e 1,e 4}.Therefore,we create the hierarchy described in Figure 5,such as the higher the depth of a node is,the more specific is the symbolized class.10Structural classes,defined in D S,are represented with a tree.The branches of the tree are symbolizing a relationship“is included”.A tree is particularly appropriated here because it represents the different granularity levels defined earlier,i.e.,from the most general to thefiner level:journeys,episodes and fragments.Let C be the set of all structural classes,and H C the set of classes relation-ships.H C⊆C×C,and(c1,c2)∈H C means that c2is a subclass of c1.A lattice can not represent a structural hierarchy as elements are included in only one element of higher granularity:a fragment belongs to a single episode and an episode belongs to a single route.–The root of this hierarchy(denoted as∗)is the most general class,i.e.,it contains all the episodes stored in the database.–Nodes of depth1correspond to the various routes made by trains.Thus, the class represented by the node[J1]contains all the episodes made in the route denoted by J1.–The next level takes into account the place of the episode in the path rep-resented by the father node.The class[J1,E1]contains all thefirst episodes of the journey J1,i.e.,episodes of the journey J1whose index is1.–So far,the classes we have defined contain episodes.However,we have seen that it is possible to use afiner granularity in the data(see Section2).In-deed,an episode can be divided into three fragments:a starting fragment,a traveling fragment and an ending fragment.To make the most of this level of granularity and obtain more detailed knowledge,we consider classes of frag-ments in the leaves of the hierarchy.Thus,class[T1,E1,starting]contains starting fragments of the episodes contained in[T1,E1].The extraction of frequent sequences in this class will provide knowledge about the behavior at the start of an episode,under the conditions of the class[T1,E1].With the environmental lattice and structural tree presented in this section, we can index knowledge from normal behavior according to their context.Thus, when new data are tested to detect potential problems,we can precisely evaluate the similarity of each part of these new data with comparable elements stored in our normal behavior database.3.3Behavior SpecificityIn the previous section,we showed how to extract frequent behaviors in classes. To go further,we want to know the specific behavior of each class(i.e.,of each particular context).We therefore distinguish the specific patterns for a class(i.e., not found in class brothers in a hierarchy),and the general patterns appearing in several brother classes.General patterns possibly may be specific to a higher level of a class hierarchy.11 Using this data representation,we can define the following concepts.Let D be a sequence database,s a sequence,c a class in the structural hierarchy or the environmental hierarchy described in Section3.2.Notation1The number of sequences(i.e.,episodes or fragments according to the chosen level of granularity)contained in c is denoted as nbSeq(c).Notation2The number of sequences contained in c supporting s is denoted as suppSeq c(s).Definition4The support of s in c is denoted as supp c(s)and is defined by:supp c(s)=suppSeq c(s) nbSeq(c).Let minSupp be a minimum support threshold,s is said to be frequent in c if supp c(s)≥minSupp.Definition5s is a specific pattern in c if:–s is frequent in c,–s is non-frequent in the classes c (where c is a brother of c in the classes hierarchy).The notion of specificity provides additional information to the experts.A pattern specific to a class describes a behavior that is linked to a specific context. In particular,this information can be used in order to detect anomalies.For example,if we test the episodes being in[J1,E1,starting]and that we meet a lot of behaviors specific to another class,then we can think that these behaviors are not normal in this context and we can investigate the causes of this anomaly more efficiently.4Anomaly DetectionIn this section,we present how anomaly detection is performed.We consider that we are provided with both one database containing normal behavior on which knowledge have been extracted(see Section3)and data corresponding to one journey.The main idea is organized as follows.First,we define a measure to evaluate the compliance of a new journey in a given contextual class.In case of any detected anomaly,we make use of the class hierarchy to provide more detailed information about the problem.4.1Detection of Abnormal BehaviorsThis involves processing a score to quantify if the sequence corresponding to new data to be tested is consistent with its associated class.We consider that the consistency of an episode with a class c depends on the number of patterns of c12such as the episode is included in.The most the episode is included in patterns of c,the most this episode is consistent with c.We thus introduce a similarity score called the Conformity Score which is defined as follows.Definition6Let s be a sequence to be evaluated in a class c.We denoted by P the set of patterns of c,and P incl the set of patterns of c being included in s.So, the Conformity Score of s,denoted by conformity(s),is such as:conformity(s)=|P incl| |P|.We will thereafter denote by score c(e)the score of an episode e in class c.4.2Anomaly DiagnosisTo provide further information on the causes of detected anomalies,we use hierarchies developed in Section3.2.The detection is performed as follows.First,we measure the conformity score of a new episode e in the most precise level of each of the hierarchies(i.e.,structural and environmental).If the score is high enough in the two classes then the behavior of the train during e is considered as normal.Otherwise,it is possible to distinguish more clearly what is the cause of the anomaly.For example,the anomaly may have structural or environmental reasons.To obtain more information about the detected anomaly,it is possible to go up in the hierarchy and test the consistency score of e in“parent”classes of problematic classes.For example,if the episode e has a poor score in the class[low,high],then we evaluate its score in the classes c1=[low,∗]andc2=[∗,high].If score c1(e)is inadequate and score c2(e)is sufficient,then theprovided information is that e is in compliance with the normal behaviors related to the exterior temperature,but not with those related to travel duration.By defining a minimum conformity score minCo we can also determine whether the emission of an alarm is necessary or not.Thus,if score score c(e), then the episode is seen as problematic,and an alarm is emitted.However,we can balance an alarm in relation to the seriousness of the detected problem.For example,a score of0.1probably corresponds to a more important issue than a score of0.45.5ExperimentsIn order to evaluate our proposal,several experiments were conducted on real datasets.They correspond to the railway data collected on12trains where each train has249sensors.Each value is collected everyfive minutes.232temperature sensors and16acceleration sensors are distributed on the different components (e.g.,wheels,motors,etc..)and a sensor measures the overall speed of the train.13 5.1Experimental ProtocolThe experimental protocol follows the organization of the proposals:1.Characterization of Normal Behavior.(see Section3)We have studiedthe impact of contextualization on the characterization of normal behavior, and the benefit of the search for specific patterns in various contexts.2.Anomaly Detection.(see Section4)We have evaluated the conformityscore by applying it on normal and abnormal behavior.5.2Normal Behavior CharacterizationThe discovery of sequential patterns has been performed with the PSP algorithm, described in[13].We have used a C++implementation,which can manage time constraints.Class Frequent Sequences Specific Patterns[*,*]387387[low,*]876634[high,*]616411[low,high]64305859Table3.Number of frequent sequences and specific patterns,according to the envi-ronmental class.Table3shows,for an extract from the hierarchy presented in Section3.2,the number of frequent sequences found in each class,and the corresponding number of specific sequences,extracted with a minimum support set to0.3.We can note thatfiltering specific patterns reduces the amount of stored results.Moreover, the fact that each class,including most restrictive ones(i.e.,the leaves of the hierarchy),contain specific patterns shows both the importance of the context in railway behavior,and the usefulness of our approach.Note that for the most general class,denoted as[∗,∗],the number of specific patterns and frequent sequences is unchanged.This is because this class does not have a brother in the environmental hierarchy.Moreover,we notice in these results that the more a class is specific,the more it contains frequent sequences and specific patterns.Indeed,we look for frequent behaviors.Train behavior heavily depend on surrounding context.Therefore,the more a class is general, the less behaviors are frequent,as they vary much more from one journey to another.5.3Anomaly DetectionWe have noted in the previous section that we can extract very precise knowl-edge about normal train behavior.This knowledge is now used in an anomaly detection process,through the methods described in Section4.。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

不确定性数据挖掘外文翻译文献(文档含中英文对照即英文原文和中文翻译)译文:不确定性数据挖掘:一种新的研究方向摘要由于不精确测量、过时的来源或抽样误差等原因,数据不确定性常常出现在真实世界应用中。

目前,在数据库数据不确定性处理领域中,很多研究结果已经被发表。

我们认为,当不确定性数据被执行数据挖掘时,数据不确定性不得不被考虑在内,才能获得高质量的数据挖掘结果。

我们称之为“不确定性数据挖掘”问题。

在本文中,我们为这个领域可能的研究方向提出一个框架。

同时,我们以UK-means 聚类算法为例来阐明传统K-means算法怎么被改进来处理数据挖掘中的数据不确定性。

1.引言由于测量不精确、抽样误差、过时数据来源或其他等原因,数据往往带有不确定性性质。

特别在需要与物理环境交互的应用中,如:移动定位服务[15]和传感器监测[3]。

例如:在追踪移动目标(如车辆或人)的情境中,数据库是不可能完全追踪到所有目标在所有瞬间的准确位置。

因此,每个目标的位置的变化过程是伴有不确定性的。

为了提供准确地查询和挖掘结果,这些导致数据不确定性的多方面来源不得不被考虑。

在最近几年里,已有在数据库中不确定性数据管理方面的大量研究,如:数据库中不确定性的表现和不确定性数据查询。

然而,很少有研究成果能够解决不确定性数据挖掘的问题。

我们注意到,不确定性使数据值不再具有原子性。

对于使用传统数据挖掘技术,不确定性数据不得不被归纳为原子性数值。

再以追踪移动目标应用为例,一个目标的位置可以通过它最后的记录位置或通过一个预期位置(如果这个目标位置概率分布被考虑到)归纳得到。

不幸地是,归纳得到的记录与真实记录之间的误差可能会严重也影响挖掘结果。

图1阐明了当一种聚类算法被应用追踪带有不确定性位置的移动目标时所发生的问题。

图1(a)表示一组目标的真实数据,而图1(b)则表示记录的已过时的这些目标的位置。

如果这些实际位置是有效的话,那么它们与那些从过时数据值中得到的数据集群有明显差异。

如果我们仅仅依靠记录的数据值,那么将会很多的目标可能被置于错误的数据集群中。

更糟糕地是,一个群中的每一个成员都有可能改变群的质心,因此导致更多的错误。

图1 数据图图1.(a)表示真实数据划分成的三个集群(a、b、c)。

(b)表示的有些目标(隐藏的)的记录位置与它们真实的数据不一样,因此形成集群a’、b’、c’和c”。

注意到a’集群中比a集群少了一个目标,而b’集群中比b集群多一个目标。

同时,c也误拆分会为c’和c”。

(c)表示方向不确定性被考虑来推测出集群a’,b’和c。

这种聚类产生的结果比(b)结果更加接近(a)。

我们建议将不确定性数据的概率密度函数等不确定性信息与现有的数据挖掘方法结合,这样在实际数据可利用于数据挖掘的情况下会使得挖掘结果更接近从真实数据中获得的结果。

本文研究了不确定性怎么通过把数据聚类当成一种激励范例使用使得不确定性因素与数据挖掘相结合。

我们称之为不确定性数据挖掘问题。

在本文中,我们为这个领域可能的研究方向提出一个框架。

文章接下来的结构如下。

第二章是有关工作综述。

在第三章中,我们定义了不确定性数据聚类问题和介绍我们提议的算法。

第四章将呈现我们算法在移动目标数据库的应用。

详细地的实习结果将在第五章解释。

最后在第六章总结论文并提出可能的研究方向。

2.研究背景近年来,人们对数据不确定性管理有明显的研究兴趣。

数据不确定性被为两类,即已存在的不确定生和数值不确定性。

在第一种类型中,不管目标或数据元组存在是否,数据本身就已经存在不确定性了。

例如,关系数据库中的元组可能与能表现它存在信任度的一个概率值相关联[1,2]。

在数据不确定性类型中,一个数据项作为一个封闭的区域,与其值的概率密度函数(PDF)限定了其可能的值[3,4,12,15]。

这个模型可以被应用于量化在不断变化的环境下的位置或传感器数据的不精密度。

在这个领域里,大量的工作都致力于不精确查找。

例如,在[5]中,解决不确定性数据范围查询的索引方案已经被提出。

在[4]中,同一作者提出了解决邻近等查询的方案。

注意到,所有工作已经把不确定性数据管理的研究结果应用于简化数据库查询中,而不是应用于相对复杂的数据分析和挖掘问题中。

在数据挖掘研究中,聚类问题已经被很好的研究。

一个标准的聚类过程由5个主要步骤组成:模式表示,模式定义,模式相似度量的定义,聚类或分组,数据抽象和造工评核[10]。

只有小部分关于数据挖掘或不确定性数据聚类的研究被发表。

Hamdan与Govaert已经通过运用EM算法解决使混合密度适合不确定性数据聚类的问题 [8]。

然而,这个模型不能任意地应用于其他聚类算法因为它相当于为EM定制的。

在数据区间的聚类也同样被研究。

像城区距离或明考斯基距离等不同距离测量也已经被用来衡量两个区间的相似度。

在这些测量的大多数中,区间的概率密度函数并没有被考虑到。

另外一个相关领域的研究就是模糊聚类。

在模糊逻辑中的模糊聚类研究已经很久远了[13]。

在模糊聚类中,一个是数据簇由一组目标的模糊子集组成。

每个目标与每个簇都有一个“归属关系度”。

换言之,一个目标可以归属于多个簇,与每个簇均有一个度。

模糊C均值聚类算法是一种最广泛的使用模糊聚类方法[2,7]。

不同的模糊聚类方法已被应用在一般数据或模糊数据中来产生的模糊数据簇。

他们研究工作是基于一个模糊数据模型的,而我们工作的开展则基于移动目标的不确定性模型。

3.不确定数据的分类在图2中,我们提出一种分类法来阐述数据挖掘方法怎么根据是否考虑数据不准确性来分类。

有很多通用的数据挖掘技术,如: 关联规则挖掘、数据分类、数据聚类。

当然这些技术需要经过改进才能用于处理不确定性技术。

此外,我们区分出数据聚类的两种类型:硬聚类和模糊聚类。

硬聚类旨在通过考虑预期的数据来提高聚类的准确性。

另一方面,模糊聚类则表示聚类的结果为一个“模糊”表格。

模糊聚类的一个例子是每个数据项被赋予一个被分配给数据簇的任意成员的概率。

图2. 不确定性数据挖掘的一种分类例如,当不确定性被考虑时,会发生一个有意思的问题,即如何在数据集中表示每个元组和关联的不确定性。

而且,由于支持和其他指标的概念需要重新定义,不得不考虑改进那些著名的关联规则挖掘算法(如Apriori)。

同样地,在数据分类和数据聚集中,传统算法由于未将数据不确定性考虑在内而导致不能起作用。

不得不对聚类质心、两个目标的距离、或目标与质心的距离等重要度量作重新定义和进行更深的研究。

4.不确定性数据聚类实例在这个章节中,我们将以不确定性数据挖掘的例子为大家介绍我们在不确定性数据聚类中的研究工作。

这将阐明我们在改进传统数据挖掘算法以适合不确定性数据问题上的想法。

4.1 问题定义用S 表示V 维向量x i 的集合,其中i=1到n ,这些向量表示在聚类应用中被考虑的所有记录的属性值。

每个记录o i 与一个概率密度函数f i (x)相联系,这个函数就是o i 属性值x 在时间t 时刻的概率密度函数。

我们没有干涉这个不确定性函数的实时变化,或记录的概率密度函数是什么。

平均密度函数就是一个概率密度函数的例子,它描述“大量不确定性”情景中是最糟的情况[3]。

另一个常用的就是高斯分布函数,它能够用于描述测量误差[12,15]。

聚类问题就是在数据集簇C j (j 从1到K )找到一个数据集C ,其中C j 由基于相似性的平均值c j 构成。

不同的聚类算法对应不对的目标函数,但是大意都是最小化同一数据集目标间的距离和最大化不同数据集目标间的距离。

数据集内部距离最小化也被视为每个数据点之间距离x i 以及x i 与对应的C j 中平均值c j 距离的最小化。

在论文中,我们只考虑硬聚类,即,每个目标只分配给一个一个集群的一个元素。

4.2 均值聚类在精确数据中的应用这个传统的均值聚类算法目的在于找到K(也就是由平均值c j 构成数据集簇C j )中找到一个数据集C 来最小化平方误差总和(SSE )。

平方误差总和通常计算如下:∑∑=∈-K j x i j ji x c 1C 2 (1)|| . ||表示一个数据点x i 与数据集平均值c j 的距离试题。

例如,欧氏距离定义为:∑=-=-V i i i y x y x 12(2)一个数据集C i 的平均值(质心)由下面的向量公式来定义:∑∈=j C i i j i x C c 1 (3)均值聚类算法如下:1. Assign initial values for cluster means c 1 to c K2. repeat3. for i = 1 to n do4. Assign each data point x i to cluster C j where || c j - x i || is the minimum.5. end for6. for j = 1 to K do7. Recalculate cluster mean c j of cluster C j8. end for9. until convergence10. return C收敛可能基于不同的质心来确定。

一些收敛性判别规则例子包括:(1)当平方误差总和小于某一用户专用临界值,(2)当在一次迭代中没有一个目标再分配给不同的数据集和(3)当迭代次数还达到预期的定义的最大值。

4.3 K-means 聚类在不确定性数据中的应用为了在聚类过程中考虑数据不确定性,我们提出一种算法来实现最小化期望平方误差总和E(SSE)的目标。

注意到一个数据对象x i 由一个带有不确定性概率密度f(x i )的不确定性区域决定。

给定一组数据群集,期望平方误差总和可以计算如下:()ii K j C i ij Kj C i ij Kj C i i j dx x f x c x c E x c E j jj )(121212∑∑∑∑∑∑=∈=∈=∈-=-=⎪⎪⎭⎫ ⎝⎛- (4) 数据集平均值可以如下给出: ()∑⎰∑∑∈∈∈==⎪⎪⎭⎫ ⎝⎛=j jj C i ii i j C i i j C i i j j dx x f x C x E C x C E c )(111 (5) 我们到此将提出一种新K-means 算法,即UK-means ,来实现不确定性数据聚类。

1. Assign initial values for cluster means c 1 to c K2. repeat3. for i = 1 to n do4. Assign each data point x i to cluster C j where E(|| c j - x i ||) is the minimum.5. end for6. for j = 1 to K do7. Recalculate cluster mean c j of cluster C j8. end for9. until convergence10. return CUK-mean 聚类算法与K-means 聚类算法的最大不同点在于距离和群集的计算。

相关文档
最新文档