Data mining An overview from a database perspective (1996)
DataMining分析方法
如有你有帮助,请购买下载,谢谢!数据挖掘Data Mining第一部 Data Mining的觀念............... 错误!未定义书签。
第一章何謂Data Mining ..................................................... 错误!未定义书签。
第二章Data Mining運用的理論與實際應用功能............. 错误!未定义书签。
第三章Data Mining與統計分析有何不同......................... 错误!未定义书签。
第四章完整的Data Mining有哪些步驟............................ 错误!未定义书签。
第五章CRISP-DM ............................................................... 错误!未定义书签。
第六章Data Mining、Data Warehousing、OLAP三者關係為何. 错误!未定义书签。
第七章Data Mining在CRM中扮演的角色為何.............. 错误!未定义书签。
第八章Data Mining 與Web Mining有何不同................. 错误!未定义书签。
第九章Data Mining 的功能................................................ 错误!未定义书签。
第十章Data Mining應用於各領域的情形......................... 错误!未定义书签。
第十一章Data Mining的分析工具..................................... 错误!未定义书签。
第二部多變量分析....................... 错误!未定义书签。
第一章主成分分析(Principal Component Analysis) ........... 错误!未定义书签。
数据挖掘导论英文版
数据挖掘导论英文版Data Mining IntroductionData mining is the process of extracting valuable insights and patterns from large datasets. It involves the application of various techniques and algorithms to uncover hidden relationships, trends, and anomalies that can be used to inform decision-making and drive business success. In today's data-driven world, the ability to effectively harness the power of data has become a critical competitive advantage for organizations across a wide range of industries.One of the key strengths of data mining is its versatility. It can be applied to a wide range of domains, from marketing and finance to healthcare and scientific research. In the marketing realm, for example, data mining can be used to analyze customer behavior, identify target segments, and develop personalized marketing strategies. In the financial sector, data mining can be leveraged to detect fraud, assess credit risk, and optimize investment portfolios.At the heart of data mining lies a diverse set of techniques and algorithms. These include supervised learning methods, such asregression and classification, which can be used to predict outcomes based on known patterns in the data. Unsupervised learning techniques, such as clustering and association rule mining, can be employed to uncover hidden structures and relationships within datasets. Additionally, advanced algorithms like neural networks and decision trees have proven to be highly effective in tackling complex, non-linear problems.The process of data mining typically involves several key steps, each of which plays a crucial role in extracting meaningful insights from the data. The first step is data preparation, which involves cleaning, transforming, and integrating the raw data into a format that can be effectively analyzed. This step is particularly important, as the quality and accuracy of the input data can significantly impact the reliability of the final results.Once the data is prepared, the next step is to select the appropriate data mining techniques and algorithms to apply. This requires a deep understanding of the problem at hand, as well as the strengths and limitations of the available tools. Depending on the specific goals of the analysis, the data mining practitioner may choose to employ a combination of techniques, each of which can provide unique insights and perspectives.The next phase is the actual data mining process, where the selectedalgorithms are applied to the prepared data. This can involve complex mathematical and statistical calculations, as well as the use of specialized software and computing resources. The results of this process may include the identification of patterns, trends, and relationships within the data, as well as the development of predictive models and other data-driven insights.Once the data mining process is complete, the final step is to interpret and communicate the findings. This involves translating the technical results into actionable insights that can be easily understood by stakeholders, such as business leaders, policymakers, or scientific researchers. Effective communication of data mining results is crucial, as it enables decision-makers to make informed choices and take appropriate actions based on the insights gained.One of the most exciting aspects of data mining is its continuous evolution and the emergence of new techniques and technologies. As the volume and complexity of data continue to grow, the need for more sophisticated and powerful data mining tools and algorithms has become increasingly pressing. Advances in areas such as machine learning, deep learning, and big data processing have opened up new frontiers in data mining, enabling practitioners to tackle increasingly complex problems and extract even more valuable insights from the data.In conclusion, data mining is a powerful and versatile tool that has the potential to transform the way we approach a wide range of challenges and opportunities. By leveraging the power of data and the latest analytical techniques, organizations can gain a deeper understanding of their operations, customers, and markets, and make more informed, data-driven decisions that drive sustainable growth and success. As the field of data mining continues to evolve, it is clear that it will play an increasingly crucial role in shaping the future of business, science, and society as a whole.。
Data mining techniques for customer
Technology in Society 24 (2002) 483–502 /locate/techsoc
Data mining techniques for customer relationship management
Chris Rygielski a, Jyun-Cheng Wang b, David C. Yen a,∗
∗
Corresponding author. Tel.: +1-513-529-4826; fax: +1-513-529-9689. E-mail address: yendc@ (D.C. Yen).
0160-791X/02/$ - see front matter 2002 Elsevier Science Ltd. All rights reserved. PII: S 0 1 6 0 - 7 9 1 X ( 0 2 ) 0 0 0 3 8 - 6
484
C. Rygielski et al. / Technology in Society 24 (2002) 483–502
1. Introduction A new business culture is developing today. Within it, the economics of customer relationships are changing in fundamental ways, and companies are facing the need to implement new solutions and strategies that address these changes. The concepts of mass production and mass marketing, first created during the Industrial Revolution, are being supplanted by new ideas in which customer relationships are the central business issue. Firms today areue through analysis of the customer lifecycle. The tools and technologies of data warehousing, data mining, and other customer relationship management (CRM) techniques afford new opportunities for businesses to act on the concepts of relationship marketing. The old model of “design-build-sell” (a product-oriented view) is being replaced by “sell-build-redesign” (a customer-oriented view). The traditional process of massmarketing is being challenged by the new approach of one-to-one marketing. In the traditional process, the marketing goal is to reach more customers and expand the customer base. But given the high cost of acquiring new customers, it makes better sense to conduct business with current customers. In so doing, the marketing focus shifts away from the breadth of customer base to the depth of each customer’s needs. The performance metric changes from market share to so-called “wallet share”. Businesses do not just deal with customers in order to make transactions; they turn the opportunity to sell products into a service experience and endeavor to establish a long-term relationship with each customer. The advent of the Internet has undoubtedly contributed to the shift of marketing focus. As on-line information becomes more accessible and abundant, consumers become more informed and sophisticated. They are aware of all that is being offered, and they demand the best. To cope with this condition, businesses have to distinguish their products or services in a way that avoids the undesired result of becoming mere commodities. One effective way to distinguish themselves is with systems that can interact precisely and consistently with customers. Collecting customer demographics and behavior data makes precision targeting possible. This kind of targeting also helps when devising an effective promotion plan to meet tough competition or identifying prospective customers when new products appear. Interacting with customers consistently means businesses must store transaction records and responses in an online system that is available to knowledgeable staff members who know how to interact with it. The importance of establishing close customer relationships is recognized, and CRM is called for. It may seem that CRM is applicable only for managing relationships between businesses and consumers. A closer examination reveals that it is even more crucial for business customers. In business-to-business (B2B) environments, a tremendous amount of information is exchanged on a regular basis. For example, transactions are more numerous, custom contracts are more diverse, and pricing schemes are more complicated. CRM helps smooth the process when various representatives of seller and buyer companies communicate and collaborate. Customized catalogues, personalized business portals, and targeted product offers can simplify the procurement process and improve efficiencies for both companies. E-mail alerts and new
Introduction to Data Mining
Introduction to Data MiningData mining is a process of extracting useful information from large datasets by using various statistical and machine learning techniques. It is a crucial part of the field of data science and plays a key role in helping businesses make informed decisions based on data-driven insights.One of the main goals of data mining is to discover patterns and relationships within data that can be used to make predictions or identify trends. This can help businesses improve their marketing strategies, optimize their operations, and better understand their customers. By analyzing large amounts of data, data mining algorithms can uncover hidden patterns that may not be immediately apparent to human analysts.There are several different techniques that are commonly used in data mining, including classification, clustering, association rule mining, and anomaly detection. Classification involves categorizing data points into different classes based on their attributes, while clustering groups similar data points together. Association rule mining identifies relationships between different variables, and anomaly detection detects outliers or unusual patterns in the data.In order to apply data mining techniques effectively, it is important to have a solid understanding of statistics, machine learning, and data analytics. Data mining professionals must be able to preprocess data, select appropriate algorithms, and interpret the results of their analyses. They must also be able to communicate their findings effectively to stakeholders in order to drive business decisions.Data mining is used in a wide range of industries, including finance, healthcare, retail, and telecommunications. In finance, data mining is used to detect fraudulent transactions and predict market trends. In healthcare, it is used to analyze patient data and improve treatment outcomes. In retail, it is used to optimize inventory management and personalize marketing campaigns. In telecommunications, it is used to analyze network performance and customer behavior.Overall, data mining is a powerful tool that can help businesses gain valuable insights from their data and make more informed decisions. By leveraging the latest advances in machine learning and data analytics, organizations can stay competitive in today's data-driven world. Whether you are a data scientist, analyst, or business leader, understanding the principles of data mining can help you unlock the potential of your data and drive success in your organization.。
Improving tax administration with data mining
Daniele Micci-Barreca, PhD, and Satheesh Ramachandran, PhD Elite Analytics, LLCExecutive reportImproving tax administration with data miningTable of contentsIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2Why data mining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3What is data mining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3Case study: An audit selection strategy for the State of Texas . . . . . . . . . . . . . . . . . . . . . . .5Other data mining applications in tax administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12About Clementine ® . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12About SPSS Inc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12About Elite Analytics, LLC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13IntroductionBoth federal and state tax administration agencies must use their limited resources to achieve maximal taxpayer compliance. The purpose of this white paper is to show how data mining helps tax agencies achieve compliance goals and improve operating efficiency using their existing resources. The paper begins with an overview of data mining and then details an actual tax compliance application implemented by Elite Analytics, LLC, a data mining consulting services firm and SPSS Inc. partner, for the Audit Division of the Texas Comptroller of Public Accounts(CPA). The paper concludes with an overview of additional applications of data mining technology in the area of tax compliance.Tax agencies primarily use audits to ensure compliance with tax laws and maintain associated revenue streams. Audits indirectly drive voluntary compliance and directly generate additional tax collections, both of which help tax agencies reduce the “tax gap” between the tax owed and the amount collected. Audits, therefore, are critical to enforcing tax laws and helping tax agencies achieve revenue objectives, ensuring the fiscal health of the country and individual states.Managing an effective auditing organization involves many decisions. What is the best audit selection strategy or combination of strategies? Should it be based on reported tax amounts or on the industry type? How should agencies allocate audit resources among different tax types? Some tax types may yield greater per-audit adjustments. Others may be associated with a higher incidence of noncompliance. An audit is a process with many progressive stages, from audit selection and assignment to hearings, adjudication, and negotiation, to collection and, in some cases, enforcement. Each stage involves decisions that can increase or reduce the efficiency of the overall auditing program.Audit selection methods range from simple random selection to more complex rule-based selection based on “audit flags,”to sophisticated statistical and data mining selection techniques. Selection strategies can vary by tax type, and even within a single type. Certain selection strategies, for example, segment taxpayers within a specific tax type, and then apply different selection rules to each segment. Texas categorizes taxpayers that account for the top 65 percent of the state’s sales tax collections as Priority One accounts, and audits these accounts every four years. Texas also audits Prior Productive taxpayers—accounts that yielded tax adjustments greater than $10,000 in previous audits.As in many states, Texas’ taxpayer population has risen steadily over the last decade, without any proportionate rise in auditing resources. As a result, Texas and several other states, as well as tax agencies in the United Kingdom and Australia, rely on data mining to help find delinquent taxpayers and make effective resource allocation decisions. Data mining leverages specialized data warehousing systems that integrate internal and external data sources to enable a variety of applications,from trend analysis to non-compliance detection and revenue forecasting, that help agencies answer questions such as:■How should we split auditing resources among tax types?■Which taxpayers are higher audit priorities?■What is the expected yield from a particular audit type?■Which SIC codes are associated with higher rates of noncompliance?Why data mining?Tax agencies have access to enormous amounts of taxpayer data. Most auditing agencies, in fact, draw information from these data sources to support auditing functions. Audit selectors, for example, search data sources for taxpayers with specific profiles. These profiles, developed by experts, may be based on a single attribute, such the taxpayer industry code (SIC), or on a complex combination of attributes (for example, taxpayers in a specific retail sector that have a specific sales-to-reported-tax ratio). Data mining technologies do the same thing, but on a much bigger scale. Using data mining techniques, tax agencies can analyze data from hundreds of thousands of taxpayers to identify common attributes and then create profiles that represent different types of activity. Agencies, for example, can create profiles of high-yield returns, so auditors can concentrate resources on new returns with similar attributes. Data mining enables organizations to leverage their data to understand, analyze, and predict noncompliant behavior.What is data mining?Data mining is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules [1]. Organizations use this information to detect existing fraud and noncompliance,and to prevent future occurrences.Data mining also enables data exploration and analysis without any specific hypothesis in mind, as opposed to traditional statistical analysis, in which experiments are designed around a particular hypothesis. While this openness adds a strong exploratory aspect to data mining projects, it also requires that organizations use a systematic approach in order to achieve usable results. The CRoss-Industry Standard Process for Data Mining (CRISP-DM) (see Figure 1) was developed in 1996 by a consortium of data mining consultants and technology specialists that included SPSS, Daimler-Benz (now DaimlerChrysler) and NCR [2]. CRISP-DM’s developers relied on their real-world experience to develop a six-phase process that incorporates an organization’s business goals and knowledge. CRISP-DM is considered the de facto standard for the data mining industry.The six phases of CRISP-DM Business understandingThe first phase ensures that all participants understand the project goals from a business or organizational perspective. These business goals are then incorporated in a data mining problem definition and detailed project plan. For a tax auditing agency, this would involve understanding the audit management process,the role and functions performed, the information that is collected and managed, and the specific challenges to improving audit efficiency. This information would be incorporated into the data mining problemdefinition and project plan.Figure 1: The CRoss-Industry Standard Process for Data Mining (CRISP-DM)Data understandingThe second phase is designed to assess the sources, quality, and characteristics of the data. This initial exploration can also provide insights that help to focus the project. The result is a detailed understanding of the key data elements that will be used to build models. This phase can be time-consuming for tax agencies that have many data sources, but it is critically important to the project.Data preparationThe next phase involves placing the data in a format suitable for building models. The analyst uses the business objectives determined in the business understanding step to determine which data types and data mining algorithms to use. This phase also resolves data issues uncovered in the data understanding phase, such as missing data.ModelingThe modeling phase involves building the data mining algorithms that extract the knowledge from the data. There are a variety of data mining techniques; each is suitable for discovering a specific type of knowledge. A tax agency would use classification or regression models, for example, to discover the characteristics of more productive tax audits. Each technique requires specific types of data, which may require a return to the data preparation phase. The modeling phase produces a model or a set of models containing the discovered knowledge in an appropriate format.EvaluationThis phase focuses on evaluating the quality of the model or models. Data mining algorithms can uncover an unlimited number of patterns; many of these, however, may be meaningless. This phase helps determine which models are useful in terms of achieving the project’s business objectives. In the context of audit selection, a predictive model for audit outcome would be assessed against a benchmark set of historical audits for which the outcome is known.DeploymentIn the deployment phase, the organization incorporates the data mining results into the day-to-day decision-making process. Depending on the significance of the results, this may require only minor modifications, or it may necessitate a major reengineering of processes and decision-support systems. The deployment phase also involves creating a repeatable process for model enhancements or recalibrations. Tax laws, for example, are likely to change over time. Analysts need a standard process for updating the models accordingly and deploying new results.The appropriate presentation of results ensures that decision makers actually use the information. This can be as simple as creating a report or as complex as implementing a repeatable data mining process across the enterprise. It is important that project managers understand from the beginning what actions they will need to take in order to make use of the final models.The six phases described in this paper are integral to every data mining project. Though each phase is important, the sequence is not rigid; certain projects may require you to move back and forth between phases. The next phase or the next task in a phase depends on the outcome of each of the previous phases. The inner arrows in Figure 1indicate the most important and frequent dependencies between phases. The outer circle symbolizes the cyclical nature of data mining projects, namely that lessons learned during a data mining project and after deployment can trigger new, more focused business questions. Subsequent data mining projects, therefore, benefit from experience gained in previous ones.Eighty to 90 percent of a typical data mining project is spent on phases other than modeling, and the success of the models depends heavily on work performed in these phases. To create the models, the analyst typically uses a collection of techniques and tools. Data mining techniques come from a variety of disciplines, including machine learning, statistical analysis, pattern recognition, signal processing, evolutionary computation, and pattern visualization. A detailed discussion of these methods is beyond the scope of this paper. The next sections, however, discuss data mining methods relevant to tax audit applications.Case study: An audit selection strategy for the State of TexasAs previously mentioned, there are many audit selection methods, each of which results in different levels of productivity. Measured in terms of dollars recovered per audit hour, the relationship between productivity and the audit selection strategy is quantifiable. This section focuses on the Audit Select scoring system implemented by Elite Analytics for the Audit Division of the Texas Comptroller of Public Accounts, and how this approach compares to traditional audit selection strategies used by the division.Already the data mining tool of choice for the Audit Division, Clementine was used throughout the implementation of the new audit selection strategy. According to Elite Analytics, “Not only did we find the variety of data preparation, modeling, and deployment features unmatched in any other product, but the Clementine workbench made working and categorizing completed work within the CRISP-DM model very intuitive.”Predictive modelingPredictive modeling is one of the most frequently used forms of data mining. As its name suggests, predictive modeling enables organizations to predict the outcome of a given process and use this insight to achieve a desired outcome. In terms of audit selection, the goal is to predict which audit leads are more likely to yield greater tax adjustments.Predictive models typically produce a numeric score that indicates the likely outcome of an audit. A high score, for instance, indicates that the tax adjustment from that audit is likely to be high or above average, while a low score indicates a low likelihood of a large tax adjustment. A real-world example is the Discriminant Index Function (DIF) score developed by the IRS to identify returns with a high probability of unreported income. Use of the DIF score results in significantly higher tax assessments than do purely random audits[3].Scoring models have become more popular, due in large part to their ability to manage hundreds of variables and large populations. Among other applications, organizations use scoring models to:■Assess credit risk■Identify fraudulent credit card transactions■Determine the response potential of mailing lists■Rank prospects in terms of buying potentialUsing models to predict sales tax audit outcomesThe Audit Division’s Advanced Database System (ADS) is a data warehouse designed to support tax compliance applications such as the Audit Select system, which uses predictive models to identify sales tax audit leads. Figure 2depicts the process of building and using predictive models for audit selection.The first step is to calibrate or train the model using a training set of historical audit data with a known outcome. This enables the model to “learn” the relationship between taxpayer attributes and the audit outcomes. The Audit Select scoring model uses the following five sources of data to create a taxpayer profile (see Figure 3):■Business information, such as taxpayer SIC code, business type (corporation, partnership, etc.), and location■Sales tax filings from the most recent four years of sales tax reports■Other tax filings, primarily for franchise tax■Employee and wage information reported to another state agency■Prior audit outcomesOnce trained, the model can be applied to the entire taxpayer population in a process known as generalization. The model generalizes what it learns from historical audits to analyze new returns and assign Audit Select scores(see examples in Figure 4). Human audit selectors then use the Audit Select scores to determine which businesses to audit.The model relies on extensive data preparation and transformations that map the raw data into more informative indicators. For example, while the actual and relative (compared to gross sales) amount of deduction is a good raw indicator, it is useful to compare the figures to those of similar businesses. This peer group comparison is used throughout the model to develop additional indicators.Figure 4Figure 4: Training and scoring examplesIn Figure 4, the first table represents the training set of historical audits used to calibrate the model. The second table shows the model’s predictive scores for the current population of taxpayers.Results from the Audit Select scoring systemThe most effective way to assess a scoring system is to compare the scores it produces to actual outcomes. In the context of audit selection, the goal is to measure the difference between the final tax adjustment and the score assigned by the model.The chart in Figure 5shows the average tax adjustment for Audit Select scores between one and 1000. The average tax adjustment, as expected, increases in proportion to the scores. The chart shows, for example, that audits performed on taxpayers with scores below 100 resulted in an average tax adjustment of$3,300. In contrast, audits performed on taxpayers with scores above 900 resulted in an average tax adjustment of$78,000. While the percentage of taxpayers with scores higher than 900 is smaller than the percentage with scores below 100, additional research in this study revealed that only 57 percent of the current population scoring 900 or greater has been audited at least once. This taxpayer segment clearly represents a significant pool of audit candidates.Comparison with other selection strategiesAs demonstrated in the examples above, the data mining and predictive modeling techniques used by the Audit Select scoring system add a new level of sophistication to the audit selection process. The most accurate measure of a new technology, however, is not technical complexity or theoretical soundness, but the degree to which it improves an existing process. This measurement is of particular importance when less sophisticated, yet well-tested techniques are already in use.Other selection strategiesBefore evaluating the performance of the Audit Select scoring system, therefore, it is important to review the selection strategies that the State of Texas used for many years and continues to use alongside the new system.■Priority One: The State of Texas classifies all businesses that contribute to the top 65 percent of tax dollars collected as Priority One taxpayers. The state audits these taxpayers approximately every four years. The Priority One selection strategy uses a very simple scoring and ranking algorithm. The score is the relative percentage of sales tax dollars contributed to the state’s total collection. The state ranks taxpayers by score and then applies a threshold to produce the target list. There are similarities between the Priority One and Audit Select selection strategies, but the main difference is the logic behind the score. The Priority One score is one-dimensional; it considers only the reported tax amount. The Audit Select score, on the other hand, takes into account a number of taxpayer attributes(as depicted in Figure 3).■Prior Productive: This strategy automatically selects taxpayers with prior audit adjustments of more than $10,000. Aswith Priority One, the Prior Productive strategy uses only one selection criterion, which is one of several used by theAudit Select system.The average tax adjustment for Priority One audits is approximately$76,000 (median $9,600). This outcome is significantly higher than the $12,000 (median $1,300) average for non-Priority One audits and the $18,000 (median $1,600) average for all sales tax audits. Priority One audits also represent approximately nine percent of all Texas sales tax audits.Audit Select versus Priority OneWhen comparing selection methods, it is important to consider not only the average tax adjustment of each method, but also the volume of audits. This is particularly important when assessing a score-based solution such as the Audit Select system. In fact, while the score enables ranking of potential targets, each score range produces a different number of candidates. When comparing the Priority One criterion with the score criterion, it is therefore important to use a score range that produces an equal number of audits. In this case, we compare the average outcome of the top nine percent of all audits, based on the score, with the average outcome of Priority One audits(which represent nine percent of the audits).Based on scores produced by the Audit Select model, the average tax adjustment for the top nine percent of audits is approximately$88,000 (median $16,000) (see Figure 6). This is a 16 percent improvement over the Priority One average adjustment(65 percent improvement over the median adjustment) on an equal number of audits. Note that36 percent ofthe top nine percent of the Audit Select audits is made up of Priority One audits, while the remaining 64 percent would not have been selected by the Priority One strategy. This demonstrates that the Audit Select strategy is not necessarily orthogonal(or in contrast) to Priority One. The Audit Select score can, in fact, be used to improve on the Priority One strategyby further segmenting the Priority One population and eliminating audits likely to result in small tax adjustments.Audit Select versus Prior ProductiveAs with the previous comparison, there are significant differences in the results obtained by the Audit Select and Prior Productive selection strategies (see Figure 7). Approximately 20 percent of the audits performed fall within the Prior Productive selection criteria. Five percent of these, however, would also be selected by the Priority One strategy. Excluding those cases, the average tax adjustment for Prior Productive audits was $17,700 (median $4,100). In comparison, the top 20 percent of the audits, based on the Audit Select score (and excluding potential Priority One audits), shows an average adjustment of$27,000 (median $6,500).Model evolutionData mining has the most impact when it becomes an integral part of any system or application. Data mining models are rarely static. As more data sources are available and processes change, the models evolve. User feedback is also a critical component in model evolution.Since its implementation, the Audit Select scoring model has evolved significantly, and additional enhancements are currently in progress. Several factors contribute to the evolution of a system of this kind. First and foremost is the evolution of the data sources. The usefulness and accuracy of any model is determined primarily by the quality, quantity, and richness of the data available. The second factor is domain knowledge, which itself evolves to reflect emerging trends and changes in the data generation process. The final factor is technological improvements: More sophisticated modeling algorithms, data transformation strategies, and model segmentation techniques enable significantly improved end results.Other data mining applications in tax administrationAudit selection is one of many possible data mining applications in tax administration. In the area of tax compliance alone, tax collectors face diverse challenges, from underreporting to nonreporting to enforcement. Data mining offers many valuable techniques for increasing the efficiency and success rate of tax collection. Data mining also leverages the variety and quantity of internal and external data sources available in modern data warehousing systems. Here are just a few of the many potential uses of data mining in the context of tax compliance:Outlier-based selectionThis is an alternative approach to audit selection that uses specific data mining algorithms to build profiles of typical behaviors and then select taxpayers that do not match the profiles. This modeling process involves creating valid taxpayer segments, characterizing the normative taxpayer profile for each segment, and creating the rules for outlier detection. Though this method is similar to that applied by many audit selectors on a daily basis, the added benefit of the data mining approach is its ability to process large amounts of data and analyze multiple taxpayer characteristics simultaneously.Lead prioritizationMany tax agencies flag potential nonfilers by, for example, cross-matching external data with internal lists of current filers. This type of application often produces many leads, each of which must be manually verified and pursued. Given the limited staff available to follow up on leads, it is important to develop criteria for prioritizing the leads. Data mining, however, can predict the tax dollars owed by an organization or individual taxpayer by modeling the relationship between the attributes and reported tax for known taxpayers.Profiling for cross-tax affinityAgencies can use data mining to analyze existing taxpayers for associations between the types of businesses and the tax types(more than 30 in Texas) for which the taxpayers file. Co-occurrence of certain tax types may infer liability for another tax type for which the taxpayer did not file.Workflow analysisFollowing up on leads can be a lengthy process consisting of multiple mail exchanges between the tax agency and the organizations classified as potential nonfilers. Tracking this process, however, generates data that is useful for modeling the process. The process models can then be used to predict the value of current pipelines and to optimize resource allocation.Anomaly detectionTaxpayers self-report several important attributes(SIC, organization type, etc.). Due to unavoidable data entry errors, however, some taxpayers are categorized incorrectly or even uncategorized. Since these attributes and categories drive audit selection and other processes, it is always a good idea to apply data mining’s rule induction techniques to detect errors and anomalies.Economic models for optimal targetingIn most traditional targeting applications, such as target marketing or e-commerce fraud detection, there is a tradeoff between the number of audits to target and the cost per audit. When audit cost information is available, agencies can use data mining to develop targeting strategies that maximize overall collections.ConclusionAs outlined in this paper, data mining has many existing and potential applications in tax administration. In the case of the Audit Division of the Texas Comptroller of Public Accounts, predictive modeling enables the agency to identify noncompliant taxpayers more efficiently and effectively, and to focus auditing resources on the accounts most likely to produce positive tax adjustments. This helps the agency make better use of its human and other resources, and minimizes the financial burden on compliant taxpayers. Data mining also helps the agency refine its traditional audit selection strategies to produce more accurate results.About ClementineClementine is a data mining workbench that enables organizations to quickly develop predictive models using business expertise, and then deploy them to improve decision making. Clementine is widely regarded as the leading data mining workbench because it delivers the maximum return on data investment in the minimum amount of time. Unlike other data mining workbenches, which focus merely on models for enhancing performance, Clementine supports the entire data mining process to reduce time-to-solution. And Clementine is designed around the de facto industry standard for data mining—CRISP-DM. CRISP-DM makes data mining a business process by focusing data mining technology on solving specific business problems.About SPSS Inc.SPSS Inc. [NASDAQ: SPSS] is the world's leading provider of predictive analytics software and services. The company’s predictive analytics technology connects data to effective action by drawing reliable conclusions about current conditions and future events. More than 250,000 commercial, academic,and public sector organizations rely on SPSS technology to help increase revenue, reduce costs, improve processes, and detect and prevent fraud. Founded in 1968, SPSS is headquartered in Chicago, Illinois. To learn more, please visit . For SPSS office locations and telephone numbers, go to /worldwide.。
我所知道的一点DataMining-电子邮件系统
◎我所知道的一點Data Mining1.前言2.定義3.方法4.工具5.應用6.結論◎以上內容提供者:趙民德中央研究院統計科學研究所◎◎資料採礦(Data Mining)連載之一‧何謂DATA MINING‧DATA MINING和統計分析的不同‧為什麼需要DATA MINING何謂DATA MINING?資料採礦的工作(Data Mining)是近年來資料庫應用領域中,相當熱門的議題。
它是個神奇又時髦的技術,但卻也不是什麼新東西,因為Data Mining使用的分析方法,如預測模型(迴歸、時間數列)、資料庫分割(Database Segmentation)、連接分析(Link Analysis)、偏差偵測(Deviation Detection)等;美國政府從第二次世界大戰前,就在人口普查以及軍事方面使用這些技術,但是資訊科技的進展超乎想像,新工具的出現,例如關連式資料庫、物件導向資料庫、柔性計算理論(包括Neural network、Fuzzy theory、Genetic Algorithms、Rough Set等)、人工智慧的應用(如知識工程、專家系統),以及網路通訊技術的發展,使從資料堆中挖掘寶藏,常常能超越歸納範圍的關係;使Data Mining成為企業智慧的一部份。
Data Mining是一個浮現中的新領域。
在範圍和定義上、推理和期望上有一些不同。
挖掘的資訊和知識從巨大的資料庫而來,它被許多研究者在資料庫系統和機器學習(Machine learning)當作關鍵研究議題,而且也被企業體當作主要利基的重要所在。
有許多不同領域的專家,對Data Mining展現出極大興趣,例如在資訊服務業中,浮現一些應用,如在Internet之資料倉儲和線上服務,並且增加企業的許多生機。
隨著資訊科技的進步以及電子化時代的來臨,現今企業所面對的是一個與以往截然不同的競爭環境。
在資訊科技的推波助瀾下,不僅企業競爭的強度與速度倍數於以往,激增的市場交易也使得各企業所需儲存與處理的資料量越來越龐大。
Data-Mining-OverviewPPT课件
Extract interesting and useful knowledge from the data Find rules, regularities, irregularities, patterns, constraints hopefully, this will help us better compete in business, do research, learn
Data collection and data availability
Automated data collection tools, database systems, Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, images, video, documents, ….
Knowledge
Information read, heard or seen and understood and
integrated
Wisdom
Distilled knowledge and understanding which can
lead to decisions
4
What is Data Mining
which products in the catalog often sell together market segmentation (find groups of customers/users with similar
数据挖掘(Data Mining)
数据挖掘(Data Mining)DM:数据挖掘(Data Mining)KDD:知识发现(Knowledge Discovery in Databases)一、背景1、目前的数据库系统虽然可以高效地实现数据的录入、查询、统计等功能,但无法发现数据中存在的关系和规则2、数据十分丰富,而信息相当贫乏。
3、数据坟墓二、数据挖掘的定义1、数据挖掘是从大量数据中提取或“挖掘”知识2、数据挖掘(Data Mining)就是从大量的、不完全的、有噪声的、模糊的、随机的实际应用数据中,提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识的过程n所谓基于数据库的知识发现3、所谓基于数据库的知识发现(KDD)是指从大量数据中提取有效的、新颖的、潜在有用的、最终可被理解的模式的非平凡过程。
OLAP【联机分析处理】面向主题的,主要面向公司领导者;OLTP【联机事务处理】面向应用的,主要面向公司职员。
OLAP是验证型的,建立在数据仓库的基础上;数据挖掘是挖掘型的,建立在各种数据源的基础上三、数据挖掘工具:DBMiner、Admocs、Predictive-CRM、SAS/EM(Enterprise Miner)、Weka目前,世界上比较有影响的典型数据挖掘系统包括:•SAS公司的Enterprise Miner•IBM公司的Intelligent Miner•SGI公司的SetMiner•SPSS公司的Clementine•Sybase公司的Warehouse Studio•RuleQuest Research公司的See5•还有CoverStory、EXPLORA、Knowledge Discovery Workbench、DBMiner、Quest等。
四、KDD过程在上述步骤中,数据挖掘占据非常重要的地位,它主要是利用某些特定的知识发现算法,在一定的运算效率范围内,从数据中发现出有关知识,决定了整个KDD过程的效果与效率。
Data Mining: What is Data Mining?
Data Mining: What is Data Mining? OverviewGenerally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.【一般来说,数据挖掘(有时也被称为数据或知识发现)是从不同的角度进行分析和总结数据转化为有用信息的信息的过程。
可以用来增加收入,降低成本,或两者兼而有之。
数据挖掘软件是众多数据分析工具之一。
它允许用户分析来自许多不同的层面或角度的数据,归类,发现和总结的关系。
从技术上讲,数据挖掘是在多个领域的大型关系数据库中发现相互关系或模式的过程。
】Continuous InnovationAlthough data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost.【不断创新虽然数据挖掘是一个相对较新的术语,但其技术不是。
6-data mining(1)
Part II Data MiningOutlineThe Concept of Data Mining(数据挖掘概念) Architecture of a Typical Data Mining System (数据挖掘系统结构)What can be Mined? (能挖掘什么?)Major Issues(主要问题)in Data MiningData Cleaning(数据清理)3What Is Data Mining?Data mining is the process of discovering interesting knowledge from large amounts of data. (数据挖掘是从大量数据中发现有趣知识的过程) The main difference that separates information retrieval apart from data mining is their goals. (数据挖掘和信息检索的主要差别在于他们的目标) Information retrieval is to help users search for documents or data that satisfy their information needs(信息检索帮用户寻找他们需要的文档/数据)e.g. Find customers who have purchased more than $10,000 in the last month .(查找上个月购物量超过1万美元的客户)Data mining discovers useful knowledge by analyzing data correlations using sophisticated data mining techniques(数据挖掘用复杂技术分析…)e.g. Find all items which are frequently purchased with milk .(查找经常和牛奶被购买的商品)A KDD Process (1) Some people view data mining as synonymous5A KDD Process (2)Learning the application domain (学习应用领域相关知识):Relevant knowledge & goals of application (相关知识和目标) Creating a target data set (建立目标数据集) Data selection, Data cleaning and preprocessing (预处理)Choosing functions of data mining (选择数据挖掘功能)Summarization, classification, association, clustering , etc.Choosing the mining algorithm(s) (选择挖掘算法)Data mining (进行数据挖掘): search for patterns of interest Pattern evaluation and knowledge presentation (模式评估和知识表示)Removing redundant patterns, visualization, transformation, etc.Present results to user in meaningful manner.Use of discovered knowledge (使用所发现的知识)7Concept/class description (概念/类描述)Characterization(特征): provide a summarization of the given data set Comparison(区分): mine distinguishing characteristics(挖掘区别特征)that differentiate a target class from comparable contrasting classes. Association rules (correlation and causality)(关联规则)Association rules are of the form(这种形式的规则): X ⇒Y,Examples: contains(T, “computer”) ⇒contains(T, “software”)[support = 1%, confidence = 50%]age(X, “20..29”) ∧income(X, “20..29K ”) ⇒buys(X, “PC ”)[support = 2%, confidence = 60%]Classification and Prediction (分类和预测)Find models that describe and distinguish classes for future prediction.What kinds of patterns can be mined?(1)What kinds of patterns can be mined?(2)Cluster(聚类)Group data to form some classes(将数据聚合成一些类)Principle: maximizing the intra-class similarity and minimizing the interclass similarity (原则: 最大化类内相似度,最小化类间相似度) Outlier analysis: objects that do not comply with the general behavior / data model. (局外者分析: 发现与一般行为或数据模型不一致的对象) Trend and evolution analysis (趋势和演变分析)Sequential pattern mining(序列模式挖掘)Regression analysis(回归分析)Periodicity analysis(周期分析)Similarity-based analysis(基于相似度分析)What kinds of patterns can be mined?(3)In the context of text and Web mining, the knowledge also includes: (在文本挖掘或web挖掘中还可以发现)Word association (术语关联)Web resource discovery (WEB资源发现)News Event (新闻事件)Browsing behavior (浏览行为)Online communities (网上社团)Mining Web link structures to identify authoritative Web pages finding spam sites (发现垃圾网站)Opinion Mining (观点挖掘)…10Major Issues in Data Mining (1)Mining methodology(挖掘方法)and user interactionMining different kinds of knowledge in DBs (从DB 挖掘不同类型知识) Interactive mining of knowledge at multiple levels of abstraction (在多个抽象层上交互挖掘知识)Incorporation of background knowledge (结合背景知识)Data mining query languages (数据挖掘查询语言)Presentation and visualization of data mining results(结果可视化表示) Handling noise and incomplete data (处理噪音和不完全数据) Pattern evaluation (模式评估)Performance and scalability (性能和可伸缩性) Efficiency(有效性)and scalability(可伸缩性)of data mining algorithmsParallel(并行), distributed(分布) & incremental(增量)mining methods©Wu Yangyang 11Major Issues in Data Mining (2)Issues relating to the diversity of data types (数据多样性相关问题)Handling relational and complex types of data (关系和复杂类型数据) Mining information from heterogeneous databases and www(异质异构) Issues related to applications (应用相关的问题) Application of discovered knowledge (所发现知识的应用)Domain-specific data mining tools (面向特定领域的挖掘工具)Intelligent query answering (智能问答) Process control(过程控制)and decision making(决策制定)Integration of the discovered knowledge with existing knowledge:A knowledge fusion problem (知识融合)Protection of data security(数据安全), integrity(完整性), and privacy12CulturesDatabases: concentrate on large-scale (non-main-memory) data.(数据库:关注大规模数据)To a database person, data-mining is an extreme form of analytic processing. Result is the data that answers the query.(对数据库工作者而言数据挖掘是一种分析处理, 其结果就是问题答案) AI (machine-learning): concentrate on complex methods, small data.(人工智能(机器学习):关注复杂方法,小数据)Statistics: concentrate on models. (统计:关注模型.)To a statistician, data-mining is the inference of models. Result is the parameters of the model (数据挖掘是模型推论, 其结果是一些模型参数)e.g. Given a billion numbers, a statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation.©Wu Yangyang 13Data Cleaning (1)Data Preprocessing (数据预处理):Cleaning, integration, transformation, reduction, discretization (离散化) Why data cleaning? (为什么要清理数据?)--No quality data, no quality mining results! Garbage in, Garbage out! Measure of data quality (数据质量的度量标准)Accuracy (正确性)Completeness (完整性)Consistency(一致)Timeliness(适时)Believability(可信)Interpretability(可解释性) Accessibility(可存取性)14Data Cleaning (2)Data in the real world is dirtyIncomplete (不完全):Lacking some attribute values (缺少一些属性值)Lacking certain interest attributes /containing only aggregate data(缺少某些有用属性或只包含聚集数据)Noisy(有噪音): containing errors or outliers(包含错误或异常) Inconsistent: containing discrepancies in codes or names(不一致: 编码或名称存在差异)Major tasks in data cleaning (数据清理的主要任务)Fill in missing values (补上缺少的值)Identify outliers(识别出异常值)and smooth out noisy data(消除噪音)Correct inconsistent data(校正不一致数据) Resolve redundancy caused by data integration (消除集成产生的冗余)15Data Cleaning (3)Handle missing values (处理缺值问题) Ignore the tuple (忽略该元组) Fill in the missing value manually (人工填补) Use a global constant to fill in the missing value (用全局常量填补) Use the attribute mean to fill in the missing value (该属性平均值填补) Use the attribute mean for all samples belonging to the same class to fill in the missing value (用同类的属性平均值填补) Use the most probable value(最大可能的值)to fill in the missing value Identify outliers and smooth out noisy data(识别异常值和消除噪音)Binning method (分箱方法):First sort data and partition into bins (先排序、分箱)Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.(然后用平均值、中值、边界值平滑)©Wu Yangyang 16Data Cleaning (4)Example: Sorted data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins (分成等深的箱):-Bin 1: 4, 8, 9, 15-Bin 2: 21, 21, 24, 25-Bin 3: 26, 28, 29, 34Smoothing by bin means (用平均值平滑):-Bin 1: 9, 9, 9, 9-Bin 2: 23, 23, 23, 23-Bin 3: 29, 29, 29, 29Smoothing by bin boundaries (用边界值平滑):-Bin 1: 4, 4, 4, 15-Bin 2: 21, 21, 25, 25-Bin 3: 26, 26, 26, 34Clustering (。
Data Mining 兴起的原因
一‧Data Mining 興起的原因資料大量產生:電腦的使用率日漸普及,所以各個行業都普遍使用電腦來收集資料,然而在資料庫的設計上,收集的欄位可能達上百個,資料筆數更是無法計算,新的資料不斷的進來,所以時間愈長資料量就愈大,龐大資料庫的形成是可想而知的。
資料倉儲形成:如果我們將一筆筆的資料,按資料庫設計者設計的型態分門別類的依序存放於資料庫中,一段時間之後形成了一個大型的資料庫,我們便可從這些資料當中找尋出可被利用的資訊,而這個經過分門別類所設計出來的資料庫,就成了資料倉儲(data warehouse)。
資料倉儲就是一種將資料聚集成資訊來源的場所。
電腦軟體配合發展:雖然資料挖掘的這些定義有點不可觸摸,但在目前它已經成爲一種商業事業。
如同在過去的歷次淘金熱中一樣,目標是`開發礦工`。
利潤最大的是賣工具給礦工,而不是幹實際的開發。
資料挖掘這個概念被用作一種裝備來出售電腦軟體。
以下是一些當前的資料採礦産品:資料採礦是利用統計與人工智慧的演算法,從龐大的企業歷史資料中,找出隱藏的規律及建立精準的模型,用以預測未來,提供有效的市場行銷以及顧客管理所需。
利用分析工具,在大型資料庫中發現資料的特殊型式以及相互關係的過程,稱為資料採礦。
近來線上的公司行號開始試著分析網頁伺服器裡頭大量的使用者紀錄及訂單資料,因此資料採礦在全球資訊網上的功能也日益顯著。
我們著手解決一個網路上資料分類的問題,利用的主要工具是Support vector machine。
了解到前置處理在資料採礦中的重要性。
依據麻省理工學院(MIT)2000年1月出版的Technology Review雜誌,所選出可改變未來世界的10大科技創新中,資料採礦(Data Mining)技術為企業提煉商業智慧的最佳工具。
資料採礦(Data Mining)意指從大量的資料中去尋找新的資訊或獲取新的知識,也就是所謂的Knowledge Discovery in Databases (KDD) ,例如針對消費者交易與特徵作資料採礦。
数据仓库发掘隐藏财富英文版
An Introduction to Data MiningDiscovering hidden value in your data warehouseOverviewData mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?"This white paper provides an introduction to the basic technologies of data mining. Examples of profitable applications illustrate its relevance to today’s business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users.The Foundations of Data MiningData mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective andproactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature:•Massive data collection•Powerful multiprocessor computers•Data mining algorithmsCommercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers can be much larger. The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods.In the evolution from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical to data mining. From the user’s point of view, the four steps listed in Table 1 were revolutionary because they allowed new business questions to be answered accurately and quickly.Table 1. Steps in the Evolution of Data Mining.The core components of data mining technology have been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with high-performance relational database engines and broad data integration efforts, make these technologies practical for current data warehouse environments.The Scope of Data MiningData mining derives its name from the similarities between searching for valuable business information in a large database —for example, finding linked products in gigabytes of store scanner data —and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities:•Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensivehands-on analysis can now be answered directly from the data —quickly. A typical example of a predictive problem is targetedmarketing. Data mining uses data on past promotional mailings toidentify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments ofa population likely to respond similarly to given events.•Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hiddenpatterns in one step. An example of pattern discovery is theanalysis of retail sales data to identify seemingly unrelatedproducts that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions.Databases can be larger in both depth and breadth:•More columns. Analysts must often limit the number of variables they examine when doing hands-on analysis due to time constraints. Yet variables that are discarded because they seem unimportant maycarry information about unknown patterns. High performance data mining allows users to explore the full depth of a database, without preselecting a subset of variables.•More rows. Larger samples yield lower estimation errors and variance, and allow users to make inferences about small butimportant segments of a population.A recent Gartner Group Advanced Technology Research Note listed data mining and artificial intelligence at the top of the five key technology areas that "will clearly have a major impact across a wide range of industries within the next 3 to 5 years."2 Gartner also listed parallel architectures and data mining as two of the top 10 new technologies in which companies will invest during the next 5 years. According to a recent Gartner HPC Research Note, "With the rapid advance in data capture, transmission and storage, large-systems users will increasingly need toimplement new and innovative ways to mine the after-market value of their vast stores of detail data, employing MPP [massively parallel processing] systems to create new sources of business advantage (0.9 probability)."3The most commonly used techniques in data mining are:•Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.•Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) .•Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.•Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³1). Sometimes called the k-nearest neighbor technique.•Rule induction: The extraction of useful if-then rules from data based on statistical significance.Many of these technologies have been in use for more than a decade in specialized analysis tools that work with relatively small volumes of data. These capabilities are now evolving to integrate directly withindustry-standard data warehouse and OLAP platforms. The appendix to this white paper provides a glossary of data mining terms.How Data Mining WorksHow exactly is data mining able to tell you important things that you didn't know or what is going to happen next? The technique that is used to perform these feats in data mining is called modeling. Modeling is simply the act of building a model in one situation where you know the answer and then applying it to another situation that you don't. For instance, if you were looking for a sunken Spanish galleon on the high seas the first thing you might do is to research the times when Spanish treasure had been found by others in the past. You might note that theseships often tend to be found off the coast of Bermuda and that there are certain characteristics to the ocean currents, and certain routes that have likely been taken by the ship’s captains in that era. You note these similarities and build a model that includes the characteristics that are common to the locations of these sunken treasures. With these models in hand you sail off looking for treasure where your model indicates it most likely might be given a similar situation in the past. Hopefully, if you've got a good model, you find your treasure.This act of model building is thus something that people have been doing for a long time, certainly before the advent of computers or data mining technology. What happens on computers, however, is not much different than the way people build models. Computers are loaded up with lots of information about a variety of situations where an answer is known and then the data mining software on the computer must run through that data and distill the characteristics of the data that should go into the model. Once the model is built it can then be used in similar situations where you don't know the answer. For example, say that you are the director of marketing for a telecommunications company and you'd like to acquire some new long distance phone customers. You could just randomly go out and mail coupons to the general population - just as you could randomly sail the seas looking for sunken treasure. In neither case would you achieve the results you desired and of course you have the opportunity to do much better than random - you could use your business experience stored in your database to build a model.As the marketing director you have access to a lot of information about all of your customers: their age, sex, credit history and long distance calling usage. The good news is that you also have a lot of information about your prospective customers: their age, sex, credit history etc. Your problem is that you don't know the long distance calling usage of these prospects (since they are most likely now customers of your competition). You'd like to concentrate on those prospects who have large amounts of long distance usage. You can accomplish this by building a model. Table 2 illustrates the data used for building a model for new customer prospecting in a data warehouse.Table 2 - Data Mining for ProspectingThe goal in prospecting is to make some calculated guesses about the information in the lower right hand quadrant based on the model that we build going from Customer General Information to Customer Proprietary Information. For instance, a simple model for a telecommunications company might be:98% of my customers who make more than $60,000/year spend more than$80/month on long distanceThis model could then be applied to the prospect data to try to tell something about the proprietary information that this telecommunications company does not currently have access to. With this model in hand new customers can be selectively targeted.Test marketing is an excellent source of data for this kind of modeling. Mining the results of a test market representing a broad but relatively small sample of prospects can provide a foundation for identifying good prospects in the overall market. Table 3 shows another common scenario for building models: predict what is going to happen in the future.Table 3 - Data Mining for PredictionsIf someone told you that he had a model that could predict customer usage how would you know if he really had a good model? The first thing you might try would be to ask him to apply his model to your customer base - where you already knew the answer. With data mining, the best way to accomplish this is by setting aside some of your data in a vault to isolate it from the mining process. Once the mining is complete, the results can be tested against the data held in the vault to confirm the model’s validity. If the model works, its observations should hold for the vaulted data.An Architecture for Data MiningTo best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. Figure 1 illustrates an architecture for advanced analysis in a large data warehouse.Figure 1 - Integrated Data Mining ArchitectureThe ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems:Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access.An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business – summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions.This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users’ business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans.Profitable ApplicationsA wide range of companies have deployed successful applications of data mining. While early adopters of this technology have tended to be in information-intensive industries such as financial services and direct mail marketing, the technology is applicable to any company looking to leverage a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are: a large, well-integrated data warehouse and a well-defined understanding of the business process within which data mining is to be applied (such as customer prospecting, retention, campaign management, and so on).Some successful application areas include:• A pharmaceutical company can analyze its recent sales force activity and their results to improve targeting of high-valuephysicians and determine which marketing activities will have the greatest impact in the next few months. The data needs to include competitor market activity as well as information about the local health care systems. The results can be distributed to the sales force via a wide-area network that enables the representatives to review the recommendations from the perspective of the keyattributes in the decision process. The ongoing, dynamic analysis of the data warehouse allows best practices from throughout the organization to be applied in specific sales situations.• A credit card company can leverage its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product. Using a small test mailing, the attributes of customers with an affinity for the product can be identified.Recent projects have indicated more than a 20-fold decrease in costs for targeted mailing campaigns over conventional approaches.• A diversified transportation company with a large direct sales force can apply data mining to identify the best prospects for its services. Using data mining to analyze its own customer experience, this company can build a unique segmentation identifying theattributes of high-value prospects. Applying this segmentation toa general business database such as those provided by Dun &Bradstreet can yield a prioritized list of prospects by region.• A large consumer package goods company can apply data mining to improve its sales process to retailers. Data from consumer panels, shipments, and competitor activity can be applied to understand the reasons for brand and store switching. Through this analysis, the manufacturer can select promotional strategies that best reachtheir target customer segments.Each of these examples have a clear common ground. They leverage the knowledge about customers implicit in a data warehouse to reduce costs and improve the value of customer relationships. These organizations can now focus their efforts on the most important (profitable) customers and prospects, and design targeted marketing strategies to best reach them.ConclusionComprehensive data warehouses that integrate operational data with customer, supplier, and market information have resulted in an explosion of information. Competition requires timely and sophisticated analysis on an integrated view of the data. However, there is a growing gap between more powerful storage and retrieval systems and the users’ ability to effectively analyze and act on the information they contain. Both relational and OLAP technologies have tremendous capabilities fornavigating massive data warehouses, but brute force navigation of data is not enough. A new technological leap is needed to structure and prioritize information for specific end-user problems. The data mining tools can make this leap. Quantifiable business benefits have been proven through the integration of data mining with current information systems, and new products are on the horizon that will bring this integration to an even wider audience of users.1 META Group Application Development Strategies: "Data Mining for Data Warehouses: Uncovering Hidden Patterns.", 7/13/95 .2 Gartner Group Advanced Technologies and Applications Research Note, 2/1/95.3 Gartner Group High Performance Computing Research Note, 1/31/95.Glossary of Data Mining Terms。
From Data Mining to Knowledge Discovery in Databases
s Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media atten-tion of late. What is all the excitement about?This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges in-volved in real-world applications of knowledge discovery, and current and future research direc-tions in the field.A cross a wide variety of fields, data arebeing collected and accumulated at adramatic pace. There is an urgent need for a new generation of computational theo-ries and tools to assist humans in extracting useful information (knowledge) from the rapidly growing volumes of digital data. These theories and tools are the subject of the emerging field of knowledge discovery in databases (KDD).At an abstract level, the KDD field is con-cerned with the development of methods and techniques for making sense of data. The basic problem addressed by the KDD process is one of mapping low-level data (which are typically too voluminous to understand and digest easi-ly) into other forms that might be more com-pact (for example, a short report), more ab-stract (for example, a descriptive approximation or model of the process that generated the data), or more useful (for exam-ple, a predictive model for estimating the val-ue of future cases). At the core of the process is the application of specific data-mining meth-ods for pattern discovery and extraction.1This article begins by discussing the histori-cal context of KDD and data mining and theirintersection with other related fields. A briefsummary of recent KDD real-world applica-tions is provided. Definitions of KDD and da-ta mining are provided, and the general mul-tistep KDD process is outlined. This multistepprocess has the application of data-mining al-gorithms as one particular step in the process.The data-mining step is discussed in more de-tail in the context of specific data-mining al-gorithms and their application. Real-worldpractical application issues are also outlined.Finally, the article enumerates challenges forfuture research and development and in par-ticular discusses potential opportunities for AItechnology in KDD systems.Why Do We Need KDD?The traditional method of turning data intoknowledge relies on manual analysis and in-terpretation. For example, in the health-careindustry, it is common for specialists to peri-odically analyze current trends and changesin health-care data, say, on a quarterly basis.The specialists then provide a report detailingthe analysis to the sponsoring health-care or-ganization; this report becomes the basis forfuture decision making and planning forhealth-care management. In a totally differ-ent type of application, planetary geologistssift through remotely sensed images of plan-ets and asteroids, carefully locating and cata-loging such geologic objects of interest as im-pact craters. Be it science, marketing, finance,health care, retail, or any other field, the clas-sical approach to data analysis relies funda-mentally on one or more analysts becomingArticlesFALL 1996 37From Data Mining to Knowledge Discovery inDatabasesUsama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth Copyright © 1996, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996 / $2.00areas is astronomy. Here, a notable success was achieved by SKICAT ,a system used by as-tronomers to perform image analysis,classification, and cataloging of sky objects from sky-survey images (Fayyad, Djorgovski,and Weir 1996). In its first application, the system was used to process the 3 terabytes (1012bytes) of image data resulting from the Second Palomar Observatory Sky Survey,where it is estimated that on the order of 109sky objects are detectable. SKICAT can outper-form humans and traditional computational techniques in classifying faint sky objects. See Fayyad, Haussler, and Stolorz (1996) for a sur-vey of scientific applications.In business, main KDD application areas includes marketing, finance (especially in-vestment), fraud detection, manufacturing,telecommunications, and Internet agents.Marketing:In marketing, the primary ap-plication is database marketing systems,which analyze customer databases to identify different customer groups and forecast their behavior. Business Week (Berry 1994) estimat-ed that over half of all retailers are using or planning to use database marketing, and those who do use it have good results; for ex-ample, American Express reports a 10- to 15-percent increase in credit-card use. Another notable marketing application is market-bas-ket analysis (Agrawal et al. 1996) systems,which find patterns such as, “If customer bought X, he/she is also likely to buy Y and Z.” Such patterns are valuable to retailers.Investment: Numerous companies use da-ta mining for investment, but most do not describe their systems. One exception is LBS Capital Management. Its system uses expert systems, neural nets, and genetic algorithms to manage portfolios totaling $600 million;since its start in 1993, the system has outper-formed the broad stock market (Hall, Mani,and Barr 1996).Fraud detection: HNC Falcon and Nestor PRISM systems are used for monitoring credit-card fraud, watching over millions of ac-counts. The FAIS system (Senator et al. 1995),from the U.S. Treasury Financial Crimes En-forcement Network, is used to identify finan-cial transactions that might indicate money-laundering activity.Manufacturing: The CASSIOPEE trou-bleshooting system, developed as part of a joint venture between General Electric and SNECMA, was applied by three major Euro-pean airlines to diagnose and predict prob-lems for the Boeing 737. To derive families of faults, clustering methods are used. CASSIOPEE received the European first prize for innova-intimately familiar with the data and serving as an interface between the data and the users and products.For these (and many other) applications,this form of manual probing of a data set is slow, expensive, and highly subjective. In fact, as data volumes grow dramatically, this type of manual data analysis is becoming completely impractical in many domains.Databases are increasing in size in two ways:(1) the number N of records or objects in the database and (2) the number d of fields or at-tributes to an object. Databases containing on the order of N = 109objects are becoming in-creasingly common, for example, in the as-tronomical sciences. Similarly, the number of fields d can easily be on the order of 102or even 103, for example, in medical diagnostic applications. Who could be expected to di-gest millions of records, each having tens or hundreds of fields? We believe that this job is certainly not one for humans; hence, analysis work needs to be automated, at least partially.The need to scale up human analysis capa-bilities to handling the large number of bytes that we can collect is both economic and sci-entific. Businesses use data to gain competi-tive advantage, increase efficiency, and pro-vide more valuable services to customers.Data we capture about our environment are the basic evidence we use to build theories and models of the universe we live in. Be-cause computers have enabled humans to gather more data than we can digest, it is on-ly natural to turn to computational tech-niques to help us unearth meaningful pat-terns and structures from the massive volumes of data. Hence, KDD is an attempt to address a problem that the digital informa-tion era made a fact of life for all of us: data overload.Data Mining and Knowledge Discovery in the Real WorldA large degree of the current interest in KDD is the result of the media interest surrounding successful KDD applications, for example, the focus articles within the last two years in Business Week , Newsweek , Byte , PC Week , and other large-circulation periodicals. Unfortu-nately, it is not always easy to separate fact from media hype. Nonetheless, several well-documented examples of successful systems can rightly be referred to as KDD applications and have been deployed in operational use on large-scale real-world problems in science and in business.In science, one of the primary applicationThere is an urgent need for a new generation of computation-al theories and tools toassist humans in extractinguseful information (knowledge)from the rapidly growing volumes ofdigital data.Articles38AI MAGAZINEtive applications (Manago and Auriol 1996).Telecommunications: The telecommuni-cations alarm-sequence analyzer (TASA) wasbuilt in cooperation with a manufacturer oftelecommunications equipment and threetelephone networks (Mannila, Toivonen, andVerkamo 1995). The system uses a novelframework for locating frequently occurringalarm episodes from the alarm stream andpresenting them as rules. Large sets of discov-ered rules can be explored with flexible infor-mation-retrieval tools supporting interactivityand iteration. In this way, TASA offers pruning,grouping, and ordering tools to refine the re-sults of a basic brute-force search for rules.Data cleaning: The MERGE-PURGE systemwas applied to the identification of duplicatewelfare claims (Hernandez and Stolfo 1995).It was used successfully on data from the Wel-fare Department of the State of Washington.In other areas, a well-publicized system isIBM’s ADVANCED SCOUT,a specialized data-min-ing system that helps National Basketball As-sociation (NBA) coaches organize and inter-pret data from NBA games (U.S. News 1995). ADVANCED SCOUT was used by several of the NBA teams in 1996, including the Seattle Su-personics, which reached the NBA finals.Finally, a novel and increasingly importanttype of discovery is one based on the use of in-telligent agents to navigate through an infor-mation-rich environment. Although the ideaof active triggers has long been analyzed in thedatabase field, really successful applications ofthis idea appeared only with the advent of theInternet. These systems ask the user to specifya profile of interest and search for related in-formation among a wide variety of public-do-main and proprietary sources. For example, FIREFLY is a personal music-recommendation agent: It asks a user his/her opinion of several music pieces and then suggests other music that the user might like (<http:// www.ffl/>). CRAYON(/>) allows users to create their own free newspaper (supported by ads); NEWSHOUND(<http://www. /hound/>) from the San Jose Mercury News and FARCAST(</> automatically search information from a wide variety of sources, including newspapers and wire services, and e-mail rele-vant documents directly to the user.These are just a few of the numerous suchsystems that use KDD techniques to automat-ically produce useful information from largemasses of raw data. See Piatetsky-Shapiro etal. (1996) for an overview of issues in devel-oping industrial KDD applications.Data Mining and KDDHistorically, the notion of finding useful pat-terns in data has been given a variety ofnames, including data mining, knowledge ex-traction, information discovery, informationharvesting, data archaeology, and data patternprocessing. The term data mining has mostlybeen used by statisticians, data analysts, andthe management information systems (MIS)communities. It has also gained popularity inthe database field. The phrase knowledge dis-covery in databases was coined at the first KDDworkshop in 1989 (Piatetsky-Shapiro 1991) toemphasize that knowledge is the end productof a data-driven discovery. It has been popular-ized in the AI and machine-learning fields.In our view, KDD refers to the overall pro-cess of discovering useful knowledge from da-ta, and data mining refers to a particular stepin this process. Data mining is the applicationof specific algorithms for extracting patternsfrom data. The distinction between the KDDprocess and the data-mining step (within theprocess) is a central point of this article. Theadditional steps in the KDD process, such asdata preparation, data selection, data cleaning,incorporation of appropriate prior knowledge,and proper interpretation of the results ofmining, are essential to ensure that usefulknowledge is derived from the data. Blind ap-plication of data-mining methods (rightly crit-icized as data dredging in the statistical litera-ture) can be a dangerous activity, easilyleading to the discovery of meaningless andinvalid patterns.The Interdisciplinary Nature of KDDKDD has evolved, and continues to evolve,from the intersection of research fields such asmachine learning, pattern recognition,databases, statistics, AI, knowledge acquisitionfor expert systems, data visualization, andhigh-performance computing. The unifyinggoal is extracting high-level knowledge fromlow-level data in the context of large data sets.The data-mining component of KDD cur-rently relies heavily on known techniquesfrom machine learning, pattern recognition,and statistics to find patterns from data in thedata-mining step of the KDD process. A natu-ral question is, How is KDD different from pat-tern recognition or machine learning (and re-lated fields)? The answer is that these fieldsprovide some of the data-mining methodsthat are used in the data-mining step of theKDD process. KDD focuses on the overall pro-cess of knowledge discovery from data, includ-ing how the data are stored and accessed, howalgorithms can be scaled to massive data setsThe basicproblemaddressed bythe KDDprocess isone ofmappinglow-leveldata intoother formsthat might bemorecompact,moreabstract,or moreuseful.ArticlesFALL 1996 39A driving force behind KDD is the database field (the second D in KDD). Indeed, the problem of effective data manipulation when data cannot fit in the main memory is of fun-damental importance to KDD. Database tech-niques for gaining efficient data access,grouping and ordering operations when ac-cessing data, and optimizing queries consti-tute the basics for scaling algorithms to larger data sets. Most data-mining algorithms from statistics, pattern recognition, and machine learning assume data are in the main memo-ry and pay no attention to how the algorithm breaks down if only limited views of the data are possible.A related field evolving from databases is data warehousing,which refers to the popular business trend of collecting and cleaning transactional data to make them available for online analysis and decision support. Data warehousing helps set the stage for KDD in two important ways: (1) data cleaning and (2)data access.Data cleaning: As organizations are forced to think about a unified logical view of the wide variety of data and databases they pos-sess, they have to address the issues of map-ping data to a single naming convention,uniformly representing and handling missing data, and handling noise and errors when possible.Data access: Uniform and well-defined methods must be created for accessing the da-ta and providing access paths to data that were historically difficult to get to (for exam-ple, stored offline).Once organizations and individuals have solved the problem of how to store and ac-cess their data, the natural next step is the question, What else do we do with all the da-ta? This is where opportunities for KDD natu-rally arise.A popular approach for analysis of data warehouses is called online analytical processing (OLAP), named for a set of principles pro-posed by Codd (1993). OLAP tools focus on providing multidimensional data analysis,which is superior to SQL in computing sum-maries and breakdowns along many dimen-sions. OLAP tools are targeted toward simpli-fying and supporting interactive data analysis,but the goal of KDD tools is to automate as much of the process as possible. Thus, KDD is a step beyond what is currently supported by most standard database systems.Basic DefinitionsKDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimate-and still run efficiently, how results can be in-terpreted and visualized, and how the overall man-machine interaction can usefully be modeled and supported. The KDD process can be viewed as a multidisciplinary activity that encompasses techniques beyond the scope of any one particular discipline such as machine learning. In this context, there are clear opportunities for other fields of AI (be-sides machine learning) to contribute to KDD. KDD places a special emphasis on find-ing understandable patterns that can be inter-preted as useful or interesting knowledge.Thus, for example, neural networks, although a powerful modeling tool, are relatively difficult to understand compared to decision trees. KDD also emphasizes scaling and ro-bustness properties of modeling algorithms for large noisy data sets.Related AI research fields include machine discovery, which targets the discovery of em-pirical laws from observation and experimen-tation (Shrager and Langley 1990) (see Kloes-gen and Zytkow [1996] for a glossary of terms common to KDD and machine discovery),and causal modeling for the inference of causal models from data (Spirtes, Glymour,and Scheines 1993). Statistics in particular has much in common with KDD (see Elder and Pregibon [1996] and Glymour et al.[1996] for a more detailed discussion of this synergy). Knowledge discovery from data is fundamentally a statistical endeavor. Statistics provides a language and framework for quan-tifying the uncertainty that results when one tries to infer general patterns from a particu-lar sample of an overall population. As men-tioned earlier, the term data mining has had negative connotations in statistics since the 1960s when computer-based data analysis techniques were first introduced. The concern arose because if one searches long enough in any data set (even randomly generated data),one can find patterns that appear to be statis-tically significant but, in fact, are not. Clearly,this issue is of fundamental importance to KDD. Substantial progress has been made in recent years in understanding such issues in statistics. Much of this work is of direct rele-vance to KDD. Thus, data mining is a legiti-mate activity as long as one understands how to do it correctly; data mining carried out poorly (without regard to the statistical as-pects of the problem) is to be avoided. KDD can also be viewed as encompassing a broader view of modeling than statistics. KDD aims to provide tools to automate (to the degree pos-sible) the entire process of data analysis and the statistician’s “art” of hypothesis selection.Data mining is a step in the KDD process that consists of ap-plying data analysis and discovery al-gorithms that produce a par-ticular enu-meration ofpatterns (or models)over the data.Articles40AI MAGAZINEly understandable patterns in data (Fayyad, Piatetsky-Shapiro, and Smyth 1996).Here, data are a set of facts (for example, cases in a database), and pattern is an expres-sion in some language describing a subset of the data or a model applicable to the subset. Hence, in our usage here, extracting a pattern also designates fitting a model to data; find-ing structure from data; or, in general, mak-ing any high-level description of a set of data. The term process implies that KDD comprises many steps, which involve data preparation, search for patterns, knowledge evaluation, and refinement, all repeated in multiple itera-tions. By nontrivial, we mean that some search or inference is involved; that is, it is not a straightforward computation of predefined quantities like computing the av-erage value of a set of numbers.The discovered patterns should be valid on new data with some degree of certainty. We also want patterns to be novel (at least to the system and preferably to the user) and poten-tially useful, that is, lead to some benefit to the user or task. Finally, the patterns should be understandable, if not immediately then after some postprocessing.The previous discussion implies that we can define quantitative measures for evaluating extracted patterns. In many cases, it is possi-ble to define measures of certainty (for exam-ple, estimated prediction accuracy on new data) or utility (for example, gain, perhaps indollars saved because of better predictions orspeedup in response time of a system). No-tions such as novelty and understandabilityare much more subjective. In certain contexts,understandability can be estimated by sim-plicity (for example, the number of bits to de-scribe a pattern). An important notion, calledinterestingness(for example, see Silberschatzand Tuzhilin [1995] and Piatetsky-Shapiro andMatheus [1994]), is usually taken as an overallmeasure of pattern value, combining validity,novelty, usefulness, and simplicity. Interest-ingness functions can be defined explicitly orcan be manifested implicitly through an or-dering placed by the KDD system on the dis-covered patterns or models.Given these notions, we can consider apattern to be knowledge if it exceeds some in-terestingness threshold, which is by nomeans an attempt to define knowledge in thephilosophical or even the popular view. As amatter of fact, knowledge in this definition ispurely user oriented and domain specific andis determined by whatever functions andthresholds the user chooses.Data mining is a step in the KDD processthat consists of applying data analysis anddiscovery algorithms that, under acceptablecomputational efficiency limitations, pro-duce a particular enumeration of patterns (ormodels) over the data. Note that the space ofArticlesFALL 1996 41Figure 1. An Overview of the Steps That Compose the KDD Process.methods, the effective number of variables under consideration can be reduced, or in-variant representations for the data can be found.Fifth is matching the goals of the KDD pro-cess (step 1) to a particular data-mining method. For example, summarization, clas-sification, regression, clustering, and so on,are described later as well as in Fayyad, Piatet-sky-Shapiro, and Smyth (1996).Sixth is exploratory analysis and model and hypothesis selection: choosing the data-mining algorithm(s) and selecting method(s)to be used for searching for data patterns.This process includes deciding which models and parameters might be appropriate (for ex-ample, models of categorical data are differ-ent than models of vectors over the reals) and matching a particular data-mining method with the overall criteria of the KDD process (for example, the end user might be more in-terested in understanding the model than its predictive capabilities).Seventh is data mining: searching for pat-terns of interest in a particular representa-tional form or a set of such representations,including classification rules or trees, regres-sion, and clustering. The user can significant-ly aid the data-mining method by correctly performing the preceding steps.Eighth is interpreting mined patterns, pos-sibly returning to any of steps 1 through 7 for further iteration. This step can also involve visualization of the extracted patterns and models or visualization of the data given the extracted models.Ninth is acting on the discovered knowl-edge: using the knowledge directly, incorpo-rating the knowledge into another system for further action, or simply documenting it and reporting it to interested parties. This process also includes checking for and resolving po-tential conflicts with previously believed (or extracted) knowledge.The KDD process can involve significant iteration and can contain loops between any two steps. The basic flow of steps (al-though not the potential multitude of itera-tions and loops) is illustrated in figure 1.Most previous work on KDD has focused on step 7, the data mining. However, the other steps are as important (and probably more so) for the successful application of KDD in practice. Having defined the basic notions and introduced the KDD process, we now focus on the data-mining component,which has, by far, received the most atten-tion in the literature.patterns is often infinite, and the enumera-tion of patterns involves some form of search in this space. Practical computational constraints place severe limits on the sub-space that can be explored by a data-mining algorithm.The KDD process involves using the database along with any required selection,preprocessing, subsampling, and transforma-tions of it; applying data-mining methods (algorithms) to enumerate patterns from it;and evaluating the products of data mining to identify the subset of the enumerated pat-terns deemed knowledge. The data-mining component of the KDD process is concerned with the algorithmic means by which pat-terns are extracted and enumerated from da-ta. The overall KDD process (figure 1) in-cludes the evaluation and possible interpretation of the mined patterns to de-termine which patterns can be considered new knowledge. The KDD process also in-cludes all the additional steps described in the next section.The notion of an overall user-driven pro-cess is not unique to KDD: analogous propos-als have been put forward both in statistics (Hand 1994) and in machine learning (Brod-ley and Smyth 1996).The KDD ProcessThe KDD process is interactive and iterative,involving numerous steps with many deci-sions made by the user. Brachman and Anand (1996) give a practical view of the KDD pro-cess, emphasizing the interactive nature of the process. Here, we broadly outline some of its basic steps:First is developing an understanding of the application domain and the relevant prior knowledge and identifying the goal of the KDD process from the customer’s viewpoint.Second is creating a target data set: select-ing a data set, or focusing on a subset of vari-ables or data samples, on which discovery is to be performed.Third is data cleaning and preprocessing.Basic operations include removing noise if appropriate, collecting the necessary informa-tion to model or account for noise, deciding on strategies for handling missing data fields,and accounting for time-sequence informa-tion and known changes.Fourth is data reduction and projection:finding useful features to represent the data depending on the goal of the task. With di-mensionality reduction or transformationArticles42AI MAGAZINEThe Data-Mining Stepof the KDD ProcessThe data-mining component of the KDD pro-cess often involves repeated iterative applica-tion of particular data-mining methods. This section presents an overview of the primary goals of data mining, a description of the methods used to address these goals, and a brief description of the data-mining algo-rithms that incorporate these methods.The knowledge discovery goals are defined by the intended use of the system. We can distinguish two types of goals: (1) verification and (2) discovery. With verification,the sys-tem is limited to verifying the user’s hypothe-sis. With discovery,the system autonomously finds new patterns. We further subdivide the discovery goal into prediction,where the sys-tem finds patterns for predicting the future behavior of some entities, and description, where the system finds patterns for presenta-tion to a user in a human-understandableform. In this article, we are primarily con-cerned with discovery-oriented data mining.Data mining involves fitting models to, or determining patterns from, observed data. The fitted models play the role of inferred knowledge: Whether the models reflect useful or interesting knowledge is part of the over-all, interactive KDD process where subjective human judgment is typically required. Two primary mathematical formalisms are used in model fitting: (1) statistical and (2) logical. The statistical approach allows for nondeter-ministic effects in the model, whereas a logi-cal model is purely deterministic. We focus primarily on the statistical approach to data mining, which tends to be the most widely used basis for practical data-mining applica-tions given the typical presence of uncertain-ty in real-world data-generating processes.Most data-mining methods are based on tried and tested techniques from machine learning, pattern recognition, and statistics: classification, clustering, regression, and so on. The array of different algorithms under each of these headings can often be bewilder-ing to both the novice and the experienced data analyst. It should be emphasized that of the many data-mining methods advertised in the literature, there are really only a few fun-damental techniques. The actual underlying model representation being used by a particu-lar method typically comes from a composi-tion of a small number of well-known op-tions: polynomials, splines, kernel and basis functions, threshold-Boolean functions, and so on. Thus, algorithms tend to differ primar-ily in the goodness-of-fit criterion used toevaluate model fit or in the search methodused to find a good fit.In our brief overview of data-mining meth-ods, we try in particular to convey the notionthat most (if not all) methods can be viewedas extensions or hybrids of a few basic tech-niques and principles. We first discuss the pri-mary methods of data mining and then showthat the data- mining methods can be viewedas consisting of three primary algorithmiccomponents: (1) model representation, (2)model evaluation, and (3) search. In the dis-cussion of KDD and data-mining methods,we use a simple example to make some of thenotions more concrete. Figure 2 shows a sim-ple two-dimensional artificial data set consist-ing of 23 cases. Each point on the graph rep-resents a person who has been given a loanby a particular bank at some time in the past.The horizontal axis represents the income ofthe person; the vertical axis represents the to-tal personal debt of the person (mortgage, carpayments, and so on). The data have beenclassified into two classes: (1) the x’s repre-sent persons who have defaulted on theirloans and (2) the o’s represent persons whoseloans are in good status with the bank. Thus,this simple artificial data set could represent ahistorical data set that can contain usefulknowledge from the point of view of thebank making the loans. Note that in actualKDD applications, there are typically manymore dimensions (as many as several hun-dreds) and many more data points (manythousands or even millions).ArticlesFALL 1996 43Figure 2. A Simple Data Set with Two Classes Used for Illustrative Purposes.。
数据挖掘Introduction to Data Mining
Course Overview
• • • •
Introduction to Knowledge Discovery in Databases and Data Mining
•
Why Data Mining? What is Data Mining? On What Kind of Data?
Application Domains and Examples
•
Webster: A pattern is (a) a natural or chance configuration, (b) a reliable sample of traits, acts, tendencies, or other observable characteristics of a person, group, or institution
(Latitude, Longitude)2
QUERY RESULT
(Latitude, Longitude)1
alg | Automated Learning Group
What is It?
Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
Data Mining Tools
•
•
Data Mining Methods
• • •
•
Summary
alg | Automated Learning Group
Data Mining分析方法
数据挖掘Data Mining第一部Data Mining的觀念 ............................. 错误!未定义书签。
第一章何謂Data Mining ..................................................... 错误!未定义书签。
第二章Data Mining運用的理論與實際應用功能............. 错误!未定义书签。
第三章Data Mining與統計分析有何不同......................... 错误!未定义书签。
第四章完整的Data Mining有哪些步驟............................ 错误!未定义书签。
第五章CRISP-DM ............................................................... 错误!未定义书签。
第六章Data Mining、Data Warehousing、OLAP三者關係為何. 错误!未定义书签。
第七章Data Mining在CRM中扮演的角色為何.............. 错误!未定义书签。
第八章Data Mining 與Web Mining有何不同................. 错误!未定义书签。
第九章Data Mining 的功能................................................ 错误!未定义书签。
第十章Data Mining應用於各領域的情形......................... 错误!未定义书签。
第十一章Data Mining的分析工具..................................... 错误!未定义书签。
第二部多變量分析.......................................... 错误!未定义书签。
数据挖掘单词
Data Mining 数据挖掘技术例句:We hope to answer all of your initial questions about data mining.我们希望能够回答您所有关于数据挖掘的初级问题。
subject-oriented 面向主题的例句:A data warehouse is a, by definition, subject-oriented, integrated, non-volatile and time-variant repository of data to support decision for manager.数据仓库是一个面向主题的、集成的、相对稳固的且随时光不断变化的数据聚集,用来支持管理人员的决议计划。
integrated 集成的例句:Does not test for integration issues: The security of a system requires that all of the system components be secure when integrated with each other. 不能测试整合问题:一个系统的安全性需要在与其它控件整合时保证这个系统的所有控件都是安全的。
time-variant 实时的例句:Classic PID control method which is based on precise mathematical model has poor adaptivity and is not adaptive to nonlinear and time-variant plants.经典的基于对象精确数学模型的PID控制方法的自适应性较差,难以适应具有非线性、时变不确定性的被控对象。
nonvolatile 非已失的例句:Valuable data assets reside in nonvolatile memory so they are protected against the loss of electrical power —regardless of whether the loss is deliberate.有价值的数据驻留在非易失性存储器中,以免因为断电而丢失—不管这种断电是有意还是无意。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
techniques is also provided, based on the kinds of databases to be mined, the kinds of knowledge to be discovered, and the kinds of techniques to be adopted. This survey is organized according to one classi cation scheme: the kinds of knowledge to be mined.
J. Han was supported in part by the research grant NSERC-A3723 from the Natural Sciences and Engineering Research Council of Canada, the research grant NCE:IRIS/Precarn-HMI5 from the Networks of Centres of Excellence of Canada, and research grants from MPR Teltech Ltd. and Hughes ResearBM T.J. Watson Res. Ctr.
P.O.Box 704 Yorktown, NY 10598, U.S.A.
Abstract
Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many di erent elds have shown great interest in data mining. Several emerging applications in information providing services, such as data warehousing and on-line services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classi cation of the available data mining techniques is provided and a comparative study of such techniques is presented.
In response to such a demand, this article is to provide a survey on the data mining techniques developed in several research communities, with an emphasis on database-oriented techniques and those implemented in applicative data mining systems. A classi cation of the available data mining
Data Mining: An Overview from Database Perspective
Ming-Syan Chen Elect. Eng. Department National Taiwan Univ. Taipei, Taiwan, ROC
Jiawei Han School of Computing Sci. Simon Fraser University B.C. V5A 1S6, Canada
Data mining, which is also referred to as knowledge discovery in databases, means a process of nontrivial extraction of implicit, previously unknown and potentially useful information (such as knowledge rules, constraints, regularities) from data in databases 70]. There are also many other terms, appearing in some articles and documents, carrying a similar or slightly di erent meaning, such as knowledge mining from databases, knowledge extraction, data archaeology, data dredging, data analysis, and so on. By knowledge discovery in databases, interesting knowledge, regularities, or high-level information can be extracted from the relevant sets of data in databases and be investigated from di erent angles, and large databases thereby serve as rich and reliable sources for knowledge generation and veri cation. Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. The discovered knowledge can be applied to information management, query processing, decision making, process control, and many other applications. Researchers in many di erent elds, including database systems, knowledge-base systems, arti cial intelligence, machine learning, knowledge acquisition, statistics, spatial databases, and data visualization, have shown great interest in data mining. Furthermore, several emerging applications in information providing services, such as on-line services and World Wide Web, also call for various data mining techniques to better understand user behavior, to meliorate the service provided, and to increase the business opportunities.
Index Terms | Data mining, knowledge discovery, association rules, classi cation, data clustering, pattern matching algorithms, data generalization and characterization, data cubes, multiple-dimensional databases.
1 Introduction
Recently, our capabilities of both generating and collecting data have been increasing rapidly. The widespread use of bar codes for most commercial products, the computerization of many business and government transactions, and the advances in data collection tools have provided us with huge amounts of data. Millions of databases have been used in business management, government administration, scienti c and engineering data management, and many other applications. It is noted that the number of such databases keeps growing rapidly because of the availability of powerful and a ordable database systems. This explosive growth in data and databases has generated an urgent need for new techniques and tools that can intelligently and automatically transform the processed data into useful information and knowledge. Consequently, data mining has become a research area with increasing importance 30, 70, 76].