data mining 练习题
数据挖掘思考和练习题汇总
数据挖掘思考和练习题第一章1.1 什么是数据挖掘?什么是知识发现?简述KDD的主要过程。
答:(1)数据挖掘(Data Mining)是指从大量结构化和非结构化的数据中提取有用的信息和知识的过程,它是知识发现的有效手段。
(2)知识发现是从大量数据中提取有效的、新颖的、潜在的有用的,以及最终可理解的模式的非平凡过程。
(3)KDD的过程主要包括:KDD的过程主要由数据整理、数据挖掘、结果的解释评论三部分组成。
可以由模型表示出来:1.确定挖掘目标:了解应用领域及相关的经验知识,从用户的观点出发确定数据挖掘的目标。
这一步是实现数据挖掘的重要因素,相当于系统分析,需要系统分析员和用户的共同参与。
2.建立目标数据集:从现有的数据中,确定哪些数据是与本次数据分析任务相关的。
根据挖掘目标,从原始数据中选择相关数据集,并将不同数据源中的数据集中起来。
在这一阶段需要解决数据挖掘平台、操作系统和数据源数据类型等不同所产生的数据格式差异。
3.数据清洗和预处理:这一阶段即是将数据转变成“干净”的数据。
目标数据集中不可避免地存在着不完整、不一致、不精确和冗余地数据。
数据抽取之后必须利用专业领域地知识对“脏数据”进行清洗。
然后再对它们实施相应的方法,神经网络方法和模糊匹配技术分析多数据源之间联系,然后再对它们实施相应的处理。
4.数据降维和转换:在对数据库和数据子集进行预处理之后,考虑了数据的不变表示或发现了数据的不变的表示情况下,减少变量的实际数目,设法将数据转换到一个更易找到了解的空间上。
5.选择挖掘算法使用合适的数据挖掘算法完成数据分析。
确定实现挖掘目标的数据挖掘功能,这些功能方法包括概念描述、分类、聚类、关联规则。
其次选择合适的模式搜索算法,包括模型和参数的确定。
6.模式评价和解释根据最终用户的决策目的对数据挖掘发现的模式进行评价,将有用的模式或描述有用模式的数据以可视化技术和知识表示技术展示给用户,让用户能够对模型结果作出解释,评价模式的有效性。
数据挖掘考试题库
一、填空题1.Web挖掘可分为、和3大类。
2.数据仓库需要统一数据源,包括统一、统一、统一和统一数据特征4个方面。
3.数据分割通常按时间、、、以及组合方法进行。
4.噪声数据处理的方法主要有、和。
5.数值归约的常用方法有、、、和对数模型等。
6.评价关联规则的2个主要指标是和。
7.多维数据集通常采用或雪花型架构,以表为中心,连接多个表。
8.决策树是用作为结点,用作为分支的树结构。
9.关联可分为简单关联、和。
10.B P神经网络的作用函数通常为区间的。
11.数据挖掘的过程主要包括确定业务对象、、、及知识同化等几个步骤。
12.数据挖掘技术主要涉及、和3个技术领域。
13.数据挖掘的主要功能包括、、、、趋势分析、孤立点分析和偏差分析7个方面。
14.人工神经网络具有和等特点,其结构模型包括、和自组织网络3种。
15.数据仓库数据的4个基本特征是、、非易失、随时间变化。
16.数据仓库的数据通常划分为、、和等几个级别。
17.数据预处理的主要内容(方法)包括、、和数据归约等。
18.平滑分箱数据的方法主要有、和。
19.数据挖掘发现知识的类型主要有广义知识、、、和偏差型知识五种。
20.O LAP的数据组织方式主要有和两种。
21.常见的OLAP多维数据分析包括、、和旋转等操作。
22.传统的决策支持系统是以和驱动,而新决策支持系统则是以、建立在和技术之上。
23.O LAP的数据组织方式主要有和2种。
24.S QL Server2000的OLAP组件叫,OLAP操作窗口叫。
25.B P神经网络由、以及一或多个结点组成。
26.遗传算法包括、、3个基本算子。
27.聚类分析的数据通常可分为区间标度变量、、、、序数型以及混合类型等。
28.聚类分析中最常用的距离计算公式有、、等。
29.基于划分的聚类算法有和。
30.C lementine的工作流通常由、和等节点连接而成。
31.简单地说,数据挖掘就是从中挖掘的过程。
32.数据挖掘相关的名称还有、、等。
数据挖掘试题参考答案
大学课程《数据挖掘》试题参考答案范围:∙ 1.什么是数据挖掘?它与传统数据分析有什么区别?定义:数据挖掘(Data Mining,DM)又称数据库中的知识发现(Knowledge Discover in Database,KDD),是目前人工智能和数据库领域研究的热点问题,所谓数据挖掘是指从数据库的大量数据中揭示出隐含的、先前未知的并有潜在价值的信息的非平凡过程。
数据挖掘是一种决策支持过程,它主要基于人工智能、机器学习、模式识别、统计学、数据库、可视化技术等,高度自动化地分析企业的数据,做出归纳性的推理,从中挖掘出潜在的模式,帮助决策者调整市场策略,减少风险,做出正确的决策。
区别:(1)数据挖掘的数据源与以前相比有了显著的改变;数据是海量的;数据有噪声;数据可能是非结构化的;(2)传统的数据分析方法一般都是先给出一个假设然后通过数据验证,在一定意义上是假设驱动的;与之相反,数据挖掘在一定意义上是发现驱动的,模式都是通过大量的搜索工作从数据中自动提取出来。
即数据挖掘是要发现那些不能靠直觉发现的信息或知识,甚至是违背直觉的信息或知识,挖掘出的信息越是出乎意料,就可能越有价值。
在缺乏强有力的数据分析工具而不能分析这些资源的情况下,历史数据库也就变成了“数据坟墓”-里面的数据几乎不再被访问。
也就是说,极有价值的信息被“淹没”在海量数据堆中,领导者决策时还只能凭自己的经验和直觉。
因此改进原有的数据分析方法,使之能够智能地处理海量数据,即演化为数据挖掘。
∙ 2.请根据CRISP-DM(Cross Industry Standard Process for Data Mining)模型,描述数据挖掘包含哪些步骤?CRISP-DM 模型为一个KDD工程提供了一个完整的过程描述.该模型将一个KDD工程分为6个不同的,但顺序并非完全不变的阶段.1: business understanding: 即商业理解. 在第一个阶段我们必须从商业的角度上面了解项目的要求和最终目的是什么. 并将这些目的与数据挖掘的定义以及结果结合起来.2.data understanding: 数据的理解以及收集,对可用的数据进行评估.3: data preparation: 数据的准备,对可用的原始数据进行一系列的组织以及清洗,使之达到建模需求.4:modeling: 即应用数据挖掘工具建立模型.5:evaluation: 对建立的模型进行评估,重点具体考虑得出的结果是否符合第一步的商业目的.6: deployment: 部署,即将其发现的结果以及过程组织成为可读文本形式.(数据挖掘报告)∙ 3.请描述未来多媒体挖掘的趋势随着多媒体技术的发展,人们接触的数据形式不断地丰富,多媒体数据库的日益增多,原有的数据库技术已满足不了应用的需要,人们希望从这些媒体数据中得到一些高层的概念和模式,找出蕴涵于其中的有价值的知识。
数据挖掘与数据分析,数据可视化试题
数据挖掘与数据分析,数据可视化试题1. Data Mining is also referred to as ……………………..data analysisdata discovery(正确答案)data recoveryData visualization2. Data Mining is a method and technique inclusive of …………………………. data analysis.(正确答案)data discoveryData visualizationdata recovery3. In which step of Data Science consume Almost 80% of the work period of the procedure.Accumulating the dataAnalyzing the dataWrangling the data(正确答案)Recapitulation of the Data4. Which Step of Data Science allows the model to consistently improve and provide punctual performance and deliverapproximate results.Wrangling the dataAccumulating the dataRecapitulation of the Data(正确答案)Analyzing the data5. Which tool of Data Science is robust machine learning library, which allows the implementation of deep learning ?algorithms. STableauD3.jsApache SparkTensorFlow(正确答案)6. What is the main aim of Data Mining ?to obtain data from a less number of sources and to transform it into a more useful version of itself.to obtain data from a less number of sources and to transform it into a less useful version of itself.to obtain data from a great number of sources and to transform it into a less useful version of itself.to obtain data from a great number of sources and to transform it into a more useful version of itself.(正确答案)7. In which step of data mining the irrelevant patterns are eliminated to avoid cluttering ? Cleaning the data(正确答案)Evaluating the dataConversion of the dataIntegration of data8. Data Science t is mainly used for ………………. purposes. Data mining is mainly used for ……………………. purposes.scientific,business(正确答案)business,scientificscientific,scientificNone9. Pandas ………………... is a one dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).Series(正确答案)FramePanelNone10. How many principal components Pandas DataFrame consists of ?4213(正确答案)11. Important data structure of pandas is/are ___________SeriesData FrameBoth(正确答案)None of the above12. Which of the following command is used to install pandas?pip install pandas(正确答案)install pandaspip pandasNone of the above13. Which of the following function/method help to create Series? series()Series()(正确答案)createSeries()None of the above14. NumPY stands for?Numbering PythonNumber In PythonNumerical Python(正确答案)None Of the above15. Which of the following is not correct sub-packages of SciPy? scipy.integratescipy.source(正确答案)scipy.interpolatescipy.signal16. How to import Constants Package in SciPy?import scipy.constantsfrom scipy.constants(正确答案)import scipy.constants.packagefrom scipy.constants.package17. ………………….. involveslooking at and describing the data set from different angles and then summarizing it ?Data FrameData VisualizationEDA(正确答案)All of the above18. what involves the preparation of data sets for analysis by removing irregularities in the data so that these irregularities do not affect further steps in the process of data analysis and machine learning model building ?Data AnalysisEDA(正确答案)Data FrameNone of the above19. What is not Utility of EDA ?Maximize the insight in the data setDetect outliers and anomaliesVisualization of dataTest underlying assumptions(正确答案)20. what can hamper the further steps in the machine learning model building process If not performed properly ?Recapitulation of the DataAccumulating the dataEDA(正确答案)None of the above21. Which plot for EDA to check the dependency between two variables ? HistogramsScatter plots(正确答案)MapsTime series plots22. What function will tell you the top records in the data set?shapehead(正确答案)showall of the aboce23. what type of data is useful for internal policymaking and business strategy building for an organization ?public dataprivate data(正确答案)bothNone of the above24. The ………… function can “fill in” NA valueswith non-null data ?headfillna(正确答案)shapeall of the above25. If you want to simply exclude the missing values, then what function along with the axis argument will be use?fillnareplacedropna(正确答案)isnull26. Which of the following attribute of DataFrame is used to display data type of each column in DataFrame?DtypesDTypesdtypes(正确答案)datatypes27. Which of the following function is used to load the data from the CSV file into a DataFrame?read.csv()readcsv()read_csv()(正确答案)Read_csv()28. how to Display first row of dataframe ‘DF’ ?print(DF.head(1))print(DF[0 : 1])print(DF.iloc[0 : 1])All of the above(正确答案)29. Spread function is known as ................ in spreadsheets ?pivotunpivot(正确答案)castorder30. ................. extract a subset of rows from a data fram based on logical conditions ? renamefilter(正确答案)setsubset31. We can shift the DataFrame’s index by a certain number of periods usingthe …………. Method ?melt()merge()tail()shift()(正确答案)32. We can join melted DataFrames into one Analytical Base Table using the ……….. function.join()append()merge()(正确答案)truncate()33. What methos is used to concatenate datasets along an axis ?concatenate()concat()(正确答案)add()merge()34. Rows can be …………….. if the number of missing values is insignificant, as thiswould not impact the overall analysis results.deleted(正确答案)updatedaddedall35. There is a specific reason behind the missing value.What stands for Missing not at randomMCARMARMNAR(正确答案)None of the above36. While plotting data, some values of one variable may not lie beyond the expectedrange, but when you plot the data with some other variable, these values may lie far from the expected value.Identify the type of outliers?Univariate outliersMultivariate outliers(正确答案)ManyVariate outlinersNone of the above37. if numeric values are stored as strings, then it would not be possible to calculatemetrics such as mean, median, etc.Then what type of data cleaning exercises you will perform ?Convert incorrect data types:(正确答案)Correct the values that lie beyond the rangeCorrect the values not belonging in the listFix incorrect structure:38. Rows that are not required in the analysis. E.g ifobservations before or after a particular date only are required for analysis.What steps we will do when perform data filering ?Deduplicate Data/Remove duplicateddataFilter rows tokeep only therelevant data.(正确答案)Filter columns Pick columnsrelevant toanalysisBring the datatogether, Groupby required keys,aggregate therest39. you need to…………... the data in order to get what you need for your analysis. searchlengthorderfilter(正确答案)40. Write the output of the following ?>>> import pandas as pd >>> series1 =pd.Series([10,20,30])>>> print(series1)0 101 202 30dtype: int64(正确答案)102030dtype: int640 1 2 dtype: int64None of the above41. What will be output for the following code?import numpy as np a = np.array([1, 2, 3], dtype = complex) print a[[ 1.+0.j, 2.+0.j, 3.+0.j]][ 1.+0.j]Error[ 1.+0.j, 2.+0.j, 3.+0.j](正确答案)42. What will be output for the following code?import numpy as np a =np.array([1,2,3]) print a[[1, 2, 3]][1][1, 2, 3](正确答案)Error43. What will be output for the following code?import numpy as np dt = dt =np.dtype('i4') print dtint32(正确答案)int64int128int1644. What will be output for the following code?import numpy as np dt =np.dtype([('age',np.int8)]) a = np.array([(10,),(20,),(30,)], dtype = dt)print a['age'][[10 20 30]][10 20 30](正确答案)[10]Error45. We can add a new row to a DataFrame using the _____________ methodrloc[ ]iloc[ ]loc[ ](正确答案)None of the above46. Function _____ can be used to drop missing values.fillna()isnull()dropna()(正确答案)delna()47. The function to perform pivoting with dataframes having duplicate values is _____ ? pivot(unique = True)pivot()pivot_table(unique = True)pivot_table()(正确答案)48. A technique, which when performed on a dataframe, rearranges the data from rows and columns in a report form, is called _____ ?summarisingreportinggroupingpivoting(正确答案)49. Normal Distribution is symmetric is about ___________ ?VarianceMean(正确答案)Standard deviationCovariance50. Write a statement to display “Amount” as x-axis label. (consider plt as an alias name of matplotlib.pyplot)bel(“Amount”)plt.xlabel(“Amount”)(正确答案)plt.xlabel(Amount)None of the above51. Fill in the blank in the given code, if we want to plot a line chart for values of list ‘a’ vs values of list ‘b’.a = [1, 2, 3, 4, 5]b = [10, 20, 30, 40, 50]import matplotlib.pyplot as pltplt.plot __________(a, b)(正确答案)(b, a)[a, b]None of the above52. #Loading the datasetimport seaborn as snstips =sns.load_dataset("tips")tips.head()In this code what is tips ?plotdataset name(正确答案)paletteNone of the above53. Visualization can make sense of information by helping to find relationships in the data and support (or disproving) ideas about the dataAnalyzeRelationShip(正确答案)AccessiblePrecise54. In which option provides A detailed data analysis tool that has an easy-to-use tool interface and graphical designoptions for visuals.Jupyter NotebookSisenseTableau DesktopMATLAB(正确答案)55. Consider a bank having thousands of ATMs across China. In every transaction, Many variables are recorded.Which among the following are not fact variables.Transaction charge amountWithdrawal amountAccount balance after withdrawalATM ID(正确答案)56. Which module of matplotlib library is required for plotting of graph?plotmatplotpyplot(正确答案)None of the above57. Write a statement to display “Amount” as x-axis label. (consider plt as an alias name of matplotlib.pyplot)bel(“Amount”)plt.xlabel(“Amount”)(正确答案)plt.xlabel(Amount)None of the above58. What will happen when you pass ‘h’ as as a value to orient parameter of the barplot function?It will make the orientation vertical.It will make the orientation horizontal.(正确答案)It will make line graphNone of the above59. what is the name of the function to display Parameters available are viewed .set_style()axes_style()(正确答案)despine()show_style()60. In stacked barplot, subgroups are displayed as bars on top of each other. How many parameters barplot() functionhave to draw stacked bars?OneTwoNone(正确答案)three61. In Line Chart or Line Plot which parameter is an object determining how to draw the markers for differentlevels of the style variable.?x.yhuemarkers(正确答案)legend62. …………………..similar to Box Plot but with a rotated plot on each side, giving more information about the density estimate on the y axis.Pie ChartLine ChartViolin Chart(正确答案)None63. By default plot() function plots a ________________HistogramBar graphLine chart(正确答案)Pie chart64. ____________ are column-charts, where each column represents a range of values, and the height of a column corresponds to how many values are in that range.Bar graphHistograms(正确答案)Line chartpie chart65. The ________ project builds on top of pandas and matplotlib to provide easy plotting of data.yhatSeaborn(正确答案)VincentPychart66. A palette means a ________.. surface on which a painter arranges and mixed paints. circlerectangularflat(正确答案)all67. The default theme of the plotwill be ________?Darkgrid(正确答案)WhitegridDarkTicks68. Outliers should be treated after investigating data and drawing insights from a dataset.在调查数据并从数据集中得出见解后,应对异常值进行处理。
数据分析英语试题及答案
数据分析英语试题及答案一、选择题(每题2分,共10分)1. Which of the following is not a common data type in data analysis?A. NumericalB. CategoricalC. TextualD. Binary2. What is the process of transforming raw data into an understandable format called?A. Data cleaningB. Data transformationC. Data miningD. Data visualization3. In data analysis, what does the term "variance" refer to?A. The average of the data pointsB. The spread of the data points around the meanC. The sum of the data pointsD. The highest value in the data set4. Which statistical measure is used to determine the central tendency of a data set?A. ModeB. MedianC. MeanD. All of the above5. What is the purpose of using a correlation coefficient in data analysis?A. To measure the strength and direction of a linear relationship between two variablesB. To calculate the mean of the data pointsC. To identify outliers in the data setD. To predict future data points二、填空题(每题2分,共10分)6. The process of identifying and correcting (or removing) errors and inconsistencies in data is known as ________.7. A type of data that can be ordered or ranked is called________ data.8. The ________ is a statistical measure that shows the average of a data set.9. A ________ is a graphical representation of data that uses bars to show comparisons among categories.10. When two variables move in opposite directions, the correlation between them is ________.三、简答题(每题5分,共20分)11. Explain the difference between descriptive andinferential statistics.12. What is the significance of a p-value in hypothesis testing?13. Describe the concept of data normalization and its importance in data analysis.14. How can data visualization help in understanding complex data sets?四、计算题(每题10分,共20分)15. Given a data set with the following values: 10, 12, 15, 18, 20, calculate the mean and standard deviation.16. If a data analyst wants to compare the performance of two different marketing campaigns, what type of statistical test might they use and why?五、案例分析题(每题15分,共30分)17. A company wants to analyze the sales data of its products over the last year. What steps should the data analyst take to prepare the data for analysis?18. Discuss the ethical considerations a data analyst should keep in mind when handling sensitive customer data.答案:一、选择题1. D2. B3. B4. D5. A二、填空题6. Data cleaning7. Ordinal8. Mean9. Bar chart10. Negative三、简答题11. Descriptive statistics summarize and describe thefeatures of a data set, while inferential statistics make predictions or inferences about a population based on a sample.12. A p-value indicates the probability of observing the data, or something more extreme, if the null hypothesis is true. A small p-value suggests that the observed data is unlikely under the null hypothesis, leading to its rejection.13. Data normalization is the process of scaling data to a common scale. It is important because it allows formeaningful comparisons between variables and can improve the performance of certain algorithms.14. Data visualization can help in understanding complex data sets by providing a visual representation of the data, making it easier to identify patterns, trends, and outliers.四、计算题15. Mean = (10 + 12 + 15 + 18 + 20) / 5 = 14, Standard Deviation = √[(Σ(xi - mean)^2) / N] = √[(10 + 4 + 1 + 16 + 36) / 5] = √52 / 5 ≈ 3.816. A t-test or ANOVA might be used to compare the means ofthe two campaigns, as these tests can determine if there is a statistically significant difference between the groups.五、案例分析题17. The data analyst should first clean the data by removing any errors or inconsistencies. Then, they should transformthe data into a suitable format for analysis, such ascreating a time series for monthly sales. They might also normalize the data if necessary and perform exploratory data analysis to identify any patterns or trends.18. A data analyst should ensure the confidentiality andprivacy of customer data, comply with relevant data protection laws, and obtain consent where required. They should also be transparent about how the data will be used and take steps to prevent any potential misuse of the data.。
数据挖掘习题及解答-完美版
Data Mining Take Home Exam学号: xxxx 姓名: xxx 1. (20分)考虑下表的数据集。
(1)计算整个数据集的Gini 指标值。
(2)计算属性性别的Gini 指标值(3)计算使用多路划分属性车型的Gini 指标值 (4)计算使用多路划分属性衬衣尺码的Gini 指标值(5)下面哪个属性更好,性别、车型还是衬衣尺码?为什么? 解:(1) Gini=1-(10/20)^2-(10/20)^2=0.5 (2)Gini=[{1-(6/10)^2-(4/10)^2}*1/2]*2=0.48 (3)Gini={1-(1/4)^2-(3/4)^2}*4/20+{1-(8/8)^2-(0/8)^2}*8/20+{1-(1/8)^2-(7/8)^2}*8/2 0=26/160=0.1625(4)Gini={1-(3/5)^2-(2/5)^2}*5/20+{1-(3/7)^2-(4/7)^2}*7/20+[{1-(2/4)^2-(2/4)^2}*4/ 20]*2=8/25+6/35=0.4914(5)比较上面各属性的Gini值大小可知,车型划分Gini值0.1625最小,即使用车型属性更好。
2. (20分)考虑下表中的购物篮事务数据集。
(1) 将每个事务ID视为一个购物篮,计算项集{e},{b,d} 和{b,d,e}的支持度。
(2)使用(1)的计算结果,计算关联规则{b,d}→{e}和{e}→{b,d}的置信度。
(3)将每个顾客ID作为一个购物篮,重复(1)。
应当将每个项看作一个二元变量(如果一个项在顾客的购买事务中至少出现一次,则为1,否则,为0)。
(4)使用(3)的计算结果,计算关联规则{b,d}→{e}和{e}→{b,d}的置信度。
答:(1)由上表计数可得{e}的支持度为8/10=0.8;{b,d}的支持度为2/10=0.2;{b,d,e}的支持度为2/10=0.2。
(2)c[{b,d}→{e}]=2/8=0.25; c[{e}→{b,d}]=8/2=4。
数据挖掘试题
数据挖掘试题1. 解释什么是数据挖掘(Data Mining)。
答:数据挖掘是通过应用统计学、机器学习和模式识别等技术,从大量数据中发现隐藏在其中的模式、关联和规律的过程。
它可以帮助人们从原始数据中提取有价值的信息,以支持决策、预测和优化等任务。
2. 请说明数据挖掘的主要任务。
答:数据挖掘的主要任务包括以下几个方面:- 分类:根据已有的数据标签和特征构建分类模型,将新的数据实例分到预定义的类别中。
- 聚类:根据数据的相似性将其分组,以发现隐藏的数据群体和类别。
- 关联规则挖掘:发现数据项之间的关联和依赖关系,如购物篮分析中发现常一起购买的商品。
- 预测分析:通过已有的数据建立预测模型,用于预测未来的趋势、结果或行为。
- 回归分析:根据数据的特征和标签之间的关系建立回归模型,用于预测连续值的结果。
- 异常检测:发现与正常模式不符的异常数据点,如欺诈检测。
- 文本挖掘:从大量的文本数据中提取有意义的信息和知识,如情感分析、主题提取等。
- 图像和视频挖掘:从图片和视频数据中提取有价值的信息和特征。
3. 请列举常用的数据挖掘算法。
答:常用的数据挖掘算法包括:- 决策树算法(Decision Tree)- 支持向量机算法(Support Vector Machine)- 贝叶斯分类算法(Naive Bayes)- 逻辑回归算法(Logistic Regression)- 人工神经网络算法(Artificial Neural Networks)- 随机森林算法(Random Forest)- 聚类算法(K-means,DBSCAN等)- 关联规则挖掘算法(Apriori,FP-Growth等)- 主成分分析算法(Principal Component Analysis)- 线性回归算法(Linear Regression)4. 数据预处理在数据挖掘中的作用是什么?答:数据预处理是数据挖掘的一个重要步骤,其作用主要有以下几个方面:- 数据清洗:处理缺失值、异常值和噪声,以确保数据的完整性和质量。
数据挖掘第二次作业
数据挖掘第二次作业第一题:1.a) Compute the Information Gain for Gender, Car Type and Shirt Size.b) Construct a decision tree with Information Gain.答案:a)因为class分为两类:C0和C1,其中C0的频数为10个,C1的频数为10,所以class元组的信息增益为Info(D)==11.按照Gender进行分类:Info gender(D)==0.971Gain(Gender)=1-0.971=0.0292.按照Car Type进行分类Info carType(D)==0.314 Gain(Car Type)=1-0.314=0.6863.按照Shirt Size进行分类:Info shirtSize(D)==0.988Gain(Shirt Size)=1-0.988=0.012b)由a中的信息增益结果可以看出采用Car Type进行分类得到的信息增益最大,所以决策树为:第二题:2. (a) Design a multilayer feed-forward neural network (one hidden layer) for thedata set in Q1. Label the nodes in the input and output layers.(b) Using the neural network obtained above, show the weight values after oneiteration of the back propagation algorithm, given the training instance “(M, Family, Small)". Indicate your initial weight values and biases and the learning rate used.a)x 11x 12x 21x 22x 23x 31x 32x 33x 34输入层隐藏层输出层b) 由a 可以设每个输入单元代表的属性和初始赋值由于初始的权重和偏倚值是随机生成的所以在此定义初始值为:净输入和输出:每个节点的误差表:10 0.0089 11 0.0030 12 -0.12权重和偏倚的更新: W 1,10 W 1,11 W 2,10 W 2,11 W 3,10 W 3,11 W 4,10 W 4,11 W 5,10 W 5,11 0.201 0.198 -0.211 -0.099 0.4 0.308 -0.202 -0.098 0.101 -0.100 W 6,10 W 6,11 W 7,10 W 7,11 W 8,10 W 8,11 W 9,10 W 9,11 W 10,12 W 11,12 0.092 -0.211 -0.400 0.198 0.201 0.190 -0.110 0.300 -0.304 -0.099 θ10 θ11 θ12 -0.287 0.1790.344第三题:3.a) Suppose the fraction of undergraduate students who smoke is 15% and thefraction of graduate students who smoke is 23%. If one-fifth of the college students are graduate students and the rest are undergraduates, what is the probability that a student who smokes is a graduate student?b) Given the information in part (a), is a randomly chosen college student morelikely to be a graduate or undergraduate student?c) Suppose 30% of the graduate students live in a dorm but only 10% of theundergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke.答:a) 定义:A={A 1 ,A 2}其中A 1表示没有毕业的学生,A 2表示毕业的学生,B 表示抽烟则由题意而知:P(B|A 1)=15% P(B|A 2)=23% P(A 1)= P(A 2)=则问题则是求P(A 2|B)由()166.0)()|B ()()|B (B 2211=+=A P A p A P A P P则()277.0166.02.023.0)()()|(|222=⨯=⨯=B P A P A B P B APb) 由a 可以看出随机抽取一个抽烟的大学生,是毕业生的概率是0.277,未毕业的学生是0.723,所以有很大的可能性是未毕业的学生。
数据挖掘一些简答题
一简答题4x5二专业翻译3x15三计算1x15四算法描述Why Data Mining?there are some reasons below:1.The explosive growth of data:from terabytes to petabytes2.We are drowning in data but starving for knowledge3.“Necessity is the mother of invention”,data mining can meet the need that automated analysis of massive data sets.What Is Data Mining?Data Mining is the process of discovering interesting patterns from massive amounts of data. as a knowledge discovery process, it typically involves :data cleaning,data integration,data selection,data transformation,data mining,pattern evaluation,knowledge presentation.A Multi-Dimensional View of Data Miningthe major dimensions are data ,knowledge,technology,and application.What Kinds of Data Can Be Mined?as a general technology,data mining can be applied to any kind of data as long as the data are meaningful for a target application.the most basic forms of data for mining application are database data,data warehouse data,transaction data .advanced data can be mined: time-related or sequence data,data streams,spatial data and spatiotemporal data,text data,multimedia data , graph or network data,and web data.What Kinds of Patterns Can Be Mined????a pattern is interesting if it is valid on test data with some degree of certainty,novel,potentially useful,and easily understood by humans,Interesting patterns represent knowledge.Measures of pattern interestingness,either objective or subjective ,can be used to guide the discovery process.What Kinds of Technologies Are Used?statics;machine learning; pattern recognition; visualization; algorithms; high-performance computing; application; information retrieval; data warehouse; database systems.What Kinds of Applications Are Targeted?Data mining has many successful applications, such as business intelligence, Web search, bioinformatics,Major Issues in Data MiningThere are many challenging issues in data mining research. Areas include mining methodology, user interaction,efficiency and scalability, and dealing with diverse data types. Data mining research has strongly impactedsociety and will continue to do so in the future.。
北京大学《数据仓库与数据挖掘》试题答案整理
《数据仓库与数据挖掘》试题与答案整理2013级智能系高飙1.名词解释5x4(1)主题主题(Subject):宏观分析领域所涉及的分析对象。
是在较高层次上将企业信息系统中的数据进行综合、归类和分析利用的一个抽象概念,每一个主题基本对应一个宏观的分析领域。
面向主题的数据组织方式:在较高的层次上对分析对象的数据的一个完整、一致的描述。
(2)事实(P联机分析)事实是数值度量的;存储一个多维数据,表达期望分析的主题(目的、感兴趣的事情、事件或者指标等);具有一定的粒度,粒度的大小与维层次相关;一个事实中通常包含一个或者多个度量一个事实的两个组件:数字型指标、聚集函数(3)数据归约(P数据预处理)在可能获得相同或相似结果的前提下,对数据的容量进行有效的缩减数据归约的方法:1数据立方体聚集:聚集操作作用于立方体中的数据2减少数据维度(维归约):可以检测并删除不相关、弱相关或者冗余的属性或维3数据压缩:使用编码机制压缩数据集4数值压缩:用替代的、较小的数据表示替换或估计数据5数据离散化以及概念层次的建立:属性的原始值用区间值或较高层的概念予以替换(4)兴趣度(P数据挖掘)一个数据挖掘系统的挖掘结果可能会产生成千上万个模式,但是并不是所有的模式都有意义。
兴趣度度量用于将不感兴趣的模式从知识中分开。
他们可以用于指导挖掘过程,或在挖掘之后,评估发现的模式。
不同类型的数据需要不同的兴趣度量。
兴趣度的度量:一个模式是否感兴趣,取决于它是否容易被用户所理解,是否有效可信,是否潜在有用,是否新颖等兴趣度的度量:客观的度量: 从模式的角度出发,基于模式结构的某些统计的结果,如:支持度(support)、置信度(confidence)等。
主观的度量:从用户的角度出发,对模式的信任程度,如:新颖性、可操作性等。
(5)数据分区(片)(P数据仓库设计)把逻辑上统一的数据分割成较小的、可以独立管理的物理单元(分片)进行存储。
可按时间、按地区、按业务类型进行数据分片(6)数据挖掘数据挖掘是识别数据中有效的、新颖的、潜在有用的和最终可被理解的模式(Pattern)的非平凡过程。
数据挖掘岗面试题目(3篇)
第1篇一、基础知识1. 请简述数据挖掘的基本概念和目的。
2. 请列举数据挖掘的主要应用领域。
3. 请说明数据挖掘的流程和步骤。
4. 请解释什么是数据预处理,其重要性是什么?5. 请列举数据预处理的主要方法。
6. 请解释什么是特征工程,其重要性是什么?7. 请列举特征工程的主要方法。
8. 请解释什么是机器学习,请列举几种常见的机器学习算法。
9. 请解释什么是监督学习、无监督学习和半监督学习。
10. 请解释什么是分类、回归和聚类。
11. 请解释什么是模型评估,请列举几种常见的模型评估指标。
12. 请解释什么是决策树,请列举决策树的分类方法。
13. 请解释什么是随机森林,请列举随机森林的优点。
14. 请解释什么是支持向量机(SVM),请列举SVM的分类方法。
15. 请解释什么是神经网络,请列举神经网络的分类方法。
16. 请解释什么是深度学习,请列举深度学习的应用领域。
17. 请解释什么是K-means算法,请列举K-means算法的优缺点。
18. 请解释什么是层次聚类,请列举层次聚类的分类方法。
19. 请解释什么是关联规则挖掘,请列举关联规则挖掘的算法。
20. 请解释什么是时间序列分析,请列举时间序列分析的方法。
二、编程能力1. 请用Python实现以下功能:(1)读取CSV文件,提取其中指定列的数据;(2)对提取的数据进行排序;(3)将排序后的数据写入新的CSV文件。
2. 请用Python实现以下功能:(1)使用Pandas库对数据集进行数据预处理;(2)使用NumPy库对数据进行特征工程;(3)使用Scikit-learn库对数据进行分类。
3. 请用Python实现以下功能:(1)使用TensorFlow库实现一个简单的神经网络模型;(2)使用PyTorch库实现一个简单的神经网络模型;(3)对模型进行训练和评估。
4. 请用Python实现以下功能:(1)使用Scikit-learn库实现一个SVM分类器;(2)对分类器进行训练和评估;(3)调整SVM分类器的参数,以提高分类效果。
数据挖掘习题及解答-完美版
Data Mining Take Home Exam学号: xxxx 姓名: xxx(1)计算整个数据集的Gini指标值。
(2)计算属性性别的Gini指标值(3)计算使用多路划分属性车型的Gini指标值(4)计算使用多路划分属性衬衣尺码的Gini指标值(5)下面哪个属性更好,性别、车型还是衬衣尺码?为什么?(3)=26/160=0.1625]*2=8/25+6/35=0.4914(5)比较上面各属性的Gini值大小可知,车型划分Gini值0.1625最小,即使用车型属性更好。
2. ((1) 将每个事务ID视为一个购物篮,计算项集{e},{b,d} 和{b,d,e}的支持度。
(2)使用(1)的计算结果,计算关联规则{b,d}→{e}和{e}→{b,d}的置信度。
(3)将每个顾客ID作为一个购物篮,重复(1)。
应当将每个项看作一个二元变量(如果一个项在顾客的购买事务中至少出现一次,则为1,否则,为0)。
(4)使用(3)的计算结果,计算关联规则{b,d}→{e}和{e}→{b,d}的置信度。
答:(1)由上表计数可得{e}的支持度为8/10=0.8;{b,d}的支持度为2/10=0.2;{b,d,e}的支持度为2/10=0.2。
(2)c[{b,d}→{e}]=2/8=0.25; c[{e}→{b,d}]=8/2=4。
(3)同理可得:{e}的支持度为4/5=0.8,{b,d}的支持度为5/5=1,{b,d,e}的支持度为4/5=0.8。
(4)c[{b,d}→{e}]=5/4=1.25,c[{e}→{b,d}]=4/5=0.8。
3. (20分)以下是多元回归分析的部分R输出结果。
> ls1=lm(y~x1+x2)> anova(ls1)Df Sum Sq Mean Sq F value Pr(>F)x1 1 10021.2 10021.2 62.038 0.0001007 ***x2 1 4030.9 4030.9 24.954 0.0015735 **Residuals 7 1130.7 161.5> ls2<-lm(y~x2+x1)> anova(ls2)Df Sum Sq Mean Sq F value Pr(>F)x2 1 3363.4 3363.4 20.822 0.002595 **x1 1 10688.7 10688.7 66.170 8.193e-05 ***Residuals 7 1130.7 161.5(1)用F检验来检验以下假设(α = 0.05)H0: β1 = 0H a: β1≠ 0计算检验统计量;是否拒绝零假设,为什么?(2)用F检验来检验以下假设(α = 0.05)H0: β2 = 0H a: β2≠ 0计算检验统计量;是否拒绝零假设,为什么?(3)用F检验来检验以下假设(α = 0.05)H0: β1 = β2 = 0H a: β1和β2 并不都等于零计算检验统计量;是否拒绝零假设,为什么?解:(1)根据第一个输出结果F=62.083>F(2,7)=4.74,p<0.05,所以可以拒绝原假设,即得到不等于0。
6-data mining(1)
Part II Data MiningOutlineThe Concept of Data Mining(数据挖掘概念) Architecture of a Typical Data Mining System (数据挖掘系统结构)What can be Mined? (能挖掘什么?)Major Issues(主要问题)in Data MiningData Cleaning(数据清理)3What Is Data Mining?Data mining is the process of discovering interesting knowledge from large amounts of data. (数据挖掘是从大量数据中发现有趣知识的过程) The main difference that separates information retrieval apart from data mining is their goals. (数据挖掘和信息检索的主要差别在于他们的目标) Information retrieval is to help users search for documents or data that satisfy their information needs(信息检索帮用户寻找他们需要的文档/数据)e.g. Find customers who have purchased more than $10,000 in the last month .(查找上个月购物量超过1万美元的客户)Data mining discovers useful knowledge by analyzing data correlations using sophisticated data mining techniques(数据挖掘用复杂技术分析…)e.g. Find all items which are frequently purchased with milk .(查找经常和牛奶被购买的商品)A KDD Process (1) Some people view data mining as synonymous5A KDD Process (2)Learning the application domain (学习应用领域相关知识):Relevant knowledge & goals of application (相关知识和目标) Creating a target data set (建立目标数据集) Data selection, Data cleaning and preprocessing (预处理)Choosing functions of data mining (选择数据挖掘功能)Summarization, classification, association, clustering , etc.Choosing the mining algorithm(s) (选择挖掘算法)Data mining (进行数据挖掘): search for patterns of interest Pattern evaluation and knowledge presentation (模式评估和知识表示)Removing redundant patterns, visualization, transformation, etc.Present results to user in meaningful manner.Use of discovered knowledge (使用所发现的知识)7Concept/class description (概念/类描述)Characterization(特征): provide a summarization of the given data set Comparison(区分): mine distinguishing characteristics(挖掘区别特征)that differentiate a target class from comparable contrasting classes. Association rules (correlation and causality)(关联规则)Association rules are of the form(这种形式的规则): X ⇒Y,Examples: contains(T, “computer”) ⇒contains(T, “software”)[support = 1%, confidence = 50%]age(X, “20..29”) ∧income(X, “20..29K ”) ⇒buys(X, “PC ”)[support = 2%, confidence = 60%]Classification and Prediction (分类和预测)Find models that describe and distinguish classes for future prediction.What kinds of patterns can be mined?(1)What kinds of patterns can be mined?(2)Cluster(聚类)Group data to form some classes(将数据聚合成一些类)Principle: maximizing the intra-class similarity and minimizing the interclass similarity (原则: 最大化类内相似度,最小化类间相似度) Outlier analysis: objects that do not comply with the general behavior / data model. (局外者分析: 发现与一般行为或数据模型不一致的对象) Trend and evolution analysis (趋势和演变分析)Sequential pattern mining(序列模式挖掘)Regression analysis(回归分析)Periodicity analysis(周期分析)Similarity-based analysis(基于相似度分析)What kinds of patterns can be mined?(3)In the context of text and Web mining, the knowledge also includes: (在文本挖掘或web挖掘中还可以发现)Word association (术语关联)Web resource discovery (WEB资源发现)News Event (新闻事件)Browsing behavior (浏览行为)Online communities (网上社团)Mining Web link structures to identify authoritative Web pages finding spam sites (发现垃圾网站)Opinion Mining (观点挖掘)…10Major Issues in Data Mining (1)Mining methodology(挖掘方法)and user interactionMining different kinds of knowledge in DBs (从DB 挖掘不同类型知识) Interactive mining of knowledge at multiple levels of abstraction (在多个抽象层上交互挖掘知识)Incorporation of background knowledge (结合背景知识)Data mining query languages (数据挖掘查询语言)Presentation and visualization of data mining results(结果可视化表示) Handling noise and incomplete data (处理噪音和不完全数据) Pattern evaluation (模式评估)Performance and scalability (性能和可伸缩性) Efficiency(有效性)and scalability(可伸缩性)of data mining algorithmsParallel(并行), distributed(分布) & incremental(增量)mining methods©Wu Yangyang 11Major Issues in Data Mining (2)Issues relating to the diversity of data types (数据多样性相关问题)Handling relational and complex types of data (关系和复杂类型数据) Mining information from heterogeneous databases and www(异质异构) Issues related to applications (应用相关的问题) Application of discovered knowledge (所发现知识的应用)Domain-specific data mining tools (面向特定领域的挖掘工具)Intelligent query answering (智能问答) Process control(过程控制)and decision making(决策制定)Integration of the discovered knowledge with existing knowledge:A knowledge fusion problem (知识融合)Protection of data security(数据安全), integrity(完整性), and privacy12CulturesDatabases: concentrate on large-scale (non-main-memory) data.(数据库:关注大规模数据)To a database person, data-mining is an extreme form of analytic processing. Result is the data that answers the query.(对数据库工作者而言数据挖掘是一种分析处理, 其结果就是问题答案) AI (machine-learning): concentrate on complex methods, small data.(人工智能(机器学习):关注复杂方法,小数据)Statistics: concentrate on models. (统计:关注模型.)To a statistician, data-mining is the inference of models. Result is the parameters of the model (数据挖掘是模型推论, 其结果是一些模型参数)e.g. Given a billion numbers, a statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation.©Wu Yangyang 13Data Cleaning (1)Data Preprocessing (数据预处理):Cleaning, integration, transformation, reduction, discretization (离散化) Why data cleaning? (为什么要清理数据?)--No quality data, no quality mining results! Garbage in, Garbage out! Measure of data quality (数据质量的度量标准)Accuracy (正确性)Completeness (完整性)Consistency(一致)Timeliness(适时)Believability(可信)Interpretability(可解释性) Accessibility(可存取性)14Data Cleaning (2)Data in the real world is dirtyIncomplete (不完全):Lacking some attribute values (缺少一些属性值)Lacking certain interest attributes /containing only aggregate data(缺少某些有用属性或只包含聚集数据)Noisy(有噪音): containing errors or outliers(包含错误或异常) Inconsistent: containing discrepancies in codes or names(不一致: 编码或名称存在差异)Major tasks in data cleaning (数据清理的主要任务)Fill in missing values (补上缺少的值)Identify outliers(识别出异常值)and smooth out noisy data(消除噪音)Correct inconsistent data(校正不一致数据) Resolve redundancy caused by data integration (消除集成产生的冗余)15Data Cleaning (3)Handle missing values (处理缺值问题) Ignore the tuple (忽略该元组) Fill in the missing value manually (人工填补) Use a global constant to fill in the missing value (用全局常量填补) Use the attribute mean to fill in the missing value (该属性平均值填补) Use the attribute mean for all samples belonging to the same class to fill in the missing value (用同类的属性平均值填补) Use the most probable value(最大可能的值)to fill in the missing value Identify outliers and smooth out noisy data(识别异常值和消除噪音)Binning method (分箱方法):First sort data and partition into bins (先排序、分箱)Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.(然后用平均值、中值、边界值平滑)©Wu Yangyang 16Data Cleaning (4)Example: Sorted data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins (分成等深的箱):-Bin 1: 4, 8, 9, 15-Bin 2: 21, 21, 24, 25-Bin 3: 26, 28, 29, 34Smoothing by bin means (用平均值平滑):-Bin 1: 9, 9, 9, 9-Bin 2: 23, 23, 23, 23-Bin 3: 29, 29, 29, 29Smoothing by bin boundaries (用边界值平滑):-Bin 1: 4, 4, 4, 15-Bin 2: 21, 21, 25, 25-Bin 3: 26, 26, 26, 34Clustering (。
data mining 8
5
Dissimilarity matrix
one-mode
dissimilarity between the 2nd and the 3rd instance
1 0 b d b+d sum a+b c+d n 1 0 sum a c a+c
object i
if the attribute is symmetric, i.e. both of the possible values are of the same importance: b+c simple matching coefficient d(i, j) = a +b+c +d if the attribute is asymmetric, i.e. “1” is more important than “0”: e.g. when diagnosing diseases b+c Jaccard coefficient d(i, j) = a +b+c
v′ =
v mean A mean _ abs _ dev A
1 n mi =1
why use mean absolute deviation instead of standard deviation?
the former is more robust to outliers than the latter the outliers remain detectable when using the former
data mining 练习题
1.Below is a table representing eight transactions and five items: Beer, Coke, Pepsi,Milk, and Juice. The items are represented by their first letters; e.g., "M" = milk.which of the pairs are frequent itemsets?2.Here is a table with seven transactions and six items, A through F. An "x"indicates that the item is in the transaction.A B C D E Fx x x x xx x x x xx x x xx x xx x x x xx x xx x xAssume that the support threshold is 2. Find all the closed frequent itemsets. Answer: There are many ways to find the frequent itemsets, but the amount of data is small, so we'll just list the results.Among the pairs, all but AF are frequent. The counts are:AC, CE: 5 AE, CD: 4 AD, BC, BD, BE, BF, DE: 3 AB, CF, DF, EF: 2Here are the counts of the frequent triples:ACE: 4 ACD, BCE, CDE: 3 ABC, ABE, ADE, BCD, BDE, BDF, BCF, BEF, CEF: 2There are four quadruples that are frequent, all with counts of 2: BCEF, BCDE, ACDE, and ABCE. There are no frequent sets of five items.To be closed, the itemset must have a larger count than all of its immediate supersets. Thus, all four of the listed quadruples are closed. A triple with a count of 2 cannot be closed unless it is contained in none of the four frequent quadruples. Among these, only BDF qualifies as closed. However, each of the triples with a count of 3 or 4 is closed, since there are no quadruples with counts this high.Among the pairs, only AC, CD, BD, and BF are closed. Among the singletons, only A and F are not closed. A, which appears 5 times, is contained in AC, which also occurs 5 times, and F, which occurs 3 times, is not closed because BF also appears 3 times.3. Find the set of 2-shingles for the "document":ABRACADABRAand also for the "document":BRICABRACAnswer the following questions:1.How many 2-shingles does ABRACADABRA have?2.How many 2-shingles does BRICABRAC have?3.How many 2-shingles do they have in common?4.What is the Jaccard similarity between the two documents"?Answer: The 2-shingles for ABRACADABRA: AB, BR, RA, AC, CA, AD, DA. The 2-shingles for BRICABRAC: BR, RI, IC, CA, AB, RA, AC.There are 5 shingles in common:AB, BR, RA, AC, CA.As there are 9 different shingles in all, the Jaccard similarity is 5/9.State the correct minhash value of each column.Answer: Look at the rows in the stated order R4, R6, R1, R3, R5, R2, and for each row, make that row be the minhash value of a column if the column has not yet beenassigned a minhash value. We sart with R4, which only has 1 in column C3, so the minhash value for C3 is R4.Next, we consider R6, which has 1 in C2 only. Since C2 does not yet have a minhash value, R6 becomes its value.Next is R1, with 1's in C2 and C3. However, both these columns already have minhash values, so we do nothing.Next, consider R3. It has 1's in C2 and C4. C2 already has a minhash value, but C4 does not. Thus, the minhash value of C4 is R3.When we consider R5 next, we see it has 1's in C1 and C3. The latter already has a minhash value, but R5 becomes the minhash value for C1. Since all columns now have minhash values, we are done.5. Perform a hierarchical clustering of the following six points:using the centroid proximity measure (distance between two clusters is the distance between their centroids). If you do this task correctly, you will find that there is a stage at which there is a tie for which pair of clusters is closest. Follow both choices. You will find that some sets of points are clusters in both cases, some sets are clusters in only one, and some are not clusters regardless of which choice you make. Answer: First, A and B, being the closest pair of points gets merged. The centroid for this pair is at (5,5). The next closest pair of centroids is C and F, so these are merged and their centroid is at (24.5, 13.5). At this time, there is a tie for closest centroids. AB and CF have centroids at distance sqrt(452.5), and so do D and CF. Thus, there are two possible third merges:1.Merge AB and CF, giving three clusters ABCF, D, and E. The centroid of ABCF is(14.75, 9.25). In this case, the next merge is ABCD with E.2.Merge CF with D, giving three clusters CDF, AB, and E. The centroid of CDF is at(27.33, 20). In this case, the next merge is E with AB.As a result, the two sequences of clusters created are:1.AB, CF, ABCF, ABCEF, ABCDEF.2.AB, CF, CDF, ABE, ABCDEF.6.Consider three Web pages with the following links:Suppose we compute PageRank with a β of 0.7, and we introduce the additional constraint that the sum of the PageRanks of the three pages must be 3, to handle the problem that otherwise any multiple of a solution will also be a solution. Compute the PageRanks a, b, and c of the three pages A, B, and C, respectively.Answer: The rules for computing the next value of a, b, or c as we iterate are:a <- .3 b <- .7(a/2) + .3 c <- .7(a/2+b+c) + .3The reason is that a splits its PageRank between b and c, while b gives all of its to c, and c keeps all its own. However, all PageRank is multiplied by .7 before distribution (the "tax"), and .3 is then added to each new PageRank.In the limit, the assignments become equalities. That immediately tells us a = .3. We can then use the second equation to discover b = .7*.3/2 + .3 = .405. Finally, the third equation simplifies to c = .7(.555 + c) + .3, or .3c = .6885. From this equation we get c = 2.295. It is now a simple matter to compute the subs of each two of the variables: a+b = .705, a+c = 2.595, and b+c = 2.7.7. 分析我们提到的各种算法的优缺点。
数据挖掘工程师招聘笔试题及解答(某大型央企)
招聘数据挖掘工程师笔试题及解答(某大型央企)(答案在后面)一、单项选择题(本大题有10小题,每小题2分,共20分)1、数据挖掘中,以下哪种算法属于监督学习算法?A、K-Means聚类算法B、决策树算法C、Apriori算法D、神经网络算法2、在数据挖掘过程中,以下哪个阶段不是数据预处理的一部分?A、数据清洗B、数据集成C、数据规约D、数据增强3、在数据挖掘中,以下哪种算法通常用于分类任务?A、K均值聚类算法B、K最近邻算法C、决策树算法D、Apriori算法4、在处理大规模数据集时,以下哪种技术通常用于提高数据挖掘的性能?A、数据抽样B、特征选择C、并行计算D、数据预处理5、某大型央企在进行客户满意度调查时,收集到了以下数据:客户满意度评分(1-10分),购买产品的数量,客户性别(男/女)。
为了分析不同性别客户对产品的满意度差异,以下哪种统计方法最为合适?A. 相关性分析B. 描述性统计C. 聚类分析D. 逻辑回归6、在进行数据挖掘项目时,发现数据集中存在大量缺失值。
以下哪种策略最有利于提高模型的质量?A. 直接删除含有缺失值的样本B. 使用均值、中位数或众数填充缺失值C. 使用模型预测缺失值D. 忽略缺失值,继续进行数据挖掘7、以下哪项不是数据挖掘过程中的预处理步骤?A. 数据清洗B. 数据集成C. 数据挖掘D. 数据变换8、在数据挖掘任务中,以下哪种算法通常用于分类问题?A. 聚类算法B. 关联规则算法C. 回归算法D. 决策树算法9、在数据挖掘过程中,以下哪项不是特征选择的方法?A. 相关性分析B. 主成分分析C. 决策树D. 支持向量机 10、下列关于K-means聚类算法的描述,错误的是:A. K-means算法是一种基于距离的聚类方法B. K-means算法需要预先指定聚类数量C. K-means算法在迭代过程中可能会陷入局部最优解D. K-means算法适用于高维数据二、多项选择题(本大题有10小题,每小题4分,共40分)1、关于数据挖掘技术,以下说法正确的是:A、数据挖掘是一种通过分析大量数据来发现有价值信息的过程。
data mining 2
clerks, IT professionals day to day operations application-oriented current, up-to-date detailed, flat relational isolated read/write short, simple transaction tens thousands 100MB-GB transaction throughput
2010-5-17 4
Data warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DBMS
Build wrappers/mediators on top of heterogeneous databases Query driven approach
Data warehouse
On-line analytical processing (OLAP) Data analysis and decision making
2010-5-17
6
OLTP vs. OLAP
OLTP users function DB design data access unit of work number of accessed records number of users DB size metric
when a query is posed, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set require complex information filtering, compete for local resources
大数据建模练习(习题卷2)
大数据建模练习(习题卷2)说明:答案和解析在试卷最后第1部分:单项选择题,共134题,每题只有一个正确答案,多选或少选均不得分。
1.[单选题]出入境模型中,要过滤出来自东南亚地区的人员,可以对“来自国家地区代码”字段进行过滤,并按照国际设置多个过滤条件,请问多过滤条件之间是什么关系A)逻辑与B)逻辑或C)逻辑非2.[单选题]在pandas中以下哪个方法用于向csv文件中实现写入工作?A)to_csv()B)read_csv()C)to_excel()3.[单选题]为了提高测试的效率,应该A)集中对付那些错误群集的程序B)随机选取测试数据C)在完成编码以后制定软件的测试计划D)取一切可能的输入数据作为测试数据4.[单选题]进入要操作的数据库TEST用以下哪一项( )A)IN TESTB)SHOW TESTC)USER TESTD)USE TEST5.[单选题]下面代码的输出结果是:x = 12.34print(type(x))A)<class 'int'>B)<class 'float'>C)<class 'bool'>D)<class 'complex'>6.[单选题]数据对账主要是针对数据治理过程中,数据提供方与数据治理方的数据账单的一致性对账,以确保数据在治理过程中的完整性。
数据对账主要是对数据的多个维度进行校验,以下选项不属于校验维度的是?A)唯一性B)完整性C)实时性D)正确性7.[单选题]在SQL语言中的视图VIEW是数据库的( )C)模式D)内模式8.[单选题]近年来,随着新科技的不断普及,愈来愈多的个人数据被采集和存储了下来,个人信息网络化和透明化已经成为不可阻挡的趋势。
那么,最突出的大数据环境是?A)物联网B)互联网C)综合国力D)自然资源9.[单选题]计算机储存数据时都有对应的类型,比如字符型、整数型、日期型等等;针对不同的数据类型,都有不同的适用的计算公式和方法等。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
1.Below is a table representing eight transactions and five items: Beer, Coke, Pepsi,Milk, and Juice. The items are represented by their first letters; e.g., "M" = milk.which of the pairs are frequent itemsets?2.Here is a table with seven transactions and six items, A through F. An "x"indicates that the item is in the transaction.A B C D E Fx x x x xx x x x xx x x xx x xx x x x xx x xx x xAssume that the support threshold is 2. Find all the closed frequent itemsets. Answer: There are many ways to find the frequent itemsets, but the amount of data is small, so we'll just list the results.Among the pairs, all but AF are frequent. The counts are:AC, CE: 5AE, CD: 4AD, BC, BD, BE, BF, DE: 3AB, CF, DF, EF: 2Here are the counts of the frequent triples:ACE: 4ACD, BCE, CDE: 3ABC, ABE, ADE, BCD, BDE, BDF, BCF, BEF, CEF: 2There are four quadruples that are frequent, all with counts of 2: BCEF, BCDE, ACDE, and ABCE. There are no frequent sets of five items.To be closed, the itemset must have a larger count than all of its immediate supersets. Thus, all four of the listed quadruples are closed. A triple with a count of 2 cannot be closed unless it is contained in none of the four frequent quadruples. Among these, only BDF qualifies as closed. However, each of the triples with a count of 3 or 4 is closed, since there are no quadruples with counts this high.Among the pairs, only AC, CD, BD, and BF are closed. Among the singletons, only A and F are not closed. A, which appears 5 times, is contained in AC, which also occurs 5 times, and F, which occurs 3 times, is not closed because BF also appears 3 times.3. Find the set of 2-shingles for the "document":ABRACADABRAand also for the "document":BRICABRACAnswer the following questions:1.How many 2-shingles does ABRACADABRA have?2.How many 2-shingles does BRICABRAC have?3.How many 2-shingles do they have in common?4.What is the Jaccard similarity between the two documents"?Answer: The 2-shingles for ABRACADABRA: AB, BR, RA, AC, CA, AD, DA. The 2-shingles for BRICABRAC: BR, RI, IC, CA, AB, RA, AC.There are 5 shingles in common:AB, BR, RA, AC, CA.As there are 9 different shingles in all, the Jaccard similarity is 5/9.State the correct minhash value of each column.Answer: Look at the rows in the stated order R4, R6, R1, R3, R5, R2, and for each row, make that row be the minhash value of a column if the column has not yet beenassigned a minhash value. We sart with R4, which only has 1 in column C3, so the minhash value for C3 is R4.Next, we consider R6, which has 1 in C2 only. Since C2 does not yet have a minhash value, R6 becomes its value.Next is R1, with 1's in C2 and C3. However, both these columns already have minhash values, so we do nothing.Next, consider R3. It has 1's in C2 and C4. C2 already has a minhash value, but C4 does not. Thus, the minhash value of C4 is R3.When we consider R5 next, we see it has 1's in C1 and C3. The latter already has a minhash value, but R5 becomes the minhash value for C1. Since all columns now have minhash values, we are done.5. Perform a hierarchical clustering of the following six points:using the centroid proximity measure (distance between two clusters is the distance between their centroids). If you do this task correctly, you will find that there is a stage at which there is a tie for which pair of clusters is closest. Follow both choices. You will find that some sets of points are clusters in both cases, some sets are clusters in only one, and some are not clusters regardless of which choice you make. Answer: First, A and B, being the closest pair of points gets merged. The centroid for this pair is at (5,5). The next closest pair of centroids is C and F, so these are merged and their centroid is at (24.5, 13.5). At this time, there is a tie for closest centroids. AB and CF have centroids at distance sqrt(452.5), and so do D and CF. Thus, there are two possible third merges:1.Merge AB and CF, giving three clusters ABCF, D, and E. The centroid of ABCF is(14.75, 9.25). In this case, the next merge is ABCD with E.2.Merge CF with D, giving three clusters CDF, AB, and E. The centroid of CDF is at(27.33, 20). In this case, the next merge is E with AB.As a result, the two sequences of clusters created are:1.AB, CF, ABCF, ABCEF, ABCDEF.2.AB, CF, CDF, ABE, ABCDEF.6.Consider three Web pages with the following links:Suppose we compute PageRank with a β of 0.7, and we introduce the additional constraint that the sum of the PageRanks of the three pages must be 3, to handle the problem that otherwise any multiple of a solution will also be a solution. Compute the PageRanks a, b, and c of the three pages A, B, and C, respectively.Answer: The rules for computing the next value of a, b, or c as we iterate are:a <- .3b <- .7(a/2) + .3c <- .7(a/2+b+c) + .3The reason is that a splits its PageRank between b and c, while b gives all of its to c, and c keeps all its own. However, all PageRank is multiplied by .7 before distribution (the "tax"), and .3 is then added to each new PageRank.In the limit, the assignments become equalities. That immediately tells us a = .3. We can then use the second equation to discover b = .7*.3/2 + .3 = .405. Finally, the third equation simplifies to c = .7(.555 + c) + .3, or .3c = .6885. From this equation we get c = 2.295. It is now a simple matter to compute the subs of each two of the variables: a+b = .705, a+c = 2.595, and b+c = 2.7.7. 分析我们提到的各种算法的优缺点。