
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Dissimilarity Summary
2016/1 1/13
Data Mining
Basic Statistical Descriptions of Data
To better understand the data: central tendency, variation and spread
2016/1 1/13
Байду номын сангаас
Numeric Attribute Types
Quantity (integer or real-valued) Interval
Measured on a scale of equal-sized units Values have order – E.g., temperature in C˚or F˚, calendar dates No true zero-point
Knowledge Discovery
6 68
Data Mining
Also called samples , examples, instances, data points, objects, tuples. Data objects are described by attributes. Database rows data objects; columns attributes.
Dissimilarity Summary
2016/1 1/13
Data Mining
Types of Data Sets
Relational records Data matrix, e.g., numerical matrix, crosstabs Document data: text documents: term-frequency vector Incomew1 w2 … w class i … wn TID Items age 5 2 … 0 high low A B … 1 C 1 Doc Bread, Coke, Milk 1 Transaction data 0 1 … 1,308 2 … 0 Young Doc 1,402 1,308 1,402 0
Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube
2016/1 1/13
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population)
Curse of dimensionality
Knowledge Discovery
5 68
Only presence counts
Data Mining
Patterns depend on the scale
Centrality and dispersion
Data Mining Knowledge Discovery
7 68
Nominal Binary Numeric: quantitative Interval-scaled Ratio-scaled
2016/1 1/13
Attribute Types
Nominal: categories, states, or “names of things”
2016/1 1/13
Data Objects
Data sets are made up of data objects. A data object represents an entity. Examples:
sales database: customers, store items, sales medical database: patients, treatments university database: students, professors, courses
Note: n is sample size and N is population size. Weighted arithmetic mean: Trimmed mean: chopping extreme values
Knowledge Discovery
1 n x xi n i 1
13 68
N n
i 1 n
i i
w x w
i 1 i
Data Mining
Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation
Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
2016/1 1/13
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a collection of documents
Knowledge Discovery
9 68
Data Mining
Inherent zero-point We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). – e.g., temperature in Kelvin, length, counts, monetary quantities
Beer, 2 Bread 786 1,374
Coke, Diaper, Milk
Social or information networks
2016/1 1/13
Types of Data Sets
Video data: sequence of images Temporal data: time-series Sequential Data: transaction sequences Genetic sequence data
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating-point variables
Hair_color = {auburn, black, blond, brown, grey, red, white} marital status, occupation, ID numbers, zip codes
8 68
Knowledge Discovery
Nominal attribute with only 2 states (0 and 1) Symmetric binary: both outcomes equally important e.g., gender Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV positive) Data Mining
2016/1 1/13
Getting Know Your Data
Data Objects and Attribute Types Basic Statistical Descriptions of Data Data Visualization
Knowledge Discovery
11 68
Measuring Data Similarity and
12 68
Knowledge Discovery
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Data Mining
Numerical dimensions correspond to sorted intervals
World Wide Web Molecular Structures
2 3 count 4 5
0 0 2,160 … Beer, Coke, Diaper, Milk 2,188 1,412 1,402 Doc 1 Diaper, 3 … 1,308 4 …2,160 2 Beer, Milk m Bread,
2016/1 1/13
Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object.
E.g., customer _ID, name, address
Values have a meaningful order (ranking) but magnitude between successive values is not known. Size = {small, medium, large}, grades, army rankings
x11 x i1 x m1 x1 j x1 n x ij x in x mj x mn
3 68
Knowledge Discovery
Data Mining
Graph and network
10 68
Knowledge Discovery
Sometimes, represented as integer variables Note: Binary attributes are a special case of discrete attributes
Data Mining
Continuous Attribute
Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula: mean mode 3 (mean median)
4 68
Knowledge Discovery
Data Mining
Spatial, image and multimedia
Spatial data: maps
Image data Video data
2016/1 1/13
Important Characteristics of Structured Data
Getting Know Your Data
Data Objects and Attribute Types Basic Statistical Descriptions of Data Data Visualization
Knowledge Discovery
2 68
Measuring Data Similarity and