缺失值处理

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

缺失值

1. is.na 确实值位置判断

注意: 缺失值被认为是不可比较的,即便是与缺失值自身的比较。这意味着无法使用比较运算

符来检测缺失值是否存在。例如,逻辑测试myvar == NA的结果永远不会为TRUE。作为替代,你只能使用处理缺失值的函数(如本节中所述的那些)来识别出R数据对象中的缺失值。

2. na.omit() 删除不完整观测

manyNAs

library(DMwR)

manyNAs(data, nORp = 0.2)

Arguments

data

A data frame with the data set.

nORp

A number controlling when a row is considered to have too many NA values (defaults to

0.2, i.e. 20% of the columns). If no rows satisfy the constraint indicated by the user, a

warning is generated.

按照比例判断缺失.

3. knnImputation K近邻填补

library(DMwR)

knnImputation(data, k = 10, scale = T, meth = "weighAvg", distData = NULL)

• 1

• 2

Arguments

Arguments

data A data frame with the data set

k The number of nearest neighbours to use (defaults to 10)

scale

Boolean setting if the data should be scale before finding the nearest neighbours (defaults

to T)

meth

String indicating the method used to calculate the value to fill in each NA. Available values

are ‘median’ or ‘weighAvg’ (the default).

distData

Optionally you may sepecify here a data frame containing the data set that should be used

to find the neighbours. This is usefull when filling in NA values on a test set, where you

should use only information from the training set. This defaults to NULL, which means that

the neighbours will be searched in data

Details

This function uses the k-nearest neighbours to fill in the unknown (NA) values in a data set. For each case with any NA value it will search for its k most similar cases and use the values of these cases to fill in the unknowns.

If meth=’median’ the function will use either the median (in case of numeric variables) or the most frequent value (in case of factors), of the neighbours to fill in the NAs. If

meth=’weighAvg’ the function will use a weighted average of the valu es of the neighbours. The weights are given by exp(-dist(k,x) where dist(k,x) is the euclidean distance between the case with NAs (x) and the neighbour k

例子:

#首先读入程序包并对数据进行清理

library(DMwR)

data(algae)

algae <- algae[-manyNAs(algae), ]

clean.algae <- knnImputation(algae[,1:12],k=10)

• 1

• 2

• 3

• 4

• 5

> head(clean.algae)

season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla a1

1 winter small medium 8.00 9.8 60.800 6.238 578.000 105.000 170.000 50.0 0.0

2 spring small medium 8.35 8.0 57.750 1.288 370.000 428.750 558.750 1.

3 1.4

3 autumn small medium 8.10 11.

4 40.020 5.330 346.667 125.667 187.057 15.6 3.3

4 spring small medium 8.07 4.8 77.364 2.302 98.182 61.182 138.700 1.4 3.1

5 autumn small medium 8.0

6 9.0 55.350 10.416 233.700 58.222 97.580 10.5 9.2

相关文档
最新文档