k最近邻算法实验报告

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

题目k-最近邻算法实现学生姓名
学生学号
专业班级
指导教师
2015-1-2
实验二k-最近邻算法实现
一、实验目的
1.加强对k-最近邻算法的理解；
2.锻炼分析问题、解决问题并动手实践的能力。

二、实验要求
使用一种你熟悉的程序设计语言，如C++或Java，给定最近邻数k和描述每个元组的属性数n，实现k-最近邻分类算法，至少在两种不同的数据集上比较算法的性能。

三、实验环境
Win7 旗舰版+ Visual Studio 2010
语言：C++
四、算法描述
KNN(k Nearest Neighbors)算法又叫k最临近方法。

假设每一个类包含多个样本数据，而且每个数据都有一个唯一的类标记表示这些样本是属于哪一个分类，KNN就是计算每个样本数据到待分类数据的距离。

如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别，则该样本也属于这个类别。

该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。

KNN方法虽然从原理上也依赖于极限定理，但在类别决策时，只与极少量的相邻样本有关。

因此，采用这种方法可以较好地避免样本的不平衡问题。

另外，由于KNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此对于类域的交叉或重叠较多的待
分样本集来说，KNN 方法较其他方法更为适合。

该方法的不足之处是计算量较大，因为对每一个待分类的文本都要计算它到全体已知样本的距离，才能求得它的K 个最近邻点。

目前常用的解决方法是事先对已知样本点进行剪辑，事先去除对分类作用不大的样本。

该算法比较适用于样本容量比较大的类域的自动分类，而那些样本容量较小的类域采用这种算法比较容易产生误分。

1、算法思路
K -最临近分类方法存放所有的训练样本，在接受待分类的新样本之前不需构造模型，并且直到新的（未标记的）样本需要分类时才建立分类。

K -最临近分类基于类比学习，其训练样本由N 维数值属性描述，每个样本代表N 维空间的一个点。

这样，所有训练样本都存放在N 维模式空间中。

给定一个未知样本，k -最临近分类法搜索模式空间，找出最接近未知样本的K 个训练样本。

这K 个训练样本是未知样本的K 个“近邻”。

“临近性”又称为相异度（Dissimilarity ），由欧几里德距离定义，其中两个点 X （x1,x2,…,xn ）和Y （y1,y2,…,yn ）的欧几里德距离是：
2
222211)()()(),(n n y x y x y x y x D -+⋯+-+-=
未知样本被分配到K 个最临近者中最公共的类。

在最简单的情况下，也就是当K=1时，未知样本被指定到模式空间中与之最临近的训练样本的类。

2、算法步骤
step.1---初始化距离为最大值；
step.2---计算未知样本和每个训练样本的距离dist ； step.3---得到目前K 个最临近样本中的最大距离maxdist ；
step.4---如果dist 小于maxdist ，则将该训练样本作为K -最近邻样本； step.5---重复步骤2、3、4，直到未知样本和所有训练样本的距离都算完； step.6---统计K -最近邻样本中每个类标号出现的次数； step.7---选择出现频率最大的类标号作为未知样本的类标号。

3、算法伪代码
搜索k个近邻的算法：kNN(A[n],k)
输入：A[n]为N个训练样本在空间中的坐标（通过文件输入），k为近邻数输出：x所属的类别
取A[1]~A[k]作为x的初始近邻，计算与测试样本x间的欧式距离d （x,A[i]）,i=1,2,.....,k；按d（x，A[i]）升序排序，计算最远样本与x间的距离D<-----max{d(x,a[j]) | j=1,2,.....,k};
for(i=k+1;i<=n;i++)
计算a[i]与x间的距离d(x,A[i]);
if(d(x,A[i]))<D
then 用A[i]代替最远样本
按照d(x,A[i])升序排序，计算最远样本与x间的距离D<---max{d(x,A[j]) | j=1,...,i };计算前k个样本A[i]),i=1,2,...,k所属类别的概率，具有最大概率的类别即为样本x的类。

五、数据结构
代码结构如图所示，方法描述如下：
KNN：KNN类构造函数，用于读取数据集；
get_all_distance:KNN类公有函数，计算要分类的点到所有点的距离；
get_distance:KNN类私有函数，计算两点间的距离；
get_max_freq_label:KNN类公有函数，在k个数据里，获取最近k个数据的分类最多的标签，将测试数据归位该类。

类图如上图所示，KNN类的成员变量描述如下：
dataSet：tData型二维数组，用于训练的数据集；
k：int型，从k个最近的元素中，找类标号对应的数目的最大值，归类；
labels：tLable型一维数组，类标签；
map_index_dist：map<int,double>型，记录测试点到各点的距离；
map_label_freq：map<tLable,int>型，记录k个邻居类，各类的个数。

六、程序截图
七、实验总结
八、附件
1.程序源码kNN1.cpp
#include<iostream>
#include<map>
#include<vector>
#include<algorithm>
#include<fstream>
using namespace std;
typedef char tLabel;
typedef double tData;
typedef pair<int,double> PAIR;
const int colLen = 2;
const int rowLen = 10;
ifstream fin;
class KNN
{
private:
tData dataSet[rowLen][colLen];
tLabel labels[rowLen];
int k;
map<int,double> map_index_dis;
map<tLabel,int> map_label_freq;
double get_distance(tData *d1,tData *d2);
public:
KNN(int k);
void get_all_distance(tData * testData);
void get_max_freq_label();
struct CmpByValue
{
bool operator() (const PAIR& lhs,const PAIR& rhs)
{
return lhs.second < rhs.second;
}
};
};
KNN::KNN(int k)
{
this->k = k;
fin.open("data.txt");
if(!fin)
{
cout<<"can not open the file data.txt"<<endl;
exit(1);
}
/* input the dataSet */
for(int i=0;i<rowLen;i++)
{
for(int j=0;j<colLen;j++)
{
fin>>dataSet[i][j];
}
fin>>labels[i];
}
}
/*
* calculate the distance between test data and dataSet[i]
*/
double KNN:: get_distance(tData *d1,tData *d2)
{
double sum = 0;
for(int i=0;i<colLen;i++)
{
sum += pow( (d1[i]-d2[i]) , 2 );
}
//cout<<"the sum is = "<<sum<<endl;
return sqrt(sum);
}
/*
* calculate all the distance between test data and each training data
*/
void KNN:: get_all_distance(tData * testData)
{
double distance;
int i;
for(i=0;i<rowLen;i++)
{
distance = get_distance(dataSet[i],testData);
//<key,value> => <i,distance>
map_index_dis[i] = distance;
}
//traverse the map to print the index and distance
map<int,double>::const_iterator it = map_index_dis.begin();
while(it!=map_index_dis.end())
{
cout<<"index = "<<it->first<<" distance = "<<it->second<<endl;
it++;
}
}
/*
* check which label the test data belongs to to classify the test data
*/
void KNN:: get_max_freq_label()
{
//transform the map_index_dis to vec_index_dis
vector<PAIR>
vec_index_dis( map_index_dis.begin(),map_index_dis.end() );
//sort the vec_index_dis by distance from low to high to get the nearest data sort(vec_index_dis.begin(),vec_index_dis.end(),CmpByValue());
for(int i=0;i<k;i++)
{
cout<<"the index = "<<vec_index_dis[i].first<<" the distance = "<<vec_index_dis[i].second<<" the label = "<<labels[vec_index_dis[i].first]<<" the coordinate ( "<<dataSet[ vec_index_dis[i].first ][0]<<","<<dataSet[ vec_index_dis[i].first ] [1]<<" )"<<endl;
//calculate the count of each label
map_label_freq[ labels[ vec_index_dis[i].first ] ]++;
}
map<tLabel,int>::const_iterator map_it = map_label_freq.begin();
tLabel label;
int max_freq = 0;
//find the most frequent label
while( map_it != map_label_freq.end() )
{
if( map_it->second > max_freq )
{
max_freq = map_it->second;
label = map_it->first;
}
map_it++;
}
cout<<"The test data belongs to the "<<label<<" label"<<endl;
}
int main()
{
tData testData[colLen];
int k ;
cout<<"please input the k value : "<<endl;
cin>>k;
KNN knn(k);
cout<<"please input the test data :"<<endl;
for(int i=0;i<colLen;i++)
cin>>testData[i];
knn.get_all_distance(testData);
knn.get_max_freq_label();
system("pause");
return 0;
}
2.数据集data.txt。