文本分类（Textcategorization）

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

文本分类（Text categorization）
Commonly used classification algorithm for:
Decision trees, Rocchio, naive Bayes, neural networks, support vector machines, linear least squares fitting, kNN, genetic algorithms, maximum entropy, Generalized, Instance, Set, etc.. Here, pick only a few of the most representative algorithms, Kan kan.
Rocchio algorithm
The Rocchio algorithm should be the first and most intuitive solution for people to think about text categorization problems. The basic idea is to take a class in the average value of a sample document (such as all the "Sports" category in the document word "Basketball" the number of occurrences of a mean value, then the "referee" takes the average value, in order to do so), you can get a new vector image known as the "center of mass", became the category centroid vector is the most representative said. When the new document needs to be judged, compare how much the new document is similar to the centroid (eight points, judging the distance between them) to determine that the new document does not belong to this class. A slightly modified Rocchio algorithm does not consider belong to this category of documents (called positive samples), also consider the document data does not belong to this category (called negative samples), the calculated centroid is as close as possible to the sample and try to stay away from negative samples. The Rocchio algorithm makes two deadly assumptions, making its performance surprisingly poor. One is that it around a category of documents only gathered in a mass, the actual situation is
not so (this data is called linear inseparable); two is it is assumed that the training data is absolutely correct, because it does not have any quantitative measure of whether the sample contains the mechanism of noise, so it is no resistance to the wrong data.
But the Rocchio classifier is very intuitive, easy to be understood, algorithm is simple, there are still some value in use, is often used for the comparison of different algorithms of the baseline system research (Base Line).
Naive Bayes algorithm (Naive Bayes)
The Bias algorithm focuses on the probability that the document belongs to a class. The probability that a document belongs to a category is equal to the expression of the probability that each word in the document belongs to that category. The number of the probability of each word belongs to the class and to a certain extent, can use the word in the training class appeared in the document (frequency information) to a rough estimate, which makes the whole calculation process becomes feasible. When using the naive Bayes algorithm, the main task in the training phase is to estimate these values.
The formula for the naive Bayes algorithm is not just one.
First, a priori probability is calculated for each element in the sample. Secondly, to compute a sample, for each classification probability, the largest probability classification will be adopted. therefore
Among them, P (d|, Ci), =P (w1|Ci), P (w2|Ci)... P (wi|Ci) P (w1|Ci)... P (wm|Ci) (form 1)
P (w|C) = element W, the number of occurrences in the sample classified as C / the number of elements in the sample after the data (formula 2)
It contains the two biggest flaws of the naive Bayes algorithm.
First, P (d| Ci) has expanded into (formula 1) product of the form is assumed between each word in an article is independent of each other, the emergence of a word not affected by another word (recall the concept of probability theory in variables independent of each other can know), but this is clearly wrong, if not the linguistic expert we also know that there was the so-called "co-occurrence" relationship between the words, in a variety of topics, may now have the number or frequency of change, but absolutely not independent each other.
Second, the use of the number of times a word appears in a document class training to estimate P (wi|Ci), only in the number of training samples is very much the case is more accurate (consider throwing coins, get through a large number of observations to get basic probability of positive and negative there are 1/2 conclusions, the number of observations too little is likely to get the wrong answer), and a large number of samples are required not only to the requirements of artificial classification work to bring higher requirements (and rising costs), when processed by the computer in the late stage is also put forward higher requirements for the storage and computing resources.
But a few common sense technicians will understand that data mining takes a lot of time to sort out data. In the data processing stage, the dictionary can be generated according to the vocabulary, and the redundant meaningless words are deleted, and the words and the important phrases are separated and so on.
This can avoid some problems of naive Bayes algorithm. In fact, the real problem still lies in how the algorithm calculates the entropy.
In many cases, the naive Bayes algorithm can achieve very good recognition results through the optimization of professionals.
The most familiar two multinational software companies still use naive Bayes algorithm as a tool algorithm for some software Natural Language Processing.
KNN algorithm
The nearest neighbor algorithm (kNN): given the new document, the new document similarity vector feature vectors and the training document set the various documents, get K with the new document document nearest to the most similar, according to the K document determine the category of the new document belongs to the category of (note that means that the kNN algorithm has no real significance on the "training" phase). This method is very good to overcome the defects of Rocchio algorithm cannot deal with linear non separable problem, also is suitable for the classification standard at any time will change in demand
(just delete the old training document, add new training documents, change classification criteria).
KNN only can say the most fatal flaw is the judgment of a new category of documents, need to take it with all the training documents all the existing comparison again, this is not the computational cost of each system can bear (a text classification system, for example, I would like to build tens of thousands of classes, each class even only 20 training samples, in order to determine whether a new document category, but also to do comparison vector 200 thousand times). Some kNN based improvements, such as Generalized, Instance, and Set, are trying to solve the problem.
KNN also has another drawback, when the sample is unbalanced, as a kind of large sample, while other types of sample size is very small, when the input may lead to a new sample, the sample of K neighbor class samples accounted for the majority of large capacity.
Support vector machines (Support, Vector, Machine)
Cortes and SVM is first proposed by Vapnik in 1995, it solve the small sample, nonlinear and high dimensional pattern recognition shows many unique advantages, and can be applied to the function fitting other machine learning problems.
The support vector machine method is a learning theory of VC dimension theory and structural risk minimization principle on the basis of statistics, according to the limited sample information in model complexity (i.e. the accuracy of learning
of specific training samples and learning ability (Accuracy) ability is no incorrectly identify any sample) to find the best compromise between. In order to obtain the best generalization ability (or generalization ability).
The SVM method has a solid theoretical basis, the essence of SVM training is to solve a quadratic programming problem (two Quadruple Programming, which the objective function is quadratic function, constraint optimization problems with linear constraints), get the global optimal solution, which is superior to other statistical learning techniques is difficult to match the. SVM classifier text classification effect is very good, is one of the best classifier. At the same time, the kernel function is used to transform the original sample space into the high dimensional space, and the problem of linear and non separable of the original sample can be solved. The disadvantage is that the choice of kernel function is lack of guidance for specific problems, the best selection of kernel function; another SVM training speed greatly by the size of the training set, the computational overhead is relatively large, the training speed of SVM, researchers have proposed many improved methods, including Chunking method, Osuna algorithm, SMO algorithm and interaction SVM. SVM classifier has the advantages of better generalization, higher classification accuracy, faster classification speed, less classification speed and the number of training samples, and is better than kNN and naive Bayes in checking and recall.。