R语言-文本挖掘主题模型文本分类
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
R语⾔-⽂本挖掘主题模型⽂本分类####需要先安装⼏个R包,如果有这些包,可省略安装包的步骤。
#install.packages("Rwordseg")
#install.packages("tm");
#install.packages("wordcloud");
#install.packages("topicmodels")
例⼦中所⽤数据
数据来源于sougou实验室数据。
数据⽹址:/dl/sogoulabdown/SogouC.mini.20061102.tar.gz
⽂件结构
└─Sample
├─C000007 汽车
├─C000008 财经
├─C000010 IT
├─C000013 健康
├─C000014 体育
├─C000016 旅游
├─C000020 教育
├─C000022 招聘
├─C000023
└─C000024 军事
采⽤Python对数据进⾏预处理为train.csv⽂件,并把每个⽂件⽂本数据处理为1⾏。
预处理python脚本
<ignore_js_op> (720 Bytes, 下载次数: 96)
所需数据
<ignore_js_op> (130.2 KB, 下载次数: 164)
⼤家也可以⽤R直接将原始数据转变成train.csv中的数据
⽂章所需stopwords
<ignore_js_op> (2.96 KB, 下载次数: 114)
1. 读取资料库
1. csv <- read.csv("d://wb//train.csv",header=T, stringsAsFactors=F)
2. mystopwords<- unlist (read.table("d://wb//StopWords.txt",stringsAsFactors=F))
复制代码
2.
数据预处理(中⽂分词、stopwords处理)
1.
2. library(tm);
3.
4. #移除数字
5. removeNumbers = function(x) { ret = gsub("[0-90123456789]","",x) }
6. sample.words <- lapply(csv
text, removeNumbers)
复制代码
1.
2. #处理中⽂分词,此处⽤到Rwordseg包
3.
4. wordsegment<- function(x) {
5. library(Rwordseg)
6. segmentCN(x)
7. }
8.
9. sample.words <- lapply(sample.words, wordsegment)
复制代码
1.
2. ###stopwords处理
3. ###先处理中⽂分词,再处理stopwords,防⽌全局替换丢失信息
4.
5. removeStopWords = function(x,words) {
6. ret = character(0)
7. index <- 1
8. it_max <- length(x)
9. while (index <= it_max) {
10. if (length(words[words==x[index]]) <1) ret <- c(ret,x[index])
11. index <- index +1
12. }
13. ret
14. }
15.
16.
17. sample.words <- lapply(sample.words, removeStopWords, mystopwords)
复制代码
3. wordcloud展⽰
1. #构建语料库
2. corpus = Corpus(VectorSource(sample.words))
3. meta(corpus,"cluster") <- csv
type
4. unique_type <- unique(csv
type)
5. #建⽴⽂档-词条矩阵
6. (sample.dtm <- DocumentTermMatrix(corpus, control = list(wordLengths = c(2, Inf))))
复制代码
1.
2. #install.packages("wordcloud"); ##需要wordcloud包的⽀持
3. library(wordcloud);
4. #不同⽂档wordcloud对⽐图
5. sample.tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(2, Inf)));
6.
7. tdm_matrix <- as.matrix(sample.tdm);
8.
9. png(paste("d://wb//sample_comparison",".png", sep = ""), width = 1500, height = 1500 );
10. comparison.cloud(tdm_matrix,colors=rainbow(ncol(tdm_matrix)));####由于颜⾊问题,稍作修改
11. title(main = "sample comparision");
12. dev.off();
13.
复制代码
1.
2. #按分类汇总wordcloud对⽐图
3. n <- nrow(csv)
4. zz1 = 1:n
5. cluster_matrix<-sapply(unique_type,function(type){apply(tdm_matrix[,zz1[csv
type==type]],1,sum)})
6. png(paste("d://wb//sample_ cluster_comparison",".png", sep = ""), width = 800, height = 800 )
7. comparison.cloud(cluster_matrix,colors=brewer.pal(ncol(cluster_matrix),"Paired")) ##由于颜⾊分类过少,此处稍作修改
8. title(main = "sample cluster comparision")
9. dev.off()
10.
复制代码
<ignore_js_op>
可以看出数据分布不均匀,culture、auto等数据很少。
1.
2. #按各分类画wordcloud
3. sample.cloud <- function(cluster, maxwords = 100) {
4. words <- sample.words[which(csv
type==cluster)]
5. allwords <- unlist(words)
6.
7. wordsfreq <- sort(table(allwords), decreasing = T)
8. wordsname <- names(wordsfreq)
9.
10. png(paste("d://wb//sample_", cluster, ".png", sep = ""), width = 600, height = 600 )
11. wordcloud(wordsname, wordsfreq, scale = c(6, 1.5), min.freq = 2, max.words = maxwords, colors = rainbow(100))
12. title(main = paste("cluster:", cluster))
13. dev.off()
14. }
15. lapply(unique_type,sample.cloud)# unique(csv
type)
16.
复制代码
<ignore_js_op>
<ignore_js_op>
4. 主题模型分析
1.
2. library(slam)
3. summary(col_sums(sample.dtm))
4. term_tfidf <- tapply(sample.dtm
v/row_sums( sample.dtm)[ sample.dtm
i], sample.dtm
j, mean)*
5. log2(nDocs( sample.dtm)/col_sums( sample.dtm > 0))
6. summary(term_tfidf)
7.
8.
9. sample.dtm <- sample.dtm[, term_tfidf >= 0.1]
10. sample.dtm <- sample.dtm[row_sums(sample.dtm) > 0,]
11.
12. library(topicmodels)
13. k <- 30
14.
15. SEED <- 2010
16. sample_TM <-
17. list(
18. VEM = LDA(sample.dtm, k = k, control = list(seed = SEED)),
19. VEM_fixed = LDA(sample.dtm, k = k,control = list(estimate.alpha = FALSE, seed = SEED)),
20. Gibbs = LDA(sample.dtm, k = k, method = "Gibbs",control = list(seed = SEED, burnin = 1000,thin = 100, iter = 1000)),
21. CTM = CTM(sample.dtm, k = k,control = list(seed = SEED,var = list(tol = 10^-4), em = list(tol = 10^-3)))
22. )
23.
复制代码
<ignore_js_op>
1.
2.
3. sapply(sample_TM[1:2], slot, "alpha")
4.
5. sapply(sample_TM, function(x) mean(apply(posterior(x)
topics,1, function(z) - sum(z * log(z)))))
6.
7.
复制代码
<ignore_js_op>
α估计严重⼩于默认值,这表明Dirichlet分布数据集中于部分数据,⽂档包括部分主题。
数值越⾼说明主题分布更均匀
1.
2.
3. #最可能的主题⽂档
4. Topic <- topics(sample_TM[["VEM"]], 1)
5. table(Topic)
6.
7. #每个Topic前5个Term
8. Terms <- terms(sample_TM[["VEM"]], 5)
9.
10. Terms[,1:10]
11.
复制代码
<ignore_js_op>
1.
2. ######### auto中每⼀篇⽂章中主题数⽬
3. (topics_auto <-topics(sample_TM[["VEM"]])[ grep("auto", csv[[1]]) ])
4.
5.
6. most_frequent_auto <- which.max(tabulate(topics_auto))
7.
8. ######### 与auto主题最相关的10个词语
9. terms(sample_TM[["VEM"]], 10)[, most_frequent_auto]
10.
复制代码
<ignore_js_op>
Processing math: 100%。