【原创】R语言文本挖掘tf-idf,主题建模,情感分析,n-gram建模研究分析案例报告(附代码数据)

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

务（附代码数据）,

咨询QQ：3025393450

有问题到百度搜索“大数据部落”就可以了

欢迎登陆官网：/datablog

R语言挖掘公告板数据文本挖掘研究分析

## Registered S3 methods overwritten by 'ggplot2':

## method from

## [.quosures rlang

## c.quosures rlang

## print.quosures rlang

我们对1993年发送到20个Usenet公告板的20,000条消息进行从头到尾的分析。此数据集中的Usenet公告板包括新闻组用于政治，宗教，汽车，体育和密码学等主题，并提供由许多用户编写的丰富文本。该数据集可在/~jason/20Newsgroups/（该20news-bydate.tar.gz文件）上公开获取，并已成为文本分析和机器学习练习的热门。

1预处理

我们首先阅读20news-bydate文件夹中的所有消息，这些消息组织在子文件夹中，每个消息都有一个文件。我们可以看到在这样的文件用的组合read_lines()，map()和unnest()。

请注意，此步骤可能需要几分钟才能读取所有文档。

library(dplyr)

library(tidyr)

library(purrr)

务（附代码数据）,

咨询QQ：3025393450

有问题到百度搜索“大数据部落”就可以了

欢迎登陆官网：/datablog

library(readr)

training_folder <- "data/20news-bydate/20news-bydate-train/"

# Define a function to read all files from a folder into a data frame

read_folder <-function(infolder) {

tibble(file =dir(infolder, s =TRUE)) %>%

mutate(text =map(file, read_lines)) %>%

transmute(id =basename(file), text) %>%

unnest(text)

}

# Use unnest() and map() to apply read_folder to each subfolder

raw_text <-tibble(folder =dir(training_folder, s =TRUE)) %>%

unnest(map(folder, read_folder)) %>%

transmute(newsgroup =basename(folder), id, text)

raw_text

## # A tibble: 511,655 x 3

## newsgroup id text

##

## 1 alt.atheism 49960 From: mathew

## 2 alt.atheism 49960 Subject: Alt.Atheism FAQ: Atheist Resources

## 3 alt.atheism 49960 Summary: Books, addresses, music -- anything related to atheism

## 4 alt.atheism 49960 Keywords: FAQ, atheism, books, music, fiction, addresses, contacts

## 5 alt.atheism 49960 Expires: Thu, 29 Apr 1993 11:57:19 GMT

## 6 alt.atheism 49960 Distribution: world

## 7 alt.atheism 49960 Organization: Mantis Consultants, Cambridge. UK.

## 8 alt.atheism 49960 Supersedes: <19930301143317@>

## 9 alt.atheism 49960 Lines: 290

## 10 alt.atheism 49960 ""

## # … with 511,645 more rows

请注意该newsgroup列描述了每条消息来自哪20个新闻组，以及id列，用于标识该新闻组中的唯一消息。包含哪些新闻组，以及每个新闻组中发布的消息数量（图1）？

务（附代码数据）,

咨询QQ：3025393450

有问题到百度搜索“大数据部落”就可以了

欢迎登陆官网：/datablog

library(ggplot2)

raw_text %>%

group_by(newsgroup) %>%

summarize(messages =n_distinct(id)) %>%

ggplot(aes(newsgroup, messages)) +

geom_col() +

coord_flip()

图1：来自每个新闻组的消息数

我们可以看到Usenet新闻组名称是按层次命名的，从主题如“talk”，“sci”或“rec”开始，然后是进一步的规范。

务（附代码数据）,

咨询QQ：3025393450

有问题到百度搜索“大数据部落”就可以了

欢迎登陆官网：/datablog

1.1预处理文本

这种预处理可以在dplyr包中使用cumsum()（累积和）和str_detect()来自stringr 的组合来完成。

library(stringr)

# must occur after the first occurrence of an empty line,

# and before the first occurrence of a line starting with --

cleaned_text <-raw_text %>%

group_by(newsgroup, id) %>%

filter(cumsum(text == "") >0,

cumsum(str_detect(text, "^--")) ==0) %>%

ungroup()

许多行也有嵌套文本代表来自其他用户的引号，通常以“某某写入...”之类的行开头。这些可以通过一些正则表达式删除。

我们也可以选择手动删除两条消息，9704并9985包含了大量的非文本内容。

cleaned_text <-cleaned_text %>%

filter(str_detect(text, "^[^>]+[A-Za-z\\d]") |text == "",

!str_detect(text, "writes(:|\\.\\.\\.)$"),

!str_detect(text, "^In article <"),

!id %in%c(9704, 9985))

此时，我们已准备好使用unnest_tokens()将数据集拆分为标记，同时删除停用词。

library(tidytext)