【原创】r语言twitter 文本挖掘 语义分析分析附代码数据
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
library(dplyr)
library(purrr)
library(twitteR)
library(ggplot2)
Read the Twitter data
load("E:/service/2017/3 19 guoyufei17 smelllikeme@/trump_tweets_df.rda") Clean up the data
library(tidyr)
Find Twitter source is Apple's mobile phone or Android phone samples, clean u p other sources of samples
tweets <-trump_tweets_df %>%
select(id, statusSource, text, created) %>%
extract(statusSource, "source", "Twitter for (.*?)<") %>%
filter(source %in%c("iPhone", "Android"))
Visualize the data at different times, corresponding to the Twitter ratio.
And compare the difference between the number of tweets on Android phones and Apple phones
library(lubridate)
library(scales)
tweets %>%
count(source, hour =hour(with_tz(created, "EST"))) %>%
mutate(percent =n /sum(n)) %>%
ggplot(aes(hour, percent, color =source)) +
geom_line() +
scale_y_continuous(labels =percent_format()) +
labs(x ="Hour of day (EST)",
y ="% of tweets",
color ="")
From the comparison chart we can find, Andrews mobile phone and Apple mobile phone release Twitter time there is a significant difference, Andrews mobile phone tend to 5:00 to 10 points between the release of Twitter, and Apple phones generally in 10:00 to 20 points Around the release of Twitter. At the same time we can see, Andrews mobile phone release the number of Twitter is higher than the proportion of Apple And then check whether the Twitter contains references, and compare the number of different platforms
library(stringr)
tweets %>%
count(source,
quoted =ifelse(str_detect(text, '^"'), "Quoted", "Not quoted")) %>%
ggplot(aes(source, n, fill =quoted)) +
geom_bar(stat ="identity", position ="dodge") +
labs(x ="", y ="Number of tweets", fill ="") +
ggtitle('Whether tweets start with a quotation mark (")')
From the comparison of the results, Andrews phone, no reference to the ratio was significantly lower than Apple's mobile phone. While the number of Andrews mobile phone applications to be significantly larger than the Apple phone. So you can think that Apple's mobile phone Twitter content is mostly original, and Andrews mobile phone mostly within the application
And then check whether there are links in Twitter or pictures, and compare the situation of different platforms
tweet_picture_counts <-tweets %>%
filter(!str_detect(text, '^"')) %>%
count(source,
picture =ifelse(str_detect(text, "t.co"),
"Picture/link", "No picture/link"))
ggplot(tweet_picture_counts, aes(source, n, fill =picture)) +
geom_bar(stat ="identity", position ="dodge") +
labs(x ="", y ="Number of tweets", fill ="")
From the above comparison chart, we can see the Android phone without pictures or links to the situation with Apple, that is, the use of Apple's mobile phone users in the hair when the general will publish photos or links
At the same time you can see the Andrews platform users to push the general do not use pictures or links, and Apple mobile phone users just the opposite
spr <-tweet_picture_counts %>%
spread(source, n) %>%
mutate_each(funs(. /sum(.)), Android, iPhone)
rr <-spr$iPhone[2] /spr$Android[2]
Then we detect the abnormal characters in the Twitter, and delete them Then find the keywords in Twitter, and sort by number
library(tidytext)
reg <-"([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
tweet_words <-tweets %>%
filter(!str_detect(text, '^"')) %>%
mutate(text =str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>% unnest_tokens(word, text, token ="regex", pattern =reg) %>%
filter(!word %in%stop_words$word,
str_detect(word, "[a-z]"))
tweet_words
## # A tibble: 8,753 × 4
## id source created wor d
## <chr><chr><dttm><chr>
## 1 676494179216805888 iPhone 2015-12-14 20:09:15 record ## 2 676494179216805888 iPhone 2015-12-14 20:09:15 health ## 3 676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain ## 4 676494179216805888 iPhone 2015-12-14 20:09:15 #trump201 6
## 5 676509769562251264 iPhone 2015-12-14 21:11:12 accolade ## 6 676509769562251264 iPhone 2015-12-14 21:11:12 @trumpgol f
## 7 676509769562251264 iPhone 2015-12-14 21:11:12 highly ## 8 676509769562251264 iPhone 2015-12-14 21:11:12 respected ## 9 676509769562251264 iPhone 2015-12-14 21:11:12 golf ## 10 676509769562251264 iPhone 2015-12-14 21:11:12 odyssey ## # ... with 8,743 more rows
tweet_words %>%
count(word, sort =TRUE) %>%
head(20) %>%
mutate(word =reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_bar(stat ="identity") +
ylab("Occurrences") +
coord_flip()
From the figure we can see Hillary's keyword ranking is the first, followed by Trump 2016 this keyword. At the same time in the back of the keywords, we also see Trump, and Clinton and so on.
The emotional analysis of the data, and calculate the relative impact of Andrews and Apple mobile phone ratio
The emotional ratio of the different platforms is calculated by the emotional tendencies of the characteristic words, and the visualization is carried out
android_iphone_ratios <-tweet_words %>%
count(word, source) %>%
filter(sum(n) >=5) %>%
spread(source, n, fill =0) %>%
ungroup() %>%
mutate_each(funs((. +1) /sum(. +1)), -word) %>%
mutate(logratio =log2(Android /iPhone)) %>%
arrange(desc(logratio))
nrc <-sentiments %>%。