亚马逊电影数据抓取及推荐系统分析

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

DATS 6101: Amazon movie data grasping and

recommendation

system analysis final project

Prepared by: Pseudo_yuan

December 16, 2015

Introduction

Big data provide useful information to the recommendation system. A good recommendation system is based on efficient algorithms. There are three popular recommendation algorithms: user-based recommendation algorithm, item-based recommendation algorithm and collaborative filtering recommendation. Based on one movie, Amazon recommends other movies that customs who watch this movie also watched. That is, this recommendation system is based on the user. However, in this system recommendations are limited, because some movies could fail to be recommended when few people have watched them. To address this problem, I will analyze attributes of the recommended movies and discuss the similarity of them to see whether it is possible to make a recommendation based on the attributes of items. In detail, with the help of the R package “rvest” I will grasp d ata from Amazon website pages and analysis the relationship between one movie and movies that customs who watch this movie also watched. Based on these relationships, customers’ preference could be predict ed and more unpopular movies can be recommended.

Description and Quality of Data

In one Amazon movie website page, there are lots of data such as the name, the genres, the director, the staring and the rates providing useful information for this movie. Amazon also gives links to recommended movies. A collection of informed data of a single movie could be a sub-dataset. One movie always associated to more than 6 recommended movies. And each recommended movie could create a new sub-dataset. In my database, one dataset includes information of one movie (the basic movie) and 6 movies that are recommended (the sub-movie) and movies that are recommended based on the sub-movies. In one dataset, there are attributes of name, year, mins, IMDb rate, BoxOffice, genre 1, genre 2, director, star 1, star 2 and studio in 43 movies. These data are website data and distribute in text, graphs even in image. The data are unstructured and sometimes could be missing, so they need cleaning before analyzed.

Data Acquisition and clean

R package “rvest” is a useful package that helps to grasp data from html website pages. The function “read_html” helps to read the html website and the function“html_nodes” helps to select nodes from a HTML document. the function “html_text”, “html_name”, “html_children” or “html_attrs” helps ex tract attributes, text and tag name from html. With these functions, we can grasp wanted data from the website page. For example, we can use the following code to fetch the movie name from the given address.

movie <- read_html(address)

Name <- movie %>% html_nodes("#aiv-content-title") %>% html_text()

In this example, we get the movie name. However, the result contains useless black space. We can use the following code to delete it and make the data clean.

name <- trimws(strsplit(Name,"\n")[[1]][2])

The full code using for grasping and cleaning data is showed in appendix 1 and the result is showed in appendix 2.

The Amazon Movie data

In this project, I build four data sets based on movie “A Most Wanted Man”, “Big Hero 6”, “Saving Christmas” and “Schindler’s List” and name them “group 1”, “group 2”, “group 3” and “group 4” separately. One data set includes the information of one movie and the movies recommended based on it. So in one data set the movies are recommendation relative. The full data sets are sh owed in the excel document named “ShuyuanZhao_FinalProjectData_Amazon Movie.xlsx”.

To detect the insights, I will visualize the data with the R package “ggplot2”. Firstly, I will present the year and IMDb rate of the movies in four data set with the following code:

p <- ggplot(data=AmazonMovie,mapping=aes(x=YEAR,y=IMDBRATE))

p + geom_point(aes(color=GROUP))

The result is presented in Figure 1.