大数据——PageRank算法
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Adapted from CS345 Data Mining
Link Analysis Algorithms Page Rank
Anand Rajaraman, Jeffrey D. Ullman
Link Analysis Algorithms
Page Rank Topic-Specific Page Rank Hubs and Authorities
No unique solution All solutions equivalent modulo scale factor
Additional constraint forces uniqueness
y+a+m = 1 y = 2/5, a = 2/5, m = 1/5
Gaussian elimination method works for small examples, but we need a better method for large graphs
Matrix formulation
Suppose there are N pages
A page j has a set of outlinks O(j) The random teleport taxes each page a fraction (1-) of its score and redistributing evenly: reducing the probability of following each outlink from 1/|O(j)| to /|O(j)| adding a teleport link from j to every page with probability (1-)/N
Matrix formulation
Matrix M has one row and one column for each web page Suppose page j has n outlinks
If j i, then Mij=1/n Else Mij=0
M is a column stochastic matrix
Spider traps
A group of pages is a spider trap if there are no links from within the group to outside the group
Random surfer gets trapped
Spider traps violate the conditions needed for the random walk theorem
|x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean
Biblioteka Baidu
Power Iteration Example
Yahoo y a y 1/2 1/2 a 1/2 0 m 0 1/2 m 0 1 0
Amazon
Simple “flow” model
The web in 1839
Yahoo y/2 y/2
y
y = y /2 + a /2 a = y /2 + m m = a /2
a/2
m Amazon
a a/2
M’soft
m
Solving the flow equations
3 equations, 3 unknowns, no constants
Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t
p(t) is a probability distribution on pages
The stationary distribution
Suppose page j links to 3 pages, including i
j
i
= i
1/3
M
r
r
Verify that a column stochastic M preserves |r|.
Eigenvector formulation
The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix
Columns sum to 1
Suppose r is a vector with one entry per web page
ri is the importance score of page i Call it the rank vector | r| = 1
Example
Then p(t) is called a stationary distribution for the random walk
Our rank vector r satisfies r = Mr
So it is a stationary distribution for the random surfer
Page Rank
Construct N by N matrix A as follows
Aij = Mij + (1-)/N
1/2 1/2 0 0.8 1/2 0 0 0 1/2 1 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
The Google solution for spider traps At each time step, the random surfer has two options:
With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for are in the range 0.8 to 0.9
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15
Random teleports ( = 0.8)
Yahoo
1/2 1/2 0 0.8 1/2 0 0 0 1/2 1
1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3
In fact, its first or principal eigenvector, with corresponding eigenvalue 1
y = y /2 + a /2 a = y /2 + m m = a /2
Example
y a Yahoo m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
Are all inlinks equal?
Recursive question!
Simple recursive formulation
Each link’s vote is proportional to the importance of its source page If page P with importance x has n outlinks, each link gets x/n votes Page P’s own importance is the sum of the votes on its inlinks
Where is the surfer at time t+1?
Follows a link uniformly at random p(t+1) = Mp(t)
Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t)
Amazon y a = m 1 1 1
M’soft 1.00 0.60 1.40 0.84 0.60 1.56
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.776 0.536 . . . 1.688 7/11 5/11 21/11
I forget to divide by 3
Surfer will teleport out of spider trap within a few time steps
Random teleports ( = 0.8)
0.2*1/3
Yahoo
1/2 0.8*1/2
1/2 0.8*1/2
0.2*1/3
y y 1/2 a 1/2 m 0
1/2 1/2 0 0.8 1/2 0 0 0 1/2 1
y 1/2 0.8* 1/2 0
y
1/3 + 0.2* 1/3 1/3
0.2*1/3
Amazon
M’soft
1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3
Dash lines – random teleports Red lines – follow a link
Existence and Uniqueness
A central result from the theory of random walks (aka Markov processes):
For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.
Microsoft becomes a spider trap
Yahoo y a y 1/2 1/2 a 1/2 0 m 0 1/2 M’soft y a = m m 0 0 1
Amazon
1 1 1
1 1/2 3/2
3/4 1/2 7/4
5/8 3/8 2
...
0 0 3
Random teleports (=0.8)
y a = m
M’soft
1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 . . . 1/6 2/5 2/5 1/5
Random Walk Interpretation
Imagine a random web surfer
At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely
r = Mr
Amazon
M’soft y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m
y = y /2 + a /2 a = y /2 + m m = a /2
Power Iteration method
Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r0 = [1/N,….,1/N]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 <
Ranking web pages
Web pages are not equally “important”
www.joe-schmoe.com v www.stanford.edu
Inlinks as votes
www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink
Link Analysis Algorithms Page Rank
Anand Rajaraman, Jeffrey D. Ullman
Link Analysis Algorithms
Page Rank Topic-Specific Page Rank Hubs and Authorities
No unique solution All solutions equivalent modulo scale factor
Additional constraint forces uniqueness
y+a+m = 1 y = 2/5, a = 2/5, m = 1/5
Gaussian elimination method works for small examples, but we need a better method for large graphs
Matrix formulation
Suppose there are N pages
A page j has a set of outlinks O(j) The random teleport taxes each page a fraction (1-) of its score and redistributing evenly: reducing the probability of following each outlink from 1/|O(j)| to /|O(j)| adding a teleport link from j to every page with probability (1-)/N
Matrix formulation
Matrix M has one row and one column for each web page Suppose page j has n outlinks
If j i, then Mij=1/n Else Mij=0
M is a column stochastic matrix
Spider traps
A group of pages is a spider trap if there are no links from within the group to outside the group
Random surfer gets trapped
Spider traps violate the conditions needed for the random walk theorem
|x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean
Biblioteka Baidu
Power Iteration Example
Yahoo y a y 1/2 1/2 a 1/2 0 m 0 1/2 m 0 1 0
Amazon
Simple “flow” model
The web in 1839
Yahoo y/2 y/2
y
y = y /2 + a /2 a = y /2 + m m = a /2
a/2
m Amazon
a a/2
M’soft
m
Solving the flow equations
3 equations, 3 unknowns, no constants
Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t
p(t) is a probability distribution on pages
The stationary distribution
Suppose page j links to 3 pages, including i
j
i
= i
1/3
M
r
r
Verify that a column stochastic M preserves |r|.
Eigenvector formulation
The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix
Columns sum to 1
Suppose r is a vector with one entry per web page
ri is the importance score of page i Call it the rank vector | r| = 1
Example
Then p(t) is called a stationary distribution for the random walk
Our rank vector r satisfies r = Mr
So it is a stationary distribution for the random surfer
Page Rank
Construct N by N matrix A as follows
Aij = Mij + (1-)/N
1/2 1/2 0 0.8 1/2 0 0 0 1/2 1 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
The Google solution for spider traps At each time step, the random surfer has two options:
With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for are in the range 0.8 to 0.9
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15
Random teleports ( = 0.8)
Yahoo
1/2 1/2 0 0.8 1/2 0 0 0 1/2 1
1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3
In fact, its first or principal eigenvector, with corresponding eigenvalue 1
y = y /2 + a /2 a = y /2 + m m = a /2
Example
y a Yahoo m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
Are all inlinks equal?
Recursive question!
Simple recursive formulation
Each link’s vote is proportional to the importance of its source page If page P with importance x has n outlinks, each link gets x/n votes Page P’s own importance is the sum of the votes on its inlinks
Where is the surfer at time t+1?
Follows a link uniformly at random p(t+1) = Mp(t)
Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t)
Amazon y a = m 1 1 1
M’soft 1.00 0.60 1.40 0.84 0.60 1.56
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.776 0.536 . . . 1.688 7/11 5/11 21/11
I forget to divide by 3
Surfer will teleport out of spider trap within a few time steps
Random teleports ( = 0.8)
0.2*1/3
Yahoo
1/2 0.8*1/2
1/2 0.8*1/2
0.2*1/3
y y 1/2 a 1/2 m 0
1/2 1/2 0 0.8 1/2 0 0 0 1/2 1
y 1/2 0.8* 1/2 0
y
1/3 + 0.2* 1/3 1/3
0.2*1/3
Amazon
M’soft
1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3
Dash lines – random teleports Red lines – follow a link
Existence and Uniqueness
A central result from the theory of random walks (aka Markov processes):
For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.
Microsoft becomes a spider trap
Yahoo y a y 1/2 1/2 a 1/2 0 m 0 1/2 M’soft y a = m m 0 0 1
Amazon
1 1 1
1 1/2 3/2
3/4 1/2 7/4
5/8 3/8 2
...
0 0 3
Random teleports (=0.8)
y a = m
M’soft
1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 . . . 1/6 2/5 2/5 1/5
Random Walk Interpretation
Imagine a random web surfer
At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely
r = Mr
Amazon
M’soft y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m
y = y /2 + a /2 a = y /2 + m m = a /2
Power Iteration method
Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r0 = [1/N,….,1/N]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 <
Ranking web pages
Web pages are not equally “important”
www.joe-schmoe.com v www.stanford.edu
Inlinks as votes
www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink