Skip to content

Chapter 122. PageRank and Network Analysis

PageRank is a method for assigning importance scores to nodes in a directed network.

It was developed for ranking web pages, but the underlying idea applies to many networks: citation graphs, social graphs, dependency graphs, recommendation graphs, knowledge graphs, and communication networks.

The central idea is recursive importance. A page is important if important pages link to it. This statement is circular, but linear algebra turns it into an eigenvector problem.

PageRank models a random surfer moving through a directed graph. The PageRank vector is the stationary distribution of this Markov chain. Equivalently, it is an eigenvector of a stochastic matrix with eigenvalue 11. PageRank was originally based on the hyperlink structure of the web and uses a Markov chain model whose stationary vector gives the ranking scores.

122.1 Directed Networks

A directed network consists of nodes and directed edges.

For the web, nodes are pages and directed edges are hyperlinks. If page ii links to page jj, then there is a directed edge

ij. i \to j.

Directed edges have orientation. A link from ii to jj does not imply a link from jj to ii.

This direction matters. PageRank uses incoming links as evidence of importance. A page with many incoming links may be important, but the quality of those incoming links also matters.

122.2 The Web Graph

The web graph is a directed graph

G=(V,E), G=(V,E),

where VV is the set of web pages and EE is the set of hyperlinks.

If

V=n, |V|=n,

then the graph has nn pages.

The web graph is enormous, sparse, directed, and irregular. Most pages link to only a small number of other pages compared with the total number of pages. Therefore its adjacency matrix is sparse.

PageRank uses this sparse link structure to construct a transition matrix.

122.3 Adjacency Matrix

Let AA be the adjacency matrix of a directed graph.

Define

aij={1,if page i links to page j,0,otherwise. a_{ij} = \begin{cases} 1, & \text{if page } i \text{ links to page } j, \\ 0, & \text{otherwise.} \end{cases}

Here row ii stores links going out from page ii.

For example, suppose there are four pages and the links are

12,13,23,31,43. 1\to 2,\quad 1\to 3,\quad 2\to 3,\quad 3\to 1,\quad 4\to 3.

Then

A=[0110001010000010]. A= \begin{bmatrix} 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix}.

The row sums give the number of outgoing links.

122.4 Out-Degree

The out-degree of node ii is

diout=j=1naij. d_i^{\text{out}} = \sum_{j=1}^n a_{ij}.

It counts how many links leave node ii.

For the example matrix,

d1out=2,d2out=1,d3out=1,d4out=1. d_1^{\text{out}}=2,\quad d_2^{\text{out}}=1,\quad d_3^{\text{out}}=1,\quad d_4^{\text{out}}=1.

PageRank assumes that if a surfer is on page ii, and page ii has outgoing links, the surfer chooses one of those links with equal probability.

Thus each outgoing link from page ii receives probability

1diout. \frac{1}{d_i^{\text{out}}}.

122.5 Transition Matrix

From the adjacency matrix, define a transition matrix PP by

pij={1diout,if ij,0,otherwise. p_{ij} = \begin{cases} \frac{1}{d_i^{\text{out}}}, & \text{if } i \to j, \\ 0, & \text{otherwise.} \end{cases}

Each row sums to 11, provided every node has at least one outgoing edge.

Thus PP is a row-stochastic matrix.

For the example,

P=[012120001010000010]. P= \begin{bmatrix} 0 & \frac12 & \frac12 & 0 \\ 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix}.

The entry pijp_{ij} is the probability of moving from page ii to page jj in one click.

122.6 Random Surfer Model

The random surfer model imagines a user who moves from page to page by clicking links.

If the surfer is on page ii, the next page is chosen from the outgoing links of ii.

The probability distribution after one step is

μk+1=μkP, \mu_{k+1}=\mu_kP,

where μk\mu_k is a row probability vector.

After kk steps,

μk=μ0Pk. \mu_k=\mu_0P^k.

The PageRank score of a page is the long-run probability that the surfer is on that page.

Thus PageRank converts ranking into a Markov chain equilibrium problem.

122.7 Stationary Distribution

A stationary distribution is a probability vector π\pi satisfying

πP=π. \pi P=\pi.

This says that applying the transition matrix does not change the distribution.

In eigenvalue language, π\pi is a left eigenvector of PP with eigenvalue 11, normalized so that

i=1nπi=1. \sum_{i=1}^n \pi_i=1.

The entry πi\pi_i is the PageRank score of page ii.

A larger πi\pi_i means the random surfer spends more long-run time on page ii.

For every finite Markov chain transition matrix, there is a left eigenvector with eigenvalue 11; with suitable irreducibility and aperiodicity conditions, the stationary distribution is unique and attracts the chain.

122.8 Recursive Importance

The stationary equation can be expanded.

For row-stochastic PP,

πj=i=1nπipij. \pi_j = \sum_{i=1}^n \pi_i p_{ij}.

Thus the score of page jj is the sum of scores received from pages that link to it.

If page ii links to jj, then it contributes

πidiout \frac{\pi_i}{d_i^{\text{out}}}

to jj.

Therefore

πj=ijπidiout. \pi_j = \sum_{i\to j} \frac{\pi_i}{d_i^{\text{out}}}.

This formula expresses the recursive nature of PageRank. A page receives rank from incoming pages, but each incoming page divides its rank among its outgoing links.

122.9 Dangling Nodes

A dangling node is a node with no outgoing links.

For such a node,

diout=0. d_i^{\text{out}}=0.

The random surfer has no link to follow.

If we leave the row empty, the transition matrix no longer has row sum 11. This breaks the Markov chain interpretation.

A common fix is to replace the dangling row by a uniform distribution:

pij=1n p_{ij}=\frac{1}{n}

for all jj.

This means that from a dangling page, the surfer jumps to a random page.

This modification keeps the transition matrix stochastic.

122.10 Rank Sinks

A rank sink is a set of pages that traps rank.

For example, suppose pages in a subset link to one another but do not link out. Once the random surfer enters this subset, the surfer cannot leave.

Then PageRank mass may accumulate inside the sink.

This is undesirable for web ranking because a small isolated group could absorb too much score.

The standard solution is teleportation.

122.11 Teleportation

Teleportation means that at each step, the surfer either follows a link or jumps to a random page.

Let

0<α<1 0<\alpha<1

be the damping factor.

With probability α\alpha, the surfer follows a link.

With probability 1α1-\alpha, the surfer teleports to a page chosen from a distribution vv.

Usually,

v=1n1T v= \frac{1}{n}\mathbf{1}^T

is uniform.

The Google matrix is

G=αP+(1α)1v. G=\alpha P+(1-\alpha)\mathbf{1}v.

Here 1v\mathbf{1}v is the matrix whose every row equals vv.

The PageRank vector satisfies

πG=π. \pi G=\pi.

The damping factor is often described as 0.850.85 in PageRank expositions, and the teleportation term prevents sinks while making the random-surfer chain better behaved.

122.12 The PageRank Equation

Using a row-stochastic convention, the PageRank equation is

π=απP+(1α)v. \pi=\alpha\pi P+(1-\alpha)v.

Equivalently,

π(IαP)=(1α)v. \pi(I-\alpha P)=(1-\alpha)v.

If the inverse exists, then

π=(1α)v(IαP)1. \pi=(1-\alpha)v(I-\alpha P)^{-1}.

In practice, the inverse is not formed for large graphs. The matrix is too large, and the graph is sparse.

Instead, PageRank is usually computed iteratively.

122.13 Column-Stochastic Convention

Some texts use column-stochastic matrices instead.

Let MM be a matrix whose columns sum to 11. Then the PageRank vector rr is a column vector satisfying

r=αMr+(1α)v. r=\alpha Mr+(1-\alpha)v.

This is equivalent to the row-stochastic form after transposition.

The choice is conventional. The mathematics is the same.

ConventionDistributionEquation
Row-stochasticRow vector π\piπ=πG\pi=\pi G
Column-stochasticColumn vector rrr=Grr=Gr

When reading PageRank formulas, one must check which convention is used.

122.14 Power Iteration

The power method computes PageRank by repeated multiplication.

Start with an initial probability vector

π0. \pi_0.

Then iterate

πk+1=απkP+(1α)v. \pi_{k+1}=\alpha\pi_kP+(1-\alpha)v.

Under the usual PageRank modification, this sequence converges to the PageRank vector.

The method is simple and scalable because each iteration uses sparse matrix-vector multiplication.

The original PageRank literature and later analyses describe the PageRank vector as obtainable by power iteration on the modified stochastic matrix.

122.15 Example

Consider three pages with links

12,13,23,31. 1\to 2,\quad 1\to 3,\quad 2\to 3,\quad 3\to 1.

The transition matrix is

P=[01212001100]. P= \begin{bmatrix} 0 & \frac12 & \frac12 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{bmatrix}.

Let

α=0.85,v=[131313]. \alpha=0.85, \qquad v= \begin{bmatrix} \frac13 & \frac13 & \frac13 \end{bmatrix}.

Then

πk+1=0.85πkP+0.15v. \pi_{k+1}=0.85\pi_kP+0.15v.

Start with

π0=[131313]. \pi_0= \begin{bmatrix} \frac13 & \frac13 & \frac13 \end{bmatrix}.

After repeated iteration, the values approach a fixed probability vector.

That vector is the PageRank score vector.

122.16 Linear System Form

The PageRank equation

π=απP+(1α)v \pi=\alpha\pi P+(1-\alpha)v

can be rearranged:

π(IαP)=(1α)v. \pi(I-\alpha P)=(1-\alpha)v.

Transposing gives

(IαPT)πT=(1α)vT. (I-\alpha P^T)\pi^T=(1-\alpha)v^T.

Thus PageRank can be solved as a linear system.

For small graphs, direct methods are possible.

For large graphs, iterative methods are preferred because the graph matrix is sparse and very large.

This is a typical numerical linear algebra tradeoff: direct solution is conceptually clean, but iterative sparse methods scale better.

122.17 Eigenvector Centrality

PageRank is related to eigenvector centrality.

Eigenvector centrality assigns high score to a node if it is connected to other high-score nodes.

For an adjacency matrix AA, the centrality vector xx may satisfy

Ax=λx. Ax=\lambda x.

PageRank modifies this idea for directed graphs with normalization, damping, and teleportation.

The modifications are important because the web graph is not regular, not symmetric, and not necessarily strongly connected.

PageRank is therefore a robust eigenvector centrality for directed networks.

122.18 Network Analysis

Network analysis studies graphs through structural quantities.

Common node-level measures include:

MeasureMeaning
In-degreeNumber of incoming links
Out-degreeNumber of outgoing links
PageRankLong-run random-surfer probability
Eigenvector centralityConnection to important nodes
BetweennessFrequency on shortest paths
ClosenessAverage distance to others
HITS authorityLinked by hubs
HITS hubLinks to authorities

PageRank differs from simple in-degree. An incoming link from an important page contributes more than an incoming link from an unimportant page.

122.19 Personalized PageRank

The teleportation distribution need not be uniform.

If

v v

concentrates probability on selected nodes, the result is personalized PageRank.

The equation is still

π=απP+(1α)v. \pi=\alpha\pi P+(1-\alpha)v.

But now the ranking is biased toward pages near the preferred set.

Personalized PageRank is used in recommendation, local graph clustering, semantic search, and graph-based ranking.

It answers a different question:

Which nodes are important relative to this starting preference?

122.20 Topic-Sensitive PageRank

Topic-sensitive PageRank uses different teleportation vectors for different topics.

For example, one vector may emphasize sports pages, another science pages, another finance pages.

Each topic has its own PageRank vector.

At query time, the ranking system can combine topic-specific scores according to the query.

This is a way to adapt global link analysis to user intent or content category.

The linear algebra stays the same. Only the teleportation vector changes.

122.21 Sparse Matrix Computation

A real web graph has billions of pages and many billions of links.

The transition matrix cannot be stored as a dense matrix.

Instead, it is stored as adjacency lists or sparse matrix formats.

Each PageRank iteration distributes rank mass along outgoing edges:

contribution from i to each out-neighbor=απidiout. \text{contribution from } i \text{ to each out-neighbor} = \frac{\alpha\pi_i}{d_i^{\text{out}}}.

Then the teleportation term is added.

The cost of one iteration is proportional to the number of edges, not to n2n^2.

This sparsity is essential.

122.22 Convergence

The damping factor affects convergence.

When α\alpha is close to 11, the ranking follows the link graph more strongly, but convergence may be slower.

When α\alpha is smaller, teleportation is stronger, and convergence is usually faster, but the result is closer to the teleportation distribution.

The convergence rate of power iteration depends on the spectral gap of the modified transition matrix. Analyses of PageRank emphasize that the power method converges to the stationary vector for the primitive transition matrix, with convergence rate determined by the subdominant eigenvalues.

122.23 Local PageRank

Sometimes one does not need a global ranking over all nodes.

Local PageRank methods approximate the personalized PageRank vector near a seed set.

Instead of exploring the whole graph, the algorithm pushes probability mass through nearby edges until the residual is small.

This is useful when the graph is huge and the query is local.

Applications include local clustering, recommendation, fraud detection, and graph search.

122.24 PageRank on Undirected Graphs

On an undirected connected graph without teleportation, the random walk stationary distribution is proportional to degree:

πi=dijdj. \pi_i=\frac{d_i}{\sum_j d_j}.

Thus PageRank on undirected graphs is closely related to degree.

Teleportation changes the distribution, but degree remains influential.

The most distinctive use of PageRank is on directed graphs, where incoming and outgoing links carry asymmetric information.

122.25 PageRank and Search

In web search, PageRank is one signal among many.

A search engine must also consider query relevance, document content, freshness, spam, location, user intent, language, quality signals, and many other factors.

PageRank supplies a query-independent authority score based on link structure.

It does not by itself determine whether a page answers a given query.

In linear algebra terms, PageRank gives a global vector over nodes. Search ranking combines that vector with query-dependent scoring features.

122.26 Limitations

PageRank is powerful but limited.

LimitationExplanation
Link dependenceIt needs meaningful edges
ManipulationLink farms can try to inflate rank
Global biasPopular old nodes may dominate
Topic ambiguityOne global score ignores context
Sparse new nodesNew pages may lack incoming links
Dynamic graphWeb links change over time
Computation costLarge graphs require distributed iteration

Teleportation, personalization, spam detection, temporal models, and content features address some of these limitations.

122.27 PageRank and Linear Algebra

The main dictionary is direct.

PageRank conceptLinear algebra object
Web pagesBasis states or graph nodes
LinksDirected edges
Link graphAdjacency matrix
Random surferMarkov chain
Click probabilitiesStochastic matrix
Long-run importanceStationary distribution
PageRank vectorEigenvector with eigenvalue 11
Dangling-node fixRow replacement
TeleportationRank-one correction
Damping factorConvex combination parameter
Iterative computationPower method
Personalized rankingDifferent teleportation vector

PageRank is one of the clearest examples of a large-scale eigenvector computation.

122.28 Summary

PageRank assigns importance scores to nodes in a directed graph.

It begins with a link graph, converts outgoing links into transition probabilities, fixes dangling nodes, and adds teleportation. The resulting Markov chain has a stationary distribution. That stationary distribution is the PageRank vector.

The basic equation is

π=απP+(1α)v. \pi=\alpha\pi P+(1-\alpha)v.

Equivalently, PageRank is an eigenvector problem or a linear system. In practice, it is computed by power iteration using sparse matrix operations.

The central principle is recursive importance. A node is important when it is reached often by a random process that prefers links from important nodes. Linear algebra makes this circular definition precise.