Chapter 122. PageRank and Network Analysis

PageRank is a method for assigning importance scores to nodes in a directed network.

It was developed for ranking web pages, but the underlying idea applies to many networks: citation graphs, social graphs, dependency graphs, recommendation graphs, knowledge graphs, and communication networks.

The central idea is recursive importance. A page is important if important pages link to it. This statement is circular, but linear algebra turns it into an eigenvector problem.

PageRank models a random surfer moving through a directed graph. The PageRank vector is the stationary distribution of this Markov chain. Equivalently, it is an eigenvector of a stochastic matrix with eigenvalue $1$ . PageRank was originally based on the hyperlink structure of the web and uses a Markov chain model whose stationary vector gives the ranking scores.

122.1 Directed Networks

A directed network consists of nodes and directed edges.

For the web, nodes are pages and directed edges are hyperlinks. If page $i$ links to page $j$ , then there is a directed edge

i \to j.

Directed edges have orientation. A link from $i$ to $j$ does not imply a link from $j$ to $i$ .

This direction matters. PageRank uses incoming links as evidence of importance. A page with many incoming links may be important, but the quality of those incoming links also matters.

122.2 The Web Graph

The web graph is a directed graph

G=(V,E),

where $V$ is the set of web pages and $E$ is the set of hyperlinks.

|V|=n,

then the graph has $n$ pages.

The web graph is enormous, sparse, directed, and irregular. Most pages link to only a small number of other pages compared with the total number of pages. Therefore its adjacency matrix is sparse.

PageRank uses this sparse link structure to construct a transition matrix.

122.3 Adjacency Matrix

Let $A$ be the adjacency matrix of a directed graph.

Define

a_{ij} = \begin{cases} 1, & \text{if page } i \text{ links to page } j, \\ 0, & \text{otherwise.} \end{cases}

Here row $i$ stores links going out from page $i$ .

For example, suppose there are four pages and the links are

1\to 2,\quad 1\to 3,\quad 2\to 3,\quad 3\to 1,\quad 4\to 3.

Then

A= \begin{bmatrix} 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix}.

The row sums give the number of outgoing links.

122.4 Out-Degree

The out-degree of node $i$ is

d_i^{\text{out}} = \sum_{j=1}^n a_{ij}.

It counts how many links leave node $i$ .

For the example matrix,

d_1^{\text{out}}=2,\quad d_2^{\text{out}}=1,\quad d_3^{\text{out}}=1,\quad d_4^{\text{out}}=1.

PageRank assumes that if a surfer is on page $i$ , and page $i$ has outgoing links, the surfer chooses one of those links with equal probability.

Thus each outgoing link from page $i$ receives probability

\frac{1}{d_i^{\text{out}}}.

122.5 Transition Matrix

From the adjacency matrix, define a transition matrix $P$ by

p_{ij} = \begin{cases} \frac{1}{d_i^{\text{out}}}, & \text{if } i \to j, \\ 0, & \text{otherwise.} \end{cases}

Each row sums to $1$ , provided every node has at least one outgoing edge.

Thus $P$ is a row-stochastic matrix.

For the example,

P= \begin{bmatrix} 0 & \frac12 & \frac12 & 0 \\ 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix}.

The entry $p_{ij}$ is the probability of moving from page $i$ to page $j$ in one click.

122.6 Random Surfer Model

The random surfer model imagines a user who moves from page to page by clicking links.

If the surfer is on page $i$ , the next page is chosen from the outgoing links of $i$ .

The probability distribution after one step is

\mu_{k+1}=\mu_kP,

where $\mu_k$ is a row probability vector.

After $k$ steps,

\mu_k=\mu_0P^k.

The PageRank score of a page is the long-run probability that the surfer is on that page.

Thus PageRank converts ranking into a Markov chain equilibrium problem.

122.7 Stationary Distribution

A stationary distribution is a probability vector $\pi$ satisfying

\pi P=\pi.

This says that applying the transition matrix does not change the distribution.

In eigenvalue language, $\pi$ is a left eigenvector of $P$ with eigenvalue $1$ , normalized so that

\sum_{i=1}^n \pi_i=1.

The entry $\pi_i$ is the PageRank score of page $i$ .

A larger $\pi_i$ means the random surfer spends more long-run time on page $i$ .

For every finite Markov chain transition matrix, there is a left eigenvector with eigenvalue $1$ ; with suitable irreducibility and aperiodicity conditions, the stationary distribution is unique and attracts the chain.

122.8 Recursive Importance

The stationary equation can be expanded.

For row-stochastic $P$ ,

\pi_j = \sum_{i=1}^n \pi_i p_{ij}.

Thus the score of page $j$ is the sum of scores received from pages that link to it.

If page $i$ links to $j$ , then it contributes

\frac{\pi_i}{d_i^{\text{out}}}

to $j$ .

Therefore

\pi_j = \sum_{i\to j} \frac{\pi_i}{d_i^{\text{out}}}.

This formula expresses the recursive nature of PageRank. A page receives rank from incoming pages, but each incoming page divides its rank among its outgoing links.

122.9 Dangling Nodes

A dangling node is a node with no outgoing links.

For such a node,

d_i^{\text{out}}=0.

The random surfer has no link to follow.

If we leave the row empty, the transition matrix no longer has row sum $1$ . This breaks the Markov chain interpretation.

A common fix is to replace the dangling row by a uniform distribution:

p_{ij}=\frac{1}{n}

for all $j$ .

This means that from a dangling page, the surfer jumps to a random page.

This modification keeps the transition matrix stochastic.

122.10 Rank Sinks

A rank sink is a set of pages that traps rank.

For example, suppose pages in a subset link to one another but do not link out. Once the random surfer enters this subset, the surfer cannot leave.

Then PageRank mass may accumulate inside the sink.

This is undesirable for web ranking because a small isolated group could absorb too much score.

The standard solution is teleportation.

122.11 Teleportation

Teleportation means that at each step, the surfer either follows a link or jumps to a random page.

Let

0<\alpha<1

be the damping factor.

With probability $\alpha$ , the surfer follows a link.

With probability $1-\alpha$ , the surfer teleports to a page chosen from a distribution $v$ .

Usually,

v= \frac{1}{n}\mathbf{1}^T

is uniform.

The Google matrix is

G=\alpha P+(1-\alpha)\mathbf{1}v.

Here $\mathbf{1}v$ is the matrix whose every row equals $v$ .

The PageRank vector satisfies

\pi G=\pi.

The damping factor is often described as $0.85$ in PageRank expositions, and the teleportation term prevents sinks while making the random-surfer chain better behaved.

122.12 The PageRank Equation

Using a row-stochastic convention, the PageRank equation is

\pi=\alpha\pi P+(1-\alpha)v.

Equivalently,

\pi(I-\alpha P)=(1-\alpha)v.

If the inverse exists, then

\pi=(1-\alpha)v(I-\alpha P)^{-1}.

In practice, the inverse is not formed for large graphs. The matrix is too large, and the graph is sparse.

Instead, PageRank is usually computed iteratively.

122.13 Column-Stochastic Convention

Some texts use column-stochastic matrices instead.

Let $M$ be a matrix whose columns sum to $1$ . Then the PageRank vector $r$ is a column vector satisfying

r=\alpha Mr+(1-\alpha)v.

This is equivalent to the row-stochastic form after transposition.

The choice is conventional. The mathematics is the same.

Convention	Distribution	Equation
Row-stochastic	Row vector $\pi$	$\pi=\pi G$
Column-stochastic	Column vector $r$	$r=Gr$

When reading PageRank formulas, one must check which convention is used.

122.14 Power Iteration

The power method computes PageRank by repeated multiplication.

Start with an initial probability vector

\pi_0.

Then iterate

\pi_{k+1}=\alpha\pi_kP+(1-\alpha)v.

Under the usual PageRank modification, this sequence converges to the PageRank vector.

The method is simple and scalable because each iteration uses sparse matrix-vector multiplication.

The original PageRank literature and later analyses describe the PageRank vector as obtainable by power iteration on the modified stochastic matrix.

122.15 Example

Consider three pages with links

1\to 2,\quad 1\to 3,\quad 2\to 3,\quad 3\to 1.

The transition matrix is

P= \begin{bmatrix} 0 & \frac12 & \frac12 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{bmatrix}.

Let

\alpha=0.85, \qquad v= \begin{bmatrix} \frac13 & \frac13 & \frac13 \end{bmatrix}.

Then

\pi_{k+1}=0.85\pi_kP+0.15v.

Start with

\pi_0= \begin{bmatrix} \frac13 & \frac13 & \frac13 \end{bmatrix}.

After repeated iteration, the values approach a fixed probability vector.

That vector is the PageRank score vector.

122.16 Linear System Form

The PageRank equation

\pi=\alpha\pi P+(1-\alpha)v

can be rearranged:

\pi(I-\alpha P)=(1-\alpha)v.

Transposing gives

(I-\alpha P^T)\pi^T=(1-\alpha)v^T.

Thus PageRank can be solved as a linear system.

For small graphs, direct methods are possible.

For large graphs, iterative methods are preferred because the graph matrix is sparse and very large.

This is a typical numerical linear algebra tradeoff: direct solution is conceptually clean, but iterative sparse methods scale better.

122.17 Eigenvector Centrality

PageRank is related to eigenvector centrality.

Eigenvector centrality assigns high score to a node if it is connected to other high-score nodes.

For an adjacency matrix $A$ , the centrality vector $x$ may satisfy

Ax=\lambda x.

PageRank modifies this idea for directed graphs with normalization, damping, and teleportation.

The modifications are important because the web graph is not regular, not symmetric, and not necessarily strongly connected.

PageRank is therefore a robust eigenvector centrality for directed networks.

122.18 Network Analysis

Network analysis studies graphs through structural quantities.

Common node-level measures include:

Measure	Meaning
In-degree	Number of incoming links
Out-degree	Number of outgoing links
PageRank	Long-run random-surfer probability
Eigenvector centrality	Connection to important nodes
Betweenness	Frequency on shortest paths
Closeness	Average distance to others
HITS authority	Linked by hubs
HITS hub	Links to authorities

PageRank differs from simple in-degree. An incoming link from an important page contributes more than an incoming link from an unimportant page.

122.19 Personalized PageRank

The teleportation distribution need not be uniform.

v

concentrates probability on selected nodes, the result is personalized PageRank.

The equation is still

\pi=\alpha\pi P+(1-\alpha)v.

But now the ranking is biased toward pages near the preferred set.

Personalized PageRank is used in recommendation, local graph clustering, semantic search, and graph-based ranking.

It answers a different question:

Which nodes are important relative to this starting preference?

122.20 Topic-Sensitive PageRank

Topic-sensitive PageRank uses different teleportation vectors for different topics.

For example, one vector may emphasize sports pages, another science pages, another finance pages.

Each topic has its own PageRank vector.

At query time, the ranking system can combine topic-specific scores according to the query.

This is a way to adapt global link analysis to user intent or content category.

The linear algebra stays the same. Only the teleportation vector changes.

122.21 Sparse Matrix Computation

A real web graph has billions of pages and many billions of links.

The transition matrix cannot be stored as a dense matrix.

Instead, it is stored as adjacency lists or sparse matrix formats.

Each PageRank iteration distributes rank mass along outgoing edges:

\text{contribution from } i \text{ to each out-neighbor} = \frac{\alpha\pi_i}{d_i^{\text{out}}}.

Then the teleportation term is added.

The cost of one iteration is proportional to the number of edges, not to $n^2$ .

This sparsity is essential.

122.22 Convergence

The damping factor affects convergence.

When $\alpha$ is close to $1$ , the ranking follows the link graph more strongly, but convergence may be slower.

When $\alpha$ is smaller, teleportation is stronger, and convergence is usually faster, but the result is closer to the teleportation distribution.

The convergence rate of power iteration depends on the spectral gap of the modified transition matrix. Analyses of PageRank emphasize that the power method converges to the stationary vector for the primitive transition matrix, with convergence rate determined by the subdominant eigenvalues.

122.23 Local PageRank

Sometimes one does not need a global ranking over all nodes.

Local PageRank methods approximate the personalized PageRank vector near a seed set.

Instead of exploring the whole graph, the algorithm pushes probability mass through nearby edges until the residual is small.

This is useful when the graph is huge and the query is local.

Applications include local clustering, recommendation, fraud detection, and graph search.

122.24 PageRank on Undirected Graphs

On an undirected connected graph without teleportation, the random walk stationary distribution is proportional to degree:

\pi_i=\frac{d_i}{\sum_j d_j}.

Thus PageRank on undirected graphs is closely related to degree.

Teleportation changes the distribution, but degree remains influential.

The most distinctive use of PageRank is on directed graphs, where incoming and outgoing links carry asymmetric information.

122.25 PageRank and Search

In web search, PageRank is one signal among many.

A search engine must also consider query relevance, document content, freshness, spam, location, user intent, language, quality signals, and many other factors.

PageRank supplies a query-independent authority score based on link structure.

It does not by itself determine whether a page answers a given query.

In linear algebra terms, PageRank gives a global vector over nodes. Search ranking combines that vector with query-dependent scoring features.

122.26 Limitations

PageRank is powerful but limited.

Limitation	Explanation
Link dependence	It needs meaningful edges
Manipulation	Link farms can try to inflate rank
Global bias	Popular old nodes may dominate
Topic ambiguity	One global score ignores context
Sparse new nodes	New pages may lack incoming links
Dynamic graph	Web links change over time
Computation cost	Large graphs require distributed iteration

Teleportation, personalization, spam detection, temporal models, and content features address some of these limitations.

122.27 PageRank and Linear Algebra

The main dictionary is direct.

PageRank concept	Linear algebra object
Web pages	Basis states or graph nodes
Links	Directed edges
Link graph	Adjacency matrix
Random surfer	Markov chain
Click probabilities	Stochastic matrix
Long-run importance	Stationary distribution
PageRank vector	Eigenvector with eigenvalue $1$
Dangling-node fix	Row replacement
Teleportation	Rank-one correction
Damping factor	Convex combination parameter
Iterative computation	Power method
Personalized ranking	Different teleportation vector

PageRank is one of the clearest examples of a large-scale eigenvector computation.

122.28 Summary

PageRank assigns importance scores to nodes in a directed graph.

It begins with a link graph, converts outgoing links into transition probabilities, fixes dangling nodes, and adds teleportation. The resulting Markov chain has a stationary distribution. That stationary distribution is the PageRank vector.

The basic equation is

\pi=\alpha\pi P+(1-\alpha)v.

Equivalently, PageRank is an eigenvector problem or a linear system. In practice, it is computed by power iteration using sparse matrix operations.

The central principle is recursive importance. A node is important when it is reached often by a random process that prefers links from important nodes. Linear algebra makes this circular definition precise.