Google’s PageRank Algorithm

Google’s PageRank Algorithm#

About the project#

Duration: 20-50 hours (including typesetting final report in LaTeX)
Prerequisites: Linear algebra (matrices, eigenvalues, eigenvectors), basic graph theory, probability concepts, basic Python programming
Python packages: NumPy, matplotlib, networkx, scipy
Learning objectives: Understand Google’s PageRank algorithm, work with adjacency matrices, compute dominant eigenvalues, analyze network structures

Important

This project was created by Jesper Bækholm Rom Poulsen and Jonas Amtoft.
Based on the work of David Brander. With guidance from Jakob Lemvig.

Introduction#

When you search on Google, you are presented with a list of websites with the most relevant at the top, disregarding advertisements. How does Google’s search algorithm know how to rank these websites? One way to look at it is that a website that many others link to is a relevant website. This is the basic idea behind the PageRank algorithm, named after Google co-founder Larry Page, and exactly what this project is about. At Google, handling the internet’s enormous amount of information is a major interplay between several systems. The web crawlers, which are distributed across many machines, fetch and store pages in a large repository. These pages undergo a process where important words and links are extracted and organized, forming the basis for building a large network structure of websites connected via links. This linking, which among other things enables the computation of a website’s PageRank, helps ensure that relevant and significant pages can easily be found (http://infolab.stanford.edu/~backrub/google.html).

In this project you will investigate one of the mathematical models behind the PageRank algorithm and develop different methods to compute PageRank in Python.

Preparation#

The eigenvalue problem and diagonalization, chapter 11 (Mathematics 1a)
Diagonalizable matrices, section 2.7 (Mathematics 1b)

Project goals#

The main goal of the project is to create four different implementations of the PageRank algorithm, which for a given network can compute the PageRank values.

Note

You must write a coherent report without referring to the exercise numbers below. It is not expected that all exercises are answered. In Exercise 22 you should for example prove that all matrices of a certain form are so-called Markov matrices - if you cannot get through the general proof, you should instead illustrate the statement with some examples. You should let your final report be guided by your interests and ambitions.

Exercises marked with (*) can be omitted in the final report. Exercises with (**) are particularly challenging and are intended for interested students.

Graph-Theoretic Modeling of Networks#

Websites and their links to other websites can be represented as a network $W$, which can be viewed as a directed graph $W = (V, E)$. The set $V$ consists of nodes, each representing a website, and the set $E$ contains edges $(p_i, p_j)$, which represent a link from website $p_i$ to website $p_j$. We assume that websites cannot link to themselves, and therefore loops $(p_i,p_i)$ are not allowed. In this model each edge is directed, meaning that a link from $p_i$ to $p_j$ does not imply a reverse link. The in-degree, $\deg^-(p_j)$, corresponds to the number of links arriving at node $p_j$, and the out-degree, $\deg^+(p_i)$, represents the number of links leaving node $p_i$.

Note

Nodes are also called vertices, points, or corners (in English: vertices, nodes, or points)
Edges are also called links (in English: edges, links or lines)
A directed graph is called in English a directed graph or digraph.
A cycle (cyclus) is called in English a cycle.

Exercise 1 (*)#

Given $V_1=\{p_1,p_2,p_3,p_4,p_5\}$ and $E_1=\{(p_1,p_3),(p_1,p_5),(p_2,p_4),(p_2,p_5),(p_3,p_1),(p_4,p_1),(p_4,p_2),(p_4,p_5)\}$, draw the graph for the network $W_1=(V_1,E_1)$ and state $deg^-(p_i)$ for $i=1,2,...,5$.

Exercise 2 (*)#

Given $V_2=\{P1,P2,P3,P4,P5,P6\}$ and $E_2=\{(P1,P2),(P2,P3),(P3,P1),(P4,P5),(P5,P6),(P6,P4)\}$, draw the graph for the network $W_2=(V_2,E_2)$.

The graph for the network $W_2$ consists of two components, both of which are 3-cycles. A component in a graph is a subset of the graph’s nodes and edges, where there is a path between any two nodes in the subset, and there are no edges connecting nodes in the subset with nodes outside the subset. A cycle is a path that starts and ends at the same node, and does not repeat any edges or nodes in between except the start and end node.

Exercise 3#

Discuss how the number of components in a network affects the websites’ visibility and accessibility for a user.

In the project we will use dictionaries in Python to represent the network $W$ of websites. This data structure makes it possible to assign each page a set of the pages it links to. The dictionary’s keys represent the pages in $W$, and the value for a given key $k$ is the set of pages that page $p_k$ links to. A dictionary in Python thus gives us an efficient way to assign each website a set of other websites it points to. For example, we can define a simple directed graph as follows:

W = {'P1': {'P2', 'P3'}, 'P2': {'P3'}, 'P3': {'P1'}}

This model makes it easy to access and manipulate data, since we can quickly find which pages receive links from a given page, which is essential for the PageRank algorithm, because the link structure forms the basis for determining the pages’ relative importance.

Exercise 4 (*)#

Write the networks $W_1$ and $W_2$ as Python code using a dictionary. How do you correctly specify that a website does not link to other websites?

Hint

Consider what datatype the values for the keys are.

In the project we need to test different implementations of the PageRank algorithm on different networks, and therefore we are interested in being able to generate arbitrary networks of varying size.

Exercise 5#

Write a Python function that creates a dictionary of n websites, where each website links to a random number of websites. This number should lie between kmin and k.

import numpy as np

def make_web(n,k,kmin=0):

    # Input: n and k are non-negative integers
    # Output: web is a dictionary with n keys.
    # The value of each key is a set that is a subset of the keys.
    
    assert(k < n), "k must be less than n (since one cannot link to oneself)"
    assert(kmin <= k), "kmin must be less than or equal to k"
    keys = # Remove pass and INSERT CODE HERE - define n keys from 0 to n-1 
    web = dict()
    
    for j in keys:
        numlinks = # INSERT CODE HERE - generate a random number between kmin and k
        web[j] = set() # INSERT CODE HERE - Choose a number of links (numlinks) from the other pages, avoid choosing the current page (j) and ensure there are no duplicate links

# ADD YOUR CODE HERE

Hint

Use np.random.choice() to select random elements from an array.

Exercise 6 (*)#

Write a Python function that creates a graphical representation of a network, where links are shown with arrows.

import networkx as nx
import matplotlib.pyplot as plt

def visualize_graph(web):
    
    # Input: network as dictionary
    # Output: Network visualized as a graph

    # INSERT CODE HERE

    plt.show()

# ADD YOUR CODE HERE

Hint

Use networkx and matplotlib.pyplot to visualize the network.

The Link Matrix and the Random Surfer Model#

Instead of using a graph or a dictionary to represent the network, we can also use an adjacency matrix to represent the graph. An adjacency matrix $\pmb{B}$ is a square matrix, where each row and column corresponds to a website. If there is an edge from node $p_j$ to node $p_i$, then the element $B_{ij}$ in row $i$ and column $j$ of the matrix will be 1, otherwise it will be 0:

\[\begin{equation*} B_{ij} = \begin{cases} 1 & \text{if } (p_j, p_i) \in E, \\ 0 & \text{otherwise}. \end{cases} \end{equation*}\]

The adjacency matrix provides an efficient way to represent directed graphs, since we can quickly check whether there is a link between two pages by checking the corresponding element in the matrix.

Representing the network as a matrix can be especially advantageous when the network is large. The matrix form enables fast computations using matrix operations, which is important for efficient analysis of the network structure.

Exercise 7 (*)#

Find the adjacency matrix for the network $W_1$.

If column $j$ is the zero vector in the adjacency matrix, website $j$ is called a sink. A sink is a website that does not link to other websites.

We are now ready to introduce the idea behind PageRank using the random surfer model. In the random surfer model, we imagine a user who starts on an arbitrary website and then randomly chooses to follow one of the outgoing links from this page (you click a random link). The model describes how a surfer who continues to choose links at random will end up being on a certain page with a probability that depends on how links are structured between the websites in the network. This probability, which the page eventually “gets” - as the number of clicks goes to infinity - is called the page’s PageRank. In what follows, this idea is formalized by viewing the time evolution as a discrete Markov chain.

A Markov chain is a model that describes a system where future states depend only on the current state and not on how the system got there. This means that the probability of reaching the next website depends solely on the current website.

In a Markov chain, the system is divided into a set of states, and transitions between these states occur with a probability determined by a transition matrix. Since the adjacency matrix is a binary matrix that only indicates whether there is a link from one website to another, we define a link matrix $\pmb{L}$, which describes the probability of choosing to surf from one website to another:

\[\begin{equation*} L_{ij} = \begin{cases} \frac{1}{N_j} & \text{if } (p_j,p_i)\in E \\ \frac{1}{N} & \text{if $p_j$ is a sink} \\ 0 & \text{otherwise} \end{cases} \end{equation*}\]

Here $N_j=deg^+(p_j)$ is the number of outgoing links from page $p_j$, and $N=|V|$ is the total number of pages in the network.

The equation $L_{ij}=\frac{1}{N}$ for $(p_j,p_i)\notin E \; \; \; \forall i\in V \;| \; i\neq j$ may look complicated at first glance, but its practical meaning is simple: When a website is a sink, it is assigned equal probability over all pages. This ensures that the Markov chain does not “get stuck” (in fancier terms: the link matrix remains fully stochastic).

Exercise 8 (*)#

Determine the link matrix for $W_1$. Compare with the adjacency matrix in Exercise 7 (*).

Definition 1: Probability vector #: A real row or column vector is a probability vector (or probability distribution) if its elements are non-negative and sum to 1.

For example, the constant column vector $\pmb{x}_0 = [1/N,1/N, \dots, 1/N]^T \in \mathbb{R}^N$ is a probability vector. In the random surfer model we unsurprisingly deal with probabilities. Thus the probability vector $\pmb{x} = [1,0,0 \dots, 0]^T \in \mathbb{R}^N$ models that you are at website $p_1$ with $100\%$ certainty. It gets more complicated with e.g. $\pmb{x} = [1/2,0,1/2, 0,0, \dots, 0]^T \in \mathbb{R}^N$ which models that you are at website $p_1$ with $50\%$ probability and at $p_3$ with $50\%$ probability, while all other websites have probability zero.

Let us now try to understand the definition of the link matrix $\pmb{L}$. Assume that $\pmb{x} = [1,0,0 \dots, 0]^T \in \mathbb{R}^N$ is the initial state - so we start on page $p_1$. If we multiply by $\pmb{L}$ from the left, i.e. $\pmb{L} \pmb{x}$, If $p_1$ links to $p_2$ and $p_3$, then $\pmb{L} \pmb{x} = [0,1/2,1/2,0, \dots, 0]^T$ (feel free to check!). The vector $\pmb{L} \pmb{x}$ thus models one click from the initial state $\pmb{x}$. Similarly, $\pmb{L}(\pmb{L} \pmb{x}) = \pmb{L}^2 \pmb{x}$ models two clicks from the initial state $\pmb{x}$. In general, $\pmb{L}^t \pmb{x}$ will be the state after $t \in \mathbb{N}$ clicks.

Definition 2: PageRank #: For a network of websites, the PageRank $PR(p_i)$ of a page $p_i$ is precisely the probability of being at $p_i$ as time goes to infinity. Time is viewed as a discrete quantity $t=1,2,3,\dots$ and $PR(p_i)$ is the $i$’th element of $\lim_{t \to \infty} \pmb{L}^t \pmb{x}_0$, where $\pmb{x}_0$ is an initial state (a probability vector, often the constant vector).

Caution

There is a “big” problem with this definition: For some networks, $\pmb{L}^t \pmb{x}_0$ does not converge as $t \to \infty$. We will later introduce damping, which solves this problem.

We can compute the PageRank of a website from the random surfer model in the following way:

Random surfer model#

Consider a network $W=(V,E)$ with $N$ websites ($|V|=N$). Assume that website $p_k$ links to $N_k$ pages $\{p_{l_1}, \dots, p_{l_{N_k}}\}\subseteq V$. Now perform the following simulation $j$ times:

Choose an arbitrary page from $V$ with equal probability $\frac{1}{N}$.
At all subsequent steps: Assume that the surfer is on page $p$:
- If website $p$ is a sink, then choose an arbitrary page from $V$.
- If page $p$ has outgoing links to $\{p_{l_1}, \dots, p_{l_{N_k}}\}$, then choose a random website among $\{p_{l_1}, \dots, p_{l_{N_k}}\}$.
Each time a page is visited this is counted, so that $M_k$ is the number of times $p_k$ is visited by the surfer. If the sample size $j$ is large enough, then $\frac{M_k}{j}$ will represent the probability of randomly ending up on page $p_k$, and thus $\frac{M_k}{j}$ approximates the PageRank for $p_k$ for sufficiently large $j$.

Note

This method of finding a PageRank can be viewed as a Markov chain Monte Carlo simulation.

Exercise 9#

Write a Python function that simulates one step of the random surfer model. The function should take a network of pages (represented as a dictionary) and a starting page as input and return a probability distribution for which pages the surfer’s next click will lead to.

def surf_step(web, page):
    
    # Input: A network as a dictionary and a start page
    # Output: Probability distribution as a dictionary for the next website
    
    distribution=dict()

    # INSERT CODE HERE

    return distribution

# ADD YOUR CODE HERE

Exercise 10#

Write a Python function that computes PageRank values for each page by simulating a random surfer. The function should call surf_step to perform the simulation.

def random_surf(web, n):

    # Input: A network as a dictionary and the number of steps in the random surf simulation
    # Output: PageRank values for each page as a dictionary

    ranking=dict()
 
    # INSERT CODE HERE

    return ranking

# ADD YOUR CODE HERE

Exercise 11#

Use the random_surf function with 100, 1000 and 10000 iterations to determine the PageRank for the networks $W_1$ and $W_2$. Assess whether the function gives a reliable measurement of PageRank for the given networks. It may be useful here to print the output of random_surf(web, n) for e.g. $n=100, 101, 102, 103, 104, 105$ and $n=10000, 10001, 10002, 10003, 10004, 10005$.

for n in range(1000, 1010, 1):
    print(random_surf(W1, n))

for n in range(1000, 1010, 1):
    print(random_surf(W2, n))

# ADD YOUR CODE HERE

To avoid the surfer being trapped in a component, and the PageRank thus becoming 0 for all websites in other components, we add a damping factor $d\in [0,1]$ to the random surfer model. The damping factor is a parameter in the random surfer model that simulates the probability that a surfer will choose a random page instead of following a link. We will later in the project investigate the choice of $d$. Until then we set $d=0.85$.

Random surfer model with damping#

Consider a network $W=(V,E)$ with $N$ websites ($|V|=N$). Assume that website $p_k$ links to $N_k$ pages $\{p_{l_1}, \dots, p_{l_{N_k}}\}\subseteq V$. Now perform the following simulation $j$ times:

Choose an arbitrary page from $V$ with equal probability $\frac{1}{N}$.
At all subsequent steps: Assume that the surfer is on page $p$: If website $p$ is a sink, then choose an arbitrary page from $V$. If page $p$ has outgoing links to $\{p_{l_1}, \dots, p_{l_{N_k}}\}$, then with probability $d$ choose a random website among $\{p_{l_1}, \dots, p_{l_{N_k}}\}$, and with probability $1-d$ choose a random page from $V$.
Each time a page is visited this is counted, so that $M_k$ is the number of times $p_k$ is visited by the surfer. If the sample size $j$ is large enough, then $\frac{M_k}{j}$ will represent the probability of randomly ending up on page $p_k$, and thus $\frac{M_k}{j}$ approximates the PageRank for $p_k$ for sufficiently large $j$.

Exercise 12#

Modify surf_step and random_surf so that they follow the random surfer model with damping.

def surf_step_damp(web, page, d):
    
    # Input: A network as a dictionary, a start page and a damping factor
    # Output: Probability distribution as a dictionary for the next website
    
    distribution=dict()

    # INSERT CODE HERE

    return distribution

def random_surf_damp(web, n, d):

    # Input: A network as a dictionary, the number of steps in the random surf simulation and a damping factor
    # Output: PageRank values for each page as a dictionary

    ranking=dict()
 
    # INSERT CODE HERE

    return ranking

# ADD YOUR CODE HERE

Exercise 13#

Use the random_surf_damp function with 100, 1000 and 10000 iterations to determine the PageRank for the networks $W_1$ and $W_2$. Compare the result with Exercise 11.

for n in range(1000, 1010, 1):
    print(random_surf_damp(W1, n))

for n in range(1000, 1010, 1):
    print(random_surf_damp(W2, n))

# ADD YOUR CODE HERE

Recursive Model and Matrix Formulation#

From the previous exercises, we can now see that the random surfer model with damping actually gives reasonable PageRanks for a set of websites. However, there are two significant problems with it: the randomness element means that you do not necessarily get the same PageRank every time you run it, and it is computationally heavy in part because many iterations may be required. To avoid these problems, one can look at a recursive model. First, all PageRanks of the websites are set to be $PR_1(p) = \frac{1}{N}$, where $N$ is the number of pages. For any page $p$, we can get to $p$ in two ways:

randomly, i.e. with probability $(1-d)\frac{1}{N}$
through a link from another page $q$. Then we must first be at $q$, and then choose $p$ among all the pages that $q$ links to. If a page does not link to $p$, the probability of going to $p$ via that link is of course $0$. With these considerations, the PageRank of $p$ can be updated recursively by

\[\begin{equation*} PR_{n+1}(p) = (1-d)\frac{1}{N} + d\sum_{q \in Inbound(p)}\frac{PR_n(q)}{deg^+(q)}. \end{equation*}\]

Exercise 14#

Explain what each term in the above formula for the recursive version means.

Hint

Examine the addition principle and the multiplication principle.

Exercise 15#

Implement rank_update and recursive_PageRank and run on the networks $W_1$ and $W_2$ with d=0.85.

def rank_update(web, PageRanks, page, d):

        """
        Updates the value of PageRank for a page based on the recursive formula
        Pages without outgoing links (sinks) are treated as if they link to all pages on the web.

        Input: 
            web and PageRanks are dictionaries as in the output from "make_web" and "random_surf",
            page is the key to the page whose rank we want to update, and
            d is the damping factor.
        Output: 
            PageRank is updated according to the above formula,
            and this function returns a float "increment", the (absolute) difference
            between the previous value and the updated value of PR(p).
        """

        return increment

def recursive_PageRank(web, stopvalue=0.0001, max_iterations=200, d=0.85):
    """
    Implements the recursive version of the PageRank algorithm by first creating
    a PageRank on 1/N to all pages (where N is the total number of pages)
    and then applying "rank_update" repeatedly until one of the two stopping conditions
    is met:
    stopping condition 1: the maximum change from step n to step (n+1) across all PageRank
    is less than the stop value,
    stopping condition 2: the number of iterations has reached "max_iterations".

    Input: web is a dictionary as in the output of "make_web", d is the damping,
    stopvalue is a positive float, max_iterations is a positive integer.
    """

    PageRanks = dict()

    return PageRanks, iteration

# ADD YOUR CODE HERE

In the recursive model, the recursive formula can be written compactly using the already defined link matrix $\pmb{L}$. Let

\[\begin{equation*} \pmb{x_n} = \begin{bmatrix} PR_n(p_{1})\\ PR_n(p_{2})\\ \vdots\\ PR_n(p_{N}) \end{bmatrix} \end{equation*}\]

and

\[\begin{equation*} \pmb{e} = \begin{bmatrix} 1\\ 1\\ \vdots\\ 1 \end{bmatrix}, \end{equation*}\]

then the recursive formula

\[\begin{equation*} PR_{n+1}(p) = (1-d)\,\frac{1}{N} + d \sum_{q \in \mathrm{Inbound}(p)} \frac{PR_n(q)}{deg^+(q)} \tag{1} \end{equation*}\]

can be written in matrix form as

\[\begin{equation*} \pmb{x}_{n+1} = \frac{1-d}{N}\,\pmb{e} + d\,\pmb{L}\,\pmb{x}_n. \tag{2} \end{equation*}\]

Here $\pmb{e}$ is an $N\times 1$ vector of ones, $d$ is the damping factor, and $\pmb{L}$ is the link matrix that describes the probabilities of following links from one page to another.

Exercise 16#

Argue that (1) can be rewritten as (2).

Assume that $\pmb{x}$ is a probability vector, i.e. that its elements sum to 1. Show that (2) can be written as

\[\begin{equation*} \pmb{x}_{n+1} = \pmb{M}_d \, \pmb{x}_n, \end{equation*}\]

where

\[\begin{equation*} \pmb{M}_d = \frac{1 - d}{N} \, \pmb{E_N} + d \, \pmb{L}, \end{equation*}\]

and $\pmb{E_N}$ is an $N \times N$ matrix consisting solely of ones.

The matrix $\pmb{M}_d$ is called the modified link matrix. If we assume that everything goes well, i.e. that $\pmb{x_n}$ at some point tends to something stable for larger and larger $n$, we can conclude that the PageRank we are looking for is precisely the $\pmb{x}$ that satisfies being a probability vector and

\[\begin{equation*} \pmb{x} = \pmb{M}_d\,\pmb{x}. \end{equation*}\]

The above equation expresses that the PageRank vector $\pmb{x}$ is an eigenvector for $\pmb{M}_d$ with eigenvalue 1. If the damping factor $d$ is not zero, there is (up to rescaling) exactly one eigenvector for this matrix with eigenvalue 1.

In the project we will show that this is always true when $0<d<1$, and that there therefore exists a unique solution for the (damped) PageRank algorithm. The matrix-based formulation thus gives us yet another way to compute PageRank:

Find an eigenvector corresponding to the eigenvalue one, and scale the vector so that its elements sum to one.

Exercise 17#

Implement modified_link_matrix and test on the networks $W_1$ and $W_2$ with d=0.85.

def modified_link_matrix(web, pagelist, d=0.85):

    # Input: web (dictionary), pagelist (list of keys), d (damping factor)
    # Output: d*A^T + (1-d)*E/N
    
    # A: NxN numpy array, where row j has non-zero elements in columns that page j links to.
    # If page j does not link to any, all elements in row j get the value 1/N.
    # E: np.ones([N,N])
    
    # INSERT CODE HERE

# ADD YOUR CODE HERE

Exercise 18#

Show that the sum of each column in $\pmb{M}_d$ is equal to 1.

The Markov Matrix: Properties and Damping in PageRank#

We want to show that the damped PageRank algorithm converges to a unique probability distribution, as we claimed in the previous section. To show this we will need to introduce the Markov matrix:

Definition 3: Markov matrix#

A Markov matrix $\pmb{A} = [a_{ij}] \in \mathbb{R}^{n\times n}$ is an $n \times n$ real matrix with non-negative elements, whose rows sum to 1:

\[\begin{equation*} a_{ij} \ge 0 \quad \forall i, j \end{equation*}\]

and

\[\begin{equation*} \sum_{j=1}^n a_{ij} = 1 \quad \forall i. \end{equation*}\]

Note

A Markov matrix can be seen as a “click matrix”, since row $i$ tells how the probability is distributed to the next step if you are in state $i$. Therefore all entries are $\ge 0$, and each row sums to 1.

For example,

\[\begin{equation*} \begin{bmatrix} 0.3 & 0.7 \\ 0 & 1 \end{bmatrix} \end{equation*}\]

is a Markov matrix, but

\[\begin{equation*} \begin{bmatrix} 2 & -1 \\ 0 & 1 \end{bmatrix} \quad \text{and} \quad \begin{bmatrix} 0.2 & 0.5 \\ 0.4 & 0.6 \end{bmatrix} \end{equation*}\]

are not.

If $\pmb{A}$ is a Markov matrix, then combined with the above two conditions, we must have that

\[\begin{equation*} 0 \le a_{ij} \le 1, \end{equation*}\]

for all $i,j$.

Exercise 19#

Argue that $\pmb{L}^T$ is a Markov matrix.

Exercise 20#

Show that if $\pmb{A}$ is a Markov matrix then $\pmb{A e} = \pmb{e}$, that is, that $(1,\pmb{e})$ is an eigenpair for $\pmb{A}$.

Exercise 21#

Prove that a matrix $\pmb{A}$ and its transposed matrix $\pmb{A}^T$ have the same eigenvalues.
Then show that they do not necessarily have the same eigenvectors (that is: give an example of a matrix $\pmb{A}$ where $\pmb{A}$ and $\pmb{A}^T$ have different eigenspaces).

Definition 4: Stationary distribution#

If $\pmb{A}$ is a Markov matrix with strictly positive elements, then the unique probability vector $\pmb{x}$ that satisfies

\[\begin{equation*} \pmb{A}^T \pmb{x} = \pmb{x}, \end{equation*} \]

is called the stationary distribution for $\pmb{A}$.

Note that the Markov matrix can have elements with value 0. We will later in the project show how we can guarantee that $\pmb{A}$ has strictly positive elements. Since $\pmb{L}=\pmb{A}^T$ and $\pmb{A}$ have the same eigenvalues and not necessarily the same eigenvectors, we want to determine the PageRank by finding an eigenvector $\pmb{x}$ that satisfies the equation

\[\begin{equation*} \pmb{Lx} = \pmb{x} \end{equation*}\]

However, we are interested in working with the modified link matrix when determining PageRank, as described earlier. We claimed that there exists a probability vector $\pmb{x}$ that satisfies $\pmb{M}_d\pmb{x}=\pmb{x}$. To be able to show this, we must first construct the corresponding modified Markov matrix:

Let $0 < d < 1$. For $\pmb{A} \in \mathbb{R}^{n \times n}$, set:

\[\begin{equation*} \pmb{A}_d = \frac{1-d}{n} \pmb{E}_n + d \pmb{A} \end{equation*}\]

Exercise 22#

Let $\pmb{A}$ be a Markov matrix, and let $0 < d < 1$. Prove that $\pmb{A}_d$ is also a Markov matrix, and that $\pmb{A}_d$ has strictly positive elements, i.e. ($\pmb{A}_d)_{ij} > 0$ for all $i$ and $j$.

Exercise 23#

We previously assumed that the modified link matrix $\pmb{M}_d$ had eigenvalue 1 for $0<d<1$. Show this.

Since we have shown that $\pmb{M}_d$ has eigenvalue 1 for $0<d<1$, we know that there exists an eigenvector $\pmb{x}$ that satisfies $\pmb{M}_d\pmb{x}=\pmb{x}$. We will later show that this is the unique solution to the damped PageRank algorithm.

Exercise 24#

Implement a Python function that, given a network, calls modified_link_matrix, finds an eigenvector corresponding to the eigenvalue 1, and turns it into a probability vector.

def eigenvector_PageRank(web,d=0.85):
        # Input: web is a dictionary of web pages and links.
        # d is the damping
        # Output: A dictionary with the same keys as web and the values are PageRank for the keys

        ranking = dict()

        # INSERT CODE HERE

        return ranking

# ADD YOUR CODE HERE

Exercise 25#

Computing eigenvectors for large matrices can be time-consuming. Create a network with the make_web function, where n=5000 and k=10, and find the PageRank using eigenvector_PageRank. Investigate and comment on the runtime.

Hint

Use the time package to determine the runtime.

web = make_web(5000,10,0)
eigenvector_PageRank(web)

# ADD YOUR CODE HERE

The PageRank Algorithm: Iterative Convergence and Damping in Markov Matrices#

Since it is computationally heavy to find eigenvalues and eigenvectors for large matrices, we would like to find another way to find eigenvectors with corresponding eigenvalue 1. Fortunately, there is a method to do this with Markov matrices, and that is what this section builds toward. We want to arrive at the fact that for a special class of Markov matrices we can raise it to a power, and as the exponent goes to infinity each column in the matrix becomes precisely a reasonable PageRank for the network of websites that the Markov matrix represents.

Exercise 26 (*)#

Compute the product of the following Markov matrices. Verify that the resulting matrix is also a Markov matrix?

\[\begin{equation*} A = \begin{bmatrix} 0.2 & 0.5 & 0.3 \\ 0.4 & 0.4 & 0.2 \\ 0.1 & 0.7 & 0.2 \end{bmatrix}, \quad B = \begin{bmatrix} 0.5 & 0.3 & 0.2 \\ 0.2 & 0.5 & 0.3 \\ 0.3 & 0.2 & 0.5 \end{bmatrix} \end{equation*}\]

That the product of two Markov matrices is a Markov matrix is a useful property that we will need later.

Exercise 27#

Show that the product of two Markov matrices is a Markov matrix.

Exercise 28#

Conclude that if $\pmb{A}$ is a Markov matrix, then $\pmb{A}^k$ is also a Markov matrix for all $k\in \mathbb{N}$.

We will also need that no eigenvalues of a Markov matrix have absolute value greater than 1. Another way to say this is that the spectral radius $\operatorname{rad}_{\pmb{A}}$ of a Markov matrix $\pmb{A}$ is equal to 1:

\[\begin{equation*}\operatorname{rad}_{\pmb{A}} = \max \{|\lambda_1|, \ldots, |\lambda_n|\}\end{equation*}\]

We will later see how this helps ensure that powers of the special class of Markov matrices converge.

Exercise 29#

Let $\pmb{A}$ be an $n\times n$ Markov matrix. Show that $\operatorname{rad}_{\pmb{A}} = 1$.

Hint

Assume that $\pmb{A}$ has an eigenvalue $\lambda$ with $|\lambda|>1$. Let $\pmb{v}$ be the corresponding eigenvector. For a Markov matrix, the sum of each row is exactly 1, so $\pmb{A}\pmb{v} = \lambda \pmb{v}$ can be seen as $n$ (different) weighted averages of $\pmb{v}_1, \pmb{v}_2, \ldots, \pmb{v}_n$. How large can the absolute value of a weighted average of $n$ elements be at most?

Exercise 30 (*)#

Consider the following matrix:

\[\begin{equation*} \pmb{A} = \begin{bmatrix} 0 & 1 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0 \end{bmatrix} \end{equation*}\]

Is it a Markov matrix? Find its eigenvalues and the modulus of the eigenvalues. Find $\pmb{A}^2, \pmb{A}^3$. Does the matrix $\pmb{A}^k$ converge for $k\to \infty$?

Now we consider the special class of Markov matrices mentioned before, namely Markov matrices with strictly positive elements - yet another reason why the damping factor is an important player in the PageRank algorithm. To ensure convergence of powers of Markov matrices with strictly positive elements, we must ensure that $1$ is the only eigenvalue with modulus $1$. For this we will need the triangle inequality:

The triangle inequality#

For all complex numbers $z$ and $w$ we have

\[\begin{equation*} |z+w| \leq |z| + |w|, \end{equation*}\]

with equality if and only if $z$ and $w$ are of the form $z = kw, k\in \mathbb{R}_{\geq 0}$

More generally it follows by induction that for a collection of $n$ complex numbers $z_1, \dots, z_n$ we have:

$\left| \sum_{i=1}^n z_i \right| \leq \sum_{i=1}^n |z_i|$
Equality occurs only if all numbers $z_i$ have the same principal argument.

Exercise 31#

If $\pmb{A}$ is a Markov matrix with strictly positive elements ($a_{ij} > 0$ $\forall$ $i,j$), prove that there is exactly one eigenvalue $\lambda$ with $|\lambda| = 1$, namely $\lambda = 1$, and moreover that the corresponding eigenspace is $\mathcal{E}_1 = \hbox{span}\{ \pmb{e} \}$, where $\pmb{e} = [1,1,\dots,1]^t$. All other eigenvalues have modulus less than 1.

Hint: The max norm of a vector $\pmb{x}$ is defined as

\[\begin{equation*} \Vert\pmb{x}\Vert_{\infty} := \max_k \{|x_k|\}, \end{equation*}\]

i.e. the value of the largest element in absolute/numerical value, cf. Example 2.1.2. in the textbook.

Assume that $\lambda$ is an eigenvalue with $|\lambda|=1$, and that $\pmb{v}$ is a corresponding eigenvector. Let $k$ be an index such that $\Vert\pmb{v}\Vert_{\infty} = |v_k|$. Then we can write:

\[\begin{align*} \Vert\pmb{v}\Vert_{\infty} &= |\lambda| |v_k| = |\lambda v_k| = \left|\sum_{j=1}^n a_{kj}v_j\right| \\ &\leq \sum_{j=1}^n |a_{kj}v_j| = \sum_{j=1}^n a_{kj}|v_j| \\ &\leq \sum_{j=1}^n a_{kj} \Vert\pmb{v}\Vert_{\infty} = \Vert\pmb{v}\Vert_{\infty}. \end{align*}\]

Provide justification for each step above.

Since the first and last terms in this chain are equal, there must be equality in each inequality. Use the first of these inequalities together with point (2) in the triangle inequality above, as well as the second of these equalities (and the condition $a_{ij}>0$) to show that we must have $\pmb{v} = C \pmb{e}$, where $|C| = \Vert\pmb{v}\Vert_{\infty}$.

Exercise 32 (*)#

Take the matrix $\pmb{A}$ from Exercise 30 (*). Dampen it with factor $0.85$, call this matrix $\pmb{A}_d$. Then find its eigenvalues, their modulus, and comment on the result. Does $\pmb{A}_d^k$ converge for $k \to \infty$?

Hint

Diagonalize the matrix.

Theorem 1#: Let $n\geq 2$ and let $\pmb{A}$ be an $n \times n$ Markov matrix with strictly positive elements. Then there is a unique vector $\pmb{x} \in \mathbb{R}^n$ such that $\pmb{A}^T\pmb{x} = \pmb{x}$, all elements in $\pmb{x}$ are strictly positive and sum to 1, and:

\[\begin{equation*} \lim_{k\to \infty} (\pmb{A}^T)^k = \pmb{xe}^t = [\pmb{x}, \pmb{x}, \ldots, \pmb{x}] \end{equation*}\]

Exercise 33#

Explain why Theorem 1 is relevant for the PageRank algorithm.

Exercise 34.#

Let $\pmb{A}$ be an $n \times n$ Markov matrix with strictly positive elements. Prove Theorem 1 under the assumption that $\pmb{A}$ is diagonalizable.

Hint

If $\pmb{A}$ is diagonalizable, then $\pmb{A} = \pmb{V}\pmb{D}\pmb{V}^{-1}$ for a matrix $\pmb{V}$ and a diagonal matrix $\pmb{D}$. What is $\pmb{A}^k$?

Exercise 35#

Write a Python function that given a network, exponent power and a damping factor d, calls modified_link_matrix and multiplies it by itself power times, and returns a PageRank for the network.

def matrix_PageRank(web,power,d=0.85):

    # Input: web is a dictionary with web pages and links.
    # d is a positive float, the damping constant.
    # Output: A dictionary with the same keys as web, and values are PageRank for each key.

    ranking = dict()

    # INSERT CODE HERE

    return ranking

# ADD YOUR CODE HERE

Exercise 36 (**)#

We want to show that the convergence of the matrix_PageRank algorithm depends on the choice of the damping factor $d$.
Show that if $\pmb{A}$ has eigenvalues 1, $\lambda_1$, $\lambda_2$,…,$\lambda_{n-1}$, then $\pmb{A}_d$ has eigenvalues 1, $d\lambda_1$, $d\lambda_2$,…,$d\lambda_{n-1}$.

Hint

First show that $1$ is an eigenvalue for $\pmb{A}_d$. Then consider $\pmb{A}^T$. What is $\pmb{e}^T\pmb{A}^T$? Show that if $\pmb{v}$ is an eigenvector for $\pmb{A}^T$ with eigenvalue different from $1$, then $\pmb{v}\perp \pmb{e}$. What eigenvalues does $\pmb{E}_n$ have? How do their eigenspaces relate to each other?
Then show that any eigenvector for $\pmb{A}^T$ is also an eigenvector for $\pmb{A}_d^T$. What are their corresponding eigenvalues?

Exercise 37#

Discuss the choice of damping in the PageRank algorithm in terms of convergence and correctness. Consider the convergence if $\pmb{L}$ is an $n \times n$ matrix with eigenvalues $\lambda_1=1$, $\lambda_2=-0.999$ and $|\lambda_k|<0.5$ for $k=2,...,n$. Include a geometric visualization and compare the convergence for $\pmb{L}$ and $\pmb{M}_d$. What happens if we replace $\lambda_2=-0.999$ with $\lambda_2=0.999\cdot i$?

Exercise 38#

Argue why we can allow ourselves to use the matrix_PageRank function to determine PageRank, even though it is an approximate algorithm.

Analysis of PageRank Models and Investigation of Damping#

Exercise 39#

Write a Python function that given a network and a ranking creates a graphical representation of the network, where arrows illustrate links between websites, and the sizes of the websites indicate the size of the PageRank. Test the function on $W_1$ and $W_2$.

def plot_ranking(web, ranking, d=0.85):

    # Input: web and ranking are dictionaries, for example as output from the functions "make_web" and "random_surf".

    # Output: Graphical representation of the web structure with links as #arrows and PageRank visualized by the size of the websites.

    # INSERT CODE HERE

    plt.show()

# ADD YOUR CODE HERE

Exercise 40#

Test the four PageRank algorithms on large networks created using the make_web function. To compare two different implementations you must compute PageRank for the same network to the same degree of precision and compare the time. Do the relative times change if higher or lower precision is required?

So far we have used d=0.85. Experiment with different values of the damping (e.g. 0.5, 0.75, 0.9) in matrix_PageRank. Run the function on the same network (use make_web) and measure how many iterations are required to achieve convergence, as well as how the final ranking of pages changes. Discuss the choice of damping.

Jordan Normal Form: Generalization of Diagonalizability#

In Exercise 34. we showed the validity of the PageRank algorithm (matrix_PageRank) under the assumption that $\pmb{A}$ is diagonalizable. However, this is not always the case, and we are therefore interested in showing that the PageRank algorithm converges even if $\pmb{A}$ cannot be diagonalized. Therefore we will need Jordan normal form:

Jordan normal form is a normal form for square matrices that generalizes diagonalizability. Let $\pmb{A} \in \mathbb{C}^{n \times n}$ have the distinct eigenvalues $\lambda_1, \lambda_2, \dots, \lambda_m$. Then there exists an invertible matrix $\pmb{P}$ such that

\[\begin{equation*} \pmb{A} = \pmb{PJP}^{-1}, \end{equation*}\]

where $\pmb{J}$, the Jordan normal form of $\pmb{A}$, is a block diagonal matrix consisting of Jordan blocks. The full $\pmb{J}$ can be written as

\[\begin{equation*} \pmb{J} = \begin{bmatrix} \pmb{J}_{k_1}(\lambda_1) & 0 & \cdots & 0 \\ 0 & \pmb{J}_{k_2}(\lambda_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \pmb{J}_{k_m}(\lambda_m) \end{bmatrix}, \end{equation*}\]

where $\sum_{i=1}^m k_i = n$. Each Jordan block $\pmb{J}_{k_i}(\lambda_i)$ has dimension $k_i \times k_i$ and has the form

\[\begin{equation*} \pmb{J}_{k_i}(\lambda_i)= \begin{bmatrix} \lambda_i & 1 & 0 & \cdots & 0 \\ 0 & \lambda_i & 1 & \cdots & 0 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0 & \cdots & 0 & \lambda_i & 1 \\ 0 & \cdots & \cdots & 0 & \lambda_i \end{bmatrix}. \end{equation*}\]

Exercise 41 (**)#

Show Theorem 1 using Jordan normal form. You therefore may not assume that $\pmb{A}$ is diagonalizable.

Google’s PageRank Algorithm

Contents

Google’s PageRank Algorithm#

About the project#

Introduction#

Preparation#

Project goals#

Graph-Theoretic Modeling of Networks#

Exercise 1 (*)#

Exercise 2 (*)#

Exercise 3#

Exercise 4 (*)#

Exercise 5#

Exercise 6 (*)#

The Link Matrix and the Random Surfer Model#

Exercise 7 (*)#

Exercise 8 (*)#

Exercise 9#

Exercise 10#

Exercise 11#

Exercise 12#

Exercise 13#

Recursive Model and Matrix Formulation#

Exercise 14#

Exercise 15#

Exercise 16#

Exercise 17#

Exercise 18#

The Markov Matrix: Properties and Damping in PageRank#

Exercise 19#

Exercise 20#

Exercise 21#

Exercise 22#

Exercise 23#

Exercise 24#

Exercise 25#

The PageRank Algorithm: Iterative Convergence and Damping in Markov Matrices#

Exercise 26 (*)#

Exercise 27#

Exercise 28#

Exercise 29#

Exercise 30 (*)#

Exercise 31#

Exercise 32 (*)#

Exercise 33#

Exercise 34.#

Exercise 35#

Exercise 36 (**)#

Exercise 37#

Exercise 38#

Analysis of PageRank Models and Investigation of Damping#

Exercise 39#

Exercise 40#

Jordan Normal Form: Generalization of Diagonalizability#

Exercise 41 (**)#