import matplotlib
if not hasattr(matplotlib.RcParams, "_get"):
matplotlib.RcParams._get = dict.get
Google’s PageRank Algorithm#
About the project#
Duration: 20-50 hours (including typesetting final report in LaTeX)
Prerequisites: Linear algebra (matrices, eigenvalues, eigenvectors), basic graph theory, probability concepts, basic Python programming
Python packages: NumPy, matplotlib, networkx, scipy
Learning objectives: Understand Google’s PageRank algorithm, work with adjacency matrices, compute dominant eigenvalues, analyze network structures
Important
This project was created by Jesper Bækholm Rom Poulsen and Jonas Amtoft.
Based on the work of David Brander. With guidance from Jakob Lemvig.
Introduction#
When you search on Google, you are presented with a list of websites with the most relevant at the top, disregarding advertisements. How does Google’s search algorithm know how to rank these websites? One way to look at it is that a website that many others link to is a relevant website. This is the basic idea behind the PageRank algorithm, named after Google co-founder Larry Page, and exactly what this project is about. At Google, handling the internet’s enormous amount of information is a major interplay between several systems. The web crawlers, which are distributed across many machines, fetch and store pages in a large repository. These pages undergo a process where important words and links are extracted and organized, forming the basis for building a large network structure of websites connected via links. This linking, which among other things enables the computation of a website’s PageRank, helps ensure that relevant and significant pages can easily be found (http://infolab.stanford.edu/~backrub/google.html).
In this project you will investigate one of the mathematical models behind the PageRank algorithm and develop different methods to compute PageRank in Python.
Preparation#
The eigenvalue problem and diagonalization, chapter 11 (Mathematics 1a)
Diagonalizable matrices, section 2.7 (Mathematics 1b)
Project goals#
The main goal of the project is to create four different implementations of the PageRank algorithm, which for a given network can compute the PageRank values.
Note
You must write a coherent report without referring to the exercise numbers below. It is not expected that all exercises are answered. In Exercise 22 you should for example prove that all matrices of a certain form are so-called Markov matrices - if you cannot get through the general proof, you should instead illustrate the statement with some examples. You should let your final report be guided by your interests and ambitions.
Exercises marked with (*) can be omitted in the final report. Exercises with (**) are particularly challenging and are intended for interested students.
Graph-Theoretic Modeling of Networks#
Websites and their links to other websites can be represented as a network \(W\), which can be viewed as a directed graph \(W = (V, E)\). The set \(V\) consists of nodes, each representing a website, and the set \(E\) contains edges \((p_i, p_j)\), which represent a link from website \(p_i\) to website \(p_j\). We assume that websites cannot link to themselves, and therefore loops \((p_i,p_i)\) are not allowed. In this model each edge is directed, meaning that a link from \(p_i\) to \(p_j\) does not imply a reverse link. The in-degree, \(\deg^-(p_j)\), corresponds to the number of links arriving at node \(p_j\), and the out-degree, \(\deg^+(p_i)\), represents the number of links leaving node \(p_i\).
Note
Nodes are also called vertices, points, or corners (in English: vertices, nodes, or points)
Edges are also called links (in English: edges, links or lines)
A directed graph is called in English a directed graph or digraph.
A cycle (cyclus) is called in English a cycle.
Exercise 1 (*)#
Given \(V_1=\{p_1,p_2,p_3,p_4,p_5\}\) and \(E_1=\{(p_1,p_3),(p_1,p_5),(p_2,p_4),(p_2,p_5),(p_3,p_1),(p_4,p_1),(p_4,p_2),(p_4,p_5)\}\), draw the graph for the network \(W_1=(V_1,E_1)\) and state \(deg^-(p_i)\) for \(i=1,2,...,5\).
Exercise 2 (*)#
Given \(V_2=\{P1,P2,P3,P4,P5,P6\}\) and \(E_2=\{(P1,P2),(P2,P3),(P3,P1),(P4,P5),(P5,P6),(P6,P4)\}\), draw the graph for the network \(W_2=(V_2,E_2)\).
The graph for the network \(W_2\) consists of two components, both of which are 3-cycles. A component in a graph is a subset of the graph’s nodes and edges, where there is a path between any two nodes in the subset, and there are no edges connecting nodes in the subset with nodes outside the subset. A cycle is a path that starts and ends at the same node, and does not repeat any edges or nodes in between except the start and end node.
Exercise 3#
Discuss how the number of components in a network affects the websites’ visibility and accessibility for a user.
In the project we will use dictionaries in Python to represent the network \(W\) of websites. This data structure makes it possible to assign each page a set of the pages it links to. The dictionary’s keys represent the pages in \(W\), and the value for a given key \(k\) is the set of pages that page \(p_k\) links to. A dictionary in Python thus gives us an efficient way to assign each website a set of other websites it points to. For example, we can define a simple directed graph as follows:
W = {'P1': {'P2', 'P3'}, 'P2': {'P3'}, 'P3': {'P1'}}
This model makes it easy to access and manipulate data, since we can quickly find which pages receive links from a given page, which is essential for the PageRank algorithm, because the link structure forms the basis for determining the pages’ relative importance.
Exercise 4 (*)#
Write the networks \(W_1\) and \(W_2\) as Python code using a dictionary. How do you correctly specify that a website does not link to other websites?
Hint
Consider what datatype the values for the keys are.
In the project we need to test different implementations of the PageRank algorithm on different networks, and therefore we are interested in being able to generate arbitrary networks of varying size.
Exercise 5#
Write a Python function that creates a dictionary of
nwebsites, where each website links to a random number of websites. This number should lie betweenkminandk.
import numpy as np
def make_web(n,k,kmin=0):
# Input: n and k are non-negative integers
# Output: web is a dictionary with n keys.
# The value of each key is a set that is a subset of the keys.
assert(k < n), "k must be less than n (since one cannot link to oneself)"
assert(kmin <= k), "kmin must be less than or equal to k"
keys = # Remove pass and INSERT CODE HERE - define n keys from 0 to n-1
web = dict()
for j in keys:
numlinks = # INSERT CODE HERE - generate a random number between kmin and k
web[j] = set() # INSERT CODE HERE - Choose a number of links (numlinks) from the other pages, avoid choosing the current page (j) and ensure there are no duplicate links
# ADD YOUR CODE HERE
Hint
Use np.random.choice() to select random elements from an array.
Exercise 6 (*)#
Write a Python function that creates a graphical representation of a network, where links are shown with arrows.
import networkx as nx
import matplotlib.pyplot as plt
def visualize_graph(web):
# Input: network as dictionary
# Output: Network visualized as a graph
# INSERT CODE HERE
plt.show()
# ADD YOUR CODE HERE
Hint
Use networkx and matplotlib.pyplot to visualize the network.
The Link Matrix and the Random Surfer Model#
Instead of using a graph or a dictionary to represent the network, we can also use an adjacency matrix to represent the graph. An adjacency matrix \(\pmb{B}\) is a square matrix, where each row and column corresponds to a website. If there is an edge from node \(p_j\) to node \(p_i\), then the element \(B_{ij}\) in row \(i\) and column \(j\) of the matrix will be 1, otherwise it will be 0:
The adjacency matrix provides an efficient way to represent directed graphs, since we can quickly check whether there is a link between two pages by checking the corresponding element in the matrix.
Representing the network as a matrix can be especially advantageous when the network is large. The matrix form enables fast computations using matrix operations, which is important for efficient analysis of the network structure.
Exercise 7 (*)#
Find the adjacency matrix for the network \(W_1\).
If column \(j\) is the zero vector in the adjacency matrix, website \(j\) is called a sink. A sink is a website that does not link to other websites.
We are now ready to introduce the idea behind PageRank using the random surfer model. In the random surfer model, we imagine a user who starts on an arbitrary website and then randomly chooses to follow one of the outgoing links from this page (you click a random link). The model describes how a surfer who continues to choose links at random will end up being on a certain page with a probability that depends on how links are structured between the websites in the network. This probability, which the page eventually “gets” - as the number of clicks goes to infinity - is called the page’s PageRank. In what follows, this idea is formalized by viewing the time evolution as a discrete Markov chain.
A Markov chain is a model that describes a system where future states depend only on the current state and not on how the system got there. This means that the probability of reaching the next website depends solely on the current website.
In a Markov chain, the system is divided into a set of states, and transitions between these states occur with a probability determined by a transition matrix. Since the adjacency matrix is a binary matrix that only indicates whether there is a link from one website to another, we define a link matrix \(\pmb{L}\), which describes the probability of choosing to surf from one website to another:
Here \(N_j=deg^+(p_j)\) is the number of outgoing links from page \(p_j\), and \(N=|V|\) is the total number of pages in the network.
The equation \(L_{ij}=\frac{1}{N}\) for \((p_j,p_i)\notin E \; \; \; \forall i\in V \;| \; i\neq j\) may look complicated at first glance, but its practical meaning is simple: When a website is a sink, it is assigned equal probability over all pages. This ensures that the Markov chain does not “get stuck” (in fancier terms: the link matrix remains fully stochastic).
Exercise 8 (*)#
Determine the link matrix for \(W_1\). Compare with the adjacency matrix in Exercise 7 (*).
- Definition 1: Probability vector #
A real row or column vector is a probability vector (or probability distribution) if its elements are non-negative and sum to 1.
For example, the constant column vector \(\pmb{x}_0 = [1/N,1/N, \dots, 1/N]^T \in \mathbb{R}^N\) is a probability vector. In the random surfer model we unsurprisingly deal with probabilities. Thus the probability vector \(\pmb{x} = [1,0,0 \dots, 0]^T \in \mathbb{R}^N\) models that you are at website \(p_1\) with \(100\%\) certainty. It gets more complicated with e.g. \(\pmb{x} = [1/2,0,1/2, 0,0, \dots, 0]^T \in \mathbb{R}^N\) which models that you are at website \(p_1\) with \(50\%\) probability and at \(p_3\) with \(50\%\) probability, while all other websites have probability zero.
Let us now try to understand the definition of the link matrix \(\pmb{L}\). Assume that \(\pmb{x} = [1,0,0 \dots, 0]^T \in \mathbb{R}^N\) is the initial state - so we start on page \(p_1\). If we multiply by \(\pmb{L}\) from the left, i.e. \(\pmb{L} \pmb{x}\), If \(p_1\) links to \(p_2\) and \(p_3\), then \(\pmb{L} \pmb{x} = [0,1/2,1/2,0, \dots, 0]^T\) (feel free to check!). The vector \(\pmb{L} \pmb{x}\) thus models one click from the initial state \(\pmb{x}\). Similarly, \(\pmb{L}(\pmb{L} \pmb{x}) = \pmb{L}^2 \pmb{x}\) models two clicks from the initial state \(\pmb{x}\). In general, \(\pmb{L}^t \pmb{x}\) will be the state after \(t \in \mathbb{N}\) clicks.
- Definition 2: PageRank #
For a network of websites, the PageRank \(PR(p_i)\) of a page \(p_i\) is precisely the probability of being at \(p_i\) as time goes to infinity. Time is viewed as a discrete quantity \(t=1,2,3,\dots\) and \(PR(p_i)\) is the \(i\)’th element of \(\lim_{t \to \infty} \pmb{L}^t \pmb{x}_0\), where \(\pmb{x}_0\) is an initial state (a probability vector, often the constant vector).
Caution
There is a “big” problem with this definition: For some networks, \(\pmb{L}^t \pmb{x}_0\) does not converge as \(t \to \infty\). We will later introduce damping, which solves this problem.
We can compute the PageRank of a website from the random surfer model in the following way:
- Random surfer model#
Consider a network \(W=(V,E)\) with \(N\) websites (\(|V|=N\)). Assume that website \(p_k\) links to \(N_k\) pages \(\{p_{l_1}, \dots, p_{l_{N_k}}\}\subseteq V\). Now perform the following simulation \(j\) times:
Choose an arbitrary page from \(V\) with equal probability \(\frac{1}{N}\).
At all subsequent steps: Assume that the surfer is on page \(p\):
If website \(p\) is a sink, then choose an arbitrary page from \(V\).
If page \(p\) has outgoing links to \(\{p_{l_1}, \dots, p_{l_{N_k}}\}\), then choose a random website among \(\{p_{l_1}, \dots, p_{l_{N_k}}\}\).
Each time a page is visited this is counted, so that \(M_k\) is the number of times \(p_k\) is visited by the surfer. If the sample size \(j\) is large enough, then \(\frac{M_k}{j}\) will represent the probability of randomly ending up on page \(p_k\), and thus \(\frac{M_k}{j}\) approximates the PageRank for \(p_k\) for sufficiently large \(j\).
Note
This method of finding a PageRank can be viewed as a Markov chain Monte Carlo simulation.
Exercise 9#
Write a Python function that simulates one step of the random surfer model. The function should take a network of pages (represented as a dictionary) and a starting page as input and return a probability distribution for which pages the surfer’s next click will lead to.
def surf_step(web, page):
# Input: A network as a dictionary and a start page
# Output: Probability distribution as a dictionary for the next website
distribution=dict()
# INSERT CODE HERE
return distribution
# ADD YOUR CODE HERE
Exercise 10#
Write a Python function that computes PageRank values for each page by simulating a random surfer. The function should call
surf_stepto perform the simulation.
def random_surf(web, n):
# Input: A network as a dictionary and the number of steps in the random surf simulation
# Output: PageRank values for each page as a dictionary
ranking=dict()
# INSERT CODE HERE
return ranking
# ADD YOUR CODE HERE
Exercise 11#
Use the
random_surffunction with 100, 1000 and 10000 iterations to determine the PageRank for the networks \(W_1\) and \(W_2\). Assess whether the function gives a reliable measurement of PageRank for the given networks. It may be useful here to print the output ofrandom_surf(web, n)for e.g. \(n=100, 101, 102, 103, 104, 105\) and \(n=10000, 10001, 10002, 10003, 10004, 10005\).
for n in range(1000, 1010, 1):
print(random_surf(W1, n))
for n in range(1000, 1010, 1):
print(random_surf(W2, n))
# ADD YOUR CODE HERE
To avoid the surfer being trapped in a component, and the PageRank thus becoming 0 for all websites in other components, we add a damping factor \(d\in [0,1]\) to the random surfer model. The damping factor is a parameter in the random surfer model that simulates the probability that a surfer will choose a random page instead of following a link. We will later in the project investigate the choice of \(d\). Until then we set \(d=0.85\).
- Random surfer model with damping#
Consider a network \(W=(V,E)\) with \(N\) websites (\(|V|=N\)). Assume that website \(p_k\) links to \(N_k\) pages \(\{p_{l_1}, \dots, p_{l_{N_k}}\}\subseteq V\). Now perform the following simulation \(j\) times:
Choose an arbitrary page from \(V\) with equal probability \(\frac{1}{N}\).
At all subsequent steps: Assume that the surfer is on page \(p\): If website \(p\) is a sink, then choose an arbitrary page from \(V\). If page \(p\) has outgoing links to \(\{p_{l_1}, \dots, p_{l_{N_k}}\}\), then with probability \(d\) choose a random website among \(\{p_{l_1}, \dots, p_{l_{N_k}}\}\), and with probability \(1-d\) choose a random page from \(V\).
Each time a page is visited this is counted, so that \(M_k\) is the number of times \(p_k\) is visited by the surfer. If the sample size \(j\) is large enough, then \(\frac{M_k}{j}\) will represent the probability of randomly ending up on page \(p_k\), and thus \(\frac{M_k}{j}\) approximates the PageRank for \(p_k\) for sufficiently large \(j\).
Exercise 12#
Modify
surf_stepandrandom_surfso that they follow the random surfer model with damping.
def surf_step_damp(web, page, d):
# Input: A network as a dictionary, a start page and a damping factor
# Output: Probability distribution as a dictionary for the next website
distribution=dict()
# INSERT CODE HERE
return distribution
def random_surf_damp(web, n, d):
# Input: A network as a dictionary, the number of steps in the random surf simulation and a damping factor
# Output: PageRank values for each page as a dictionary
ranking=dict()
# INSERT CODE HERE
return ranking
# ADD YOUR CODE HERE
Exercise 13#
Use the
random_surf_dampfunction with 100, 1000 and 10000 iterations to determine the PageRank for the networks \(W_1\) and \(W_2\). Compare the result with Exercise 11.
for n in range(1000, 1010, 1):
print(random_surf_damp(W1, n))
for n in range(1000, 1010, 1):
print(random_surf_damp(W2, n))
# ADD YOUR CODE HERE
Recursive Model and Matrix Formulation#
From the previous exercises, we can now see that the random surfer model with damping actually gives reasonable PageRanks for a set of websites. However, there are two significant problems with it: the randomness element means that you do not necessarily get the same PageRank every time you run it, and it is computationally heavy in part because many iterations may be required. To avoid these problems, one can look at a recursive model. First, all PageRanks of the websites are set to be \(PR_1(p) = \frac{1}{N}\), where \(N\) is the number of pages. For any page \(p\), we can get to \(p\) in two ways:
randomly, i.e. with probability \((1-d)\frac{1}{N}\)
through a link from another page \(q\). Then we must first be at \(q\), and then choose \(p\) among all the pages that \(q\) links to. If a page does not link to \(p\), the probability of going to \(p\) via that link is of course \(0\). With these considerations, the PageRank of \(p\) can be updated recursively by
Exercise 14#
Explain what each term in the above formula for the recursive version means.
Hint
Examine the addition principle and the multiplication principle.
Exercise 15#
Implement
rank_updateandrecursive_PageRankand run on the networks \(W_1\) and \(W_2\) withd=0.85.
def rank_update(web, PageRanks, page, d):
"""
Updates the value of PageRank for a page based on the recursive formula
Pages without outgoing links (sinks) are treated as if they link to all pages on the web.
Input:
web and PageRanks are dictionaries as in the output from "make_web" and "random_surf",
page is the key to the page whose rank we want to update, and
d is the damping factor.
Output:
PageRank is updated according to the above formula,
and this function returns a float "increment", the (absolute) difference
between the previous value and the updated value of PR(p).
"""
return increment
def recursive_PageRank(web, stopvalue=0.0001, max_iterations=200, d=0.85):
"""
Implements the recursive version of the PageRank algorithm by first creating
a PageRank on 1/N to all pages (where N is the total number of pages)
and then applying "rank_update" repeatedly until one of the two stopping conditions
is met:
stopping condition 1: the maximum change from step n to step (n+1) across all PageRank
is less than the stop value,
stopping condition 2: the number of iterations has reached "max_iterations".
Input: web is a dictionary as in the output of "make_web", d is the damping,
stopvalue is a positive float, max_iterations is a positive integer.
"""
PageRanks = dict()
return PageRanks, iteration
# ADD YOUR CODE HERE
In the recursive model, the recursive formula can be written compactly using the already defined link matrix \(\pmb{L}\). Let
and
then the recursive formula
can be written in matrix form as
Here \(\pmb{e}\) is an \(N\times 1\) vector of ones, \(d\) is the damping factor, and \(\pmb{L}\) is the link matrix that describes the probabilities of following links from one page to another.
Exercise 16#
Argue that (1) can be rewritten as (2).
Assume that \(\pmb{x}\) is a probability vector, i.e. that its elements sum to 1. Show that (2) can be written as
\[\begin{equation*} \pmb{x}_{n+1} = \pmb{M}_d \, \pmb{x}_n, \end{equation*}\]where
\[\begin{equation*} \pmb{M}_d = \frac{1 - d}{N} \, \pmb{E_N} + d \, \pmb{L}, \end{equation*}\]and \(\pmb{E_N}\) is an \(N \times N\) matrix consisting solely of ones.
The matrix \(\pmb{M}_d\) is called the modified link matrix. If we assume that everything goes well, i.e. that \(\pmb{x_n}\) at some point tends to something stable for larger and larger \(n\), we can conclude that the PageRank we are looking for is precisely the \(\pmb{x}\) that satisfies being a probability vector and
The above equation expresses that the PageRank vector \(\pmb{x}\) is an eigenvector for \(\pmb{M}_d\) with eigenvalue 1. If the damping factor \(d\) is not zero, there is (up to rescaling) exactly one eigenvector for this matrix with eigenvalue 1.
In the project we will show that this is always true when \(0<d<1\), and that there therefore exists a unique solution for the (damped) PageRank algorithm. The matrix-based formulation thus gives us yet another way to compute PageRank:
Find an eigenvector corresponding to the eigenvalue one, and scale the vector so that its elements sum to one.
Exercise 17#
Implement
modified_link_matrixand test on the networks \(W_1\) and \(W_2\) withd=0.85.
def modified_link_matrix(web, pagelist, d=0.85):
# Input: web (dictionary), pagelist (list of keys), d (damping factor)
# Output: d*A^T + (1-d)*E/N
# A: NxN numpy array, where row j has non-zero elements in columns that page j links to.
# If page j does not link to any, all elements in row j get the value 1/N.
# E: np.ones([N,N])
# INSERT CODE HERE
# ADD YOUR CODE HERE
Exercise 18#
Show that the sum of each column in \(\pmb{M}_d\) is equal to 1.
The Markov Matrix: Properties and Damping in PageRank#
We want to show that the damped PageRank algorithm converges to a unique probability distribution, as we claimed in the previous section. To show this we will need to introduce the Markov matrix:
- Definition 3: Markov matrix#
A Markov matrix \(\pmb{A} = [a_{ij}] \in \mathbb{R}^{n\times n}\) is an \(n \times n\) real matrix with non-negative elements, whose rows sum to 1:
\[\begin{equation*} a_{ij} \ge 0 \quad \forall i, j \end{equation*}\]and
\[\begin{equation*} \sum_{j=1}^n a_{ij} = 1 \quad \forall i. \end{equation*}\]
Note
A Markov matrix can be seen as a “click matrix”, since row \(i\) tells how the probability is distributed to the next step if you are in state \(i\). Therefore all entries are \(\ge 0\), and each row sums to 1.
For example,
is a Markov matrix, but
are not.
If \(\pmb{A}\) is a Markov matrix, then combined with the above two conditions, we must have that
for all \(i,j\).
Exercise 19#
Argue that \(\pmb{L}^T\) is a Markov matrix.
Exercise 20#
Show that if \(\pmb{A}\) is a Markov matrix then \(\pmb{A e} = \pmb{e}\), that is, that \((1,\pmb{e})\) is an eigenpair for \(\pmb{A}\).
Exercise 21#
Prove that a matrix \(\pmb{A}\) and its transposed matrix \(\pmb{A}^T\) have the same eigenvalues.
Then show that they do not necessarily have the same eigenvectors (that is: give an example of a matrix \(\pmb{A}\) where \(\pmb{A}\) and \(\pmb{A}^T\) have different eigenspaces).
- Definition 4: Stationary distribution#
If \(\pmb{A}\) is a Markov matrix with strictly positive elements, then the unique probability vector \(\pmb{x}\) that satisfies
\[\begin{equation*} \pmb{A}^T \pmb{x} = \pmb{x}, \end{equation*} \]is called the stationary distribution for \(\pmb{A}\).
Note that the Markov matrix can have elements with value 0. We will later in the project show how we can guarantee that \(\pmb{A}\) has strictly positive elements. Since \(\pmb{L}=\pmb{A}^T\) and \(\pmb{A}\) have the same eigenvalues and not necessarily the same eigenvectors, we want to determine the PageRank by finding an eigenvector \(\pmb{x}\) that satisfies the equation
However, we are interested in working with the modified link matrix when determining PageRank, as described earlier. We claimed that there exists a probability vector \(\pmb{x}\) that satisfies \(\pmb{M}_d\pmb{x}=\pmb{x}\). To be able to show this, we must first construct the corresponding modified Markov matrix:
Let \(0 < d < 1\). For \(\pmb{A} \in \mathbb{R}^{n \times n}\), set:
Exercise 22#
Let \(\pmb{A}\) be a Markov matrix, and let \(0 < d < 1\). Prove that \(\pmb{A}_d\) is also a Markov matrix, and that \(\pmb{A}_d\) has strictly positive elements, i.e. (\(\pmb{A}_d)_{ij} > 0\) for all \(i\) and \(j\).
Exercise 23#
We previously assumed that the modified link matrix \(\pmb{M}_d\) had eigenvalue 1 for \(0<d<1\). Show this.
Since we have shown that \(\pmb{M}_d\) has eigenvalue 1 for \(0<d<1\), we know that there exists an eigenvector \(\pmb{x}\) that satisfies \(\pmb{M}_d\pmb{x}=\pmb{x}\). We will later show that this is the unique solution to the damped PageRank algorithm.
Exercise 24#
Implement a Python function that, given a network, calls
modified_link_matrix, finds an eigenvector corresponding to the eigenvalue 1, and turns it into a probability vector.
def eigenvector_PageRank(web,d=0.85):
# Input: web is a dictionary of web pages and links.
# d is the damping
# Output: A dictionary with the same keys as web and the values are PageRank for the keys
ranking = dict()
# INSERT CODE HERE
return ranking
# ADD YOUR CODE HERE
Exercise 25#
Computing eigenvectors for large matrices can be time-consuming. Create a network with the
make_webfunction, wheren=5000andk=10, and find the PageRank usingeigenvector_PageRank. Investigate and comment on the runtime.
Hint
Use the time package to determine the runtime.
web = make_web(5000,10,0)
eigenvector_PageRank(web)
# ADD YOUR CODE HERE
The PageRank Algorithm: Iterative Convergence and Damping in Markov Matrices#
Since it is computationally heavy to find eigenvalues and eigenvectors for large matrices, we would like to find another way to find eigenvectors with corresponding eigenvalue 1. Fortunately, there is a method to do this with Markov matrices, and that is what this section builds toward. We want to arrive at the fact that for a special class of Markov matrices we can raise it to a power, and as the exponent goes to infinity each column in the matrix becomes precisely a reasonable PageRank for the network of websites that the Markov matrix represents.
Exercise 26 (*)#
Compute the product of the following Markov matrices. Verify that the resulting matrix is also a Markov matrix?
\[\begin{equation*} A = \begin{bmatrix} 0.2 & 0.5 & 0.3 \\ 0.4 & 0.4 & 0.2 \\ 0.1 & 0.7 & 0.2 \end{bmatrix}, \quad B = \begin{bmatrix} 0.5 & 0.3 & 0.2 \\ 0.2 & 0.5 & 0.3 \\ 0.3 & 0.2 & 0.5 \end{bmatrix} \end{equation*}\]
That the product of two Markov matrices is a Markov matrix is a useful property that we will need later.
Exercise 27#
Show that the product of two Markov matrices is a Markov matrix.
Exercise 28#
Conclude that if \(\pmb{A}\) is a Markov matrix, then \(\pmb{A}^k\) is also a Markov matrix for all \(k\in \mathbb{N}\).
We will also need that no eigenvalues of a Markov matrix have absolute value greater than 1. Another way to say this is that the spectral radius \(\operatorname{rad}_{\pmb{A}}\) of a Markov matrix \(\pmb{A}\) is equal to 1:
We will later see how this helps ensure that powers of the special class of Markov matrices converge.
Exercise 29#
Let \(\pmb{A}\) be an \(n\times n\) Markov matrix. Show that \(\operatorname{rad}_{\pmb{A}} = 1\).
Hint
Assume that \(\pmb{A}\) has an eigenvalue \(\lambda\) with \(|\lambda|>1\). Let \(\pmb{v}\) be the corresponding eigenvector. For a Markov matrix, the sum of each row is exactly 1, so \(\pmb{A}\pmb{v} = \lambda \pmb{v}\) can be seen as \(n\) (different) weighted averages of \(\pmb{v}_1, \pmb{v}_2, \ldots, \pmb{v}_n\). How large can the absolute value of a weighted average of \(n\) elements be at most?
Exercise 30 (*)#
Consider the following matrix:
Is it a Markov matrix? Find its eigenvalues and the modulus of the eigenvalues. Find \(\pmb{A}^2, \pmb{A}^3\). Does the matrix \(\pmb{A}^k\) converge for \(k\to \infty\)?
Now we consider the special class of Markov matrices mentioned before, namely Markov matrices with strictly positive elements - yet another reason why the damping factor is an important player in the PageRank algorithm. To ensure convergence of powers of Markov matrices with strictly positive elements, we must ensure that \(1\) is the only eigenvalue with modulus \(1\). For this we will need the triangle inequality:
- The triangle inequality#
For all complex numbers \(z\) and \(w\) we have
\[\begin{equation*} |z+w| \leq |z| + |w|, \end{equation*}\]with equality if and only if \(z\) and \(w\) are of the form \(z = kw, k\in \mathbb{R}_{\geq 0}\)
More generally it follows by induction that for a collection of \(n\) complex numbers \(z_1, \dots, z_n\) we have:
\(\left| \sum_{i=1}^n z_i \right| \leq \sum_{i=1}^n |z_i|\)
Equality occurs only if all numbers \(z_i\) have the same principal argument.
Exercise 31#
If \(\pmb{A}\) is a Markov matrix with strictly positive elements (\(a_{ij} > 0\) \(\forall\) \(i,j\)), prove that there is exactly one eigenvalue \(\lambda\) with \(|\lambda| = 1\), namely \(\lambda = 1\), and moreover that the corresponding eigenspace is \(\mathcal{E}_1 = \hbox{span}\{ \pmb{e} \}\), where \(\pmb{e} = [1,1,\dots,1]^t\). All other eigenvalues have modulus less than 1.
Hint: The max norm of a vector \(\pmb{x}\) is defined as
i.e. the value of the largest element in absolute/numerical value, cf. Example 2.1.2. in the textbook.
Assume that \(\lambda\) is an eigenvalue with \(|\lambda|=1\), and that \(\pmb{v}\) is a corresponding eigenvector. Let \(k\) be an index such that \(\Vert\pmb{v}\Vert_{\infty} = |v_k|\). Then we can write:
Provide justification for each step above.
Since the first and last terms in this chain are equal, there must be equality in each inequality. Use the first of these inequalities together with point (2) in the triangle inequality above, as well as the second of these equalities (and the condition \(a_{ij}>0\)) to show that we must have \(\pmb{v} = C \pmb{e}\), where \(|C| = \Vert\pmb{v}\Vert_{\infty}\).
Exercise 32 (*)#
Take the matrix \(\pmb{A}\) from Exercise 30 (*). Dampen it with factor \(0.85\), call this matrix \(\pmb{A}_d\). Then find its eigenvalues, their modulus, and comment on the result. Does \(\pmb{A}_d^k\) converge for \(k \to \infty\)?
Hint
Diagonalize the matrix.
- Theorem 1#
Let \(n\geq 2\) and let \(\pmb{A}\) be an \(n \times n\) Markov matrix with strictly positive elements. Then there is a unique vector \(\pmb{x} \in \mathbb{R}^n\) such that \(\pmb{A}^T\pmb{x} = \pmb{x}\), all elements in \(\pmb{x}\) are strictly positive and sum to 1, and:
\[\begin{equation*} \lim_{k\to \infty} (\pmb{A}^T)^k = \pmb{xe}^t = [\pmb{x}, \pmb{x}, \ldots, \pmb{x}] \end{equation*}\]
Exercise 33#
Explain why Theorem 1 is relevant for the PageRank algorithm.
Exercise 34.#
Let \(\pmb{A}\) be an \(n \times n\) Markov matrix with strictly positive elements. Prove Theorem 1 under the assumption that \(\pmb{A}\) is diagonalizable.
Hint
If \(\pmb{A}\) is diagonalizable, then \(\pmb{A} = \pmb{V}\pmb{D}\pmb{V}^{-1}\) for a matrix \(\pmb{V}\) and a diagonal matrix \(\pmb{D}\). What is \(\pmb{A}^k\)?
Exercise 35#
Write a Python function that given a network, exponent
powerand a damping factord, callsmodified_link_matrixand multiplies it by itselfpowertimes, and returns a PageRank for the network.
def matrix_PageRank(web,power,d=0.85):
# Input: web is a dictionary with web pages and links.
# d is a positive float, the damping constant.
# Output: A dictionary with the same keys as web, and values are PageRank for each key.
ranking = dict()
# INSERT CODE HERE
return ranking
# ADD YOUR CODE HERE
Exercise 36 (**)#
We want to show that the convergence of the
matrix_PageRankalgorithm depends on the choice of the damping factor \(d\).
Show that if \(\pmb{A}\) has eigenvalues 1, \(\lambda_1\), \(\lambda_2\),…,\(\lambda_{n-1}\), then \(\pmb{A}_d\) has eigenvalues 1, \(d\lambda_1\), \(d\lambda_2\),…,\(d\lambda_{n-1}\).
Hint
First show that \(1\) is an eigenvalue for \(\pmb{A}_d\). Then consider \(\pmb{A}^T\). What is \(\pmb{e}^T\pmb{A}^T\)? Show that if \(\pmb{v}\) is an eigenvector for \(\pmb{A}^T\) with eigenvalue different from \(1\), then \(\pmb{v}\perp \pmb{e}\). What eigenvalues does \(\pmb{E}_n\) have? How do their eigenspaces relate to each other?
Then show that any eigenvector for \(\pmb{A}^T\) is also an eigenvector for \(\pmb{A}_d^T\). What are their corresponding eigenvalues?
Exercise 37#
Discuss the choice of damping in the PageRank algorithm in terms of convergence and correctness. Consider the convergence if \(\pmb{L}\) is an \(n \times n\) matrix with eigenvalues \(\lambda_1=1\), \(\lambda_2=-0.999\) and \(|\lambda_k|<0.5\) for \(k=2,...,n\). Include a geometric visualization and compare the convergence for \(\pmb{L}\) and \(\pmb{M}_d\). What happens if we replace \(\lambda_2=-0.999\) with \(\lambda_2=0.999\cdot i\)?
Exercise 38#
Argue why we can allow ourselves to use the
matrix_PageRankfunction to determine PageRank, even though it is an approximate algorithm.
Analysis of PageRank Models and Investigation of Damping#
Exercise 39#
Write a Python function that given a network and a ranking creates a graphical representation of the network, where arrows illustrate links between websites, and the sizes of the websites indicate the size of the PageRank. Test the function on \(W_1\) and \(W_2\).
def plot_ranking(web, ranking, d=0.85):
# Input: web and ranking are dictionaries, for example as output from the functions "make_web" and "random_surf".
# Output: Graphical representation of the web structure with links as #arrows and PageRank visualized by the size of the websites.
# INSERT CODE HERE
plt.show()
# ADD YOUR CODE HERE
Exercise 40#
Test the four PageRank algorithms on large networks created using the
make_webfunction. To compare two different implementations you must compute PageRank for the same network to the same degree of precision and compare the time. Do the relative times change if higher or lower precision is required?
So far we have used
d=0.85. Experiment with different values of the damping (e.g. 0.5, 0.75, 0.9) inmatrix_PageRank. Run the function on the same network (usemake_web) and measure how many iterations are required to achieve convergence, as well as how the final ranking of pages changes. Discuss the choice of damping.
Jordan Normal Form: Generalization of Diagonalizability#
In Exercise 34. we showed the validity of the PageRank algorithm (matrix_PageRank) under the assumption that \(\pmb{A}\) is diagonalizable. However, this is not always the case, and we are therefore interested in showing that the PageRank algorithm converges even if \(\pmb{A}\) cannot be diagonalized. Therefore we will need Jordan normal form:
Jordan normal form is a normal form for square matrices that generalizes diagonalizability. Let \(\pmb{A} \in \mathbb{C}^{n \times n}\) have the distinct eigenvalues \(\lambda_1, \lambda_2, \dots, \lambda_m\). Then there exists an invertible matrix \(\pmb{P}\) such that
where \(\pmb{J}\), the Jordan normal form of \(\pmb{A}\), is a block diagonal matrix consisting of Jordan blocks. The full \(\pmb{J}\) can be written as
where \(\sum_{i=1}^m k_i = n\). Each Jordan block \(\pmb{J}_{k_i}(\lambda_i)\) has dimension \(k_i \times k_i\) and has the form
Exercise 41 (**)#
Show Theorem 1 using Jordan normal form. You therefore may not assume that \(\pmb{A}\) is diagonalizable.