Florida International University
School of Computing and Information Sciences
Survey of
Web Page Ranking Algorithms
COP 6727
Advanced Database System
Spring 2008
Professor: Vagelis Hristidis
April, 2008
Roberto
Espinoza
Narendran
Bakthavatsalam
On
On On the web courtesy of:
CP CPC Computer Consultants, Inc. www.cpccci.com
Table of Contents
ABSTRACT__________________________________________________________________ 4
INTRODUCTION_____________________________________________________________ 5
Page Quality: In Search of
an Unbiased Web Ranking________________________________ 6
PageRank and Popularity______________________________________________________ 6
Quality and PageRank________________________________________________________ 6
Measuring Page Quality_______________________________________________________ 7
Web User Model_____________________________________________________________ 7
Popularity Evolution______________________________________________________________________ 8
Quality Estimator_________________________________________________________________________ 8
Time Evolution of Page
Popularity_____________________________________________ 10
Popularity Evolution
when Quality Drops at t=30__________________________________ 11
Topic Sensitive PageRank______________________________________________________ 12
ODP Biasing_______________________________________________________________ 12
Query-Time Importance
Score_________________________________________________ 12
Experimental Results_________________________________________________________ 13
Similarity Measure for
Induced Rankings_____________________________________________________ 13
Effect of the
ODP-Biasing_________________________________________________________________ 13
Query-Sensitive Scoring___________________________________________________________________ 13
Context-Sensitive
Scoring_________________________________________________________________ 14
Sources of Search
Context____________________________________________________ 14
Different search
contexts for query �blues�_______________________________________ 15
Results for query
�blues� with content sensitive____________________________________ 15
Pairwise comparison of
topically-biased rankings__________________________________ 16
Precision at ten results
for our test queries. The average precision over ten queries is also shown 17
Web Page Scoring Systems
for Horizontal and Vertical Search________________________ 18
PageRank and Random
Walks_________________________________________________ 18
Uniform Jump
Probabilities________________________________________________________________ 19
Multiple State Model_____________________________________________________________________ 19
Horizontal WPSS___________________________________________________________ 19
PageRank______________________________________________________________________________ 19
The HITS Ranking System________________________________________________________________ 20
The PageRank-HITS Model________________________________________________________________ 20
Vertical WPSS______________________________________________________________ 20
Focused PageRank_______________________________________________________________________ 21
Double Focused PageRank_________________________________________________________________ 21
Top figure is showing
the distribution of page scores for the topic �Linux�. The one below is showing
the distribution for the topic �cooking recipes�.______________________________________ 22
Percentage of positive
judgedments assessed by a set of ten users______________________ 23
Ranking Systems: The
PageRank Axioms_________________________________________ 24
Page Ranking______________________________________________________________ 24
The Axioms________________________________________________________________ 25
Isomorphism____________________________________________________________________________ 25
Self Edge______________________________________________________________________________ 25
Vote by Committee______________________________________________________________________ 25
Collapsing_____________________________________________________________________________ 25
Proxy_________________________________________________________________________________ 26
Sketch of Delete_____________________________________________________________ 27
Sketch of Duplicate__________________________________________________________ 27
Example run of the
completeness algorithm_______________________________________ 28
Sketch of Several Axioms_____________________________________________________ 29
CONCLUSION______________________________________________________________ 30
REFERENCES______________________________________________________________ 31
ENDNOTES_________________________________________________________________ 34
This paper represents a survey of several Web page ranking algorithms. We intend to analyze and compare each of these algorithms, which introduce a novel approach to Internet searching and the ranking Search Engines, more in particular, Google, give to the search results of a particular query.
After we have introduced and surveyed the 4 different proposed algorithms, we shall convey our conclusions where we summarize our own opinions and recommendations for the implementation of that algorithm that promises the highest beneficial impact toward the search of relevant information in the Internet.
The purpose of this paper is to compare four different algorithms of Web page ranking when performing Internet searching. These algorithms are the following:
1. Page Quality: In Search of an Unbiased Web Ranking
2. Topic-Sensitive PageRank
3. Web Page Scoring Systems for Horizontal and Vertical Search
4. Ranking Systems: The PageRank Axioms
We will begin the survey of each algorithm by providing a brief introduction, followed by an analysis with corresponding graphs or figures. We shall finalize this paper by incorporating our own conclusions of this survey with our opinions and recommendations.
Popular pages tend to get more popular, while unpopular pages tend to get more ignored, this decreases the overall quality of search results in the long run. This phenomenon presents a problem for newly published web sites that do not yet have a following. This algorithm attempts to alleviate this situation where popular pages become even more popular while the others continue to get ignored. The proposed solution is that of introducing a �quality estimator�, which measures the intrinsic quality of a web page. Furthermore, this algorithm does compare itself with Google�s current PageRank, and presents an analysis of cases where PageRank is accurate and where it is not.
The current PageRank metric defines the importance of a particular page by the sum of the importance of the pages that do point to it. There is a dumping factor also taken into account, which represents the randomness of the user clicking on an unrelated link after he has been clicking on relevant topics for a while. Therefore, PageRank implies that many users are indeed interested in a particular page, and that more users are likely to visit the page as compared to other low ranking pages.
The key feature of PageRank is that it mesures the popularity of a page, which means that a user looked at a page and liked it, then it is reasonable to assume that subsequent visitors also will like the page after visiting it for the first time. Thus, the algorithm in effect increases the probability that users will like the first few results.
The bias that PageRank has against unpopular pages accentuates even more with freshly created web pages which have not yet had the opportunity to be seen. Since it doesn�t appear in the top search results then very few people get to see it, and becomes even more unpopular, despite the fact that it may be a very high quality page with much valuable information.
Therefore, Page Quality suggests using the �probability� that a page will be liked when seen for the first time, instead of using the �popularity� of a web page to display search results. Thus, the quality of a page is defined as:
�Page Quality: Define the quality of a page p, Q(p), as the conditional probability that an average user will like the page when the user sees the page for the first time�[1]
Q(p) = P(Lp | Ap) where Ap the event that the user will become aware of p by visiting the page for the first time and Lp represents the event that the user likes the page.
The first assumption in developing this idea is that the creation of links pointing to a particular page means such page is well liked. The second assumption is that a high quality page will be liked more quickly overall. The challenge here becomes how to use popularity increase in measuring the quality of the page. In order to address this, a new model is created, that of the �web-user� model
Definition: The popularity of page p at time t, P(p, t), is the fraction of web users who like the page.
Definition: The visit popularity of a page p at time t, V(p, t) is the number of �visits� or �page views� the page gets within a unit time interval at time t.
Definition: The user awareness of page p at time t, A(p, t), is the fraction of web users who are aware of p at time t.
The first hypothesis is based on the random-surfer interpretation of PageRank, which helps us generate the following propositions: 1. The number of visits to page p within a unit time interval at time t is proportional to how many people like the page. 2. All web users will visit a particular page with equal probability.
We can estimate the current popularity of the page p, by estimating how many new visitors will visit p. Then we can estimate how many will like the page in their first visit, and thus how much its popularity will increase.
Based on the results of estimations using sound theorems and proofs, it can be concluded that a websiste goes through three stages after it�s first launched: infant, expansion and maturity stages. In the first infant stage, the page is barely noticed by web users and has practically zero popularity. At some point the page enters the second expansion stage where the popularity of the page suddenly increases. Finally, the popularity of the page stabilizes at a certain value entering the maturity stage.
The quality of a page is proportional to its popularity increase and inversely proportional to its current popularity. It is also inversely proportional to the fraction of the users who are unaware of the page. Quality of a page does change over time, and the quality estimator is able to capture such change.


In Topic Sensitive PageRank, several scores are computed: multiple importance scores for each page under several topics that form a composite PageRank score for those pages matching the query. During the offline crawling process, 16 topic-sensitive PageRank vectors are generated, using as a guideline the top-level category from Open Directory Project (ODP). At query time, the similarity of the query is compared to each of these vectors or topics; and subsequently, instead of using a single global ranking vector, the linear combination of the topic-sensitive vectors is weighed using the similarity of the query to the topics. This method yields a very accurate set of results relevant to the context of the particular query.
The first step in this approach is performed once during the offline preprocessing of the Web crawl. It generates a set of biased PageRank vectors using a set of �basis� topics. The specific methodology of creating the biased PageRank vectors is to use the URLs listed below the 16 top-level categories in ODP. These then, constitute the personalization vectors. A single, non-biased PageRank vector is also computed for comparison purposes. Third, a set of 16 class term-vectors is also computed, which consists of the terms in the documents below each of the 16 top-level categories.
The second step is performed at query time. If a query q was issued by highlighting the term q in some web page u, then q� (the context of q) consists of the terms in u. For ordinary queries not done in context, let q� = q. using a unigram language model, with parameters set to their maximum-likelihood estimates, the class probabilities for each of the 16 top-level classes in ODP conditioned on q� are computed. Using a text index, the URLs for all documents containing the original query terms q are retrieved. Finally, the query-sensitive importance score of each of these retrieved URLs is computed. The results are then ranked according to the composite score.
Uses two measures when comparing rankings. The first indicates the degree of overlap between the top URLs of two rankings. This measure however, does not give a complete picture of the similarity between the two rankings since it does not indicate the degree to which the relative ordering of the top URLs are in agreement. The second uses a variant of the distance measure between Web pages. Based on these, the similarity measure is defined as the probability that two rankings agree on the relative ordering of a randomly selected pair of distinct Web pages.
This measures the effect of topically biasing the PageRank computation. It should be noted that the choice of the bias factor does affect the degree to which the resultant vector is biased towards the topic vector. For lower vales of the bias factor, α, where 0 � α � 1, the URLs in the bias set becomes irrelevant to the final score assignment.
This tells us how effectively the ranking precision gained by the use of multiple PageRank vectors can be utilized. Given a query, the first task is to determine which of the rank vectors can best rank the results for the query. For most test queries, the ODP categories are intuitively the most relevant categories for the query.
If the search is done in context, then the context can be used instead of the query to determine the topics. Using the context can help disambiguate the query term and yield results that more closely reflect the original intent of the user.
There are a variety of other sources of context that may be used in this algorithm scheme, for example, the history of queries issued leading up to the current query is another form of query context. In addition to these types of context associated with the query itself, another source that can be potentially