Florida International University

School of Computing and Information Sciences

 

 

Survey of

Web Page Ranking Algorithms

 

 

COP 6727 Advanced Database System

Spring 2008

Professor: Vagelis Hristidis

April, 2008

 

 

Roberto Espinoza

Narendran Bakthavatsalam

 

On

On   On the web courtesy of:

CP   CPC Computer Consultants, Inc. www.cpccci.com

 


 

Table of Contents

 

ABSTRACT__________________________________________________________________ 4

INTRODUCTION_____________________________________________________________ 5

Page Quality: In Search of an Unbiased Web Ranking________________________________ 6

PageRank and Popularity______________________________________________________ 6

Quality and PageRank________________________________________________________ 6

Measuring Page Quality_______________________________________________________ 7

Web User Model_____________________________________________________________ 7

Popularity Evolution______________________________________________________________________ 8

Quality Estimator_________________________________________________________________________ 8

Time Evolution of Page Popularity_____________________________________________ 10

Popularity Evolution when Quality Drops at t=30__________________________________ 11

Topic Sensitive PageRank______________________________________________________ 12

ODP Biasing_______________________________________________________________ 12

Query-Time Importance Score_________________________________________________ 12

Experimental Results_________________________________________________________ 13

Similarity Measure for Induced Rankings_____________________________________________________ 13

Effect of the ODP-Biasing_________________________________________________________________ 13

Query-Sensitive Scoring___________________________________________________________________ 13

Context-Sensitive Scoring_________________________________________________________________ 14

Sources of Search Context____________________________________________________ 14

Different search contexts for query �blues�_______________________________________ 15

Results for query �blues� with content sensitive____________________________________ 15

Pairwise comparison of topically-biased rankings__________________________________ 16

Precision at ten results for our test queries. The average precision over ten queries is also shown          17

Web Page Scoring Systems for Horizontal and Vertical Search________________________ 18

PageRank and Random Walks_________________________________________________ 18

Uniform Jump Probabilities________________________________________________________________ 19

Multiple State Model_____________________________________________________________________ 19

Horizontal WPSS___________________________________________________________ 19

PageRank______________________________________________________________________________ 19

The HITS Ranking System________________________________________________________________ 20

The PageRank-HITS Model________________________________________________________________ 20

Vertical WPSS______________________________________________________________ 20

Focused PageRank_______________________________________________________________________ 21

Double Focused PageRank_________________________________________________________________ 21

Top figure is showing the distribution of page scores for the topic �Linux�. The one below is showing the distribution for the topic �cooking recipes�.______________________________________ 22

Percentage of positive judgedments assessed by a set of ten users______________________ 23

Ranking Systems: The PageRank Axioms_________________________________________ 24

Page Ranking______________________________________________________________ 24

The Axioms________________________________________________________________ 25

Isomorphism____________________________________________________________________________ 25

Self Edge______________________________________________________________________________ 25

Vote by Committee______________________________________________________________________ 25

Collapsing_____________________________________________________________________________ 25

Proxy_________________________________________________________________________________ 26

Sketch of Delete_____________________________________________________________ 27

Sketch of Duplicate__________________________________________________________ 27

Example run of the completeness algorithm_______________________________________ 28

Sketch of Several Axioms_____________________________________________________ 29

CONCLUSION______________________________________________________________ 30

REFERENCES______________________________________________________________ 31

ENDNOTES_________________________________________________________________ 34

 


 

ABSTRACT

 

This paper represents a survey of several Web page ranking algorithms. We intend to analyze and compare each of these algorithms, which introduce a novel approach to Internet searching and the ranking Search Engines, more in particular, Google, give to the search results of a particular query.

 

After we have introduced and surveyed the 4 different proposed algorithms, we shall convey our conclusions where we summarize our own opinions and recommendations for the implementation of that algorithm that promises the highest beneficial impact toward the search of relevant information in the Internet.

 

 


 

INTRODUCTION

 

The purpose of this paper is to compare four different algorithms of Web page ranking when performing Internet searching.  These algorithms are the following:

 

1.    Page Quality: In Search of an Unbiased Web Ranking

2.    Topic-Sensitive PageRank

3.    Web Page Scoring Systems for Horizontal and Vertical Search

4.    Ranking Systems: The PageRank Axioms

 

We will begin the survey of each algorithm by providing a brief introduction, followed by an analysis with corresponding graphs or figures. We shall finalize this paper by incorporating our own conclusions of this survey with our opinions and recommendations.

 


 

 

Page Quality: In Search of an Unbiased Web Ranking

 

Popular pages tend to get more popular, while unpopular pages tend to get more ignored, this decreases the overall quality of search results in the long run. This phenomenon presents a problem for newly published web sites that do not yet have a following. This algorithm attempts to alleviate this situation where popular pages become even more popular while the others continue to get ignored.  The proposed solution is that of introducing a �quality estimator�, which measures the intrinsic quality of a web page.  Furthermore, this algorithm does compare itself with Google�s current PageRank, and presents an analysis of cases where PageRank is accurate and where it is not.

 

PageRank and Popularity

 

The current PageRank metric defines the importance of a particular page by the sum of the importance of the pages that do point to it.  There is a dumping factor also taken into account, which represents the randomness of the user clicking on an unrelated link after he has been clicking on relevant topics for a while.  Therefore, PageRank implies that many users are indeed interested in a particular page, and that more users are likely to visit the page as compared to other low ranking pages.

 

Quality and PageRank

 

The key feature of PageRank is that it mesures the popularity of a page, which means that a user looked at a page and liked it, then it is reasonable to assume that subsequent visitors also will like the page after visiting it for the first time.  Thus, the algorithm in effect increases the probability that users will like the first few results.

 

The bias that PageRank has against unpopular pages accentuates even more with freshly created web pages which have not yet had the opportunity to be seen.  Since it doesn�t appear in the top search results then very few people get to see it, and becomes even more unpopular, despite the fact that it may be a very high quality page with much valuable information.

 

Therefore, Page Quality suggests using the �probability� that a page will be liked when seen for the first time, instead of using the �popularity� of a web page to display search results. Thus, the quality of a page is defined as:

 

�Page Quality: Define the quality of a page p, Q(p), as the conditional probability that an average user will like the page when the user sees the page for the first time�[1]

 

Q(p) = P(Lp | Ap) where Ap the event that the user will become aware of p by visiting the page for the first time and Lp represents the event that the user likes the page.

 

Measuring Page Quality

 

The first assumption in developing this idea is that the creation of links pointing to a particular page means such page is well liked. The second assumption is that a high quality page will be liked more quickly overall.  The challenge here becomes how to use popularity increase in measuring the quality of the page.  In order to address this, a new model is created, that of the �web-user� model

 

Web User Model

 

Definition:  The popularity of page p at time t, P(p, t), is the fraction of web users who like the page.

 

Definition: The visit popularity of a page p at time t, V(p, t) is the number of �visits� or �page views� the page gets within a unit time interval at time t.

 

Definition: The user awareness of page p at time t, A(p, t), is the fraction of web users who are aware of p at time t.

 

The first hypothesis is based on the random-surfer interpretation of PageRank, which helps us generate the following propositions: 1. The number of visits to page p within a unit time interval at time t is proportional to how many people like the page. 2. All web users will visit a particular page with equal probability.

 

Popularity Evolution

 

We can estimate the current popularity of the page p, by estimating how many new visitors will visit p. Then we can estimate how many will like the page in their first visit, and thus how much its popularity will increase.

 

Based on the results of estimations using sound theorems and proofs, it can be concluded that a websiste goes through three stages after it�s first launched: infant, expansion and maturity stages.  In the first infant stage, the page is barely noticed by web users and has practically zero popularity. At some point the page enters the second expansion stage where the popularity of the page suddenly increases. Finally, the popularity of the page stabilizes at a certain value entering the maturity stage.

 

Quality Estimator

 

The quality of a page is proportional to its popularity increase and inversely proportional to its current popularity. It is also inversely proportional to the fraction of the users who are unaware of the page.  Quality of a page does change over time, and the quality estimator is able to capture such change.

 

 


 

Time Evolution of Page Popularity

 

 

 


 

Popularity Evolution when Quality Drops at t=30

 

 

 


 

Topic Sensitive PageRank

 

In Topic Sensitive PageRank, several scores are computed:  multiple importance scores for each page under several topics that form a composite PageRank score for those pages matching the query.  During the offline crawling process, 16 topic-sensitive PageRank vectors are generated, using as a guideline the top-level category from Open Directory Project (ODP).  At query time, the similarity of the query is compared to each of these vectors or topics; and subsequently, instead of using a single global ranking vector, the linear combination of the topic-sensitive vectors is weighed using the similarity of the query to the topics.  This method yields a very accurate set of results relevant to the context of the particular query.

 

ODP Biasing

 

The first step in this approach is performed once during the offline preprocessing of the Web crawl. It generates a set of biased PageRank vectors using a set of �basis� topics.  The specific methodology of creating the biased PageRank vectors is to use the URLs listed below the 16 top-level categories in ODP.  These then, constitute the personalization vectors.  A single, non-biased PageRank vector is also computed for comparison purposes. Third, a set of 16 class term-vectors is also computed, which consists of the terms in the documents below each of the 16 top-level categories.

 

Query-Time Importance Score

 

The second step is performed at query time. If a query q was issued by highlighting the term q in some web page u, then q� (the context of q) consists of the terms in u.  For ordinary queries not done in context, let q� = q. using a unigram language model, with parameters set to their maximum-likelihood estimates, the class probabilities for each of the 16 top-level classes in ODP conditioned on q� are computed.  Using a text index, the URLs for all documents containing the original query terms q are retrieved.  Finally, the query-sensitive importance score of each of these retrieved URLs is computed. The results are then ranked according to the composite score. 

 

Experimental Results

Similarity Measure for Induced Rankings

 

Uses two measures when comparing rankings. The first indicates the degree of overlap between the top URLs of two rankings.  This measure however, does not give a complete picture of the similarity between the two rankings since it does not indicate the degree to which the relative ordering of the top URLs are in agreement.  The second uses a variant of the distance measure between Web pages.  Based on these, the similarity measure is defined as the probability that two rankings agree on the relative ordering of a randomly selected pair of distinct Web pages.

 

Effect of the ODP-Biasing

 

This measures the effect of topically biasing the PageRank computation.  It should be noted that the choice of the bias factor does affect the degree to which the resultant vector is biased towards the topic vector. For lower vales of the bias factor, α, where 0 � α � 1, the URLs in the bias set becomes irrelevant to the final score assignment.

 

Query-Sensitive Scoring

 

This tells us how effectively the ranking precision gained by the use of multiple PageRank vectors can be utilized. Given a query, the first task is to determine which of the rank vectors can best rank the results for the query. For most test queries, the ODP categories are intuitively the most relevant categories for the query.

 

Context-Sensitive Scoring

 

If the search is done in context, then the context can be used instead of the query to determine the topics. Using the context can help disambiguate the query term and yield results that more closely reflect the original intent of the user.

 

Sources of Search Context

 

There are a variety of other sources of context that may be used in this algorithm scheme, for example, the history of queries issued leading up to the current query is another form of query context.  In addition to these types of context associated with the query itself, another source that can be potentially