Florida International University
School of Computing and Information Sciences
Web Page Ranking Algorithms
COP 6727 Advanced Database System
Professor: Vagelis Hristidis
On On the web courtesy of:
CP CPC Computer Consultants, Inc. www.cpccci.com
Table of Contents
Page Quality: In Search of an Unbiased Web Ranking________________________________ 6
PageRank and Popularity______________________________________________________ 6
Quality and PageRank________________________________________________________ 6
Measuring Page Quality_______________________________________________________ 7
Web User Model_____________________________________________________________ 7
Popularity Evolution______________________________________________________________________ 8
Quality Estimator_________________________________________________________________________ 8
Time Evolution of Page Popularity_____________________________________________ 10
Popularity Evolution when Quality Drops at t=30__________________________________ 11
Topic Sensitive PageRank______________________________________________________ 12
ODP Biasing_______________________________________________________________ 12
Query-Time Importance Score_________________________________________________ 12
Experimental Results_________________________________________________________ 13
Similarity Measure for Induced Rankings_____________________________________________________ 13
Effect of the ODP-Biasing_________________________________________________________________ 13
Query-Sensitive Scoring___________________________________________________________________ 13
Context-Sensitive Scoring_________________________________________________________________ 14
Sources of Search Context____________________________________________________ 14
Different search contexts for query �blues�_______________________________________ 15
Results for query �blues� with content sensitive____________________________________ 15
Pairwise comparison of topically-biased rankings__________________________________ 16
Precision at ten results for our test queries. The average precision over ten queries is also shown 17
Web Page Scoring Systems for Horizontal and Vertical Search________________________ 18
PageRank and Random Walks_________________________________________________ 18
Uniform Jump Probabilities________________________________________________________________ 19
Multiple State Model_____________________________________________________________________ 19
Horizontal WPSS___________________________________________________________ 19
The HITS Ranking System________________________________________________________________ 20
The PageRank-HITS Model________________________________________________________________ 20
Vertical WPSS______________________________________________________________ 20
Focused PageRank_______________________________________________________________________ 21
Double Focused PageRank_________________________________________________________________ 21
Top figure is showing the distribution of page scores for the topic �Linux�. The one below is showing the distribution for the topic �cooking recipes�.______________________________________ 22
Percentage of positive judgedments assessed by a set of ten users______________________ 23
Ranking Systems: The PageRank Axioms_________________________________________ 24
Page Ranking______________________________________________________________ 24
The Axioms________________________________________________________________ 25
Self Edge______________________________________________________________________________ 25
Vote by Committee______________________________________________________________________ 25
Sketch of Delete_____________________________________________________________ 27
Sketch of Duplicate__________________________________________________________ 27
Example run of the completeness algorithm_______________________________________ 28
Sketch of Several Axioms_____________________________________________________ 29
This paper represents a survey of several Web page ranking algorithms. We intend to analyze and compare each of these algorithms, which introduce a novel approach to Internet searching and the ranking Search Engines, more in particular, Google, give to the search results of a particular query.
After we have introduced and surveyed the 4 different proposed algorithms, we shall convey our conclusions where we summarize our own opinions and recommendations for the implementation of that algorithm that promises the highest beneficial impact toward the search of relevant information in the Internet.
The purpose of this paper is to compare four different algorithms of Web page ranking when performing Internet searching. These algorithms are the following:
1. Page Quality: In Search of an Unbiased Web Ranking
2. Topic-Sensitive PageRank
3. Web Page Scoring Systems for Horizontal and Vertical Search
4. Ranking Systems: The PageRank Axioms
We will begin the survey of each algorithm by providing a brief introduction, followed by an analysis with corresponding graphs or figures. We shall finalize this paper by incorporating our own conclusions of this survey with our opinions and recommendations.
Popular pages tend to get more popular, while unpopular pages tend to get more ignored, this decreases the overall quality of search results in the long run. This phenomenon presents a problem for newly published web sites that do not yet have a following. This algorithm attempts to alleviate this situation where popular pages become even more popular while the others continue to get ignored. The proposed solution is that of introducing a �quality estimator�, which measures the intrinsic quality of a web page. Furthermore, this algorithm does compare itself with Google�s current PageRank, and presents an analysis of cases where PageRank is accurate and where it is not.
The current PageRank metric defines the importance of a particular page by the sum of the importance of the pages that do point to it. There is a dumping factor also taken into account, which represents the randomness of the user clicking on an unrelated link after he has been clicking on relevant topics for a while. Therefore, PageRank implies that many users are indeed interested in a particular page, and that more users are likely to visit the page as compared to other low ranking pages.
The key feature of PageRank is that it mesures the popularity of a page, which means that a user looked at a page and liked it, then it is reasonable to assume that subsequent visitors also will like the page after visiting it for the first time. Thus, the algorithm in effect increases the probability that users will like the first few results.
The bias that PageRank has against unpopular pages accentuates even more with freshly created web pages which have not yet had the opportunity to be seen. Since it doesn�t appear in the top search results then very few people get to see it, and becomes even more unpopular, despite the fact that it may be a very high quality page with much valuable information.
Therefore, Page Quality suggests using the �probability� that a page will be liked when seen for the first time, instead of using the �popularity� of a web page to display search results. Thus, the quality of a page is defined as:
�Page Quality: Define the quality of a page p, Q(p), as the conditional probability that an average user will like the page when the user sees the page for the first time�
Q(p) = P(Lp | Ap) where Ap the event that the user will become aware of p by visiting the page for the first time and Lp represents the event that the user likes the page.
The first assumption in developing this idea is that the creation of links pointing to a particular page means such page is well liked. The second assumption is that a high quality page will be liked more quickly overall. The challenge here becomes how to use popularity increase in measuring the quality of the page. In order to address this, a new model is created, that of the �web-user� model
Definition: The popularity of page p at time t, P(p, t), is the fraction of web users who like the page.
Definition: The visit popularity of a page p at time t, V(p, t) is the number of �visits� or �page views� the page gets within a unit time interval at time t.
Definition: The user awareness of page p at time t, A(p, t), is the fraction of web users who are aware of p at time t.
The first hypothesis is based on the random-surfer interpretation of PageRank, which helps us generate the following propositions: 1. The number of visits to page p within a unit time interval at time t is proportional to how many people like the page. 2. All web users will visit a particular page with equal probability.
We can estimate the current popularity of the page p, by estimating how many new visitors will visit p. Then we can estimate how many will like the page in their first visit, and thus how much its popularity will increase.
Based on the results of estimations using sound theorems and proofs, it can be concluded that a websiste goes through three stages after it�s first launched: infant, expansion and maturity stages. In the first infant stage, the page is barely noticed by web users and has practically zero popularity. At some point the page enters the second expansion stage where the popularity of the page suddenly increases. Finally, the popularity of the page stabilizes at a certain value entering the maturity stage.
The quality of a page is proportional to its popularity increase and inversely proportional to its current popularity. It is also inversely proportional to the fraction of the users who are unaware of the page. Quality of a page does change over time, and the quality estimator is able to capture such change.
In Topic Sensitive PageRank, several scores are computed: multiple importance scores for each page under several topics that form a composite PageRank score for those pages matching the query. During the offline crawling process, 16 topic-sensitive PageRank vectors are generated, using as a guideline the top-level category from Open Directory Project (ODP). At query time, the similarity of the query is compared to each of these vectors or topics; and subsequently, instead of using a single global ranking vector, the linear combination of the topic-sensitive vectors is weighed using the similarity of the query to the topics. This method yields a very accurate set of results relevant to the context of the particular query.
The first step in this approach is performed once during the offline preprocessing of the Web crawl. It generates a set of biased PageRank vectors using a set of �basis� topics. The specific methodology of creating the biased PageRank vectors is to use the URLs listed below the 16 top-level categories in ODP. These then, constitute the personalization vectors. A single, non-biased PageRank vector is also computed for comparison purposes. Third, a set of 16 class term-vectors is also computed, which consists of the terms in the documents below each of the 16 top-level categories.
The second step is performed at query time. If a query q was issued by highlighting the term q in some web page u, then q� (the context of q) consists of the terms in u. For ordinary queries not done in context, let q� = q. using a unigram language model, with parameters set to their maximum-likelihood estimates, the class probabilities for each of the 16 top-level classes in ODP conditioned on q� are computed. Using a text index, the URLs for all documents containing the original query terms q are retrieved. Finally, the query-sensitive importance score of each of these retrieved URLs is computed. The results are then ranked according to the composite score.
Uses two measures when comparing rankings. The first indicates the degree of overlap between the top URLs of two rankings. This measure however, does not give a complete picture of the similarity between the two rankings since it does not indicate the degree to which the relative ordering of the top URLs are in agreement. The second uses a variant of the distance measure between Web pages. Based on these, the similarity measure is defined as the probability that two rankings agree on the relative ordering of a randomly selected pair of distinct Web pages.
This measures the effect of topically biasing the PageRank computation. It should be noted that the choice of the bias factor does affect the degree to which the resultant vector is biased towards the topic vector. For lower vales of the bias factor, α, where 0 � α � 1, the URLs in the bias set becomes irrelevant to the final score assignment.
This tells us how effectively the ranking precision gained by the use of multiple PageRank vectors can be utilized. Given a query, the first task is to determine which of the rank vectors can best rank the results for the query. For most test queries, the ODP categories are intuitively the most relevant categories for the query.
If the search is done in context, then the context can be used instead of the query to determine the topics. Using the context can help disambiguate the query term and yield results that more closely reflect the original intent of the user.
There are a variety of other sources of context that may be used in this algorithm scheme, for example, the history of queries issued leading up to the current query is another form of query context. In addition to these types of context associated with the query itself, another source that can be potentially utilized is query independent user context. Sources of user context include the users� browsing patterns, bookmarks, and email archives.
The general web page scoring model proposed extends both PageRank and the HITS scheme. Many new features are explored which greatly expedite focused (vertical) searches. The content of the pages is combined with the web graphical structure, giving rise to scoring mechanisms focused on a specific topic. Furthermore, when doing vertical searches, the algorithm can take into account the mutual relationship amongst different topics, and in doing so the discovery of pages with high score for a given topic affects the score of pages with related topics.
A web page is represented as a graph G, where each web page is a node and a hyperlink between web pages is a link between two nodes in the graph. It is generally assumed that the relevance of a page is represented by the probability of ending up in that page during a walk on this graph.
Within this framework, it is assumed that when a user is surfing the internet he may undertake any of these atomic actions at any time:
1. Jump to a node of the graph
2. Follow a hyperlink from the current page
3. Follow a back-link
4. Stay in the same node
Furthermore, it is assumed that the actions he decides to take will depend on the page contents and the links it contains. Therefore, if the user likes the page, he will likely follow a link contained there, otherwise, he will jump to a new page, perhaps and unrelated one.
This model investigates the jump probabilities from one page to the next, which are independent on the starting page. It is assumed that a user decides to make random jumps from a given page to another page with uniform probability.
In order to model the activity of different users, a set of state variables is used that represents the probability of each user to visit a particular page at a given time. The interaction of a surfer is modeled by a set of parameters which define the probability of the surfer to accept the suggestion of another surfer, thus jumping from the page he was visiting to the one visited by the other. This interaction happens before the choice of the actions described previously.
Horizontal Web Page Scoring Systems do not consider any information on the page contents and produce the rank vector using just the topological characteristics of the Web graph.
The current ranking methodology is based on a random walk model defined by a single state variable. Only two actions are considered: the surfer jumps to a new random page or he follows one link from the current page.
A new algorithm, namely the HITS algorithm, was proposed to model documents that rely only on the information hidden in the connections among them due to co-citation of web hyperlinks. In such context, web pages are divided into two categories: the pages that are informational in nature, or the �authorities�; and the pages which link to information sources, or the �hubs�. The HITS algorithm assigns two numbers to each page, the authority number and the hubness number, in order to model the importance of the page. Therefore, since HITS uses two state variables, the hubness and the authority of a page, the corresponding random walk model is a multiple state scheme based on the activity of two surfers.
PageRank is a stable algorithm, it works well in most cases because of it probabilistic component and it has a widespread application without determent to the influence of smaller web portals or communities. PageRank, however, is rather too simple to take into account the complex relationships of web page citations. The HITS algorithm, in the other hand, is not as stable, however the hub and authority model can capture, more than the PageRank algorithm can, the relationships among the web pages.
Vertical Web Page Scoring System aims at computing a relative ranking of pages when focusing on a specific topic. A vertical WPSS relies on the representation of the page content with a set of features and on a classifier that is used to assess the degree of relevance of the page with respect to the topic of interest. Vertical WPSSs can produce more accurate results in ranking topic-specific pages.
Instead of the random surfer model, in the focused domain the more realistic case of a surfer who follows the links according to the suggestions provided by a page classifier may be considered. This scoring system removes the assumption of complete randomness of the underlying web surfer in this specific scenario.
A more realistic model must account the fact that the decision about the action is very likely dependent on the contents of the current page. This behavior can be modeled by adapting the action probabilities using the contents of such current page. This would provide a modeled focused choice of the surfer�s actions.
In both cases Page Rank is much smoother than the HITS one.
Topic: �Linux� and �Golf�
The importance of the Search Engines ranking systems, Google�s in particular, is paramount in understanding the influence in many of today�s Internet sites. One important set of such ranking systems is Google�s PageRank algorithm. In the study of the rationale of using a particular page ranking algorithm, and in order to address the question of what makes this particular algorithm different from any other, an axiomatic approach is introduced, similar to that in the mathematical theory of social choice.
If the Internet is treated as a graph, where the nodes (pages) are agents, and the links originating from the node (page) define the preferences of the corresponding agent that the node links to is preferable to a page that the node does not link to then the page ranking problem becomes the problem of aggregating individual rankings into a global (social) ranking.
Today�s practice by the Search Engines of ranking web pages when responding to a particular query, is based on the computation of a stationary probability distribution of a surfer�s random path along the Internet, or graph, where the pages are nodes and the links are edges.
Google�s PageRank algorithm can be thought of as a matrix that captures the random path created by the PageRank procedure. In this process one begins in a random page, and iteratively moves to one of the pages that are linked to by the current page, assigning equal probabilities to each such page.
The isomorphism axiom tells us that the ranking procedure should be independent of the names we choose for the vertices.
If a web page a is ranked at least as high as a web page b where a does not link to itself, then a should be ranked higher than b if all we do is add a link from a to self.
If page a links to pages b and c, then the relative ranking of all pages should be the same as in the case where the direct links from a to b and c are replaced by links from a to a new set of pages, which link only to b and c.
If there is a pair of pages, such as a and b, where both a and b link to the same set of pages, but the sets of pages that link to a and b are disjoint, then if we collapse a and b into a singleton, say a, where all links to b become now links to a, then the relative ranking of all pages, excluding a and b of course, should remain as before.
If there is a set of k pages, all having the same importance, which link to a, where a itself links to k pages, then if we drop a and connect directly, and in a 1-1 fashion, the pages which linked to a to the pages that a linked to, then the relative ranking of all pages (excluding a) should remain the same.
We believe that all of the four algorithms surveyed in this paper are effective in presenting viable alternatives to enhance the current PageRank algorithm used by Google, the most influential Search Engine in the Internet, whose Web page ranking method is used by hundreds of other lesser important searching utilities.
We do, however, favor one algorithm in particular, that of Topic-Sensitive PageRank. We believe this will be the most effective and the one promising the most beneficial impact in the way users search for information in the Internet. Implementing the Topic-Sensitive PageRank algorithm would require changing some standards in the way Web pages are configured, such as adding a brand new set of HTML tags that will include the Category and Sub-Category (and perhaps one or more additional sub-categories) of that specific Web page. This way, when Search Engines crawl the Internet, they would index the Web pages placing them in their corresponding Categories. Likewise, the Search Engines would need to modify the query request incorporating a field for the user to enter the Category or Sub-Category relevant to his query at that given time.
 The Google Search Engine: Commercial search engine founded by the originators of PageRank. http://www.google.com/.
 The Open Directory Pro ject: Web directory for over 2.5 million URLs. http://www.dmoz.org/.
 �More Evil Than Dr. Evil?� http://searchenginewatch.com/sereport/99/11- google.html.
 Krishna Bharat and Monika R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the ACM-SIGIR, 1998.
 Krishna Bharat and George A. Mihaila. When experts agree: Using non-aﬃliated experts to rank popular topics. In Proceedings of the Tenth International World Wide Web Conference, 2001.
 Sergey Brin, Ra jeev Motwani, Larry Page, and Terry Winograd. What can you do with a web in your pocket. In Bul letin of the IEEE Computer Society Technical Committee on Data Engineering, 1998.
 Sergey Brin and Larry Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International World Wide Web Conference, 1998.
 S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Ra jagopalan. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the Seventh International World Wide Web Conference, 1998.
 Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of the Tenth International World Wide Web Conference, 2001.
 Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. Placing search in context: the concept revisited. In Proceedings of the Tenth International World Wide Web Conference, 2001.
 Taher H. Haveliwala. Eﬃcient computation of PageRank. Stanford University Technical Report, 1999.
 J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. Webbase: A repository of web pages. In Proceedings of the Ninth International World Wide Web Conference, 2000.
 Glen Jeh and Jennifer Widom. Scaling personalized web search. Stanford University Technical Report, 2002.
 Jon Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1998.
 Ra jeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press, United Kingdom, 1995.
 Larry Page. PageRank: Bringing order to the web. Stanford Digital Libraries Working Paper, 1997.
 Davood Ra�ei and Alberto O. Mendelzon. What is this page known for? Computing web page reputations. In Proceedings of the Ninth International World Wide Web Conference, 2000.
 Matthew Richardson and Pedro Domingos. The Intel ligent Surfer: Probabilistic Combination of Link and Content Information in PageRank, volume 14. MIT Press, Cambridge, MA, 2002 (To appear)
 J. Kleinberg, �Authoritative sources in a hyperlinked environment.� Report RJ 10076, IBM, May 1997, 1997.
 K. Bharat and M. Henzinger, �Improved algorithms for topic distillation in a hyperlinked enviroment,� in Proceedings of the 21st ACM SIGIR Conference on Research and Developments in Information Retrieval, pp. 104–111, 1998.
 R. Lempel and S. Moran, �The stochastic approach for link-structure analysis (SALSA) and the TKC effect,� in Proceedings of the 9th World Wide Web Conference, 2000.
 R. Lempel and S. Moran, �Salsa: The stochastic approach for link-structure analysis,� ACM Transactions on Information Systems, vol. 19, pp. 131–160, April 2001.
 D. Cohn and H. Chang, �Learning to probabilistically identify authoritative documents,� in Proc. 17th International Conf. on Machine Learning, pp. 167–174, Morgan Kaufmann, San Francisco, CA, 2000.
 D. Cohn and T. Hofmann, �The missing link: a probabilistic model of document content and hypertext connectivity,� in Neural Information Processing Systems, vol. 13, 2001.
 E. Seneta, Non-negative matrices and Markov chains. Springer-Verlag, 1981.
 M. Joshi, V. Tawde, and S. Chakrabarti, �Enhanced topic distillation using text, markup tags, and hyperlinks,� in International ACM Conference on Research and Development in Information Retrieval (SIGIR), August 2001.
 M. Diligenti, F. Coetzee, S. Lawrence, L. Giles, and M. Gori, �Focus crawling by context graphs,� in Proceedings of the International Conference on Very Large DataBases, 11-15 September 2000, Il Cairo, Egypt, pp. 527–534, 2000.
 B. Amento, L. Terveen, and W. Hill, �Does authority mean quality? predicting expert quality ratings of Web documents,� in Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 296–303, 2000.
 Page Quality: In Search of an Unbiased Web Ranking, p. 553
 Web Page Scoring Systems for Horizontal and vertical Search, p.511
 Ranking Pages: The PageRank Axioms, p. 3
 Ranking Pages: The PageRank Axioms, p. 3
 Ranking Pages: The PageRank Axioms, p. 3
 Ranking Pages: The PageRank Axioms, p. 3
 Ranking Pages: The PageRank Axioms, p. 3