As more and more people rely on search engines as starting points to fulfill their need for
information, it has become absolutely important to have one’s page rank up in the top few
results of popular search engines. Most search engines use, among other things, variants
of the classic PageRank algorithm, which relies on the link structure of the web to rank
pages. In order to have their pages rank higher than deserving, some web designers,
resort to all sorts of tricks to mislead search engines by manipulating linkage (link-spam)
and content(term-spam) on their pages and the web, in the process give form to what has
come to be called web-spam. There is a continuing clash between search engine
algorithm-designers and web-spammers leading to this battleground of the Adversarial
Our main focus in this report is link-spam. We take a look at the different methods of
combating link-spam. We also look at optimal link-spam structures and test them using
Java code. We implement popular algorithms for ranking algorithms and test the efficacy
of these on a web-graph made available by Webaroo.
Web Spam Taxonomy:
A first step in gearing up for the counter-measures it would be prudent to understand the
spammers’ ‘arsenal’. This section elucidates the attempts to organize web-spamming
techniques into a taxonomy. It also briefly brushes over published statistics about webspam.
There have been discussions in literature and on the web, but we draw heavily from.
We use two terms: importance: the ranking of a page in general, and relevance: the
ranking of a page with respect to a specific query
Link Spamming Techniques:
To delve into link spamming let’s categorize pages according to the way they can be
manipulated by spammers to influence results:
a. Inaccessible pages: Spammers cannot modify these pages. However, they can
point to them.
b. Accessible pages: These pages don’t belong to the spammer, but they can modify
the content on these pages, in a limited manner. Typical examples are: wikis,
comments on blogs.
c. Own pages: The spammer wants to boost ranking of one or more of these pages:
target pages, t. These have a cap on budget (e.g. web hosting, etc.).
The target algorithms: HITS, PageRank, TrustRank, etc.
HITS ranks hubs and authority pages.[ 11] For HITS, the spammer can easily obtain high
hub scores by adding outlinks to popular websites. Some spammers even pay users of
high ranked .edu authority pages to point to their spammy pages. The spammer can
obtain high authority scores by having his unscrupulous hub pages point into a page
which can now become a hub page
It is important for spammers to conceal their intent from a human visitor. Two techniques
used here are:
This involves making the spam invisible from the page. This can be done by changing
background color or by having the 1x1 pixel.
Spammers can provide one version to humans and different one to crawlers. This is done
by keeping track of IP addresses of crawlers and serving them different content