Lies, Damn Lies and Search Engine Stats: War Diary of a Con Job
(rt) Ever since Yahoo and MSN joined the fray with newly consolidated and/or fresh resources, the notorious Search Engine Wars have flared up full scale again.
The contenders, being the competing public corporations they are, constantly in the limelight of shareholders, investment analyst and consultant, it is hardly surprising to see them flaunt their index size figures all over the place – size, after all, is a concept very much easier to grasp for most people (including those working in the media) than “quality” or “relevancy”. Being traded on the stock exchange equates with access to funding by banks and investors – a crucial entrepreneurial factor in an industry requiring vast amounts of resouces for research and development in an ever more sophisticated technological environment.
However, some hardnosed analysis of their fundamental claims seems called for, the more frantic the competition gets. What is verifiable data, what are merely wild claims … and what are investors and search marketers supposed to make of it all?
Admittedly the material presented below is an old hat for many of the more Google-critical and number crunching SEOs out there. However, for those (presumably the vast majority) who haven’t seen it yet, it seems more than justified to point it out. The main reason being that – rather than constituting yet another contribution towards Google bashing if not trashing – the research outlined impacts all your search analytics, ranging from the actual size of Google’s index (in blatant contrast to their on site claims, effectively inflating their figures by 66%!) via their loss of millions of web pages to the perennial “my index rocks, your index sucks” race between the major search operators, broken search algos, and much, much more.
So let’s introduce a French-and-English blog maintained by Jean Véronis of Aix-en-Provence, France, professor of Linguistics and Information Technology and director of → CILSH Centre Informatique pour les Lettres et Sciences Humaines (appr. Information Center for Letters and Humanities Studies) at the → Université de Provence in Aix. Véronis also doubles as director of → DELIC DEscription Linguistique Informatisée sur Corpus (Informationalized Linguistic Corpus Description), a major research center for Corpus Linguistics.
The blog is named Technologie du langage and, as the title makes clear, is focused on → Language Technology.
Some of Véronis’ published findings are quite spectacular and have set in motion a flurry of corrective activities at the search engines’ headquarters. Nor is this a mere frivolous claim on the author’s part, as this (French only) piece shows: → Web: Le futur selon Yahoo. Here, he mentions Jan Pedersen, Chief Scientist at Yahoo!, citing Véronis’ at length and even displaying his graphs in the course of a presentation at the → 10th Search Engine Meeting in Boston, Massachusetts which took place only last month.
For the following overview we have only listed those of Véronis’ postings available in English. We may present a digest of his French material at some later point. The titles are fairly self-explanatory, so we will restrict ourselves to short quotes – highly recommended reading!
The entries are sorted in chronological order.
January 26, 2005
→ Web: Google’s counts faked?
Quote:
In any case, I would not recommend professional uses of Google’s counts (such as → “Google linguistics”). Yahoo! seems more reliable — or are they simply cleverer?
February 08, 2005
→ Web: Google’s missing pages: mystery solved?
Here’s his take on Google’s botched Boolean aearch algo:
In all likelihood, the Google engineers simply forgot to plug the extrapolation routine at the end of the boolean module! Therefore, if you want to know the real index count for any word, simply type it twice:
Word Count
stuttering 749,000
stuttering stuttering 452,000The second line is likely to be the real count…
February 28, 2005
→ Web: MSN cheating too?
Quote:
Google : 66% inflation ; MSN : 33% inflation. About half. Coincidental ?
In any case, so far only Yahoo’s results seem coherent (should I say sincere ?). The irony is that Google probably inflated its count because of MSN’s pressure, when MSN announced 5 billion pages, but it seems that MSN if playing a trick too!
March 09, 2005
→ Web: Yahoo doubles its counts!
Quote:
Yahoo has clearly caught up on Google in terms of size and quality (relevance, freshness, etc.), and is beginning to gain more and more respect among professional users, experts, academics (a good step was the release of a very nice API a few days ago).
March 13, 2005
→ Web: Google adjusts its counts
Quote:
The Googlers must have been slightly embarrassed, and since the study was published (Feb. 8th), they have been adjusting the counts in a major way to correct the situation.
March 23, 2005
→ Google: 5 billion “the” have disappeared overnight
Quote:
Interestingly enough, the new results reveal very clearly that Yahoo indexes more pages than Google
March 25, 2005
→ Google: A snapshot of the update
Google is currently undergoing major modifications, in which the problem is no more a simple index update, but an in-depth correction of extrapolation routines and boolean logic, in order to fix the count aberrations
In other words: this is no mere “slightly tweaking the algo to improve search results” – it’s a full fledged upscaling problem, an issue regular readers of fantomNews have been familiar with for years.
And the show goes on …
[Keywords: SEO, search engine optimization, search engine research, SEO/SEM resources, web analytics, web stats ]
Trackback link: http://fantomaster.com/fantomNews/archives/2005/05/18/lies-damn-lies-and-search-engine-stats-war-diary-of-a-con-job/trackback/
![[Home]](http://fantomaster.com/images/shim.gif)















