I want to add a page (I wish I could say “an article”) with a provocative title to those more that 66’000’000 pages (yes, sixty six millions) returned by Google for the query ‘search sucks’. Individually, Google, Yahoo and Bing score 2’700’000, 11’500’000 (!) and 643,000 respectively, according to their own evaluations. Should we trust these figures? Well, yes and no. If search results are not always relevant, you cannot be sure that texts on all those pages are about what you mean by your query. But since all search engines try to make results relevant, some portion of the results should be meaningful. Let’s explore how search works now, how trustworthy the results are, and what to do about this.
How Search Engines Work
Not being by any means associated with Google, or other big search services providers, this article contains speculations, based on available information, observations, experience and a couple of attempts to improve things. Thus what I am talking about here might be wrong, but I will be surprised, if it’ll be completely wrong.
Google Founders are Geniuses, or It All Is About Trust and Reputation
The idea of the PageRank algorithm is really brilliant: in addition to evaluating how closely page’s content matches the query, links from other pages to your page also count. If a reputable page links to your page, your reputation (read rank) increases: the author of that page trusts that when people navigate from her page to your page, it won’t harm her reputation (have you noticed we’re talking about relationships between people, not pages?).
I highly recommend reading an extremely interesting article by Ian Rogers about how PageRank works, and after that (you might prefer reading them in the opposite order, or even in parallel) an early paper on web search by Sergey Brin and Larry Page, or at least to skim through these. Reading this article will be more joy, if you understand what the damping factor is about.
That early paper, in addition to the page ranking algorithm, pays attention to relevance between the anchor text of the link pointing to your page, and the text on your page.
Surprisingly, that paper says nothing about content uniqueness and duplication: looks like it became more important as things began to unfold and spammers came, and nowadays the algorithm has become slightly more complicated.
The Content, and The Neighborhood
Google tries to explain how they value links without disclosing how it really works. That article still is a good starting point for moving forward. It fits well into Ian Roger’s explanation.
What is valued is you and your neighbors (and according to the original paper by Brin and Page, you’re worth only 15%, and your neighbors get the rest 85%, but let’s not be greedy at the moment).
What counts is:
- Your page’s content
- Inbound links
- Content on the page, which contains a link to your page
- Anchor text on those pages
- Outbound links: to far lesser extent, but the principle is the same
- These rules are transitive (i.e., neighbors of neighbors still are neighbors) of course, to the extent reasonable
Each page is ranked independently, it matters little whether a page is under an umbrella of a big site, or it’s an individual blog. And this is one of good things about Google.
Neighbors Knows Better, or 1’000’000 Lemmings Can’t Be Wrong
It takes time to climb social ladder and to gain reputation. So those who are right and at the same time are good neighbors get all the attention. If you’re a sociopathic genius, forget about Top 10 at Google, you won’t be seen and heard. These are Google’s rules of survival.
But what if you’re searching for information, and you care more about content rather than author’s reputation? What if you’re looking for an opinion, opposite to what the majority thinks? What if you’re looking for fat tails of black swans? Or what if you want a fresh and new and current (think Twitter) opinion of a person, who has just started writing. What if you want it now, rather than in months that take to build “online reputation”?
Do you remember the time when Microsoft was kind of buying Yahoo? All the web was flooded with articles about this process, and finding companies that Yahoo bought was very hard. Now Google returns much more relevant results for a ‘yahoo ~buys’, or even ‘“yahoo ~buy *” ’query, but back then… But I still wonder. why results for ‘“yahoo ~buy *” and “yahoo ~buys *” differ so greatly’).
Content Uniqueness and Keyword Density
It took a while to find a good article about long tailed queries: not the most relevant from Google’s perspective, but from mine. Finally I took the link from The Art of SEO book: http://guides.seomoz.org/chapter-5-keyword-research. When working on the semantic search engine, we got an inquiry whether we can help find insightful articles. Frankly, we didn’t know how to do it. Short articles have high keyword density, but is high keyword density what you’re looking for, or you need good coverage of the subject matter? Google relies upon people here, and it’s a good thing. If it only were combined with a relevant content matching algorithm…
Though there are attempts to reverse engineer Google’s algorithm (and by the way, Yahoo returns much better results for this query), it looks like keyword density scores big, and short articles relevant to short queries score higher than long articles (like this one). On the other hand, longer articles should score better or related content match, and it’s where statistical approaches can start to work.
Have I Said “Search Services Providers” Before?
What Google and the company are selling is not search services to you: you get this for free. What they do is they market page views to prove they can sell ad clicks to advertisers. So they just want to keep us happy enough. We are not their target market. Understand this, and relax. Don’t expect to be served well, just good enough is the best you can account for, you’re just a tool.
Who pays? Advertisers. What is sold? Clicks on ads on those pages. The more pages you show, i.e., the worse is relevance, the more ads are shown, the higher probability that ads will be clicked. Google wants to broaden your search, trying to show more pages, rather than narrow it by making as relevant to the query as possible.
Related Content, Or Will They Dare To Ask?
Would you try to answer if you do not understand what you’re asked for? I think you would try to understand the person, at least in the cases when you care. Google and the company make timid attempts to offer related searches. It seems to be aimed at helping people build long tailed queries. It is a good trend that the length of a query increases, but the search engines provide really too few tools to help people build queries efficiently. On the one hand, better tools will help people find information faster, on the other hand, it will mean fewer page views for Google and their customers.
Page Structure Analysis
For Google, a page looks like text with a title and nested headings. Keywords in the title and in headings have higher weights.
But has anyone noticed that Google cares about whether the keywords are part of a consistent text? I haven’t. Thus you end up finding your keywords scattered all over the page, in the best case standing next to an unrelated entity.
What makes things worse is that in the modern world a web page contains side content and navigation parts. Google does not make any difference between the article itself and all the wrappings around it. In the worst case your keywords appear somewhere within related links and terms of use. Have you ever tried to search for an event, involved some entities within specific year or month? Most frequently you’ll have to go through blog dated indices, copyright lists, and comments posted on that specific date. Any time you search for words that are likely to appear in navigation sections, such as “link”, “RSS”, “post”, even “Google”.
For the Curious: Why Yahoo Scored So High In the “I Suck” Test?
Articles from Yahoo! News have the word Yahoo in the title, while neither Bing nor Google place their company names in page titles, that’s can be why.
That’s of course not only Google’s problem. There has no been a legitimate way, before HTML5, to describe page structure, and separate articles from navigation and side content. The question is whether Google is going to support HTML5 markup. So far too little has been done in this direction. I wonder if Yahoo/Bing see HTML5 support as a competitive advantage to them.
Text Analysis
No attempt is made to analyze the text: extracting information about participants, events, and establish relationships between these, everything which is the essence of semantic analysis. So, let’s admit it: Google is a keyword search engine, a very good one, but not more than that. Ok, it even let’s you enter a sequence of keywords in quotation marks, but not much more. Nelson Mattos an interview mentioned that semantics is hard, and Google is only at the beginning of the path. No doubt, but so far nothing in this direction has been shown to the public.
Neighborhood Analysis
Sometimes Google decides to return pages, that do not contain your keywords.
A simple example. When looking for the <article> tag support by Google, the following page http://apiblog.youtube.com/2010/06/flash-and-html5-tag.html is ranked #1. It’s a good article, but about the <video> tag. The you look at the text-only version in the cache, you can see that Google confesses “These search terms are highlighted: html5 google support tag. These terms only appear in links pointing to this page: article”. ‘Nuf said.
The Good and The Bad
The Good
- Ranking each page independently is the first step to indexing articles, not pages
- Counting inbound links really helps take into consideration independent opinions
- Taking into consideration the content of the neighborhood helps disambiguate meanings
- Results are returned quickly
The Bad
- Page structure is almost completely ignored and no attempt is made understand the query and the content on the pages. This is the reason why you have to look through lots of pages that are completely irrelevant to what you’re looking for.
- Ranking inbound links high has built the whole market for SEO companies that write “related” articles. Do you think these are always insightful and fun to read?
- UI does not let you conveniently refine the query, let alone giving means to explain Google what you’re looking for by giving hints, specifying synonyms, pointing out irrelevant pages etc
- Despite returning results quickly, users waste time clicking through lots of irrelevant results.
Does This All Matter? Is Google Good Enough? Or It Sucks?
Google is moving from a generic search tool, which everyone can use, to a somewhat skewed ad showing engine, and it should be accounted for. Google is great, if you want to find a page containing keywords it its title. But a research shows that more that 80% of queries are informational rather than navigational or transactional. Doesn’t it look as a big under-served market?
Or maybe I just want to use Google not in the way it’s supposed to, and should use other applications and services instead?
What We All Can Do
Let’s stop criticizing Google for a while, and ask ourselves: have we, as content providers, done enough for Google to understand our articles? Since we know (ok, we try to guess) how it works, and its limitations, what can we do to help Google, and the rest of the herd, to index our pages better?
Things start looking promising now, with HTML5 becoming a real thing. Starting with simply marking articles and navigation can help filter out junk. (How do properly mark up different types of pages, articles, comments, related content, list of article abstracts, posting dates etc deserves a separate discussion).
Search engines might start giving better rankings to properly marked content. This will naturally promote creating clean and clear content, as well as promote good behavior. I completely agree with Sam Langdon that commercial forces will dictate what the web content will be.
Being able to specify page structure is only a first baby step in improving search results. But before talking about semantic search, why not implement simpler things?
Wrapping Up and Around
Do you remember what we’ve started with? For the “I suck” query, Yahoo returned only 28 pages, i.e., 280 results. Google has sincerely reported that it does not return more than 1000 results. Bing stopped an page 100, i.e., the same as Google. Big Oligopoly knows better?
Can it be so that the best answer to your question is on the page 101? Experiments using Yahoo and Bing search API’s have shown that in some cases the answer is “definitely yes”, and the most relevant articles have been in eight’s hundred, and second or third thousand.
