Search Sucks

I want to add a page (I wish I could say “an article”) with a provocative title to those more that 66’000’000 pages (yes, sixty six millions) returned by Google for the query ‘search sucks’. Individually, Google, Yahoo and Bing score 2’700’000, 11’500’000 (!) and 643,000 respectively, according to their own evaluations.  Should we trust these figures? Well, yes and no. If search results are not always relevant, you cannot be sure that texts on all those pages are about what you mean by your query. But since all search engines try to make results relevant, some portion of the results should be meaningful. Let’s explore how search works now, how trustworthy the results are, and what to do about this.

How Search Engines Work

Not being by any means associated with Google, or other big search services providers, this article contains speculations, based on available information, observations, experience and a couple of attempts to improve things. Thus what I am talking about here might be wrong, but I will be surprised, if it’ll be completely wrong.

Google Founders are Geniuses, or It All Is About Trust and Reputation

The idea of the PageRank algorithm is really brilliant: in addition to evaluating how closely page’s content matches the query, links from other pages to your page also count. If a reputable page links to your page, your reputation (read rank) increases: the author of that page trusts that when people navigate from her page to your page, it won’t harm her reputation (have you noticed we’re talking about relationships between people, not pages?).

I highly recommend reading an extremely interesting article by Ian Rogers about how PageRank works, and after that (you might prefer reading them in the opposite order, or even in parallel) an early paper on web search by Sergey Brin and Larry Page, or at least to skim through these. Reading this article will be more joy, if you understand what the damping factor is about.

That early paper, in addition to the page ranking algorithm, pays attention to relevance between the anchor text of the link pointing to your page, and the text on your page.

Surprisingly, that paper says nothing about content uniqueness and duplication: looks like it became more important as things began to unfold and spammers came,  and nowadays the algorithm has become slightly more complicated.

The Content, and The Neighborhood

Google tries to explain how they value links without disclosing how it really works. That article still is a good starting point for moving forward. It fits well into Ian Roger’s explanation.
What is valued is you and your neighbors (and according to the original paper by Brin and Page, you’re worth only 15%, and your neighbors get the rest 85%, but let’s not be greedy at the moment).
What counts is:

  • Your page’s content
  • Inbound links
    • Content on the page, which contains a link to your page
    • Anchor text on those pages
  • Outbound links: to far lesser extent, but the principle is the same
  • These rules are transitive (i.e., neighbors of neighbors still are neighbors) of course, to the extent reasonable

Each page is ranked independently, it matters little whether a page is under an umbrella of a big site, or it’s an individual blog. And this is one of good things about Google.

Neighbors Knows Better, or 1’000’000 Lemmings Can’t Be Wrong

It takes time to climb social ladder and to gain reputation. So those who are right and at the same time are good neighbors get all the attention. If you’re a sociopathic genius, forget about Top 10 at Google, you won’t be seen and heard. These are Google’s rules of survival.
But what if you’re searching for information, and you care more about content rather than author’s reputation? What if you’re looking for an opinion, opposite to what the majority thinks? What if you’re looking for fat tails of black swans? Or what if you want a fresh and new and current (think Twitter) opinion of a person, who has just started writing. What if you want it now, rather than in months that take to build “online reputation”?

Do you remember the time when Microsoft was kind of buying Yahoo? All the web was flooded with articles about this process, and finding companies that Yahoo bought was very hard. Now Google returns much more relevant results for a ‘yahoo ~buys’, or  even ‘“yahoo ~buy *” ’query, but back then… But I still wonder. why results for ‘“yahoo ~buy *” and “yahoo ~buys *” differ so greatly’).

Content Uniqueness and Keyword Density

It took a while to find a good article about long tailed queries: not the most relevant from Google’s perspective, but from mine. Finally I took the link from The Art of SEO book:  http://guides.seomoz.org/chapter-5-keyword-research. When working on the semantic search engine, we got an inquiry whether we can help find insightful articles. Frankly, we didn’t know how to do it. Short articles have high keyword density, but is high keyword density what you’re looking for, or you need good coverage of the subject matter? Google relies upon people here, and it’s a good thing. If it only were combined with a relevant content matching algorithm…
Though there are attempts to reverse engineer Google’s algorithm (and by the way, Yahoo returns much better results for this query), it looks like keyword density scores big, and short articles relevant to short queries score higher than long articles (like this one). On the other hand, longer articles should score better or related content match, and it’s where statistical approaches can start to work.

Have I Said “Search Services Providers” Before?

What Google and the company are selling is not search services to you: you get this for free. What they do is they market page views to prove they can sell ad clicks to advertisers. So they just want to keep us happy enough. We are not their target market. Understand this, and relax. Don’t expect to be served well, just good enough is the best you can account for, you’re  just a tool.

Who pays? Advertisers. What is sold? Clicks on ads on those pages. The more pages you show, i.e., the worse is relevance, the more ads are shown, the higher probability that ads will be clicked. Google wants to broaden your search, trying to show more pages, rather than narrow it by making as relevant to the query as possible.

Related Content, Or Will They Dare To Ask?

Would you try to answer if you do not understand what you’re asked for? I think you would try to understand the person, at least in the cases when you care. Google and the company make timid attempts to offer related searches. It seems to be aimed at helping people build long tailed queries. It is a good trend that the length of a query increases, but the search engines provide really too few tools to help people build queries efficiently. On the one hand, better tools will help people find information faster, on the other hand, it will mean fewer page views for Google and their customers.

Page Structure Analysis

For Google, a page looks like text with a title and nested headings. Keywords in the title and in headings have higher weights.
But has anyone noticed that Google cares about whether the keywords are part of a consistent text? I haven’t. Thus you end up finding your keywords scattered all over the page, in the best case standing next to an unrelated entity.

What makes things worse is that in the modern world a web page contains side content and navigation parts. Google does not make any difference between the article itself and all the wrappings around it. In the worst case your keywords appear somewhere within related links and terms of use. Have you ever tried to search for an event, involved some entities within specific year or month? Most frequently you’ll have to go through blog dated indices, copyright lists, and comments posted on that specific date. Any time you search for words that are likely to appear in navigation sections, such as “link”, “RSS”, “post”, even “Google”.

For the Curious: Why Yahoo Scored So High In the “I Suck” Test?

Articles from Yahoo! News have the word Yahoo in the title, while neither Bing nor Google place their company names in page titles, that’s can be why.

That’s of course not only Google’s problem. There has no been a legitimate way, before HTML5, to describe page structure, and separate articles from navigation and side content. The question is whether Google is going to support HTML5 markup. So far too little has been done in this direction. I wonder if Yahoo/Bing see HTML5 support as a competitive advantage to them.

Text Analysis

No attempt is made to analyze the text: extracting information about participants, events, and establish relationships between these, everything which is the essence of semantic analysis. So, let’s admit it: Google is a keyword search engine, a very good one, but not more than that. Ok, it even let’s you enter a sequence of keywords in quotation marks, but not much more. Nelson Mattos an interview mentioned that semantics is hard, and Google is only at the beginning of the path. No doubt, but so far nothing in this direction has been shown to the public.

Neighborhood Analysis

Sometimes Google decides to return pages, that do not contain your keywords.

A simple example. When looking for the <article> tag support by Google, the following page http://apiblog.youtube.com/2010/06/flash-and-html5-tag.html is ranked #1. It’s a good article, but about the <video> tag. The you look at the text-only version in the cache, you can see that Google confesses “These search terms are highlighted: html5 google support tag. These terms only appear in links pointing to this page: article”. ‘Nuf said.

The Good and The Bad

The Good

  • Ranking each page independently is the first step to indexing articles, not pages
  • Counting inbound links really helps take into consideration independent opinions
  • Taking into consideration the content of the neighborhood helps disambiguate meanings
  • Results are returned quickly

The Bad

  • Page structure is almost completely ignored and no attempt is made understand the query and the content on the pages. This is the reason why you have to look through lots of pages that are completely irrelevant to what you’re looking for.
  • Ranking inbound links high has built the whole market for SEO companies that write “related” articles. Do you think these are always insightful and fun to read?
  • UI does not let you conveniently refine the query, let alone giving means to explain Google what you’re looking for by giving hints, specifying synonyms, pointing out irrelevant pages etc
  • Despite returning results quickly, users waste time clicking through lots of irrelevant results.

Does This All Matter? Is Google Good Enough? Or It Sucks?

Google is moving from a generic search tool, which everyone can use, to a somewhat skewed ad showing engine, and it should be accounted for. Google is great, if you want to find a page containing keywords it its title. But a research shows that more that 80% of queries are informational rather than navigational or transactional. Doesn’t it look as a big under-served market?

Or maybe I just want to use Google not in the way it’s supposed to, and should use other applications and services instead?

What We All Can Do

Let’s stop criticizing Google for a while, and ask ourselves: have we, as content providers, done enough for Google to understand our articles? Since we know (ok, we try to guess) how it works, and its limitations, what can we do to help Google, and the rest of the herd, to index our pages better?

Things start looking promising now, with HTML5 becoming a real thing. Starting with simply marking articles and navigation can help filter out junk. (How do properly mark up different types of pages, articles, comments, related content, list of article abstracts, posting dates etc deserves a separate discussion).
Search engines might start giving better rankings to properly marked content. This will naturally promote creating clean and clear content, as well as promote good behavior. I completely agree with Sam Langdon that commercial forces will dictate what the web content will be.

Being able to specify page structure is only a first baby step in improving search results. But before talking about semantic search, why not implement simpler things?

Wrapping Up and Around

Do you remember what we’ve started with? For the “I suck” query, Yahoo returned only 28 pages, i.e., 280 results. Google has sincerely reported that it does not return more than 1000 results. Bing stopped an page 100, i.e., the same as Google. Big Oligopoly knows better?

Can it be so that the best answer to your question is on the page 101? Experiments using Yahoo and Bing search API’s have shown that in some cases the answer is “definitely yes”, and the most relevant articles have been in eight’s hundred, and second or third thousand.

Posted in Uncategorized | Leave a comment

Here enters Buzzuvello, or Clear and Clean Content SEO

As of today, Google and others knew nothing about Buzzuvello.

So, his name is a good keyword to identify this blog in the web.

Waiting for the spiders…

…and the spiders came, and found 4 links. It’s interesting to see, how Google ranks pages:

Buzzuvello at Google: How Google ranks blog pagesThe winner, the list of all articles tagged “buzzuvello”,  has the tag word in the title, in the URL, and (we’ve been lucky) in the short description. Almost perfect, nothing to add, though you would like to be this post to be ranked #1, not #3.

Why is it #3, btw? Let’s try to guess. Google’s Webmaster Tools won’t give you detailed statistics on keyword density of each page, so the following is pure speculation for now.

Result #3, this actual post, has our buzzuvello keyword in the title, as well as in the URL. What differs it from #1? Probably, in #1 it was the only post in the category, the content was cleaner, without the comment form, and thus our buzzuvello keyword’s density was higher in the list of all posts in the category, that in the article itself.

An interesting case is the result #2. It’s not an article about Buzzuvello at all, it’s the clear and clean content manifesto, my first post! What made it appear as #2 is a navigation link to the previous post. That’s the whole point we’re talking about here: search engines do not find texts, they find pages, and these things are slightly different. Search engines cannot tell different events, and that’s one of things that differs them from humans. And this is one of reasons why you should not always feed the spiders (shall I say ‘web spiders’ ?) people’s food :)   Do you think they know better? I’ll blog on that later.

This leads to interesting conclusions, but let’s continue the experiment.

As of today, August 21, 2010, the site has a sitemap, which contains links to individual posts (which is how it should be) and the home page (a questionable decision, and I suspect it is wrong, but let’s prove it experimentally) Archives, tag and category indices are not included into the sitemap. Also, permalinks do not include any dates as of now (well, WordPress includes date into image paths, let’s deal with this later). Posting date is not always a part of the message you want to convey in your text. You may write today about a last year’s events. Why confuse Google with information it cannot interpret?

To makes things more interesting, I have installed a plugin, which cleans off all large part of side content, leaving only the title, headers and the content to the spiders. (Disclosure: it also leaves the previous and next links, but they also gonna go with the next plugin update, they are not part of your message in your text!)

Let’s see how this all affects the results (and please notice, I’ve used the word buzzuvello five or six time in this text, will it help it rank higher?).

I have also submitted this blog to Yahoo and Bing, and to Baidu. Also in plans is the submission to the so called semantic search engines, it was my previous start-up, and I hope it will be one of the next ones, but it’s not a priority right now :)

Keep tuned.

Next posts will be on text or discourse or article versus web pages (and where time enters into the game), microdata and microformats (and about time again), and of course about search user experience, which is the main topic of this blog, and probably other things.

Posted in Uncategorized | Tagged | Leave a comment

Let’s make posts easy to find and joy to read

Finding information in the internet is not easy. Search engines give you pages, but not information, leaving alone answers. That’s the sad state of the art no one has succeeded to challenge yet. Well, some tried… but that’s another story.

When you’re looking where to buy something, internet search works, more or less – you’re looking for a thing by name, and keywords seem to be quite the same as names. The reality is a bit more complicated, and, if you start digging deeper, much more complicated… but in 80% of cases it works.

But when you look for information, i.e., text written by humans, usually about the state of things or events in the world, prepare to be patient. You’ll have to click through dozens of pages, reviewing them, and throwing away those that are irrelevant.

Why do search engines, from Google to Yahoo to Bing to many smaller ones, from niche to semantic, return so many wrong results? The answer (at least, my version of it) is astonishing, when you arrive at it: people don’t speak with keywords, they speak in a language.

Some search engines don’t even try to understand what a person is looking for, and what is in the web; Google is leading this herd. Some try, by can’t.

But all this blog is not about complaining, it’s about improving things. And many things can be done not by big boys, but by those who create information: bloggers, journalists and webmasters.

This blog now will be a playground to invent, explore and and test different techniques, aimed at improving quality of the content presented to search engines. Will it help keyword-based search engines present people with better results? Let’s see, stay tuned. And… let’s do it together.

Posted in Uncategorized | Tagged , | Leave a comment