Tag Archives: Indexing

Google’s new search index: Caffeine

On June 9th Google announced the completion of a new web indexing system called Caffeine. Caffeine provides 50 percent fresher results for web searches than Google’s last index, and it’s the largest collection of web content which Google offered. Whether it’s a news story, a blog or a forum post, you can now find links to relevant content much sooner after it is published than was possible ever before.

Some background for those of you who don’t build search engines for a living like us: when you search Google, you’re not searching the live web. Instead you’re searching Google’s index of the web which, like the list in the back of a book, helps you pinpoint exactly the information you need. (Here’s a good explanation of how it all works.)

So why did Google build a new search indexing system? Content on the web is blossoming. It’s growing not just in size and numbers but with the advent of video, images, news and real-time updates, the average webpage are richer and more complex. In addition, people’s expectations for search are higher than they used to be. Searchers want to find the latest relevant content and publishers expect to be found the instant they publish.

To keep up with the evolution of the web and to meet rising user expectations, Google built Caffeine. The image below illustrates how our old indexing system worked compared to Caffeine:



Google’s old index had several layers, some of which were refreshed at a faster rate than others; the main layer would update every couple of weeks. To refresh a layer of the old index, Google would analyze the entire web, which meant there was a significant delay between when Google found a page and made it available to you.

With Caffeine, Google analyze the web in small portions and update its search index on a continuous basis, globally. As Google find new pages, or new information on existing pages, they add them straight to the index. That means you can find fresher information than ever before—no matter when or where it was published.

Caffeine let Google index web pages on an enormous scale. In fact, every second Caffeine processes hundreds of thousands of pages in parallel. If this were a pile of paper it would grow three miles taller every second. Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day. You would need 625,000 of the largest iPods to store that much information; if these were stacked end-to-end they would go for more than 40 miles.

Google built Caffeine with the future in mind. Not only is it fresher, it’s a robust foundation that makes it possible for Google to build an even faster and comprehensive search engine that scales with the growth of information online, and delivers even more relevant search results to you. So stay tuned, and look for more improvements in the months to come.

Why Google Indexing Requires A Complex Blend Of Skills

If it was easy, everybody would be doing it. Getting a company’s name and products, or services, onto the first page of a genuine Google search isn’t a trivial piece of work. In fact, there are four distinct skills that a search engine optimizer needs to possess. Most people possess one or maybe two of these skills, very rarely do people posses all four. In truth, to get to all four, people who are good at two of these need to actively develop the other skills. Now, if you are running your own business, do you really have the time to do this? Is this the best use of your time?

Specifically the four skills needed for SEO work are:
Web Design – producing a visually attractive page
HTML coding – developing Search Engine friendly coding that sits behind the web design
Copy writing – producing the actual readable text on the page
Marketing – what are the actual searches that are being used, what key words actually get more business for your company?

Many website designers produce more and more eye-catching designs with animations and clever rollover buttons hoping to entice the people onto their sites. This is the first big mistake; using designs like these will actually decrease your chances of a high Google rating. Yes, that’s right; all that money you have paid for the website design could be wasted because no-one will ever find your site.

The reason for this is that before you get people to your site you need to get the spider bots to like your site. Spider bots are pieces of software used by the search engine companies to trawl the Internet looking at all the websites, and then having reviewed the sites, they use complex algorithms to rank the sites. Some of the complex techniques used by web designers cannot be trawled by spider bots. They come to your site, look at the HTML code and exit stage right, without even bothering to rank your site. So, you will not be found on any meaningful search.

I am amazed how many times I look at websites and I immediately know they are a waste of money. The trouble is that both the web designers and the company that paid the money really do not want to know this. In fact, I have stopped playing the messenger of bad news (too many shootings!); I now work round the problem. So, optimizing a website to be Google friendly is often a compromise between a visually attractive site and an easy to find site.

The second skill is that of optimizing the actual HTML code to be spider bot friendly. I put this as different to the web design because you really do need to be “down and dirty” in the code rather than using an editor like dreamweaver, which is OK for website design. This skill takes lots of time and experience to develop, and just when you think you have cracked it, the search engine companies change the algorithms used to calculate how high your site will appear in the search results.

This is no place for even the most enthusiastic amateur. Results need to be constantly monitored, pieces of code added or removed, and a check kept on what the competition are doing. Many people who design their own website feel they will get searched because it looks good, and totally miss out this step. Without a strong technical understanding of how spider bots work, you will always struggle to get your company on the first results page in Google.

Thirdly, I suggested that copy writing is a skill in its own right. This is the writing of the actual text that people coming to your site will read. The Google bot and other spider bots like Inktomi, love text – but only when written well in proper English. Some people try to stuff their site with keywords, while others put white writing on white space (so spider bots can see it but humans cannot).

Spider bots are very sophisticated and not only will not fall for these tricks, they may actively penalize your site – in Google terms, this is sand boxing. Google takes new sites and “naughty” sites and effectively sin-bins them for 3-6 months, you can still be found but not until results page 14 – really useful! As well as good English, the spider bots are also reading the HTML code, so the copy writer also needs an appreciation of the interplay between the two. My recommendation for anyone copy writing their own site is to write normal, well-constructed English sentences that can be read by machine and human alike.

The final skill is marketing, after all this is what we are doing – marketing you site and hence company and products/services on the Web. The key here is to set the site up to be accessible to the searches that will provide most business to you. I have seen many sites that can be found as you key in the company name. Others that can be found by keying in “Accountant Manchester North-West England”, which is great, except no-one ever actually does that search. So the marketing skill requires knowledge of a company’s business, what they are really trying to sell and an understanding of what actual searches may provide dividends.

I hope you will see that professional Search Engine Optimization companies need more than a bit of web design to improve your business. Make sure anyone you choose for SEO work can cover all the bases.