Timing Google's Crawl - dummies

By Brad Hill

Google crawls the Web at varying depths and on more than one schedule. The so-called deep crawl occurs roughly once a month. This extensive reconnaissance of Web content requires more than a week to complete and an undisclosed length of time after completion to build the results into the index. For this reason, it can take up to six weeks for a new page to appear in Google. Brand new sites at new domain addresses that have never been crawled before might not even be indexed at first.

If Google relied entirely on the deep crawl, its index would quickly become outdated in the rapidly shifting Web. To stay current, Google launches various supplemental fresh crawls that skim the Web more shallowly and frequently than the deep crawl. These supplementary spiders (automated software programs that travel from link to link one the Web, collecting content from online pages) do not update the entire index, but they freshen it by updating the content of some sites. Google does not divulge its fresh-crawling schedules or targets, but Webmasters can get an indication of the crawl’s frequency through sharp observance.

Google has no obligation to touch any particular URL with a fresh crawl. Sites can increase their chance of being crawled often, however, by changing their content and adding pages frequently. Remember the shallowness aspect of the fresh crawl; Google might dip into the home page of your site (the front page, or index page) but not dive into a deep exploration of the site’s inner pages. (You may, for example, notice that a new index page of your site appears in Google within a day of your updates, while a new inner page added at the same time may be missing.) But Google’s spider can compare previous crawl results with the current crawl, and if it learns from the top navigation page that new content is added regularly, it might start crawling the entire site during its frequent visits.

The deep crawl is more automatic and mindlessly thorough than the fresh crawl. Chances are good that in a deep crawl cycle, any URL already in the main index will be reassessed down to its last page. However, Google does not necessarily include every page of a site. As usual, the reasons and formulas involved in excluding certain pages are not divulged. The main fact to remember is that Google applies PageRank considerations to every single page, not just to domains and top pages. If a specific page is important to you and is not appearing in Google search results, your task is to apply every networking and optimization tactic you can imagine to that page. You may also manually submit that specific page to Google.

The terms deep crawl and fresh crawl are widely used in the online marketing community to distinguish between the thorough spidering of the Web that Google launches approximately monthly and various intermediate crawls run at Google’s discretion. Google itself acknowledges both levels of spider activity, but is secretive about exact schedules, crawl depths, and formulas by which the company chooses crawl targets. To a large extent, targets are determined by automatic processes built into the spider’s programming, but humans at Google also direct the spider to specific destinations for various reasons.

Technically, the Google index remains static between crawls. Google matches keywords against the index, not against live Web content, so any pages put online (or modified) between visits from Google’s spider remain excluded from (or out of date in) the search results until they are crawled again. But two factors work against the index remaining unchanged for long. First, the frequency of fresh crawls keeps the index evolving in a state that Google-watchers call everflux. Second, some time is required to put crawl results into the index on Google’s thousands of servers. The irregular heaving and churning of the index that results from these two factors is called the Google dance.