Timing Google's Crawl

Google crawls the Web at varying depths and on more than one schedule. The so-called deep crawl occurs roughly once a month. This extensive reconnaissance of Web content requires more than a week to complete and an undisclosed length of time after completion to build the results into the index. For this reason, it can take up to six weeks for a new page to appear in Google. Brand new sites at new domain addresses that have never been crawled before might not even be indexed at first.

If Google relied entirely on the deep crawl, its index would quickly become outdated in the rapidly shifting Web. To stay current, Google launches various supplemental fresh crawls that skim the Web more shallowly and frequently than the deep crawl. These supplementary spiders (automated software programs that travel from link to link one the Web, collecting content from online pages) do not update the entire index, but they freshen it by updating the content of some sites. Google does not divulge its fresh-crawling schedules or targets, but Webmasters can get an indication of the crawl's frequency through sharp observance.

Google has no obligation to touch any particular URL with a fresh crawl. Sites can increase their chance of being crawled often, however, by changing their content and adding pages frequently. Remember the shallowness aspect of the fresh crawl; Google might dip into the home page of your site (the front page, or index page) but not dive into a deep exploration of the site's inner pages. (You may, for example, notice that a new index page of your site appears in Google within a day of your updates, while a new inner page added at the same time may be missing.) But Google's spider can compare previous crawl results with the current crawl, and if it learns from the top navigation page that new content is added regularly, it might start crawling the entire site during its frequent visits.

The deep crawl is more automatic and mindlessly thorough than the fresh crawl. Chances are good that in a deep crawl cycle, any URL already in the main index will be reassessed down to its last page. However, Google does not necessarily include every page of a site. As usual, the reasons and formulas involved in excluding certain pages are not divulged. The main fact to remember is that Google applies PageRank considerations to every single page, not just to domains and top pages. If a specific page is important to you and is not appearing in Google search results, your task is to apply every networking and optimization tactic you can imagine to that page. You may also manually submit that specific page to Google.

The terms deep crawl and fresh crawl are widely used in the online marketing community to distinguish between the thorough spidering of the Web that Google launches approximately monthly and various intermediate crawls run at Google's discretion. Google itself acknowledges both levels of spider activity, but is secretive about exact schedules, crawl depths, and formulas by which the company chooses crawl targets. To a large extent, targets are determined by automatic processes built into the spider's programming, but humans at Google also direct the spider to specific destinations for various reasons.

Technically, the Google index remains static between crawls. Google matches keywords against the index, not against live Web content, so any pages put online (or modified) between visits from Google's spider remain excluded from (or out of date in) the search results until they are crawled again. But two factors work against the index remaining unchanged for long. First, the frequency of fresh crawls keeps the index evolving in a state that Google-watchers call everflux. Second, some time is required to put crawl results into the index on Google's thousands of servers. The irregular heaving and churning of the index that results from these two factors is called the Google dance.

Comments (10)

  1. Posted by Ex Girlfiend
    After reading this article, I feel that I need more info. Could you suggest some resources ?
  2. Posted by sam
    this article usefull for me... www.indotricky.co.cc
  3. Posted by Phil Parks
    This information is only partially correct. The statement in paragraph two - "Google does not divulge its fresh-crawling schedules or targets, but Webmasters can get an indication of the crawl's frequency through sharp observance" - is not completely true. Webmasters who use Google's own Webmaster Tools can see the actual crawl rate that Googlebot has recorded for the site for the previous 90 days. Though the tool used to tell you when the bot last crawled your page, that feature is now gone. So, using a great stats program like AWStats will tell you how often the Googlebot (and the other spiders) are visiting your site.
  4. Posted by Shams Pirani
    AWstats is NOT a "great tool" any more than webalizer or any other off the shelf stats product. The only way to monitor and accurately log all server activity is by writing your own tools and using them to regularly crawl your own logs and work out what's going on - eg my webalizer count has missed 90% of the search engine accesses to my site, and if I didn't have my own in-house stats software I wouldn't know that right now my site is experiencing about 10,000% growth. The article above is very helpful in one key way - it clarifies that Google's deeper crawl involves an "update" which seems to come around once a month - I think that around the 1st of each month I discover new pages have been added to google's index from my site... that's the only vital thing a serious webmaster needs to worry about in the stuff above. Fresh crawls and so on definitely confuse the lay webmaster when it comes to figuring out what's going on, so try to focus, I reckon, on the once-a-month incident and then you'll have a firmer grip on how your site is growing (or not growing).
  5. Posted by Kai van Husen
    Interesting information. Hope it works for my USAG Stuttgart Housing service. http://www.vanHusen-Immobilien.de/
  6. Posted by B Pereira
    How long does it takes between google crawling a page and adding it to the index. Awstats for my site www.mallukitchen.com show that googlebot crawls my site every 18 hours or so and the number of hits is around 500, but only around 50 pages from my site are indexed yet
  7. Posted by learner
    I was wondering why my website, learn effectively was indexed in 3 days without submitting it directly to google... I guess it was a fresh crawl!
  8. Posted by Claris
    to allow GoogleBot to sense your presence, the site need to be maintained regularly so that the fresh crawl is able to find you...one word - Patience I guess :-) I am trying to build a website for my personal use and found this online - Free Website Preview - Anyone try it before?
  9. Posted by Vijay - Your Health Supplements Guide
    Excellent post! Some really great information on deep crawling and shallow crawling done by Google. Your Health Supplements Guide
  10. Posted by Nita
    This is a fantastic article. For the first time I have got an understanding of how google works and I have been reading up on this issue for weeks now, after I started my new site. I feel odd to leave the url here, but what the heck it's allowed, and everyone is doing it so here it is: http://palmistryforyou.com/ Although my site is indexed, new pages are not getting indexed and it wasn't making sense to me until I read this. I guess my site being new is not given much importance and initially it was crawled because it must have during the deep crawl. My site will never come up in the frequent fresh crawl results so now I know that I have to wait! Thanks a lot for this understanding.

Leave a Reply


Post Comment

Connect with For Dummies

Sign Up for RSS Feeds

Business & Careers

Inside Dummies.com