Follow Us On:
Follow me on Twitter  RSS Feed

What exactly is a bot like Googlebot

Googlebot and other Spiders

Posted by



Spider, Bot, Robot, Crawler - Googlebot, Yahoo Slurp, MSNbot

Googlebot, Yahoo Slurp, and MSNbot and similar spiders, bots, and crawlers are the programs that harvest information for search engines.

For anyone tracking statistics on their website, Googlebot, MSNbot, and Yahoo Slurp can be welcomed guests. These three search engine bots gather (harvest) information about your page for their respective search engine. Seeing these spiders more often is also desirable because this means that you are being indexed more often and more likely to show up quickly in the SERPs (search engine results page).

A spider is nothing more than a computer program that follows certain links on the web and gathers information as it goes. For example, Googlebot will follow HREF or SRC tags to find pages and images that are associated with any given site. Because these crawlers are merely computer programs, they aren't always the smartest of creatures and may get caught in endless loops built by dynamically created webpages.

Robots.txt


While having Googlebot index your site more quickly is almost always a good thing, there are times when you don't want certain pages or images indexed. Most "reputable" spiders will obey a directive given by the robots.txt file. This file is document that tells spiders what they may and may not index. You can also explicitly instruct a robot not to follow any of the links on a page by the following meta tag:META NAME="Googlebot" CONTENT="nofollow".

Because of how these bots work and the importance they place on text links, many people have begun placing keyword filled text links to their website in their signatures on blogs and other comment sections. To reduce the impact that these have, you can instruct spiders not to follow one specific link by placing the following in the anchor tag:rel="nofollow". This will reduce the outgoing number of links and help you to maintain your pagerank.

Bad SPAM bots


Now just as in life, not all bots are good. There are "bad" bots that don't care about your robots.txt and are only out there to harvest your email address. To fight these "bad" SPAM bots, some people use javascript to "hide" their email addresses. However, anything that can be written to avoid a bad bot can be broken by an even worse bot. One company is fighting bots by giving them just what they want, email addresses, and lots of them. However, they are all email addresses of known SPAMers. I found the sight to be quite clever.

Hopefully this will clear up some confusion as to what a bot, crawler, spider is and how they go about collecting information. If you have any questions, post them below and we will try to answer as quickly as possible. If you need help with SEO (search engine optimization), we would love to help show you ways to increase the frequency and number of times Googlebot, Yahoo Slurp, and MSNbot index your site.

Follow me on twitterYou should follow me on twitter here Follow me on twitter

Get Started Now

Go to our information sheets to
start building your website today!


Related Articles




Filed Under: Google, Bots, Spiders, Spam, PageRank, SEO
 
Please feel free to leave a comment or question about the article
"What exactly is a bot like Googlebot"

:
:
:
Comment:
  [Security Code]

Comments

1

111

Is there is any relation of these spider BOTS with the URL size.

 

2

AHFXStudios

The deeper in the file structure the bot has to dig (for example multiple folders inside folder i.e. http://www.ahfx.net/folder1/folder2/folder3/file.php) the less likely the spider will visit that page. Normally the number of links that link down into deeper pages will be smaller than those that link to high level main pages; thus the spiders do not visit them as much. There isn´t a set limit on url size but each browser and operating system have their own limits. It is always better to keep the url small for people to remember and keep the directory rather short and fat rather than tall and skinny.

 

3

tandrus

Hello Adam,

I would like the spiders to come to some pages located in my modules directory and index them, so that I can optimize these pages too. So far, the index.php file has been indexed, and maybe some of the main website pages have been indexed, but these other pages have yet to be found.

I´m worried, however, that the search engines may never find these pages since they are only doorway pages-which will help people in a specific location find my service. (None of the main pages point to these doorway links)

The site will work for everyone, not just Idaho Falls and Pocatello, and the service is location-specific. Therefore, I believe that I will be using doorway pages in a legitimate way.

The problem is, these doorway pages are also location-specific pages. Until I create similar pages for every city, nationwide, it might seem too "tra-la-la" to have pages for Idaho Falls, and Pocatello on a "Here-is-Where-We-Are-Marketing" page, and nothing for any other cities. Doing so would show my clients less company development than I want to show.

A site map, might get me through this problem (since very few people in my opinion use these as a navigation tool, yet the spider might access and explore the rest of the site-including the doorway pages which I could include on the site map). Do you think I am on the right track?

If so, is creating a site map as simple as creating a page (or a few pages) showing all of my links, or is there more to it than this?

Thanks,

T. J.

 

4

AHFXStudios

I would be wary of any "doorway" pages. Any time you try to show your business as being somewhere it isn´t, the search engines will penalize your site. By adding a sitemap to you site will help each of the pages get indexed that you want indexed. However, realize that the pagerank will not filter down stream as well through a sitemap as compared to links from on of the main pages and just because you have a page that is "indexed" doesn´t mean that it will show up in the SERPS.

 

5

tandrus

Hello Adam,

I just got back from CES, and got to talk with some of the Google underlings. I asked them how I might moniter when their crawlers last indexed my page. They hinted that the date in the Cache might be an indicator of this information.

I´ve used your site as a case study, (since you are the only person I have personally talked with you has hit 1 at any significant level in the search rankings).

It bothers me that when I look at ahfx.net´s cached info on Google to see a date of 12/26/2005. To me, this older cache date bucks against the theory that "content is king"(assuming the Googlings hint was correct), since your content is dynamically refreshed every time someone opens the home page.

I know that you have created a monitoring software to track for you when your site has been spidered last.

This brings me to the following questions:

1) Does your software agree that 12/26/2005 was the last time Google spidered you?

2) If so, why do you think that you are not getting "the love" from Google, when you are always presenting them with "new content"?

Best Regards,


T. J.

 

6

AHFXStudios

Unfortuanely they might have led you astray to a point. I´ve been crawled 403 times this month (Jan 1 - Jan 10th) without counting any days after Dec 26th to the end of the year. However, just because they crawl my site doesn't mean that they re-"indexed" my site. You must further realize that Google uses many datacenters and each datacenter is updated on its own schedule. So I could have a cache date of Dec 26th on one datacenter and Jan 10th on another (as is the case on some of the datacenters right now). I would be interested in what keyword terms you are using to run your tests.

 

7

rich duplessis

I was in a forum and noticed a lot of yahoo spiders looking at pages. how can I get the yahoo spider to visit my pages?

 

8

AHFXStudios

All you need is a normal HTML link from a currently "indexed" page in Yahoo, Google, or MSN and their spiders will find you automatically. If you want to submit a new site and invite the search engines to spider your site, you can look at our e-list for information on how to submit your sites for free.

 

9

Derrick

I am new to all of this and I have a few questions.

1. How do I know when msnbot, googlebot and all the other bots out there crawl my site?
2. How do I get them to crawl my site more often?
3. How do I get a better PR? I believe the beter my PR the better off I will be, but not sure.

I have a google site map, but is there anything else I need in order to improve my site?

If anyone can answer these questions, it would be very helpfull to me. Also, please visit my site and let me know what you think I need to do in order to improve it.

Thanks

 

10

AHFXStudios

Derrick, great questions. Here are some short answers:



  1. Check your server logs for the following user agents: Googlebot, MSNbot, and Yahoo! Slurp. That will tell you the last time those bots have visited your site. If you don´t have access to your logs, you can check your "cached" version of your site to see the last time they indexed your page (but not necessarily the last time they visited your page)

  2. Frequent updates (daily/weekly) to your site content will help bring back spiders more often. Also increasing your pagerank will help bring back spiders more often because your page will be "more important".

  3. Read our blogs about pagerank for great ways to increase your pagerank.

  4. A Sitemap is a good start to get pages indexed, but you need to make sure that you have done your basic SEO to get the benefit out of your sitemap.


 

11

jill gaylor

great information on bots and spiders, very useful

 

12

cully cangelosi

I am wondering is there any thing special I need to do, to have my website available sooner on all the search engines. I was told it could take as long as 6 weeks.

 

13

AHFX Web Design

Cully, it really depends how the spiders find your site. If you get a link from a site that the spiders visit on a frequent basis, they will find you quicker. However, that doesn´t mean that you will indexed that quickly. Once you are "found" they have to decide on where to put your site. Based on previous experience, I tell people that they will normally need to wait 3 months for all of their pages (in a normal sized website) to be fully indexed. But keep in mind that indexing and ranking well are totally different.

 

14

Michiel Malotaux

how do I quickly determine the page-size of any website?

 

15

Chris

great resourse!

 

16

steve mac

Hi
Great information thanks

Regards
Steve.

 

17

yoyoyo1

Heh, this is good knowledge :)

 

18

Rick Teller

Can bots and other such programs get to web pages that are not referenced in a link such as an href or image? For example, if a page can only be reached via a link in an email, will a bot, spider, or crawler find it?

 

19

Selvam

In nearly 45days old.Still my website not crawled by google.What can i do now? How to make google bot crawl my website??

 

20

concord

Doorway pages are specially created to fool the search engines algorithm and
draw search engine visitors to a website. Doorway pages are Web pages
designed and built specifically to draw search engine visitors to your
website. They are standalone pages designed only to act as entry or door to
your websites. Usually these pages are theme based. They are also known as
portal pages, jump pages, gateway pages, and entry pages

Doorway pages are considered to be part of black hat and should not be used,
although many of seo companies use these pages for gaining more traffic.

 

21

AHFX Web Design

concord, you need to read our post on natural versus bad doorways.

 

22

What Is A Spider?

Excellent post for webmasters! I did find this to be slightly to technical for those curious about web crawlers who are not computer people!

 

23

What Is A Spider?

Excellent post for webmasters and those who already understand search engine spiders. I did find from some research users are more searching for a layman term explanation of what a spider is, so I decided to put it into layman terms and hope the article hows further enhance your vistors question about a spider.

 

24

What Is A Spider?

A spider is a web application or program that visits websites and reads the page information, while searching for more pages on the website. This allows companies like google, with the googlebot crawler, MSN with the msnbot, and Yahoo! Slurp, Yahoo!´s Web Crawler to add to their abundant source of information.

 

25

EM

Can someone view my profile on myspace if its private with a google bot?.meaning read my messages, look at my pictures? etc?

 

26

Buddy Dixon

I have numerous web sites that I am either building or have built. Some are doing OK in YAHOO and some are not. I believe I have found what is wrong with the ones that aren´t doing well, my problem is trying to get YAHOO to crawl them again. Any advice would be very much appreciated. Thank you in advance.

 

27

RRj

Thanks for your articles.

 

28

Si Gembala

Spider bot was eating very much bandwidth.On my website, Googlebot spends bandwidth of 1.3 gigabytes.How to be more sparing bandwidth?

 

29

sevgi

very nice than you admin

 

30

lee

my blog hasnt been indexed for two weeks and my site has been indexed but never gets updated. Is there anyone who can tell me when google or bing updates there search engines please thankyou lee help

 

31

Viv

How does a yahoo slurp spider manage to get into the shoutbox of a fansite??? And seemingly answer people

 

32

Mark Thomas

What is the difference between a spider and a bot?

 

33

Laurie

What is a Hound Spider Bot? This shows up as a content title, when clicked in analytics it shows tmwebminetestconfiguration.php under contnent performance.
Is this a bad bot? Do you think it installed something on our site?

Thanks for any help you cna offer.

 

34

Rajesh

I am new to all of this and I have a few questions.

1. How do I know when msnbot, googlebot and all the other bots out there crawl my site?
2. How do I get them to crawl my site more often?
3. How do I get a better PR? I believe the beter my PR the better off I will be, but not sure.

 

35

JS

Thank you for sharing the information. I wonder if you have some information about the MSN bot and Yahoo bot. It seems they hardly visit my sites.

 

36

Atacoplease

Thanks for the article was very quick and to the point, I must ask however why do they sometimes go over the same content over and over, and I don´t think they are stuck in a loop when they do that.

 

37

Lilly

I have a blog through Blogspot with Feedjit installed. It´s a widget similar to the Google Analytics that is available to Blogspot users only it gives me more information such as cities my visits are coming from. I´ve been getting visits from Mountain View, CA for a while and I assumed it was a Google bot or spider, however only recently have the Mountain View visits been disappearing, meaning if I´m watching my live traffic feed and see this city show up, the ´footprints´ are then erased. If I refresh my live view, the visit doesn´t show anymore. Can you solve this mystery? I can´t seem to find any information about it anywhere. Is this a Google bot/spider or an actual visitor erasing their visit?

 

38

Murter Kallande

Please robot register my site

 

39

Andrew

Hi, Many thanks for your great article about bots and spiders.

I just recently got more interest to know how these spiders and bots works. My website on average gets between 40 to 120 bots and spiders every time i visit it. One thing that i don´t understand is I have a "who is online" in my website admin. when i click to check who is online, instantly these bots drop down by half or even a 3rd. Did i scared these bots? I quite like these bots, because it get my site higher traffic ranking in search engine.

But now since you mentioned about bad bots, i´m a little bit concerned. If i make a page available with lots of email addresses "just unwanted spams email address", would these bad bots stay at that page and not going through my whole website like the good bots?

I hope you can help.
Cheers!

 

LiveZilla Live Help