
Related Articles For You
Googlebot, Yahoo Slurp, and MSNbot and similar spiders, bots, and crawlers are the programs that harvest information for search engines.
For anyone tracking statistics on their website, Googlebot, MSNbot, and Yahoo Slurp can be welcomed guests. These three search engine bots gather (harvest) information about your page for their respective search engine. Seeing these spiders more often is also desirable because this means that you are being indexed more often and more likely to show up quickly in the SERPs (search engine results page).
A spider is nothing more than a computer program that follows certain links on the web and gathers information as it goes. For example, Googlebot will follow HREF or SRC tags to find pages and images that are associated with any given site. Because these crawlers are merely computer programs, they aren't always the smartest of creatures and may get caught in endless loops built by dynamically created webpages.
Robots.txt
While having Googlebot index your site more quickly is almost always a good thing, there are times when you don't want certain pages or images indexed. Most "reputable" spiders will obey a directive given by the robots.txt file. This file is document that tells spiders what they may and may not index. You can also explicitly instruct a robot not to follow any of the links on a page by the following meta tag:META NAME="Googlebot" CONTENT="nofollow".
Because of how these bots work and the importance they place on text links, many people have begun placing keyword filled text links to their website in their signatures on blogs and other comment sections. To reduce the impact that these have, you can instruct spiders not to follow one specific link by placing the following in the anchor tag:rel="nofollow". This will reduce the outgoing number of links and help you to maintain your pagerank.
Bad SPAM bots
Now just as in life, not all bots are good. There are "bad" bots that don't care about your robots.txt and are only out there to harvest your email address. To fight these "bad" SPAM bots, some people use javascript to "hide" their email addresses. However, anything that can be written to avoid a bad bot can be broken by an even worse bot. One company is fighting bots by giving them just what they want, email addresses, and lots of them. However, they are all email addresses of known SPAMers. I found the sight to be quite clever.
Hopefully this will clear up some confusion as to what a bot, crawler, spider is and how they go about collecting information. If you have any questions, post them below and we will try to answer as quickly as possible. If you need help with SEO (search engine optimization), we would love to help show you ways to increase the frequency and number of times Googlebot, Yahoo Slurp, and MSNbot index your site.
You should follow me on twitter here 
Get Started Now
Go to our information sheets to
start building your website today!
Related Articles
Browse By Category
Filed Under: Google, Bots, Spiders, Spam, PageRank, SEO
Please feel free to leave a comment or question about the article
"What exactly is a bot like Googlebot"
Comments
2
AHFXStudios
The deeper in the file structure the bot has to dig (for example multiple folders inside folder i.e. http://www.ahfx.net/folder1/folder2/folder3/file.php) the less likely the spider will visit that page. Normally the number of links that link down into deeper pages will be smaller than those that link to high level main pages; thus the spiders do not visit them as much. There isn´t a set limit on url size but each browser and operating system have their own limits. It is always better to keep the url small for people to remember and keep the directory rather short and fat rather than tall and skinny.
3
tandrus
Hello Adam,
I would like the spiders to come to some pages located in my modules directory and index them, so that I can optimize these pages too. So far, the index.php file has been indexed, and maybe some of the main website pages have been indexed, but these other pages have yet to be found.
I´m worried, however, that the search engines may never find these pages since they are only doorway pages-which will help people in a specific location find my service. (None of the main pages point to these doorway links)
The site will work for everyone, not just Idaho Falls and Pocatello, and the service is location-specific. Therefore, I believe that I will be using doorway pages in a legitimate way.
The problem is, these doorway pages are also location-specific pages. Until I create similar pages for every city, nationwide, it might seem too "tra-la-la" to have pages for Idaho Falls, and Pocatello on a "Here-is-Where-We-Are-Marketing" page, and nothing for any other cities. Doing so would show my clients less company development than I want to show.
A site map, might get me through this problem (since very few people in my opinion use these as a navigation tool, yet the spider might access and explore the rest of the site-including the doorway pages which I could include on the site map). Do you think I am on the right track?
If so, is creating a site map as simple as creating a page (or a few pages) showing all of my links, or is there more to it than this?
Thanks,
T. J.
4
AHFXStudios
I would be wary of any "doorway" pages. Any time you try to show your business as being somewhere it isn´t, the search engines will penalize your site. By adding a sitemap to you site will help each of the pages get indexed that you want indexed. However, realize that the pagerank will not filter down stream as well through a sitemap as compared to links from on of the main pages and just because you have a page that is "indexed" doesn´t mean that it will show up in the SERPS.
5
tandrus
Hello Adam,
I just got back from CES, and got to talk with some of the Google underlings. I asked them how I might moniter when their crawlers last indexed my page. They hinted that the date in the Cache might be an indicator of this information.
I´ve used your site as a case study, (since you are the only person I have personally talked with you has hit 1 at any significant level in the search rankings).
It bothers me that when I look at ahfx.net´s cached info on Google to see a date of 12/26/2005. To me, this older cache date bucks against the theory that "content is king"(assuming the Googlings hint was correct), since your content is dynamically refreshed every time someone opens the home page.
I know that you have created a monitoring software to track for you when your site has been spidered last.
This brings me to the following questions:
1) Does your software agree that 12/26/2005 was the last time Google spidered you?
2) If so, why do you think that you are not getting "the love" from Google, when you are always presenting them with "new content"?
Best Regards,
T. J.
6
AHFXStudios
Unfortuanely they might have led you astray to a point. I´ve been crawled 403 times this month (Jan 1 - Jan 10th) without counting any days after Dec 26th to the end of the year. However, just because they crawl my site doesn't mean that they re-"indexed" my site. You must further realize that Google uses many datacenters and each datacenter is updated on its own schedule. So I could have a cache date of Dec 26th on one datacenter and Jan 10th on another (as is the case on some of the datacenters right now). I would be interested in what keyword terms you are using to run your tests.
7
rich duplessis
I was in a forum and noticed a lot of yahoo spiders looking at pages. how can I get the yahoo spider to visit my pages?
8
AHFXStudios
All you need is a normal HTML link from a currently "indexed" page in Yahoo, Google, or MSN and their spiders will find you automatically. If you want to submit a new site and invite the search engines to spider your site, you can look at our e-list for information on how to submit your sites for free.
9
Derrick
I am new to all of this and I have a few questions.
1. How do I know when msnbot, googlebot and all the other bots out there crawl my site?
2. How do I get them to crawl my site more often?
3. How do I get a better PR? I believe the beter my PR the better off I will be, but not sure.
I have a google site map, but is there anything else I need in order to improve my site?
If anyone can answer these questions, it would be very helpfull to me. Also, please visit my site and let me know what you think I need to do in order to improve it.
Thanks
10
AHFXStudios
Derrick, great questions. Here are some short answers:
- Check your server logs for the following user agents: Googlebot, MSNbot, and Yahoo! Slurp. That will tell you the last time those bots have visited your site. If you don´t have access to your logs, you can check your "cached" version of your site to see the last time they indexed your page (but not necessarily the last time they visited your page)
- Frequent updates (daily/weekly) to your site content will help bring back spiders more often. Also increasing your pagerank will help bring back spiders more often because your page will be "more important".
- Read our blogs about pagerank for great ways to increase your pagerank.
- A Sitemap is a good start to get pages indexed, but you need to make sure that you have done your basic SEO to get the benefit out of your sitemap.
11
jill gaylor
great information on bots and spiders, very useful
12
cully cangelosi
I am wondering is there any thing special I need to do, to have my website available sooner on all the search engines. I was told it could take as long as 6 weeks.
13
AHFX Web Design
Cully, it really depends how the spiders find your site. If you get a link from a site that the spiders visit on a frequent basis, they will find you quicker. However, that doesn´t mean that you will indexed that quickly. Once you are "found" they have to decide on where to put your site. Based on previous experience, I tell people that they will normally need to wait 3 months for all of their pages (in a normal sized website) to be fully indexed. But keep in mind that indexing and ranking well are totally different.
14
Michiel Malotaux
how do I quickly determine the page-size of any website?
15
Chris
great resourse!
16
steve mac
Hi
Great information thanks
Regards
Steve.
17
yoyoyo1
Heh, this is good knowledge :)
18
Rick Teller
Can bots and other such programs get to web pages that are not referenced in a link such as an href or image? For example, if a page can only be reached via a link in an email, will a bot, spider, or crawler find it?
19
Selvam
In nearly 45days old.Still my website not crawled by google.What can i do now? How to make google bot crawl my website??
20
concord
Doorway pages are specially created to fool the search engines algorithm and
draw search engine visitors to a website. Doorway pages are Web pages
designed and built specifically to draw search engine visitors to your
website. They are standalone pages designed only to act as entry or door to
your websites. Usually these pages are theme based. They are also known as
portal pages, jump pages, gateway pages, and entry pages
Doorway pages are considered to be part of black hat and should not be used,
although many of seo companies use these pages for gaining more traffic.
22
What Is A Spider?
Excellent post for webmasters! I did find this to be slightly to technical for those curious about web crawlers who are not computer people!
23
What Is A Spider?
Excellent post for webmasters and those who already understand search engine spiders. I did find from some research users are more searching for a layman term explanation of what a spider is, so I decided to put it into layman terms and hope the article hows further enhance your vistors question about a spider.
24
What Is A Spider?
A spider is a web application or program that visits websites and reads the page information, while searching for more pages on the website. This allows companies like google, with the googlebot crawler, MSN with the msnbot, and Yahoo! Slurp, Yahoo!´s Web Crawler to add to their abundant source of information.
25
EM
Can someone view my profile on myspace if its private with a google bot?.meaning read my messages, look at my pictures? etc?
26
Buddy Dixon
I have numerous web sites that I am either building or have built. Some are doing OK in YAHOO and some are not. I believe I have found what is wrong with the ones that aren´t doing well, my problem is trying to get YAHOO to crawl them again. Any advice would be very much appreciated. Thank you in advance.
27
RRj
Thanks for your articles.
28
Si Gembala
Spider bot was eating very much bandwidth.On my website, Googlebot spends bandwidth of 1.3 gigabytes.How to be more sparing bandwidth?
29
sevgi
very nice than you admin
30
lee
my blog hasnt been indexed for two weeks and my site has been indexed but never gets updated. Is there anyone who can tell me when google or bing updates there search engines please thankyou lee help
31
Viv
How does a yahoo slurp spider manage to get into the shoutbox of a fansite??? And seemingly answer people
32
Mark Thomas
What is the difference between a spider and a bot?
33
Laurie
What is a Hound Spider Bot? This shows up as a content title, when clicked in analytics it shows tmwebminetestconfiguration.php under contnent performance.
Is this a bad bot? Do you think it installed something on our site?
Thanks for any help you cna offer.








1
111
Is there is any relation of these spider BOTS with the URL size.