Thursday, August 2, 2007

Robots.txt File Blocks Common Bad Bots Easily

If you have a site with any kind of size to it, eventually the robots from various servers will come and find your site through links from other sites. They scour the internet day and night - one computer or hundreds, and they are looking for information. There are good bots and bad bots.

I would consider Google a goodbot. Google has a few different bots, and I want all of them to index as much as possible on all of my web sites because when they do - Google will give me some representation in their index in the form of search results when Google users query a term or phrase at Google.com.

I would consider twicelor a bad bot. I know nothing about it, except that it ravenously consumed 180,000 pages in one week at my blog. Robots act as web surfers really. They go to your site and view a page. They follow the links on the pages of your site and pull up all those pages. If you have many pages, like I have about 1100 pages on one of my sites - twicelor can pull a HUGE number of your pages as it crawl through your site.

If you're unlucky enough to have a poor hosting account you may find your site shut down, just because this one bad bot ate up all of your bandwidth!

I was looking through my stats and was happy to see - for about a week - my stats were climbing through the roof! I thought, ahh, Google is finally doing it's work! The traffic was coming steadily it appeared - no jumps just on certain hours (minutes) of the day like usually signifies a robot coming through.

I was excited until I checked the Browser stats - which shows what browsers, and also the names of the bots that are using the browsers (in my Godaddy.com hosting account).

Bad web robot, Twicelor, showing false stats increase at my Godaddy hosted siteIt showed a bot named twicelor was killing me with it's request for pages.

376,000 pages were pulled by the twicelor bot in just a few days! It kept pulling more everyday - like exponentially increasing as it fed off my site.

My bandwidth that was eaten was just a couple of Gigabytes - I have a Terabyte of bandwidth - so, no worries YET. But, if that twicelor bot kept pulling pages I was afraid I'd have to start paying for bandwidth.

Would be a great scam for Godaddy to have twicelor run through Godaddy sites and eat up everyone's bandwidth so they needed to purchase more, eh? I'm sure that's not what is going on...

So - I realized I needed to block Twicelor from viewing my site. How to do this? Robots.txt file in your home directory.

How is that done?

Make up a robots.txt file and ftp it into your home directory and all your subdomains if you have any. If you have a blogger or wordpress.com blog you won't be able to access your home directory. No worries, they already block all the bad bots - you don't need to give it a second thought.

If you have your own site or blog hosted somewhere other than blogger.com, typepad.com, wordpress.com or other canned-systems, you'd do well to immediately effect this change at your site and add a robots.txt file to your home directory.

I will provide the text below that will keep the good bots coming and some of the bad ones - away. Keep in mind, this is not a failsafe system as the bots must choose whether to ignore your robots.txt file or not. Most will comply with your wishes, but not all. The ones on this list are thought to always comply.

Your stats may suffer a hit after you implement this change - as you are getting rid of BAD traffic - and fake traffic... the bad bots seem similar to traffic in your stats - but on closer inspection it's worthless traffic - and actually much more harmful since they're just TAKING and not giving you anythng of benefit - like Google gives back.

Below is the file that I use, you can copy and paste it into a .txt file - make sure there are no spaces before or after the text when you save the file as "robots.txt".

Upload that file to your main directory of your domain - the same directory that your index.htm or index.html file resides in. If you don't know how to do this, contact your web master that does your web site maintenance. Do it before you need it to prevent wasted energy and time resolving the matter in the future.

This is NOT a failsafe method... .htaccess files are better - but beyond the scope of this article.

And, in the future you may need to block more robots than the ones listed here.

Best to read up on this subject at other sources and really understand as much as you can about robots and what they do.

:) Good luck!

No comments: