Wednesday, January 23, 2013

ADDING A ROBOTS.TXT FILE TO YOUR MAGENTO STORE

ADDING A ROBOTS.TXT FILE TO YOUR MAGENTO STORE


A web-robot or robot can be described as a program that executes a specific function automatically on the web; be it for search engine indexing or for HTML and link validation. Googlebot and Bingbot are two of the most common web-robots.
Bandwidth is the measure of data that is sent across the Internet, each time a person visits your website a portion of bandwidth is used. The same applies for web-robots; each time a web-robot visits your site they use a small portion of bandwidth.
Ordinarily the bandwidth that web-robots use is relatively small, but web-robots can sometimes be seen to consume gigabytes of bandwidth which can be a problem for those with bandwidth limits to adhere to with their hosting providers.
Whilst your website being visited regularly by web-robots is by no means a bad thing, it can in fact be perfectly normal if you are regularly adding new content to your website, problems arise when web-robots become stuck in infinite loops on your website. These infinite loops can be caused by custom scripts, but are most often caused when Session ID's are served with each URL that is indexed. The constant activity of web-robots on your site trapped in an infinite loop is what can sometimes cause such heavy bandwidth usage.
A robots.txt file is a simple text file that sits in your web root folder and can be used to control the web-robots allowed to visit your website. Additionally, a robots.txt file further allows you to control what web-robots can view during that visit. This means by controlling the web-robots that visit your site, you can prevent certain directories from being indexed by a web-robot which can in turn provide SEO benefits by preventing duplicated content from being indexed. Furthermore, a robots.txt also allows you to specify a Crawl-Delay in order to restrict a web-robot from constantly indexing and crawling your website, helping to reduce the footprint they make on your bandwidth allocation.
For your convenience, we have included a widely available robots.txt file for use with Magento which is beneficial both in terms of improving your SEO as well as reducing bandwidth usage and server load.


 # $Id: robots.txt,v magento-specific 2010/28/01 18:24:19 goba Exp $
#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these ?robots? where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:  http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
# For syntax checking, see:
# http://www.sxw.org.uk/computing/robots/check.html
# Website Sitemap
# Sitemap: http://www.mywebsite.com/sitemap.xml

# Crawlers Setup
User-agent: *
Crawl-delay: 30
# Allowable Index
Allow: /*?p=
Allow: /index.php/blog/
Allow: /catalog/seo_sitemap/category/
Allow: /catalogsearch/result/
Allow: /media/
# Directories
Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /errors/
Disallow: /includes/
Disallow: /js/
Disallow: /lib/
Disallow: /magento/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /scripts/
Disallow: /shell/
Disallow: /skin/
Disallow: /stats/
Disallow: /var/
# Paths (clean URLs)
Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
# Files
Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt
# Paths (no clean URLs)
Disallow: /*.js$
Disallow: /*.css$
Disallow: /*.php$
Disallow: /*?p=*&
Disallow: /*?SID=
Disallow: /*?limit=all

# Uncomment if you do not wish for Google to index your images
#User-agent: Googlebot-Image
#Disallow: /


Full credit for the above robots.txt file goes to its original creator. Discussion on the robots.txt file can be found within this thread on the Magento Community Forums.

To install the robots.txt file to your domain, you can follow these few simple steps:
Step 1: Download the robots.txt file to your computer by clicking here.
Step 2: If your Magento is installed within a sub-directory you will need to modify the robots.txt file accordingly. For example, you would need to change 'Disallow: /404/' to 'Disallow: /your-sub-directory/404/' and 'Disallow: /app/' to 'Disallow: /your-sub-directory/app/' and so on and so forth...
Step 3: If your domain has a sitemap.xml, you can uncomment line number 21 of the robots.txt file and include the URL to your sitemap.xml.
Step 4: Upload the robots.txt file to your web-root folder. This can be done by placing the file within your 'httpdocs/' directory either by logging into your Plesk Hosting Control Panel with your credentials, or through your FTP client of choice.
If you have any feedback on the above robots.txt file, feel free to leave a post below. It would be good to discuss and improve the above robots.txt file for the benefit of everyone.



Original sources: http://www.nublue.co.uk/forums/topic/318/adding-a-robotstxt-file-to-your-magento-store/

No comments:

Post a Comment