Friday, July 19, 2013

What is sitemap, Robots.txt and RSS?


Sitemap: 

A Sitemap is a file that lists URLs for a site. The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. It allows webmasters to include more information about URL like when it was last updated and how important it is in relation to other URLs in the site. Site Files are limited to 50,000 URLs and 10 megabytes per map. Google first introduced Sitemaps 0.84 in June 2005. In November 2006 Google, MSN and Yahoo announced joint support for the Sitemaps protocol

RSS: 

RSS (Rich Site Summary) is a format that provides information about regularly changing web content such as latest news headlines, blog entries, audio and video. An RSS document includes summarized text, Meta data such as publishing dates and authorship. It allows you to easily stay informed by retrieving the latest content from the sites. RSS can be read using software called an RSS reader, feed reader or aggregator. News aggregator or RSS reader software allows you to capture RSS feeds from various sites and display them for you to read and use.

Robots.txt File:

Robots.txt is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Web site owners use the robots.txt file to give instructions about their site to web robots.
It works like this: If a robot wants to visits a Web site URL, say http://www.site.com. Then it firsts checks for http://www.site.com/robots.txt and finds: 

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot to not visit the site pages.
There are two important considerations when using robots.txt:
Robots can ignore your robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
The robots.txt file is a file in the public domain. Anyone can see what sections of your server do not want robots to use.

1 comment: