CSSRockstars

Posts Tagged ‘Search Engine’

Robots.txt file usage is sometimes ignored. On the other hand, it is an important factor for the webpages being indexed properly and very easy to setup.

I know that robots.txt is not something new. But, I’ve been preparing a SEO sheet for a while and wanted to share this small & useful portion with you.

What is robots.txt?

Robots.txt is a file that is used to exclude content from the crawling process of search engine spiders / bots. Robots.txt is also called the Robots Exclusion Protocol.

Why to use robots.txt?

In general, we prefer that our webpages are indexed by the search engines. But there may be some content that we don’t want to be crawled & indexed. Like the personal images folder, website administration folder, customer’s test folder of a web developer, no search value folders like cgi-bin, and many more. The main idea is we don’t want them to be indexed.

Is robots.txt file a certain solution?

No. Standards based bots like Google’s, Yahoo’s or other big search engine’s robots listen to your robots.txt file. This is because they are programmed to. If configured so, any search engine bot can ignore the robots.txt file. Result: there is no guarantee.

How to use robot.txt file?

Robots.txt file has some simple directives which manages the bots. These are:

  • User-agent: this parameter defines, for which bots the next parameters will be valid. * is a wildcard which means all bots or Googlebot for Google.
  • Disallow: defines which folders or files will be excluded. None means nothing will be excluded, / means everything will be excluded or /folder name/ or /filename can be used to specify the values to excluded. Folder name between slashes like /folder name/ means that only folder name/default.html will be excluded. Using 1 slash like /folder name means all content inside the folder name folder will be excluded.

There are also some other parameters which are only supported by all browsers. These are:

  • Allow: this parameter works just the opposite of Disallow. You can mention which content will be allowed to be crawled here. * is a wildcard.
  • Request-rate: defines pages/seconds to be crawled ratio. 1/20 would be 1 page in every 20 second.
  • Crawl-delay: defines howmany seconds to wait after each succesful crawling.
  • Visit-time: you can define between which hours you want your pages to be crawled. Example usage is: 0100-0330 which means that pages will be indexed between 01:00 AM - 03:30 AM GMT.
  • Sitemap: this is the parameter where you can show where your sitemap file is. You must use the complete URL addres for the file.

Robots.txt example:

User-agent: * #allows all search engine spiders.
Disallow: /secretcontent/ #disallow them to crawl secretcontent folder.

Resources:
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40360
http://www.robotstxt.org/
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

For optimizing a website in means of SEO, you may have to deal with many manual tasks like submiting to search engines, checking the keyword density of pages, searching for any missing alt tags & more.

Free SEO SoftwareWeb CEO, one of the most popular softwares in this area, lets you manage the SEO-side of unlimited number of websites with a very informative interface.

Web CEO comes with a lots of featured free edition & much more featured paid editions.

I’ve been using Web CEO for a long time and one of the best features of it is the advices after analyzing pages. Rather than just telling the problems, Web CEO informs you about how to solve them in a detailed way.

Some Web CEO features:

  • Submit websites to search engines
  • Analyze a website’s position in search engines for any keyword
  • Keyword Mining from competitor’s pages
  • Advice for General Search Engine Compliance
  • Analyzing Link Popularity (Number of Links to Your Website)
  • Analyzing Backward Links vs. Competition
  • Reporting
feed-holder
FeedBurner
  • Gooey Templates
  • Krop
  • slice'n dice
  • Website Magazine
  • DNS Pinger
  • Advertise Here