• Related

  • Spider Search Engine

    Search engine spider is a program that automatically fetch web pages and bring them to search engines. Has a different name “web-crawler“, a search engine sends a spider to retrieve (crawling) web pages as much as possible.

    With the mechanism, downloading web pages as a browser program, but only read the HTML code only.

    The duty of crawler is to index, rank, arrange and organize the pages in the index structure to be searched in a very fast time. Objects are addressed by the crawler is the content of pages, files, folders, and web directories. While the subject is the robot.txt.
    Crawler will translate it will guide to the search engine to index a website. So Crawler knows which pages need to be or not to index. Robot.txt more complete guide to the better index of the contents of a web.
    Usually a web page consists of several links to other pages, and if the crawler to see it so then he will go to the link and takes.
    Highly recommended to not use javascript on the main menu, use the tag <noscript>. Since that javascript can not be retrieved by the search engine crawlers, but it will rejected.

    Robot.txt is a text file (not HTML) which is placed on the web site pages to inform the search engine robots not to go to a specific page. The easiest way is to use a robot.txt robot that is generated at the facility from Google’s Webmaster tools. You can make your own and analyze it works.
    If you using Google Sitemaps, now available robot.txt validation in it.

    In general there are two types Sitemap;  HTML and XML. HTML sitemap which lists your site’s pages are useful to help find the information you need user. While the XML Sitemap provide the information about your website to the search engines.
    In essence, sitemap duty is to make sure Google knows all the things in your site, even including the URL that may not be able to find them at a normal crawling.
    Sitemap great help if:

    1. Your web contents is dynamic contents
    2. Your site has pages that are not easily discovered by Googlebot in the process of exploration. Suppose a page full of AJAX or Flash.
    3. New website or one which have some link, because the Googlebot had trouble crawling into all those links.
    4. Your site has content that has poor relations with each other.
    5. The last time data is modified.

    Source : PC+

    Share this to your friends

    Leave a Reply




    You can use these HTML tags

    <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

    Maximum 2 links per comment. Do not use BBCode.