Project Summary =============== Info ---- Created on 28/nov/2012 Revised on 05/jan/2013 Author: Marcello La Rocca Description ----------- PyCrawler is a Breadth-first multithread crawler, keeping a queue of the URLs that are encountered. Main class is CrawlHandler. Once created, crawling can be started on any page using start_crawling(url). It is possible to crawl only one domain at the same time. It is possible to hand over four extra parameters to start_crawling: * The number of threads to be started (each thread will pick up links from a shared synchronized queue and process them); * A max depth for the crawling (i.e. max distance of a page crawled from the starting point); * A limit to the number of pages crawled; * A delay between two consecutive requests of a single Crawler (to allow for polite crawling, default is 0.15 seconds) The actual crawling is made by class Crawler: each instance of Crawler runs in a separate thread. A list of pages already visited is used to avoid circular redirection between pages. Assumptions ----------- Only crawls static content * Anchors * Imgs * Scripts * Stylesheets * Forms (only those using get method) Discards query parameters and #fragments URLs can't contain quotes or double quotes Relocation and forms submission by javascript are ignored Doesn't check if tags are well formes, and well placed (f.i. if A and FORM only appears in body section) Only URLs on the same domain are listed