PyCrawler is a Breadth-first multithread crawler, keeping a queue of the URLs that are encountered. Main class is CrawlHandler. Once created, crawling can be started on any page using start_crawling(url). It is possible to crawl only one domain at the same time.
The actual crawling is made by class Crawler: each instance of Crawler runs in a separate thread. A list of pages already visited is used to avoid circular redirection between pages.
Discards query parameters and #fragments URLs can’t contain quotes or double quotes Relocation and forms submission by javascript are ignored Doesn’t check if tags are well formes, and well placed (f.i. if A and FORM only appears in body section) Only URLs on the same domain are listed