Project Summary

Info

Created on 28/nov/2012 Revised on 05/jan/2013

Author: Marcello La Rocca

Description

PyCrawler is a Breadth-first multithread crawler, keeping a queue of the URLs that are encountered. Main class is CrawlHandler. Once created, crawling can be started on any page using start_crawling(url). It is possible to crawl only one domain at the same time.

It is possible to hand over four extra parameters to start_crawling:
  • The number of threads to be started (each thread will pick up links from a shared synchronized queue and process them);
  • A max depth for the crawling (i.e. max distance of a page crawled from the starting point);
  • A limit to the number of pages crawled;
  • A delay between two consecutive requests of a single Crawler (to allow for polite crawling, default is 0.15 seconds)

The actual crawling is made by class Crawler: each instance of Crawler runs in a separate thread. A list of pages already visited is used to avoid circular redirection between pages.

Assumptions

Only crawls static content
  • Anchors
  • Imgs
  • Scripts
  • Stylesheets
  • Forms (only those using get method)

Discards query parameters and #fragments URLs can’t contain quotes or double quotes Relocation and forms submission by javascript are ignored Doesn’t check if tags are well formes, and well placed (f.i. if A and FORM only appears in body section) Only URLs on the same domain are listed

Table Of Contents

Previous topic

Welcome to PyCrawler’s documentation!

Next topic

Auto Generated Documentation

This Page