Auto Generated Documentation

class pycrawler.Crawler(threadID, handler)

A breadth-first crawler.

Each crawler object can be run in a separate thread, and access the same synchronized Queue to find the next pages to crawl and to enqueue the URLs of the next pages to be crawled.

Parameters:
  • threadID – The ID of the thread in which the crawler is going to be run.
  • handler – A reference to the CrawlerHandler object coordinating the crawling.
run()

Starts the crawler.

quit()

Force the thread to exit.

Raises SystemExit:
 Always raises a SystemExit to be sure to silently kill the thread: must be caught or the main program will exit as well
class pycrawler.CrawlerHandler

The main crawler, the object handling all the high-level crawling process.

start_crawling(url, threads=1, max_page_depth=None, max_pages_to_crawl=None, crawler_delay=1.5)

Starts crawling a website beginning from an URL. Only pages within the same domain will considered for crawling, while pages outside it will be listed among the references of the single pages.

Parameters:
  • url (string) – The starting point for the crawling.
  • threads (integer or 1) – Number of concurrent crawlers to be run in separate threads; Must be a positive integer (if omitted or invalid it will be set to 1 by default)
  • max_page_depth – The maximum depth that can be reached during crawling (if omitted, crawling will stop only

when all the pages (or _max_pages_to_crawl_ pages, if it’s set) of the same domain reachable from the starting one will be crawled). :type max_page_depth: integer or None

Parameters:max_pages_to_crawl – The maximum number of pages to retrieve during crawling (if omitted, crawling will stop only

when all the pages of the same domain reachable from the starting one will be crawled). :type max_pages_to_crawl: integer or None

Parameters:crawler_delay (float or DEFAULT_CRAWLER_DELAY) – To allow for polite crawling, a minimum delay between two page requests can be set (by default, 1.5 sec.s)
format_and_enqueue_url(page_url, current_path, current_depth)

Enqueue a url to a page to be retrieved, if it hasn’t been enqueued yet and if it is “valid”, meaning it’s in the same domain as the main page

Parameters:
  • page_url – the original URL to be enqueued
  • current_path – the path of current page (for relative URLs)
  • current_depth – depth of the current page (distance from the starting page)
Returns:

  • None <=> the URL is located in a different domain
  • The formatted absolute URL, otherwise

check_page_by_content(html, url)

Check if a page has already been visited by its content: if two different urls lead to the same identical page, it will be caught here, and no further processing of the page will be :param html: The content of the page. :type html: string :param url: The url of the page. :type url: string

list_resources(page_url=None)

Starting from the home page (or from the page provided), lists all the resources used in it and in all the pages on the same domain reachable from the home page

Parameters:page_url – It is possible to specify a page different from the one from which the crawling had started,

and explore the connection graph from that page (it will, of course, be limited to that part of the domain actually crawled, if a page limit or a depth limit have been specified) :type page_url: string or self.__home_page_url

page_graph(page_url=None)

Starting from the home page (or from the page provided), draws a graph of the website.

Parameters:page_url – It is possible to specify a page different from the one from which the crawling had started,

and explore the connection graph from that page (it will, of course, be limited to that part of the domain actually crawled, if a page limit or a depth limit have been specified) :type page_url: string or self.__home_page_url

class pycrawler.Page(url, handler)

Data structure to represent a page-

Several different fields keeps track of all the static content linked from the page (see project description)

Parameters:
  • url – The URL of the page to page to be hereby retrieved, parsed and stored.
  • handler – A reference to the CrawlerHandler object coordinating the crawling, used to access

its format_and_enqueue_url method to add the links found on the current page to the crawler’s queue.

Takes a link, properly format it, and then keeps track of it for future crawling.

If the URL belongs to another domain, it is neglected (see CrawlerHandler.format_and_enqueue_url specifications) If it belongs to this domain, it may or may not have been added to the queue (it’s note this page’s responsability); however, it will be linked from this page)

Parameters:url – The URL to be enqueued.
class pycrawler.PageParser(page, handler)

Parses html code into a structured page with separate lists of links to resources and referenced pages.

The constructor creates a parser object and connects it to the page
Parameters:page – A reference to the page object that will hold the structured info for the parsed html page.
startParsing(url)

Retrieve and parses the page located at the specific URL

Parameters:url – The URL of the page to be retrieved and parsed.

Previous topic

Project Summary

This Page