A breadth-first crawler.
Each crawler object can be run in a separate thread, and access the same synchronized Queue to find the next pages to crawl and to enqueue the URLs of the next pages to be crawled.
Parameters: |
|
---|
Starts the crawler.
Force the thread to exit.
Raises SystemExit: | |
---|---|
Always raises a SystemExit to be sure to silently kill the thread: must be caught or the main program will exit as well |
The main crawler, the object handling all the high-level crawling process.
Starts crawling a website beginning from an URL. Only pages within the same domain will considered for crawling, while pages outside it will be listed among the references of the single pages.
Parameters: |
|
---|
when all the pages (or _max_pages_to_crawl_ pages, if it’s set) of the same domain reachable from the starting one will be crawled). :type max_page_depth: integer or None
Parameters: | max_pages_to_crawl – The maximum number of pages to retrieve during crawling (if omitted, crawling will stop only |
---|
when all the pages of the same domain reachable from the starting one will be crawled). :type max_pages_to_crawl: integer or None
Parameters: | crawler_delay (float or DEFAULT_CRAWLER_DELAY) – To allow for polite crawling, a minimum delay between two page requests can be set (by default, 1.5 sec.s) |
---|
Enqueue a url to a page to be retrieved, if it hasn’t been enqueued yet and if it is “valid”, meaning it’s in the same domain as the main page
Parameters: |
|
---|---|
Returns: |
|
Check if a page has already been visited by its content: if two different urls lead to the same identical page, it will be caught here, and no further processing of the page will be :param html: The content of the page. :type html: string :param url: The url of the page. :type url: string
Starting from the home page (or from the page provided), lists all the resources used in it and in all the pages on the same domain reachable from the home page
Parameters: | page_url – It is possible to specify a page different from the one from which the crawling had started, |
---|
and explore the connection graph from that page (it will, of course, be limited to that part of the domain actually crawled, if a page limit or a depth limit have been specified) :type page_url: string or self.__home_page_url
Starting from the home page (or from the page provided), draws a graph of the website.
Parameters: | page_url – It is possible to specify a page different from the one from which the crawling had started, |
---|
and explore the connection graph from that page (it will, of course, be limited to that part of the domain actually crawled, if a page limit or a depth limit have been specified) :type page_url: string or self.__home_page_url
Data structure to represent a page-
Several different fields keeps track of all the static content linked from the page (see project description)
Parameters: |
|
---|
its format_and_enqueue_url method to add the links found on the current page to the crawler’s queue.
Takes a link, properly format it, and then keeps track of it for future crawling.
If the URL belongs to another domain, it is neglected (see CrawlerHandler.format_and_enqueue_url specifications) If it belongs to this domain, it may or may not have been added to the queue (it’s note this page’s responsability); however, it will be linked from this page)
Parameters: | url – The URL to be enqueued. |
---|
Parses html code into a structured page with separate lists of links to resources and referenced pages.
The constructor creates a parser object and connects it to the page
Parameters: | page – A reference to the page object that will hold the structured info for the parsed html page. |
---|
Retrieve and parses the page located at the specific URL
Parameters: | url – The URL of the page to be retrieved and parsed. |
---|