port over article http://brett.is/writing/about/my-pyton-web-crawler/

12 years ago · d7339218d1
--- a/contents/writing/about/my-python-web-crawler/index.md
+++ b/contents/writing/about/my-python-web-crawler/index.md
@ -0,0 +1,203 @@
 ---
 title: My Python Web Crawler
 author: Brett Langdon
 date: 2012-09-09
 template: article.jade
 ---
 How to write a very simplistic Web Crawler in Python for fun.
 ---
 Recently I decided to take on a new project, a Python based
 <a href="http://en.wikipedia.org/wiki/Web_crawler" target="_blank">web crawler</a>
 that I am dubbing Breakdown. Why? I have always been interested in web crawlers
 and have written a few in the past, one previously in Python and another before
 that as a class project in C++. So what makes this project different?
 For starters I want to try and store and expose different information about the
 web pages it is visiting. Instead of trying to analyze web pages and develop a
 ranking system (like
 <a href="http://en.wikipedia.org/wiki/PageRank" target="_blank">PageRank</a>)
 that allows people to easily search for pages based on keywords, I instead want to
 just store the information that is used to make those decisions and allow people
 to use them how they wish.
 For example, I want to provide an API for people to be able to search for specific
 web pages. If the page is found in the system, it will return back an easy to use
 data structure that contain the pages
 <a href="http://en.wikipedia.org/wiki/Meta_element" target="_blank">meta data</a>,
 keyword histogram, list of links to other pages and more.
 ## Overview of Web Crawlers
 What is a web crawler? We can start with the simplest definition of a web crawler.
 It is a program that, starting from a single web page, moves from web page to web
 page by only using urls that are given in each page, starting with only those
 provided in the original page. This is how search engines like
 <a href="http://www.google.com/" target="_blank">Google</a>,
 <a href="http://www.bing.com/" target="_blank">Bing</a> and
 <a href="http://www.yahoo.com/" target="_blank">Yahoo</a>
 obtain the content they need for their search sites.
 But a web crawler is not just about moving from site to site (even though this
 can be fun to watch). Most web crawlers have a higher purpose, like (in the case
 of search engines) to rank the relativity of a web page based on the content
 provided within the pages content and html meta data to allow people easier
 searching of content on the internet. Other web crawlers are used for more
 invasive purposes like to obtain e-mail addresses to use for marketing or spam.
 So what goes into making a web crawler? A web crawler, again, is not just about
 moving from place to place how ever it feels. Web sites can actually dictate how
 web crawlers access the content on their sites and how they should move around on
 their site. This information is provided in the
 <a href="http://www.robotstxt.org/" target="_blank">robots.txt</a>
 file that can be found on most websites
 (<a href="http://en.wikipedia.org/robots.txt" target="_blank">here is wikipedia’s</a>).
 A rookie mistaken when building a web crawler is to ignore this file. These
 robots.txt files are provided as a set of guidelines and rules that web crawlers
 must adhere by for a given domain, otherwise you are liable to get your IP and/or
 User Agent banned. Robots.txt files tell crawlers which pages or directories to
 ignore or even which ones they should consider.
 Along with ensuring that you follow along with robots.txt please be sure to
 provide a useful and unique
 <a href="http://en.wikipedia.org/wiki/User_agent" target="_blank">User Agent</a>.
 This is so that sites can identify that you are a robot and not a human.
 For example, if you see a User Agent of *“breakdown”* on your website, hi, it’s me.
 Do not use know User Agents like:
 *“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/12.04 Chromium/18.0.1025.168 Chrome/18.0.1025.168 Safari/535.19″*,
 this is, again, an easy way for you to get your IP address banned on many sites.
 Lastly, it is important to consider adding in rate limiting to your crawler. It is
 wonderful to be able to crawl websites and between them very quickly (no one likes
 to wait for results), but this is another sure fire way of getting your IP banned
 by websites. Net admins do not like bots to tie up all of their networks
 resources, making it difficult for actual users to use their site.
 ## Prototype of Web Crawler
 So this afternoon I decided to take around an hour or so and prototype out the
 code to crawl from page to page extracting links and storing them in the database.
 All this code does at the moment is download the content of a url, parse out all
 of the urls, find the new urls that it has not seen before, append them to a queue
 for further processing and also inserting them into the database.This process has
 2 queues and 2 different thread types for processing each link.
 There are two different types of processes within this module, the first is a
 Grabber, which is used to take a single url from a queue and download the text
 content of that url using the
 <a href="http://docs.python-requests.org/en/latest/index.html" target="_blank">Requests</a>
 Python module. It then passes the content along to a queue that the Parser uses
 to get new content to process. The Parser takes the content from the queue that
 has been retrieved from the Grabber process and simply parses out all the links
 contained within the sites html content. It then checks MongoDB to see if that
 url has been retrieved already or not, if not, it will append the new url to the
 queue that the Grabber uses to retrieve new content and also inserts this url
 into the database.
 The unique thing about using multiple threads per process (X for Grabbers and Y
 for Parsers) as well as having two different queues to share information between
 the two allows this crawler to be self sufficient once it gets started with a
 single url. The Grabbers help feed the queue that the Parsers work off of and the
 Parsers feed the queue that the Grabbers work from.
 For now, this is all that my prototype does, it only stores links and crawls from
 site to site looking for more links. What I have left to do is expand upon the
 Parser to parse out more information from the html including things like meta
 data, page title, keywords, etc, as well as to incorporate
 <a href="http://www.robotstxt.org/" target="_blank">robots.txt</a> into the
 processing (to keep from getting banned) and automated rate limiting
 (right now I have a 3 second pause between each web request).
 ## How Did I Do It?
 So I assume at this point you want to see some code? The code it not up on
 GitHub just yet, I have it hosted on my own private git repo for now and will
 gladly open source the code once I have a better prototype.
 Lets just take a very quick look at how I am sharing code between the different
 threads.
 ### Parser.py
 ```python
 import threading
 class Thread(threading.Thread):
    def __init__(self, content_queue, url_queue):
        self.c_queue = content_queue
        self.u_queue = url_queue
        super(Thread, self).__init__()
    def run(self):
        while True:
            data = self.c_queue.get()
            #process data
            for link in links:
                self.u_queue.put(link)
            self.c_queue.task_done()
 ```
 ### Grabber.py
 ```python
 import threading
 class Thread(threading.Thread):
    def __init__(self, url_queue, content_queue):
        self.c_queue = content_queue
        self.u_queue = url_queue
        super(Thread, self).__init__()
    def run(self):
        while True:
            next_url = self.u_queue.get()
            #data = requests.get(next_url)
            while self.c_queue.full():
                pass
            self.c_queue.put(data)
            self.u_queue.task_done()
 ```
 ### Breakdown
 ```python
 from breakdown import Parser, Grabber
 from Queue import Queue
 num_threads = 4
 max_size = 1000
 url_queue = Queue()
 content_queue = Queue(maxsize=max_size)
 parsers = [Parser.Thread(content_queue, url_queue) for i in xrange(num_threads)]
 grabbers = [Grabber.Thread(url_queue, content_queue) for i in xrange(num_threads)]
 for thread in parsers+grabbers:
    thread.daemon = True
    thread.start()
 url_queue.put('http://brett.is/')
 ```
 Lets talk about this process quick. The Breakdown code is provided as a binary
 script to start the crawler. It creates “num_threads” threads for each process
 (Grabber and Parser). It starts each thread and then appends the starting point
 for the crawler, http://brett.is/. One of the Grabber threads will then pick up on
 the single url, make a web request to get the content of that url and append it
 to “content_queue”. Then one of the Parser threads will pick up on the content
 data from “content_queue”, it will process the data from the web page html,
 parsing out all of the links and then appending those links onto “url_queue”. This
 will then allow the other Grabber threads an opportunity to make new web requests
 to get more content to pass to the Parsers threads. This will continue on and on
 until there are no links left (hopefully never).
 ## My Results
 I ran this script for a few minutes, maybe 10-15, and I ended up with over 11,000
 links ranging from my domain,
 <a href="http://www.pandora.com/" target="_blank">pandora</a>,
 <a href="http://www.twitter.com/" target="_blank">twitter</a>,
 <a href="http://www.linkedin.com/" target="_blank">linkedin</a>,
 <a href="http://www.github.com/" target="_blank">github</a>,
 <a href="http://www.sony.com/" target="_blank">sony</a>,
 and many many more. Now that I have a decent base prototype I can continue forward
 and expand upon the processing and logic that goes into each web request.
 Look forward to more posts about this in the future.