| @ -0,0 +1,203 @@ | |||
| --- | |||
| title: My Python Web Crawler | |||
| author: Brett Langdon | |||
| date: 2012-09-09 | |||
| template: article.jade | |||
| --- | |||
| How to write a very simplistic Web Crawler in Python for fun. | |||
| --- | |||
| Recently I decided to take on a new project, a Python based | |||
| <a href="http://en.wikipedia.org/wiki/Web_crawler" target="_blank">web crawler</a> | |||
| that I am dubbing Breakdown. Why? I have always been interested in web crawlers | |||
| and have written a few in the past, one previously in Python and another before | |||
| that as a class project in C++. So what makes this project different? | |||
| For starters I want to try and store and expose different information about the | |||
| web pages it is visiting. Instead of trying to analyze web pages and develop a | |||
| ranking system (like | |||
| <a href="http://en.wikipedia.org/wiki/PageRank" target="_blank">PageRank</a>) | |||
| that allows people to easily search for pages based on keywords, I instead want to | |||
| just store the information that is used to make those decisions and allow people | |||
| to use them how they wish. | |||
| For example, I want to provide an API for people to be able to search for specific | |||
| web pages. If the page is found in the system, it will return back an easy to use | |||
| data structure that contain the pages | |||
| <a href="http://en.wikipedia.org/wiki/Meta_element" target="_blank">meta data</a>, | |||
| keyword histogram, list of links to other pages and more. | |||
| ## Overview of Web Crawlers | |||
| What is a web crawler? We can start with the simplest definition of a web crawler. | |||
| It is a program that, starting from a single web page, moves from web page to web | |||
| page by only using urls that are given in each page, starting with only those | |||
| provided in the original page. This is how search engines like | |||
| <a href="http://www.google.com/" target="_blank">Google</a>, | |||
| <a href="http://www.bing.com/" target="_blank">Bing</a> and | |||
| <a href="http://www.yahoo.com/" target="_blank">Yahoo</a> | |||
| obtain the content they need for their search sites. | |||
| But a web crawler is not just about moving from site to site (even though this | |||
| can be fun to watch). Most web crawlers have a higher purpose, like (in the case | |||
| of search engines) to rank the relativity of a web page based on the content | |||
| provided within the pages content and html meta data to allow people easier | |||
| searching of content on the internet. Other web crawlers are used for more | |||
| invasive purposes like to obtain e-mail addresses to use for marketing or spam. | |||
| So what goes into making a web crawler? A web crawler, again, is not just about | |||
| moving from place to place how ever it feels. Web sites can actually dictate how | |||
| web crawlers access the content on their sites and how they should move around on | |||
| their site. This information is provided in the | |||
| <a href="http://www.robotstxt.org/" target="_blank">robots.txt</a> | |||
| file that can be found on most websites | |||
| (<a href="http://en.wikipedia.org/robots.txt" target="_blank">here is wikipedia’s</a>). | |||
| A rookie mistaken when building a web crawler is to ignore this file. These | |||
| robots.txt files are provided as a set of guidelines and rules that web crawlers | |||
| must adhere by for a given domain, otherwise you are liable to get your IP and/or | |||
| User Agent banned. Robots.txt files tell crawlers which pages or directories to | |||
| ignore or even which ones they should consider. | |||
| Along with ensuring that you follow along with robots.txt please be sure to | |||
| provide a useful and unique | |||
| <a href="http://en.wikipedia.org/wiki/User_agent" target="_blank">User Agent</a>. | |||
| This is so that sites can identify that you are a robot and not a human. | |||
| For example, if you see a User Agent of *“breakdown”* on your website, hi, it’s me. | |||
| Do not use know User Agents like: | |||
| *“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/12.04 Chromium/18.0.1025.168 Chrome/18.0.1025.168 Safari/535.19″*, | |||
| this is, again, an easy way for you to get your IP address banned on many sites. | |||
| Lastly, it is important to consider adding in rate limiting to your crawler. It is | |||
| wonderful to be able to crawl websites and between them very quickly (no one likes | |||
| to wait for results), but this is another sure fire way of getting your IP banned | |||
| by websites. Net admins do not like bots to tie up all of their networks | |||
| resources, making it difficult for actual users to use their site. | |||
| ## Prototype of Web Crawler | |||
| So this afternoon I decided to take around an hour or so and prototype out the | |||
| code to crawl from page to page extracting links and storing them in the database. | |||
| All this code does at the moment is download the content of a url, parse out all | |||
| of the urls, find the new urls that it has not seen before, append them to a queue | |||
| for further processing and also inserting them into the database.This process has | |||
| 2 queues and 2 different thread types for processing each link. | |||
| There are two different types of processes within this module, the first is a | |||
| Grabber, which is used to take a single url from a queue and download the text | |||
| content of that url using the | |||
| <a href="http://docs.python-requests.org/en/latest/index.html" target="_blank">Requests</a> | |||
| Python module. It then passes the content along to a queue that the Parser uses | |||
| to get new content to process. The Parser takes the content from the queue that | |||
| has been retrieved from the Grabber process and simply parses out all the links | |||
| contained within the sites html content. It then checks MongoDB to see if that | |||
| url has been retrieved already or not, if not, it will append the new url to the | |||
| queue that the Grabber uses to retrieve new content and also inserts this url | |||
| into the database. | |||
| The unique thing about using multiple threads per process (X for Grabbers and Y | |||
| for Parsers) as well as having two different queues to share information between | |||
| the two allows this crawler to be self sufficient once it gets started with a | |||
| single url. The Grabbers help feed the queue that the Parsers work off of and the | |||
| Parsers feed the queue that the Grabbers work from. | |||
| For now, this is all that my prototype does, it only stores links and crawls from | |||
| site to site looking for more links. What I have left to do is expand upon the | |||
| Parser to parse out more information from the html including things like meta | |||
| data, page title, keywords, etc, as well as to incorporate | |||
| <a href="http://www.robotstxt.org/" target="_blank">robots.txt</a> into the | |||
| processing (to keep from getting banned) and automated rate limiting | |||
| (right now I have a 3 second pause between each web request). | |||
| ## How Did I Do It? | |||
| So I assume at this point you want to see some code? The code it not up on | |||
| GitHub just yet, I have it hosted on my own private git repo for now and will | |||
| gladly open source the code once I have a better prototype. | |||
| Lets just take a very quick look at how I am sharing code between the different | |||
| threads. | |||
| ### Parser.py | |||
| ```python | |||
| import threading | |||
| class Thread(threading.Thread): | |||
| def __init__(self, content_queue, url_queue): | |||
| self.c_queue = content_queue | |||
| self.u_queue = url_queue | |||
| super(Thread, self).__init__() | |||
| def run(self): | |||
| while True: | |||
| data = self.c_queue.get() | |||
| #process data | |||
| for link in links: | |||
| self.u_queue.put(link) | |||
| self.c_queue.task_done() | |||
| ``` | |||
| ### Grabber.py | |||
| ```python | |||
| import threading | |||
| class Thread(threading.Thread): | |||
| def __init__(self, url_queue, content_queue): | |||
| self.c_queue = content_queue | |||
| self.u_queue = url_queue | |||
| super(Thread, self).__init__() | |||
| def run(self): | |||
| while True: | |||
| next_url = self.u_queue.get() | |||
| #data = requests.get(next_url) | |||
| while self.c_queue.full(): | |||
| pass | |||
| self.c_queue.put(data) | |||
| self.u_queue.task_done() | |||
| ``` | |||
| ### Breakdown | |||
| ```python | |||
| from breakdown import Parser, Grabber | |||
| from Queue import Queue | |||
| num_threads = 4 | |||
| max_size = 1000 | |||
| url_queue = Queue() | |||
| content_queue = Queue(maxsize=max_size) | |||
| parsers = [Parser.Thread(content_queue, url_queue) for i in xrange(num_threads)] | |||
| grabbers = [Grabber.Thread(url_queue, content_queue) for i in xrange(num_threads)] | |||
| for thread in parsers+grabbers: | |||
| thread.daemon = True | |||
| thread.start() | |||
| url_queue.put('http://brett.is/') | |||
| ``` | |||
| Lets talk about this process quick. The Breakdown code is provided as a binary | |||
| script to start the crawler. It creates “num_threads” threads for each process | |||
| (Grabber and Parser). It starts each thread and then appends the starting point | |||
| for the crawler, http://brett.is/. One of the Grabber threads will then pick up on | |||
| the single url, make a web request to get the content of that url and append it | |||
| to “content_queue”. Then one of the Parser threads will pick up on the content | |||
| data from “content_queue”, it will process the data from the web page html, | |||
| parsing out all of the links and then appending those links onto “url_queue”. This | |||
| will then allow the other Grabber threads an opportunity to make new web requests | |||
| to get more content to pass to the Parsers threads. This will continue on and on | |||
| until there are no links left (hopefully never). | |||
| ## My Results | |||
| I ran this script for a few minutes, maybe 10-15, and I ended up with over 11,000 | |||
| links ranging from my domain, | |||
| <a href="http://www.pandora.com/" target="_blank">pandora</a>, | |||
| <a href="http://www.twitter.com/" target="_blank">twitter</a>, | |||
| <a href="http://www.linkedin.com/" target="_blank">linkedin</a>, | |||
| <a href="http://www.github.com/" target="_blank">github</a>, | |||
| <a href="http://www.sony.com/" target="_blank">sony</a>, | |||
| and many many more. Now that I have a decent base prototype I can continue forward | |||
| and expand upon the processing and logic that goes into each web request. | |||
| Look forward to more posts about this in the future. | |||