From d7339218d175e212a6cc4d7a1371dc2d185db198 Mon Sep 17 00:00:00 2001 From: brettlangdon Date: Tue, 19 Nov 2013 08:21:43 -0500 Subject: [PATCH] port over article http://brett.is/writing/about/my-pyton-web-crawler/ --- .../about/my-python-web-crawler/index.md | 203 ++++++++++++++++++ 1 file changed, 203 insertions(+) create mode 100644 contents/writing/about/my-python-web-crawler/index.md diff --git a/contents/writing/about/my-python-web-crawler/index.md b/contents/writing/about/my-python-web-crawler/index.md new file mode 100644 index 0000000..48d5688 --- /dev/null +++ b/contents/writing/about/my-python-web-crawler/index.md @@ -0,0 +1,203 @@ +--- +title: My Python Web Crawler +author: Brett Langdon +date: 2012-09-09 +template: article.jade +--- + +How to write a very simplistic Web Crawler in Python for fun. + +--- + +Recently I decided to take on a new project, a Python based +web crawler +that I am dubbing Breakdown. Why? I have always been interested in web crawlers +and have written a few in the past, one previously in Python and another before +that as a class project in C++. So what makes this project different? +For starters I want to try and store and expose different information about the +web pages it is visiting. Instead of trying to analyze web pages and develop a +ranking system (like +PageRank) +that allows people to easily search for pages based on keywords, I instead want to +just store the information that is used to make those decisions and allow people +to use them how they wish. + +For example, I want to provide an API for people to be able to search for specific +web pages. If the page is found in the system, it will return back an easy to use +data structure that contain the pages +meta data, +keyword histogram, list of links to other pages and more. + +## Overview of Web Crawlers + +What is a web crawler? We can start with the simplest definition of a web crawler. +It is a program that, starting from a single web page, moves from web page to web +page by only using urls that are given in each page, starting with only those +provided in the original page. This is how search engines like +Google, +Bing and +Yahoo +obtain the content they need for their search sites. + +But a web crawler is not just about moving from site to site (even though this +can be fun to watch). Most web crawlers have a higher purpose, like (in the case +of search engines) to rank the relativity of a web page based on the content +provided within the pages content and html meta data to allow people easier +searching of content on the internet. Other web crawlers are used for more +invasive purposes like to obtain e-mail addresses to use for marketing or spam. + +So what goes into making a web crawler? A web crawler, again, is not just about +moving from place to place how ever it feels. Web sites can actually dictate how +web crawlers access the content on their sites and how they should move around on +their site. This information is provided in the +robots.txt +file that can be found on most websites +(here is wikipedia’s). +A rookie mistaken when building a web crawler is to ignore this file. These +robots.txt files are provided as a set of guidelines and rules that web crawlers +must adhere by for a given domain, otherwise you are liable to get your IP and/or +User Agent banned. Robots.txt files tell crawlers which pages or directories to +ignore or even which ones they should consider. + +Along with ensuring that you follow along with robots.txt please be sure to +provide a useful and unique +User Agent. +This is so that sites can identify that you are a robot and not a human. +For example, if you see a User Agent of *“breakdown”* on your website, hi, it’s me. +Do not use know User Agents like: +*“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/12.04 Chromium/18.0.1025.168 Chrome/18.0.1025.168 Safari/535.19″*, +this is, again, an easy way for you to get your IP address banned on many sites. + +Lastly, it is important to consider adding in rate limiting to your crawler. It is +wonderful to be able to crawl websites and between them very quickly (no one likes +to wait for results), but this is another sure fire way of getting your IP banned +by websites. Net admins do not like bots to tie up all of their networks +resources, making it difficult for actual users to use their site. + + +## Prototype of Web Crawler + +So this afternoon I decided to take around an hour or so and prototype out the +code to crawl from page to page extracting links and storing them in the database. +All this code does at the moment is download the content of a url, parse out all +of the urls, find the new urls that it has not seen before, append them to a queue +for further processing and also inserting them into the database.This process has +2 queues and 2 different thread types for processing each link. + +There are two different types of processes within this module, the first is a +Grabber, which is used to take a single url from a queue and download the text +content of that url using the +Requests +Python module. It then passes the content along to a queue that the Parser uses +to get new content to process. The Parser takes the content from the queue that +has been retrieved from the Grabber process and simply parses out all the links +contained within the sites html content. It then checks MongoDB to see if that +url has been retrieved already or not, if not, it will append the new url to the +queue that the Grabber uses to retrieve new content and also inserts this url +into the database. + +The unique thing about using multiple threads per process (X for Grabbers and Y +for Parsers) as well as having two different queues to share information between +the two allows this crawler to be self sufficient once it gets started with a +single url. The Grabbers help feed the queue that the Parsers work off of and the +Parsers feed the queue that the Grabbers work from. + +For now, this is all that my prototype does, it only stores links and crawls from +site to site looking for more links. What I have left to do is expand upon the +Parser to parse out more information from the html including things like meta +data, page title, keywords, etc, as well as to incorporate +robots.txt into the +processing (to keep from getting banned) and automated rate limiting +(right now I have a 3 second pause between each web request). + + +## How Did I Do It? + +So I assume at this point you want to see some code? The code it not up on +GitHub just yet, I have it hosted on my own private git repo for now and will +gladly open source the code once I have a better prototype. + +Lets just take a very quick look at how I am sharing code between the different +threads. + +### Parser.py +```python +import threading +class Thread(threading.Thread): + def __init__(self, content_queue, url_queue): + self.c_queue = content_queue + self.u_queue = url_queue + super(Thread, self).__init__() + def run(self): + while True: + data = self.c_queue.get() + #process data + for link in links: + self.u_queue.put(link) + self.c_queue.task_done() +``` + +### Grabber.py +```python +import threading +class Thread(threading.Thread): + def __init__(self, url_queue, content_queue): + self.c_queue = content_queue + self.u_queue = url_queue + super(Thread, self).__init__() + def run(self): + while True: + next_url = self.u_queue.get() + #data = requests.get(next_url) + while self.c_queue.full(): + pass + self.c_queue.put(data) + self.u_queue.task_done() +``` + +### Breakdown +```python +from breakdown import Parser, Grabber +from Queue import Queue + +num_threads = 4 +max_size = 1000 +url_queue = Queue() +content_queue = Queue(maxsize=max_size) + +parsers = [Parser.Thread(content_queue, url_queue) for i in xrange(num_threads)] +grabbers = [Grabber.Thread(url_queue, content_queue) for i in xrange(num_threads)] + +for thread in parsers+grabbers: + thread.daemon = True + thread.start() + +url_queue.put('http://brett.is/') +``` + +Lets talk about this process quick. The Breakdown code is provided as a binary +script to start the crawler. It creates “num_threads” threads for each process +(Grabber and Parser). It starts each thread and then appends the starting point +for the crawler, http://brett.is/. One of the Grabber threads will then pick up on +the single url, make a web request to get the content of that url and append it +to “content_queue”. Then one of the Parser threads will pick up on the content +data from “content_queue”, it will process the data from the web page html, +parsing out all of the links and then appending those links onto “url_queue”. This +will then allow the other Grabber threads an opportunity to make new web requests +to get more content to pass to the Parsers threads. This will continue on and on +until there are no links left (hopefully never). + + +## My Results + +I ran this script for a few minutes, maybe 10-15, and I ended up with over 11,000 +links ranging from my domain, +pandora, +twitter, +linkedin, +github, +sony, +and many many more. Now that I have a decent base prototype I can continue forward +and expand upon the processing and logic that goes into each web request. + +Look forward to more posts about this in the future.