| @ -0,0 +1,265 @@ | |||
| --- | |||
| title: Generator Pipelines in Python | |||
| author: Brett Langdon | |||
| date: 2012-12-18 | |||
| template: article.jade | |||
| --- | |||
| A brief look into what a generator pipeline is and how to write one in Python. | |||
| --- | |||
| Generator pipelines are a great way to break apart complex processing into | |||
| smaller pieces when processing lists of items (like lines in a file). For those | |||
| who are not familiar with <a href="http://www.python.org" target="_blank">Python</a> | |||
| generators or the concept behind generator pipelines, I strongly recommend | |||
| reading this article first: | |||
| <a href="http://www.dabeaz.com/generators-uk/index.html" target="_blank">Generator Tricks for Systems Programmers</a> | |||
| by <a href="http://www.dabeaz.com/" target="_blank">David M. Beazley</a>. | |||
| It will surely take you more in-depth than I am going to go. | |||
| A brief introduction on generators. There are two types of generators, | |||
| generator expressions and generator functions. A | |||
| <a href="http://www.python.org/dev/peps/pep-0289/" target="_blank">generator expression</a> | |||
| looks similar to a | |||
| <a href="http://www.python.org/dev/peps/pep-0202/" target="_blank">list comprehension</a> | |||
| but the simple difference is that it uses parenthesis over square brackets. | |||
| A <a href="http://www.python.org/dev/peps/pep-0255/" target="_blank">generator function</a> | |||
| is a function which contains the keyword | |||
| <a href="http://docs.python.org/2/reference/simple_stmts.html#grammar-token-yield_stmt" target="_blank">yield</a>; | |||
| yield is used to pass a value from within the function to the calling expression | |||
| without exiting the function (unlike a return statement). | |||
| ## Generator Expression | |||
| ```python | |||
| nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | |||
| print sum(num for num in nums) | |||
| num_gen = (num for num in nums) | |||
| for num in num_gen: | |||
| print num | |||
| ``` | |||
| Line 2 of the above, when passing a generator into a function the extra parenthesis | |||
| are not needed. Otherwise you can create a stand alone generator, like in line 3; | |||
| this expression simply creates the generator, it does not iterate over the list of | |||
| numbers until it is passed into the for loop on line 4. | |||
| ## Generator Function | |||
| ```python | |||
| def nums(): | |||
| nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | |||
| for num in nums: | |||
| yield num | |||
| print sum(nums()) | |||
| for num in nums(): | |||
| print num | |||
| ``` | |||
| This block of code does the exact same as the example above but uses a generator | |||
| function instead of a generator expression. When the function nums is called it | |||
| will loop through the list of numbers and one by one pass them back up to either | |||
| the function call for sum or for the for loop. | |||
| Generators (either expressions or functions) are not the same as returning a list | |||
| of items (lets say numbers). They do not wait for all possible items to be yielded | |||
| before the items are returned. Each item is returned as it is yielded. For example, | |||
| with the generator function code above, the number 1 is being printed on line 7 | |||
| before the number 2 is being yielded on line 4. | |||
| So, cool, alright, generators are nice, but what about generator pipelines? A | |||
| generator pipeline is taking these generators (expressions or functions) and | |||
| chaining them together. Lets try to look at a case where they might be useful. | |||
| ## Example: Without Generators | |||
| ```python | |||
| def process(num): | |||
| # filter out non-evens | |||
| if num % 2 != 0: | |||
| return | |||
| num = num * 3 | |||
| num = 'The Number: %s' % num | |||
| return num | |||
| nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | |||
| for num in nums: | |||
| print process(num) | |||
| ``` | |||
| This code is fairly simple and may not seem like the best example for creating a | |||
| generator pipeline, but it is nice because we can break it down into small parts. | |||
| For starters we need to filter out any non-even numbers, then we need to multiple | |||
| the num by 3, then finally we convert the number to a string. Lets see what this | |||
| looks like as a pipeline. | |||
| ## Generator Pipeline | |||
| ```python | |||
| def even_filter(nums): | |||
| for num in nums: | |||
| if num % 2 == 0: | |||
| yield num | |||
| def multiply_by_three(nums): | |||
| for num in nums: | |||
| yield num * 3 | |||
| def convert_to_string(nums): | |||
| for num in nums: | |||
| yield 'The Number: %s' % num | |||
| nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | |||
| pipeline = convert_to_string(multiply_by_three(even_filter(nums))) | |||
| for num in pipeline: | |||
| print num | |||
| ``` | |||
| This code example might look more complex that the previous example, but it | |||
| provides a good example of how (with generators) you can chain together a set of | |||
| very small and concise processes over a set of items. So, how does this example | |||
| really work? Each number in the list nums passes through each of the three | |||
| functions and is printed before the next items has it’s chance to make it through. | |||
| 1. The Number 1 is checked for even, it is not so processing for that number stops | |||
| 2. The Number 2 is checked for even, it is so it is yielded to `multiply_by_three` | |||
| 3. The Number 2 is multiplied by 3 and yielded to `convert_to_string` | |||
| 4. The Number 2 is formatted into the string and yielded to the for loop on line 14 | |||
| 5. The Number 2 is printed as _“The Number: 2″_ | |||
| 6. The Number 3 is checked for even, it is not so processing for that number stops | |||
| 7. The Number 4 is checked for even, it is so it is yielded to `multiply_by_three` | |||
| 8. … etc… | |||
| This continues until all of the numbers have either been ignored (by even_filter) | |||
| or have been yielded. If you wanted to, you can change the order in which the | |||
| chain is created to change the order in which each process runs (try swapping | |||
| even_filter and multiply_by_three). | |||
| So, how about a more practical example? What if we needed to process an | |||
| <a href="http://httpd.apache.org/" target="_blank">Apache</a> log file? We can use | |||
| a generator pipeline to break the processing into very small functions for | |||
| filtering and parsing. We will use the following example line format for our | |||
| processing: | |||
| ``` | |||
| 127.0.0.1 [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 | |||
| ``` | |||
| ## Processing Apache Logs | |||
| ```python | |||
| class LogProcessor(object): | |||
| def __init__(self, file): | |||
| self._file = file | |||
| self._filters = [] | |||
| def add_filter(self, new_filter): | |||
| if callable(new_filter): | |||
| self._filters.append(new_filter) | |||
| def process(self): | |||
| # this is the pattern for creating a generator | |||
| # pipeline, we start with a generator then wrap | |||
| # each consecutive generator with the pipeline itself | |||
| pipeline = self._file | |||
| for new_filter in self._filters: | |||
| pipeline = new_filter(pipeline) | |||
| return pipeline | |||
| def parser(lines): | |||
| """Split each line based on spaces and | |||
| yield the resulting list. | |||
| """ | |||
| for line in lines: | |||
| yield [part.strip('"[]') for part in line.split(' ')] | |||
| def mapper(lines): | |||
| """Convert each line to a dict | |||
| """ | |||
| for line in lines: | |||
| tmp = {} | |||
| tmp['ip_address'] = line[0] | |||
| tmp['timestamp'] = line[1] | |||
| tmp['timezone'] = line[2] | |||
| tmp['method'] = line[3] | |||
| tmp['request'] = line[4] | |||
| tmp['version'] = line[5] | |||
| tmp['status'] = int(line[6]) | |||
| tmp['size'] = int(line[7]) | |||
| yield tmp | |||
| def status_filter(lines): | |||
| """Filter out lines whose status | |||
| code is not 200 | |||
| """ | |||
| for line in lines: | |||
| # is the status is not 200 | |||
| # then the line is ignored | |||
| # and does not make it through | |||
| # the pipeline to the end | |||
| if line['status'] == 200: | |||
| yield line | |||
| def method_filter(lines): | |||
| """Filter out lines whose method | |||
| is not 'GET' | |||
| """ | |||
| for line in lines: | |||
| # all lines with method not equal | |||
| # to 'get' are dropped | |||
| if line['method'].lower() == 'get': | |||
| yield line | |||
| def size_converter(lines): | |||
| """Convert the size (in bytes) | |||
| into megabytes | |||
| """ | |||
| mb = 9.53674e-7 | |||
| for line in lines: | |||
| line['size'] = line['size'] * mb | |||
| yield line | |||
| # setup the processor | |||
| log = open('./sample.log') | |||
| processor = LogProcessor(log) | |||
| # this is the order we want the functions to run | |||
| processor.add_filter(parser) | |||
| processor.add_filter(mapper) | |||
| processor.add_filter(status_filter) | |||
| processor.add_filter(method_filter) | |||
| processor.add_filter(size_converter) | |||
| # process() returns the generator pipeline | |||
| for line in processor.process(): | |||
| # line with be a dict whose status is | |||
| # 200 and method is 'GET' and whose | |||
| # size is expressed in megabytes | |||
| print line | |||
| log.close() | |||
| ``` | |||
| So there you have it. A more practical example of how to use generator pipelines. | |||
| We have setup a simple class that is used to iterate through a log file of a | |||
| specific format and perform a set of operations on each log line in a specified | |||
| order. By having each operation a very small generator function we now have modular | |||
| line processing, meaning we can move our filters, parsers and converters around in | |||
| any order we want. We can swap the order of the method and status filters and move | |||
| the size converters before the filters. It would not make sense, but we could move | |||
| the parser and mapper functions around as well (this might break things). | |||
| This generator pipeline will do the following: | |||
| 1. yield a single line in from the log file | |||
| 2. Split that line based on spaces and yield the resulting list | |||
| 3. yield a dict from the single line list | |||
| 4. check the line’s status code, yield if 200, goto step 1 otherwise | |||
| 5. check the line’s method, yield if ‘get’, goto step 1 otherwise | |||
| 6. convert the line’s size to megabytes, yield the line | |||
| 7. the line is printed in the for loop, goto step 1 (repeat for all other lines) | |||
| Do you use generators and generator pipelines differently in your Python code? | |||
| Please feel free to share any tips/tricks or anything I may have missed in | |||
| the above. Enjoy. | |||