Browse Source

Migrate content to hugo

dev/hugo
Brett Langdon 9 years ago
parent
commit
8749084ac3
No known key found for this signature in database GPG Key ID: A2ECAB73CE12147F
33 changed files with 3134 additions and 0 deletions
  1. +22
    -0
      config.toml
  2. +2
    -0
      content/about/index.md
  3. +107
    -0
      content/writing/about/browser-fingerprinting/index.md
  4. +62
    -0
      content/writing/about/continuous-nodejs-module/index.md
  5. +167
    -0
      content/writing/about/cookieless-user-tracking/index.md
  6. +62
    -0
      content/writing/about/detect-flash-with-javascript/index.md
  7. +141
    -0
      content/writing/about/fail2ban-honeypot/index.md
  8. +136
    -0
      content/writing/about/fastest-python-json-library/index.md
  9. +184
    -0
      content/writing/about/forge-configuration-parser/index.md
  10. +265
    -0
      content/writing/about/generator-pipelines-in-python/index.md
  11. +173
    -0
      content/writing/about/goodbye-grunt-hello-tend/index.md
  12. +85
    -0
      content/writing/about/how-ads-are-delivered/index.md
  13. +86
    -0
      content/writing/about/javascript-documentation-generation/index.md
  14. +43
    -0
      content/writing/about/javascript-interview-questions/index.md
  15. +242
    -0
      content/writing/about/lets-make-a-metrics-beacon/index.md
  16. +145
    -0
      content/writing/about/managing-go-dependencies-with-git-subtree/index.md
  17. +37
    -0
      content/writing/about/my-new-website/index.md
  18. +203
    -0
      content/writing/about/my-python-web-crawler/index.md
  19. +31
    -0
      content/writing/about/os-x-battery-percentage-command-line/index.md
  20. +46
    -0
      content/writing/about/pharos-popup-on-osx-lion/index.md
  21. +77
    -0
      content/writing/about/php-stop-malicious-image-uploads/index.md
  22. +90
    -0
      content/writing/about/python-redis-queue-workers/index.md
  23. +87
    -0
      content/writing/about/sharing-data-from-php-to-javascript/index.md
  24. +95
    -0
      content/writing/about/the-battle-of-the-caches/index.md
  25. +352
    -0
      content/writing/about/third-party-tracking-pixels/index.md
  26. +42
    -0
      content/writing/about/what-i'm-up-to-these-days/index.md
  27. +86
    -0
      content/writing/about/why-benchmarking-tools-suck/index.md
  28. +56
    -0
      content/writing/about/write-code-every-day/index.md
  29. +1
    -0
      static/css/lato.css
  30. +9
    -0
      static/css/site.css
  31. BIN
      static/images/avatar.png
  32. BIN
      static/images/avatar@2x.png
  33. BIN
      static/images/favicon.ico

+ 22
- 0
config.toml View File

@ -0,0 +1,22 @@
baseurl = "https://brett.is/"
title = "Brett.is"
languageCode = "en-us"
theme = "hugo-cactus-theme"
googleAnalytics = "UA-34513423-1"
disqusShortname = "brettlangdon"
[params]
customCSS = ["css/lato.css", "css/site.css"]
name = "Brett Langdon"
description = "A geek with a blog"
bio = "A geek with a blog"
aboutAuthor = "A geek with a blog"
twitter = "brett_langdon"
enableRSS = true
iconFont = "font-awesome"
[social]
twitter = "https://twitter.com/brett_langdon"
github = "https://github.com/brettlangdon"
linkedin = "https://www.linkedin.com/in/brettlangdon"
rss = "https://brett.is/index.xml"

+ 2
- 0
content/about/index.md View File

@ -0,0 +1,2 @@
---
---

+ 107
- 0
content/writing/about/browser-fingerprinting/index.md View File

@ -0,0 +1,107 @@
---
title: Browser Fingerprinting
author: Brett Langdon
date: 2013-06-05
template: article.jade
---
Ever want to know what browser fingerprinting is or how it is done?
---
## What is Browser Fingerprinting?
A browser or <a href="http://en.wikipedia.org/wiki/Device_fingerprint" target="_blank">device fingerprint</a>
is a term used to describe an identifier generated from information retrieved from
a single given device that can be used to identify that single device only.
For example, as you will see below, browser fingerprinting can be used to generate
an identifier for the browser you are currently viewing this website with.
Regardless of you clearing your cookies (which is how most third party companies
track your browser) the identifier should be the same every time it is generated
for your specific device/browser. A browser fingerprint is usually generated from
the browsers <a href="https://en.wikipedia.org/wiki/User_agent" target="_blank">user agent</a>,
timezone offset, list of installed plugins, available fonts, screen resolution,
language and more. The <a href="https://www.eff.org/" target"_blank">EFF</a> did
a <a href="https://panopticlick.eff.org/browser-uniqueness.pdf" target="_blank">study</a>
on how unique a browser fingerprint for a given client can be and which browser
information provides the most entropy. To see how unique your browser is please
check out their demo application
<a href="https://panopticlick.eff.org/" target="_blank">Panopticlick</a>.
## What can it used for?
Ok, so great, but who cares? How can browser fingerprinting be used? Right now
the majority of <a href="http://kb.mozillazine.org/User_tracking" target="_blank">user tracking</a>
is done by the use of cookies. For example, when you go to a website that has
[tracking pixels](http://brett.is/writing/about/third-party-tracking-pixels/)
(which are “invisible” scripts or images loaded in the background of the web page)
the third party company receiving these tracking calls will inject a cookie into
your browser which has a unique, usually randomly generated, identifier that is
used to associate stored data about you like collected
<a href="http://searchengineland.com/what-is-retargeting-160407" target="_blank">site or search retargeting</a>
data. This way when you visit them again with the same cookie they can lookup
previously associated data for you.
So, if this is how it is usually done why do we care about browser fingerprints?
Well, the main problem with cookies is they can be volatile, if you manually delete
your cookies then the company that put that cookie there loses all association with
you and any data they have on your is no longer useful. As well, if a client does
not allow third party cookies (or any cookies) on their browser then the company
will be unable to track the client at all.
A browser fingerprint on the other hand is a more constant way to identify a given
client, as long as they have javascript enabled (which seems to be a thing which
most websites cannot properly function without), which allows the client to be
identified even if they do not allow cookies for their browser.
##How do we do it?
Like I mentioned before to generate a browser fingerprint you must have javascript
enabled as it is the easiest way to gather the most information about a browser.
Javascript gives us access to things like your screen size, language, installed
plugins, user agent, timezone offset, and other points of interest. This
information is basically smooshed together in a string and then hashed to generate
the identifier, the more information you can gather about a single browser the more
unique of a fingerprint you can generate and the less collision you will have.
Collision? Yes, if you end up with two laptops each of the same make, model, year,
os version, browser version with the exact same features and plugins enabled then
the hashes will be the exact same and anyone relying on their fingerprint will
treat both of those devices as the same. But, if you read the white paper by EFF
listed above then you will see that their method for generating browser fingerprints
is usually unique for almost 3 million different devices. There may be some cases
for companies where that much uniqueness is more than enough to use and rely on
fingerprints to identify devices and others where they have more than 3
million users.
Where does this really come into play? Most websites usually have their users
create and account and log in before allowing them access to portions of the site or
to be able to lookup stored information, maybe their credit card payment
information, home address, e-mail address, etc. Where browser fingerprints are
useful is for trying to identify anonymous visitors to a web application. For
example, [third party trackers](/writing/about/third-party-tracking-pixels/)
who are collecting search or other kinds of data.
## Some Code
Their is a project on <a href="https://www.github.com/" target="_blank">github</a>
by user <a href="https://github.com/Valve" target="_blank">Valentin Vasilyev (Valve)</a>
called <a href="https://github.com/Valve/fingerprintjs" target="_blank">fingerprintjs</a>
which is a client side javascript library for generating browser fingerprints.
If you are interested in seeing some production worthy code of how to generate
browser fingerprints please take a look at that project, it uses information like
useragent, language, color depth, timezone offset, whether session or local storage
is available, a listing of all installed plugins and it hashes everything using
<a href="https://sites.google.com/site/murmurhash/" target="_blank">murmurhash3</a>.
## Your <a href="" target="_blank">fingerprintjs</a> Fingerprint: *<span id="fingerprint">Could not generate fingerprint</span>*
<script type="text/javascript" src="/js/fingerprint.js"></script>
<script type="text/javascript">
var fingerprint = new Fingerprint().get();
document.getElementById("fingerprint").innerHTML = fingerprint;
</script>
**Resources:**
* <a href="http://panopticlick.eff.org/" target="_blank">panopticlick.eff.org</a> - find out how rare your browser fingerprint is.
* <a href="https://github.com/Valve/fingerprintjs" target="_blank">github.com/Valve/fingerprintjs</a> - client side browser fingerprinting library.

+ 62
- 0
content/writing/about/continuous-nodejs-module/index.md View File

@ -0,0 +1,62 @@
---
title: Continuous NodeJS Module
author: Brett Langdon
date: 2012-04-28
template: article.jade
---
A look into my new NodeJS module called Continuous.
---
Greetings everyone. I wanted to take a moment to mention the new NodeJS module
that I just published called Continuous.
Continuous is a fairly simply plugin that is aimed to aid in running blocks of
code consistently; it is an event based interface for setTimeout and setInterval.
With Continuous you can choose to run code at a set or random interval and
can also hook into events.
## Installation
```bash
npm install continuous
```
## Continuous Usage
```javascript
var continuous = require('continuous');
var run = new continuous({
minTime: 1000,
maxTime: 3000,
random: true,
callback: function(){
return Math.round( new Date().getTime()/1000.0 );
},
limit: 5
});
run.on(‘complete’, function(count, result){
console.log(‘I have run ‘ + count + ‘ times’);
console.log(‘Results:’);
console.dir(result);
});
run.on(‘started’, function(){
console.log(‘I Started’);
});
run.on(‘stopped’, function(){
console.log(‘I am Done’);
});
run.start();
setTimeout( function(){
run.stop();
}, 5000 );
```
For more information check out Continuous on
<a href="https://github.com/brettlangdon/continuous" target="_blank">GitHub</a>.

+ 167
- 0
content/writing/about/cookieless-user-tracking/index.md View File

@ -0,0 +1,167 @@
---
title: Cookieless User Tracking
author: Brett Langdon
date: 2013-11-30
template: article.jade
---
A look into various methods of online user tracking without cookies.
---
Over the past few months, in my free time, I have been researching various
methods for cookieless user tracking. I have a previous article that talks
on how to write a
<a href="/writing/about/third-party-tracking-pixels/" target="_blank">tracking server</a>
which uses cookies to follow people between requests. However, recently
browsers are beginning to disallow third party cookies by default which means
developers have to come up with other ways of tracking users.
## Browser Fingerprinting
You can use client side javascript to generate a
<a href="/writing/about/browser-fingerprinting/" target="_blank">browser fingerprint</a>,
or, a unique identifier for a specific users browser (since that is what cookies
are actually tracking). Once you have the browser's fingerprint you can then
send that id along with any other requests you make.
```javascript
var user_id = generateBrowserFingerprint();
document.write(
'<script type="text/javascript" src="/track/user/"' + user_id + '></ sc' + 'ript>'
);
```
## Local Storage
Newer browsers come equipped with a feature called
<a href="http://diveintohtml5.info/storage.html" target="_blank">local storage</a>
, which is used as a simple key-value store accessible through javascript.
So instead of relying on cookies as your persistent storage, you can store the
user id in local storage instead.
```javascript
var user_id = localStorage.getItem("user_id");
if(user_id == null){
user_id = generateNewId();
localStorage.setItem("user_id", user_id);
}
document.write(
'<script type="text/javascript" src="/track/user/"' + user_id + '></ sc' + 'ript>'
);
```
This can also be combined with a browser fingerprinting library for generating
the new id.
## ETag Header
There is a feature of HTTP requests called an
<a href="http://en.wikipedia.org/wiki/HTTP_ETag" target="_blank">ETag Header</a>
which can be exploited for the sake of user tracking. The way an ETag works is
when a request is made the server will respond with an ETag header with
a given value (usually it is an id for the requested document, or maybe a hash
of it), whenever the bowser then makes another request for that document it will
send an _If-None-Match_ header with the value of _ETag_ provided by the server
last time. The server can then make a decision as to whether or not new content
needs to be served based on the id/hash provided by the browser.
As you may have figured out, instead we can assign a unique user id as the ETag
header for a response, then when the browser makes a request for that page again
it will send us the user id.
This is useful, except for the fact that we can only provide a single id per
user per endpoint. For example, if I use the urls `/track/user` and
`/collect/data` there is no way for me to get the browser to send the same
_If-None-Match_ header for both urls.
### Example Server
```python
from uuid import uuid4
from wsgiref.simple_server import make_server
def tracking_server(environ, start_response):
user_id = environ.get("HTTP_IF_NONE_MATCH")
if not user_id:
user_id = uuid4().hex
start_response("200 Ok", [
("ETag", user_id),
])
return [user_id]
if __name__ == "__main__":
try:
httpd = make_server("", 8000, tracking_server)
print "Tracking Server Listening on Port 8000..."
httpd.serve_forever()
except KeyboardInterrupt:
print "Exiting..."
```
## Redirect Caching
Redirect caching is similar in concept to the the ETag tracking method where
we rely on the browser cache to store the user id for us. With redirect caching
we have our tracking url `/track/`, when someone goes there we perform a 301
redirect to `/<user_id>/track`. The users browser will then cache that 301
redirect and the next time the user goes to `/track` it will just go to
`/<user_id>/track` instead.
Just like the ETag method we run into an issue where this method really only
works for a single endpoint url. We cannot use it for an end all be all for
tracking users across a site or multiple sites.
### Example Server
```python
from uuid import uuid4
from wsgiref.simple_server import make_server
def tracking_server(environ, start_response):
if environ["PATH_INFO"] == "/track":
start_response("301 Moved Permanently", [
("Location", "/%s/track" % uuid4().hex),
])
else:
start_response("200 Ok", [])
return [""]
if __name__ == "__main__":
try:
httpd = make_server("", 8000, tracking_server)
print "Tracking Server Listening on Port 8000..."
httpd.serve_forever()
except KeyboardInterrupt:
print "Exiting..."
```
## Ever Cookie
A project worth noting is Samy Kamkar's
<a href="http://samy.pl/evercookie/" target="_blank">Evercookie</a>
which uses standard cookies, flash objects, silverlight isolated storage,
web history, etags, web cache, local storage, global storage... and more
all at the same time to track users. This library exercises every possible
method for storing a user id which makes it a reliable method for ensuring
that the id is stored, but at the cost of being very intrusive and persistent.
## Other Methods
I am sure there are other methods out there, these are just the few that I
decided to focus on. If anyone has any other methods or ideas please leave a comment.
## References
* <a href="http://ochronus.com/tracking-without-cookies/" target="_blank">http://ochronus.com/tracking-without-cookies/</a>
* <a href="http://ochronus.com/user-tracking-http-redirect/" target="_blank">http://ochronus.com/user-tracking-http-redirect/</a>
* <a href="http://samy.pl/evercookie/" target="_blank">http://samy.pl/evercookie/</a>

+ 62
- 0
content/writing/about/detect-flash-with-javascript/index.md View File

@ -0,0 +1,62 @@
---
title: Detect Flash with JavaScript
author: Brett Langdon
date: 2013-06-05
template: article.jade
---
Quick, easy and lightweight way to detecting flash support in clients.
---
Recently I had to find a a good way of detecting if <a href="http://www.adobe.com/products/flashplayer.html" target="_blank">Flash</a>
is enabled in the browser, there are the two main libraries
<a href="http://solutionpartners.adobe.com/products/flashplayer/download/detection_kit/" target="_blank">Adobe Flash Detection Kit</a>
and <a href="https://code.google.com/p/swfobject/" target="_blank">SWFObject</a>
which are both very good at detecting whether Flash is enabled as well as getting
the version of Flash installed and useful for dynamically embedding and manipulating
<a href="http://en.wikipedia.org/wiki/SWF" target="_blank">swf</a> files
in your web application. But all I needed was a **yes** or a **no** to whether
Flash was there or not without the added overhead of unneeded code.
My goal was to wrote the least amount of JavaScript while still being able
to detect cross browser for Flash.
```javascript
function detectflash(){
if (navigator.plugins != null && navigator.plugins.length > 0){
return navigator.plugins["Shockwave Flash"] && true;
}
if(~navigator.userAgent.toLowerCase().indexOf("webtv")){
return true;
}
if(~navigator.appVersion.indexOf("MSIE") && !~navigator.userAgent.indexOf("Opera")){
try{
return new ActiveXObject("ShockwaveFlash.ShockwaveFlash") && true;
} catch(e){}
}
return false;
}
```
For those unfamiliar with the tilde (~) operator in javascript, please read
<a href="http://dreaminginjavascript.wordpress.com/2008/07/04/28/" target="_blank">this article</a>,
but the short version is, used with indexOf these two lines are equivalent:
```javascript
~navigator.appVersion.indexOf("MSIE")
navigator.appVersion.indexOf("MSIE") != -1
```
To use the above function:
```javascript
if(detectflash()){
alert("Flash is enabled");
} else{
alert("Flash is not available");
}
```
And that is it. Pretty simple and a much shorter version that the alternatives,
compressed and mangled I have gotten this code to under 400 Bytes.
I tested this code with IE 5.5+, Firefox and Chrome without any issues.

+ 141
- 0
content/writing/about/fail2ban-honeypot/index.md View File

@ -0,0 +1,141 @@
---
title: Fail2Ban Honeypot
author: Brett Langdon
date: 2012-02-04
template: article.jade
---
How to use Python and Fail2Ban to write an auto-blocking honeypot.
---
I have been practicing for the upcoming NECCDC competition and have been playing
around with various security concepts and one that I thought of trying was
creating a honeypot that automagically blocks ips when trapped. So what I have is
a honeypot script written in python that logs intruders to a log file and then a
<a href="http://fail2ban.org/" target="_blank">Fail2Ban</a>
definition that will block the ip address. So I will show you the Fail2Ban
honeypot that I have thrown together.
## Installation
We first need to install
<a href="http://python.org/" target="_blank">python</a> and
<a href="http://fail2ban.org/" target="_blank">fail2ban</a>.
Installation process might be different depending which linux distribution
you are using.
```bash
sudo apt-get install python fail2ban
```
## Honeypot
Copy the following python script and create a file `honeypot.py`.
```python
import socket
import threading
import sys
class HoneyThread(threading.Thread):
def __init__(self, logfile, port):
self.logfile = logfile
self.port = port
self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self.sock.bind( ('', port) )
self.sock.listen( 1 )
print 'Listening on: ', port
super(HoneyThread, self).__init__()
def run(self):
while True:
channel, details = self.sock.accept()
logstr = (
'Connection from %s:%s on port %s\r\n' %
(details[0], details[1], self.port)
)
self.logfile.write('%s\r\n' % logstr)
print logstr
self.logfile.flush()
channel.send('You Just Got Stuck In Some Honey')
channel.close()
ports = []
for arg in sys.argv[1:]:
ports.append(int(arg))
threads = []
logfile = open('/var/log/honeypot.log', 'a')
for p in ports:
threads.append(HoneyThread(logfile, p))
for thread in threads:
thread.start()
print 'Bring it on!'
```
Some may notice a slight issue, this script is meant to run 24/7 and never be
stopped. There is no particular way of stopping the threads unless the machine
is restarted.
## Running Honeypot
To run the honeypot simply issue the following command:
```bash
python honeypot.py 22 25 80 443
```
Replace the ports shown with the ports that you want the honeypot to run on.
When someone tries to connect to one of the supplied ports this script will
display on the screen the ip address that connected, the port they connected from
and the port they were trying to reach. It will also log the incident to
the `/var/log/honeypot.log` file.
## Fail2Ban
Now to setup fail2ban to block the ip address when it is captured.
A new filter definition needs to be created in `/etc/fail2ban/filter.d/honeypot.conf`.
```ini
[Definition]
failregex =
```
And the filter has to be set in `/etc/fail2ban/jail.conf`.
```ini
...
[honeypot]
enabled = true
filter = honeypot
logpath = /var/log/honeypot.log
action = iptables-allports[name=Honeypot, protocol=all]
maxretry = 1
...
```
Please make sure to read up on fail2ban’s various actions, the ‘iptables-allports’
one is used here with ‘protocol: all’, meaning that the ip address is banned from
making all connections on any port using any protocol (tcp, udp, icmp, etc). Also
change ‘maxretry’ as you see fit, with it set to 1 then any single access to the
honeypot will ban the ip for the configured amount of time (600 seconds by
default), if you want this can be changed to 2 or 3 so if someone is persistent
with trying to access the false service.
And that is it, just start Fail2Ban and test by trying to access the one of the
honeypot ports. This can be done from a second machine and using telnet.
```bash
telnet 192.168.1.11 80
```
Replace ’192.168.1.11′ with the ip address of the machine running the honeypot
and ’80′ with the port you wish to test.
And there you have it, a Fail2Ban honeypot written in Python. Deploy and Enjoy.

+ 136
- 0
content/writing/about/fastest-python-json-library/index.md View File

@ -0,0 +1,136 @@
---
title: The Fastest Python JSON Library
author: Brett Langdon
date: 2013-09-22
template: article.jade
---
My results from benchmarking a handfull of Python JSON parsing libraries.
---
Most who know me well know that I am usually not one for benchmarks.
Especially blindly posted benchmark results in blog posts (like how this one is going to be).
So, instead of trying to say that “this library is better than that library” or to try and convince you that you are going to end up with the same results as me.
Instead remember to take these results with a grain of salt.
You might end up with different results than me.
Take these results as interesting findings which help supplement your own experiments.
Ok, now that that diatribe is over with LETS GET TO THE COOL STUFF!
We use JSON for a bunch of stuff at work, whether it is a third party system that uses JSON to communicate or storing JSON blobs in the database.
We have done some naive benchmarking in the past and came to the conclusion that [jsonlib2](https://pypi.python.org/pypi/jsonlib2/) is the library for us.
Well, I started a personal project that also uses JSON and I decided to revisit benchmarking Python JSON libraries to see if there are any “better” ones out there.
I ended up with the following libraries to test:
[standard lib json](http://docs.python.org/2/library/json.html), [jsonlib2](https://pypi.python.org/pypi/jsonlib2/), [simplejson](https://pypi.python.org/pypi/simplejson/), [yajl](https://pypi.python.org/pypi/yajl) (yet another json library) and lastly [ujson](https://pypi.python.org/pypi/ujson) (ultrajson).
For the test, I wanted to test parsing and serializing a large json blob, in this case, I simply took a snapshot of data from the [Twitter API Console](https://dev.twitter.com/console).
Ok, enough with this context b.s. lets see some code and some results.
```python
import json
import timeit
# json data as a str
json_data = open("./fixture.json").read()
# json data as a list
data = json.loads(json_data)
number = 500
repeat = 4
print "Average run time over %s executions repeated %s times" % (number, repeat)
# we still store the fastest run times here
fastest_dumps = (None, -1)
fastest_loads = (None, -1)
for library in ("ujson", "simplejson", "jsonlib2", "json", "yajl"):
print "-" * 20
# thanks yajl for not setting __version__
exec("""
try:
from %s import __version__
except Exception:
__version__ = None
""" % library)
print "Library: %s" % library
# for jsonlib2 this is a tuple... thanks guys
print "Version: %s" % (__version__, )
# time to time json.dumps
timer = timeit.Timer(
"json.dumps(data)",
setup="""
import %s as json
data = %r
""" % (library, data)
)
total = sum(timer.repeat(repeat=repeat, number=number))
per_call = total / (number * repeat)
print "%s.dumps(data): %s (total) %s (per call)" % (library, total, per_call)
if fastest_dumps[1] == -1 or total > fastest_dumps[1]:
fastest_dumps = (library, total)
# time to time json.loads
timer = timeit.Timer(
"json.loads(data)",
setup="""
import %s as json
data = %r
""" % (library, json_data)
)
total = sum(timer.repeat(repeat=repeat, number=number))
per_call = total / (number * repeat)
print "%s.loads(data): %s (total) %s (per call)" % (library, total, per_call)
if fastest_loads[1] == -1 or total > fastest_loads[1]:
fastest_loads = (library, total)
print "-" * 20
print "Fastest dumps: %s %s (total)" % fastest_dumps
print "Fastest loads: %s %s (total)" % fastest_loads
```
Ok, we need to talk about this code for a second.
It really is not the cleanest code I have ever written.
We start off by loading the fixture json data as both the raw json text and parse it into a python list of objects.
Then for each of the libraries we want to test, we try to get their version information and finally we use [timeit](http://docs.python.org/2/library/timeit.html) to test how long it takes to serialize the parsed fixture data into a JSON string and then we test parsing the JSON string of the fixture data into a list of objects.
And lastly, we store the name of the library with the fastest total run time for either “dumps” or “loads” and then at the end we print which was fastest.
Here are the results I got when running on my macbook pro:
```text
Average run time over 500 executions repeated 4 times
--------------------
Library: ujson
Version: 1.33
ujson.dumps(data): 1.97361302376 (total) 0.000986806511879 (per call)
ujson.loads(data): 2.05873394012 (total) 0.00102936697006 (per call)
--------------------
Library: simplejson
Version: 3.3.0
simplejson.dumps(data): 3.24183320999 (total) 0.001620916605 (per call)
simplejson.loads(data): 2.20791387558 (total) 0.00110395693779 (per call)
--------------------
Library: jsonlib2
Version: (1, 3, 10)
jsonlib2.dumps(data): 2.211810112 (total) 0.001105905056 (per call)
jsonlib2.loads(data): 2.55381131172 (total) 0.00127690565586 (per call)
--------------------
Library: json
Version: 2.0.9
json.dumps(data): 2.35674309731 (total) 0.00117837154865 (per call)
json.loads(data): 5.23104810715 (total) 0.00261552405357 (per call)
--------------------
Library: yajl
Version: None
yajl.dumps(data): 2.85826969147 (total) 0.00142913484573 (per call)
yajl.loads(data): 3.03867292404 (total) 0.00151933646202 (per call)
--------------------
Fastest dumps: ujson 1.97361302376 (total)
Fastest loads: ujson 2.05873394012 (total)
```
So there we have it.
My tests show that [ujson](https://pypi.python.org/pypi/ujson) is the fastest python json library (when running on my mbp and when parsing or serializing a “large” json dataset).
I have added the test scripts, fixture data and results in [this gist](https://gist.github.com/brettlangdon/6b007ef89fd7d2931a22) if anyone wants to run locally and post their results in the comments below.
I would be curious to see the results of others.

+ 184
- 0
content/writing/about/forge-configuration-parser/index.md View File

@ -0,0 +1,184 @@
---
title: Forge configuration parser
author: Brett Langdon
date: 2015-06-27
template: article.jade
---
An overview of how I wrote a configuration file format and parser.
---
Recently I have finished the initial work on a project,
[forge](https://github.com/brettlangdon/forge), which is a
configuration file syntax and parser written in go. Recently I was working
on a project where I was trying to determine what configuration
language I wanted to use and whether I tested out
[YAML](https://en.wikipedia.org/wiki/YAML) or
[JSON](https://en.wikipedia.org/wiki/JSON) or
[ini](https://en.wikipedia.org/wiki/INI_file), nothing really felt
right. What I really wanted was a format similar to
[nginx](http://wiki.nginx.org/FullExample)
but I couldn't find any existing packages for go which supported this
syntax. A-ha, I smell an opportunity.
I have always been interested by programming languages, by their
design and implementation. I have always wanted to write my own
programming language, but since I have never had any formal education
around the subject I have always gone about it on my own. I bring it
up because this project has some similarities. You have a defined
syntax that gets parsed into some sort of intermediate format. The
part that is missing is where the intermediate format is then
translated into machine or byte code and actually executed. Since this
is just a configuration language, that is not necessary.
## Project overview
You can see the repository for
[forge](https://github.com/brettlangdon/forge) for current usage and
documentation.
Forge syntax is a file which is made up of _directives_. There are 3
kinds of _directives_:
* _settings_: Which are in the form `<KEY> = <VALUE>`
* _sections_: Which are used to group more _directives_ `<SECTION-NAME> { <DIRECTIVES> }`
* _includes_: Used to pull in settings from other forge config files `include <FILENAME/GLOB>`
Forge also supports various types of _setting_ values:
* _string_: `key = "some value";`
* _bool_: `key = true;`
* _integer_: `key = 5;`
* _float_: `key = 5.5;`
* _null_: `key = null;`
* _reference_: `key = some_section.key;`
Most of these setting types are probably fairly self explanatory
except for _reference_. A _reference_ in forge is a way to have the
value of one _setting_ be a pointer to another _setting_. For example:
```config
global = "value";
some_section {
key = "some_section.value";
global_ref = global;
local_ref = .key;
ref_key = ref_section.ref_key;
}
ref_section {
ref_key = "hello";
}
```
In this example we see 3 examples of _references_. A _reference_ value
is one which is an identifier (`global`) possibly multiple identifiers separated
with a period (`ref_section.ref_key`) as well _references_ can begin
with a perod (`.key`). Every _reference_ which is not prefixed with a period
is resolved from the global section (most outer level). So in this
example a _reference_ to `global` will point to the value of
`"value"` and `ref_section.ref_key` will point to the value of
`"hello"`. A _local reference_ is one which is prefixed with a period,
those are resolved starting from the current section that the
_setting_ is defined in. So in this case, `local_ref` will point to
the value of `"some_section.value"`.
That is a rough idea of how forge files are defined, so lets see a
quick example of how you can use it from go.
```go
package main
import (
"github.com/brettlangdon/forge"
)
func main() {
settings, _ := forge.ParseFile("example.cfg")
if settings.Exists("global") {
value, _ := settings.GetString("global");
fmt.Println(value);
}
settings.SetString("new_key", "new_value");
settingsMap := settings.ToMap();
fmt.Println(settingsMaps["new_key"]);
jsonBytes, _ := settings.ToJSON();
fmt.Println(string(jsonBytes));
}
```
## How it works
Lets dive in and take a quick look at the parts that make forge
capable of working.
**Example config file:**
```config
# Top comment
global = "value";
section {
a_float = 50.67;
sub_section {
a_null = null;
a_bool = true;
a_reference = section.a_float; # Gets replaced with `50.67`
}
}
```
Basically what forge does is take a configuration file in defined
format and parses it into what is essentially a `map[string]interface{}`.
The code itself is comprised of two main parts, the tokenizer (or scanner) and the
parser. The tokenizer turns the raw source code (like above) into a stream of tokens. If
you printed the token representation of the code above, it could look like:
```
(COMMENT, "Top comment")
(IDENTIFIER, "global")
(EQUAL, "=")
(STRING, "value")
(SEMICOLON, ";"
(IDENTIFIER, "section")
(LBRACKET, "{")
(IDENTIFIER, "a_float")
(EQUAL, "=")
(FLOAT, "50.67")
(SEMICOLON, ";")
....
```
Then the parser takes in this stream of tokens and tries to parse them based on some known
grammar. For example, a directive is in the form
`<IDENTIFIER> <EQUAL> <VALUE> <SEMICOLON>` (where `<VALUE>` can be
`<STRING>`, `<BOOL>`, `<INTEGER>`, `<FLOAT>`, `<NULL>`,
`<REFERENCE>`). When the parser sees `<IDENTIFIER>` it'll look ahead
to the next token to try and match it to this rule, if it matches then
it knows to add this setting to the internal `map[string]interface{}`
for that identifier. If it doesn't match anything then it has a syntax
error and will throw an exception.
The part that I think is interesting is that I opted to just write the
tokenizer and parser by hand rather than using a library that converts
a language grammar into a tokenizer (like flex/bison). I have done
this before and was inspired to do so after learning that that is how
the go programming language is written, you can see here
[parser.go](https://github.com/golang/go/blob/258bf65d8b157bfe311ce70c93dd854022a25c9d/src/go/parser/parser.go)
(not a light read at 2500 lines). The
[scanner.go](https://github.com/brettlangdon/forge/blob/1c8c6f315b078622b7264b702b76c6407ec0f264/scanner.go)
and
[parser.go](https://github.com/brettlangdon/forge/blob/1c8c6f315b078622b7264b702b76c6407ec0f264/parser.go)
might proof to be slightly easier reads for those who are interested.
## Conclusion
There is just a brief overview of the project and just a slight dip
into the inner workings of it. I am extremely interested in continuing
to learn as much as I can about programming languages and
parsers/compilers. I am going to put together a series of blog posts
that walk through what I have learned so far and which might help
guide the reader through creating something similar to forge.
Enjoy.

+ 265
- 0
content/writing/about/generator-pipelines-in-python/index.md View File

@ -0,0 +1,265 @@
---
title: Generator Pipelines in Python
author: Brett Langdon
date: 2012-12-18
template: article.jade
---
A brief look into what a generator pipeline is and how to write one in Python.
---
Generator pipelines are a great way to break apart complex processing into
smaller pieces when processing lists of items (like lines in a file). For those
who are not familiar with <a href="http://www.python.org" target="_blank">Python</a>
generators or the concept behind generator pipelines, I strongly recommend
reading this article first:
<a href="http://www.dabeaz.com/generators-uk/index.html" target="_blank">Generator Tricks for Systems Programmers</a>
by <a href="http://www.dabeaz.com/" target="_blank">David M. Beazley</a>.
It will surely take you more in-depth than I am going to go.
A brief introduction on generators. There are two types of generators,
generator expressions and generator functions. A
<a href="http://www.python.org/dev/peps/pep-0289/" target="_blank">generator expression</a>
looks similar to a
<a href="http://www.python.org/dev/peps/pep-0202/" target="_blank">list comprehension</a>
but the simple difference is that it uses parenthesis over square brackets.
A <a href="http://www.python.org/dev/peps/pep-0255/" target="_blank">generator function</a>
is a function which contains the keyword
<a href="http://docs.python.org/2/reference/simple_stmts.html#grammar-token-yield_stmt" target="_blank">yield</a>;
yield is used to pass a value from within the function to the calling expression
without exiting the function (unlike a return statement).
## Generator Expression
```python
nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print sum(num for num in nums)
num_gen = (num for num in nums)
for num in num_gen:
print num
```
Line 2 of the above, when passing a generator into a function the extra parenthesis
are not needed. Otherwise you can create a stand alone generator, like in line 3;
this expression simply creates the generator, it does not iterate over the list of
numbers until it is passed into the for loop on line 4.
## Generator Function
```python
def nums():
nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for num in nums:
yield num
print sum(nums())
for num in nums():
print num
```
This block of code does the exact same as the example above but uses a generator
function instead of a generator expression. When the function nums is called it
will loop through the list of numbers and one by one pass them back up to either
the function call for sum or for the for loop.
Generators (either expressions or functions) are not the same as returning a list
of items (lets say numbers). They do not wait for all possible items to be yielded
before the items are returned. Each item is returned as it is yielded. For example,
with the generator function code above, the number 1 is being printed on line 7
before the number 2 is being yielded on line 4.
So, cool, alright, generators are nice, but what about generator pipelines? A
generator pipeline is taking these generators (expressions or functions) and
chaining them together. Lets try to look at a case where they might be useful.
## Example: Without Generators
```python
def process(num):
# filter out non-evens
if num % 2 != 0:
return
num = num * 3
num = 'The Number: %s' % num
return num
nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for num in nums:
print process(num)
```
This code is fairly simple and may not seem like the best example for creating a
generator pipeline, but it is nice because we can break it down into small parts.
For starters we need to filter out any non-even numbers, then we need to multiple
the num by 3, then finally we convert the number to a string. Lets see what this
looks like as a pipeline.
## Generator Pipeline
```python
def even_filter(nums):
for num in nums:
if num % 2 == 0:
yield num
def multiply_by_three(nums):
for num in nums:
yield num * 3
def convert_to_string(nums):
for num in nums:
yield 'The Number: %s' % num
nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
pipeline = convert_to_string(multiply_by_three(even_filter(nums)))
for num in pipeline:
print num
```
This code example might look more complex that the previous example, but it
provides a good example of how (with generators) you can chain together a set of
very small and concise processes over a set of items. So, how does this example
really work? Each number in the list nums passes through each of the three
functions and is printed before the next items has it’s chance to make it through.
1. The Number 1 is checked for even, it is not so processing for that number stops
2. The Number 2 is checked for even, it is so it is yielded to `multiply_by_three`
3. The Number 2 is multiplied by 3 and yielded to `convert_to_string`
4. The Number 2 is formatted into the string and yielded to the for loop on line 14
5. The Number 2 is printed as _“The Number: 2″_
6. The Number 3 is checked for even, it is not so processing for that number stops
7. The Number 4 is checked for even, it is so it is yielded to `multiply_by_three`
8. … etc…
This continues until all of the numbers have either been ignored (by even_filter)
or have been yielded. If you wanted to, you can change the order in which the
chain is created to change the order in which each process runs (try swapping
even_filter and multiply_by_three).
So, how about a more practical example? What if we needed to process an
<a href="http://httpd.apache.org/" target="_blank">Apache</a> log file? We can use
a generator pipeline to break the processing into very small functions for
filtering and parsing. We will use the following example line format for our
processing:
```
127.0.0.1 [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
```
## Processing Apache Logs
```python
class LogProcessor(object):
def __init__(self, file):
self._file = file
self._filters = []
def add_filter(self, new_filter):
if callable(new_filter):
self._filters.append(new_filter)
def process(self):
# this is the pattern for creating a generator
# pipeline, we start with a generator then wrap
# each consecutive generator with the pipeline itself
pipeline = self._file
for new_filter in self._filters:
pipeline = new_filter(pipeline)
return pipeline
def parser(lines):
"""Split each line based on spaces and
yield the resulting list.
"""
for line in lines:
yield [part.strip('"[]') for part in line.split(' ')]
def mapper(lines):
"""Convert each line to a dict
"""
for line in lines:
tmp = {}
tmp['ip_address'] = line[0]
tmp['timestamp'] = line[1]
tmp['timezone'] = line[2]
tmp['method'] = line[3]
tmp['request'] = line[4]
tmp['version'] = line[5]
tmp['status'] = int(line[6])
tmp['size'] = int(line[7])
yield tmp
def status_filter(lines):
"""Filter out lines whose status
code is not 200
"""
for line in lines:
# is the status is not 200
# then the line is ignored
# and does not make it through
# the pipeline to the end
if line['status'] == 200:
yield line
def method_filter(lines):
"""Filter out lines whose method
is not 'GET'
"""
for line in lines:
# all lines with method not equal
# to 'get' are dropped
if line['method'].lower() == 'get':
yield line
def size_converter(lines):
"""Convert the size (in bytes)
into megabytes
"""
mb = 9.53674e-7
for line in lines:
line['size'] = line['size'] * mb
yield line
# setup the processor
log = open('./sample.log')
processor = LogProcessor(log)
# this is the order we want the functions to run
processor.add_filter(parser)
processor.add_filter(mapper)
processor.add_filter(status_filter)
processor.add_filter(method_filter)
processor.add_filter(size_converter)
# process() returns the generator pipeline
for line in processor.process():
# line with be a dict whose status is
# 200 and method is 'GET' and whose
# size is expressed in megabytes
print line
log.close()
```
So there you have it. A more practical example of how to use generator pipelines.
We have setup a simple class that is used to iterate through a log file of a
specific format and perform a set of operations on each log line in a specified
order. By having each operation a very small generator function we now have modular
line processing, meaning we can move our filters, parsers and converters around in
any order we want. We can swap the order of the method and status filters and move
the size converters before the filters. It would not make sense, but we could move
the parser and mapper functions around as well (this might break things).
This generator pipeline will do the following:
1. yield a single line in from the log file
2. Split that line based on spaces and yield the resulting list
3. yield a dict from the single line list
4. check the line’s status code, yield if 200, goto step 1 otherwise
5. check the line’s method, yield if ‘get’, goto step 1 otherwise
6. convert the line’s size to megabytes, yield the line
7. the line is printed in the for loop, goto step 1 (repeat for all other lines)
Do you use generators and generator pipelines differently in your Python code?
Please feel free to share any tips/tricks or anything I may have missed in
the above. Enjoy.

+ 173
- 0
content/writing/about/goodbye-grunt-hello-tend/index.md View File

@ -0,0 +1,173 @@
---
title: Goodbye Grunt, Hello Tend
author: Brett Langdon
date: 2014-06-09
template: article.jade
---
Recently decided to give Grunt a try, which caused me to write my
own node.js build system.
---
For the longest time I had refused to move away from [Makefiles](http://mrbook.org/tutorials/make/)
for [Grunt](http://gruntjs.com/) or some other [node.js](https://nodejs.org) build system.
But I finally gave in and decided to take an afternoon to give Grunt a go.
Initially it seemed promising, Grunt had a plugin for everything and ultimately
it supporting watching files and directories (the one feature I really wanted
for my `make` build setup).
I tried to move over a fairly simplistic `Makefile` that I already had written
into a `Gruntfile`. However, after about an hour (or more) of trying to get `grunt`
setup with [grunt-cli](https://github.com/gruntjs/grunt-cli) and all the other
plugins installed and configured to do the right thing I realized that `Grunt`
wasn't for me. I took a simple 10 (ish) line `Makefile` and turned it into a 40+
line `Gruntfile` and it still didn't seem to do exactly what I wanted. What I
had to reflect on was why should I spend all this time trying to learn how to
configure some convoluted plugins when I already known the correct commands to
execute? Then I realized what I really wanted wasn't a new build system but
simply `watch` for a `Makefile`
I have attempted to get some form of watch working with a `Makefile` in the past,
but it usually involves using inotify and I've never gotten it working exactly
like how I wanted. So, I decided to start writing my own system, because, why not
spend more time on perfecting my build system. My requirements were fairly simple,
I wanted a way to watch a directory/files for changes and when they do simply run
a single command (ultimately `make <target>`), I wanted the ability to also run
long running processing like `node server.js` and restart them if certain files
have changed, and lastly unlike other watch based systems I have seen I wanted
a way to run a command as soon as I start up the watch program (so you dont have
to start the watching, then go open/save a newline change to a file to get it to
build for the first time).
What I came up with was [tend](https://github.com/brettlangdon/tend). Which solves
mostly all of my needs, which was simply "watch for make". So how do you use it?
### Installation
```bash
npm install -g tend
```
### Usage
```
Usage:
tend
tend <action>
tend [--restart] [--start] [--ignoreHidden] [--filter <filter>] [<dir> <command>]
tend (--help | --version)
Options:
-h --help Show this help text
-v --version Show tend version information
-r --restart If <command> is still running when there is a change, stop and re-run it
-i --ignoreHidden Ignore changes to files which start with "."
-f --filter <filter> Use <filter> regular expression to filter which files trigger the command
-s --start Run <command> as soon as tend executes
```
### Example CLI Usage
The following will watch for changes to any `js` files in the directory `./src/`
when any of them change or are added it will run `uglifyjs` to combine them into
a single file.
```bash
tend --ignoreHidden --filter "*.js" ./src "uglifyjs -o ./public/main.min.js ./src/*.js"
```
The following will run a long running process, starting it as soon as `tend` starts
and restarting the program whenever files in `./routes/` has changed.
```bash
tend --restart --start --filter "*.js" ./routes "node server.js"
```
### Config File
Instead of running `tend` commands singly from the command line you can provide
`tend` with a `.tendrc` file of multiple directories/files to watch with commands
to run.
The following `.tendrc` file are setup to run the same commands as shown above.
```ini
; global settings
ignoreHidden=true
[js]
filter=*.js
directory=./src
command=uglifyjs -o ./public/main.min.js ./src/*.js
[app]
filter=*.js
directory=./routes
command=node ./app/server.js
restart=true
start=true
```
You can then simply run `tend` without any arguments to have `tend` watch for
all changes configured in your `.tendrc` file.
Running:
```bash
tend
```
Will basically execute:
```bash
tend --ignoreHidden --filter "*.js" ./src "uglifyjs -o ./public/main.min.js ./src/*.js" \
& tend --restart --start --filter "*.js" ./routes "node server.js"
```
Along with running multiple targets at once, you can run specific targets from
a `.tendrc` file as well, `tend <target>`.
```bash
tend js
```
Will run the `js` target once.
```bash
tend --ignoreHidden --filter "*.js" ./src "uglifyjs -o ./public/main.min.js ./src/*.js"
```
### With Make
If I haven't beaten a dead horse enough, I am a `Makefile` kind of person and
that is exactly what I wanted to use this new tool with. So below is an example
of a `Makefile` and it's corresponding `.tendrc` file.
```make
js:
uglifyjs -o ./public/main.min.js ./src/*.js
app:
node server.js
.PHONY: js app
```
```ini
ignoreHidden=true
[js]
filter=*.js
directory=./src
command=make js
[app]
filter=*.js
directory=./routes
command=make app
restart=true
start=true
```
### Conclusion
So that is mostly it. Nothing overly exciting and nothing really new here, just
another watch/build system written in node to add to the list. For the most part
this tool does exactly what I want for now, but if anyone has any ideas on how
to make this better or even any other better/easier tools which do similar things
please let me know, I am more than willing to continue maintaining this tool.

+ 85
- 0
content/writing/about/how-ads-are-delivered/index.md View File

@ -0,0 +1,85 @@
---
title: How Ads are Delivered
author: Brett Langdon
date: 2012-09-02
template: article.jade
---
A really brief look into how online advertising works.
---
For the last 6 months or so I have been working in the ad tech industry for a
search re-targeting company,
<a href="http://www.magnetic.com" target="_blank">Magnetic</a>,
as a software engineer working on software to deliver ads online and I wanted
share some of the things I have learned.
When I started working for them I did not realize how online ads are delivered.
I thought that web sites offer up space to advertisers and then they show various
ads based on what the web site wants them to show. Well, this isn’t really wrong
but not quite right. There are a few more pieces to the puzzle.
### Advertiser
An advertiser is the person, or agency, that wishes to deliver ads to the internet.
### Publisher
A publisher is a person, or organization, that has online ad space that they
wish to fill.
### Ad Exchange
An ad exchange is a company that allows various advertisers to bid on available
ad space provided by publishers.
## How It Works
This is the part I never fully understood until I started working in the industry
(there are still parts I do not know). The magic for most online ads is in the ad
exchange. When a user goes to a website there are various iframes on the page which
the publisher has pointed to the ad exchange. This lets the exchange know that
there is space currently available.
So the exchange then compiles a bid request which contains as much information
about the ad space and user as possible. This information can contain simple
things like the size of the ad and location of the add (above or below fold), to
various information about the user, geo location, window size, etc.
The bid request is sent out to all of the advertisers to let them know about the
potential ad space available. The advertisers must then make a decision whether or
not they want to bid on that ad space, based on the information provided. If they
have an ad that meets the criteria then they will return a bid response to the ad
exchange telling them of their bid. The bid price for an ad is provided in micro
dollars or one one millionth of a dollar. Another common unit for ad tech is CPM
or cost per mile which denotes the price for every one thousand ads.
Once the ad exchange has all the bids they take the ad with the highest bid to
deliver. The cost you pay is not the price you bid, but one bidding unit above
the next highest bid. Lastly, the ad is delivered to the user.
One thing to note, which I find very cool, is this all happens in real time for
every page request that a user makes. Next time you go to a website which contains
ads, stop to think about what had happened for that ad to become available to you.
## Why Is This Cool?
Some might not find this topic very interesting, others might hold a grunge to the
fact that ads are being shown on websites or to the fact that some companies are
maintaining search information about them on their systems (in order to make
future ad decisions based on available ad space for you specifically). To me this
is interesting because of the scale that these systems need to be in. Our company
does not make a few hundred bids per day or even hour, this can happen in seconds.
We also do not make any “static” decisions based on the bids that we receive,
instead we are trying to make informed, real time decisions as to which ads we
want to show.
Our systems need to not only be scalable, for an increase in available bids, but
they also need to be fast. If we waited for a SQL query to finish we would lose
out on hundreds of bids before we got our response. Our system is based heavily
on caching and rebuilding useful information for bidding. The fact that our
company works under these constraints requires our developers (that includes me)
to have to think outside the box and about the bigger picture.

+ 86
- 0
content/writing/about/javascript-documentation-generation/index.md View File

@ -0,0 +1,86 @@
---
title: Javascript Documentation Generation
author: Brett Langdon
date: 2015-02-03
template: article.jade
---
I have always been trying to find a good Javascript documentation generator and
I have never really been very happy with any that I have found. So I've decided
to just write my own, DocAST.
---
The problem I have always had with any documentation generators is they are
either hard to theme or are sometimes very strict with the way doc strings are
suppose to be written, making them potentially difficult to switch between
documentation generators if you had to. So for a fun exercise I've decided to
just try writting one myself, [DocAST](https://github.com/brettlangdon/docast).
What is different about DocAST? I've seen a few documentation parsers which use
regular expressions to parse out the comment blocks, which works perfectly well,
except I've decided to have some fun and use
[AST](http://en.wikipedia.org/wiki/Abstract_syntax_tree) parsing to grab the
code blocks from the scripts. As well, DocAST doesn't try to force you in to any
specific theme or display, instead it is used simply to extract documentation
from scripts. Lastly, DocAST, doesn't use any specific documentation format for
signifying parameters, returns or exceptions, it will traverse the AST of the
code block to find them for you, so most of the time you just need to add a
simple block comment describing the function above it.
Lets just get to an example:
```javascript
// script.js
/*
* This is my super cool function that does all sorts of cool stuff
**/
function superCool(arg1, arg2){
if(arg1 === arg2){
throw new Exception("arg1 and arg2 cant be the same");
}
var sum = arg1 + arg2;
return sum;
}
```
```shell
$ docast extract ./script.js
$ cat out.json
```
```javascript
[
{
"name": "superCool",
"params": [
"arg1",
"arg2"
],
"returns": [
"sum"
],
"raises": [
"Exception"
],
"doc": " This is my super cool function that does all sorts of cool stuff\n"
}
]
```
For more information check out the github page for
[DocAST](https://github.com/brettlangdon/docast).
The other benefit I have found with a documentation parser (something that just
extracts the documentation information as opposed to trying to build it) is that
you can get fun and creative with how you use the information parsed. For
example, I've had someone suggest creating your doc strings as
[yaml](http://www.yaml.org/). When you parse out the string just parse the yaml
to get an object which is then easy to pass on to [jade](http://jade-lang.com/)
or some other templating engine to generate your documentation. If you want to
see an example of this, just check out the documentation for DocAST
https://github.com/brettlangdon/docast/blob/master/lib/index.js#L127 and the
code used to generate the docs at http://brettlangdon.github.io/docast/
https://github.com/brettlangdon/docast/tree/master/docs

+ 43
- 0
content/writing/about/javascript-interview-questions/index.md View File

@ -0,0 +1,43 @@
---
title: JavaScript Interview Questions
author: Brett Langdon
date: 2012-09-01
template: article.jade
---
Prelimiary review of the JavaScript book "JavaScript Interview Questions"
---
A few weeks ago I pre-ordered a wonderful book,
<a href="http://o2js.com/interview-questions/" target="_blank">JavaScript Interview Questions</a>,
written by
<a href="http://o2js.com/volkan" target="_blank">Volkan Özçelik</a>.
So even though the book is not yet finished I thought I would take a moment
to give a brief overview of what I have read so far.
When I started reading the book it was a mere 20-30 pages long and most of the
book was empty chapters and sections outlining the structure of the soon to be
full copy. Now, just a few weeks further along, Volkan has begun filling in the
others sections nicely and the book has reached over 150 pages and there is still
much more work to do. This book will cover every topic surrounding a professional
JavaScript interview, from how to handle technical JavaScript interview questions
to even how to apply for a job and react to a job offer.
This book provides an intimate and in-depth look into the heart of JavaScript and
the parts that make the language unique from others. As if he felt years of
professional and tried experience were not enough Volkan also offers various
resources and links to related material throughout the book to help support points
he makes, as well as to provide an alternative point of view to the topics he is
covering. So far the references section of the book takes up 5 pages and includes
over 100 unique links, and there are more to come.
For those who feel that this book is just for those looking how to get a leg up in
a JavaScript interview, please reconsider purchasing the book. It will help you
learn the hidden secrets to make you a better JavaScript developer and an all
around better interviewee.
If <a href="http://o2js.com/interview-questions/" target="_blank">JavaScript Interview Questions</a>
sounds interesting to you then please checkout the
<a href="http://o2js.com/assets/javascript-interview-questions.pdf" target="_blank">book teaser</a>
for a free lesson on JavaScript Closures and consider pre-ordering the book.

+ 242
- 0
content/writing/about/lets-make-a-metrics-beacon/index.md View File

@ -0,0 +1,242 @@
---
title: Lets Make a Metrics Beacon
author: Brett Langdon
date: 2014-06-22
template: article.jade
---
Recently I wrote a simple javascript metrics beacon
library. Let me show you what I came up with and how it works.
---
So, what do I mean by "javascript metrics beacon library"? Think
[RUM (Real User Monitoring)](http://en.wikipedia.org/wiki/Real_user_monitoring) or
[Google Analytics](http://www.google.com/analytics/),
it is a javascript library used to capture/aggregate metrics/data
from the client side and send that data to a server either in one
big batch or in small increments.
For those who do not like reading articles and just want the code you
can find the current state of my library on github: https://github.com/brettlangdon/sleuth
Before we get into anything technical, lets just take a quick look at an
example usage:
```html
<script type="text/javascript" src="//raw.githubusercontent.com/brettlangdon/sleuth/master/sleuth.min.js"></script>
<script type="text/javascript">
Sleuth.init({
url: "/track",
});
// static tags to identify the browser/user
// these are sent with each call to `url`
Sleuth.tag('uid', userId);
Sleuth.tag('productId', productId);
Sleuth.tag('lang', navigator.language);
// set some metrics to be sent with the next sync
Sleuth.track('clicks', buttonClicks);
Sleuth.track('images', imagesLoaded);
// manually sync all data
Sleuth.sendAllData();
</script>
```
Alright, so lets cover a few concepts from above, `tags`, `metrics` and `syncing`.
### Tags
Tags are meant to be a way to uniquely identify the metrics that are being sent
to the server and are generally used to break apart metrics. For example, you might
have a metric to track whether or not someone clicks an "add to cart" button, using tags
you can then break out that metric to see how many times the button has been pressed
for each `productId` or browser or language or any other piece of data you find
applicable to segment your metrics. Tags can also be used when tracking data for
[A/B Tests](http://en.wikipedia.org/wiki/A/B_testing) where you want to segment your
data based on which part of the test the user was included.
### Metrics
Metrics are simply data points to track for a given request. Good metrics to record
are things like load times, elements loaded on the page, time spent on the page,
number of times buttons are clicked or other user interactions with the page.
### Syncing
Syncing refers to sending the data from the client to the server. I refer to it as
"syncing" since we want to try and aggregate as much data on the client side and send
fewer, but larger, requests rather than having to make a request to the server for
each metric we mean to track. We do not want to overload the Client if we mean to
track a lot of user interactions on the site.
## How To Do It
Alright, enough of the simple examples/explanations, lets dig into the source a bit
to find out how to aggregate the data on the client side and how to sync that data
to the server.
### Aggregating Data
Collecting the data we want to send to the server isn't too bad. We are just going
to take any specific calls to `Sleuth.track(key, value)` and store either in
[LocalStorage](http://diveintohtml5.info/storage.html) or in an object until we need
to sync. For example this is the `track` method of `Sleuth`:
```javascript
Sleuth.prototype.track = function(key, value){
if(this.config.useLocalStorage && window.localStorage !== undefined){
window.localStorage.setItem('Sleuth:' + key, value);
} else {
this.data[key] = value;
}
};
```
The only thing of note above is that it will fall back to storing in `this.data`
if LocalStorage is not available as well we are namespacing all data stored in
LocalStorage with the prefix "Sleuth:" to ensure there is no name collision with
anyone else using LocalStorage.
Also `Sleuth` will be kind enough to capture data from `window.performance` if it
is available and enabled (it is by default). And it simply grabs everything it can
to sync up to the server:
```javascript
Sleuth.prototype.captureWindowPerformance = function(){
if(this.config.performance && window.performance !== undefined){
if(window.performance.timing !== undefined){
this.data.timing = window.performance.timing;
}
if(window.performance.navigation !== undefined){
this.data.navigation = {
redirectCount: window.performance.navigation.redirectCount,
type: window.performance.navigation.type,
};
}
}
};
```
For an idea on what is store in `window.performance.timing` check out
[Navigation Timing](https://developer.mozilla.org/en-US/docs/Navigation_timing).
### Syncing Data
Ok, so this is really the important part of this library. Collecting the data isn't
hard. In fact, no one probably really needs a library to do that for them, when you
just as easily store a global object to aggregate the data. But why am I making a
"big deal" about syncing the data either? It really isn't too hard when you can just
make a simple AJAX call using jQuery `$.ajax(...)` to ship up a JSON string to some
server side listener.
The approach I wanted to take was a little different, yes, by default `Sleuth` will
try to send the data using AJAX to a server side url "/track", but what about when
the server which collects the data lives on another hostname?
[CORS](http://en.wikipedia.org/wiki/Cross-origin_resource_sharing) can be less than
fun to deal with, and rather than worrying about any domain security I just wanted
a method that can send the data from anywhere I want back to whatever server I want
regardless of where it lives. So, how? Simple, javascript pixels.
A javascript pixel is simply a `script` tag which is written to the page with
`document.write` whose `src` attribute points to the url that you want to make the
call to. The browser will then call that url without using AJAX just like it would
with a normal `script` tag loading javascript. For a more in-depth look at tracking
pixels you can read a previous article of mine:
[Third Party Tracking Pixels](http://brett.is/writing/about/third-party-tracking-pixels/).
The point of going with this method is that we get CORS-free GET requests from any
client to any server. But some people are probably thinking, "wait, a GET request
doesn't help us send data from the client to server"? This is why we will encode
our JSON string of data for the url and simply send in the url as a query string
parameter. Enough talk, lets see what this looks like:
```javascript
var encodeObject = function(data){
var query = [];
for(var key in data){
query.push(encodeURIComponent(key) + '=' + encodeURIComponent(data[key]));
};
return query.join('&');
};
var drop = function(url, data, tags){
// base64 encode( stringify(data) )
tags.d = window.btoa(JSON.stringify(data));
// these parameters are used for cache busting
tags.n = new Date().getTime();
tags.r = Math.random() * 99999999;
// make sure we url encode all parameters
url += '?' + encodeObject(tags);
document.write('<sc' + 'ript type="text/javascript" src="' + url + '"></scri' + 'pt>');
};
```
That is basically it. We simply base64 encode a JSON string version of the data and send
as a query string parameter. There might be a few odd things that stand out above, mainly
url length limitations of base64 encoded JSON string, the "cache busting" and the weird
breaking up of the tag "script". A safe url length limit to live under is around
[2000](http://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers)
to accommodate internet explorer, which from some very crude testing means each reqyest
can hold around 50 or so separate metrics each containing a string value. Cache busting
can be read about more in-depth in my article again about tracking pixels
(http://brett.is/writing/about/third-party-tracking-pixels/#cache-busting), but the short
version is, we add random numbers and the current timestamp the query string to ensure that
the browser or cdn or anyone in between doesn't cache the request being made to the server,
this way you will not get any missed metrics calls. Lastly, breaking up the `script` tag
into "sc + ript" and "scri + pt" makes it harder for anyone blocking scripts from writing
`script` tags to detect that a script tag is being written to the DOM (also an `img` or
`iframe` tag could be used instead of a `script` tag).
### Unload
How do we know when to send the data? If someone is trying to time and see how much time
someone is spending on each page or wants to make sure they are collecting as much data
as they want on the client side then you want to wait until the last second before
syncing the data to the server. By using LocalStorage to store the data you can ensure
that you will be able to access that data the next time you see that user, but who wants
to wait? And what if the user never comes back? I want my data now dammit!
Simple, lets bind an event to `window.onunload`! Woot, done... wait... why isn't my data
being sent to me? Initially I was trying to use `window.onunload` to sync data back, but
found that it didn't always work with pixel dropping, AJAX requests worked most of the time.
After some digging I found that with `window.onunload` I was hitting a race condition on
whether or not the DOM was still available or not, meaning I couldn't use `document.write`
or even query the DOM on unload for more metrics to sync on `window.onunload`.
In come `window.onbeforeunload` to the rescue! For those who don't know about it (I
didn't before this project), `window.onbeforeunload` is exactly what it sounds like
an event that gets called before `window.onunload` which also happens before the DOM
gets unloaded. So you can reliably use it to write to the DOM (like the pixels) or
to query the DOM for any extra information you want to sync up.
## Conclusion
So what do you think? There really isn't too much to it is there? Especially since we
only covered the client side of the piece and haven't touched on how to collect and
interpret this data on the server (maybe that'll be a follow up post). Again this is mostly
a simple implementation of a RUM library, but hopefully it sparks an interest to build
one yourself or even just to give you some insight into how Google Analytics or other
RUM libraries collect/send data from the client.
I think this project that I undertook was neat because I do not always do client side
javascript and every time I do I tend to learn something pretty cool. In this case
learning the differences between `window.onunload` and `window.onbeforeunload` as well
as some of the cool things that are tracked by default in `window.performance` I
definitely urge people to check out the documentation on `window.performance`.
### TODO
What is next for [Sleuth](https://github.com/brettlangdon/sleuth)? I am not sure yet,
I am thinking of implementing more ways of tracking data, like adding counter support,
rate limiting, automatic incremental data syncs. I am open to ideas of how other people
would use a library like this, so please leave a comment here or open an issue on the
projects github page with any thoughts you have.
## Links
* [Sleuth](https://github.com/brettlangdon/sleuth)
* [Third Party Tracking Pixels](http://brett.is/writing/about/third-party-tracking-pixels/)
* [LocalStorage](http://diveintohtml5.info/storage.html)
* [Navigation Timing](https://developer.mozilla.org/en-US/docs/Navigation_timing)
* [window.onbeforeunload](https://developer.mozilla.org/en-US/docs/Web/API/Window.onbeforeunload)
* [window.onunload](https://developer.mozilla.org/en-US/docs/Web/API/Window.onunload)
* [RUM](http://en.wikipedia.org/wiki/Real_user_monitoring)
* [Google Analytics](http://www.google.com/analytics/)
* [A/B Testing](http://en.wikipedia.org/wiki/A/B_testing)

+ 145
- 0
content/writing/about/managing-go-dependencies-with-git-subtree/index.md View File

@ -0,0 +1,145 @@
---
title: Managing Go dependencies with git-subtree
author: Brett Langdon
date: 2016-02-03
template: article.jade
---
Recently I have decided to make the switch to using `git-subtree` for managing dependencies of my Go projects.
---
For a while now I have been searching for a good way to manage dependencies for my [Go](https://golang.org/)
projects. I think I have finally found a work flow that I really like that uses
[git-subtree](http://git.kernel.org/cgit/git/git.git/plain/contrib/subtree/git-subtree.txt).
When I began investigating different ways to manage dependencies I had a few small goals or concepts I wanted to follow.
### Keep it simple
I have always been drawn to the simplicity of Go and the tools that surround it.
I didn't want to add a lot of overhead or complexity into my work flow when programming in Go.
### Vendor dependencies
I decided right away that I wanted to vendor my dependencies, that is, where all of my dependencies
live under a top level `vendor/` directory in each repository.
This also means that I wanted to use the `GO15VENDOREXPERIMENT="1"` flag.
### Maintain the full source code of each dependency in each repository
The idea here is that each project will maintain the source code for each of its dependencies
instead of having a dependency manifest file, like `package.json` or `Godeps.json`, to manage the dependencies.
This was more of an acceptance than a decision. It wasn't a hard requirement that
each repository maintains the full source code for each of its dependencies, but
I was willing to accept that as a by product of a good work flow.
## In come git-subtree
When researching methods of managing dependencies with `git`, I came across a great article
from Atlassian, [The power of Git subtree](https://developer.atlassian.com/blog/2015/05/the-power-of-git-subtree/).
Which outlined how to use `git-subtree` for managing repository dependencies... exactly what I was looking for!
The main idea with `git-subtree` is that it is able to fetch a full repository and place
it inside of your repository. However, it differs from `git-submodule` because it does not
create a link/reference to a remote repository, instead it will fetch all the files from that
remote repository and place them under a directory in your repository and then treats them as
though they are part of your repository (there is no additional `.git` directory).
If you pair `git-subtree` with its `--squash` option, it will squash the remote repository
down to a single commit before pulling it into your repository.
As well, `git-subtree` has ability to issue a `pull` to update a child repository.
Lets just take a look at how using `git-subtree` would work.
### Adding a new dependency
We want to add a new dependency, [github.com/miekg/dns](https://github.com/miekg/dns)
to our project.
```
git subtree add --prefix vendor/github.com/miekg/dns https://github.com/miekg/dns.git master --squash
```
This command will pull in the full repository for `github.com/miekg/dns` at `master` to `vendor/github.com/miekg/dns`.
And that is it, `git-subtree` will have created two commits for you, one for the squash of `github.com/miekg/dns`
and another for adding it as a child repository.
### Updating an existing dependency
If you want to then update `github.com/miekg/dns` you can just run the following:
```
git subtree pull --prefix vendor/github.com/miekg/dns https://github.com/miekg/dns.git master --squash
```
This command will again pull down the latest version of `master` from `github.com/miekg/dns` (assuming it has changed)
and create two commits for you.
### Using tags/branches/commits
`git-subtree` also works with tags, branches, or commit hashes.
Say we want to pull in a specific version of `github.com/brettlangdon/forge` which uses tags to manage versions.
```
git subtree add --prefix vendor/github.com/brettlangdon/forge https://github.com/brettlangdon/forge.git v0.1.5 --squash
```
And then, if we want to update to a later version, `v0.1.7`, we can just run the following:
```
git subtree pull --prefix vendor/github.com/brettlangdon/forge https://github.com/brettlangdon/forge.git v0.1.7 --squash
```
## Making it all easier
I really like using `git-subtree`, a lot, but the syntax is a little cumbersome.
The previous article I mentioned from Atlassian ([here](ttps://developer.atlassian.com/blog/2015/05/the-power-of-git-subtree/))
suggests adding in `git` aliases to make using `git-subtree` easier.
I decided to take this one step further and write a `git` command, [git-vendor](https://github.com/brettlangdon/git-vendor)
to help manage subtree dependencies.
I won't go into much details here since it is outlined in the repository as well as at https://brettlangdon.github.io/git-vendor/,
but the project's goal was to make working with `git-subtree` easier for managing Go dependencies.
Mainly, to be able to add subtrees and give them a name, to be able to list all current subtrees,
and to be able to update a subtree by name rather than repo + prefix path.
Here is a quick preview:
```
$ git vendor add forge https://github.com/brettlangdon/forge v0.1.5
$ git vendor list
forge@v0.1.5:
name: forge
dir: vendor/github.com/brettlangdon/forge
repo: https://github.com/brettlangdon/forge
ref: v0.1.5
commit: 4c620b835a2617f3af91474875fc7dc84a7ea820
$ git vendor update forge v0.1.7
$ git vendor list
forge@v0.1.7:
name: forge
dir: vendor/github.com/brettlangdon/forge
repo: https://github.com/brettlangdon/forge
ref: v0.1.7
commit: 0b2bf8e484ce01c15b87bbb170b0a18f25b446d9
```
## Why not...
### Godep/&lt;package manager here&gt;
I decided early on that I did not want to "deal" with a package manager unless I had to.
This is not to say that there is anything wrong with [godep](https://github.com/tools/godep)
or any of the other currently available package managers out there, I just wanted to keep
the work flow simple and as close to what Go supports with respect to vendored dependencies
as possible.
### git-submodule
I have been asked why not `git-submodule`, and I think anyone that has had to work
with `git-submodule` will agree that it isn't really the best option out there.
It isn't as though it cannot get the job done, but the extra work flow needed
when working with them is a bit of a pain. Mostly when working on a project with
multiple contributors, or with contributors who are either not aware that the project
is using submodules or who has never worked with them before.
### Something else?
This isn't the end of my search, I will always be keeping a look out for new and
different ways to manage my dependencies. However, this is by far my favorite as of yet.
If anyone has any suggestions, please feel free to leave a comment.

+ 37
- 0
content/writing/about/my-new-website/index.md View File

@ -0,0 +1,37 @@
---
title: My New Website
author: Brett Langdon
date: 2013-11-16
template: article.jade
---
Why did I redo my website?
What makes it any better?
Why are there old posts that are missing?
---
I just wanted to write a quick post about my new site.
Some of you who are not familiar with my site might not notice the difference,
but trust me... it is different and for the better.
So what has changed?
For starters, I think the new design is a little simpler than the previous,
but more importantly it is not longer in [Wordpress](http://www.wordpress.org).
It is now maintained with [Wintersmith](https://github.com/jnordberg/wintersmith),
which is a static site generator which is built in [node.js](http://nodejs.org/) and
uses[Jade](http://jade-lang.com) templates and [markdown](http://daringfireball.net/projects/markdown/).
Why is this better?
Well for started I think writing in markdown is a lot easier than using Wordpress.
It means I can use whatever text editor I want (emacs in this case) to write my
articles. As well, I no longer need to have PHP and MySQL setup in order to just
serve up silly static content like blog posts and a few images.
This also means I can keep my blog entirely in [GitHub](http://github.com/).
So far I am fairly happy with the move to Wintersmith, except having to move all my
current blog posts over to markdown, but I will slowly keep porting some over until
I have them all in markdown. So, please bear with me during the time of transition
as there may be a few posts missing when I initially publish this new site.
Check out my blog in GitHub, [brett.is](http://github.com/brettlangdon/brett.is.git).

+ 203
- 0
content/writing/about/my-python-web-crawler/index.md View File

@ -0,0 +1,203 @@
---
title: My Python Web Crawler
author: Brett Langdon
date: 2012-09-09
template: article.jade
---
How to write a very simplistic Web Crawler in Python for fun.
---
Recently I decided to take on a new project, a Python based
<a href="http://en.wikipedia.org/wiki/Web_crawler" target="_blank">web crawler</a>
that I am dubbing Breakdown. Why? I have always been interested in web crawlers
and have written a few in the past, one previously in Python and another before
that as a class project in C++. So what makes this project different?
For starters I want to try and store and expose different information about the
web pages it is visiting. Instead of trying to analyze web pages and develop a
ranking system (like
<a href="http://en.wikipedia.org/wiki/PageRank" target="_blank">PageRank</a>)
that allows people to easily search for pages based on keywords, I instead want to
just store the information that is used to make those decisions and allow people
to use them how they wish.
For example, I want to provide an API for people to be able to search for specific
web pages. If the page is found in the system, it will return back an easy to use
data structure that contain the pages
<a href="http://en.wikipedia.org/wiki/Meta_element" target="_blank">meta data</a>,
keyword histogram, list of links to other pages and more.
## Overview of Web Crawlers
What is a web crawler? We can start with the simplest definition of a web crawler.
It is a program that, starting from a single web page, moves from web page to web
page by only using urls that are given in each page, starting with only those
provided in the original page. This is how search engines like
<a href="http://www.google.com/" target="_blank">Google</a>,
<a href="http://www.bing.com/" target="_blank">Bing</a> and
<a href="http://www.yahoo.com/" target="_blank">Yahoo</a>
obtain the content they need for their search sites.
But a web crawler is not just about moving from site to site (even though this
can be fun to watch). Most web crawlers have a higher purpose, like (in the case
of search engines) to rank the relativity of a web page based on the content
provided within the pages content and html meta data to allow people easier
searching of content on the internet. Other web crawlers are used for more
invasive purposes like to obtain e-mail addresses to use for marketing or spam.
So what goes into making a web crawler? A web crawler, again, is not just about
moving from place to place how ever it feels. Web sites can actually dictate how
web crawlers access the content on their sites and how they should move around on
their site. This information is provided in the
<a href="http://www.robotstxt.org/" target="_blank">robots.txt</a>
file that can be found on most websites
(<a href="http://en.wikipedia.org/robots.txt" target="_blank">here is wikipedia’s</a>).
A rookie mistaken when building a web crawler is to ignore this file. These
robots.txt files are provided as a set of guidelines and rules that web crawlers
must adhere by for a given domain, otherwise you are liable to get your IP and/or
User Agent banned. Robots.txt files tell crawlers which pages or directories to
ignore or even which ones they should consider.
Along with ensuring that you follow along with robots.txt please be sure to
provide a useful and unique
<a href="http://en.wikipedia.org/wiki/User_agent" target="_blank">User Agent</a>.
This is so that sites can identify that you are a robot and not a human.
For example, if you see a User Agent of *“breakdown”* on your website, hi, it’s me.
Do not use know User Agents like:
*“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/12.04 Chromium/18.0.1025.168 Chrome/18.0.1025.168 Safari/535.19″*,
this is, again, an easy way for you to get your IP address banned on many sites.
Lastly, it is important to consider adding in rate limiting to your crawler. It is
wonderful to be able to crawl websites and between them very quickly (no one likes
to wait for results), but this is another sure fire way of getting your IP banned
by websites. Net admins do not like bots to tie up all of their networks
resources, making it difficult for actual users to use their site.
## Prototype of Web Crawler
So this afternoon I decided to take around an hour or so and prototype out the
code to crawl from page to page extracting links and storing them in the database.
All this code does at the moment is download the content of a url, parse out all
of the urls, find the new urls that it has not seen before, append them to a queue
for further processing and also inserting them into the database.This process has
2 queues and 2 different thread types for processing each link.
There are two different types of processes within this module, the first is a
Grabber, which is used to take a single url from a queue and download the text
content of that url using the
<a href="http://docs.python-requests.org/en/latest/index.html" target="_blank">Requests</a>
Python module. It then passes the content along to a queue that the Parser uses
to get new content to process. The Parser takes the content from the queue that
has been retrieved from the Grabber process and simply parses out all the links
contained within the sites html content. It then checks MongoDB to see if that
url has been retrieved already or not, if not, it will append the new url to the
queue that the Grabber uses to retrieve new content and also inserts this url
into the database.
The unique thing about using multiple threads per process (X for Grabbers and Y
for Parsers) as well as having two different queues to share information between
the two allows this crawler to be self sufficient once it gets started with a
single url. The Grabbers help feed the queue that the Parsers work off of and the
Parsers feed the queue that the Grabbers work from.
For now, this is all that my prototype does, it only stores links and crawls from
site to site looking for more links. What I have left to do is expand upon the
Parser to parse out more information from the html including things like meta
data, page title, keywords, etc, as well as to incorporate
<a href="http://www.robotstxt.org/" target="_blank">robots.txt</a> into the
processing (to keep from getting banned) and automated rate limiting
(right now I have a 3 second pause between each web request).
## How Did I Do It?
So I assume at this point you want to see some code? The code it not up on
GitHub just yet, I have it hosted on my own private git repo for now and will
gladly open source the code once I have a better prototype.
Lets just take a very quick look at how I am sharing code between the different
threads.
### Parser.py
```python
import threading
class Thread(threading.Thread):
def __init__(self, content_queue, url_queue):
self.c_queue = content_queue
self.u_queue = url_queue
super(Thread, self).__init__()
def run(self):
while True:
data = self.c_queue.get()
#process data
for link in links:
self.u_queue.put(link)
self.c_queue.task_done()
```
### Grabber.py
```python
import threading
class Thread(threading.Thread):
def __init__(self, url_queue, content_queue):
self.c_queue = content_queue
self.u_queue = url_queue
super(Thread, self).__init__()
def run(self):
while True:
next_url = self.u_queue.get()
#data = requests.get(next_url)
while self.c_queue.full():
pass
self.c_queue.put(data)
self.u_queue.task_done()
```
### Breakdown
```python
from breakdown import Parser, Grabber
from Queue import Queue
num_threads = 4
max_size = 1000
url_queue = Queue()
content_queue = Queue(maxsize=max_size)
parsers = [Parser.Thread(content_queue, url_queue) for i in xrange(num_threads)]
grabbers = [Grabber.Thread(url_queue, content_queue) for i in xrange(num_threads)]
for thread in parsers+grabbers:
thread.daemon = True
thread.start()
url_queue.put('http://brett.is/')
```
Lets talk about this process quick. The Breakdown code is provided as a binary
script to start the crawler. It creates “num_threads” threads for each process
(Grabber and Parser). It starts each thread and then appends the starting point
for the crawler, http://brett.is/. One of the Grabber threads will then pick up on
the single url, make a web request to get the content of that url and append it
to “content_queue”. Then one of the Parser threads will pick up on the content
data from “content_queue”, it will process the data from the web page html,
parsing out all of the links and then appending those links onto “url_queue”. This
will then allow the other Grabber threads an opportunity to make new web requests
to get more content to pass to the Parsers threads. This will continue on and on
until there are no links left (hopefully never).
## My Results
I ran this script for a few minutes, maybe 10-15, and I ended up with over 11,000
links ranging from my domain,
<a href="http://www.pandora.com/" target="_blank">pandora</a>,
<a href="http://www.twitter.com/" target="_blank">twitter</a>,
<a href="http://www.linkedin.com/" target="_blank">linkedin</a>,
<a href="http://www.github.com/" target="_blank">github</a>,
<a href="http://www.sony.com/" target="_blank">sony</a>,
and many many more. Now that I have a decent base prototype I can continue forward
and expand upon the processing and logic that goes into each web request.
Look forward to more posts about this in the future.

+ 31
- 0
content/writing/about/os-x-battery-percentage-command-line/index.md View File

@ -0,0 +1,31 @@
---
title: OS X Battery Percentage Command Line
author: Brett Langdon
date: 2012-03-18
template: article.jade
---
Quick and easy utility to get OS X battery usage from the command line.
---
Recently I learned how to enable full screen console mode for OS X but the first
issue I ran into was trying to determine how far gone the battery in my laptop was.
Yes of course I could use the fancy little button on the side that lights up and
shows me but that would be way too easy for a programmer, so of course instead I
wrote this scripts. The script will gather the battery current and max capacity
and simply divide them to give you a percentage of battery life left.
Just create this script, I named mine “battery”, make sure to enable execution
“chmod +x battery” and I moved mine into “/usr/sbin/”. Then to use simply run the
command “battery” and you’ll get an output similar to “3.900%”
(yes as of the writing of this my battery needs a charging).
```bash
#!/bin/bash
current=`ioreg -l | grep CurrentCapacity | awk ‘{print %5}’`
max=`ioreg -l | grep MaxCapacity| awk ‘{print %5}’`
echo `echo “scale=3;$current/$max*100″|bc -l`’%’
```
Enjoy!

+ 46
- 0
content/writing/about/pharos-popup-on-osx-lion/index.md View File

@ -0,0 +1,46 @@
---
title: Pharos Popup on OSX Lion
author: Brett Langdon
date: 2012-01-28
template: article.jade
---
Fixing Pharos Popup app on OS X Lion.
---
My University uses
<a href="http://www.pharos.com/" target="_blank">Pharos</a>
print servers to manage a few printers on campus and we were running into an
issue of the Pharos popup and notify applications not working properly with OSX
Lion. As I work for the Apple technician on campus I was tasked with finding out
why. The popup installation was setting up the applications to run on startup just
fine, the postflight script was invoking the Popup.app, the drivers we were using
worked perfectly when we mapped the printer by IP but what was going on? Through
some further examination the two applications were in fact not being properly
started either after install or on boot.
I managed to find a work around that caused the applications to run. I manually
ran each of them through command line (as through Finder resulted in failure) and
magically they worked as expected and now whenever my machine starts up they start
on boot without having to manually run them, even if I uninstall the applications
and reinstall them I not longer have to manually run them… but why?
```bash
voltaire:~ brett$ open /Library/Application\ Support/Pharos/Popup.app
voltaire:~ brett$ open /Library/Application\ Support/Pharos/Notify.app
voltaire:~ brett$ ps aux | grep Pharos
brett 600 0.0 0.1 655276 3984 ?? S 2:55PM 0:00.10 /Library/Application Support/Pharos/Popup.app/Contents/MacOS/Popup -psn_0_237626
brett 543 0.0 0.1 655156 3652 ?? S 2:45PM 0:00.08 /Library/Application Support/Pharos/Notify.app/Contents/MacOS/Notify -psn_0_233529
brett 608 0.0 0.0 2434892 436 s001 R+ 2:56PM 0:00.00 grep Pharos
```
I am still not 100% sure why this work around worked, especially when the
postflight script included with the Popup package is set to run Popup.app after
installation. The only explanation I can come up with is OSX keeps a library of
all of the “trusted” applications, you know that popup that asks you if you want
to run a program that was downloaded from the internet, and the Popup.app and
Notify.app are not being properly added to the list, unless run manually.
I am still looking into a solution that can be packaged with the Popup package and
will post more information here when I find out more.

+ 77
- 0
content/writing/about/php-stop-malicious-image-uploads/index.md View File

@ -0,0 +1,77 @@
---
title: PHP - Stop Malicious Image Uploads
author: Brett Langdon
date: 2012-02-01
template: article.jade
---
Quick and easy trick for detecting and stopping malicious image uploads to PHP.
---
Recently I have been practicing for the upcoming NECCDC competition and have
come across a few issues that will need to be overcome, including how to stop
malicious image uploads.
I was reading
<a href="http://www.acunetix.com/websitesecurity/upload-forms-threat.htm" target="_blank">this</a>
article on
<a href="http://www.acunetix.com/" target="_blank">Acunetix.com</a>
about the threats of having upload forms in PHP.
The general idea behind this exploit for Apache and PHP is when a user can
upload an image whose content contains PHP code and the extension includes
‘php’ for example an image ‘new-house.php.jpg’ that contains:
```
... (image contents)
<?php phpinfo(); ?>
... (image contents)
```
When uploaded and then viewed Apache, if improperly setup, will process the
image as PHP, because of the ‘.php’ in the extension and then when accessed
will execute malicious code on your server.
## My Solution
I was trying to find a good way to remove this issue quickly without opening
more security holes. I have seen some solutions that use the function
<a href="http://us2.php.net/manual/en/function.getimagesize.php" target="_blank">getimagesize</a>
to try and determine if the file is an image, but if the malicious code is
injected into the middle of an actual image this function will still return
the actual image size and the file will validate as an image. The solution I
came up with is to explicitly convert each uploaded image to a jpeg using
<a href="http://us2.php.net/manual/en/function.imagecreatefromjpeg.php" target="_blank">imagecreatefromjpeg</a>
and
<a href="http://us2.php.net/manual/en/function.imagejpeg.php" target="_blank">imagejpeg</a>
functions.
```php
<?php
$image = imagecreatefromjpeg( './new-house.php.jpeg' );
imagejpeg( $image, './new-house.php.jpeg' );
```
If the original image contains malicious code an error will be thrown and
`$image` will not contain an image. This is a way to try and sanitize the
image. This code can also be embellished where if the image is invalid then
an image is still created and uploaded.
```php
<?php
//@ to quite the possible error from this.
$image = @imagecreatefromjpeg( './new-house.php.jpg' );
if( !$image ):
$image = imagecreate(100,20);
$greenish = imagecolorallocate( $image, 180,200,180 );
imagefill( $image, 0, 0, $greenish );
$black = imagecolorallocate( $image, 0,0,0 );
imagestring( $image, 1, 5, 5, 'No.. No..', $black );
endif;
imagejpeg( $image, './new-house.php.jpg' );
```
Enjoy.

+ 90
- 0
content/writing/about/python-redis-queue-workers/index.md View File

@ -0,0 +1,90 @@
---
title: Python Redis Queue Workers
author: Brett Langdon
date: 2014-10-14
template: article.jade
---
Learn an easy, distributed approach to processing jobs
from a Redis queue in Python.
---
Recently I started thinking about a new project. I want to write my own Continuous Integration (CI)
server. I know what you are thinking... "Why?!" and yes I agree, there are a bunch of good ones out
there now, I just want to do it. The first problem I came across was how to have distributed workers
to process the incoming builds for the CI server. I wanted something that was easy to start up on
multiple machines and that needed minimal configuration to get going.
The design is relatively simple, there is a main queue which jobs can be pulled from and a second queue
that each worker process pulls jobs into to denote processing. The main queue is meant as a list of things that
have to be processed where the processing queues is a list of pending jobs which are being processed by the
workers. For this example we will be using [Redis lists](http://redis.io/commands#list) since they support
the short feature list we require.
### worker.py
Lets start with the worker process, the job of the worker is to simply grab a job from the queue and process it.
```python
import redis
def process(job_id, job_data):
print "Processing job id(%s) with data (%r)" % (job_id, job_data)
def main(client, processing_queue, all_queue):
while True:
# try to fetch a job id from "<all_queue>:jobs"
# and push it to "<processing_queue>:jobs"
job_id = client.brpoplpush(all_queue, processing_queue)
if not job_id:
continue
# fetch the job data
job_data = client.hgetall("job:%s" % (job_id, ))
# process the job
process(job_id, job_data)
# cleanup the job information from redis
client.delete("job:%s" % (job_id, ))
client.lrem(process_queue, 1, job_id)
if __name__ == "__main__":
import socket
import os
client = redis.StrictRedis()
try:
main(client, "processing:jobs", "all:jobs")
except KeyboardInterrupt:
pass
```
The above script does the following:
1. Try to fetch a job from the queue `all:jobs` pushing it to `processing:jobs`
2. Fetch the job data from a [hash](http://redis.io/commands#hash) key with the name `job:<job_id>`
3. Process the job information
4. Remove the hash key `job:<job_id>`
5. Remove the job id from the queue `processing:jobs`
With this design we will always be able to determine how many jobs are currently queued for process
by looking at the list `all:jobs` and we will also know exactly how many jobs are being processed
by looking at the list `processing:jobs` which contains the list of job ids that all workers are
working on.
Also we are not tied down to running just 1 worker on 1 machine. With this design we can run multiple
worker processes on as many nodes as we want. As long as they all have access to the same Redis server.
There are a few limitations which are all seeded in Redis' [limits on lists](http://redis.io/topics/data-types),
but this should be good enough to get started.
There are a few other approaches that can be taken here as well. Instead of using a single processing queue
we could use a separate queue for each worker. Then we can look at which jobs are currently being processed
by each individual worker, this approach would also give us the opportunity to have the workers try to fetch
from the worker specific queue first before looking at `all:jobs` so we can either assign jobs to specific
workers or where the worker can recover from failed processing by starting with the last job it was working
on before failing.
## qw
I have developed the library [qw](https://github.com/brettlangdon/qw) or (QueueWorker) to implement a similar
pattern to this, so if you are interested in playing around with this or to see a more developed implementation
please checkout the projects [github page](https://github.com/brettlangdon/qw) for more information.

+ 87
- 0
content/writing/about/sharing-data-from-php-to-javascript/index.md View File

@ -0,0 +1,87 @@
---
title: Sharing Data from PHP to JavaScript
author: Brett Langdon
date: 2014-03-16
template: article.jade
---
A quick example of how I decided to share dynamic content from PHP with my JavaScript.
---
So the other day I was refactoring some of the client side code I was working on and
came across something like the following:
### page.php
```php
<html>
...
<script type="text/javascript">
var modelTitle = "<?=$myModel->getTitle()?>";
// do something with modelTitle
</script>
</html>
```
There isn't really anything wrong here, in fact this seems to be a fairly common practice
(from the little research I did). So... whats the big deal? Why write an article about it?
My issue with the above is, what if the JavaScript gets fairly large (as mine was). The
ideal thing to do is to move the js into it's own file, minify/compress it and serve it
from a CDN so it doesn't effect page load time. But, now we have content that needs to be
added dynamically from the PHP script in order for the js to run. How do we solve it? The
approach that I took, which probably isn't original at all, but I think neat enough to
share, was to let PHP make the data available to the script through `window.data`.
### page.php
```php
<html>
...
<?php
$pageData = array(
'modelTitle' => $myModel->getTitle(),
);
?>
<script type="text/javascript">
window.data = <?=json_encode($pageData)?>;
</script>
<script type="text/javascript" src="//my-cdn.com/scripts/page-script.min.js"></script>
</html>
```
### page-script.js
```javascript
// window.data.modelTitle is available for me to use
console.log("My Model Title: " + window.data.modelTitle);
```
Nothing really fancy, shocking, new or different here, just passing data from PHP to js.
Something to note is that we have to have our PHP code set `window.data` before we load
our external script so that `window.data` will be available when the script loads. Which
this shouldn't be too much of an issue since most web developers are used to putting all
of their `script` tags at the end of the page.
Some might wonder why I decided to use `window.data`, why not just set
`var modelTitle = "<?=$myModel->getTitle()?>";`? I think it is better to try and have a
convention for where the data from the page will come from. Having to rely on a bunch of
global variables being set isn't really a safe way to write this. What if you overwrite
an existing variable or if some other script overwrites your data from the PHP script?
This is still a cause for concern with `window.data`, but at least you only have to keep
track of a single variable. As well, I think organizationally it is easier and more concise
to have `window.data = <?=json_encode($pageData)?>;` as opposed to:
```php
var modelTitle = "<?=$myModel->getTitle()?>";
var modelId = "<?=$myModel->getId()?>";
var username = "<?=getCurrentUser()?>";
...
```
I am sure there are other ways to do this sort of thing, like with AJAX or having an
initialization function that PHP calls with the correct variables it needs to pass, etc.
This was just what I came up with and the approach I decided to take.
If anyone has other methods of sharing dynamic content between PHP and js, please leave a
comment and let me know, I am curious as to what most other devs are doing to handle this.

+ 95
- 0
content/writing/about/the-battle-of-the-caches/index.md View File

@ -0,0 +1,95 @@
---
title: The Battle of the Caches
author: Brett Langdon
date: 2013-08-01
template: article.jade
---
A co-worker and I set out to each build our own http proxy cache.
One of them was written in Go and the other as a C++ plugin for
Kyoto Tycoon.
---
So, I know what most people are thinking: “Not another cache benchmark post,
with skewed or biased results.” But luckily that is not what this post is about;
there are no opinionated graphs showing that my favorite caching system happens
to be better than all the other ones. Instead, this post is about why at work we
decided to write our own API caching system rather than use <a href="http://www.varnish-cache.org/" target="_blank">Varnish</a>
(a tested, tried and true HTTP caching system).
Let us discuss the problem we have to solve. The system we have is a simple
request/response HTTP server that needs to have very low latency (a few
milliseconds, usually 2-3 on average) and we are adding a third-party HTTP API
call to almost every request that we see. I am sure some people see the issue
right away, any network call is going to add at least a half to a whole millisecond
to your processing time and that is if the two servers are in the same datacenter,
more if they are not. That is just network traffic, now we must rely on the
performance of the third-party API, hoping that they are able to maintain a
consistent response time under heavy load. If, in total, this third-party API call
is adding more than 2 milliseconds response time to each request that our system
is processing then that greatly reduces the capacity of our system.
THE SOLUTION! Lets use Varnish. This is the logical solution, lets put a caching
system in front of the API. The content we are requesting isn’t changing very often
(every few days, if that) and it can help speed up the added latency from the API
call. So, we tried this but had very little luck; no matter what we tried we could
not get Varnish to respond in under 2 milliseconds per request (which is a main
requirement of solution we were looking for). That means Varnish is out, the next
solution is to write our own caching system.
Now, before people start flooding the comments calling me a troll or yelling at me
for not trying this or that or some other thing, let me try to explain really why
we decided to write our own cache rather than spend extra days investing time into
Varnish or some other known HTTP cache. We have a fairly specific requirement from
our cache, very low and consistent latency. “Consistent” is the key word that really
matters to us. We decided fairly early on that getting no response on a cache miss
is better for our application than blocking and waiting for a response from the
proxy call. This is a very odd requirement and most HTTP caching systems do not
support it since it almost defeats their purpose (be “slow” 1-2 times so you can be
fast all the other times). As well, HTTP is not a requirement for us, that is,
from the cache to the API server HTTP must be used, but it is not a requirement
that our application calls to the cache using HTTP. Headers add extra bandwidth
and processing that are not required for our application.
So we decided that our ideal cache would have 3 main requirements:
1. Must have a consistent response time, returning nothing early over waiting for a proper response
2. Support the <a href="https://github.com/memcached/memcached/blob/master/doc/protocol.txt" target="_blank">Memcached Protocol</a>
3. Support TTLs on the cached data
This behavior works basically like so: Call to cache, if it is a cache miss,
return an empty response and queue the request to a background process to make the
call to the API server, every identical request coming in (until the proxy call
returns a result) will receive an empty response but not add the request to the
queue. As soon as the proxy call returns, update the cache and every identical call
coming in will yield the proper response. After a given TTL consider the data in
the cache to be old and re-fetch.
This was then seen as a challenge between a co-worker,
<a href="http://late.am/" target="_blank">Dan Crosta</a>, and myself to see who
can write the better/faster caching system with these requirements. His solution,
entitled “CacheOrBust”, was a
<a href="http://fallabs.com/kyototycoon/" target="_blank">Kyoto Tycoon</a> plugin
written in C++ which simply used a subset of the memcached protocol as well as some
background workers and a request queue to perform the fetching. My solution,
<a href="https://github.com/brettlangdon/ferrite" target="_blank">Ferrite</a>, is a
custom server written in <a href="http://golang.org/" target="_blank">Go</a>
(originally written in C) that has the same functionality (except using
<a href="http://golang.org/doc/effective_go.html#goroutines" target="_blank">goroutines</a>
rather than background workers and a queue). Both servers used
<a href="http://fallabs.com/kyotocabinet/" target="_blank">Kyoto Cabinet</a>
as the underlying caching data structure.
So… results already! As with most fairly competitive competitions it is always a
sad day when there is a tie. Thats right, two similar solutions, written in two
different programming languages yielded similar results (we probably have
Kyoto Cabinet to thank). Both of our caching systems were able to yield us the
results we wanted, **consistent** sub-millisecond response times, averaging about
.5-.6 millisecond responses (different physical servers, but same datacenter),
regardless of whether the response was a cache hit or a cache miss. Usually the
morale of the story is: “do not re-invent the wheel, use something that already
exists that does what you want,” but realistically sometimes this isn’t an option.
Sometimes you have to bend the rules a little to get exactly what your application
needs, especially when dealing with low latency systems, every millisecond counts.
Just be smart about the decisions you make and make sure you have sound
justification for them, especially if you decide to build it yourself.

+ 352
- 0
content/writing/about/third-party-tracking-pixels/index.md View File

@ -0,0 +1,352 @@
---
title: Third Party Tracking Pixels
author: Brett Langdon
date: 2013-05-03
template: article.jade
---
An overview of what a third party tracking pixel is and how to create/use them.
---
So, what exactly do we mean by “third party tracking pixel” anyways?
Lets try to break it down piece by piece:
### Tracking Pixel:
A pixel referes to a tag that is placed on a site that offers no merit other than
calling out to a web page or script that is not the current page you are visiting.
These pixels are usually an html script tag that point to a javascript file with
no content or an img tag with a empty or transparent 1 pixel by 1 pixel gif image
(hence the term “pixel”). A tracking pixel is the term used to describe a pixel
that calls to another page or script in order to provide it information about the
users visit to the page.
### Third Party:
Third party just means the pixel points to a website that is not the current
website. For example,
<a href="http://www.google.com/analytics/" target="_blank">Google Analytics</a>
is a third party tracking tool because you place scripts on your website
that calls and sends data to Google.
## What is the point?
Why do people do this? In the case of Google Analytics people do not wish to track
and follow their own analytics for their website, instead they want a third party
host to do it for them, but they need a way of sending their user’s data to Google.
Using pixels and javascript to send the data to Google offers the company a few
benefits. For starters, they do not require any more overhead on their servers for
a service to send data directly to Google, instead by using pixels and scripts they
get to off load this overhead onto their users (thats right, we are using our
personal computers resources to send analytical data about ourselves to Google for
websites that use Google analytics). Secondly, the benefit of using a tracking
pixel that runs client side (in the user’s browser) we are allowed to gather more
information about the user. The information that is made available to us through
the use of javascript is far greater than what is given to our servers via
HTTP Headers.
## How do we do it?
Next we will walk through the basics of how to create third party tracking pixels.
Code examples for the following discussion can be found
<a href="https://github.com/brettlangdon/tracking-server-examples" target="_blank">here</a>.
We will walk through four examples of tracking pixels accompanied by the server
code needed to serve and receive the pixels. The server is written in
<a href="http://python.org/" target="_blank">Python</a> and some basic
understanding of Python is required to follow along. The server examples are
written using only standard Python wsgi modules, so no extra installation is
needed. We will start off with a very simple example of using a tracking pixel and
then each example afterwards we will begin to add features to the pixel.
## Simple Example
For this example all we want to accomplish is to have a web server that returns
HTML containing our tracking pixel as well as a handler to receive the call from
our tracking pixel. Our end goal is to serve this HTML content:
```html
<html>
<head></head>
<body>
<h2>Welcome</h2>
<script src="/track.js"></script>
</body>
</html>
```
As you can see, this is fairly simple HTML; the important part is the script tag
pointing to “/track.js”, this is our tracking pixel. When the user’s browser loads
the page this script will make a call to our server, our server can then log
information about that user. So we start with a wsgi handler for the HTML code:
```python
def html_content(environ, respond):
headers = [('Content-Type', 'text/html')]
respond('200 OK', headers)
return [
"""
<html><head></head><body>
<h2>Welcome</h2><script src="/track.js"></script>
</body></html>
"""
]
```
Next we want to make sure that we have a handler for the calls to “/track.js”
from the script tag:
```python
def track_user(environ, respond):
headers = [('Content-Type', 'application/javascript')]
respond('200 OK', headers)
prefixes = ['PATH_', 'HTTP', 'REQUEST', 'QUERY']
for key, value in environ.iteritems():
if any(key.startswith(prefix) for prefix in prefixes):
print '%s: %s' % (key, value)
return ['']
```
In this handler we are taking various information about the request from the user
and simply printing it to the screen. The end point “/track.js” is not meant to
point to actual javascript so instead we return back an empty string. When this
code runs you should see something like the following:
```
brett$ python tracking_server.py
Tracking Server Listening on Port 8000...
1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET / HTTP/1.1" 200 89
HTTP_REFERER: http://localhost:8000/
REQUEST_METHOD: GET
QUERY_STRING:
HTTP_ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.3
HTTP_CONNECTION: keep-alive
PATH_INFO: /track.js
HTTP_HOST: localhost:8000
HTTP_ACCEPT: */*
HTTP_USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31
HTTP_ACCEPT_LANGUAGE: en-US,en;q=0.8
HTTP_DNT: 1
HTTP_ACCEPT_ENCODING: gzip,deflate,sdch
1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET /track.js HTTP/1.1" 200 0
1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET /favicon.ico HTTP/1.1" 204 0
```
You can see in the above that first the browser makes the request “GET /” which
returns our HTML containing the tracking pixel, then directly afterwards makes a
request for “GET /track.js” which prints out various information about the incoming
request. This example is not very useful as is, but helps to illustrate the key
point of a tracking pixel. We are having the browser make a request on behalf of
the user without the user’s knowledge. In this case we are making a call back to
our own server, but our script tag could easily point to a third party server.
## Add Some Search Data
Our previous, simple, example does not really provide us with any particularly
useful information other than allow us to track that a user’s browser made the
call to our server. For this next example we want to build upon the previous by
sending some data along with the tracking pixel; in this case, some search data.
Let us make an assumption that our web page allows users to make searches; searches
are given to the page through a url query string parameter “search”. We want to
pass that query string parameter on to our tracking pixel, which we will use the
query string parameter “s”. So our requests will look as follows:
* http://localhost:8000?search=my cool search
* http://localhost:8000/track.js?s=my cool search
To do this, we simply append the query string parameter “search” onto our track.js
script tag in our HTML:
```python
def html_content(environ, respond):
query = parse_qs(environ['QUERY_STRING'])
search = quote(query.get('search', [''])[0])
headers = [('Content-Type', 'text/html')]
respond('200 OK', headers)
return [
"""
<html><head></head><body>
<h2>Welcome</h2><script src="/track.js?s=%s"></script>
</body></html>
""" % search
]
```
For our tracking pixel handler we will simply print the value of the query string
parameter “s” and again return an empty string.
```python
def track_user(environ, respond):
query = parse_qs(environ['QUERY_STRING'])
search = query.get('s', [''])[0]
print 'User Searched For: %s' % search
headers = [('Content-Type', 'application/javascript')]
respond('200 OK', headers)
return ['']
```
When run the output will look similar to:
```
brett$ python tracking_server.py
Tracking Server Listening on Port 8000...
1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /?search=my%20cool%20search HTTP/1.1" 200 110
User Searched For: my cool search
1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /track.js?s=my%20cool%20search HTTP/1.1" 200 0
1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /favicon.ico HTTP/1.1" 204 0
1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /?search=another%20search HTTP/1.1" 200 108
User Searched For: another search
1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /track.js?s=another%20search HTTP/1.1" 200 0
1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /favicon.ico HTTP/1.1" 204 0
```
Here we can see the two search requests made to our web page and the similar
resulting requests to track.js. Again, this example might not seem like much but
it proves a way of being able to pass values from our web page along with to the
tracking server. In this case we are passing search terms, but we could also pass
any other information along we needed.
## Track User’s with Cookies
So now we are getting somewhere, our tracking server is able to receive some
search data about the requests made to our web page. The problem now is we have
no way of associating this information with a specific user; how can we know when
a specific user searches for multiple things. Cookies to the rescue. In this
example we are going to add the support of using cookies to assign each visiting
user a specific and unique id, this will allow us to associate all the search data
we receive with “specific” users. Yes, I say “specific” with quotes because we can
only associate the data with a given cookie, if multiple people share a computer
then we will probably think they are a single person. As well, if someone clears
the cookies for their browser then we lose all association with that user and have
to start all over again with a new cookie. Lastly, if a user does not allow cookies
for their browser then we will be unable to associate any data with them as every
time they visit our tracking server we will see them as a new user. So, how do we
do this? When receive a request from a user we want to look and see if we have
given them a cookie with a user id, if so then we will associate the incoming data
with that user id and if there is no user cookie then we will generate a new user
id and give it to the user.
```python
def track_user(environ, respond):
cookies = SimpleCookie()
cookies.load(environ.get('HTTP_COOKIE', ''))
user_id = cookies.get('id')
if not user_id:
user_id = uuid4()
print 'User did not have id, giving: %s' % user_id
query = parse_qs(environ['QUERY_STRING'])
search = query.get('s', [''])[0]
print 'User %s Searched For: %s' % (user_id, search)
headers = [
('Content-Type', 'application/javascript'),
('Set-Cookie', 'id=%s' % user_id)
]
respond('200 OK', headers)
return ['']
```
This is great! Not only can we now obtain search data from a third party website
but we can also do our best to associate that data with a given user. In this
instance a single user is anyone who shares the same user id in their
browsers cookies.
## Cache Busting
So what exactly is cache busting? Our browsers are smart, they know that we do not
like to wait a long time for a web page to load, they have also learned that they
do not need to refetch content that they have seen before if they cache it. For
example, an image on a web site might get cached by your web browser so every time
you reload the page the image can be loaded locally as opposed to being fetched
from the remote server. Cache busting is a way to ensure that the browser does not
cache the content of our tracking pixel. We want the user’s browser to follow the
tracking pixel to our server for every page request they make because we want to
follow everything that that user does. When the browser caches our tracking
pixel’s content (an empty string) then we lose out on data. Cache busting is the
term used when we programmatically generate query string parameters to make calls
to our tracking pixel look unique and therefore ensure that the browser follows
the pixel rather than load from it’s cache. To do this we need to add an extra end
point to our server. We need the HTML for the web page, along with a cache busting
script and finally our track.js handler. A cache busting script will use javascript
to add our track.js script tag to the web page. This means that after the web page
is loaded javascript will run to manipulate the
<a href="http://en.wikipedia.org/wiki/Document_Object_Model" target="_blank">DOM</a>
to add our cache busted track.js script tag to the HTML. So, what does this
look like?
```javascript
var now = new Date().getTime();
var random = Math.random() * 99999999999;
document.write('<script type="text/javascript" src="/track.js?t=' + now + '&r=' + random + '"></script>
```
This script adds the extra query string parameters ”r” which is a random number
and “t” which is the current timestamp in milliseconds. This will give us a unique
enough request that will trick our browsers into ignoring anything that is has in
it’s cache for track.js and forces it to make the request anyways. Using a cache
buster requires us to modify the html we server slightly to server up the cache
busting javascript as opposed to our track.js pixel.
```html
<html>
<head></head>
<body>
<h2>Welcome</h2>
<script src="/buster.js"></script>
</body>
</html>
```
And we need the following to serve up the cache buster script buster.js:
```python
def cache_buster(environ, respond):
headers = [('Content-Type', 'application/javascript')]
respond('200 OK', headers)
cb_js = """
function getParameterByName(name){
name = name.replace(/[\[]/, "\\\[").replace(/[\]]/, "\\\]");
var regexS = "[\\?&]" + name + "=([^&#]*)";
var regex = new RegExp(regexS);
var results = regex.exec(window.location.search);
if(results == null){
return "";
}
return decodeURIComponent(results[1].replace(/\+/g, " "));
}
var now = new Date().getTime();
var random = Math.random() * 99999999999;
var search = getParameterByName('search');
document.write('<script src="/track.js?t=' + now + '&r=' + random + '&s=' + search + '"></script>');
"""
return [cb_js]
```
We do not care very much if the browser caches our cache buster script because
it will always generate a new unique track.js url every time it is run.
## Conclusion
There is a lot of stuff going on here and probably a lot to digest so lets review
quick what we have learned. For starters we learned that companies use tracking
pixels or tags on web pages whose sole purpose is to make your browser call our to
external third party sites in order to track information about your internet
usage (usually, they can be used for other things as well). We also looked into
some very simplistic ways of implementing a server whose job it is to accept
tracking pixels calls in various forms.
We learned that these tracking servers can use cookies stored on your browser to
store a unique id for you in order to help associate the data collected to you.
That you can remove this association by clearing your cookies or by not allowing
them at all. Lastly, we learned that browsers can cause issues for our tracking
pixels and data collection and that we can get around them using a cache busting
javascript.
As a reminder the full working code examples can be located at
<a href="https://github.com/brettlangdon/tracking-server-examples" target="_blank">"https://github.com/brettlangdon/tracking-server-examples</a>.

+ 42
- 0
content/writing/about/what-i'm-up-to-these-days/index.md View File

@ -0,0 +1,42 @@
---
title: What I'm up to these days
author: Brett Langdon
date: 2015-06-19
template: article.jade
---
It has been awhile since I have written anything in my blog. Might as well get started
somewhere, like a brief summary of what I have been working on lately.
---
It has been far too long since I last wrote in this blog. I always have these aspirations
of writing all the time about all the things I am working on. The problem generally comes
back to me not feeling confident enough to write about anything I am working on. "Oh, a
post like that probably already exists", "There are smarter people than me out there
writing about this, why bother". It is an unfortunate feeling to try and get over.
So, here is where I am making an attempt. I will try to write more, it'll be healthy for
me. I always hear of people setting reminders in their calendars to block off time to
write blog posts, even if they end up only writing a few sentences, which seems like a
great idea that I indent to try.
Ok, enough with the "I haven't been feeling confident dribble", on to what I actually have
been up to lately.
Since my last post I have a new job. I am now Senior Software Engineer at
[underdog.io](https://underdog.io/). We are a small early stage startup (4 employees, just
over a year old) that is in the hiring space. For candidates our site basically acts like
a common application to now over 150 venture backed startups in New York City or San
Francisco. In the short time I have been working there, I am very impressed and glad that
I took their offer. I work with some awesome and smart people and I am still learning a
lot, whether it is about coding or just trying to run a business.
I originally started to end this post by talking about a programming project I have been
working on, but it ended up being 4 times longer than the text above and have decided
instead to write a separate post about it. Apparently even though I have been writing
lately, I have a lot to say.
Thanks for bearing with this "I have to write something" post. I am not going to make a
promise that I am going to write more, because it is something that could easily fall
through, like it usually does... but I shall give it my all!

+ 86
- 0
content/writing/about/why-benchmarking-tools-suck/index.md View File

@ -0,0 +1,86 @@
---
title: Why Benchmarking Tools Suck
author: Brett Langdon
date: 2012-10-22
template: article.jade
---
A brief aside into why I think no benchmarking tool is exactly correct
and why I wrote my own.
---
Benchmarking is (or should be) a fairly important part of most developers job or
duty. To determine the load that the systems that they build can withstand. We are
currently at a point in our development lifecycle at work where load testing is a
fairly high priority. We need to be able to answer questions like, what kind of
load can our servers currently handle as a whole?, what kind of load can a single
server handle?, how much throughput can we gain by adding X more servers?, what
happens when we overload our servers?, what happens when our concurrency doubles?
These are all questions that most have probably been asked at some point in their
career. Luckily enough there is a plethora of HTTP benchmarking tools to help try
to answer these questions. Tools like,
<a href="http://httpd.apache.org/docs/2.2/programs/ab.html" target="_blank">ab</a>,
<a href="http://www.joedog.org/siege-home/" target="_blank">siege</a>,
<a href="https://github.com/newsapps/beeswithmachineguns" target="_blank">beeswithmachineguns</a>,
<a href="http://curl-loader.sourceforge.net/" target="_blank">curl-loader</a>
and one I wrote recently (today),
<a href="https://github.com/brettlangdon/tommygun" target="_blank">tommygun</a>.
Every single one of those tools suck, including the one I wrote (and will
probably keep using/maintaining). Why? Don’t a lot of people use them? Yes,
almost everyone I know has used ab (most of you probably have) and I know a
decent handful of people who use siege, but that does not mean that they are
the most useful for all use cases. In fact they tend to only be useful for a
limited set of testing. Ab is great if you want to test a single web page, but
what if you need to test multiple pages at once? or in a sequence? I’ve also
personally experienced huge performance issues with running ab from a mac. These
scope issues of ab make way for other tools such as siege and curl-loader which
can test multiple pages at a time or in a sequence, but at what cost? Currently at
work we are having issues getting siege to properly parse and test a few hundred
thousand urls, some of which contain binary post data.
On top of only really having a limited set of use cases, each benchmarking tool
also introduces overhead to the machine that you are benchmarking from. Ab might
be able to test your servers faster and with more concurrency than curl-loader
can, but if curl-loader can test your specific use case, which do you use?
Curl-loader can probably benchmark exactly what your trying to test but if it
cannot supply the source load of what you are looking for, then how useful of a
tool is it? What if you need to scale your benchmarking tool? How do you scale
your benchmarking tool? What if you are running the test from the same machine as
your development environment? What kind of effect will running the benchmarking
tool itself have on your application?
So, what is the solution then? I think instead of trying to develop these command
line tools to fit each scenario we should try to develop a benchmarking framework
with all of the right pieces that we need. For example, develop a platform that
has the functionality to run a given task concurrently but where you supply the
task for it to run. This way the benchmarking tool does not become obsolete and
useless as your application evolves. This will also pave the way for the tool to
be protocol agnostic. Allowing people to write tests easily for HTTP web
applications or even services that do not interpret HTTP, such as message queues
or in memory stores. This framework should also provide a way to scale the tool
to allow more throughput and overload on your system. Lastly, but not least, this
platform should be lightweight and try to introduce as little overhead as
possible, for those who do not have EC2 available to them for testing, or who do
not have spare servers lying around for them to test from.
I am not saying that up until now load testing has been nothing but a pain and
the tools that we have available to us (for free) are the worst things out there
and should not be trusted. I just feel that they do not and cannot meet every use
case and that I have been plighted by this issue in the past. How can you properly
load test your application if you do not have the right load testing tool for
the job?
So, I know what some might be thinking, “sounds neat, when will your framework
be ready for me to use?” That is a nice idea, but if the past few months are any
indication of how much free time I have, I might not be able to get anything done
right away (seeing how I was able to write my load testing tool while on vacation).
I am however, more than willing to contribute to anyone else’s attempt at this
framework and I am especially more than willing to help test anyone else’s
framework.
**Side Note:** If anyone knows of any tool or framework currently that tries to
achieve my “goal” please let me know. I was unable to find any tools out there
that worked as I described or that even got close, but I might not of searched for
the right thing or maybe skipped over the right link, etc.

+ 56
- 0
content/writing/about/write-code-every-day/index.md View File

@ -0,0 +1,56 @@
---
title: Write code every day
author: Brett Langdon
date: 2015-07-02
template: article.jade
---
Just like a poet or an athlete practicing code every day will only make you better.
---
Lately I have been trying to get into blogging more and any article I read always says, "you need to write every day".
It doesn't matter if what I write down gets published, but forming the habit of trying to write something every day
is what counts. The more I write the easier it will become, the more natural it will feel and the better I will get at it.
This really isn't just true of writing or blogging, it is something that can be said of anything at all. Riding a bike,
playing basketball, reading, cooking or absolutely anything at all. The more you do it, the easier it will become and
the better you will get.
As the title of the post will allude you to, this is also true of programming. If you want to be really good at programming
you have to write code every day. The more code you write the easier it'll be to write and the better you will be at programming.
Just like any other task I've listed in this article, trying to write code every day, even if you are used to it, can be really
hard to do and a really hard habit to keep.
"What should I write?" The answer to this question is going to be different for everyone, but it is the hurdle which
you must first overcome to work your way towards writing code every day. Usually people write code to solve problems
that they have, but not everyone has problems to solve. There is usually a chicken and the egg problem. You need to
write code to have coding problems, and you need to have coding problems to have something to write. So, where should
you start?
For myself, one of the things I like doing is to rewrite things that already exist. Sometimes it can be hard to come up with a
new and different idea or even a new approach to an existing idea. However, there are millions of existing projects out
there to copy. The idea I go for is to try and replicate the overall goal of the project, but in my own way. That might
mean writing it in a different language, or changing the API for it or just taking some wacky new approach to solving the same issue.
More times than not the above exercise leads me to a problem that I then can go off and solve. For example, a few weeks ago
I sat down and decided I wanted to write a web server in `go` (think `nginx`/`apache`). I knew going into the project I wanted
a really nice and easy to use configuration file to define the settings. So, I did what most people do these days I and
used `json`, but that didn't really feel right to me. I then tried `yaml`, but yet again didn't feel like what I wanted. I
probably could have used `ini` format and made custom rules for the keys and values, but again, this is hacky. This spawned
a new project in order to solve the problem I was having and ended up being [forge](https://github.com/brettlangdon/forge),
which is a hand coded configuration file syntax and parser for `go` which ended up being a neat mix between `json` and `nginx`
configuration file syntax.
Anywho, enough of me trying to self promote projects. The main point is that by trying to replicate something that
already exists, without really trying to do anything new, I came up with an idea which spawned another project and
for at least a week (and continuing now) gave me a reason to write code every day. Not only did I write something
useful that I can now use in any future project of mine, I also learned something I did not know before. I learned
how to hand code a syntax parser in `go`.
Ultimately, try to take "coding every day" not as a challenge to write something useful every day, but to learn
something new every day. Learn part of a new language, a new framework, learn how to take something apart or put
it back together. Write code every day and learn something new every day. The more you do this, the more you will
learn and the better you will become.
Go forth and happy coding. :)

+ 1
- 0
static/css/lato.css View File

@ -0,0 +1 @@
css

+ 9
- 0
static/css/site.css View File

@ -0,0 +1,9 @@
#wrapper,
.profile #wrapper,
#wrapper.home {
max-width: 900px;
}
a.symbol {
margin-right: 0.7rem;
}

BIN
static/images/avatar.png View File

Before After
Width: 128  |  Height: 128  |  Size: 25 KiB

BIN
static/images/avatar@2x.png View File

Before After
Width: 1024  |  Height: 1024  |  Size: 554 KiB

BIN
static/images/favicon.ico View File

Before After

Loading…
Cancel
Save