diff --git a/config.toml b/config.toml new file mode 100644 index 0000000..658160b --- /dev/null +++ b/config.toml @@ -0,0 +1,22 @@ +baseurl = "https://brett.is/" +title = "Brett.is" +languageCode = "en-us" +theme = "hugo-cactus-theme" +googleAnalytics = "UA-34513423-1" +disqusShortname = "brettlangdon" + +[params] + customCSS = ["css/lato.css", "css/site.css"] + name = "Brett Langdon" + description = "A geek with a blog" + bio = "A geek with a blog" + aboutAuthor = "A geek with a blog" + twitter = "brett_langdon" + enableRSS = true + iconFont = "font-awesome" + +[social] + twitter = "https://twitter.com/brett_langdon" + github = "https://github.com/brettlangdon" + linkedin = "https://www.linkedin.com/in/brettlangdon" + rss = "https://brett.is/index.xml" diff --git a/content/about/index.md b/content/about/index.md new file mode 100644 index 0000000..a845151 --- /dev/null +++ b/content/about/index.md @@ -0,0 +1,2 @@ +--- +--- diff --git a/content/writing/about/browser-fingerprinting/index.md b/content/writing/about/browser-fingerprinting/index.md new file mode 100644 index 0000000..4a107ee --- /dev/null +++ b/content/writing/about/browser-fingerprinting/index.md @@ -0,0 +1,107 @@ +--- +title: Browser Fingerprinting +author: Brett Langdon +date: 2013-06-05 +template: article.jade +--- + +Ever want to know what browser fingerprinting is or how it is done? + +--- + +## What is Browser Fingerprinting? + +A browser or device fingerprint +is a term used to describe an identifier generated from information retrieved from +a single given device that can be used to identify that single device only. +For example, as you will see below, browser fingerprinting can be used to generate +an identifier for the browser you are currently viewing this website with. +Regardless of you clearing your cookies (which is how most third party companies +track your browser) the identifier should be the same every time it is generated +for your specific device/browser. A browser fingerprint is usually generated from +the browsers user agent, +timezone offset, list of installed plugins, available fonts, screen resolution, +language and more. The EFF did +a study +on how unique a browser fingerprint for a given client can be and which browser +information provides the most entropy. To see how unique your browser is please +check out their demo application +Panopticlick. + +## What can it used for? + +Ok, so great, but who cares? How can browser fingerprinting be used? Right now +the majority of user tracking +is done by the use of cookies. For example, when you go to a website that has +[tracking pixels](http://brett.is/writing/about/third-party-tracking-pixels/) +(which are “invisible” scripts or images loaded in the background of the web page) +the third party company receiving these tracking calls will inject a cookie into +your browser which has a unique, usually randomly generated, identifier that is +used to associate stored data about you like collected +site or search retargeting +data. This way when you visit them again with the same cookie they can lookup +previously associated data for you. + +So, if this is how it is usually done why do we care about browser fingerprints? +Well, the main problem with cookies is they can be volatile, if you manually delete +your cookies then the company that put that cookie there loses all association with +you and any data they have on your is no longer useful. As well, if a client does +not allow third party cookies (or any cookies) on their browser then the company +will be unable to track the client at all. + +A browser fingerprint on the other hand is a more constant way to identify a given +client, as long as they have javascript enabled (which seems to be a thing which +most websites cannot properly function without), which allows the client to be +identified even if they do not allow cookies for their browser. + +##How do we do it? + +Like I mentioned before to generate a browser fingerprint you must have javascript +enabled as it is the easiest way to gather the most information about a browser. +Javascript gives us access to things like your screen size, language, installed +plugins, user agent, timezone offset, and other points of interest. This +information is basically smooshed together in a string and then hashed to generate +the identifier, the more information you can gather about a single browser the more +unique of a fingerprint you can generate and the less collision you will have. + +Collision? Yes, if you end up with two laptops each of the same make, model, year, +os version, browser version with the exact same features and plugins enabled then +the hashes will be the exact same and anyone relying on their fingerprint will +treat both of those devices as the same. But, if you read the white paper by EFF +listed above then you will see that their method for generating browser fingerprints +is usually unique for almost 3 million different devices. There may be some cases +for companies where that much uniqueness is more than enough to use and rely on +fingerprints to identify devices and others where they have more than 3 +million users. + +Where does this really come into play? Most websites usually have their users +create and account and log in before allowing them access to portions of the site or +to be able to lookup stored information, maybe their credit card payment +information, home address, e-mail address, etc. Where browser fingerprints are +useful is for trying to identify anonymous visitors to a web application. For +example, [third party trackers](/writing/about/third-party-tracking-pixels/) +who are collecting search or other kinds of data. + +## Some Code + +Their is a project on github +by user Valentin Vasilyev (Valve) +called fingerprintjs +which is a client side javascript library for generating browser fingerprints. +If you are interested in seeing some production worthy code of how to generate +browser fingerprints please take a look at that project, it uses information like +useragent, language, color depth, timezone offset, whether session or local storage +is available, a listing of all installed plugins and it hashes everything using +murmurhash3. + +## Your fingerprintjs Fingerprint: *Could not generate fingerprint* + + + + +**Resources:** +* panopticlick.eff.org - find out how rare your browser fingerprint is. +* github.com/Valve/fingerprintjs - client side browser fingerprinting library. diff --git a/content/writing/about/continuous-nodejs-module/index.md b/content/writing/about/continuous-nodejs-module/index.md new file mode 100644 index 0000000..0d1287b --- /dev/null +++ b/content/writing/about/continuous-nodejs-module/index.md @@ -0,0 +1,62 @@ +--- +title: Continuous NodeJS Module +author: Brett Langdon +date: 2012-04-28 +template: article.jade +--- + +A look into my new NodeJS module called Continuous. + +--- + +Greetings everyone. I wanted to take a moment to mention the new NodeJS module +that I just published called Continuous. + +Continuous is a fairly simply plugin that is aimed to aid in running blocks of +code consistently; it is an event based interface for setTimeout and setInterval. +With Continuous you can choose to run code at a set or random interval and +can also hook into events. + +## Installation +```bash +npm install continuous +``` + +## Continuous Usage + +```javascript +var continuous = require('continuous'); + +var run = new continuous({ + minTime: 1000, + maxTime: 3000, + random: true, + callback: function(){ + return Math.round( new Date().getTime()/1000.0 ); + }, + limit: 5 +}); + +run.on(‘complete’, function(count, result){ + console.log(‘I have run ‘ + count + ‘ times’); + console.log(‘Results:’); + console.dir(result); +}); + +run.on(‘started’, function(){ + console.log(‘I Started’); +}); + +run.on(‘stopped’, function(){ + console.log(‘I am Done’); +}); + +run.start(); + +setTimeout( function(){ + run.stop(); +}, 5000 ); +``` + +For more information check out Continuous on +GitHub. diff --git a/content/writing/about/cookieless-user-tracking/index.md b/content/writing/about/cookieless-user-tracking/index.md new file mode 100644 index 0000000..1715c94 --- /dev/null +++ b/content/writing/about/cookieless-user-tracking/index.md @@ -0,0 +1,167 @@ +--- +title: Cookieless User Tracking +author: Brett Langdon +date: 2013-11-30 +template: article.jade +--- + +A look into various methods of online user tracking without cookies. + +--- + +Over the past few months, in my free time, I have been researching various +methods for cookieless user tracking. I have a previous article that talks +on how to write a +tracking server +which uses cookies to follow people between requests. However, recently +browsers are beginning to disallow third party cookies by default which means +developers have to come up with other ways of tracking users. + + +## Browser Fingerprinting + +You can use client side javascript to generate a +browser fingerprint, +or, a unique identifier for a specific users browser (since that is what cookies +are actually tracking). Once you have the browser's fingerprint you can then +send that id along with any other requests you make. + +```javascript +var user_id = generateBrowserFingerprint(); +document.write( + ' + +``` + +Alright, so lets cover a few concepts from above, `tags`, `metrics` and `syncing`. + +### Tags +Tags are meant to be a way to uniquely identify the metrics that are being sent +to the server and are generally used to break apart metrics. For example, you might +have a metric to track whether or not someone clicks an "add to cart" button, using tags +you can then break out that metric to see how many times the button has been pressed +for each `productId` or browser or language or any other piece of data you find +applicable to segment your metrics. Tags can also be used when tracking data for +[A/B Tests](http://en.wikipedia.org/wiki/A/B_testing) where you want to segment your +data based on which part of the test the user was included. + +### Metrics +Metrics are simply data points to track for a given request. Good metrics to record +are things like load times, elements loaded on the page, time spent on the page, +number of times buttons are clicked or other user interactions with the page. + +### Syncing +Syncing refers to sending the data from the client to the server. I refer to it as +"syncing" since we want to try and aggregate as much data on the client side and send +fewer, but larger, requests rather than having to make a request to the server for +each metric we mean to track. We do not want to overload the Client if we mean to +track a lot of user interactions on the site. + +## How To Do It +Alright, enough of the simple examples/explanations, lets dig into the source a bit +to find out how to aggregate the data on the client side and how to sync that data +to the server. + +### Aggregating Data +Collecting the data we want to send to the server isn't too bad. We are just going +to take any specific calls to `Sleuth.track(key, value)` and store either in +[LocalStorage](http://diveintohtml5.info/storage.html) or in an object until we need +to sync. For example this is the `track` method of `Sleuth`: + +```javascript +Sleuth.prototype.track = function(key, value){ + if(this.config.useLocalStorage && window.localStorage !== undefined){ + window.localStorage.setItem('Sleuth:' + key, value); + } else { + this.data[key] = value; + } +}; +``` + +The only thing of note above is that it will fall back to storing in `this.data` +if LocalStorage is not available as well we are namespacing all data stored in +LocalStorage with the prefix "Sleuth:" to ensure there is no name collision with +anyone else using LocalStorage. + +Also `Sleuth` will be kind enough to capture data from `window.performance` if it +is available and enabled (it is by default). And it simply grabs everything it can +to sync up to the server: + +```javascript +Sleuth.prototype.captureWindowPerformance = function(){ + if(this.config.performance && window.performance !== undefined){ + if(window.performance.timing !== undefined){ + this.data.timing = window.performance.timing; + } + if(window.performance.navigation !== undefined){ + this.data.navigation = { + redirectCount: window.performance.navigation.redirectCount, + type: window.performance.navigation.type, + }; + } + } +}; +``` + +For an idea on what is store in `window.performance.timing` check out +[Navigation Timing](https://developer.mozilla.org/en-US/docs/Navigation_timing). + +### Syncing Data +Ok, so this is really the important part of this library. Collecting the data isn't +hard. In fact, no one probably really needs a library to do that for them, when you +just as easily store a global object to aggregate the data. But why am I making a +"big deal" about syncing the data either? It really isn't too hard when you can just +make a simple AJAX call using jQuery `$.ajax(...)` to ship up a JSON string to some +server side listener. + +The approach I wanted to take was a little different, yes, by default `Sleuth` will +try to send the data using AJAX to a server side url "/track", but what about when +the server which collects the data lives on another hostname? +[CORS](http://en.wikipedia.org/wiki/Cross-origin_resource_sharing) can be less than +fun to deal with, and rather than worrying about any domain security I just wanted +a method that can send the data from anywhere I want back to whatever server I want +regardless of where it lives. So, how? Simple, javascript pixels. + +A javascript pixel is simply a `script` tag which is written to the page with +`document.write` whose `src` attribute points to the url that you want to make the +call to. The browser will then call that url without using AJAX just like it would +with a normal `script` tag loading javascript. For a more in-depth look at tracking +pixels you can read a previous article of mine: +[Third Party Tracking Pixels](http://brett.is/writing/about/third-party-tracking-pixels/). + +The point of going with this method is that we get CORS-free GET requests from any +client to any server. But some people are probably thinking, "wait, a GET request +doesn't help us send data from the client to server"? This is why we will encode +our JSON string of data for the url and simply send in the url as a query string +parameter. Enough talk, lets see what this looks like: + +```javascript +var encodeObject = function(data){ + var query = []; + for(var key in data){ + query.push(encodeURIComponent(key) + '=' + encodeURIComponent(data[key])); + }; + + return query.join('&'); +}; + +var drop = function(url, data, tags){ + // base64 encode( stringify(data) ) + tags.d = window.btoa(JSON.stringify(data)); + + // these parameters are used for cache busting + tags.n = new Date().getTime(); + tags.r = Math.random() * 99999999; + + // make sure we url encode all parameters + url += '?' + encodeObject(tags); + document.write(''); +}; +``` + +That is basically it. We simply base64 encode a JSON string version of the data and send +as a query string parameter. There might be a few odd things that stand out above, mainly +url length limitations of base64 encoded JSON string, the "cache busting" and the weird +breaking up of the tag "script". A safe url length limit to live under is around +[2000](http://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers) +to accommodate internet explorer, which from some very crude testing means each reqyest +can hold around 50 or so separate metrics each containing a string value. Cache busting +can be read about more in-depth in my article again about tracking pixels +(http://brett.is/writing/about/third-party-tracking-pixels/#cache-busting), but the short +version is, we add random numbers and the current timestamp the query string to ensure that +the browser or cdn or anyone in between doesn't cache the request being made to the server, +this way you will not get any missed metrics calls. Lastly, breaking up the `script` tag +into "sc + ript" and "scri + pt" makes it harder for anyone blocking scripts from writing +`script` tags to detect that a script tag is being written to the DOM (also an `img` or +`iframe` tag could be used instead of a `script` tag). + +### Unload +How do we know when to send the data? If someone is trying to time and see how much time +someone is spending on each page or wants to make sure they are collecting as much data +as they want on the client side then you want to wait until the last second before +syncing the data to the server. By using LocalStorage to store the data you can ensure +that you will be able to access that data the next time you see that user, but who wants +to wait? And what if the user never comes back? I want my data now dammit! + +Simple, lets bind an event to `window.onunload`! Woot, done... wait... why isn't my data +being sent to me? Initially I was trying to use `window.onunload` to sync data back, but +found that it didn't always work with pixel dropping, AJAX requests worked most of the time. +After some digging I found that with `window.onunload` I was hitting a race condition on +whether or not the DOM was still available or not, meaning I couldn't use `document.write` +or even query the DOM on unload for more metrics to sync on `window.onunload`. + +In come `window.onbeforeunload` to the rescue! For those who don't know about it (I +didn't before this project), `window.onbeforeunload` is exactly what it sounds like +an event that gets called before `window.onunload` which also happens before the DOM +gets unloaded. So you can reliably use it to write to the DOM (like the pixels) or +to query the DOM for any extra information you want to sync up. + +## Conclusion +So what do you think? There really isn't too much to it is there? Especially since we +only covered the client side of the piece and haven't touched on how to collect and +interpret this data on the server (maybe that'll be a follow up post). Again this is mostly +a simple implementation of a RUM library, but hopefully it sparks an interest to build +one yourself or even just to give you some insight into how Google Analytics or other +RUM libraries collect/send data from the client. + +I think this project that I undertook was neat because I do not always do client side +javascript and every time I do I tend to learn something pretty cool. In this case +learning the differences between `window.onunload` and `window.onbeforeunload` as well +as some of the cool things that are tracked by default in `window.performance` I +definitely urge people to check out the documentation on `window.performance`. + +### TODO +What is next for [Sleuth](https://github.com/brettlangdon/sleuth)? I am not sure yet, +I am thinking of implementing more ways of tracking data, like adding counter support, +rate limiting, automatic incremental data syncs. I am open to ideas of how other people +would use a library like this, so please leave a comment here or open an issue on the +projects github page with any thoughts you have. + + +## Links +* [Sleuth](https://github.com/brettlangdon/sleuth) +* [Third Party Tracking Pixels](http://brett.is/writing/about/third-party-tracking-pixels/) +* [LocalStorage](http://diveintohtml5.info/storage.html) +* [Navigation Timing](https://developer.mozilla.org/en-US/docs/Navigation_timing) +* [window.onbeforeunload](https://developer.mozilla.org/en-US/docs/Web/API/Window.onbeforeunload) +* [window.onunload](https://developer.mozilla.org/en-US/docs/Web/API/Window.onunload) +* [RUM](http://en.wikipedia.org/wiki/Real_user_monitoring) +* [Google Analytics](http://www.google.com/analytics/) +* [A/B Testing](http://en.wikipedia.org/wiki/A/B_testing) diff --git a/content/writing/about/managing-go-dependencies-with-git-subtree/index.md b/content/writing/about/managing-go-dependencies-with-git-subtree/index.md new file mode 100644 index 0000000..89a7328 --- /dev/null +++ b/content/writing/about/managing-go-dependencies-with-git-subtree/index.md @@ -0,0 +1,145 @@ +--- +title: Managing Go dependencies with git-subtree +author: Brett Langdon +date: 2016-02-03 +template: article.jade +--- + +Recently I have decided to make the switch to using `git-subtree` for managing dependencies of my Go projects. + +--- + +For a while now I have been searching for a good way to manage dependencies for my [Go](https://golang.org/) +projects. I think I have finally found a work flow that I really like that uses +[git-subtree](http://git.kernel.org/cgit/git/git.git/plain/contrib/subtree/git-subtree.txt). + +When I began investigating different ways to manage dependencies I had a few small goals or concepts I wanted to follow. + +### Keep it simple +I have always been drawn to the simplicity of Go and the tools that surround it. +I didn't want to add a lot of overhead or complexity into my work flow when programming in Go. + +### Vendor dependencies +I decided right away that I wanted to vendor my dependencies, that is, where all of my dependencies +live under a top level `vendor/` directory in each repository. + +This also means that I wanted to use the `GO15VENDOREXPERIMENT="1"` flag. + +### Maintain the full source code of each dependency in each repository +The idea here is that each project will maintain the source code for each of its dependencies +instead of having a dependency manifest file, like `package.json` or `Godeps.json`, to manage the dependencies. + +This was more of an acceptance than a decision. It wasn't a hard requirement that +each repository maintains the full source code for each of its dependencies, but +I was willing to accept that as a by product of a good work flow. + +## In come git-subtree +When researching methods of managing dependencies with `git`, I came across a great article +from Atlassian, [The power of Git subtree](https://developer.atlassian.com/blog/2015/05/the-power-of-git-subtree/). +Which outlined how to use `git-subtree` for managing repository dependencies... exactly what I was looking for! + +The main idea with `git-subtree` is that it is able to fetch a full repository and place +it inside of your repository. However, it differs from `git-submodule` because it does not +create a link/reference to a remote repository, instead it will fetch all the files from that +remote repository and place them under a directory in your repository and then treats them as +though they are part of your repository (there is no additional `.git` directory). + +If you pair `git-subtree` with its `--squash` option, it will squash the remote repository +down to a single commit before pulling it into your repository. + +As well, `git-subtree` has ability to issue a `pull` to update a child repository. + +Lets just take a look at how using `git-subtree` would work. + +### Adding a new dependency +We want to add a new dependency, [github.com/miekg/dns](https://github.com/miekg/dns) +to our project. + +``` +git subtree add --prefix vendor/github.com/miekg/dns https://github.com/miekg/dns.git master --squash +``` + +This command will pull in the full repository for `github.com/miekg/dns` at `master` to `vendor/github.com/miekg/dns`. + +And that is it, `git-subtree` will have created two commits for you, one for the squash of `github.com/miekg/dns` +and another for adding it as a child repository. + +### Updating an existing dependency +If you want to then update `github.com/miekg/dns` you can just run the following: + +``` +git subtree pull --prefix vendor/github.com/miekg/dns https://github.com/miekg/dns.git master --squash +``` + +This command will again pull down the latest version of `master` from `github.com/miekg/dns` (assuming it has changed) +and create two commits for you. + +### Using tags/branches/commits +`git-subtree` also works with tags, branches, or commit hashes. + +Say we want to pull in a specific version of `github.com/brettlangdon/forge` which uses tags to manage versions. + +``` +git subtree add --prefix vendor/github.com/brettlangdon/forge https://github.com/brettlangdon/forge.git v0.1.5 --squash +``` + +And then, if we want to update to a later version, `v0.1.7`, we can just run the following: + +``` +git subtree pull --prefix vendor/github.com/brettlangdon/forge https://github.com/brettlangdon/forge.git v0.1.7 --squash +``` + +## Making it all easier +I really like using `git-subtree`, a lot, but the syntax is a little cumbersome. +The previous article I mentioned from Atlassian ([here](ttps://developer.atlassian.com/blog/2015/05/the-power-of-git-subtree/)) +suggests adding in `git` aliases to make using `git-subtree` easier. + +I decided to take this one step further and write a `git` command, [git-vendor](https://github.com/brettlangdon/git-vendor) +to help manage subtree dependencies. + +I won't go into much details here since it is outlined in the repository as well as at https://brettlangdon.github.io/git-vendor/, +but the project's goal was to make working with `git-subtree` easier for managing Go dependencies. +Mainly, to be able to add subtrees and give them a name, to be able to list all current subtrees, +and to be able to update a subtree by name rather than repo + prefix path. + +Here is a quick preview: + +``` +$ git vendor add forge https://github.com/brettlangdon/forge v0.1.5 +$ git vendor list +forge@v0.1.5: + name: forge + dir: vendor/github.com/brettlangdon/forge + repo: https://github.com/brettlangdon/forge + ref: v0.1.5 + commit: 4c620b835a2617f3af91474875fc7dc84a7ea820 +$ git vendor update forge v0.1.7 +$ git vendor list +forge@v0.1.7: + name: forge + dir: vendor/github.com/brettlangdon/forge + repo: https://github.com/brettlangdon/forge + ref: v0.1.7 + commit: 0b2bf8e484ce01c15b87bbb170b0a18f25b446d9 +``` + +## Why not... +### Godep/<package manager here> +I decided early on that I did not want to "deal" with a package manager unless I had to. +This is not to say that there is anything wrong with [godep](https://github.com/tools/godep) +or any of the other currently available package managers out there, I just wanted to keep +the work flow simple and as close to what Go supports with respect to vendored dependencies +as possible. + +### git-submodule +I have been asked why not `git-submodule`, and I think anyone that has had to work +with `git-submodule` will agree that it isn't really the best option out there. +It isn't as though it cannot get the job done, but the extra work flow needed +when working with them is a bit of a pain. Mostly when working on a project with +multiple contributors, or with contributors who are either not aware that the project +is using submodules or who has never worked with them before. + +### Something else? +This isn't the end of my search, I will always be keeping a look out for new and +different ways to manage my dependencies. However, this is by far my favorite as of yet. +If anyone has any suggestions, please feel free to leave a comment. diff --git a/content/writing/about/my-new-website/index.md b/content/writing/about/my-new-website/index.md new file mode 100644 index 0000000..8b38a61 --- /dev/null +++ b/content/writing/about/my-new-website/index.md @@ -0,0 +1,37 @@ +--- +title: My New Website +author: Brett Langdon +date: 2013-11-16 +template: article.jade +--- + +Why did I redo my website? +What makes it any better? +Why are there old posts that are missing? + +--- + +I just wanted to write a quick post about my new site. +Some of you who are not familiar with my site might not notice the difference, +but trust me... it is different and for the better. + +So what has changed? +For starters, I think the new design is a little simpler than the previous, +but more importantly it is not longer in [Wordpress](http://www.wordpress.org). +It is now maintained with [Wintersmith](https://github.com/jnordberg/wintersmith), +which is a static site generator which is built in [node.js](http://nodejs.org/) and +uses[Jade](http://jade-lang.com) templates and [markdown](http://daringfireball.net/projects/markdown/). + +Why is this better? +Well for started I think writing in markdown is a lot easier than using Wordpress. +It means I can use whatever text editor I want (emacs in this case) to write my +articles. As well, I no longer need to have PHP and MySQL setup in order to just +serve up silly static content like blog posts and a few images. +This also means I can keep my blog entirely in [GitHub](http://github.com/). + +So far I am fairly happy with the move to Wintersmith, except having to move all my +current blog posts over to markdown, but I will slowly keep porting some over until +I have them all in markdown. So, please bear with me during the time of transition +as there may be a few posts missing when I initially publish this new site. + +Check out my blog in GitHub, [brett.is](http://github.com/brettlangdon/brett.is.git). diff --git a/content/writing/about/my-python-web-crawler/index.md b/content/writing/about/my-python-web-crawler/index.md new file mode 100644 index 0000000..48d5688 --- /dev/null +++ b/content/writing/about/my-python-web-crawler/index.md @@ -0,0 +1,203 @@ +--- +title: My Python Web Crawler +author: Brett Langdon +date: 2012-09-09 +template: article.jade +--- + +How to write a very simplistic Web Crawler in Python for fun. + +--- + +Recently I decided to take on a new project, a Python based +web crawler +that I am dubbing Breakdown. Why? I have always been interested in web crawlers +and have written a few in the past, one previously in Python and another before +that as a class project in C++. So what makes this project different? +For starters I want to try and store and expose different information about the +web pages it is visiting. Instead of trying to analyze web pages and develop a +ranking system (like +PageRank) +that allows people to easily search for pages based on keywords, I instead want to +just store the information that is used to make those decisions and allow people +to use them how they wish. + +For example, I want to provide an API for people to be able to search for specific +web pages. If the page is found in the system, it will return back an easy to use +data structure that contain the pages +meta data, +keyword histogram, list of links to other pages and more. + +## Overview of Web Crawlers + +What is a web crawler? We can start with the simplest definition of a web crawler. +It is a program that, starting from a single web page, moves from web page to web +page by only using urls that are given in each page, starting with only those +provided in the original page. This is how search engines like +Google, +Bing and +Yahoo +obtain the content they need for their search sites. + +But a web crawler is not just about moving from site to site (even though this +can be fun to watch). Most web crawlers have a higher purpose, like (in the case +of search engines) to rank the relativity of a web page based on the content +provided within the pages content and html meta data to allow people easier +searching of content on the internet. Other web crawlers are used for more +invasive purposes like to obtain e-mail addresses to use for marketing or spam. + +So what goes into making a web crawler? A web crawler, again, is not just about +moving from place to place how ever it feels. Web sites can actually dictate how +web crawlers access the content on their sites and how they should move around on +their site. This information is provided in the +robots.txt +file that can be found on most websites +(here is wikipedia’s). +A rookie mistaken when building a web crawler is to ignore this file. These +robots.txt files are provided as a set of guidelines and rules that web crawlers +must adhere by for a given domain, otherwise you are liable to get your IP and/or +User Agent banned. Robots.txt files tell crawlers which pages or directories to +ignore or even which ones they should consider. + +Along with ensuring that you follow along with robots.txt please be sure to +provide a useful and unique +User Agent. +This is so that sites can identify that you are a robot and not a human. +For example, if you see a User Agent of *“breakdown”* on your website, hi, it’s me. +Do not use know User Agents like: +*“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/12.04 Chromium/18.0.1025.168 Chrome/18.0.1025.168 Safari/535.19″*, +this is, again, an easy way for you to get your IP address banned on many sites. + +Lastly, it is important to consider adding in rate limiting to your crawler. It is +wonderful to be able to crawl websites and between them very quickly (no one likes +to wait for results), but this is another sure fire way of getting your IP banned +by websites. Net admins do not like bots to tie up all of their networks +resources, making it difficult for actual users to use their site. + + +## Prototype of Web Crawler + +So this afternoon I decided to take around an hour or so and prototype out the +code to crawl from page to page extracting links and storing them in the database. +All this code does at the moment is download the content of a url, parse out all +of the urls, find the new urls that it has not seen before, append them to a queue +for further processing and also inserting them into the database.This process has +2 queues and 2 different thread types for processing each link. + +There are two different types of processes within this module, the first is a +Grabber, which is used to take a single url from a queue and download the text +content of that url using the +Requests +Python module. It then passes the content along to a queue that the Parser uses +to get new content to process. The Parser takes the content from the queue that +has been retrieved from the Grabber process and simply parses out all the links +contained within the sites html content. It then checks MongoDB to see if that +url has been retrieved already or not, if not, it will append the new url to the +queue that the Grabber uses to retrieve new content and also inserts this url +into the database. + +The unique thing about using multiple threads per process (X for Grabbers and Y +for Parsers) as well as having two different queues to share information between +the two allows this crawler to be self sufficient once it gets started with a +single url. The Grabbers help feed the queue that the Parsers work off of and the +Parsers feed the queue that the Grabbers work from. + +For now, this is all that my prototype does, it only stores links and crawls from +site to site looking for more links. What I have left to do is expand upon the +Parser to parse out more information from the html including things like meta +data, page title, keywords, etc, as well as to incorporate +robots.txt into the +processing (to keep from getting banned) and automated rate limiting +(right now I have a 3 second pause between each web request). + + +## How Did I Do It? + +So I assume at this point you want to see some code? The code it not up on +GitHub just yet, I have it hosted on my own private git repo for now and will +gladly open source the code once I have a better prototype. + +Lets just take a very quick look at how I am sharing code between the different +threads. + +### Parser.py +```python +import threading +class Thread(threading.Thread): + def __init__(self, content_queue, url_queue): + self.c_queue = content_queue + self.u_queue = url_queue + super(Thread, self).__init__() + def run(self): + while True: + data = self.c_queue.get() + #process data + for link in links: + self.u_queue.put(link) + self.c_queue.task_done() +``` + +### Grabber.py +```python +import threading +class Thread(threading.Thread): + def __init__(self, url_queue, content_queue): + self.c_queue = content_queue + self.u_queue = url_queue + super(Thread, self).__init__() + def run(self): + while True: + next_url = self.u_queue.get() + #data = requests.get(next_url) + while self.c_queue.full(): + pass + self.c_queue.put(data) + self.u_queue.task_done() +``` + +### Breakdown +```python +from breakdown import Parser, Grabber +from Queue import Queue + +num_threads = 4 +max_size = 1000 +url_queue = Queue() +content_queue = Queue(maxsize=max_size) + +parsers = [Parser.Thread(content_queue, url_queue) for i in xrange(num_threads)] +grabbers = [Grabber.Thread(url_queue, content_queue) for i in xrange(num_threads)] + +for thread in parsers+grabbers: + thread.daemon = True + thread.start() + +url_queue.put('http://brett.is/') +``` + +Lets talk about this process quick. The Breakdown code is provided as a binary +script to start the crawler. It creates “num_threads” threads for each process +(Grabber and Parser). It starts each thread and then appends the starting point +for the crawler, http://brett.is/. One of the Grabber threads will then pick up on +the single url, make a web request to get the content of that url and append it +to “content_queue”. Then one of the Parser threads will pick up on the content +data from “content_queue”, it will process the data from the web page html, +parsing out all of the links and then appending those links onto “url_queue”. This +will then allow the other Grabber threads an opportunity to make new web requests +to get more content to pass to the Parsers threads. This will continue on and on +until there are no links left (hopefully never). + + +## My Results + +I ran this script for a few minutes, maybe 10-15, and I ended up with over 11,000 +links ranging from my domain, +pandora, +twitter, +linkedin, +github, +sony, +and many many more. Now that I have a decent base prototype I can continue forward +and expand upon the processing and logic that goes into each web request. + +Look forward to more posts about this in the future. diff --git a/content/writing/about/os-x-battery-percentage-command-line/index.md b/content/writing/about/os-x-battery-percentage-command-line/index.md new file mode 100644 index 0000000..d516313 --- /dev/null +++ b/content/writing/about/os-x-battery-percentage-command-line/index.md @@ -0,0 +1,31 @@ +--- +title: OS X Battery Percentage Command Line +author: Brett Langdon +date: 2012-03-18 +template: article.jade +--- + +Quick and easy utility to get OS X battery usage from the command line. + +--- + +Recently I learned how to enable full screen console mode for OS X but the first +issue I ran into was trying to determine how far gone the battery in my laptop was. +Yes of course I could use the fancy little button on the side that lights up and +shows me but that would be way too easy for a programmer, so of course instead I +wrote this scripts. The script will gather the battery current and max capacity +and simply divide them to give you a percentage of battery life left. + +Just create this script, I named mine “battery”, make sure to enable execution +“chmod +x battery” and I moved mine into “/usr/sbin/”. Then to use simply run the +command “battery” and you’ll get an output similar to “3.900%” +(yes as of the writing of this my battery needs a charging). + +```bash +#!/bin/bash +current=`ioreg -l | grep CurrentCapacity | awk ‘{print %5}’` +max=`ioreg -l | grep MaxCapacity| awk ‘{print %5}’` +echo `echo “scale=3;$current/$max*100″|bc -l`’%’ +``` + +Enjoy! diff --git a/content/writing/about/pharos-popup-on-osx-lion/index.md b/content/writing/about/pharos-popup-on-osx-lion/index.md new file mode 100644 index 0000000..2aaec2c --- /dev/null +++ b/content/writing/about/pharos-popup-on-osx-lion/index.md @@ -0,0 +1,46 @@ +--- +title: Pharos Popup on OSX Lion +author: Brett Langdon +date: 2012-01-28 +template: article.jade +--- + +Fixing Pharos Popup app on OS X Lion. + +--- + +My University uses +Pharos +print servers to manage a few printers on campus and we were running into an +issue of the Pharos popup and notify applications not working properly with OSX +Lion. As I work for the Apple technician on campus I was tasked with finding out +why. The popup installation was setting up the applications to run on startup just +fine, the postflight script was invoking the Popup.app, the drivers we were using +worked perfectly when we mapped the printer by IP but what was going on? Through +some further examination the two applications were in fact not being properly +started either after install or on boot. + +I managed to find a work around that caused the applications to run. I manually +ran each of them through command line (as through Finder resulted in failure) and +magically they worked as expected and now whenever my machine starts up they start +on boot without having to manually run them, even if I uninstall the applications +and reinstall them I not longer have to manually run them… but why? + +```bash +voltaire:~ brett$ open /Library/Application\ Support/Pharos/Popup.app +voltaire:~ brett$ open /Library/Application\ Support/Pharos/Notify.app +voltaire:~ brett$ ps aux | grep Pharos +brett 600 0.0 0.1 655276 3984 ?? S 2:55PM 0:00.10 /Library/Application Support/Pharos/Popup.app/Contents/MacOS/Popup -psn_0_237626 +brett 543 0.0 0.1 655156 3652 ?? S 2:45PM 0:00.08 /Library/Application Support/Pharos/Notify.app/Contents/MacOS/Notify -psn_0_233529 +brett 608 0.0 0.0 2434892 436 s001 R+ 2:56PM 0:00.00 grep Pharos +``` + +I am still not 100% sure why this work around worked, especially when the +postflight script included with the Popup package is set to run Popup.app after +installation. The only explanation I can come up with is OSX keeps a library of +all of the “trusted” applications, you know that popup that asks you if you want +to run a program that was downloaded from the internet, and the Popup.app and +Notify.app are not being properly added to the list, unless run manually. + +I am still looking into a solution that can be packaged with the Popup package and +will post more information here when I find out more. diff --git a/content/writing/about/php-stop-malicious-image-uploads/index.md b/content/writing/about/php-stop-malicious-image-uploads/index.md new file mode 100644 index 0000000..1a71056 --- /dev/null +++ b/content/writing/about/php-stop-malicious-image-uploads/index.md @@ -0,0 +1,77 @@ +--- +title: PHP - Stop Malicious Image Uploads +author: Brett Langdon +date: 2012-02-01 +template: article.jade +--- + +Quick and easy trick for detecting and stopping malicious image uploads to PHP. + +--- + +Recently I have been practicing for the upcoming NECCDC competition and have +come across a few issues that will need to be overcome, including how to stop +malicious image uploads. + +I was reading +this +article on +Acunetix.com +about the threats of having upload forms in PHP. + +The general idea behind this exploit for Apache and PHP is when a user can +upload an image whose content contains PHP code and the extension includes +‘php’ for example an image ‘new-house.php.jpg’ that contains: + +``` +... (image contents) + +... (image contents) +``` + +When uploaded and then viewed Apache, if improperly setup, will process the +image as PHP, because of the ‘.php’ in the extension and then when accessed +will execute malicious code on your server. + +## My Solution + +I was trying to find a good way to remove this issue quickly without opening +more security holes. I have seen some solutions that use the function +getimagesize +to try and determine if the file is an image, but if the malicious code is +injected into the middle of an actual image this function will still return +the actual image size and the file will validate as an image. The solution I +came up with is to explicitly convert each uploaded image to a jpeg using +imagecreatefromjpeg +and +imagejpeg +functions. + +```php +:jobs" + # and push it to ":jobs" + job_id = client.brpoplpush(all_queue, processing_queue) + if not job_id: + continue + # fetch the job data + job_data = client.hgetall("job:%s" % (job_id, )) + # process the job + process(job_id, job_data) + # cleanup the job information from redis + client.delete("job:%s" % (job_id, )) + client.lrem(process_queue, 1, job_id) + + +if __name__ == "__main__": + import socket + import os + + client = redis.StrictRedis() + try: + main(client, "processing:jobs", "all:jobs") + except KeyboardInterrupt: + pass +``` + +The above script does the following: +1. Try to fetch a job from the queue `all:jobs` pushing it to `processing:jobs` +2. Fetch the job data from a [hash](http://redis.io/commands#hash) key with the name `job:` +3. Process the job information +4. Remove the hash key `job:` +5. Remove the job id from the queue `processing:jobs` + +With this design we will always be able to determine how many jobs are currently queued for process +by looking at the list `all:jobs` and we will also know exactly how many jobs are being processed +by looking at the list `processing:jobs` which contains the list of job ids that all workers are +working on. + +Also we are not tied down to running just 1 worker on 1 machine. With this design we can run multiple +worker processes on as many nodes as we want. As long as they all have access to the same Redis server. +There are a few limitations which are all seeded in Redis' [limits on lists](http://redis.io/topics/data-types), +but this should be good enough to get started. + +There are a few other approaches that can be taken here as well. Instead of using a single processing queue +we could use a separate queue for each worker. Then we can look at which jobs are currently being processed +by each individual worker, this approach would also give us the opportunity to have the workers try to fetch +from the worker specific queue first before looking at `all:jobs` so we can either assign jobs to specific +workers or where the worker can recover from failed processing by starting with the last job it was working +on before failing. + +## qw +I have developed the library [qw](https://github.com/brettlangdon/qw) or (QueueWorker) to implement a similar +pattern to this, so if you are interested in playing around with this or to see a more developed implementation +please checkout the projects [github page](https://github.com/brettlangdon/qw) for more information. diff --git a/content/writing/about/sharing-data-from-php-to-javascript/index.md b/content/writing/about/sharing-data-from-php-to-javascript/index.md new file mode 100644 index 0000000..3a091e9 --- /dev/null +++ b/content/writing/about/sharing-data-from-php-to-javascript/index.md @@ -0,0 +1,87 @@ +--- +title: Sharing Data from PHP to JavaScript +author: Brett Langdon +date: 2014-03-16 +template: article.jade +--- + +A quick example of how I decided to share dynamic content from PHP with my JavaScript. + +--- + +So the other day I was refactoring some of the client side code I was working on and +came across something like the following: + +### page.php +```php + +... + + + +``` + +There isn't really anything wrong here, in fact this seems to be a fairly common practice +(from the little research I did). So... whats the big deal? Why write an article about it? + +My issue with the above is, what if the JavaScript gets fairly large (as mine was). The +ideal thing to do is to move the js into it's own file, minify/compress it and serve it +from a CDN so it doesn't effect page load time. But, now we have content that needs to be +added dynamically from the PHP script in order for the js to run. How do we solve it? The +approach that I took, which probably isn't original at all, but I think neat enough to +share, was to let PHP make the data available to the script through `window.data`. + +### page.php +```php + +... + $myModel->getTitle(), +); +?> + + + +``` + +### page-script.js +```javascript +// window.data.modelTitle is available for me to use +console.log("My Model Title: " + window.data.modelTitle); +``` + +Nothing really fancy, shocking, new or different here, just passing data from PHP to js. +Something to note is that we have to have our PHP code set `window.data` before we load +our external script so that `window.data` will be available when the script loads. Which +this shouldn't be too much of an issue since most web developers are used to putting all +of their `script` tags at the end of the page. + +Some might wonder why I decided to use `window.data`, why not just set +`var modelTitle = "getTitle()?>";`? I think it is better to try and have a +convention for where the data from the page will come from. Having to rely on a bunch of +global variables being set isn't really a safe way to write this. What if you overwrite +an existing variable or if some other script overwrites your data from the PHP script? +This is still a cause for concern with `window.data`, but at least you only have to keep +track of a single variable. As well, I think organizationally it is easier and more concise +to have `window.data = ;` as opposed to: + +```php +var modelTitle = "getTitle()?>"; +var modelId = "getId()?>"; +var username = ""; +... +``` + +I am sure there are other ways to do this sort of thing, like with AJAX or having an +initialization function that PHP calls with the correct variables it needs to pass, etc. +This was just what I came up with and the approach I decided to take. + +If anyone has other methods of sharing dynamic content between PHP and js, please leave a +comment and let me know, I am curious as to what most other devs are doing to handle this. diff --git a/content/writing/about/the-battle-of-the-caches/index.md b/content/writing/about/the-battle-of-the-caches/index.md new file mode 100644 index 0000000..d4fb1ad --- /dev/null +++ b/content/writing/about/the-battle-of-the-caches/index.md @@ -0,0 +1,95 @@ +--- +title: The Battle of the Caches +author: Brett Langdon +date: 2013-08-01 +template: article.jade +--- + +A co-worker and I set out to each build our own http proxy cache. +One of them was written in Go and the other as a C++ plugin for +Kyoto Tycoon. + +--- + +So, I know what most people are thinking: “Not another cache benchmark post, +with skewed or biased results.” But luckily that is not what this post is about; +there are no opinionated graphs showing that my favorite caching system happens +to be better than all the other ones. Instead, this post is about why at work we +decided to write our own API caching system rather than use Varnish +(a tested, tried and true HTTP caching system). + +Let us discuss the problem we have to solve. The system we have is a simple +request/response HTTP server that needs to have very low latency (a few +milliseconds, usually 2-3 on average) and we are adding a third-party HTTP API +call to almost every request that we see. I am sure some people see the issue +right away, any network call is going to add at least a half to a whole millisecond +to your processing time and that is if the two servers are in the same datacenter, +more if they are not. That is just network traffic, now we must rely on the +performance of the third-party API, hoping that they are able to maintain a +consistent response time under heavy load. If, in total, this third-party API call +is adding more than 2 milliseconds response time to each request that our system +is processing then that greatly reduces the capacity of our system. + +THE SOLUTION! Lets use Varnish. This is the logical solution, lets put a caching +system in front of the API. The content we are requesting isn’t changing very often +(every few days, if that) and it can help speed up the added latency from the API +call. So, we tried this but had very little luck; no matter what we tried we could +not get Varnish to respond in under 2 milliseconds per request (which is a main +requirement of solution we were looking for). That means Varnish is out, the next +solution is to write our own caching system. + +Now, before people start flooding the comments calling me a troll or yelling at me +for not trying this or that or some other thing, let me try to explain really why +we decided to write our own cache rather than spend extra days investing time into +Varnish or some other known HTTP cache. We have a fairly specific requirement from +our cache, very low and consistent latency. “Consistent” is the key word that really +matters to us. We decided fairly early on that getting no response on a cache miss +is better for our application than blocking and waiting for a response from the +proxy call. This is a very odd requirement and most HTTP caching systems do not +support it since it almost defeats their purpose (be “slow” 1-2 times so you can be +fast all the other times). As well, HTTP is not a requirement for us, that is, +from the cache to the API server HTTP must be used, but it is not a requirement +that our application calls to the cache using HTTP. Headers add extra bandwidth +and processing that are not required for our application. + +So we decided that our ideal cache would have 3 main requirements: +1. Must have a consistent response time, returning nothing early over waiting for a proper response +2. Support the Memcached Protocol +3. Support TTLs on the cached data + +This behavior works basically like so: Call to cache, if it is a cache miss, +return an empty response and queue the request to a background process to make the +call to the API server, every identical request coming in (until the proxy call +returns a result) will receive an empty response but not add the request to the +queue. As soon as the proxy call returns, update the cache and every identical call +coming in will yield the proper response. After a given TTL consider the data in +the cache to be old and re-fetch. + +This was then seen as a challenge between a co-worker, +Dan Crosta, and myself to see who +can write the better/faster caching system with these requirements. His solution, +entitled “CacheOrBust”, was a +Kyoto Tycoon plugin +written in C++ which simply used a subset of the memcached protocol as well as some +background workers and a request queue to perform the fetching. My solution, +Ferrite, is a +custom server written in Go +(originally written in C) that has the same functionality (except using +goroutines +rather than background workers and a queue). Both servers used +Kyoto Cabinet +as the underlying caching data structure. + +So… results already! As with most fairly competitive competitions it is always a +sad day when there is a tie. Thats right, two similar solutions, written in two +different programming languages yielded similar results (we probably have +Kyoto Cabinet to thank). Both of our caching systems were able to yield us the +results we wanted, **consistent** sub-millisecond response times, averaging about +.5-.6 millisecond responses (different physical servers, but same datacenter), +regardless of whether the response was a cache hit or a cache miss. Usually the +morale of the story is: “do not re-invent the wheel, use something that already +exists that does what you want,” but realistically sometimes this isn’t an option. +Sometimes you have to bend the rules a little to get exactly what your application +needs, especially when dealing with low latency systems, every millisecond counts. +Just be smart about the decisions you make and make sure you have sound +justification for them, especially if you decide to build it yourself. diff --git a/content/writing/about/third-party-tracking-pixels/index.md b/content/writing/about/third-party-tracking-pixels/index.md new file mode 100644 index 0000000..02ee0c9 --- /dev/null +++ b/content/writing/about/third-party-tracking-pixels/index.md @@ -0,0 +1,352 @@ +--- +title: Third Party Tracking Pixels +author: Brett Langdon +date: 2013-05-03 +template: article.jade +--- + +An overview of what a third party tracking pixel is and how to create/use them. + +--- + +So, what exactly do we mean by “third party tracking pixel” anyways? +Lets try to break it down piece by piece: + +### Tracking Pixel: +A pixel referes to a tag that is placed on a site that offers no merit other than +calling out to a web page or script that is not the current page you are visiting. +These pixels are usually an html script tag that point to a javascript file with +no content or an img tag with a empty or transparent 1 pixel by 1 pixel gif image +(hence the term “pixel”). A tracking pixel is the term used to describe a pixel +that calls to another page or script in order to provide it information about the +users visit to the page. + +### Third Party: +Third party just means the pixel points to a website that is not the current +website. For example, +Google Analytics +is a third party tracking tool because you place scripts on your website +that calls and sends data to Google. + + +## What is the point? + +Why do people do this? In the case of Google Analytics people do not wish to track +and follow their own analytics for their website, instead they want a third party +host to do it for them, but they need a way of sending their user’s data to Google. +Using pixels and javascript to send the data to Google offers the company a few +benefits. For starters, they do not require any more overhead on their servers for +a service to send data directly to Google, instead by using pixels and scripts they +get to off load this overhead onto their users (thats right, we are using our +personal computers resources to send analytical data about ourselves to Google for +websites that use Google analytics). Secondly, the benefit of using a tracking +pixel that runs client side (in the user’s browser) we are allowed to gather more +information about the user. The information that is made available to us through +the use of javascript is far greater than what is given to our servers via +HTTP Headers. + + +## How do we do it? + +Next we will walk through the basics of how to create third party tracking pixels. +Code examples for the following discussion can be found +here. +We will walk through four examples of tracking pixels accompanied by the server +code needed to serve and receive the pixels. The server is written in +Python and some basic +understanding of Python is required to follow along. The server examples are +written using only standard Python wsgi modules, so no extra installation is +needed. We will start off with a very simple example of using a tracking pixel and +then each example afterwards we will begin to add features to the pixel. + +## Simple Example + +For this example all we want to accomplish is to have a web server that returns +HTML containing our tracking pixel as well as a handler to receive the call from +our tracking pixel. Our end goal is to serve this HTML content: + +```html + + + +

Welcome

+ + + +``` + +As you can see, this is fairly simple HTML; the important part is the script tag +pointing to “/track.js”, this is our tracking pixel. When the user’s browser loads +the page this script will make a call to our server, our server can then log +information about that user. So we start with a wsgi handler for the HTML code: + +```python +def html_content(environ, respond): + headers = [('Content-Type', 'text/html')] + respond('200 OK', headers) + return [ + """ + +

Welcome

+ + """ + ] +``` + +Next we want to make sure that we have a handler for the calls to “/track.js” +from the script tag: + +```python +def track_user(environ, respond): + headers = [('Content-Type', 'application/javascript')] + respond('200 OK', headers) + prefixes = ['PATH_', 'HTTP', 'REQUEST', 'QUERY'] + for key, value in environ.iteritems(): + if any(key.startswith(prefix) for prefix in prefixes): + print '%s: %s' % (key, value) + return [''] +``` + +In this handler we are taking various information about the request from the user +and simply printing it to the screen. The end point “/track.js” is not meant to +point to actual javascript so instead we return back an empty string. When this +code runs you should see something like the following: + +``` +brett$ python tracking_server.py +Tracking Server Listening on Port 8000... +1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET / HTTP/1.1" 200 89 +HTTP_REFERER: http://localhost:8000/ +REQUEST_METHOD: GET +QUERY_STRING: +HTTP_ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.3 +HTTP_CONNECTION: keep-alive +PATH_INFO: /track.js +HTTP_HOST: localhost:8000 +HTTP_ACCEPT: */* +HTTP_USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31 +HTTP_ACCEPT_LANGUAGE: en-US,en;q=0.8 +HTTP_DNT: 1 +HTTP_ACCEPT_ENCODING: gzip,deflate,sdch +1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET /track.js HTTP/1.1" 200 0 +1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET /favicon.ico HTTP/1.1" 204 0 +``` + +You can see in the above that first the browser makes the request “GET /” which +returns our HTML containing the tracking pixel, then directly afterwards makes a +request for “GET /track.js” which prints out various information about the incoming +request. This example is not very useful as is, but helps to illustrate the key +point of a tracking pixel. We are having the browser make a request on behalf of +the user without the user’s knowledge. In this case we are making a call back to +our own server, but our script tag could easily point to a third party server. + + +## Add Some Search Data + +Our previous, simple, example does not really provide us with any particularly +useful information other than allow us to track that a user’s browser made the +call to our server. For this next example we want to build upon the previous by +sending some data along with the tracking pixel; in this case, some search data. +Let us make an assumption that our web page allows users to make searches; searches +are given to the page through a url query string parameter “search”. We want to +pass that query string parameter on to our tracking pixel, which we will use the +query string parameter “s”. So our requests will look as follows: + +* http://localhost:8000?search=my cool search +* http://localhost:8000/track.js?s=my cool search + +To do this, we simply append the query string parameter “search” onto our track.js +script tag in our HTML: + +```python +def html_content(environ, respond): + query = parse_qs(environ['QUERY_STRING']) + search = quote(query.get('search', [''])[0]) + headers = [('Content-Type', 'text/html')] + respond('200 OK', headers) + return [ + """ + +

Welcome

+ + """ % search + ] +``` + +For our tracking pixel handler we will simply print the value of the query string +parameter “s” and again return an empty string. + +```python +def track_user(environ, respond): + query = parse_qs(environ['QUERY_STRING']) + search = query.get('s', [''])[0] + print 'User Searched For: %s' % search + headers = [('Content-Type', 'application/javascript')] + respond('200 OK', headers) + return [''] +``` + +When run the output will look similar to: + +``` +brett$ python tracking_server.py +Tracking Server Listening on Port 8000... +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /?search=my%20cool%20search HTTP/1.1" 200 110 +User Searched For: my cool search +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /track.js?s=my%20cool%20search HTTP/1.1" 200 0 +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /favicon.ico HTTP/1.1" 204 0 +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /?search=another%20search HTTP/1.1" 200 108 +User Searched For: another search +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /track.js?s=another%20search HTTP/1.1" 200 0 +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /favicon.ico HTTP/1.1" 204 0 +``` + +Here we can see the two search requests made to our web page and the similar +resulting requests to track.js. Again, this example might not seem like much but +it proves a way of being able to pass values from our web page along with to the +tracking server. In this case we are passing search terms, but we could also pass +any other information along we needed. + + +## Track User’s with Cookies + +So now we are getting somewhere, our tracking server is able to receive some +search data about the requests made to our web page. The problem now is we have +no way of associating this information with a specific user; how can we know when +a specific user searches for multiple things. Cookies to the rescue. In this +example we are going to add the support of using cookies to assign each visiting +user a specific and unique id, this will allow us to associate all the search data +we receive with “specific” users. Yes, I say “specific” with quotes because we can +only associate the data with a given cookie, if multiple people share a computer +then we will probably think they are a single person. As well, if someone clears +the cookies for their browser then we lose all association with that user and have +to start all over again with a new cookie. Lastly, if a user does not allow cookies +for their browser then we will be unable to associate any data with them as every +time they visit our tracking server we will see them as a new user. So, how do we +do this? When receive a request from a user we want to look and see if we have +given them a cookie with a user id, if so then we will associate the incoming data +with that user id and if there is no user cookie then we will generate a new user +id and give it to the user. + +```python +def track_user(environ, respond): + cookies = SimpleCookie() + cookies.load(environ.get('HTTP_COOKIE', '')) + + user_id = cookies.get('id') + if not user_id: + user_id = uuid4() + print 'User did not have id, giving: %s' % user_id + + query = parse_qs(environ['QUERY_STRING']) + search = query.get('s', [''])[0] + print 'User %s Searched For: %s' % (user_id, search) + headers = [ + ('Content-Type', 'application/javascript'), + ('Set-Cookie', 'id=%s' % user_id) + ] + respond('200 OK', headers) + return [''] +``` + +This is great! Not only can we now obtain search data from a third party website +but we can also do our best to associate that data with a given user. In this +instance a single user is anyone who shares the same user id in their +browsers cookies. + + +## Cache Busting + +So what exactly is cache busting? Our browsers are smart, they know that we do not +like to wait a long time for a web page to load, they have also learned that they +do not need to refetch content that they have seen before if they cache it. For +example, an image on a web site might get cached by your web browser so every time +you reload the page the image can be loaded locally as opposed to being fetched +from the remote server. Cache busting is a way to ensure that the browser does not +cache the content of our tracking pixel. We want the user’s browser to follow the +tracking pixel to our server for every page request they make because we want to +follow everything that that user does. When the browser caches our tracking +pixel’s content (an empty string) then we lose out on data. Cache busting is the +term used when we programmatically generate query string parameters to make calls +to our tracking pixel look unique and therefore ensure that the browser follows +the pixel rather than load from it’s cache. To do this we need to add an extra end +point to our server. We need the HTML for the web page, along with a cache busting +script and finally our track.js handler. A cache busting script will use javascript +to add our track.js script tag to the web page. This means that after the web page +is loaded javascript will run to manipulate the +DOM +to add our cache busted track.js script tag to the HTML. So, what does this +look like? + +```javascript +var now = new Date().getTime(); +var random = Math.random() * 99999999999; +document.write(' +``` + +This script adds the extra query string parameters ”r” which is a random number +and “t” which is the current timestamp in milliseconds. This will give us a unique +enough request that will trick our browsers into ignoring anything that is has in +it’s cache for track.js and forces it to make the request anyways. Using a cache +buster requires us to modify the html we server slightly to server up the cache +busting javascript as opposed to our track.js pixel. + +```html + + + +

Welcome

+ + + +``` + +And we need the following to serve up the cache buster script buster.js: + +```python +def cache_buster(environ, respond): + headers = [('Content-Type', 'application/javascript')] + respond('200 OK', headers) + cb_js = """ + function getParameterByName(name){ + name = name.replace(/[\[]/, "\\\[").replace(/[\]]/, "\\\]"); + var regexS = "[\\?&]" + name + "=([^&#]*)"; + var regex = new RegExp(regexS); + var results = regex.exec(window.location.search); + if(results == null){ + return ""; + } + return decodeURIComponent(results[1].replace(/\+/g, " ")); + } + + var now = new Date().getTime(); + var random = Math.random() * 99999999999; + var search = getParameterByName('search'); + document.write(''); + """ + return [cb_js] +``` + +We do not care very much if the browser caches our cache buster script because +it will always generate a new unique track.js url every time it is run. + + +## Conclusion + +There is a lot of stuff going on here and probably a lot to digest so lets review +quick what we have learned. For starters we learned that companies use tracking +pixels or tags on web pages whose sole purpose is to make your browser call our to +external third party sites in order to track information about your internet +usage (usually, they can be used for other things as well). We also looked into +some very simplistic ways of implementing a server whose job it is to accept +tracking pixels calls in various forms. + +We learned that these tracking servers can use cookies stored on your browser to +store a unique id for you in order to help associate the data collected to you. +That you can remove this association by clearing your cookies or by not allowing +them at all. Lastly, we learned that browsers can cause issues for our tracking +pixels and data collection and that we can get around them using a cache busting +javascript. + +As a reminder the full working code examples can be located at +"https://github.com/brettlangdon/tracking-server-examples. diff --git a/content/writing/about/what-i'm-up-to-these-days/index.md b/content/writing/about/what-i'm-up-to-these-days/index.md new file mode 100644 index 0000000..5c94849 --- /dev/null +++ b/content/writing/about/what-i'm-up-to-these-days/index.md @@ -0,0 +1,42 @@ +--- +title: What I'm up to these days +author: Brett Langdon +date: 2015-06-19 +template: article.jade +--- + +It has been awhile since I have written anything in my blog. Might as well get started +somewhere, like a brief summary of what I have been working on lately. + +--- + +It has been far too long since I last wrote in this blog. I always have these aspirations +of writing all the time about all the things I am working on. The problem generally comes +back to me not feeling confident enough to write about anything I am working on. "Oh, a +post like that probably already exists", "There are smarter people than me out there +writing about this, why bother". It is an unfortunate feeling to try and get over. + +So, here is where I am making an attempt. I will try to write more, it'll be healthy for +me. I always hear of people setting reminders in their calendars to block off time to +write blog posts, even if they end up only writing a few sentences, which seems like a +great idea that I indent to try. + +Ok, enough with the "I haven't been feeling confident dribble", on to what I actually have +been up to lately. + +Since my last post I have a new job. I am now Senior Software Engineer at +[underdog.io](https://underdog.io/). We are a small early stage startup (4 employees, just +over a year old) that is in the hiring space. For candidates our site basically acts like +a common application to now over 150 venture backed startups in New York City or San +Francisco. In the short time I have been working there, I am very impressed and glad that +I took their offer. I work with some awesome and smart people and I am still learning a +lot, whether it is about coding or just trying to run a business. + +I originally started to end this post by talking about a programming project I have been +working on, but it ended up being 4 times longer than the text above and have decided +instead to write a separate post about it. Apparently even though I have been writing +lately, I have a lot to say. + +Thanks for bearing with this "I have to write something" post. I am not going to make a +promise that I am going to write more, because it is something that could easily fall +through, like it usually does... but I shall give it my all! diff --git a/content/writing/about/why-benchmarking-tools-suck/index.md b/content/writing/about/why-benchmarking-tools-suck/index.md new file mode 100644 index 0000000..aa4413a --- /dev/null +++ b/content/writing/about/why-benchmarking-tools-suck/index.md @@ -0,0 +1,86 @@ +--- +title: Why Benchmarking Tools Suck +author: Brett Langdon +date: 2012-10-22 +template: article.jade +--- + +A brief aside into why I think no benchmarking tool is exactly correct +and why I wrote my own. + +--- + +Benchmarking is (or should be) a fairly important part of most developers job or +duty. To determine the load that the systems that they build can withstand. We are +currently at a point in our development lifecycle at work where load testing is a +fairly high priority. We need to be able to answer questions like, what kind of +load can our servers currently handle as a whole?, what kind of load can a single +server handle?, how much throughput can we gain by adding X more servers?, what +happens when we overload our servers?, what happens when our concurrency doubles? +These are all questions that most have probably been asked at some point in their +career. Luckily enough there is a plethora of HTTP benchmarking tools to help try +to answer these questions. Tools like, +ab, +siege, +beeswithmachineguns, +curl-loader +and one I wrote recently (today), +tommygun. + +Every single one of those tools suck, including the one I wrote (and will +probably keep using/maintaining). Why? Don’t a lot of people use them? Yes, +almost everyone I know has used ab (most of you probably have) and I know a +decent handful of people who use siege, but that does not mean that they are +the most useful for all use cases. In fact they tend to only be useful for a +limited set of testing. Ab is great if you want to test a single web page, but +what if you need to test multiple pages at once? or in a sequence? I’ve also +personally experienced huge performance issues with running ab from a mac. These +scope issues of ab make way for other tools such as siege and curl-loader which +can test multiple pages at a time or in a sequence, but at what cost? Currently at +work we are having issues getting siege to properly parse and test a few hundred +thousand urls, some of which contain binary post data. + +On top of only really having a limited set of use cases, each benchmarking tool +also introduces overhead to the machine that you are benchmarking from. Ab might +be able to test your servers faster and with more concurrency than curl-loader +can, but if curl-loader can test your specific use case, which do you use? +Curl-loader can probably benchmark exactly what your trying to test but if it +cannot supply the source load of what you are looking for, then how useful of a +tool is it? What if you need to scale your benchmarking tool? How do you scale +your benchmarking tool? What if you are running the test from the same machine as +your development environment? What kind of effect will running the benchmarking +tool itself have on your application? + +So, what is the solution then? I think instead of trying to develop these command +line tools to fit each scenario we should try to develop a benchmarking framework +with all of the right pieces that we need. For example, develop a platform that +has the functionality to run a given task concurrently but where you supply the +task for it to run. This way the benchmarking tool does not become obsolete and +useless as your application evolves. This will also pave the way for the tool to +be protocol agnostic. Allowing people to write tests easily for HTTP web +applications or even services that do not interpret HTTP, such as message queues +or in memory stores. This framework should also provide a way to scale the tool +to allow more throughput and overload on your system. Lastly, but not least, this +platform should be lightweight and try to introduce as little overhead as +possible, for those who do not have EC2 available to them for testing, or who do +not have spare servers lying around for them to test from. + +I am not saying that up until now load testing has been nothing but a pain and +the tools that we have available to us (for free) are the worst things out there +and should not be trusted. I just feel that they do not and cannot meet every use +case and that I have been plighted by this issue in the past. How can you properly +load test your application if you do not have the right load testing tool for +the job? + +So, I know what some might be thinking, “sounds neat, when will your framework +be ready for me to use?” That is a nice idea, but if the past few months are any +indication of how much free time I have, I might not be able to get anything done +right away (seeing how I was able to write my load testing tool while on vacation). +I am however, more than willing to contribute to anyone else’s attempt at this +framework and I am especially more than willing to help test anyone else’s +framework. + +**Side Note:** If anyone knows of any tool or framework currently that tries to +achieve my “goal” please let me know. I was unable to find any tools out there +that worked as I described or that even got close, but I might not of searched for +the right thing or maybe skipped over the right link, etc. diff --git a/content/writing/about/write-code-every-day/index.md b/content/writing/about/write-code-every-day/index.md new file mode 100644 index 0000000..50c6e51 --- /dev/null +++ b/content/writing/about/write-code-every-day/index.md @@ -0,0 +1,56 @@ +--- +title: Write code every day +author: Brett Langdon +date: 2015-07-02 +template: article.jade +--- + +Just like a poet or an athlete practicing code every day will only make you better. + +--- + +Lately I have been trying to get into blogging more and any article I read always says, "you need to write every day". +It doesn't matter if what I write down gets published, but forming the habit of trying to write something every day +is what counts. The more I write the easier it will become, the more natural it will feel and the better I will get at it. + +This really isn't just true of writing or blogging, it is something that can be said of anything at all. Riding a bike, +playing basketball, reading, cooking or absolutely anything at all. The more you do it, the easier it will become and +the better you will get. + +As the title of the post will allude you to, this is also true of programming. If you want to be really good at programming +you have to write code every day. The more code you write the easier it'll be to write and the better you will be at programming. +Just like any other task I've listed in this article, trying to write code every day, even if you are used to it, can be really +hard to do and a really hard habit to keep. + +"What should I write?" The answer to this question is going to be different for everyone, but it is the hurdle which +you must first overcome to work your way towards writing code every day. Usually people write code to solve problems +that they have, but not everyone has problems to solve. There is usually a chicken and the egg problem. You need to +write code to have coding problems, and you need to have coding problems to have something to write. So, where should +you start? + +For myself, one of the things I like doing is to rewrite things that already exist. Sometimes it can be hard to come up with a +new and different idea or even a new approach to an existing idea. However, there are millions of existing projects out +there to copy. The idea I go for is to try and replicate the overall goal of the project, but in my own way. That might +mean writing it in a different language, or changing the API for it or just taking some wacky new approach to solving the same issue. + +More times than not the above exercise leads me to a problem that I then can go off and solve. For example, a few weeks ago +I sat down and decided I wanted to write a web server in `go` (think `nginx`/`apache`). I knew going into the project I wanted +a really nice and easy to use configuration file to define the settings. So, I did what most people do these days I and +used `json`, but that didn't really feel right to me. I then tried `yaml`, but yet again didn't feel like what I wanted. I +probably could have used `ini` format and made custom rules for the keys and values, but again, this is hacky. This spawned +a new project in order to solve the problem I was having and ended up being [forge](https://github.com/brettlangdon/forge), +which is a hand coded configuration file syntax and parser for `go` which ended up being a neat mix between `json` and `nginx` +configuration file syntax. + +Anywho, enough of me trying to self promote projects. The main point is that by trying to replicate something that +already exists, without really trying to do anything new, I came up with an idea which spawned another project and +for at least a week (and continuing now) gave me a reason to write code every day. Not only did I write something +useful that I can now use in any future project of mine, I also learned something I did not know before. I learned +how to hand code a syntax parser in `go`. + +Ultimately, try to take "coding every day" not as a challenge to write something useful every day, but to learn +something new every day. Learn part of a new language, a new framework, learn how to take something apart or put +it back together. Write code every day and learn something new every day. The more you do this, the more you will +learn and the better you will become. + +Go forth and happy coding. :) diff --git a/static/css/lato.css b/static/css/lato.css new file mode 100644 index 0000000..493ec68 --- /dev/null +++ b/static/css/lato.css @@ -0,0 +1 @@ +css \ No newline at end of file diff --git a/static/css/site.css b/static/css/site.css new file mode 100644 index 0000000..50683a4 --- /dev/null +++ b/static/css/site.css @@ -0,0 +1,9 @@ +#wrapper, +.profile #wrapper, +#wrapper.home { + max-width: 900px; +} + +a.symbol { + margin-right: 0.7rem; +} diff --git a/static/images/avatar.png b/static/images/avatar.png new file mode 100644 index 0000000..05df014 Binary files /dev/null and b/static/images/avatar.png differ diff --git a/static/images/avatar@2x.png b/static/images/avatar@2x.png new file mode 100644 index 0000000..5766fc9 Binary files /dev/null and b/static/images/avatar@2x.png differ diff --git a/static/images/favicon.ico b/static/images/favicon.ico new file mode 100644 index 0000000..db9aba1 Binary files /dev/null and b/static/images/favicon.ico differ