diff --git a/contents/writing/about/third-party-tracking-pixels/index.md b/contents/writing/about/third-party-tracking-pixels/index.md new file mode 100644 index 0000000..02ee0c9 --- /dev/null +++ b/contents/writing/about/third-party-tracking-pixels/index.md @@ -0,0 +1,352 @@ +--- +title: Third Party Tracking Pixels +author: Brett Langdon +date: 2013-05-03 +template: article.jade +--- + +An overview of what a third party tracking pixel is and how to create/use them. + +--- + +So, what exactly do we mean by “third party tracking pixel” anyways? +Lets try to break it down piece by piece: + +### Tracking Pixel: +A pixel referes to a tag that is placed on a site that offers no merit other than +calling out to a web page or script that is not the current page you are visiting. +These pixels are usually an html script tag that point to a javascript file with +no content or an img tag with a empty or transparent 1 pixel by 1 pixel gif image +(hence the term “pixel”). A tracking pixel is the term used to describe a pixel +that calls to another page or script in order to provide it information about the +users visit to the page. + +### Third Party: +Third party just means the pixel points to a website that is not the current +website. For example, +Google Analytics +is a third party tracking tool because you place scripts on your website +that calls and sends data to Google. + + +## What is the point? + +Why do people do this? In the case of Google Analytics people do not wish to track +and follow their own analytics for their website, instead they want a third party +host to do it for them, but they need a way of sending their user’s data to Google. +Using pixels and javascript to send the data to Google offers the company a few +benefits. For starters, they do not require any more overhead on their servers for +a service to send data directly to Google, instead by using pixels and scripts they +get to off load this overhead onto their users (thats right, we are using our +personal computers resources to send analytical data about ourselves to Google for +websites that use Google analytics). Secondly, the benefit of using a tracking +pixel that runs client side (in the user’s browser) we are allowed to gather more +information about the user. The information that is made available to us through +the use of javascript is far greater than what is given to our servers via +HTTP Headers. + + +## How do we do it? + +Next we will walk through the basics of how to create third party tracking pixels. +Code examples for the following discussion can be found +here. +We will walk through four examples of tracking pixels accompanied by the server +code needed to serve and receive the pixels. The server is written in +Python and some basic +understanding of Python is required to follow along. The server examples are +written using only standard Python wsgi modules, so no extra installation is +needed. We will start off with a very simple example of using a tracking pixel and +then each example afterwards we will begin to add features to the pixel. + +## Simple Example + +For this example all we want to accomplish is to have a web server that returns +HTML containing our tracking pixel as well as a handler to receive the call from +our tracking pixel. Our end goal is to serve this HTML content: + +```html + + + +

Welcome

+ + + +``` + +As you can see, this is fairly simple HTML; the important part is the script tag +pointing to “/track.js”, this is our tracking pixel. When the user’s browser loads +the page this script will make a call to our server, our server can then log +information about that user. So we start with a wsgi handler for the HTML code: + +```python +def html_content(environ, respond): + headers = [('Content-Type', 'text/html')] + respond('200 OK', headers) + return [ + """ + +

Welcome

+ + """ + ] +``` + +Next we want to make sure that we have a handler for the calls to “/track.js” +from the script tag: + +```python +def track_user(environ, respond): + headers = [('Content-Type', 'application/javascript')] + respond('200 OK', headers) + prefixes = ['PATH_', 'HTTP', 'REQUEST', 'QUERY'] + for key, value in environ.iteritems(): + if any(key.startswith(prefix) for prefix in prefixes): + print '%s: %s' % (key, value) + return [''] +``` + +In this handler we are taking various information about the request from the user +and simply printing it to the screen. The end point “/track.js” is not meant to +point to actual javascript so instead we return back an empty string. When this +code runs you should see something like the following: + +``` +brett$ python tracking_server.py +Tracking Server Listening on Port 8000... +1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET / HTTP/1.1" 200 89 +HTTP_REFERER: http://localhost:8000/ +REQUEST_METHOD: GET +QUERY_STRING: +HTTP_ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.3 +HTTP_CONNECTION: keep-alive +PATH_INFO: /track.js +HTTP_HOST: localhost:8000 +HTTP_ACCEPT: */* +HTTP_USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31 +HTTP_ACCEPT_LANGUAGE: en-US,en;q=0.8 +HTTP_DNT: 1 +HTTP_ACCEPT_ENCODING: gzip,deflate,sdch +1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET /track.js HTTP/1.1" 200 0 +1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET /favicon.ico HTTP/1.1" 204 0 +``` + +You can see in the above that first the browser makes the request “GET /” which +returns our HTML containing the tracking pixel, then directly afterwards makes a +request for “GET /track.js” which prints out various information about the incoming +request. This example is not very useful as is, but helps to illustrate the key +point of a tracking pixel. We are having the browser make a request on behalf of +the user without the user’s knowledge. In this case we are making a call back to +our own server, but our script tag could easily point to a third party server. + + +## Add Some Search Data + +Our previous, simple, example does not really provide us with any particularly +useful information other than allow us to track that a user’s browser made the +call to our server. For this next example we want to build upon the previous by +sending some data along with the tracking pixel; in this case, some search data. +Let us make an assumption that our web page allows users to make searches; searches +are given to the page through a url query string parameter “search”. We want to +pass that query string parameter on to our tracking pixel, which we will use the +query string parameter “s”. So our requests will look as follows: + +* http://localhost:8000?search=my cool search +* http://localhost:8000/track.js?s=my cool search + +To do this, we simply append the query string parameter “search” onto our track.js +script tag in our HTML: + +```python +def html_content(environ, respond): + query = parse_qs(environ['QUERY_STRING']) + search = quote(query.get('search', [''])[0]) + headers = [('Content-Type', 'text/html')] + respond('200 OK', headers) + return [ + """ + +

Welcome

+ + """ % search + ] +``` + +For our tracking pixel handler we will simply print the value of the query string +parameter “s” and again return an empty string. + +```python +def track_user(environ, respond): + query = parse_qs(environ['QUERY_STRING']) + search = query.get('s', [''])[0] + print 'User Searched For: %s' % search + headers = [('Content-Type', 'application/javascript')] + respond('200 OK', headers) + return [''] +``` + +When run the output will look similar to: + +``` +brett$ python tracking_server.py +Tracking Server Listening on Port 8000... +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /?search=my%20cool%20search HTTP/1.1" 200 110 +User Searched For: my cool search +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /track.js?s=my%20cool%20search HTTP/1.1" 200 0 +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /favicon.ico HTTP/1.1" 204 0 +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /?search=another%20search HTTP/1.1" 200 108 +User Searched For: another search +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /track.js?s=another%20search HTTP/1.1" 200 0 +1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /favicon.ico HTTP/1.1" 204 0 +``` + +Here we can see the two search requests made to our web page and the similar +resulting requests to track.js. Again, this example might not seem like much but +it proves a way of being able to pass values from our web page along with to the +tracking server. In this case we are passing search terms, but we could also pass +any other information along we needed. + + +## Track User’s with Cookies + +So now we are getting somewhere, our tracking server is able to receive some +search data about the requests made to our web page. The problem now is we have +no way of associating this information with a specific user; how can we know when +a specific user searches for multiple things. Cookies to the rescue. In this +example we are going to add the support of using cookies to assign each visiting +user a specific and unique id, this will allow us to associate all the search data +we receive with “specific” users. Yes, I say “specific” with quotes because we can +only associate the data with a given cookie, if multiple people share a computer +then we will probably think they are a single person. As well, if someone clears +the cookies for their browser then we lose all association with that user and have +to start all over again with a new cookie. Lastly, if a user does not allow cookies +for their browser then we will be unable to associate any data with them as every +time they visit our tracking server we will see them as a new user. So, how do we +do this? When receive a request from a user we want to look and see if we have +given them a cookie with a user id, if so then we will associate the incoming data +with that user id and if there is no user cookie then we will generate a new user +id and give it to the user. + +```python +def track_user(environ, respond): + cookies = SimpleCookie() + cookies.load(environ.get('HTTP_COOKIE', '')) + + user_id = cookies.get('id') + if not user_id: + user_id = uuid4() + print 'User did not have id, giving: %s' % user_id + + query = parse_qs(environ['QUERY_STRING']) + search = query.get('s', [''])[0] + print 'User %s Searched For: %s' % (user_id, search) + headers = [ + ('Content-Type', 'application/javascript'), + ('Set-Cookie', 'id=%s' % user_id) + ] + respond('200 OK', headers) + return [''] +``` + +This is great! Not only can we now obtain search data from a third party website +but we can also do our best to associate that data with a given user. In this +instance a single user is anyone who shares the same user id in their +browsers cookies. + + +## Cache Busting + +So what exactly is cache busting? Our browsers are smart, they know that we do not +like to wait a long time for a web page to load, they have also learned that they +do not need to refetch content that they have seen before if they cache it. For +example, an image on a web site might get cached by your web browser so every time +you reload the page the image can be loaded locally as opposed to being fetched +from the remote server. Cache busting is a way to ensure that the browser does not +cache the content of our tracking pixel. We want the user’s browser to follow the +tracking pixel to our server for every page request they make because we want to +follow everything that that user does. When the browser caches our tracking +pixel’s content (an empty string) then we lose out on data. Cache busting is the +term used when we programmatically generate query string parameters to make calls +to our tracking pixel look unique and therefore ensure that the browser follows +the pixel rather than load from it’s cache. To do this we need to add an extra end +point to our server. We need the HTML for the web page, along with a cache busting +script and finally our track.js handler. A cache busting script will use javascript +to add our track.js script tag to the web page. This means that after the web page +is loaded javascript will run to manipulate the +DOM +to add our cache busted track.js script tag to the HTML. So, what does this +look like? + +```javascript +var now = new Date().getTime(); +var random = Math.random() * 99999999999; +document.write(' +``` + +This script adds the extra query string parameters ”r” which is a random number +and “t” which is the current timestamp in milliseconds. This will give us a unique +enough request that will trick our browsers into ignoring anything that is has in +it’s cache for track.js and forces it to make the request anyways. Using a cache +buster requires us to modify the html we server slightly to server up the cache +busting javascript as opposed to our track.js pixel. + +```html + + + +

Welcome

+ + + +``` + +And we need the following to serve up the cache buster script buster.js: + +```python +def cache_buster(environ, respond): + headers = [('Content-Type', 'application/javascript')] + respond('200 OK', headers) + cb_js = """ + function getParameterByName(name){ + name = name.replace(/[\[]/, "\\\[").replace(/[\]]/, "\\\]"); + var regexS = "[\\?&]" + name + "=([^&#]*)"; + var regex = new RegExp(regexS); + var results = regex.exec(window.location.search); + if(results == null){ + return ""; + } + return decodeURIComponent(results[1].replace(/\+/g, " ")); + } + + var now = new Date().getTime(); + var random = Math.random() * 99999999999; + var search = getParameterByName('search'); + document.write(''); + """ + return [cb_js] +``` + +We do not care very much if the browser caches our cache buster script because +it will always generate a new unique track.js url every time it is run. + + +## Conclusion + +There is a lot of stuff going on here and probably a lot to digest so lets review +quick what we have learned. For starters we learned that companies use tracking +pixels or tags on web pages whose sole purpose is to make your browser call our to +external third party sites in order to track information about your internet +usage (usually, they can be used for other things as well). We also looked into +some very simplistic ways of implementing a server whose job it is to accept +tracking pixels calls in various forms. + +We learned that these tracking servers can use cookies stored on your browser to +store a unique id for you in order to help associate the data collected to you. +That you can remove this association by clearing your cookies or by not allowing +them at all. Lastly, we learned that browsers can cause issues for our tracking +pixels and data collection and that we can get around them using a cache busting +javascript. + +As a reminder the full working code examples can be located at +"https://github.com/brettlangdon/tracking-server-examples.