| @ -0,0 +1,352 @@ | |||
| --- | |||
| title: Third Party Tracking Pixels | |||
| author: Brett Langdon | |||
| date: 2013-05-03 | |||
| template: article.jade | |||
| --- | |||
| An overview of what a third party tracking pixel is and how to create/use them. | |||
| --- | |||
| So, what exactly do we mean by “third party tracking pixel” anyways? | |||
| Lets try to break it down piece by piece: | |||
| ### Tracking Pixel: | |||
| A pixel referes to a tag that is placed on a site that offers no merit other than | |||
| calling out to a web page or script that is not the current page you are visiting. | |||
| These pixels are usually an html script tag that point to a javascript file with | |||
| no content or an img tag with a empty or transparent 1 pixel by 1 pixel gif image | |||
| (hence the term “pixel”). A tracking pixel is the term used to describe a pixel | |||
| that calls to another page or script in order to provide it information about the | |||
| users visit to the page. | |||
| ### Third Party: | |||
| Third party just means the pixel points to a website that is not the current | |||
| website. For example, | |||
| <a href="http://www.google.com/analytics/" target="_blank">Google Analytics</a> | |||
| is a third party tracking tool because you place scripts on your website | |||
| that calls and sends data to Google. | |||
| ## What is the point? | |||
| Why do people do this? In the case of Google Analytics people do not wish to track | |||
| and follow their own analytics for their website, instead they want a third party | |||
| host to do it for them, but they need a way of sending their user’s data to Google. | |||
| Using pixels and javascript to send the data to Google offers the company a few | |||
| benefits. For starters, they do not require any more overhead on their servers for | |||
| a service to send data directly to Google, instead by using pixels and scripts they | |||
| get to off load this overhead onto their users (thats right, we are using our | |||
| personal computers resources to send analytical data about ourselves to Google for | |||
| websites that use Google analytics). Secondly, the benefit of using a tracking | |||
| pixel that runs client side (in the user’s browser) we are allowed to gather more | |||
| information about the user. The information that is made available to us through | |||
| the use of javascript is far greater than what is given to our servers via | |||
| HTTP Headers. | |||
| ## How do we do it? | |||
| Next we will walk through the basics of how to create third party tracking pixels. | |||
| Code examples for the following discussion can be found | |||
| <a href="https://github.com/brettlangdon/tracking-server-examples" target="_blank">here</a>. | |||
| We will walk through four examples of tracking pixels accompanied by the server | |||
| code needed to serve and receive the pixels. The server is written in | |||
| <a href="http://python.org/" target="_blank">Python</a> and some basic | |||
| understanding of Python is required to follow along. The server examples are | |||
| written using only standard Python wsgi modules, so no extra installation is | |||
| needed. We will start off with a very simple example of using a tracking pixel and | |||
| then each example afterwards we will begin to add features to the pixel. | |||
| ## Simple Example | |||
| For this example all we want to accomplish is to have a web server that returns | |||
| HTML containing our tracking pixel as well as a handler to receive the call from | |||
| our tracking pixel. Our end goal is to serve this HTML content: | |||
| ```html | |||
| <html> | |||
| <head></head> | |||
| <body> | |||
| <h2>Welcome</h2> | |||
| <script src="/track.js"></script> | |||
| </body> | |||
| </html> | |||
| ``` | |||
| As you can see, this is fairly simple HTML; the important part is the script tag | |||
| pointing to “/track.js”, this is our tracking pixel. When the user’s browser loads | |||
| the page this script will make a call to our server, our server can then log | |||
| information about that user. So we start with a wsgi handler for the HTML code: | |||
| ```python | |||
| def html_content(environ, respond): | |||
| headers = [('Content-Type', 'text/html')] | |||
| respond('200 OK', headers) | |||
| return [ | |||
| """ | |||
| <html><head></head><body> | |||
| <h2>Welcome</h2><script src="/track.js"></script> | |||
| </body></html> | |||
| """ | |||
| ] | |||
| ``` | |||
| Next we want to make sure that we have a handler for the calls to “/track.js” | |||
| from the script tag: | |||
| ```python | |||
| def track_user(environ, respond): | |||
| headers = [('Content-Type', 'application/javascript')] | |||
| respond('200 OK', headers) | |||
| prefixes = ['PATH_', 'HTTP', 'REQUEST', 'QUERY'] | |||
| for key, value in environ.iteritems(): | |||
| if any(key.startswith(prefix) for prefix in prefixes): | |||
| print '%s: %s' % (key, value) | |||
| return [''] | |||
| ``` | |||
| In this handler we are taking various information about the request from the user | |||
| and simply printing it to the screen. The end point “/track.js” is not meant to | |||
| point to actual javascript so instead we return back an empty string. When this | |||
| code runs you should see something like the following: | |||
| ``` | |||
| brett$ python tracking_server.py | |||
| Tracking Server Listening on Port 8000... | |||
| 1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET / HTTP/1.1" 200 89 | |||
| HTTP_REFERER: http://localhost:8000/ | |||
| REQUEST_METHOD: GET | |||
| QUERY_STRING: | |||
| HTTP_ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.3 | |||
| HTTP_CONNECTION: keep-alive | |||
| PATH_INFO: /track.js | |||
| HTTP_HOST: localhost:8000 | |||
| HTTP_ACCEPT: */* | |||
| HTTP_USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31 | |||
| HTTP_ACCEPT_LANGUAGE: en-US,en;q=0.8 | |||
| HTTP_DNT: 1 | |||
| HTTP_ACCEPT_ENCODING: gzip,deflate,sdch | |||
| 1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET /track.js HTTP/1.1" 200 0 | |||
| 1.0.0.127.in-addr.arpa - - [24/Apr/2013 20:03:21] "GET /favicon.ico HTTP/1.1" 204 0 | |||
| ``` | |||
| You can see in the above that first the browser makes the request “GET /” which | |||
| returns our HTML containing the tracking pixel, then directly afterwards makes a | |||
| request for “GET /track.js” which prints out various information about the incoming | |||
| request. This example is not very useful as is, but helps to illustrate the key | |||
| point of a tracking pixel. We are having the browser make a request on behalf of | |||
| the user without the user’s knowledge. In this case we are making a call back to | |||
| our own server, but our script tag could easily point to a third party server. | |||
| ## Add Some Search Data | |||
| Our previous, simple, example does not really provide us with any particularly | |||
| useful information other than allow us to track that a user’s browser made the | |||
| call to our server. For this next example we want to build upon the previous by | |||
| sending some data along with the tracking pixel; in this case, some search data. | |||
| Let us make an assumption that our web page allows users to make searches; searches | |||
| are given to the page through a url query string parameter “search”. We want to | |||
| pass that query string parameter on to our tracking pixel, which we will use the | |||
| query string parameter “s”. So our requests will look as follows: | |||
| * http://localhost:8000?search=my cool search | |||
| * http://localhost:8000/track.js?s=my cool search | |||
| To do this, we simply append the query string parameter “search” onto our track.js | |||
| script tag in our HTML: | |||
| ```python | |||
| def html_content(environ, respond): | |||
| query = parse_qs(environ['QUERY_STRING']) | |||
| search = quote(query.get('search', [''])[0]) | |||
| headers = [('Content-Type', 'text/html')] | |||
| respond('200 OK', headers) | |||
| return [ | |||
| """ | |||
| <html><head></head><body> | |||
| <h2>Welcome</h2><script src="/track.js?s=%s"></script> | |||
| </body></html> | |||
| """ % search | |||
| ] | |||
| ``` | |||
| For our tracking pixel handler we will simply print the value of the query string | |||
| parameter “s” and again return an empty string. | |||
| ```python | |||
| def track_user(environ, respond): | |||
| query = parse_qs(environ['QUERY_STRING']) | |||
| search = query.get('s', [''])[0] | |||
| print 'User Searched For: %s' % search | |||
| headers = [('Content-Type', 'application/javascript')] | |||
| respond('200 OK', headers) | |||
| return [''] | |||
| ``` | |||
| When run the output will look similar to: | |||
| ``` | |||
| brett$ python tracking_server.py | |||
| Tracking Server Listening on Port 8000... | |||
| 1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /?search=my%20cool%20search HTTP/1.1" 200 110 | |||
| User Searched For: my cool search | |||
| 1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /track.js?s=my%20cool%20search HTTP/1.1" 200 0 | |||
| 1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:24] "GET /favicon.ico HTTP/1.1" 204 0 | |||
| 1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /?search=another%20search HTTP/1.1" 200 108 | |||
| User Searched For: another search | |||
| 1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /track.js?s=another%20search HTTP/1.1" 200 0 | |||
| 1.0.0.127.in-addr.arpa - - [24/Apr/2013 21:35:34] "GET /favicon.ico HTTP/1.1" 204 0 | |||
| ``` | |||
| Here we can see the two search requests made to our web page and the similar | |||
| resulting requests to track.js. Again, this example might not seem like much but | |||
| it proves a way of being able to pass values from our web page along with to the | |||
| tracking server. In this case we are passing search terms, but we could also pass | |||
| any other information along we needed. | |||
| ## Track User’s with Cookies | |||
| So now we are getting somewhere, our tracking server is able to receive some | |||
| search data about the requests made to our web page. The problem now is we have | |||
| no way of associating this information with a specific user; how can we know when | |||
| a specific user searches for multiple things. Cookies to the rescue. In this | |||
| example we are going to add the support of using cookies to assign each visiting | |||
| user a specific and unique id, this will allow us to associate all the search data | |||
| we receive with “specific” users. Yes, I say “specific” with quotes because we can | |||
| only associate the data with a given cookie, if multiple people share a computer | |||
| then we will probably think they are a single person. As well, if someone clears | |||
| the cookies for their browser then we lose all association with that user and have | |||
| to start all over again with a new cookie. Lastly, if a user does not allow cookies | |||
| for their browser then we will be unable to associate any data with them as every | |||
| time they visit our tracking server we will see them as a new user. So, how do we | |||
| do this? When receive a request from a user we want to look and see if we have | |||
| given them a cookie with a user id, if so then we will associate the incoming data | |||
| with that user id and if there is no user cookie then we will generate a new user | |||
| id and give it to the user. | |||
| ```python | |||
| def track_user(environ, respond): | |||
| cookies = SimpleCookie() | |||
| cookies.load(environ.get('HTTP_COOKIE', '')) | |||
| user_id = cookies.get('id') | |||
| if not user_id: | |||
| user_id = uuid4() | |||
| print 'User did not have id, giving: %s' % user_id | |||
| query = parse_qs(environ['QUERY_STRING']) | |||
| search = query.get('s', [''])[0] | |||
| print 'User %s Searched For: %s' % (user_id, search) | |||
| headers = [ | |||
| ('Content-Type', 'application/javascript'), | |||
| ('Set-Cookie', 'id=%s' % user_id) | |||
| ] | |||
| respond('200 OK', headers) | |||
| return [''] | |||
| ``` | |||
| This is great! Not only can we now obtain search data from a third party website | |||
| but we can also do our best to associate that data with a given user. In this | |||
| instance a single user is anyone who shares the same user id in their | |||
| browsers cookies. | |||
| ## Cache Busting | |||
| So what exactly is cache busting? Our browsers are smart, they know that we do not | |||
| like to wait a long time for a web page to load, they have also learned that they | |||
| do not need to refetch content that they have seen before if they cache it. For | |||
| example, an image on a web site might get cached by your web browser so every time | |||
| you reload the page the image can be loaded locally as opposed to being fetched | |||
| from the remote server. Cache busting is a way to ensure that the browser does not | |||
| cache the content of our tracking pixel. We want the user’s browser to follow the | |||
| tracking pixel to our server for every page request they make because we want to | |||
| follow everything that that user does. When the browser caches our tracking | |||
| pixel’s content (an empty string) then we lose out on data. Cache busting is the | |||
| term used when we programmatically generate query string parameters to make calls | |||
| to our tracking pixel look unique and therefore ensure that the browser follows | |||
| the pixel rather than load from it’s cache. To do this we need to add an extra end | |||
| point to our server. We need the HTML for the web page, along with a cache busting | |||
| script and finally our track.js handler. A cache busting script will use javascript | |||
| to add our track.js script tag to the web page. This means that after the web page | |||
| is loaded javascript will run to manipulate the | |||
| <a href="http://en.wikipedia.org/wiki/Document_Object_Model" target="_blank">DOM</a> | |||
| to add our cache busted track.js script tag to the HTML. So, what does this | |||
| look like? | |||
| ```javascript | |||
| var now = new Date().getTime(); | |||
| var random = Math.random() * 99999999999; | |||
| document.write('<script type="text/javascript" src="/track.js?t=' + now + '&r=' + random + '"></script> | |||
| ``` | |||
| This script adds the extra query string parameters ”r” which is a random number | |||
| and “t” which is the current timestamp in milliseconds. This will give us a unique | |||
| enough request that will trick our browsers into ignoring anything that is has in | |||
| it’s cache for track.js and forces it to make the request anyways. Using a cache | |||
| buster requires us to modify the html we server slightly to server up the cache | |||
| busting javascript as opposed to our track.js pixel. | |||
| ```html | |||
| <html> | |||
| <head></head> | |||
| <body> | |||
| <h2>Welcome</h2> | |||
| <script src="/buster.js"></script> | |||
| </body> | |||
| </html> | |||
| ``` | |||
| And we need the following to serve up the cache buster script buster.js: | |||
| ```python | |||
| def cache_buster(environ, respond): | |||
| headers = [('Content-Type', 'application/javascript')] | |||
| respond('200 OK', headers) | |||
| cb_js = """ | |||
| function getParameterByName(name){ | |||
| name = name.replace(/[\[]/, "\\\[").replace(/[\]]/, "\\\]"); | |||
| var regexS = "[\\?&]" + name + "=([^&#]*)"; | |||
| var regex = new RegExp(regexS); | |||
| var results = regex.exec(window.location.search); | |||
| if(results == null){ | |||
| return ""; | |||
| } | |||
| return decodeURIComponent(results[1].replace(/\+/g, " ")); | |||
| } | |||
| var now = new Date().getTime(); | |||
| var random = Math.random() * 99999999999; | |||
| var search = getParameterByName('search'); | |||
| document.write('<script src="/track.js?t=' + now + '&r=' + random + '&s=' + search + '"></script>'); | |||
| """ | |||
| return [cb_js] | |||
| ``` | |||
| We do not care very much if the browser caches our cache buster script because | |||
| it will always generate a new unique track.js url every time it is run. | |||
| ## Conclusion | |||
| There is a lot of stuff going on here and probably a lot to digest so lets review | |||
| quick what we have learned. For starters we learned that companies use tracking | |||
| pixels or tags on web pages whose sole purpose is to make your browser call our to | |||
| external third party sites in order to track information about your internet | |||
| usage (usually, they can be used for other things as well). We also looked into | |||
| some very simplistic ways of implementing a server whose job it is to accept | |||
| tracking pixels calls in various forms. | |||
| We learned that these tracking servers can use cookies stored on your browser to | |||
| store a unique id for you in order to help associate the data collected to you. | |||
| That you can remove this association by clearing your cookies or by not allowing | |||
| them at all. Lastly, we learned that browsers can cause issues for our tracking | |||
| pixels and data collection and that we can get around them using a cache busting | |||
| javascript. | |||
| As a reminder the full working code examples can be located at | |||
| <a href="https://github.com/brettlangdon/tracking-server-examples" target="_blank">"https://github.com/brettlangdon/tracking-server-examples</a>. | |||