Applying Natural Selection to Location Based Data
Posted in Blog
By Jonathan Lenaghan
Economy and efficiency are key communication in the complex world of location based data. There’s no need to use 100 words when 10 will do, just like there’s no need to analyze 1,000 mobile ad requests with poor location-quality, when 10 high quality requests are at your disposal. While having more data produces better location data model and more reliable predictions, this is only true if that data is cleansed, verified and of the highest quality.
Having just enough quality data maximizes predictions, inferences and understanding. This is especially important when analyzing location data from ad request logs, which are notoriously loaded with noise, fraud and general misrepresentation. Darwin is an evolutionary pipeline that allows PlaceIQ to rigorously evaluates the quality of location data that is always changing.
Darwin embodies the broad skills and diverse backgrounds of our Data Science team. From a sociological perspective, a lot of time is spent thinking about what defines human behavior. On the physics side, there’s an obsession with measurement and quantification. Then there are computer scientists who build scalable machine learning algorithms to infer human behavior.
PlaceIQ ingests over 100 billion ad requests each month, so it’s imperative to know which of these requests are of high enough quality that they can be relied on in terms of human movement and behavior. The questions that enabled the development of a high-scale pipeline to compute metrics across many terabytes of data were:
- What do human movement patterns look like?
- How do they change throughout the day?
- When are they sparse and where do they cluster?
When it comes to data analytics: “garbage in” simply leads to “garbage out”. There’s no magical machine that can transform poor quality data into golden nuggets. Therefore it’s crucial for any location data provider to be able to measure the quality of location data.
In the world of location-based targeting, quality is defined by the hyperlocality of a data set.
In this case, “hyperlocal” refers to the distance between a determined location and the true location. By nature, GPS technology will always have some level of error, be it 100 meters or 10 meters. And the reality is some publishers accurately report location up until a certain digit, then assign arbitrary digits at the end to make data sets appear hyperlocal. Darwin distinguishes these publishers from those who truly provide hyperlocal data, and identifies truly hyperlocal data by matching up the information gain from each lat/long digit is with the associated human behavior. Yes, human behavior is random, but not to the extent that a computer generates it.
Once it’s confirmed that a given partner satisfies hyperlocal needs, filtration is complete, right? Wrong. PlaceIQ is delivering location-based targeting, so “high quality” data must also have users that display normal human behavior. Humans generally have dwells – locations we tend to be around consistently. Adults generally have home, work and leisure dwells. With that in mind, the origin of ad requests, based on user location, should cluster to some degree. PlaceIQ refers to this notion as “clusterability”. If the clusterability of a data set is low due to fraud, where devices appear to be around the world within short time spans, then those mobile devices must be removed from PlaceIQ’s analytics pipeline, in order to achieve a higher quality data set.
By reducing the sources of “garbage” Darwin ensures that only the fittest data survives. This pipeline filter has dramatically improved the quality of PlaceIQ’s data and offers a maximized understanding of human movement and behavior.