I recently read this article on how to create a heatmap from google location history data. testing it myself, I got some amazing results:
The big red circles represent cities where I’ve spent a significant amount of time. the purple hue in different places represents places he had traveled to or passed during a train ride.
my old phone had some gps issues causing my location to show in arizona usa. uu. surprisingly (or not?) he even gave proof of that!
This was all really cool to see, but I really wanted to dive in and learn more about my travel patterns over the years.
Like most data science problems, data preprocessing was definitely the critical point. the data was in a json format where the meaning of the different attributes was not very clear.
data extraction
After some research, I found this article, which cleared up a lot of things. however, there are some questions that still remain unanswered:-
- what does the typetilting activity mean?
- I assumed that the confidence is the probability of each task. However, many times they do not add up to 100. If they do not represent probabilities, what do they represent?
- What is the difference between the type of activity walking and walking?
- How can Google predict the type of activity between in_two_wheeler_vehicle vs in_four_wheeler_vehicle?!
If anyone has been able to figure it out, please let me know in the comments.
edit: There has been some discussion of these issues in this thread. You can find a paper on human activity recognition using smartphone data here.
assumptions
As I continued to structure my preprocessing pipeline, I realized that I would have to make a few assumptions to account for all the data attributes.
- gps is always on (a strong assumption that is later fixed).
- the confidence interval is the probability of the type of activity. this assumption helps us account for several possible activity types for a given instance without underrepresenting or overrepresenting any particular activity type.
- each record has two types of timestamps. (i) corresponding to the position of latitude and longitude. (ii) corresponding to the activity. Since the difference between two timestamps used to be very small (< 30 seconds), I safely used the corresponding latitude and longitude timestamp for our analysis
data cleaning
remember i told you my gps was showing arizona, usa. uu. like location? I didn’t want those data points to differ significantly from the results. Using the longitudinal limits of India, I filtered out data points pertaining to India only.
cities for each data point
I wanted to get the corresponding city for each given latitude and longitude. a simple google search gave me the coordinates of the main cities i have lived in i.e. delhi, goa, trivandrum and bangalore.
distance
records consist of latitude and longitude. to calculate the distance traveled between records, one has to convert these values to formats that can be used for distance-related calculations.
normalized distance
each record consists of activity. each activity consists of one or more activity types together with confidence (called probability). to account for the confidence of the measurement, I devised a new metric called normalized distance which is simply distance * confidence
Now comes the interesting part! Before I dive into the ideas, let me summarize some of the data attributes:-
- precision: estimation of the precision of the data. an accuracy of less than 800 is generally considered high. therefore, we have discarded columns with a precision greater than 1000
- day: represents the day of the month
- day_of_the_week: represents the day of the week
- month: represents the month
- year: represents the year
- distance: total distance traveled
- city: city corresponding to that data point
outlier detection
There are a total of 1158736 data points. 99% of the points cover a distance of less than 1 mile. the remaining 1% are anomalies caused by poor reception/flight mode.
To prevent 1% of the data from causing significant changes in our observations, we will split the data in two based on the normalized distance.
this also ensures that we remove points that do not obey assumption #1 we made during our analysis
distance traveled with respect to the city
The 2018 data correctly represents that most of the time was spent in Bangalore and Trivandrum.
I was wondering how the distance traveled in delhi (my hometown) turned out to be greater than that in goa, where I graduated. Then I realized, I didn’t have a mobile internet connection for most of my college life :).
travel patterns in bangalore and trivandrum
In June 2018, I completed my internship at my previous organization (in trivandrum) and joined nineleaps (in bangalore). I wanted to know how my habits changed when I moved from one city to another. I was particularly interested in looking at my patterns for two reasons:
- Since I always had mobile internet while residing in these cities, I expected the rendering to be an accurate representation of reality.
- I have spent roughly the same amount of time in two cities, Therefore, the data will not be biased towards any particular city.
- several friends and family who visited bangalore in the month of october resulted in a big increase in the distance traveled by vehicles.
- at first, i was exploring trivandrum. however, as my focus shifted to securing a full-time data science opportunity, the distance traveled dropped dramatically from January to February and March.
- Vehicle use is much higher in Bangalore between 20:00 and 00:00. I guess I’ll leave later in bangalore.
- I was walking a lot more in trivandrum! the difference in walking distance from 10am to 8pm shows how I was living a healthier lifestyle by walking after every hour or two at the office.
There’s a lot more (like this and this) to do with your location history. you can also browse your twitter/facebook/chrome data. Some useful tips when trying to explore your dataset:
- spends a significant amount of time pre-processing your data. it’s painful but worth it.
- When working with large volumes of data, preprocessing can be computationally heavy. instead of re-executing the jupyter cells each time, dump the pre-processed data into a pickle file and just import the data when you start again.
- initially you might fail miserably (like me) finding any pattern. make a list of your observations and keep exploring the data set from different angles. If you ever get to a point where you wonder if patterns are present, ask yourself three questions: (i) do I have a thorough understanding of the various attributes of the data? (ii) is there anything I can do to improve my pre-processing step? (iii) have I explored the relationship between all attributes using all possible statistical/visualization tools?
to get started, you can use my jupyter notebook here.
If you have any questions/suggestions, feel free to post them in the comments.
You can connect with me on linkedin or email me at k.mathur68@gmail.com.