Google Analytics and UserId Backstitching
For a few years now I’ve been obsessing about how Google Analytics treats UserIDs. There are quite a few posts out there for how Google Analytics UserID feature allows you to “Measure Real Users Instead of Devices” or “Understand People’s Behavior“. I’ve always
complained argued that the UserId functionality in Google Analytics misses the mark in a big way because Google only does “Session Unification” but not true visitor back-stitching.
IDs IDs IDs OH MY!!
The main question at hand is how Google Analytics defines a “User”. By default, a “User” is a clientId ==> most often identified by the _ga cookie (i.e. a unique browser ID). However, within UserID Views, when you pass a value to the UserID field in GA, this will overwrite the default processing of what defines a user which will allow the View to show cross device reporting.
Google explains how they currently handle session unification here.
Session Unification not enabled
Session Unification Enabled
As you can see above, when Session Unification is enabled, the hits that were sent before a user logged in Session Two are auto-magically backfilled to include hits from the beginning of the session. But what about Session ONE? Or Session FIVE?
Truth be told, once there has been a match between a clientId and userId, by doing a LEFT JOIN on the data table the userId will back-stitch to all rows in the table.
In the sample table below (real data), this user was anonymous on the site until their 7th session.
In response to some discussion around this post on social media, I’d like to add a quick word here about User Privacy and what it means to “track an individual”.
When back-stitching User IDs to Client IDs, there is a spectrum of different use cases, and this spectrum has different implications with regards to User Privacy. As a number of my European colleagues have flagged, with the existence of GDPR there can be a range of legal implications when back-stitching User IDs. I feel grossly cliche for saying this, but check with your own legal team and company with regards to policy; I’m not a lawyer.
In the sample image above, User 35363822 was not logged in during sessions 1 through 6, but now I can create a click-stream history of all pages viewed by this person before they logged in. I feel this is the most “red-zone” when it comes to User Privacy and ethical practices within analytics. This practice is most common in large B2B enterprises who are interested in tying user behavioral data into their CRM. But the use of back-stitching to create individual user histories honestly is not what bothers me about the lack of back-stitching in GA. Google Analytics is not a CRM or CDP nor was it ever intended to be.
Rather, my interest in back-stitching is for the purposes of properly lining up the primary keys within the analytics data set for the purposes of segmentation and analysis. In that regards, I don’t believe that there is any difference ethically or with regards to User Privacy as it relates to back-stitching. Having the numerator and denominator more accurate when doing behavioral funnel analysis doesn’t violate the privacy of User 35363822. With the exception of the User Explorer report, if GA back-stitched User IDs, the visibility into an individual’s behavior would not be impacted in a material way. But your cross device attribution modelling would be much better.
About the data set.
I queried 8 different GA accounts where I had access to both a clientId and a userId as custom dimensions. There are a total of 146 million sessions in the data set, with the following distribution.
The first query is pure “session level” data. It will return one row for every session in the date range.
The second query will return one row for every session in the date range where the session has an associated User ID Custom Dimension. The actual UserID field in Google Analytics is not accessible via the Core Reporting API.
I concatenated the clientId with the countOfSessions to return a “sessionId”. This is an important point because I wanted to find all sessions that had a UserID available.
Once I had my “sessions table” and “userId table” lined up , I did two LEFT JOINs. The first was to join the userId onto the sessions table. This back-stitches the userId to every relevant clientId. The second join is the “sessionId” to identify which sessions had a userId available (and which did not).
It is important to keep in mind that it is possible for a single clientId to have multiple userIds. This is something that any analyst needs to keep in mind when back-stitching. For this analysis I took the MAX(userId) so that I had a single clientId <> userId relationship. Internally, we oftentimes take the most recent userId and keep an updated back-stitching table available for our ad-hoc analysis.
For this post, I broke down all the sessions that I analyzed into three groups:
- ClientId Only Sessions => Sessions where there is no matching UserId, even with back-stitching applied.
- UserId Sessions => Sessions where there is a UserId. These sessions would be processed by the GA UserID View
- Stitched Sessions => Sessions that would not be a part of the GA UserID View, but could be if GA did back-stitching
I also bucketed the Count Of Sessions into its own BINs so that I could have a dimension that was a bit easier to work with.
Being able to measure the impact on GA’s lack of visitor stitching is something that has eluded me for quite some time, until I finally buckled down and learned some SQL. Full transparency, I’ve spent hours upon hours initially trying to do this analysis using Excel+PowerQuery. But Excel is simply the wrong tool to use to do JOINs. Millions of rows / Gigabytes of files are not suited Excel. This type of data work is computationally expensive. (More on that later). And while I do feel a bit embarrassed to admit that I initially wasted gobs of time trying to do this in Excel, I think that it is important for me to share a bit about my personal experience because I hope that others will be inspired to push themselves and learn new things.
So I uploaded my data sources to BigQuery, did the necessary ETL there, and then connected my BigQuery tables directly to Tableau Online where it was honestly just a pleasure to interact with the data and produce the following visualizations. (Action Item for all those readers who don’t know SQL, you should learn SQL).
What is the impact?
As I begin drilling down into some of the key points that I want to share, I’d like to start by saying that this is anecdotal, albeit real data. In other words, there was quite a bit of discrepancy between the different sites that I analyzed; so your own mileage will vary. I only analyzed 146 million sessions, not 146 trillion.
Overall, about 5% of the total sessions in the entire data set could have been back-stitched.
Of sessions that could be related to a user, 11.35% of sessions could be back-stitched (almost 7 million in total).
Unsurprisingly, the count of sessions more “sensitive” to being back-stitched were the earlier ones on the customer life-cycle.
As I mentioned earlier, the impact of back-stitching differed significantly between sites.
Same chart as above, bucketed with % Totals of Sessions (
Finally, the B2B company has two ways that they can collect a User Identifier, either via a login (which they do have) or through Eloqua (which uses email link decoration). I am particularly intrigued by the different email service providers and marketing automation software packages which can integrate their email link decoration to identify a browser. Sites which have low rates of login or may not even have a login at all can have cross device visibility into user behavior when integrated properly.
The percentage of total sessions that have an identifiable userId continues to grow as more and more users within the buckets are a part of the email lists.
Takeaways and considerations
First and foremost, this data confirmed something that I’ve been blabbering about for years; namely, that the Google Analytics UserID View is not particularly useful from an analytics perspective unless you’re analyzing the behavior of logged in users. For example, if you’re a SaaS site and you’d like to analyze how customers use your product, the UserID View is okay.
If you are a marketer or analyst who cares about acquisition, then the UserID View falls flat. You’ll be missing user data about some of the most important visits to your site, the “customer acquisition” ones pre-login.
The biggest impact on an analytics platform that doesn’t back-stitch user identities is Segmentation. Creating a “User” based segment with flawed data-set is a flawed segment. A close “second” in terms of impact is Acquisition analysis. Just think about how different your Multi-Channel Funnels would look if you knew the different marketing touch points that a user interacted with across different devices.
So why doesn’t GA simply back-stitch UserIDs?
My unverified, non-validated, decently informed guesses are:
1). It is computationally expensive. When running the JOIN on the data set that we use to create a back-stitched userId for just one of our clients, the 427 million rows in the raw data set takes about 40 seconds to crank through in BigQuery. I don’t like the computationally expensive argument when it comes to Google, however, because their resources are computationally phenomenal, although I may be being a bit naive to dismiss it out of hand. That said, the computational expense fully explains why the UserID View exists in the first place. The data in a View is processed once before it’s put into place. And everybody knows that GA doesn’t reprocess any data, so it at least makes some sense why proper user identification is not a part of their approach.
Honestly, the concept of a UserId should not even be limited to its own “View”, the userId should serve as a primary key in the main data set when it exists, not cordoned off from other touch points and interactions on the site.
2). It is a privacy issue. Here, too, I don’t really embrace the concern. Google Analytics already has plenty of Terms Of Service limitations about why sorts of identifiers can be sent in any field (including Customer Dimensions and UserID). And it’s not impossible to take the data out and stitch it (I did). But because Google is under the microscope of every government and privacy agency out there about how they treat data, not wanting to touch a live privacy wire with a 10 foot plastic pole while wearing protective gloves makes sense.
This post was less about giving myself yet another chance to complain about how GA doesn’t do visitor back-stitching, but more about sharing some sort of quantitative data about why it makes a difference. I’ve literally had this blog post on the back-burner for a few years, and I’m really happy that with a little bit of upping my game with BigQuery I’m able to share this with you.
I’m including my full sql below. I’m happy to have anyone make suggestions for improvements. I’m still learning.
with cid_table as( SELECT * FROM `data-prep-234419.IT6.clientId_websitename` UNION ALL SELECT * FROM `data-prep-234419.IT6.clientId_websitename2` ), uid_table as ( SELECT * FROM `data-prep-234419.IT6.userId_websitename` UNION ALL SELECT * FROM `data-prep-234419.IT6.userId_websitename2` ), getClientIdUserIdMatch as ( SELECT user_clientId, MAX (userId) as userId FROM uid_table GROUP BY 1 ), userIdJoin as ( Select cid_table.*, getClientIdUserIdMatch.userId as userId, user_sessionId from cid_table LEFT JOIN getClientIdUserIdMatch ON clientId = user_clientId LEFT JOIN uid_table ON sessionId = user_sessionId ), booleans AS( SELECT userIdJoin.*, CASE WHEN userId is not null THEN 1 ELSE 0 END as userSessionsIncludingStitching, CASE WHEN REGEXP_CONTAINS(user_sessionId, r":") THEN 1 ELSE 0 END as userSessionsNoStitching FROM userIdJoin) SELECT *, CASE WHEN CAST(countOfSessions AS NUMERIC) > 200 THEN 200 WHEN CAST(countOfSessions AS NUMERIC) > 100 THEN 100 WHEN CAST(countOfSessions AS NUMERIC) > 50 THEN 50 WHEN CAST(countOfSessions AS NUMERIC) > 40 THEN 40 WHEN CAST(countOfSessions AS NUMERIC) > 30 THEN 30 WHEN CAST(countOfSessions AS NUMERIC) > 20 THEN 20 WHEN CAST(countOfSessions AS NUMERIC) > 10 THEN 10 WHEN CAST(countOfSessions AS NUMERIC) > 5 THEN 6 WHEN CAST(countOfSessions AS NUMERIC) > 4 THEN 5 WHEN CAST(countOfSessions AS NUMERIC) > 3 THEN 4 WHEN CAST(countOfSessions AS NUMERIC) > 2 THEN 3 WHEN CAST(countOfSessions AS NUMERIC) > 1 THEN 2 WHEN CAST(countOfSessions AS NUMERIC) = 1 THEN 1 end as sessionBucket, CASE WHEN userSessionsIncludingStitching = 1 AND userSessionsNoStitching = 0 THEN 1 ELSE 0 END as countSessionsCouldHaveBeenStitch, CASE WHEN userSessionsIncludingStitching = 1 AND userSessionsNoStitching = 0 THEN "Stitched Session" WHEN userSessionsNoStitching = 1 THEN "User ID Session" WHEN userSessionsIncludingStitching = 0 THEN "ClientId Only Session" END as isStitchedSession, FROM booleans
Hi, interesting blog post.
This comes right in time, as I’m searching what to do with the userId custom dimension that was implemented a couple of months back at me current new job.
I’d like to point out that yes, in order to make deeper analysis one need to lear SQL, R (or Python). I’ll try to replay the querying process to extract the data from GA using R, tomorrow monday.
It’s a boomer that the userId from GA is not accessible from the API, but as far as I know you can download custome dimensions, so if you have sent your custom userID, you should be able to get it through the API:
The second query will return one row for every session in the date range where they session has an associated User ID Custom Dimension. The actual UserID field in Google Analytics is not accessible via the Core Reporting API.
John Travolta says
You should blog more often