GA4 and HyperLogLog: the good, the bad and the ugly

This post explores Google Analytics 4 (GA4) and its relationship with HyperLogLog (HLL++), a probabilistic data structure used for counting unique users and sessions.

I watch way too much Seinfeld. In one episode (“The Strike“), Jerry & George quote a line from “The Good, the Bad and the Ugly“. That reference makes me think of the relationship between Google Analytics 4 (GA4) and something called HyperLogLog.

What is HyperLogLog?

HyperLogLog is a powerful tool for estimating the number of unique elements in a large set, even with duplicates (e.g., unique visitors). It uses a probabilistic approach, sacrificing some precision for significant efficiency gains.

HyperLogLog in GA4:

HLL++ estimations are ubiquitous in the GA4 product, impacting all reporting surfaces (standard reports, explorations and the data API). HLL++ currently is used to report all user metrics (except “New users”), and all session metrics.

Accuracy and BigQuery:

While HLL++ boasts having only a 1-2% error margin for users and 3-4% for sessions, it can create discrepancies with BigQuery-based GA4 reporting. This mismatch arises from BigQuery analyzing the raw data without HLL++ estimations. This difference can be confusing and potentially undermine confidence in the GA4 product’s reporting accuracy. (See more details in Google’s developer post on Bridging the gap between Google Analytics UI and BigQuery export.)

Our own test:

Below is a table of our own test of the impact of HLL++ on GA4 product vs. BigQuery GA4 reporting. We are considering the GA4 standard reports as the baseline, with % variance for the other reporting options. For a recent 28 day period (ending over a week ago) with a GA4 property recording over 1.5M Total users, and over 2M Sessions, you can see Total users and Sessions for BigQuery GA4 reporting are close to 5% lower than GA4 standard reports.

Possibly the difference is somehow accounted for by bot traffic (e.g., that was removed ahead of the GA4 BigQuery export, but has not been reprocessed by GA4 yet – seems unlikely)? The reporting identify for the GA4 property is “observed”, and we record User-ID in some scenarios, but that would lower Total users in GA4 standard reports, which does not seem to be happening here (we did not consider User-ID in our BigQuery GA4 reporting). We also see a thresholding alert in GA4, but it only references cards generically, so that does not seem to be impacting this reporting. We do not see sampling or “other” row instances in effect in GA4 product reporting for this test.

A recent 28 day period in GA4	Total users	Sessions	HLL++ in effect?
GA4 Standard Reports	100.00%	100.00%	Yes
	Variance	Variance
GA4 Explorations	0.00%	0.00%	Yes
GA4 Data API (in Looker Studio)	-1.24%	0.43%	Yes
GA4 Data in BigQuery	-4.83%	-4.69%	No

Test of the impact of HLL++ on GA4 product vs. BigQuery GA4 reporting.

GA4 uses a model based on your property’s data to determine if sampling will produce more accurate results. If the model determines that the grouping of results in the “other” row (due to cardinality limits) will result in less directionally accurate results than sampling, Analytics will automatically use sampling to provide you with more accurate results that won’t include an “other” row.

Precision vs. Direction:

We must remember that Google Analytics is directional, not precise. Due to inherent limitations in online tracking, achieving absolute accuracy is impossible. HyperLogLog, while not perfect, should provide estimates that are sufficient for making informed decisions without requiring massive computational resources.

Benefits of HLL++:

Space-efficient: HLL++’s algorithm significantly reduces storage space needed for unique user/session counts, making it ideal for handling large datasets.
Computationally efficient: HLL++ drastically cuts down processing power required for counting unique users, making it much faster than traditional methods.

The “Ugly” Side:

Estimates, not exact counts: HLL++ provides estimates, which can be concerning for those seeking absolute precision.
BigQuery discrepancy: The difference between GA4 and BigQuery reporting due to HLL++ can be confusing and raise doubts about data accuracy.

NOTE: HLL++ was used in Universal Analytics starting in 2017, and is also leveraged in other vendor analytics products (e.g., Adobe Analytics).

What about the impact HLL++ on a/b tests?

The accuracy of A/B test results can be significantly impacted by the use of HLL++ for user counts. This article suggests if you have >12K users per variant in your A/B test, you can’t do reliable analysis using GA4 (nor any vendor analytics product employing HLL++). Raw GA4 data in BigQuery, on the other hand, can be used for such analysis without the impact of HLL++ (how to make that happen is beyond the scope of this post ). However, even if you are using raw GA4 data in BigQuery to analyze your a/b tests, you certainly are missing data from users you were not able to track, and you will also have same person potentially showing up as multiple users given iOS / Safari ITP tracking disintegration.

Conclusion:

Transparency regarding HLL++ can be a double-edged sword. While some appreciate understanding the inner workings of GA4, others become fixated on the lack of exact counts. It’s important to remember that accurate estimations, combined with the right analysis approach, can still lead to sound decision-making. The HLL++ estimation differences from raw BigQuery GA4 reporting you are seeing do not seem great enough to meaningfully erode the directional usefulness of GA4 data as reported in GA4 reporting surfaces. Yes, be aware of how HLL++ can impact a/b testing. But, if you are aware of the impact of such probabilistic guessing, you can still use the information to make better decisions.

iDimension can help you can help you navigate this complexity, and unlock the power of GA4.

GA4 and HyperLogLog: the good, the bad and the ugly