Wednesday, 22 March 2017

BigQuery Tip: The UNNEST Function

Todd Kerpleman
Todd Kerpelman
Developer Advocate
By now, you probably already know that you can export your Firebase Analytics data to BigQuery, which lets you run all sorts of sophisticated ad hoc queries against your analytics data.

At first, the data set in BigQuery might seem confusing to work with. If you've worked with any of our public BigQuery data sets in the past (like the Hacker News post data, or the recent San Francisco public data that our Developer Advocate Reto Meier had fun with), it probably looked a lot like a big ol' SQL table. Something like this:
The truth of the matter is that BigQuery can get much more sophisticated than that. The rows of a BigQuery table don't just have to be straightforward key-value pairs. They can look more like rows of JSON objects, containing some simple data (like strings, integers, and floats), but also more complex data like arrays, structs, or even arrays of structs. Something a little more like this:

Firebase Analytics takes advantage of this format to bundle all of your users' user properties together in the same row. Rather than have you perform some kind of join against a separate user_properties table, all of your user properties are included in the same BigQuery row as an array of structs.

A slightly simplified version of the user_properties struct in your BigQuery data 

The same thing holds true for your events. Your event parameters are included inside your events as an array of structs. And it turns out these events themselves are stored inside of an array. One single row of data in BigQuery will often contain 2 or 3 Firebase Analytics events all bundled together.
This means a single row in your BigQuery table can contain an array of user properties, as well as an array of events, which in turn also have their own arrays of event parameters. I know combining all of that information into a data structure like this seems confusing at first, but in the long run, it actually makes your life easier because there aren't any JOINs with other tables for you to worry about.

Important note: For all of these examples, I'm going to be using standard SQL, which is what all the cool kids are doing this days1. If you want to follow along, turn off Legacy SQL in your BigQuery options. Also, you'll need to follow this link to access the sample Firebase Analytics data we'll be using.

For example, I can see all of my event data at once just by calling

#standardSQL
SELECT event_dim
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`
LIMIT 50
and I'll get back all of my event data, along with all of the event parameters, in one nice little table
And then if we want to get a list of all of my "Round completed" events, I can just write some SQL like this…

#standardSQL
SELECT event_dim
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`
WHERE event_dim.name = "round_completed"
...which gives me a nice result of...

Error: Cannot access field name on a value with type ARRAY<STRUCT<date STRING, name STRING, params ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, ...>>>, ...>> at [2:17]

Oh. Oh dear. 

Okay, so this won't win any awards for "Best Error Message of 2017"2 , but if you think about it, the reason it's barfing makes sense. You're trying to compare a string value to "an element of a struct that's buried inside of an array". Sure, that element ends up being a string, but they're fairly different objects.

So to fix this, you can use the UNNEST function. The UNNEST function will take an array and break it out into each of its individual elements. Let's start with a simple example.

Calling:

#standardSQL
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT *
FROM data
will give you back a single row consisting of a string, and that array of data.

Instead, try something like this:

#standardSQL
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT description, prime
FROM data CROSS JOIN UNNEST (primes_array) as prime
What you're basically saying is, "Hey, BigQuery, please break up that primes_array into its individual members. Then join each of these members with a clone of the original row." So you end up with a data structure that looks more like this:
The results are similar as before, but now each prime is in its own row:
You'll notice that the original primes_array is still included in the data structure. In some cases (as you'll see below), this can be useful. In this particular case, I found it was a little confusing, which is why I only asked for the individual fields of description and prime instead of SELECT *.3

It's also common convention to replace that CROSS JOIN syntax with a comma, so you get a query that looks like this.

#standardSQL
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT description, prime
FROM data, UNNEST (primes_array) as prime
It's the exact same query as the previous one; it's just a little more readable. Plus, I can now stand by my original statement that this data format means you don't have perform any JOINs. :)

And the nice thing here is that I now have one piece of "prime" data per column that I can interact with. So I can start to do comparisons like this:

#standardSQL
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT description, prime
FROM data, UNNEST (primes_array) as prime
WHERE prime > 8
To get just that list of prime numbers between 8 and 15.
So going back to our Firebase Analytics data, I can now use the UNNEST function to look for events that have a specific name. 

#standardSQL
SELECT event.name, event.timestamp_micros
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(event_dim) as event
WHERE event.name = "round_completed"

As you'll recall, events have their own paramsarray, which contains all of the event parameters. If I were to UNNEST those as well, I'd be able to query for specific events that contain specific event parameter values:

#standardSQL
SELECT event, event.name, event.timestamp_micros
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(event_dim) as event,
UNNEST(event.params) as event_param
WHERE event.name = "round_completed"
AND event_param.key = "score"
AND event_param.value.int_value > 10000

Note that in this case, I am selecting "event" as one of the fields in my query, which gives me the original array of all my event parameters nicely grouped together in my table results.

Querying against user properties works in a similar manner. Let's say I'm curious as to what language my users prefer using for my app, something our app is tracking in a "language" user property. First, I'll use the UNNEST query to get just a list of each user and their preferred language.

#standardSQL
SELECT
user_dim.app_info.app_instance_id as unique_id,
MAX(user_prop.key) as keyname,
MAX(user_prop.value.value.string_value) as keyvalue
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(user_dim.user_properties) AS user_prop
WHERE user_prop.key = "language"
GROUP BY unique_id
And then I can use that as my inner selection to grab the total number of users4 that fits into that group.
#standardSQL
SELECT keyvalue, count(*) as count
FROM (
SELECT
user_dim.app_info.app_instance_id as unique_id,
MAX(user_prop.key) as keyname,
MAX(user_prop.value.value.string_value) as keyvalue
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(user_dim.user_properties) AS user_prop
WHERE user_prop.key = "language"
GROUP BY unique_id
)
GROUP BY keyvalue
ORDER BY count DESC
I can also UNNEST both my event parameters and my user properties if I want to create one great big query (no pun intended) where I want to look at events of a specific name where an event parameter matches a particular criteria, while also filtering by users who meet a certain criteria:

#standardSQL
SELECT user_dim, event, event.name, event.timestamp_micros
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(event_dim) as event,
UNNEST(event.params) as event_param,
UNNEST(user_dim.user_properties) as user_prop
WHERE event.name = "round_completed"
AND event_param.key = "squares_daubed"
AND event_param.value.int_value > 20
AND user_prop.key = "elite_powers"
AND (CAST(user_prop.value.value.string_value as int64)) > 1
Once you start playing around with the UNNESTfunction, you'll find that it's really powerful and it can make working with Firebase Analytics data a lot more fun. If you want to find out more, you can check out the Working with Arrays section of BigQuery's standard SQL documentation.

And don't forget, you get 1 terabyte of usage data for free every month with BigQuery, so don't be afraid to play around with it. Go crazy, you array expander, you!



1 The BigQuery team has asked me to inform you that this is really because standard SQL is the preferred SQL dialect for querying data stored in BigQuery. But I'm pretty sure they're just saying that so they get invited to all the good parties.

2 Yet another year the Messies have slipped from our grasp!

3   I could have also done this by saying "SELECT * EXCEPT (primes_array)", which can be pretty convenient sometimes.

4 Okay, technically, each "App Instance" -- a user interacting with my app from multiple devices would get counted multiple times here.









Share:

0 comments:

Post a Comment