At first, the data set in BigQuery might seem confusing to work with. If you've worked with any of our public BigQuery data sets in the past (like the Hacker News post data, or the recent San Francisco public data that our Developer Advocate Reto Meier had fun with), it probably looked a lot like a big ol' SQL table. Something like this:
Firebase Analytics takes advantage of this format to bundle all of your users' user properties together in the same row. Rather than have you perform some kind of join against a separate user_properties table, all of your user properties are included in the same BigQuery row as an array of structs.
A slightly simplified version of the user_properties struct in your BigQuery data
Important note: For all of these examples, I'm going to be using standard SQL, which is what all the cool kids are doing this days1. If you want to follow along, turn off Legacy SQL in your BigQuery options. Also, you'll need to follow this link to access the sample Firebase Analytics data we'll be using.
For example, I can see all of my event data at once just by calling
#standardSQLand I'll get back all of my event data, along with all of the event parameters, in one nice little table
SELECT event_dim
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`
LIMIT 50
#standardSQL...which gives me a nice result of...
SELECT event_dim
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`
WHERE event_dim.name = "round_completed"
Error: Cannot access field name on a value with type ARRAY<STRUCT<date STRING, name STRING, params ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, ...>>>, ...>> at [2:17]
- Oh. Oh dear.
Okay, so this won't win any awards for "Best Error Message of 2017"2 , but if you think about it, the reason it's barfing makes sense. You're trying to compare a string value to "an element of a struct that's buried inside of an array". Sure, that element ends up being a string, but they're fairly different objects.
So to fix this, you can use the
UNNEST
function. The UNNEST
function will take an array and break it out into each of its individual elements. Let's start with a simple example.Calling:
#standardSQLwill give you back a single row consisting of a string, and that array of data.
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT *
FROM data
#standardSQLWhat you're basically saying is, "Hey, BigQuery, please break up that
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT description, prime
FROM data CROSS JOIN UNNEST (primes_array) as prime
primes_array
into its individual members. Then join each of these members with a clone of the original row." So you end up with a data structure that looks more like this: primes_array
is still included in the data structure. In some cases (as you'll see below), this can be useful. In this particular case, I found it was a little confusing, which is why I only asked for the individual fields of description
and prime
instead of SELECT *.
3 It's also common convention to replace that
CROSS JOIN
syntax with a comma, so you get a query that looks like this. #standardSQLIt's the exact same query as the previous one; it's just a little more readable. Plus, I can now stand by my original statement that this data format means you don't have perform any JOINs. :)
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT description, prime
FROM data, UNNEST (primes_array) as prime
And the nice thing here is that I now have one piece of "prime" data per column that I can interact with. So I can start to do comparisons like this:
#standardSQLTo get just that list of prime numbers between 8 and 15.
WITH data AS (
SELECT "primes under 15" AS description,
[1,2,3,5,7,11,13] AS primes_array)
SELECT description, prime
FROM data, UNNEST (primes_array) as prime
WHERE prime > 8
- So going back to our Firebase Analytics data, I can now use the
UNNEST
function to look for events that have a specific name. #standardSQL
SELECT event.name, event.timestamp_micros
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(event_dim) as event
WHERE event.name = "round_completed"
params
array, which contains all of the event parameters. If I were to UNNEST
those as well, I'd be able to query for specific events that contain specific event parameter values: #standardSQL
SELECT event, event.name, event.timestamp_micros
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(event_dim) as event,
UNNEST(event.params) as event_param
WHERE event.name = "round_completed"
AND event_param.key = "score"
AND event_param.value.int_value > 10000
Querying against user properties works in a similar manner. Let's say I'm curious as to what language my users prefer using for my app, something our app is tracking in a "language" user property. First, I'll use the
UNNEST
query to get just a list of each user and their preferred language. #standardSQL
SELECT
user_dim.app_info.app_instance_id as unique_id,
MAX(user_prop.key) as keyname,
MAX(user_prop.value.value.string_value) as keyvalue
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(user_dim.user_properties) AS user_prop
WHERE user_prop.key = "language"
GROUP BY unique_id
- And then I can use that as my inner selection to grab the total number of users4 that fits into that group.
#standardSQL
SELECT keyvalue, count(*) as count
FROM (
SELECT
user_dim.app_info.app_instance_id as unique_id,
MAX(user_prop.key) as keyname,
MAX(user_prop.value.value.string_value) as keyvalue
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(user_dim.user_properties) AS user_prop
WHERE user_prop.key = "language"
GROUP BY unique_id
)
GROUP BY keyvalue
ORDER BY count DESC
UNNEST
both my event parameters and my user properties if I want to create one great big query (no pun intended) where I want to look at events of a specific name where an event parameter matches a particular criteria, while also filtering by users who meet a certain criteria: #standardSQLOnce you start playing around with the
SELECT user_dim, event, event.name, event.timestamp_micros
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`,
UNNEST(event_dim) as event,
UNNEST(event.params) as event_param,
UNNEST(user_dim.user_properties) as user_prop
WHERE event.name = "round_completed"
AND event_param.key = "squares_daubed"
AND event_param.value.int_value > 20
AND user_prop.key = "elite_powers"
AND (CAST(user_prop.value.value.string_value as int64)) > 1
UNNEST
function, you'll find that it's really powerful and it can make working with Firebase Analytics data a lot more fun. If you want to find out more, you can check out the Working with Arrays section of BigQuery's standard SQL documentation. And don't forget, you get 1 terabyte of usage data for free every month with BigQuery, so don't be afraid to play around with it. Go crazy, you array expander, you!
1 ↩ The BigQuery team has asked me to inform you that this is really because standard SQL is the preferred SQL dialect for querying data stored in BigQuery. But I'm pretty sure they're just saying that so they get invited to all the good parties.
2 ↩ Yet another year the Messies have slipped from our grasp!
3 ↩ I could have also done this by saying "SELECT * EXCEPT (primes_array)", which can be pretty convenient sometimes.
4 ↩ Okay, technically, each "App Instance" -- a user interacting with my app from multiple devices would get counted multiple times here.
0 comments:
Post a Comment