Joinless fact tables in Google BigQuery (storing facts at the transaction header level)

There's information out there in the wild indicating that best practices for Google BigQuery involves using nesting and repeated columns as opposed to traditional star schemas.

However, most of the information that I've found uses examples that are more akin to a snowflaked dimension (e.g., a "book" table should contain embedded information about the author(s) credited to the book.)

While I suppose a book could be thought of as a fact, I haven't seen much with regards to traditional, obvious fact tables.

After quite a bit of searching, I stumbled upon Rajesh Thallum's reference article on the topic. Rajesh's example uses a transaction-style fact table to illustrate the point in bringing the dimension information into the fact table, and compressing detail fact rows into their respective headers.

To that end, I modeled a somewhat similar fact table, dimensions, and compared query numbers against a more traditional approach.

The schema

dim_customer_small - 10,000 rows, 10 columns
dim_store - 6 rows, 4 columns
dim_product - 46 rows, 3 columns
fact_grocery_sales - 1 billion rows
sales_id (transaction identifier - more than one line item is possible per transaction)
customer_key
store_key
product_key
quantity
extended_amount

Business case

Our sample use-case involved returning the top five customers, based on total sales per store, for the grocery product group. The SQL used is as follows:

SELECT store_name, full_name, SUM(a.extended_amount) AS total_sales
,ROW_NUMBER() OVER (PARTITION BY store_name ORDER BY SUM(a.extended_amount) DESC) AS Rnk
FROM misc.fact_grocery_sales a
INNER JOIN misc.dim_product p
  ON a.product_key = p.product_key
  AND p.product_group = 'Grocery'
INNER JOIN misc.dim_customer_small c
  ON a.customer_key = c.customer_key
INNER JOIN misc.dim_store s
  ON a.store_key = s.store_key
GROUP BY store_name, full_name
QUALIFY ROW_NUMBER() OVER (PARTITION BY store_name ORDER BY SUM(a.extended_amount) DESC) IN (1,2,3,4,5)
ORDER BY 1, 4;

Execution time (seconds)	1
Data billed (GB)	29.86
Slot time consumed	3 min, 3 sec
Data shuffled (MB)	135.65

Refactoring the fact table as per best practices

In the next step, I used nested and repeated columns to compress the fact table down to the transaction header level, and embedded dimension attributes within the fact table. Dimensional modeling teaches us to store data at the atomic level, which I would argue is still the case here. Years ago, that meant one row at that lower level, but here I've embedded repeating data within rows. The result is that the total row count in the fact table was reduced by about two-thirds. In a real-life scenario, where on average there are more than three line-items per transaction, further reduction may be expected.

The SQL to compress the fact table is straightforward:

CREATE OR REPLACE TABLE misc.total_grocery_sales
AS
  SELECT sale_id 
  ,STRUCT(c.customer_key, c.full_name) AS customer
  ,STRUCT(s.store_key, s.store_name, s.store_region, s.store_manager) AS store
  ,ARRAY_AGG(STRUCT(p.product_key, p.product_name, p.product_group, a.quantity, a.extended_amount)) AS products
  FROM misc.fact_grocery_sales a
  INNER JOIN misc.dim_product p
    ON a.product_key = p.product_key
  INNER JOIN misc.dim_customer_small c
    ON a.customer_key = c.customer_key
  INNER JOIN misc.dim_store s
    ON a.store_key = s.store_key
  GROUP BY sale_id, c.customer_key, c.full_name, s.store_key, s.store_name, s.store_region, s.store_manager;

The STRUCT function was used to combine dimension attributes together to store in a single column for both customer and store. STRUCT was combined with ARRARY_AGG to group transactions lines into a single row, with multiple measures. This approach could be re-used to keep fact tables updated when dimension attributes change over time.

The new table is as picture below and the subsequent query involved just this one table:

The new SQL query to answer our business question:

SELECT store.store_name, customer.full_name, SUM(p.extended_amount) AS total_sales
,ROW_NUMBER() OVER (PARTITION BY store.store_name ORDER BY SUM(p.extended_amount) DESC) AS Rnk
FROM misc.total_grocery_sales , UNNEST(products) p
WHERE p.product_group = 'Grocery'
GROUP BY store.store_name, full_name
QUALIFY ROW_NUMBER() OVER (PARTITION BY store.store_name ORDER BY SUM(p.extended_amount) DESC) IN (1,2,3,4,5)
ORDER BY 1, 4;

Execution time (seconds)	1
Data billed (GB)	21.95
Slot time consumed	5 min, 7 sec
Data shuffled (MB)	321.49

The net result in this scenario is a data billed reduction of 26%. Additionally, the SQL required to answer typical business questions is arguably simpler to write. Given that storage is cheaper than compute, it shouldn't be an issue to embed dimension attributes into multiple facts tables. However, keeping them all in sync does pose new challenges. A compromise could be to retain keys to dimensions but compress line-level rows into header rows, storing the measures in arrays.

Joinless fact tables in Google BigQuery (storing facts at the transaction header level)

Recent Posts

Comments