top of page
Search

Joinless fact tables in Google BigQuery (storing facts at the transaction header level)

  • timandrews1
  • May 25, 2022
  • 3 min read

Updated: May 26, 2022

There's information out there in the wild indicating that best practices for Google BigQuery involves using nesting and repeated columns as opposed to traditional star schemas.

However, most of the information that I've found uses examples that are more akin to a snowflaked dimension (e.g., a "book" table should contain embedded information about the author(s) credited to the book.)


While I suppose a book could be thought of as a fact, I haven't seen much with regards to traditional, obvious fact tables.


After quite a bit of searching, I stumbled upon Rajesh Thallum's reference article on the topic. Rajesh's example uses a transaction-style fact table to illustrate the point in bringing the dimension information into the fact table, and compressing detail fact rows into their respective headers.


To that end, I modeled a somewhat similar fact table, dimensions, and compared query numbers against a more traditional approach.


The schema

  • dim_customer_small - 10,000 rows, 10 columns

  • dim_store - 6 rows, 4 columns

  • dim_product - 46 rows, 3 columns

  • fact_grocery_sales - 1 billion rows

  • sales_id (transaction identifier - more than one line item is possible per transaction)

  • customer_key

  • store_key

  • product_key

  • quantity

  • extended_amount

Business case

Our sample use-case involved returning the top five customers, based on total sales per store, for the grocery product group. The SQL used is as follows:


SELECT store_name, full_name, SUM(a.extended_amount) AS total_sales
,ROW_NUMBER() OVER (PARTITION BY store_name ORDER BY SUM(a.extended_amount) DESC) AS Rnk
FROM misc.fact_grocery_sales a
INNER JOIN misc.dim_product p
  ON a.product_key = p.product_key
  AND p.product_group = 'Grocery'
INNER JOIN misc.dim_customer_small c
  ON a.customer_key = c.customer_key
INNER JOIN misc.dim_store s
  ON a.store_key = s.store_key
GROUP BY store_name, full_name
QUALIFY ROW_NUMBER() OVER (PARTITION BY store_name ORDER BY SUM(a.extended_amount) DESC) IN (1,2,3,4,5)
ORDER BY 1, 4;

Execution time (seconds)

1

Data billed (GB)

29.86

Slot time consumed

3 min, 3 sec

Data shuffled (MB)

135.65

Refactoring the fact table as per best practices

In the next step, I used nested and repeated columns to compress the fact table down to the transaction header level, and embedded dimension attributes within the fact table. Dimensional modeling teaches us to store data at the atomic level, which I would argue is still the case here. Years ago, that meant one row at that lower level, but here I've embedded repeating data within rows. The result is that the total row count in the fact table was reduced by about two-thirds. In a real-life scenario, where on average there are more than three line-items per transaction, further reduction may be expected.


The SQL to compress the fact table is straightforward:


CREATE OR REPLACE TABLE misc.total_grocery_sales
AS
  SELECT sale_id 
  ,STRUCT(c.customer_key, c.full_name) AS customer
  ,STRUCT(s.store_key, s.store_name, s.store_region, s.store_manager) AS store
  ,ARRAY_AGG(STRUCT(p.product_key, p.product_name, p.product_group, a.quantity, a.extended_amount)) AS products
  FROM misc.fact_grocery_sales a
  INNER JOIN misc.dim_product p
    ON a.product_key = p.product_key
  INNER JOIN misc.dim_customer_small c
    ON a.customer_key = c.customer_key
  INNER JOIN misc.dim_store s
    ON a.store_key = s.store_key
  GROUP BY sale_id, c.customer_key, c.full_name, s.store_key, s.store_name, s.store_region, s.store_manager;

The STRUCT function was used to combine dimension attributes together to store in a single column for both customer and store. STRUCT was combined with ARRARY_AGG to group transactions lines into a single row, with multiple measures. This approach could be re-used to keep fact tables updated when dimension attributes change over time.


The new table is as picture below and the subsequent query involved just this one table:


The new SQL query to answer our business question:


SELECT store.store_name, customer.full_name, SUM(p.extended_amount) AS total_sales
,ROW_NUMBER() OVER (PARTITION BY store.store_name ORDER BY SUM(p.extended_amount) DESC) AS Rnk
FROM misc.total_grocery_sales , UNNEST(products) p
WHERE p.product_group = 'Grocery'
GROUP BY store.store_name, full_name
QUALIFY ROW_NUMBER() OVER (PARTITION BY store.store_name ORDER BY SUM(p.extended_amount) DESC) IN (1,2,3,4,5)
ORDER BY 1, 4;

Execution time (seconds)

1

Data billed (GB)

21.95

Slot time consumed

5 min, 7 sec

Data shuffled (MB)

321.49

The net result in this scenario is a data billed reduction of 26%. Additionally, the SQL required to answer typical business questions is arguably simpler to write. Given that storage is cheaper than compute, it shouldn't be an issue to embed dimension attributes into multiple facts tables. However, keeping them all in sync does pose new challenges. A compromise could be to retain keys to dimensions but compress line-level rows into header rows, storing the measures in arrays.



Comments


Post: Blog2_Post

Follow

  • Facebook
  • Twitter
  • LinkedIn

©2022 by Tim's BigQuery blog. Proudly created with Wix.com

bottom of page