Simple data quality tests for uniqueness and foreign keys

Example code for this blog entry

As a column-store, cloud data warehouse, BigQuery does not support primary keys, unique keys, or foreign key constraints. BigQuery doesn't even support "fake" primary keys like Snowflake does.

While best practices may be to perform tests for uniqueness at the data transformation layer (if you are using dbt, consider the tests feature) it can be useful to perform routine tests to report out duplicates and orphaned rows. Furthermore, utilizing naming conventions can aid both dynamic data quality testing and the understanding of your data model by your analysts and end-users.

Naming conventions in the data model

Our sample data quality testing scheme involves a few rules for our data model:

fact tables are to be prefixed with fact_
dimension tables are to be prefixed with dim_
Primary keys which are surrogate keys in dimensions are to be prefixed with pksk_
Foreign keys which are surrogate keys in fact tables are to be prefixed with fksk_
Business keys in both fact tables and dimensions are to be prefixed with bk_

The actual naming conventions don't matter, as long at they are consistent and can be utilized to perform dynamic tests.

New tables to store test results

Next, we will create two tables to store our test results: one table for uniqueness tests, and one table for foreign key / orphaned row tests. Note that facts and dimensions will be tested for uniqueness, while only facts will be tested for orphaned rows.

See the DDL below. Note that we will be storing the number of violations for each test - storing the actual duplicate or orphaned values could use a lot of space, and it's easy enough to query for duplicates and orphans after test results are distributed.

CREATE TABLE duplicate_keys (
	time_detected_UTC timestamp,
	table_schema STRING,
	table_name STRING,
	cols STRING,
	duplicates INT
);

CREATE TABLE orphaned_facts (
	time_detected_UTC timestamp,
	table_schema STRING,
	table_name STRING,
	column_name STRING,
	orphans INT
);

Stored procedure and schedule query to perform tests

The key to dynamically performing data quality tests, even after new tables and columns are added involves utilizing INFORMATION_SCHEMA along with EXECUTE IMMEDIATE.

For my testing, I've written a stored procedure, detect_uniques_and_orphans, which performs the following steps:

Queries INFORMATION_SCHEMA for all pksk_ and bk_ columns, storing the results and testing SQL in a temp table. Compound keys are supported for business keys.
Queries INFORMATION_SCHEMA for all fksk columns in fact tables, storing the results and testing SQL in a temp table.
Loops through both temp tables, executing the stored SQL statements which perform the tests, and storing the data quality test results in duplicate_keys and orphaned_facts.

Finally, the procedure is called once per day with a scheduled query.

CREATE PROCEDURE detect_uniques_and_orphans()
BEGIN

DECLARE stmt STRING;
SET stmt = '';


CREATE OR REPLACE TEMPORARY TABLE temp_uniques
AS
SELECT 
table_schema
, table_name
, CASE
    WHEN COLUMN_NAME LIKE 'pksk%' THEN 'primary key / surroage key'
    ELSE 'business key'
  END AS unique_type
, STRING_AGG(column_name) AS cols
, CONCAT('SELECT ', STRING_AGG(column_name), ' , COUNT(1) AS Ct FROM ', table_schema, '.', table_name, ' GROUP BY ', STRING_AGG(column_name), ' HAVING COUNT(1) > 1;' )AS debug_statement
, CONCAT('''
        INSERT INTO misc.duplicate_keys
        (time_detected_utc, table_schema ,table_name,cols,duplicates)
        SELECT current_timestamp() AS time_detected,  "''',  table_schema, '","' ,table_name , '","', STRING_AGG(column_name), '", COUNT(1) AS duplicates FROM (SELECT ', STRING_AGG(column_name), ' , COUNT(1) AS Ct FROM ', table_schema, '.', table_name, ' GROUP BY ', STRING_AGG(column_name), ' HAVING COUNT(1) > 1) x ; ') AS statement
, 0 AS processed
FROM misc.INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME LIKE 'pksk_%' OR COLUMN_NAME LIKE 'bk_%'
GROUP BY 1, 2, 3;

--Get a list of foreign key columns to dimension surrogate keys that are being used in our fact tables
CREATE OR REPLACE TEMP TABLE temp_foreign_keys
AS
SELECT table_schema, table_name, column_name
FROM misc.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME LIKE 'fact%'
AND COLUMN_NAME LIKE 'fksk%';

--Determine which dimension table these go with
CREATE OR REPLACE TEMP TABLE temp_foreign_keys
AS
SELECT y.*
,CONCAT('INSERT INTO misc.orphaned_facts SELECT CURRENT_TIMESTAMP(), "', table_schema, '", "', table_name, '", "', column_name, '", (', statement_calc, ');') AS statement
FROM
(
  SELECT x.*
  ,CONCAT('SELECT COUNT(1) FROM ', table_schema, '.', table_name, ' WHERE ', column_name, ' NOT IN (SELECT ', dim_column ,' FROM ', dim_schema, '.', dim_table, ')') AS statement_calc
  , 0 AS processed
  FROM
  (
    SELECT a.*, b.table_schema AS dim_schema, b.table_name AS dim_table, b.cols AS dim_column
    FROM temp_foreign_keys a
    INNER JOIN temp_uniques b
      ON REPLACE(a.column_name, 'fksk', 'pksk') = b.cols
  ) x
) y;


--Process the possible duplicate keys
WHILE (SELECT COUNT(1) FROM temp_uniques WHERE Processed = 0) > 0 DO
  SET stmt = (SELECT MIN(Statement) FROM temp_uniques WHERE Processed = 0);
  EXECUTE IMMEDIATE (stmt);
  UPDATE temp_uniques SET Processed = 1 WHERE Statement = stmt;
END WHILE;

--Process the possible orphaned facts
WHILE (SELECT COUNT(1) FROM temp_foreign_keys WHERE Processed = 0) > 0 DO
  SET stmt = (SELECT MIN(Statement) FROM temp_foreign_keys WHERE Processed = 0);
  EXECUTE IMMEDIATE (stmt);
  UPDATE temp_foreign_keys SET Processed = 1 WHERE Statement = stmt;
END WHILE;

END;

Tying it all together

Once our data quality tests are recording information, it's trivial to set up a Data Studio report to list out the current days' dupes and orphans, along with a schedule to email the information:

Simple data quality tests for uniqueness and foreign keys

Naming conventions in the data model

New tables to store test results

Stored procedure and schedule query to perform tests

Tying it all together

Recent Posts

Comentarios