BigQuery Table Summary

timandrews1

Jul 16, 20224 min read

I've seen countless situations in which a developer begins coding a data integration routine without planning and without performing the important data profiling stage. They may have received vague requirements from the requester, gone back to their desk and started coding immediately, without any foreknowledge of the profile of the data contained in the tables that they are mashing up together. What is the range of expected values? Though the source table defines all strings as VARCHAR(MAX), is there an upper limit to the size that we should use in our target table? What size should the description column be in the target table? For calculations, should we use a float, or a specific precision and scale with a decimal?

These are all important questions that can be investigated easily by taking a few minutes to write some code to interrogate source tables. However, when there are many source tables involved, multiplying these few minutes by many tables can really add up. Other, more statistically-aimed development environments such a R have a built in function to gather basic data profiles about tables (or dataframes in R's case). Below is the output for the famous summary function in R - run against the built-in mtcars dataframe. We can see that strings are summarized differently than numerical columns, and that numerical columns contain basic statistical information revolving around the range and distribution of values contained within the dataframe.

BigQuery does not have a built-in equivalent to R's summary function, but we can make something similar with a stored procedure. Our aim will be to provide the following output values, some of which mirror equivalents in R's summary function:

Min value
1st quartile
Median
Mean
3rd quartile
Max value
Approximate count of distinct values (approximate due to performance reasons)
Count of NULL values for the column
Total count of values (useful to compute the ratio of NULLs to non-NULLs)

These extra attributes over what R provides are more important in a database-centered context. Additionally, for STRINGs, we will look at Mins, Maxes, etc. for the length of the string. The complete code can be found on github.

Table Summary stored procedure

Our stored procedure will take in two parameters: "tbl_name" in dataset.table format, and "display_bools_as_bools" as a boolean. Display_bools_as_bools is a flag to determine if, for example, the "average value of a boolean column" should only yield TRUE/FALSE, or if .4 could be returned as an indication that 40% of values are TRUE and 60% are FALSE.

CREATE OR REPLACE PROCEDURE misc.table_summary(tbl_name STRING, display_bools_as_bools BOOL)

Next, we create some code to store a string that we will use to build up a SQL statement to query the statistics about each column within a table. This string will contain placeholders (denoted between <<>> brackets) that will be swapped out depending on the name or type of column being queried.

DECLARE sql_select STRING; DECLARE sql_insert STRING; DECLARE base_date STRING; SET base_date = '1900-01-01'; /* Create a SQL string which will interrogate the rows for each column, including placeholders to swap in column names and table names */ SET sql_select = ''' INSERT INTO temp_processed_rows SELECT "<<alwayscolumn>>" AS col_name..........................

After this, we use INFORMATION_SCHEMA to return strings by manipulating the string template that we declared above, storing the resulting SQL strings in a temp table. Note that the type of column determines how we manipulate the string template to gather statistics.

/* Use the SQL string to load a version of the string into a temp table, swapping out for row and column names. This temp table will be looped over to process each column */ CREATE OR REPLACE TEMPORARY TABLE temp_data_to_process AS SELECT REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(sql_select, '<<column>>' , CASE WHEN data_type = 'INT64' THEN column_name WHEN data_type = 'FLOAT64' THEN column_name WHEN data_type LIKE '%NUMERIC%' THEN column_name WHEN data_type = 'STRING' THEN CONCAT('LENGTH(', column_name, ')') WHEN data_type LIKE '%BYTE%' THEN CONCAT('LENGTH(', column_name, ')') WHEN data_type LIKE '%ARRAY%' THEN CONCAT('ARRAY_LENGTH(', column_name, ')') WHEN data_type LIKE '%TIMESTAMP%' THEN CONCAT('DATETIME_DIFF(', column_name, ", TIMESTAMP '1900-01-01', MICROSECOND)") WHEN data_type LIKE '%DATETIME%' THEN CONCAT('DATETIME_DIFF(', column_name, ", DATETIME '1900-01-01', MICROSECOND)") WHEN data_type LIKE '%DATE%' THEN CONCAT('DATE_DIFF(', column_name, ", DATE '1900-01-01', DAY)") WHEN data_type LIKE '%TIME%' THEN CONCAT('TIME_DIFF(', column_name, ", TIME '00:00:00', MICROSECOND)") WHEN data_type LIKE '%BOOL%' THEN CONCAT('CAST(', column_name, ' AS INT64)') END ), '<<table>>', tbl_name), '<<alwayscolumn>>', column_name), "<<datatype>>", data_type), "<<ordinalposition>>", CAST(ordinal_position AS STRING)) AS sql_to_run ,data_type ,0 AS processed FROM misc.INFORMATION_SCHEMA.COLUMNS WHERE CONCAT(table_schema, '.', table_name) = tbl_name AND ( data_type LIKE '%INT%' OR data_type LIKE '%FLOAT%' OR data_type LIKE '%NUMERIC%' OR data_type LIKE '%STRING%' OR data_type LIKE '%BYTE%' OR data_type LIKE '%%ARRAY' OR data_type LIKE '%TIME%' OR data_type LIKE '%DATE%' OR data_type LIKE '%BOOL%' );

Step three consists of looping over the temporary table that we built, executing each stored SQL string with EXECUTE IMMEDIATE to gather the actual statistics for each column, storing the results in yet another temp table.

/* Loop over the table of prepared SQL strings, executing them, and storing the statistical results in a temp table */ CREATE OR REPLACE TEMPORARY TABLE temp_processed_rows ( col_name STRING ,ordinal_position INT64 ,data_type STRING ,min_val STRING ,first_quartile STRING ,median STRING ,mean_val STRING ,third_quartile STRING ,max_val STRING ,distinct_vals INT64 ,null_vals INT64 ,total_count INT64 ); WHILE (SELECT COUNT(1) FROM temp_data_to_process WHERE processed = 0) > 0 DO SET sql_insert = (SELECT MIN(sql_to_run) FROM temp_data_to_process WHERE processed = 0); EXECUTE IMMEDIATE sql_insert; UPDATE temp_data_to_process SET processed = 1 WHERE sql_to_run = sql_insert; END WHILE;

Step four consists of cleaning up our output. Since columns with TIMESTAMP types and some other types do not play nicely with aggregate functions like AVG, we compiled statistics by converting to differentials based off of a base date. However, we need to convert those differentials back into their proper types.

BEFORE

AFTER

If we zoom in, we can see that the "ts" column (ordinal 9) had statistics computed against very larger integer numbers, and were later transformed into actual timestamp values in the final result.

A sample call to the stored procedure looks like this:

CALL misc.table_summary('misc.fact_sales_tiny', FALSE)

BigQuery Table Summary

Table Summary stored procedure

BEFORE

AFTER

Recent Posts