Refactoring BigQuery public datasets pt. 1: nested and repeated in FDA food

Data munging is one of the most time consuming exercises in the data analysis process. It's not unheard of to spend 90% of the effort in data cleansing with the remaining 10% dedicated to analysis or model building.

In today's exercise, we are diving into one of the BigQuery public datasets: fda_food.food_events. For those not aware, within BigQuery, there are many, many publicly available datasets pre-loaded with data to query against (as long as you have a GCP account). This provides an attractive option if you happen to find the same dataset pre-loaded in Google's cloud, whereas otherwise, you may end up downloading files from another site and uploading into your GCP account.

The problem

The food_events table contains a candidate key (report_number), six columns which contain comma-separated lists, and five other attribute columns. It appears as though the food_events table has header and detail information within the same table. This table is not the most readily queryable in its current format.

Our mission

Our mission is to wrap the food-events table into a view, converting the comma-separated columns into nested and repeated columns where appropriate, thereby facilitating better design and easier queries.

Task 1: Data profiling

Looking at a sample row, we can make some assumptions, and then dig in further:

report_number:
- 104776
reactions:
- VOMITING,MUSCULOSKELETAL CHEST PAIN,DIARRHOEA,DECREASED APPETITE,BALANCE DISORDER,ABDOMINAL PAIN
outcomes:
- Patient Visited Healthcare Provider,Patient Visited ER
products_brand_name:
- MARKET PANTRY WHOLE MILK,CEREAL
products_industry_code:
- 9,5
products_role:
- SUSPECT,CONCOMITANT
products_industry_name:
- Milk/Butter/Dried Milk Prod,Cereal Prep/Breakfast Food
date_created:
- 2008-07-16
date_started:
- 2008-07-01
consumer_gender: (NULL)
consumer_age: (NULL)
consumer_age_unit: (NULL)

There are five elements in reactions, two in outcomes, two in products_brand_name, two in products_industry_code, two in products_role, and two in products_industry_name. So far, it looks as though the number of elements in reactions are not correlated to the other comma-separated columns. Furthermore, since all "products_" columns have two elements, and begin with the same term, there is a good possibility that all "products" values go together in their own records.

The query below shows us how many elements are in each multi-valued column

SELECT SUM(ARRAY_LENGTH(SPLIT(reactions))) AS reactions_length ,SUM(ARRAY_LENGTH(SPLIT(outcomes))) AS outcomes_length ,SUM(ARRAY_LENGTH(SPLIT(products_brand_name))) AS brand_name_length ,SUM(ARRAY_LENGTH(SPLIT(products_industry_code))) AS industry_code_length ,SUM(ARRAY_LENGTH(SPLIT(products_role))) AS role_length ,SUM(ARRAY_LENGTH(SPLIT(products_industry_name))) AS industry_name_length FROM `bigquery-public-data.fda_food.food_events`

It is obvious that reactions are completely separate from outcomes and products - which makes sense. One event may yield multiple reactions. Additionally, while 121,000 outcomes is close to 131,000 products, outcomes are still probably somewhat separate from products. Two products may result in one outcome, and one product may result in two outcomes. More interesting is that product brand names contain many more elements than their respective industries and roles.

Task 3: Investigating products

The following query will give us some example data to examine, concerning records with more elements in brand_name vs. industry_code

SELECT products_brand_name ,products_industry_code FROM `bigquery-public-data.fda_food.food_events` WHERE report_number IN ( SELECT report_number FROM `bigquery-public-data.fda_food.food_events` GROUP BY report_number HAVING SUM(ARRAY_LENGTH(SPLIT(products_brand_name))) > SUM(ARRAY_LENGTH(SPLIT(products_industry_code))) )

Values such as "QUAKER OATS, OLD FASHIONED" look like a single product to me. I'm assuming that Old Fashioned is the particular style of Oats. So how do we separate true delimited values from terms which contain embedded commas? Looking at some of our prior queries, comma-separated values appear to contain no spaces on either side of the space. Replacing <comma / space> with another value before splitting may help us out.

SELECT SUM(ARRAY_LENGTH(SPLIT(REPLACE(reactions, ', ', '--')))) AS reactions_length ,SUM(ARRAY_LENGTH(SPLIT(REPLACE(outcomes, ', ', '--')))) AS outcomes_length ,SUM(ARRAY_LENGTH(SPLIT(REPLACE(products_brand_name, ', ', '--')))) AS brand_name_length ,SUM(ARRAY_LENGTH(SPLIT(REPLACE(products_industry_code, ', ', '--')))) AS industry_code_length ,SUM(ARRAY_LENGTH(SPLIT(REPLACE(products_role, ', ', '--')))) AS role_length ,SUM(ARRAY_LENGTH(SPLIT(REPLACE(products_industry_name, ', ', '--')))) AS industry_name_length FROM `bigquery-public-data.fda_food.food_events`

In this situation, I've replaced commas followed by spaces with two dashes. For example, "QUAKER OATS, OLD FASHIONED" becomes "QUAKER OATS--OLD FASHIONED." As seen below, this mod brings our data element counts into much closer alignment - but we still need to get brand_names to the same element count as the other product columns.

Upon further examination, there are data values with numbers containing commas.

Using some regex to substitute a space for the comma separating the thousands-place for numbers in brand_name results in a small reduction for the sum of brand_name_length. We are down to 131,587 for total length of elements in brand_name, vs. 131,261 for the other product columns. This is a difference of just 0.25%. There are simply some product brands with commas, without spaces between words, that are too difficult to systematically detect. We can live with this small data quality issue and allow them to remain as orphaned brands.

SELECT SUM(ARRAY_LENGTH(SPLIT(REPLACE(reactions, ', ', '--')))) AS reactions_length ,SUM(ARRAY_LENGTH(SPLIT(REPLACE(outcomes, ', ', '--')))) AS outcomes_length ,SUM( CASE WHEN REGEXP_INSTR(products_brand_name, '\\d(,\\d)') > 0 THEN ARRAY_LENGTH(SPLIT(REPLACE(CONCAT(LEFT(products_brand_name, REGEXP_INSTR(products_brand_name, '\\d(,\\d)') - 1), ' ' , RIGHT(products_brand_name, LENGTH(products_brand_name) - REGEXP_INSTR(products_brand_name, '\\d(,\\d)') )) , ', ', '--'))) ELSE ARRAY_LENGTH(SPLIT(REPLACE(products_brand_name, ', ', '--'))) END ) AS brand_name_length ,SUM(ARRAY_LENGTH(SPLIT(REPLACE(products_industry_code, ', ', '--')))) AS industry_code_length ,SUM(ARRAY_LENGTH(SPLIT(REPLACE(products_role, ', ', '--')))) AS role_length ,SUM(ARRAY_LENGTH(SPLIT(REPLACE(products_industry_name, ', ', '--')))) AS industry_name_length FROM `bigquery-public-data.fda_food.food_events`

Task 4: Tying it all together

To summarize and tie everything together, we are going to build a view with the following data-cleansing routines:

Replace commas followed by spaces with two dashes in our six comma-delimited columns
Replace commas within numbers with a space in our products_brand_name column
Transform the reactions column into an array
Transform the outcomes column into an array
Transform the four products_ columns into a struct with arrays of mostly equal length
- products_brand_name columns containing more elements than their sibling products_ columns will have some elements orphaned, but columns containing the length of elements for brand_name and industry_code will will be returned for an easy way to weed out the orphans.

/* This view turns comma-separated columns in food-events into nested and repeated columns, for better querying. The columns "reactions" and "outcomes" are turnd into arrays. The columns "products_brand_name", "products_industry_code," "products_role," & "products_industry_name" are turned into a struct with separate columns for each of the four columns. Multiple data-cleansing steps are involved, and a small percentage (0.25%) of rows end up with extra products_brand_names. */ WITH food_events_cleansed AS ( SELECT report_number ,SPLIT(reactions, ',') AS reactions ,SPLIT(outcomes, ',') AS outcomes ,STRUCT( CASE WHEN REGEXP_INSTR(products_brand_name, '\\d(,\\d)') > 0 THEN (SPLIT(REPLACE(CONCAT(LEFT(products_brand_name, REGEXP_INSTR(products_brand_name, '\\d(,\\d)') - 1), ' ' , RIGHT(products_brand_name, LENGTH(products_brand_name) - REGEXP_INSTR(products_brand_name, '\\d(,\\d)') )) , ', ', '--'))) ELSE (SPLIT(REPLACE(products_brand_name, ', ', '--'))) END AS brand_name ,SPLIT(REPLACE(products_industry_code, ', ', ' -- ')) AS industry_code ,SPLIT(REPLACE(products_role, ', ', ' -- ')) AS role ,SPLIT(REPLACE(products_industry_name, ', ', ' -- ')) AS industry_name ) AS products ,date_created ,date_started ,consumer_gender ,consumer_age ,consumer_age_unit ,ARRAY_LENGTH(SPLIT(REPLACE(products_industry_code, ', ', ' -- '))) AS industry_code_length ,ARRAY_LENGTH(CASE WHEN REGEXP_INSTR(products_brand_name, '\\d(,\\d)') > 0 THEN (SPLIT(REPLACE(CONCAT(LEFT(products_brand_name, REGEXP_INSTR(products_brand_name, '\\d(,\\d)') - 1), ' ' , RIGHT(products_brand_name, LENGTH(products_brand_name) - REGEXP_INSTR(products_brand_name, '\\d(,\\d)') )) , ', ', '--'))) ELSE (SPLIT(REPLACE(products_brand_name, ', ', '--'))) END) AS brand_name_length FROM `bigquery-public-data.fda_food.food_events` ) SELECT * FROM food_events_cleansed;

Refactoring BigQuery public datasets pt. 1: nested and repeated in FDA food_events

The problem

Our mission

Task 1: Data profiling

Task 3: Investigating products

Task 4: Tying it all together

Recent Posts

Comments