Refactoring your DW landscape for BigQuery and other cloud warehouses
- timandrews1
- Jun 18, 2022
- 5 min read
Moving from a traditional, mature data warehouse can be the perfect time to restore and rebuild your data warehouse schema to something more consistent and straightforward.
Over time, a data warehouse ecosystem can become ensnared in one-offs and contain multiple, competing design patterns. This results in confusion, difficulty, and even development paralysis as data professionals can become hesitant about choosing the correct approach for their project as per company standards, and don't want to break anything.
Taking the time to eliminate technical debt will pay dividends - and not just in the long run. Quite often, development efficiencies will be seen immediately after standardization. Additionally, the process of standardization can be fun and rewarding for an architect or a team to iterate on better processes that fit both industry best practices and company-specific circumstances.
The following are some ideas that I have found useful in improving schemas and DW landscapes. Many of these ideas are not one-size-fits-all and can be tweaked or ignored as needed.
Establish consistency in the data ingestion process
Some teams choose to utilize a data ingestion pattern which involves landing the data into cloud storage, before moving into the data warehouse. Other teams bypass cloud storage and use an API to move data directly into the DW.
Choosing a consistent set of rules will ensure that everything remains easy to maintain, and new processes are able to be built quickly. Here are some options:
All data is first moved to cloud storage, and then moved into GBQ.
Only un-structered data is stored in cloud storage. Relational data and flat files are moved directly into GBQ.
There are no-unstructered sources, and cloud storage is bypassed.
Furthermore, when using cloud storage, naming consistency is important, as well as file and folder layout. Will files be appended to? Will the current file be stored in a higher-level directory, and old files archived in a subfolder? Don't leave these approaches up to individual developers. Consistency is the key as others will not need to spend time analyzing how what should have been boiler-plate code was constructed.
Add a view-layer on top of all tables, for end-user and reporting consumption
Creating a policy which includes adding a view on top of all data warehouse tables adds a few benefits:
Depending on the IDE used, tables and views may be displayed in separate sections. Exposing all tables as views keeps everything together for developers and users when browsing for tables.
A set of company-specific, minor transformations can be applied in views (think month-to-date calculated measures for example), rather than relying on developers and analysts to apply these at run-time with slightly different approaches.
Quick triaging can be applied at the view-level by filtering out bad rows, or replacing errant data - be sure to apply the changes at the transformation layer afterwards.
The view-layer can serve as a bridge between refactoring for developers and preserving reporting connections for users. For example, if the team decides to standardize on naming conventions to make development maintenance and debugging easier, views can temporarily use the old names, so that all report connections do not need to be changed at the same time as table DDL changes.
Apply smart naming to tables, views and columns
The intent of specific columns in tables and views can be ambiguous. It's common practice to prefix tables with "fact" or "dim," but prefixing or suffixing columns with indicative names can also help. For example, if your particular use-case or team produces a lot of tables designed as periodic snapshots, consider adding terminology to the table names so that users know the layout of the table before even writing a query (e.g. "factSalesDailySnapshot.")
Additionally, it can help users to understand which columns are semi-additive, and which are fully additive. factSalesDailySnapshot may include the column "fa_SalesQuantity" (the number of sales that happened on that day, the 'fa' indicating a measure that is fully additive) and the column "sa_InventoryBalance" (the quantity on hand for the day, which can be summed across all dimensions except time, with the 'sa' indicating a semi-additive measure).
Adding smart-naming can even help to quickly query metadata as far as planning and scripting. Questions such as "How many additive vs. semi-additive measures do we have?" can be be answered. Depending on the BI tool sitting on top, it may be possible to use INFORMATION_SCHEMA to script out BI-layer measure behavior as well.
Finally, as noted in a prior entry, prefixing foreign keys can add assistance to analysts when trying to determine how to join tables together. Adding to this, as most cloud warehouses do not support primary keys and/or unique constraints (some support them in name while remaining unforced), prefixing columns intended to be unique with the table can provide immense benefit. Instead of spending time profiling data to determine which set of columns uniquely identify rows, analysts should be able to scan columns names and have confidence that what they are selecting and joining on will not result in duplicates.
Enforce descriptions on database objects
Text descriptions on tables, views, columns and other database objects should be applied at creation time, updated regularly and are easily monitored.
For example, a table called factSales may have the following description added, "Stores one row for each sales line-item transaction. Prior to 2017, data is sourced and loaded from LegacyPOSSystem. After 2017, data is sourced and loaded from NewHorizonsPOS."
Similarly, descriptions on columns are also helpful - especially for metrics. Tables often contain similar, competing columns that look as though they might serve the same purpose. Perhaps the ExtendedAmount doesn't contain freight, and most analysts will want the freight included for an all-in cost. Or ExtendedAmount excludes returns, or does it? It's best to call that out so that they know to choose the appropriate column.
It's also easy to monitor and report out on missing descriptions. The following example in BigQuery looks for tables that don't have any descriptions added. Query output like this can even be routed to a ticketing or defect system to ensure that it is addressed by the team:
SELECT a.table_schema, a.table_name, b.option_value AS table_description FROM misc.INFORMATION_SCHEMA.TABLES a LEFT OUTER JOIN misc.INFORMATION_SCHEMA.TABLE_OPTIONS b ON a.table_name = b.table_name AND a.table_schema = b.table_schema AND b.option_name = 'description' WHERE b.option_name IS NULL OR TRIM(b.option_name) = ''
TL;DR
Certain aspects on the design of your data warehouse require creativity up-front (the architecture and general design) and can be fun for the team to work on. However, rules should be established which makes it easy on future developers and analysts to quickly build and consume data without putting in labor to figure out how a developer built this table vs. that other table.
It is also worth putting in time to refactor the DW to add consistency. These refactoring exercises pay immediate dividends. A new developer might spend three days determining the nuances of a specific table which may have otherwise been immediate had the schema been consistent.
Opmerkingen