Predictive modeling workflow in BigQuery vs. R Studio pt. 1: ad-hoc modeling

One of the advantages of BigQuery over other cloud data warehouses is the inclusion of machine-learning models within the SQL syntax. In BigQuery, it's possible to train and use predictive models in the same manner in which one might write a stored procedure or a view.

In a past life, working with a SQL Server backend, my typical workflow was to utilize R Studio with an RODBC connection to SQL, or to export data from SQL to a csv/Excel file and then upload into R.

This post will explore the differences in workflow and outcomes between modeling directly in BigQuery against the older methodology of integrating R Studio.

Part 1 of this series will explore workflow differences and approaches for ad-hoc modeling: scenarios where you're required to analyze a situation, describe an event, explain circumstances or fix some data. Part 2 will examine productionalizing models and the differences with each approach.

Model background

The purpose of my modeling exercise will be to build a linear regression model to predict the output of my solar panels on a daily basis, given input values related to high and low temperatures, whether or not precipitation is forecasted (or was observed), the number of daylight hours (summer has more daylight than winter), and the "sunniness factor." Sunniness is a feature-engineered valued that I derived based on the overall forecast description for the day. For example, "Thunderstorms" may have a sunniness factor of 0, whereas "Mostly sunny" is 0.85.

Ad-hoc modeling in R-Studio

Connecting to BigQuery from R Studio is very simple with the bigrquery and DPI packages. Running a query against BigQuery even returns back the data quantity billed. Looking at the code below, I am performing the following steps:

Referencing my required libraries
Connecting to BigQuery
Building a SQL string and executing it
Splitting my resultset into training and test sets - training is Sundays through Thursdays, and test is Fridays and Saturdays
Training the model
Viewing a summary of model stats
Predicting values in the testing set
Visualizing the actuals vs. predictions

The model results in a fairly respectable 84% r-squared, and a good p-value. This model is not going to be perfect and is more of a directional indication on the level of solar production. It doesn't account for situations when we've been in multi-day winter storms with the panels covered in ice and sun even when its sunny.


library(DBI)
library(ggplot2)
library(reshape2)

con <- dbConnect(
  bigrquery::bigquery(),
  project = "my-project",
  dataset = "solar",
  billing = "my-project"
)

sql <- "SELECT * FROM solar.v_power_model_input"


solarData <- dbGetQuery(con, sql, n = 100000)

#Train data on Sunday through Thursday
trainingData <- solarData[solarData$dow >= 1 & solarData$dow <= 5,]

#Test on Friday and Saturday
testingData <- solarData[solarData$dow > 5, ]

#Build model
solarModel <- lm(label ~ min_temp + max_temp + day_precip + daylight + sunniness, data = trainingData)

summary(solarModel)

#predict based on testing data
predictions <- predict.lm(solarModel, newdata=testingData[,c(2,3,4,5,6)])

#combine predictions with actuals
predictedData <- cbind(testingData, predictions)

#Plot actuals vs predictions
p <- ggplot(predictedData, aes(x=the_date)) +
  geom_segment(aes(y = label, yend = predictions, x = the_date, xend = the_date)) +
  geom_point(aes(y = label, x = the_date), color="blue") +
  geom_point(aes(y = predictions, x = the_date), color="orange") 
p

Overall, the workflow experience in creating predictive models in R-Studio, against Google BigQuery, for descriptive purposes is intuitive and straightforward. The bigrquery package makes connecting easy, and the fact that visualizations are done in the same tool adds to the efficiency.

Ad-hoc modeling directly in BigQuery

The benefit of this approach is that all of the model training and predicting can be done directly in SQL, in the GBQ query window:

CREATE OR REPLACE MODEL solar.m_predict_solar_production OPTIONS (MODEL_TYPE='LINEAR_REG', OPTIMIZE_STRATEGY = 'AUTO_STRATEGY', DATA_SPLIT_METHOD = 'NO_SPLIT') AS SELECT label, min_temp, max_temp, day_precip, daylight, sunniness FROM solar.v_power_model_input a WHERE dow BETWEEN 1 AND 5

Obviously, much less code is required to build the model. There are no libraries to import, and no connections to establish. Unsurprisingly, the code above results in the same r-squared that we saw in R.

Using the model to predict Saturdays and Sundays is quick and easy as well. The following code results in the same predicted values that we saw in R.

SELECT * FROM ML.PREDICT( MODEL solar.m_predict_solar_production, ( SELECT label, min_temp, max_temp, day_precip, daylight, sunniness, the_date FROM solar.v_power_model_input a WHERE dow BETWEEN 6 AND 7 ) ) ORDER BY the_date

The main issue with this approach is that BigQuery would need to be wired up to a BI tool, or exported to Excel for data visualization.

Conclusion

Personally, I will stick with R-Studio for ad-hoc modeling against BigQuery. It's easy enough to connect, and the data visualization tools are built into R. The only thing that would make me hesitate would be consistency across a team - using R Studio in this scenario necessitates that everyone checking out the code have the same packages installed.

I have a feeling that part 2 of the series will result in a different conclusion....but who knows, I like surprises.

Predictive modeling workflow in BigQuery vs. R Studio pt. 1: ad-hoc modeling

Model background

Ad-hoc modeling in R-Studio

Ad-hoc modeling directly in BigQuery

Conclusion

Recent Posts

Comments