Author : Thomas James
Thomas works as a data engineer in Beamlytics. He has worked on developing and automating the machine learning models, created looker dashboards and generated fake data for testing of our product Beamlytics Lens . Please reach out to Thomas if you want guidance on automating your MLOps pipeline.
Email : Thomas@beamlytics.com
BigQuery Machine Learning (BigQueryML) revolutionises the machine learning paradigm by enabling users to build and operate models using simple SQL queries directly within Google BigQuery. This integration allows for seamless handling of large datasets without the need for data migration to external platforms, significantly reducing complexity and improving efficiency. Leveraging BigQuery’s robust infrastructure, BigQueryML supports various model types, including linear and logistic regression, k-means clustering, matrix factorization, time series forecasting, and deep neural networks. By democratising machine learning, BigQueryML empowers users to create, train, evaluate, and deploy models efficiently, all while utilising their existing SQL skills, making advanced data analysis and predictive modelling accessible to a wider audience.
[K21]
Automating BigQueryML with BigQuery Scheduler
Automating the BigQueryML model process can significantly enhance productivity and ensure models are consistently updated and provide accurate predictions. By leveraging BigQuery Scheduler, you can automate the entire workflow, including data cleaning, model retraining, and prediction generation. This automation minimises manual intervention, reduces errors, and ensures that models are continuously refreshed with the latest data. Setting up scheduled queries for each step—data cleaning, model training, and generating predictions—ensures a seamless and efficient process, allowing data scientists and analysts to focus on more strategic tasks while maintaining reliable and up-to-date machine learning models. This approach not only improves efficiency but also scales easily with increasing data volumes, providing a robust solution for ongoing data-driven insights.
Why Not Cloud Functions?
Initially, using Cloud Functions might seem like a viable option for automating BigQuery Machine Learning models due to its flexibility and event-driven nature. Cloud Functions allows you to write small, single-purpose functions that respond to various events within the Google Cloud ecosystem. Here’s how Cloud Functions can be used in this context:
Data Cleaning Function
- Trigger: This function can be triggered by events such as new data being uploaded to a Google Cloud Storage bucket.
- Process: The function executes SQL queries to clean the data in BigQuery, handling tasks like removing duplicates, normalising values, and dealing with missing data.
- Output: The cleaned data is stored back in BigQuery, ready for the next step.
Model Retraining Function
- Trigger: This function is triggered after the data cleaning process completes, often via a Pub/Sub message or another event.
- Process: The function checks if the newly inserted data exceeds a certain threshold and, if so, executes SQL queries to retrain the ML model.
- Output: The retrained model is stored in BigQueryML.
Prediction Generation Function
- Trigger: This function is triggered after the model retraining process completes.
- Process: The function uses the updated model to generate predictions based on new data.
- Output: The predictions are stored in BigQuery for further analysis or visualisation.
[Pemavor]
Complications with Cloud Functions
While Cloud Functions offer flexibility and integration with various Google Cloud services, they can introduce several complications:
- Token Expiration: Cloud Functions require authentication tokens to interact with other Google Cloud services. These tokens can expire, leading to failed executions unless managed carefully.
- HTTP Trigger Functions: Many Cloud Functions need HTTP triggers, which can add complexity to the setup and require additional configurations for security and access management.
- State Management: Managing state across multiple functions can be challenging, especially when functions need to communicate or pass data between each other.
These issues add complexity to the workflow, making it difficult to maintain and manage over time.
Benefits of BigQuery Scheduler
Instead of using Cloud Functions, automating BigQueryML with BigQuery Scheduler offers a more streamlined and reliable pipeline. BigQuery Scheduler allows you to schedule three main queries: data cleaning, model retraining, and prediction generation, ensuring a more efficient process.
- Simplicity: With BigQuery Scheduler, you don’t need to manage multiple functions or handle token expirations. Everything is handled within the BigQuery environment.
- Reliability: Scheduled queries run at predefined times, ensuring a consistent and predictable workflow without the risk of missed triggers or expired tokens.
- Efficiency: By scheduling queries directly in BigQuery, you eliminate the overhead of function orchestration and state management, focusing solely on the data processing and ML tasks.
Setting Up BigQuery Scheduler
BigQuery Scheduler is a feature that allows you to run SQL queries on a scheduled basis. Here’s how to set it up:
- Navigate to BigQuery in the Google Cloud Console:
- Open the Google Cloud Console.
- Go to the BigQuery section.
- Create a New Scheduled Query:
- In the BigQuery console, click on the “Scheduled Queries” tab.
- Click on “Create Scheduled Query”.
- Configure the Scheduled Query:
- Name: Give your scheduled query a descriptive name.
- Frequency: Set the frequency for the query to run. This can be hourly, daily, weekly, etc., depending on how often you need to update your model.
- Destination: Choose where the results of the query will be stored. This can be a new or existing table in BigQuery.
- SQL Query: This is where you’ll enter the SQL query that you want to automate.
Automating the Machine Learning Model
1. Automating the Data Cleaning Query
Data cleaning is a crucial step in preparing your data for machine learning. Set up a scheduled query for preprocessing the data, including steps such as handling missing values, normalising data, and removing duplicates. Schedule this query based on how often your raw data is updated.
2. Automating the Model Retraining Query
Once the data is cleaned, the next step is to retrain the ML model. Retraining should only occur if the newly inserted data exceeds a certain threshold, which is calculated based on the row count of the updated table with the new data. Ensure that the schedule allows for sufficient time for data cleaning to complete before retraining.
3. Automating the Prediction Query
The final step is to generate predictions using the retrained model. Make sure that this scheduled query runs after the model retraining is completed to ensure that the latest model is used for predictions.
4. Managing Dependencies
To ensure that each step runs in the correct order, leverage the scheduling options:
- Data Cleaning: Schedule this to run first.
- Model Retraining: Schedule this to run after data cleaning is complete.
- Predictions: Schedule this to run after model retraining is complete.
Conclusion
Automating BigQueryML with BigQuery Scheduler simplifies the process of maintaining and updating your machine learning models. By breaking down the process into three main steps—data cleaning, model retraining, and prediction generation—you can ensure a streamlined and efficient workflow. Regular monitoring and maintenance will help you stay on top of any issues, ensuring the continuous performance of your ML models.