Guides & Tutorials

When Partitioning and Clustering Go Wrong: Lessons from Optimizing Queries

GCP

Query Optimization

BigQuery

Ayushi Gupta

Ready to transform your data strategy with cutting-edge solutions?

Get key insights and all the details you need in one easy-to-access guide 🚀

Recently, I worked with a table in BigQuery called orders, sized at 2.98 GB. My goal was to optimize query performance and reduce the amount of data scanned for cost efficiency. Here’s how the journey unfolded:

The Initial Query

I ran the following query on the unpartitioned table:

SELECT *
FROM foodwagon.orders
WHERE order_date BETWEEN '2020-01-01' AND '2022-01-31'
AND restaurant_ratings >= 4 AND restaurant_ratings < 5;

Data scanned: 2.98 GB.

First Optimization: Partitioning

Since the query frequently filtered by order_date, I decided to partition the table on order_date. After partitioning: Data scanned: 1.24 GB.

This was a decent improvement, but I wanted to push the optimization further.

Next Step: Clustering

Given that most of my queries also filtered on restaurant_ratings, clustering seemed like the logical next step. However, restaurant_ratings is a float column with high cardinality, which isn't ideal for clustering—it results in minimal gains.

To address this, I created a derived column rounded_ratings by flooring the restaurant_ratings to the nearest integer:

CREATE TABLE foodwagon.orders_clustered
PARTITION BY DATE_TRUNC(order_date, MONTH)
CLUSTER BY rounded_ratings
AS
SELECT *, CAST(FLOOR(restaurant_ratings) AS INT64) AS rounded_ratings
FROM foodwagon.orders;

This approach reduced the scanned data for my query to 283.03 MB, which is a 77% reduction compared to partitioning alone and a 90% reduction from the original table scan. Huge success!

The Unexpected Outcome

However, things got interesting when I ran a slightly modified query:

SELECT *
FROM foodwagon.orders_clustered
WHERE order_date BETWEEN '2020-01-01' AND '2022-01-31';

This query scanned 1.4 GB, which is more than the partitioned table scan (1.24 GB). Upon inspecting the clustered table, I found its size had grown to 3.35 GB—larger than the original table.

Why Did This Happen?

Clustering Increases Storage Size: Clustering creates metadata to optimize how data is stored and queried. This metadata adds overhead, especially when combined with partitioning, which can lead to a larger table size.
Inefficient Clustering for Non-Filtered Queries: Clustering only helps queries that filter on clustered fields. In this case, rounded_ratings wasn’t part of the query, so BigQuery couldn’t leverage the clustering metadata, resulting in a larger data scan.
Cluster Size Impacts Query Efficiency: Clustering benefits depend on how well the clustered field aligns with query patterns. High cardinality or sparsely filtered clusters can offset performance gains, especially for large datasets.

Partitioning vs. Clustering: When to Use What?

Partitioning:

Use for low-cardinality fields like dates or categories.
Significant benefits when queries heavily filter on the partitioned column.
Adds minimal overhead to storage.

Clustering:

Works well for fields frequently used in filters or joins, especially with moderate cardinality.
Avoid clustering on high-cardinality fields (e.g., floats) without transforming the data.
Be cautious when combining clustering with partitioning; it can lead to larger table sizes if not used wisely.

Key Takeaways

Understand Query Patterns: Before deciding on partitioning or clustering, analyze query requirements. In my case, rounded_ratings was a good clustering field because most queries filtered on it.
Quantify Benefits: Partitioning reduced the scanned data by 58%, and clustering brought it down by an additional 77%. However, the tradeoff was increased table size and complexity.
Balance Partitioning and Clustering: Combining both can be powerful but may backfire if clusters don’t align with query patterns.

By understanding these strategies and tradeoffs, I was able to reduce costs and improve query performance significantly, despite some unexpected challenges

Ready to Experience the Future of Data?

Discover how Enqurious helps deliver an end-to-end learning experience

Curious how we're reshaping the future of data? Watch our story unfold

The Snowflake Feature That Defies Common Sense blog cover image

Guides & Tutorials

September 15, 2025

The Snowflake Feature That Defies Common Sense

Snowflake recommends 100–250 MB files for optimal loading, but why? What happens when you load one large file versus splitting it into smaller chunks? I tested this with real data, and the results were surprising. Click to discover how this simple change can drastically improve loading performance.

Mandar Sr. Data Analyst

Building Bronze Layer: Using COPY INTO in Databricks blog cover image

Guides & Tutorials

September 12, 2025

Building Bronze Layer: Using COPY INTO in Databricks

Master the bronze layer foundation of medallion architecture with COPY INTO - the command that handles incremental ingestion and schema evolution automatically. No more duplicate data, no more broken pipelines when new columns arrive. Your complete guide to production-ready raw data ingestion

Sayli Jr. Data Engineer

Mastering Git & GitHub: A Complete Beginner-to-Advanced Guide blog cover image

Guides & Tutorials

September 11, 2025

Mastering Git & GitHub: A Complete Beginner-to-Advanced Guide

Learn Git and GitHub step by step with this complete guide. From Git basics to branching, merging, push, pull, and resolving merge conflicts—this tutorial helps beginners and developers collaborate like pros.

Sunil Senior Data Analyst

How Data, Governance & Security Work Like a Food Delivery App blog cover image

Guides & Tutorials

August 29, 2025

How Data, Governance & Security Work Like a Food Delivery App

Discover how data management, governance, and security work together—just like your favorite food delivery app. Learn why these three pillars turn raw data into trusted insights, ensuring trust, compliance, and business growth.

Divyanshi Analyst

My Journey Building the Automated Feedback Forms System blog cover image

Guides & Tutorials

August 25, 2025

My Journey Building the Automated Feedback Forms System

A simple request to automate Google feedback forms turned into a technical adventure. From API roadblocks to a smart Google Apps Script pivot, discover how we built a seamless system that cut form creation time from 20 minutes to just 2.

Chethan

Mastering AKS Monitoring: Practical Journey with Azure Kubernetes Service blog cover image

Guides & Tutorials

August 21, 2025

Mastering AKS Monitoring: Practical Journey with Azure Kubernetes Service

Step-by-step journey of setting up end-to-end AKS monitoring with dashboards, alerts, workbooks, and real-world validations on Azure Kubernetes Service.

Yaseen DevOps and Automations Engineer

Journey of an App | From Browser to Database and back blog cover image

Guides & Tutorials

August 9, 2025

Journey of an App | From Browser to Database and back

My learning experience tracing how an app works when browser is refreshed

Amit Co-founder & CEO

The Schema Evolution Challenge in Modern Data Pipelines (Part 1/5) blog cover image

Guides & Tutorials

May 10, 2025

The Schema Evolution Challenge in Modern Data Pipelines (Part 1/5)

This is the first in a five-part series detailing my experience implementing advanced data engineering solutions with Databricks on Google Cloud Platform. The series covers schema evolution, incremental loading, and orchestration of a robust ELT pipeline.

Amit Co-founder & CEO

7 Major Stages of the Data Engineering Lifecycle blog cover image

Guides & Tutorials

April 8, 2025

7 Major Stages of the Data Engineering Lifecycle

Discover the 7 major stages of the data engineering lifecycle, from data collection to storage and analysis. Learn the key processes, tools, and best practices that ensure a seamless and efficient data flow, supporting scalable and reliable data systems.

Ayushi Sr. Data Engineer

Troubleshooting Pip Installation Issues on Dataproc with Internal IP Only blog cover image

Guides & Tutorials

April 3, 2025

Troubleshooting Pip Installation Issues on Dataproc with Internal IP Only

This blog is troubleshooting adventure which navigates networking quirks, uncovers why cluster couldn’t reach PyPI, and find the real fix—without starting from scratch.

Ayushi Sr. Data Engineer

Optimizing Query Performance in BigQuery blog cover image

Guides & Tutorials

January 24, 2025

Optimizing Query Performance in BigQuery

Explore query scanning can be optimized from 9.78 MB down to just 3.95 MB using table partitioning. And how to use partitioning, how to decide the right strategy, and the impact it can have on performance and costs.

Ayushi Sr. Data Engineer

Stored Procedures vs. Functions: Choosing the Right Tool for the Job blog cover image

Guides & Tutorials

January 6, 2025

Stored Procedures vs. Functions: Choosing the Right Tool for the Job

Wondering when to use a stored procedure vs. a function in SQL? This blog simplifies the differences and helps you choose the right tool for efficient database management and optimized queries.

Divyanshi Analyst

Understanding the Power Law Distribution blog cover image

Guides & Tutorials

January 3, 2025

Understanding the Power Law Distribution

This blog talks about the Power Law statistical distribution and how it explains content virality

Amit Co-founder & CEO

Breaking Down Data Silos with BigQuery Omni and BigLake blog cover image

Guides & Tutorials

December 23, 2024

Breaking Down Data Silos with BigQuery Omni and BigLake

Discover how BigQuery Omni and BigLake break down data silos, enabling seamless multi-cloud analytics and cost-efficient insights without data movement.

Ayushi Sr. Data Engineer

Solving a Computer Vision task with AI assistance blog cover image

Guides & Tutorials

December 18, 2024

Solving a Computer Vision task with AI assistance

In this article we'll build a motivation towards learning computer vision by solving a real world problem by hand along with assistance with chatGPT

Amit Co-founder & CEO

How Apache Airflow Helps Manage Tasks, Just Like an Orchestra blog cover image

Guides & Tutorials

September 16, 2024

How Apache Airflow Helps Manage Tasks, Just Like an Orchestra

This blog explains how Apache Airflow orchestrates tasks like a conductor leading an orchestra, ensuring smooth and efficient workflow management. Using a fun Romeo and Juliet analogy, it shows how Airflow handles timing, dependencies, and errors.

Burhanuddin Jr. Data Engineer

Snapshots and Point-in-Time Restore: The E-Commerce Lifesaver blog cover image

Guides & Tutorials

January 13, 2024

Snapshots and Point-in-Time Restore: The E-Commerce Lifesaver

The blog underscores how snapshots and Point-in-Time Restore (PITR) are essential for data protection, offering a universal, cost-effective solution with applications in disaster recovery, testing, and compliance.

Ayushi Sr. Data Engineer

Guides & Tutorials

December 16, 2023

Basics of Langchain

The blog contains the journey of ChatGPT, and what are the limitations of ChatGPT, due to which Langchain came into the picture to overcome the limitations and help us to create applications that can solve our real-time queries

Burhanuddin Jr. Data Engineer

Understanding Data Lakes and Data Warehouses: A Simple Guide blog cover image

Guides & Tutorials

December 8, 2023

Understanding Data Lakes and Data Warehouses: A Simple Guide

This blog simplifies the complex world of data management by exploring two pivotal concepts: Data Lakes and Data Warehouses.

Ayushi Sr. Data Engineer

An L&D Strategy to achieve 100% Certification clearance blog cover image

Guides & Tutorials

December 6, 2023

An L&D Strategy to achieve 100% Certification clearance

An account of experience gained by Enqurious team as a result of guiding our key clients in achieving a 100% success rate at certifications

Amit Co-founder & CEO

Serving Up Cloud Concepts: A Pizza Lover's Guide to Understanding Tech blog cover image

Guides & Tutorials

November 2, 2023

Serving Up Cloud Concepts: A Pizza Lover's Guide to Understanding Tech

demystifying the concepts of IaaS, PaaS, and SaaS with Microsoft Azure examples

Ayushi Sr. Data Engineer

Azure Data Factory: The Ultimate Prep Cook for Your Data Kitchen blog cover image

Guides & Tutorials

October 31, 2023

Azure Data Factory: The Ultimate Prep Cook for Your Data Kitchen

Discover how Azure Data Factory serves as the ultimate tool for data professionals, simplifying and automating data processes

Ayushi Sr. Data Engineer

Harnessing Azure Cosmos DB APIs: Transforming E-Commerce blog cover image

Guides & Tutorials

October 26, 2023

Harnessing Azure Cosmos DB APIs: Transforming E-Commerce

Revolutionizing e-commerce with Azure Cosmos DB, enhancing data management, personalizing recommendations, real-time responsiveness, and gaining valuable insights.

Ayushi Sr. Data Engineer

Unleashing the Power of NoSQL: Beyond Traditional Databases blog cover image

Guides & Tutorials

October 26, 2023

Unleashing the Power of NoSQL: Beyond Traditional Databases

Highlights the benefits and applications of various NoSQL database types, illustrating how they have revolutionized data management for modern businesses.

Ayushi Sr. Data Engineer

Calendar Events Automation: Streamline Your Life with App Script Automation blog cover image

Guides & Tutorials

October 10, 2023

Calendar Events Automation: Streamline Your Life with App Script Automation

This blog delves into the capabilities of Calendar Events Automation using App Script.

Burhanuddin Jr. Data Engineer

A Journey Through Extraction, Transformation, and Loading blog cover image

Guides & Tutorials

September 7, 2023

A Journey Through Extraction, Transformation, and Loading

Dive into the fundamental concepts and phases of ETL, learning how to extract valuable data, transform it into actionable insights, and load it seamlessly into your systems.

Burhanuddin Jr. Data Engineer