When Partitioning and Clustering Go Wrong: Lessons from Optimizing Queries

Ready to transform your data strategy with cutting-edge solutions?
Recently, I worked with a table in BigQuery called orders, sized at 2.98 GB. My goal was to optimize query performance and reduce the amount of data scanned for cost efficiency. Here’s how the journey unfolded:
The Initial Query
I ran the following query on the unpartitioned table:
SELECT *
FROM foodwagon.orders
WHERE order_date BETWEEN '2020-01-01' AND '2022-01-31'
AND restaurant_ratings >= 4 AND restaurant_ratings < 5;
Data scanned: 2.98 GB.
First Optimization: Partitioning
Since the query frequently filtered by order_date, I decided to partition the table on order_date. After partitioning: Data scanned: 1.24 GB.
This was a decent improvement, but I wanted to push the optimization further.
Next Step: Clustering
Given that most of my queries also filtered on restaurant_ratings, clustering seemed like the logical next step. However, restaurant_ratings is a float column with high cardinality, which isn't ideal for clustering—it results in minimal gains.
To address this, I created a derived column rounded_ratings by flooring the restaurant_ratings to the nearest integer:
CREATE TABLE foodwagon.orders_clustered
PARTITION BY DATE_TRUNC(order_date, MONTH)
CLUSTER BY rounded_ratings
AS
SELECT *, CAST(FLOOR(restaurant_ratings) AS INT64) AS rounded_ratings
FROM foodwagon.orders;
This approach reduced the scanned data for my query to 283.03 MB, which is a 77% reduction compared to partitioning alone and a 90% reduction from the original table scan. Huge success!
The Unexpected Outcome
However, things got interesting when I ran a slightly modified query:
SELECT *
FROM foodwagon.orders_clustered
WHERE order_date BETWEEN '2020-01-01' AND '2022-01-31';
This query scanned 1.4 GB, which is more than the partitioned table scan (1.24 GB). Upon inspecting the clustered table, I found its size had grown to 3.35 GB—larger than the original table.
Why Did This Happen?
Clustering Increases Storage Size: Clustering creates metadata to optimize how data is stored and queried. This metadata adds overhead, especially when combined with partitioning, which can lead to a larger table size.
Inefficient Clustering for Non-Filtered Queries: Clustering only helps queries that filter on clustered fields. In this case, rounded_ratings wasn’t part of the query, so BigQuery couldn’t leverage the clustering metadata, resulting in a larger data scan.
Cluster Size Impacts Query Efficiency: Clustering benefits depend on how well the clustered field aligns with query patterns. High cardinality or sparsely filtered clusters can offset performance gains, especially for large datasets.
Partitioning vs. Clustering: When to Use What?
Partitioning:
Use for low-cardinality fields like dates or categories.
Significant benefits when queries heavily filter on the partitioned column.
Adds minimal overhead to storage.
Clustering:
Works well for fields frequently used in filters or joins, especially with moderate cardinality.
Avoid clustering on high-cardinality fields (e.g., floats) without transforming the data.
Be cautious when combining clustering with partitioning; it can lead to larger table sizes if not used wisely.
Key Takeaways
Understand Query Patterns: Before deciding on partitioning or clustering, analyze query requirements. In my case, rounded_ratings was a good clustering field because most queries filtered on it.
Quantify Benefits: Partitioning reduced the scanned data by 58%, and clustering brought it down by an additional 77%. However, the tradeoff was increased table size and complexity.
Balance Partitioning and Clustering: Combining both can be powerful but may backfire if clusters don’t align with query patterns.
By understanding these strategies and tradeoffs, I was able to reduce costs and improve query performance significantly, despite some unexpected challenges
Ready to Experience the Future of Data?
You Might Also Like

This is the first in a five-part series detailing my experience implementing advanced data engineering solutions with Databricks on Google Cloud Platform. The series covers schema evolution, incremental loading, and orchestration of a robust ELT pipeline.

Discover the 7 major stages of the data engineering lifecycle, from data collection to storage and analysis. Learn the key processes, tools, and best practices that ensure a seamless and efficient data flow, supporting scalable and reliable data systems.

This blog is troubleshooting adventure which navigates networking quirks, uncovers why cluster couldn’t reach PyPI, and find the real fix—without starting from scratch.

Explore query scanning can be optimized from 9.78 MB down to just 3.95 MB using table partitioning. And how to use partitioning, how to decide the right strategy, and the impact it can have on performance and costs.

Wondering when to use a stored procedure vs. a function in SQL? This blog simplifies the differences and helps you choose the right tool for efficient database management and optimized queries.

This blog talks about the Power Law statistical distribution and how it explains content virality

Discover how BigQuery Omni and BigLake break down data silos, enabling seamless multi-cloud analytics and cost-efficient insights without data movement.

In this article we'll build a motivation towards learning computer vision by solving a real world problem by hand along with assistance with chatGPT

This blog explains how Apache Airflow orchestrates tasks like a conductor leading an orchestra, ensuring smooth and efficient workflow management. Using a fun Romeo and Juliet analogy, it shows how Airflow handles timing, dependencies, and errors.

The blog underscores how snapshots and Point-in-Time Restore (PITR) are essential for data protection, offering a universal, cost-effective solution with applications in disaster recovery, testing, and compliance.

The blog contains the journey of ChatGPT, and what are the limitations of ChatGPT, due to which Langchain came into the picture to overcome the limitations and help us to create applications that can solve our real-time queries

This blog simplifies the complex world of data management by exploring two pivotal concepts: Data Lakes and Data Warehouses.

An account of experience gained by Enqurious team as a result of guiding our key clients in achieving a 100% success rate at certifications

demystifying the concepts of IaaS, PaaS, and SaaS with Microsoft Azure examples

Discover how Azure Data Factory serves as the ultimate tool for data professionals, simplifying and automating data processes

Revolutionizing e-commerce with Azure Cosmos DB, enhancing data management, personalizing recommendations, real-time responsiveness, and gaining valuable insights.

Highlights the benefits and applications of various NoSQL database types, illustrating how they have revolutionized data management for modern businesses.

This blog delves into the capabilities of Calendar Events Automation using App Script.

Dive into the fundamental concepts and phases of ETL, learning how to extract valuable data, transform it into actionable insights, and load it seamlessly into your systems.

An easy to follow guide prepared based on our experience with upskilling thousands of learners in Data Literacy

Teaching a Robot to Recognize Pastries with Neural Networks and artificial intelligence (AI)

Streamlining Storage Management for E-commerce Business by exploring Flat vs. Hierarchical Systems

Figuring out how Cloud help reduce the Total Cost of Ownership of the IT infrastructure

Understand the circumstances which force organizations to start thinking about migration their business to cloud