Enqurious logo
Back to blog
Conversations

Snowflake - A Database or Data Warehouse?

Snowflake - A Database or Data Warehouse? blog cover image
data-warehousing
cloud-computing
storage
Amit ChoudharyCo-founder & CEO

GlobalMart has been growing at a phenomenal pace over the past 3-4 years. One of its core focus has been customer engagement which has resulted in majority of its customers indulging in repeat purchases leading to low CAC for the company. 

 However, with growing customer base, it is becoming increasingly difficult to keep a track of customer behavior metrics via plain OLTP systems. Also, with analytical queries rising day by day, the OLTP databases are hitting performance bottlenecks every now and then. 

 This has forced the management to think about a centralized, clean, source of truth which can be used for all analytical querying for the company. In short, they've been looking for the right data warehousing solution. 

 Out of several vendors, this new cloud first Data warehousing solution called Snowflake has caught their attention. The CTO feels that this would be a right choice for the company and thus has mandated Data Engineers to learn Snowflake and clear the Snowpro Core certification to be ready to work on projects related to Snowflake. 

 Vinay, one of the Data Engineers at GlobalMart is pretty curious about this new technology and is seeking help from Senior Architect Rahul who also happens to have worked on Snowflake in past.

 Let's figure out how Vinay explores Snowflake and how Rahul helps him navigate through the concepts and implementation of this cool new Data warehousing technology

Vinay : Hi Rahul, glad to connect with you and seek your help towards my learning journey in Snowflake. I may have a lot of dumb questions as a starter. Bear with me for a while and I should surely pick up from your experience :)

Rahul : Hey, don't worry. I've been this path before. I had all the questions when I started my journey with Snowflake. So, don't worry, there are no dumb questions!

Feel free to ask anything you'd like, and I'll do my best to provide clear explanations and SQL code examples. Let's get started. What aspect of Snowflake would you like to begin with or do you have any specific questions to start with?

Vinay : I come from a SQL background. I understand the idea of tables, loading data using inserts, updates etc. I have done joins, group-aggregates, windowing etc. I have basic idea of writing functions in SQL. Given these skills what would be the best way to get started?

Rahul : Great, it sounds like you have a solid foundation in SQL, which will be very helpful as you learn Snowflake. Since you're familiar with SQL concepts, let's start by discussing some of the key differences and features of Snowflake compared to traditional relational databases.

1. Data Warehousing Model: Snowflake is a cloud-based data warehousing platform. Unlike traditional databases, it's designed for large-scale data storage and analytics. Data is stored in a columnar format, which is optimized for analytical queries.

2. Data Loading: Snowflake offers various methods for data ingestion, such as bulk loading, Snowpipe for real-time data ingestion, and using external stages. Let's start with a basic example of loading data into a Snowflake table using SQL.

 -- Creating a table
CREATE OR REPLACE TABLE my_table (
    id INT,
    name STRING
);

-- Loading data from an external stage (assuming you have a CSV file in a stage)
COPY INTO my_table
FROM @my_stage/my_data.csv
FILE_FORMAT = (TYPE = CSV)
ON_ERROR = CONTINUE; -- You can specify error handling options

3. Data Warehouse Configuration: In Snowflake, you'll work with a virtual warehouse, which is the compute engine responsible for running queries. You can scale it up or down based on your needs.

-- Creating a warehouse
CREATE OR REPLACE WAREHOUSE my_warehouse
WAREHOUSE_SIZE = X-SMALL
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE;

4. Access Control: Snowflake provides robust access control mechanisms. You can control who can access what data and what actions they can perform.

 -- Creating a role
CREATE OR REPLACE ROLE my_role;

-- Granting privileges to a role
GRANT USAGE ON DATABASE my_db TO ROLE my_role;

-- Assigning a role to a user
CREATE USER my_user
PASSWORD = 'password'
DEFAULT_ROLE = my_role;

5. ELT (Extract, Load, Transform): Snowflake often uses the ELT approach. You extract data into Snowflake, load it into tables, and then perform transformations using SQL.

6. Writing Functions and Procedures: You can create user-defined functions and procedures in Snowflake using SQL.

-- Creating a simple SQL function
CREATE OR REPLACE FUNCTION my_function(x INT, y INT)
RETURNS INT
LANGUAGE SQL
AS '
SELECT x + y';

These are some key concepts to get started with Snowflake. If you have specific questions or if you'd like to explore any of these topics in more detail, please let me know, and I'll provide more information and code examples as needed.

Vinay : Whoaaa! That's too much for a novice like me :) Shall we break this in smaller pieces and understand better? First, help me understand what's this data warehouse? I work on databases. Why should I care about this data warehouse?

Rahul : Of course, let's break it down into smaller pieces. Understanding the concept of a data warehouse is a great place to start.

In the context of Snowflake :

A Data Warehouse is a specialized type of database designed for storing and analyzing large volumes of data efficiently. Unlike traditional transactional databases, which are optimized for read and write operations, data warehouses are optimized for complex queries and reporting. Here's why you should care about it:

1. Analytical Queries: Data warehouses are designed to handle complex analytical queries with ease. They can aggregate, join, and filter large datasets quickly, making them ideal for business intelligence and data analysis tasks.

2. Columnar Storage: Data in a data warehouse is typically stored in a columnar format rather than the row-based format of traditional databases. This columnar storage is highly optimized for analytical workloads, as it allows for efficient compression and retrieval of data.

3. Performance: Data warehouses are optimized for performance, especially when dealing with large datasets. They can distribute data across multiple nodes for parallel processing, providing faster query results.

4. Scalability: Snowflake and other cloud-based data warehouses can easily scale up or down based on your needs. You can allocate more computing resources (compute clusters) as your data and query complexity grow.

5. Concurrency: Data warehouses like Snowflake can handle multiple concurrent users and queries effectively. You won't experience a significant slowdown in performance when many users are querying the system simultaneously.

6. Data Integration: Data warehouses are often used to consolidate data from various sources. Snowflake supports data integration, allowing you to bring data from different systems into one central location for analysis.

So, as someone coming from a SQL background and working with databases, understanding data warehousing concepts is essential if you want to perform advanced analytics, gain insights from large datasets, and make data-driven decisions. Snowflake provides a platform that makes it easier to harness the power of a data warehouse in a cloud-based environment.

If you have any more specific questions about data warehousing or how Snowflake implements these concepts, please feel free to ask.

Vinay : Ok. So, let's start with the database. Why can't I simple analyze the data in my OLTP database and perform analytical queries? It will worked so well without going for this data warehouse!

Rahul : Analyzing data in your OLTP (Online Transaction Processing) database is certainly possible for smaller-scale analytical tasks and reporting. However, using a data warehouse like Snowflake becomes advantageous as your data and analytical requirements grow. Here's why:

1. Data Volume: OLTP databases are designed for efficient transaction processing and are typically not optimized for handling large volumes of historical data. Analytical workloads often involve querying and aggregating extensive datasets, which can put a significant load on OLTP systems and degrade their performance.

2. Performance Impact: Running analytical queries in an OLTP database can impact the performance of your transactional operations. These databases are optimized for quick reads and writes, but complex analytical queries can be resource-intensive and slow down the database for users performing critical transactional tasks.

3. Data Structure: OLTP databases are usually normalized to minimize redundancy and ensure data integrity. Analytical queries often require denormalized or differently structured data for efficient processing. Data warehouses allow you to transform and structure data for analytical purposes without affecting the original OLTP data.

4. Concurrency: Analytical workloads may require multiple users to run complex queries concurrently. OLTP databases might not handle this level of concurrency efficiently, leading to contention and reduced performance.

5. Data Integration: Analytical tasks often involve data from various sources, such as different OLTP databases, external data feeds, or flat files. A data warehouse like Snowflake provides tools and capabilities for integrating and consolidating data from these diverse sources in one place.

6. Scalability: Data warehouses can scale horizontally or vertically to accommodate the increasing data volume and analytical complexity without major disruptions to operations. OLTP databases might require significant changes and downtime to achieve the same level of scalability.

7. Cost Efficiency: Cloud-based data warehouses like Snowflake offer pay-as-you-go pricing models, allowing you to allocate resources as needed. You can optimize costs by scaling resources up during peak analytical periods and down during off-peak times.

In summary, while it's possible to perform analytical queries in an OLTP database, using a dedicated data warehouse like Snowflake provides the following advantages:

  • Better performance for analytical queries without impacting transactional operations.

  • Handling of large volumes of historical data.

  • Support for complex analytical transformations and queries.

  • Improved concurrency for multiple users.

  • Centralized data integration from various sources.

  • Scalability and cost-efficiency.

So, the choice between using your OLTP database for analytics or investing in a data warehouse like Snowflake depends on your specific data volume, analytical requirements, and performance considerations. Data warehouses are purpose-built for analytics, making them a robust choice for organizations with significant analytical needs.

Vinay : Hold on. You say the OLTP databases are designed for quick reads and writes however analytical queries are resource intensive. Can you take actual code examples with insights on performance to prove this point?

Rahul : Certainly, I can provide you with a simple example to illustrate the performance impact of running analytical queries on an OLTP database compared to a data warehouse. Please note that the performance difference may vary depending on the specific database system and hardware configuration, but the principles apply broadly.

Let's assume you have an OLTP database with a table called sales_transactions, which stores individual sales transactions for an online store. It has millions of rows, and you want to perform a simple analytical query to calculate the total revenue for a specific date range.

In an OLTP Database (e.g., PostgreSQL or MySQL) :

-- Analytical Query in OLTP Database
SELECT SUM(amount)
FROM sales_transactions
WHERE transaction_date BETWEEN '2023-01-01' AND '2023-02-28';

In a Data Warehouse (e.g., Snowflake):

-- Analytical Query in Snowflake (Data Warehouse)
SELECT SUM(amount)
FROM sales_transactions
WHERE transaction_date BETWEEN '2023-01-01' AND '2023-02-28';

Here's why the data warehouse is likely to outperform the OLTP database for this analytical query:

1. Indexing: OLTP databases are typically optimized for quick reads and writes, which means they often use indexing strategies that are well-suited for transactional operations. Analytical queries often need to scan large portions of the table, and full table scans are less efficient in OLTP databases.

2. Locking: OLTP databases may use row-level locking to maintain data integrity during transactions. Analytical queries can be long-running and may cause contention with transactional operations, leading to slower query performance.

3. Query Optimization: Data warehouses are designed to optimize complex analytical queries. They may use techniques like query parallelism, optimized storage formats (e.g., columnar storage), and pre-computed aggregations, all of which improve query performance.

4. Concurrency: Data warehouses are built to handle multiple concurrent users and queries efficiently, whereas OLTP databases may prioritize transactional operations over analytical ones, resulting in slower query response times during peak usage.

5. Resource Allocation: Data warehouses often allow you to allocate dedicated resources to analytical queries, ensuring that they have the computing power they need. OLTP databases may not offer the same level of resource isolation for analytical workloads.

While this example demonstrates a simple query, the performance difference becomes more pronounced as queries become more complex, involve multiple joins, and require aggregations over large datasets. In such cases, data warehouses like Snowflake shine in terms of query performance and scalability.

It's essential to choose the right tool for the job based on your specific workload. OLTP databases excel at transactional operations, while data warehouses are designed for analytical tasks. Combining both types of databases in your architecture can provide a well-rounded solution for different data processing needs.

Vinay : What would be some examples of quick read writes? Can you give some SQL examples?

Rahul : Quick read and write operations are typically associated with Online Transaction Processing (OLTP) databases. These databases are optimized for handling individual, short, and frequent transactions, such as inserting, updating, or deleting records. Here are some SQL examples to illustrate quick read and write operations in an OLTP context:

1. Inserting Data :

-- Insert a new customer record
INSERT INTO customers (customer_id, name, email)
VALUES (1, 'John Doe', 'johndoe@example.com'); 

This operation adds a new customer to the "customers" table with minimal overhead.

2. Updating Data:

-- Update the email address of an existing customer
UPDATE customers
SET email = 'newemail@example.com'
WHERE customer_id = 1;

Updating a single record is a quick operation in OLTP databases.

3. Deleting Data:

-- Delete a customer record
DELETE FROM customers
WHERE customer_id = 1;

Deleting a single record is also a fast operation.

-- Retrieve customer information by primary key
SELECT name, email
FROM customers
WHERE customer_id = 2;

4. Selecting Data by Primary Key :

-- Retrieve customer information by primary key
SELECT name, email
FROM customers
WHERE customer_id = 2;

Queries that use primary key values for retrieval are typically very fast in OLTP databases.

5. Transaction Commit:

-- Transaction example (begin, update, commit)
BEGIN;
UPDATE accounts
SET balance = balance - 100
WHERE account_number = '12345';
COMMIT;

OLTP databases are designed to handle transactions efficiently, ensuring that updates are atomic and consistent.

These examples highlight operations that OLTP databases excel at—handling individual transactions with minimal latency. OLTP databases are well-suited for applications like e-commerce systems, banking systems, order processing, and other situations where data needs to be updated quickly and reliably.

In contrast, as previously discussed, data warehouses like Snowflake are optimized for complex analytical queries that involve scanning and aggregating large volumes of data. While OLTP databases prioritize quick read and write operations, data warehouses prioritize query performance for analytics and reporting tasks.

Vinay : Got it. So, the goal itself is different here, in terms of what to do with data. In OLTP, all we care that the data is inserted, updated, deleted properly. Mostly customers would be firing queries to look at their own orders, or order tracking data or purchase history. While in analytical queries we may perform an aggregation of sales and quantities across all customers which will be heavy on processing due to joins, group-aggregates and windowing functions etc. Am I thinking right?

Rahul : You are absolutely thinking in the right direction! Your understanding is spot on.

In OLTP (Online Transaction Processing) databases, the primary focus is on ensuring data integrity, reliability, and the efficient handling of individual transactions. These databases are typically used for day-to-day operations, where customers or users interact with the system to perform tasks like placing orders, updating personal information, and checking order status. Queries are usually simple and focused on retrieving specific records or performing CRUD (Create, Read, Update, Delete) operations.

Conversely, in analytical queries and data warehousing, the goal is to derive insights from large volumes of data. This involves complex operations like joining multiple tables, aggregating data, applying window functions, and running sophisticated analytical queries. The emphasis is on analyzing historical data, identifying trends, making data-driven decisions, and generating reports.

So, to summarize:

- OLTP Databases:

  • Primarily focused on maintaining data integrity.

  • Handle frequent, individual transactions.

  • Queries are often simple and focused on specific records.

  • Optimize for quick read and write operations.

  • Typically used for day-to-day business operations.

- Analytical Queries and Data Warehousing:

  • Focus on analyzing historical data.

  • Perform complex operations like joins, aggregates, and window functions.

  • Queries are often heavy on processing but aim to provide valuable insights.

  • Optimize for query performance over large datasets.

  • Used for business intelligence, reporting, and data analysis.

Understanding this distinction is crucial when designing data systems and choosing the right tools for different aspects of your business. OLTP databases and data warehouses serve complementary roles within an organization's data architecture, each optimized for its specific set of tasks and requirements.

Vinay : Thanks! I am feeling good about it now. So, Snowflake provides a data warehouse solution rather than a database solution right?

Rahul : You're correct! Snowflake provides a data warehousing solution rather than a traditional database solution. Snowflake is specifically designed to serve as a cloud-based data warehouse that excels at storing and analyzing large volumes of data for analytical and business intelligence purposes.

While Snowflake is primarily a data warehousing solution, it can also handle many aspects of data management typically associated with databases, such as table creation, data loading, and even basic transactional operations. This makes it a versatile platform for organizations looking to consolidate their data analytics and reporting efforts in a cloud-based environment.

Vinay : Cool! I think let me assimilate this piece of learning for some time and come back to you for follow up questions.

Rahul : Take your time Vinay. Happy to help you!