Enqurious logo
Go back

Building ETL pipeline on GCP | Set 2

9 Scenarios
3 Hours 55 Minutes
Advanced
project poster
Industry
retail-and-cpg
e-commerce
Skills
cloud-management
data-understanding
data-storage
batch-etl
programming
data-wrangling
approach
data-modelling
data-quality
code-versioning
git-version-control
problem-understanding
performance-tuning
Tools
google-cloud
spark
sql
airflow
github
databricks

Learning Objectives

Design and implement an ETL/ELT pipeline using Dataproc in GCP, following the Medallion Architecture.
Manage Credentials Securely and IAM
Automate Deployment with CI/CD using GitHub/GitHub Actions
Perform Unit Testing for Data Pipelines
Orchestrate Data Pipelines Efficiently using Cloud Composer
Optimizing Bigquery
Code Quality Checks using pre-commit checks

Overview

GlobalMart, a rapidly growing e-commerce startup, faces several data management challenges that impact its ability to generate reliable insights and make timely business decisions. Key issues include:

  • Data Quality Issues: Inconsistencies in data formats, missing values, and duplication create inefficiencies in processing.
  • Slow Data Transformations – As data volume increases, sluggish transformation processes delay critical insights.
  • Lack of Streamlined Workflows – Inefficient data processing and handling of invalid data disrupt operations and reduce overall efficiency.
  • Security Risks – Improper handling of sensitive credentials and access keys poses potential security threats.
  • Unreliable Deployment & Testing – The absence of a structured framework makes changes to data pipelines error-prone, increasing operational overhead.

These issues led to lack of trust in data systems rendering them useless. In this project you will be spending time to implement the following architecture that addresses all the problems that Globalmart is currently facing in their data systems

Image

Prerequisites

  • Understanding of Google Cloud Platform
  • Knowledge of ETL/ELT Processes & Pipeline Management
  • Familiarity with Big Query, Dataproc, PySpark & Python
  • Basic Knowledge of CI/CD Pipelines
  • Experience with Airflow
  • Experience with GitHub/GitHub Actions
Redefining the learning experience

Supercharge Your
Data+AI Teams with us!