Enqurious logo
Go back

Building ETL pipeline on GCP | Set 1

7 Scenarios
2 Hours 35 Minutes
project poster
Industry
retail-and-cpg
Skills
cloud-management
data-understanding
data-storage
batch-etl
programming
data-modelling
data-quality
data-wrangling
code-versioning
git-version-control
approach
Tools
google-cloud
spark
github
airflow
sql

Learning Objectives

Design and implement an ETL/ELT pipeline using Dataproc in GCP, following the Medallion Architecture.
Manage Secure Access and Credentials
Automate Deployment with CI/CD
Perform Unit Testing for Data Pipelines
Orchestrate Data Pipelines Efficiently

Overview

GlobalMart, a rapidly growing e-commerce startup, faces several data management challenges that impact its ability to generate reliable insights and make timely business decisions. Key issues include:

  • Data Quality Issues: Inconsistencies in data formats, missing values, and duplication create inefficiencies in processing.
  • Slow Data Transformations – As data volume increases, sluggish transformation processes delay critical insights.
  • Lack of Streamlined Workflows – Inefficient data processing and handling of invalid data disrupt operations and reduce overall efficiency.
  • Security Risks – Improper handling of sensitive credentials and access keys poses potential security threats.
  • Unreliable Deployment & Testing – The absence of a structured framework makes changes to data pipelines error-prone, increasing operational overhead.

These challenges should be addressed with a structured, scalable, and secure data management approach to ensure trust in GlobalMart’s data-driven strategies.

Prerequisites

  • Understanding of Google Cloud Platform
  • Knowledge of ETL/ELT Processes & Pipeline Management
  • Familiarity with Big Query, Dataproc, PySpark & Python
  • Basic Knowledge of CI/CD Pipelines
  • Experience with Airflow
Redefining the learning experience

Supercharge Your
Data+AI Teams with us!