.webp&w=3840&q=100)
Industry
retail-and-cpg
Skills
cloud-management
data-understanding
data-storage
batch-etl
programming
data-modelling
data-quality
data-wrangling
code-versioning
git-version-control
approach
Tools
google-cloud
spark
github
airflow
sql
Learning Objectives
Design and implement an ETL/ELT pipeline using Dataproc in GCP, following the Medallion Architecture.
Manage Secure Access and Credentials
Automate Deployment with CI/CD
Perform Unit Testing for Data Pipelines
Orchestrate Data Pipelines Efficiently
Overview
GlobalMart, a rapidly growing e-commerce startup, faces several data management challenges that impact its ability to generate reliable insights and make timely business decisions. Key issues include:
- Data Quality Issues: Inconsistencies in data formats, missing values, and duplication create inefficiencies in processing.
- Slow Data Transformations – As data volume increases, sluggish transformation processes delay critical insights.
- Lack of Streamlined Workflows – Inefficient data processing and handling of invalid data disrupt operations and reduce overall efficiency.
- Security Risks – Improper handling of sensitive credentials and access keys poses potential security threats.
- Unreliable Deployment & Testing – The absence of a structured framework makes changes to data pipelines error-prone, increasing operational overhead.
These challenges should be addressed with a structured, scalable, and secure data management approach to ensure trust in GlobalMart’s data-driven strategies.
Prerequisites
- Understanding of Google Cloud Platform
- Knowledge of ETL/ELT Processes & Pipeline Management
- Familiarity with Big Query, Dataproc, PySpark & Python
- Basic Knowledge of CI/CD Pipelines
- Experience with Airflow