Narrating
INTRODUCTION TO
Databricks
The Unified Analytics Platform
J
John
Why "Unified"?
S
Sharon
Built on Apache Spark
One Platform, Every Role
👷
Data Engineers
Build ELT pipelines
Manage data flows
🧪
Data Scientists
Create ML models
Experiment & train
📊
Data Analysts
SQL & Python queries
Reports & insights
📈
BI Experts
Build dashboards
Business metrics
SUPPORTED LANGUAGES
Python
SQL
Scala
R
All in Databricks
Pipelines + ML Models + Dashboards + Analysis
Seamless Data Integration
☁️
Cloud Storage
ADLS, S3, GCS
🗄️
Databases
SQL Server, Postgres
🏢
Data Warehouses
Snowflake, Redshift
📡
Streaming Data
Kafka, Event Hubs
Databricks
Powered by Apache Spark
Auto-managed Spark sessions
No manual cluster setup
Fully Managed
No Spark session init
Auto cluster scaling
Infrastructure handled
🎯
Just focus on your work!
The Lakehouse Architecture
🌊
Data Lake
Flexible, scalable storage
All data types: structured,
semi-structured, unstructured
Flexibility ✓
+
🏛️
Data Warehouse
Structured, organized
ACID transactions
Schema enforcement
Structure ✓
=
Lakehouse Architecture
Flexibility + Structure | Real-time + Historical | Organized + Secure
🔒 Governance
Right teams → Right data
📡 Real-time Streaming
Live orders, sensor data
Databricks Architecture
AWS
Azure
GCP
Control Plane
(Databricks Account)
🎮
The "Remote Control"
Manage clusters
Orchestrate tasks
Notebooks, Workflows, Repos
Compute Plane
(Your Cloud Account)
🚗
The "Car" — does the work
Worker nodes (RAM+CPU)
Cloud storage (Data Lake)
Actual data processing & storage
Serverless Compute Plane
Fully managed by Databricks. No cluster setup. Automatic provisioning & scaling.
Perfect for quick, on-demand tasks ⚡
Workspace & Pricing Tiers
Databricks Workspace
A collaborative environment for managing data, clusters, notebooks, and pipelines
Tip: Use separate workspaces for production vs. development
PREMIUM TIER
🔒
For sensitive data
✓ Role-based access control
✓ Secure cluster connectivity
✓ Column-level security
✓ Customer transactions & PII
Best for production & compliance
STANDARD TIER
📋
For non-sensitive data
✓ Basic data analysis
✓ ETL pipeline testing
✓ Product sales reports
✓ Development workloads
Best for dev & smaller projects
Clusters: Compute Options
All-Purpose Compute
🔧
Interactive development & testing
Stays active until you terminate
Explore data, iterate, experiment
Best for: Dev & Testing
Job Compute
⏰
Automated, scheduled tasks
Starts when job begins, stops when done
Saves resources & costs
Best for: Production & Scheduling
Cluster Pools
Pre-created VMs — new clusters start almost instantly!
Frequent Jobs
Quick cluster starts
Scaling Workloads
Minimize provisioning delays
Cost Optimization
Pay only for what you use
Notebooks & Collaboration
Databricks Notebook
%python df = spark.read...
%sql SELECT * FROM orders
%scala val rdd = sc.para...
✓ Multi-language in one notebook
✓ Real-time co-authoring
✓ Version control built-in
✓ No Jupyter or SSMS needed
Databricks = Unified Analytics Platform
ELT Pipelines
Data Engineers
ML Models
Data Scientists
Dashboards
BI Experts
Collaboration
All Teams
John's Ready to Build!
One platform. Infinite possibilities.
Click Play to start
Play
Pause
Mute
Restart
0:00 / 0:00