Narrating
Data Ingestion using Databricks
From Azure Data Lake & Relational Databases into Databricks
☁️
Data Lake (ADLS)
Mount & read cloud storage
🗄️
Database (JDBC)
Connect & query SQL Server
J
John
S
Sharon
Connecting to Azure Data Lake Storage
Setting up the Spark configuration with access keys
☁️
ADLS
Azure Data Lake
Storage
🔑 Access Key
🧱
Databricks
Spark Cluster
spark
.conf.set(
"fs.azure.account.key.<account>.dfs.core.windows.net"
,
"<your-storage-account-access-key>"
)
How to Get Access Keys
1. Go to Azure Portal
2. Open Storage Account
3. Click "Access Keys"
4. Copy Key1 or Key2
John
Sharon
Listing Files in ADLS
dbutils
.fs.ls(
"abfss://amazondata@account.dfs.core.windows.net"
)
📁 Container: amazondata
📄 customers.csv
📄 orders.csv
📄 products.csv
📄 transactions.csv
1.2 MB
3.4 MB
0.8 MB
5.1 MB
⚠️
Cluster Restart Warning
When the cluster stops,
the connection is LOST.
You must reconfigure
every time!
📦 Container 1
amazondata
📦 Container 2
salesdata
📦 Container 3
analytics
→
❌ Impractical
for production
Mounting — The Persistent Solution
☁️
ADLS Container
📄 customers.csv
📄 orders.csv
📄 products.csv
🔗 Mount Point
/mnt/mnt_name
🧱
Databricks
Access via /mnt/mnt_name
as if files are local
✅ No data copied
Virtual path only
✅ Persists
Survives cluster restarts
container_name
=
"amazondata"
account_name
=
"adlsstoragedata01"
dbutils
.fs.mount(
source =
"wasbs://{container}@{account}.blob.core.windows.net"
mount_point =
"/mnt/mnt_name"
extra_configs =
{account_key: storage_key}
)
✅ Persists across cluster restarts!
Reading Data into a DataFrame
Using PySpark to load CSV from the mount point
customers_df
= spark.read.format(
"csv"
) \
.option(
"header"
,
"true"
).option(
"inferSchema"
,
"true"
) \
.load(
"/mnt/test_amz_data"
)
📋
format("csv")
Data is in CSV format
🏷️
header = "true"
First row = column names
🔍
inferSchema = "true"
Auto-detect data types
📂
load(path)
Read from mount
customers_df.show(5)
id
name
email
city
1
Alice Johnson
alice@email.com
New York
2
Bob Smith
bob@email.com
Chicago
3
Carol Davis
carol@email.com
Seattle
Data Ingestion from Database
Connecting to relational databases using JDBC
🗄️
SQL Server
Relational Database
Customer data tables
JDBC
🧱
Databricks
Spark Cluster
DataFrame processing
🌐
Universal
Works with most databases
🔒
Secure
Authenticated connections
📊
SQL Support
Query structured data
⚡
Stable
Reliable connections
John
Sharon
JDBC Connection Setup
🖥️
Hostname
Server address (IP/domain)
🔌
Port: 1433
SQL Server default port
🗃️
Database Name
Target database
🔑
Credentials
User + Password + Driver
jdbcUrl
=
"jdbc:sqlserver://{hostname}:{port};database={dbname}"
# Combines host, port, and database into one connection string
connectionProperties
= {
"user"
:
"your_username"
,
"password"
:
"your_password"
,
"driver"
:
"com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
🔒 Secure connection to SQL Server established
Reading Data via JDBC
Using spark.read.jdbc() to fetch data into a DataFrame
table_name
=
"customers"
customers_df
= spark.read.jdbc(url=
jdbcUrl
, table=
table_name
, properties=
connectionProperties
)
customers_df
.show(5)
🗄️
SQL Server
"customers" table
JDBC
⚡
spark.read.jdbc()
Fetch & convert
📊
Spark DataFrame
Process, transform, analyze
customers_df.show(5)
id
name
email
purchase_count
1
Alice Johnson
alice@globalmart.com
47
2
Bob Smith
bob@globalmart.com
23
3
Carol Davis
carol@globalmart.com
31
Data Ingestion — Complete!
Two powerful methods for bringing data into Databricks
☁️
ADLS Mounting
✅ Persistent virtual path to cloud storage
✅ Survives cluster restarts
✅ Access ADLS as if files are local
✅ Read with spark.read.format("csv")
Best for: Data Lakes, Cloud Storage
🗄️
JDBC Connection
✅ Universal database connectivity
✅ Secure authenticated connections
✅ SQL query support
✅ Read with spark.read.jdbc()
Best for: SQL Server, RDBMS
J
John
S
Sharon
Click Play to start
Play
Pause
Mute
Restart
0:00 / 0:00