Narrating
Data Ingestion using Databricks From Azure Data Lake & Relational Databases into Databricks ☁️ Data Lake (ADLS) Mount & read cloud storage 🗄️ Database (JDBC) Connect & query SQL Server J John S Sharon Connecting to Azure Data Lake Storage Setting up the Spark configuration with access keys ☁️ ADLS Azure Data Lake Storage 🔑 Access Key 🧱 Databricks Spark Cluster spark .conf.set( "fs.azure.account.key.<account>.dfs.core.windows.net" , "<your-storage-account-access-key>" ) How to Get Access Keys 1. Go to Azure Portal 2. Open Storage Account 3. Click "Access Keys" 4. Copy Key1 or Key2 John Sharon Listing Files in ADLS dbutils .fs.ls( "abfss://amazondata@account.dfs.core.windows.net" ) 📁 Container: amazondata 📄 customers.csv 📄 orders.csv 📄 products.csv 📄 transactions.csv 1.2 MB 3.4 MB 0.8 MB 5.1 MB ⚠️ Cluster Restart Warning When the cluster stops, the connection is LOST. You must reconfigure every time! 📦 Container 1 amazondata 📦 Container 2 salesdata 📦 Container 3 analytics ❌ Impractical for production Mounting — The Persistent Solution ☁️ ADLS Container 📄 customers.csv 📄 orders.csv 📄 products.csv 🔗 Mount Point /mnt/mnt_name 🧱 Databricks Access via /mnt/mnt_name as if files are local ✅ No data copied Virtual path only ✅ Persists Survives cluster restarts container_name = "amazondata" account_name = "adlsstoragedata01" dbutils .fs.mount( source = "wasbs://{container}@{account}.blob.core.windows.net" mount_point = "/mnt/mnt_name" extra_configs = {account_key: storage_key} ) ✅ Persists across cluster restarts! Reading Data into a DataFrame Using PySpark to load CSV from the mount point customers_df = spark.read.format( "csv" ) \ .option( "header" , "true" ).option( "inferSchema" , "true" ) \ .load( "/mnt/test_amz_data" ) 📋 format("csv") Data is in CSV format 🏷️ header = "true" First row = column names 🔍 inferSchema = "true" Auto-detect data types 📂 load(path) Read from mount customers_df.show(5) id name email city 1 Alice Johnson alice@email.com New York 2 Bob Smith bob@email.com Chicago 3 Carol Davis carol@email.com Seattle Data Ingestion from Database Connecting to relational databases using JDBC 🗄️ SQL Server Relational Database Customer data tables JDBC 🧱 Databricks Spark Cluster DataFrame processing 🌐 Universal Works with most databases 🔒 Secure Authenticated connections 📊 SQL Support Query structured data Stable Reliable connections John Sharon JDBC Connection Setup 🖥️ Hostname Server address (IP/domain) 🔌 Port: 1433 SQL Server default port 🗃️ Database Name Target database 🔑 Credentials User + Password + Driver jdbcUrl = "jdbc:sqlserver://{hostname}:{port};database={dbname}" # Combines host, port, and database into one connection string connectionProperties = { "user" : "your_username" , "password" : "your_password" , "driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver" } 🔒 Secure connection to SQL Server established Reading Data via JDBC Using spark.read.jdbc() to fetch data into a DataFrame table_name = "customers" customers_df = spark.read.jdbc(url= jdbcUrl , table= table_name , properties= connectionProperties ) customers_df .show(5) 🗄️ SQL Server "customers" table JDBC spark.read.jdbc() Fetch & convert 📊 Spark DataFrame Process, transform, analyze customers_df.show(5) id name email purchase_count 1 Alice Johnson alice@globalmart.com 47 2 Bob Smith bob@globalmart.com 23 3 Carol Davis carol@globalmart.com 31 Data Ingestion — Complete! Two powerful methods for bringing data into Databricks ☁️ ADLS Mounting ✅ Persistent virtual path to cloud storage ✅ Survives cluster restarts ✅ Access ADLS as if files are local ✅ Read with spark.read.format("csv") Best for: Data Lakes, Cloud Storage 🗄️ JDBC Connection ✅ Universal database connectivity ✅ Secure authenticated connections ✅ SQL query support ✅ Read with spark.read.jdbc() Best for: SQL Server, RDBMS J John S Sharon
Click Play to start
0:00 / 0:00