Databricks Cheatsheet
Lakehouse Platform โ Delta Lake, Unity Catalog, Spark SQL & MLflow
Platform: Databricks | Versie: v3.0 | Update: Jan 2025
๐ก Databricks Lakehouse: Combineert data warehouse en data lake in รฉรฉn platform met Delta Lake, Unity Catalog en geรฏntegreerde MLflow.
๐๏ธ Delta Lake Operations
Delta Table Creation
Maak Delta table:
-- SQL
CREATE TABLE sales_delta
USING delta
LOCATION '/mnt/datalake/sales'
AS SELECT * FROM sales_parquet;
# Python
df.write.format("delta")\
.mode("overwrite")\
.save("/mnt/datalake/sales")
ACID Transactions
UPDATE, DELETE, MERGE operations:
-- UPDATE
UPDATE sales_delta
SET price = price * 1.1
WHERE category = 'Electronics';
-- MERGE (UPSERT)
MERGE INTO customers_target t
USING customers_source s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN
UPDATE SET t.email = s.email
WHEN NOT MATCHED THEN
INSERT (customer_id, email) VALUES (s.customer_id, s.email);
Time Travel
Data historie bekijken:
-- Toon table history
DESCRIBE HISTORY sales_delta;
-- Data van specifieke versie
SELECT * FROM sales_delta
VERSION AS OF 5;
-- Data van specifieke timestamp
SELECT * FROM sales_delta
TIMESTAMP AS OF '2024-01-15 10:30:00';
Optimization & Vacuum
Performance optimalisatie:
-- Z-Ordering voor query performance
OPTIMIZE sales_delta
ZORDER BY customer_id, transaction_date;
-- Compact small files
OPTIMIZE sales_delta;
-- Cleanup oude versies (retentie 7 dagen)
VACUUM sales_delta
RETAIN 168 HOURS;
๐๏ธ Unity Catalog - Unified Governance
| Object | Commando | Beschrijving |
|---|---|---|
| Catalog | CREATE CATALOG sales_catalog | Top-level container voor data |
| Schema | CREATE SCHEMA sales_catalog.raw | Database in een catalog |
| Table | CREATE TABLE sales_catalog.raw.customers | Tabel in schema |
| View | CREATE VIEW sales_catalog.analytics.daily_sales | View voor data abstractie |
| Volume | CREATE VOLUME sales_catalog.raw.files | Voor unstructured data |
| Share | CREATE SHARE external_partners | Data delen met externe partijen |
-- Complete Unity Catalog setup
-- 1. Catalog aanmaken
CREATE CATALOG IF NOT EXISTS enterprise_data
COMMENT 'Enterprise data catalog voor alle business units';
-- 2. Schemas voor verschillende zones
CREATE SCHEMA enterprise_data.raw
COMMENT 'Raw landing zone voor bron systemen';
CREATE SCHEMA enterprise_data.staging
COMMENT 'Staging zone voor data transformatie';
CREATE SCHEMA enterprise_data.analytics
COMMENT 'Analytics zone voor business reports';
-- 3. Table met managed storage
CREATE TABLE enterprise_data.raw.customers (
customer_id BIGINT,
first_name STRING,
last_name STRING,
email STRING,
created_at TIMESTAMP
)
USING delta
COMMENT 'Customer master data'
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
๐ Spark SQL - Databricks Notebooks
Notebook Basics
Notebook cell commands:
-- Magic commands
%sql # SQL cell
%python # Python cell
%scala # Scala cell
%r # R cell
%sh # Shell commands
%fs # File system commands
%md # Markdown documentation
# Widgets voor interactie
dbutils.widgets.dropdown("environment", "dev", ["dev", "test", "prod"])
env = dbutils.widgets.get("environment")
dbutils Utilities
Handige Databricks utilities:
# File system operations
dbutils.fs.ls("/mnt/datalake")
dbutils.fs.cp("source", "destination")
dbutils.fs.mkdirs("/mnt/new_folder")
# Secrets management
secret = dbutils.secrets.get("scope", "key")
# Notebook workflows
dbutils.notebook.run("notebook_path", timeout_seconds=300, arguments={"param": "value"})
# Display helpers
display(df) # Show dataframe with UI
Spark Configuratie
Cluster en Spark config:
# Python - Spark session config
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DataProcessing") \
.config("spark.sql.shuffle.partitions", "200") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.databricks.delta.optimizeWrite.enabled", "true") \
.getOrCreate()
-- SQL - Session config
SET spark.sql.shuffle.partitions = 200;
SET spark.databricks.delta.optimizeWrite.enabled = true;
โ๏ธ Workflows & Job Scheduling
Task Types
Verschillende job types:
# Notebook task
{
"task_key": "process_sales",
"notebook_task": {
"notebook_path": "/Shared/ETL/sales_pipeline",
"base_parameters": {
"start_date": "2024-01-01",
"end_date": "2024-12-31"
}
},
"timeout_seconds": 3600
}
# Spark JAR task
{
"spark_jar_task": {
"main_class_name": "com.company.ETLJob",
"parameters": ["--date", "2024-01-01"]
}
}
Schedule & Triggers
Job scheduling:
# Cron schedule
{
"schedule": {
"quartz_cron_expression": "0 0 6 * * ?", # Dagelijks 6:00
"timezone_id": "Europe/Amsterdam",
"pause_status": "UNPAUSED"
}
}
# File arrival trigger
{
"file_arrival": {
"url": "dbfs:/mnt/landing/incoming/",
"min_time_between_triggers_seconds": 300
}
}
Cluster Configuratie
Job cluster settings:
# New cluster config
{
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2,
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true"
},
"autoscale": {
"min_workers": 2,
"max_workers": 8
}
}
}
๐ค MLflow - Machine Learning
| Functionaliteit | Code | Beschrijving |
|---|---|---|
| Experiment Start | mlflow.start_run() | Start nieuwe MLflow run |
| Log Parameters | mlflow.log_param("learning_rate", 0.01) | Log model parameters |
| Log Metrics | mlflow.log_metric("accuracy", 0.95) | Log evaluation metrics |
| Log Model | mlflow.sklearn.log_model(model, "model") | Save model artifact |
| Autologging | mlflow.autolog() | Automatisch loggen |
| Load Model | mlflow.pyfunc.load_model("runs:/run_id/model") | Model laden voor inference |
# Complete MLflow example
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Start experiment
mlflow.set_experiment("/Shared/churn_prediction")
with mlflow.start_run(run_name="rf_baseline"):
# Log parameters
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 10)
# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
# Log metrics
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("precision", precision)
mlflow.log_metric("recall", recall)
# Log model
mlflow.sklearn.log_model(model, "churn_model")
# Log artifact (feature importance plot)
mlflow.log_artifact("feature_importance.png")
๐ Databricks SQL - BI & Analytics
SQL Warehouse
SQL Endpoint configuratie:
-- Warehouse create
CREATE WAREHOUSE analytics_wh
WITH (
WAREHOUSE_SIZE = 'Small',
AUTO_SUSPEND = 10,
MIN_CLUSTERS = 1,
MAX_CLUSTERS = 5,
SCALING_POLICY = 'STANDARD'
);
-- Query warehouse
SELECT *
FROM system.compute.warehouses
WHERE state = 'RUNNING';
Query History
Monitor query performance:
-- Recent queries
SELECT
query_text,
execution_time,
rows_produced,
user_name,
warehouse_id
FROM system.query.history
WHERE start_time > CURRENT_TIMESTAMP - INTERVAL 1 DAY
ORDER BY execution_time DESC
LIMIT 100;
BI Integratie
Connect Power BI/Tableau:
# Connection string voor Power BI
Server: https://<workspace>.cloud.databricks.com
Port: 443
HTTP Path: /sql/1.0/warehouses/<warehouse-id>
Authentication: Personal Access Token
# Python connectie
from databricks import sql
connection = sql.connect(
server_hostname="<workspace>.cloud.databricks.com",
http_path="/sql/1.0/warehouses/<id>",
access_token="dapi..."
)
โก Performance Optimization
| Techniek | Code | Impact |
|---|---|---|
| Partitioning | PARTITIONED BY (date DATE) | Snellere date range queries |
| Z-Ordering | ZORDER BY customer_id | Betere data locality |
| File Size | spark.conf.set("spark.sql.files.maxPartitionBytes", "128mb") | Optimal partition size |
| Caching | df.cache() of CACHE SELECT * FROM table | Snellere herhaalde queries |
| Bloom Filters | TBLPROPERTIES ('delta.bloomFilter.columns' = 'customer_id') | Snellere join performance |
| Compression | TBLPROPERTIES ('delta.dataSkippingNumIndexedCols' = 32) | Kleinere storage |
๐ก Performance Tips:
1. Gebruik Photon engine voor snellere queries
2. Configureer auto-scaling voor variabele workloads
3. Gebruik Delta Cache voor herhaalde queries
4. Optimaliseer regulier met OPTIMIZE en ZORDER
5. Monitor performance met Query History
2. Configureer auto-scaling voor variabele workloads
3. Gebruik Delta Cache voor herhaalde queries
4. Optimaliseer regulier met OPTIMIZE en ZORDER
5. Monitor performance met Query History
๐ Security & Governance
Row & Column Security
Fine-grained access control:
-- Row filter (dynamic)
CREATE ROW FILTER sales_region_filter
ON sales
AS (region) WHERE region IN (
SELECT region FROM user_regions
WHERE user = CURRENT_USER()
);
-- Column mask (PII)
CREATE MASK email_mask AS (
CASE
WHEN CURRENT_USER() = 'admin@company.com'
THEN email
ELSE REGEXP_REPLACE(email, '(.)(.*)@', '***@')
END
);
Lineage & Auditing
Data lineage tracking:
-- Table lineage
DESCRIBE EXTENDED analytics.daily_sales;
-- Unity Catalog lineage API
# Python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
lineage = w.lineage.get(table_name="catalog.schema.table")
Backup & Recovery
Disaster recovery:
-- Clone voor backup
CREATE TABLE sales_backup
DEEP CLONE sales_production;
-- Restore from version
RESTORE TABLE sales
TO VERSION AS OF 10;
-- Export voor on-prem backup
# Delta Lake export
df.write.format("delta")\
.save("abfss://container@storage.dfs.core.windows.net/backup/sales")
๐ Snelle Referentie - Dagelijks Gebruik
-- Meest gebruikte Databricks commands
-- 1. Data lezen van Delta table
SELECT * FROM delta.`/mnt/datalake/sales`;
-- 2. Nieuwe Delta table aanmaken
CREATE TABLE sales_delta
USING delta
LOCATION '/mnt/datalake/sales'
AS SELECT * FROM sales_csv;
-- 3. Time travel query
SELECT * FROM sales_delta
TIMESTAMP AS OF '2024-01-15';
-- 4. MERGE (UPSERT) operation
MERGE INTO target_table t
USING source_table s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
-- 5. Performance optimalisatie
OPTIMIZE sales_delta
ZORDER BY (customer_id, date);
-- 6. Unity Catalog table
CREATE TABLE catalog.schema.table_name
AS SELECT * FROM source_table;
-- 7. MLflow experiment query
SELECT * FROM mlflow_artifact
WHERE experiment_id = '123';
DataPartner365 Learning Center | databricks@datapartner365.nl
Deze Databricks cheatsheet is vrij te gebruiken voor persoonlijk en professioneel gebruik.
Voor updates en meer cheatsheets, bezoek: datapartner365.nl/downloads
Hoe te gebruiken:
1. Print deze pagina als PDF (Ctrl+P → Save as PDF)
2. Bewaar als referentie naast je Databricks workspace
3. Deel met je team voor consistente ontwikkeling