Azure Synapse Analytics: Complete Gids voor Modern Data Platform

Laatst bijgewerkt: 20 december 2024

Leestijd: 50 minuten

Azure Synapse Analytics, Azure Data Lake, SQL Pools, Apache Spark, Data Warehousing, Azure Synapse, Modern Data Platform

Leer de fundamenten en geavanceerde technieken van Azure Synapse Analytics. Van data lake integration tot real-time analytics en praktische implementatie strategieën voor het moderne data platform.

Zoek je Azure Synapse Experts?

Vind ervaren Azure Data Engineers en Synapse Architects voor je cloud data projecten

Azure Data Vacatures Vacature Plaassen €25

Inhoudsopgave

Inleiding tot Azure Synapse Analytics
Synapse Architecture Overzicht
SQL Pools: Dedicated vs Serverless
Apache Spark Pools en Databricks Integration
Data Integration en Pipelines
Azure Data Lake Integration
Security en Governance
Performance Optimalisatie
Cost Management en Optimization
Migratie naar Azure Synapse
Best Practices en Patterns
Praktijk Case Studies

1. Inleiding tot Azure Synapse Analytics

Wat is Azure Synapse Analytics?

Azure Synapse Analytics is een geïntegreerd data platform dat data warehousing, big data analytics en data integration combineert in één unified service. Het brengt SQL en Apache Spark samen voor enterprise data processing.

Geïntegreerd Data Platform

Combineert SQL, Spark, en Data Integration in één service

Real-time Analytics

Stream processing en real-time data ingestion

Onbeperkte Schaal

Elastische schaling van compute en storage

Enterprise Security

End-to-end security en compliance

Feature	Traditional Data Warehouse	Azure Synapse Analytics	Voordelen
Architecture	Gescheiden systemen	Unified platform	Eenvoudiger management
Data Types	Gestructureerde data	Alle data types	Completere data analysis
Processing	Batch processing	Batch + real-time	Snellere insights
Schaalbaarheid	Beperkt scaling	Elastische scaling	Flexibele capaciteit
Kosten	Vaste kosten	Pay-per-use	Kosten efficientie
Development	Complexe integratie	Geïntegreerde tools	Snellere development

2. Synapse Architecture Overzicht

Synapse Componenten

Azure Synapse architecture bestaat uit vier kern componenten die samenwerken voor een end-to-end data platform.

Core Architecture Components

-- AZURE SYNAPSE ARCHITECTURE OVERZICHT

-- 1. SYNAPSE WORKSPACE: Centraal management punt
-- Azure Portal: Create Synapse Workspace
/* 
Workspace Name: synapse-dataplatform
Region: West Europe
Data Lake Storage Gen2: 
    - Primary Storage Account: datalake001
    - File System: synapse-workspace
SQL Administrator: admin@datapartner365.nl
*/

-- 2. DEDICATED SQL POOL (voorheen SQL DW)
CREATE WORKLOAD GROUP reporting_workload
WITH (
    MIN_PERCENTAGE_RESOURCE = 30,
    CAP_PERCENTAGE_RESOURCE = 60,
    REQUEST_MIN_RESOURCE_GRANT_PERCENT = 5
);

CREATE WORKLOAD CLASSIFIER reporting_classifier
WITH (
    WORKLOAD_GROUP = 'reporting_workload',
    MEMBERNAME = 'reporting_user'
);

-- 3. SERVERLESS SQL POOL: On-demand querying
CREATE EXTERNAL TABLE serverless_sales_data
WITH (
    LOCATION = 'abfss://sales@datalake001.dfs.core.windows.net/raw/sales/',
    DATA_SOURCE = sales_data_source,
    FILE_FORMAT = parquet_format
)
AS
SELECT 
    customer_id,
    product_id,
    sales_amount,
    transaction_date
FROM OPENROWSET(
    BULK 'https://datalake001.dfs.core.windows.net/sales/raw/*.parquet',
    FORMAT = 'PARQUET'
) AS rows;

-- 4. APACHE SPARK POOL: Big data processing
-- Synapse Notebook: Python code voor data processing
"""
# Initialize Spark session
spark.conf.set("fs.azure.account.key.datalake001.dfs.core.windows.net", 
               "your_storage_account_key")

# Read data from Data Lake
df_sales = spark.read.parquet("abfss://sales@datalake001.dfs.core.windows.net/raw/")
df_customers = spark.read.parquet("abfss://customers@datalake001.dfs.core.windows.net/raw/")

# Perform transformations
df_enriched = df_sales.join(df_customers, "customer_id", "left")

# Write to curated zone
df_enriched.write.mode("overwrite").parquet(
    "abfss://sales@datalake001.dfs.core.windows.net/curated/enriched_sales"
)
"""

-- 5. SYNAPSE PIPELINES: Data integration
-- Azure Data Factory activiteiten binnen Synapse
/*
Pipeline: Daily_Sales_ETL
Activities:
1. Copy data from SQL Server to Data Lake (raw)
2. Spark job voor data transformatie
3. Load naar dedicated SQL pool
4. Execute stored procedures voor aggregaties
*/

3. SQL Pools: Dedicated vs Serverless

SQL Processing in Synapse

SQL Pools bieden twee opties voor SQL-based data processing: Dedicated SQL Pools voor productieworkloads en Serverless SQL Pools voor ad-hoc querying.

Dedicated SQL Pool Implementation

-- DEDICATED SQL POOL: Productie data warehouse

-- 1. CREATE DATABASE EN TABELLEN MET DISTRIBUTIE STRATEGIEËN
CREATE DATABASE sales_dw 
WITH 
    MAXSIZE = '100 TB',
    SERVICE_OBJECTIVE = 'DW500c';

USE sales_dw;

-- Fact table met HASH distributie voor gelijke data distributie
CREATE TABLE fact_sales (
    sales_id BIGINT IDENTITY(1,1) NOT NULL,
    customer_id INT NOT NULL,
    product_id INT NOT NULL,
    date_key INT NOT NULL,
    store_id INT NOT NULL,
    sales_amount DECIMAL(18,2) NOT NULL,
    quantity INT NOT NULL,
    discount_amount DECIMAL(18,2) DEFAULT 0,
    created_date DATETIME2 DEFAULT SYSUTCDATETIME()
)
WITH (
    DISTRIBUTION = HASH(customer_id),  -- Optimale distributie voor customer-centric queries
    CLUSTERED COLUMNSTORE INDEX,        -- Columnstore voor compressie en performance
    PARTITION (date_key RANGE RIGHT FOR VALUES (
        20230101, 20230401, 20230701, 20231001,
        20240101, 20240401, 20240701, 20241001
    ))
);

-- Dimension table met REPLICATE distributie voor kleine tabellen
CREATE TABLE dim_product (
    product_id INT NOT NULL,
    product_name NVARCHAR(255) NOT NULL,
    category_id INT NOT NULL,
    category_name NVARCHAR(100) NOT NULL,
    brand_name NVARCHAR(100) NOT NULL,
    unit_price DECIMAL(18,2) NOT NULL,
    is_active BIT DEFAULT 1,
    valid_from DATE NOT NULL,
    valid_to DATE
)
WITH (
    DISTRIBUTION = REPLICATE,  -- Replicatie naar alle compute nodes
    CLUSTERED INDEX (product_id)  -- Rowstore voor dimension tables
);

-- 2. CREATE EXTERNAL TABEL VOOR DATA LAKE INTEGRATIE
CREATE EXTERNAL DATA SOURCE sales_data_source 
WITH (
    LOCATION = 'abfss://sales@datalake001.dfs.core.windows.net',
    CREDENTIAL = sales_credential
);

CREATE EXTERNAL FILE FORMAT parquet_file_format 
WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);

CREATE EXTERNAL TABLE external_sales_data (
    customer_id INT,
    product_id INT,
    sales_date DATE,
    amount DECIMAL(18,2),
    quantity INT
)
WITH (
    LOCATION = '/raw/sales_data/*.parquet',
    DATA_SOURCE = sales_data_source,
    FILE_FORMAT = parquet_file_format
);

-- 3. CTAS (CREATE TABLE AS SELECT) VOOR PERFORMANCE
CREATE TABLE monthly_sales_summary
WITH (
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT 
    YEAR(sales_date) AS sales_year,
    MONTH(sales_date) AS sales_month,
    dp.category_name,
    COUNT(*) AS transaction_count,
    SUM(fs.sales_amount) AS total_sales,
    AVG(fs.sales_amount) AS avg_sale_amount
FROM fact_sales fs
INNER JOIN dim_product dp ON fs.product_id = dp.product_id
GROUP BY 
    YEAR(sales_date),
    MONTH(sales_date),
    dp.category_name;

-- 4. WORKLOAD MANAGEMENT CONFIGURATIE
CREATE WORKLOAD GROUP etl_workload
WITH (
    MIN_PERCENTAGE_RESOURCE = 40,
    CAP_PERCENTAGE_RESOURCE = 80,
    REQUEST_MIN_RESOURCE_GRANT_PERCENT = 10,
    IMPORTANCE = HIGH
);

CREATE WORKLOAD CLASSIFIER etl_classifier
WITH (
    WORKLOAD_GROUP = 'etl_workload',
    MEMBERNAME = 'etl_user',
    WLM_LABEL = 'etl_process'
);

-- 5. MATERIALIZED VIEWS VOOR PERFORMANCE
CREATE MATERIALIZED VIEW mv_customer_sales_summary
WITH (DISTRIBUTION = HASH(customer_id))
AS
SELECT 
    customer_id,
    COUNT(*) AS total_transactions,
    SUM(sales_amount) AS total_spent,
    AVG(sales_amount) AS avg_transaction_value,
    MIN(sales_date) AS first_purchase,
    MAX(sales_date) AS last_purchase
FROM fact_sales
GROUP BY customer_id;

Serverless SQL Pool Implementation

-- SERVERLESS SQL POOL: Ad-hoc querying en data exploration

-- 1. QUERY DATA DIRECT VAN DATA LAKE
SELECT 
    TOP 100 *
FROM OPENROWSET(
    BULK 'https://datalake001.dfs.core.windows.net/sales/raw/sales_2024_*.parquet',
    FORMAT = 'PARQUET'
) AS sales_data;

-- 2. CREATE EXTERNAL TABLE VOOR FREQUENTE ACCESS
CREATE EXTERNAL TABLE serverless_sales
WITH (
    LOCATION = 'sales/raw/',
    DATA_SOURCE = sales_datalake,
    FILE_FORMAT = parquet_format
)
AS
SELECT 
    customer_id,
    product_id,
    CAST(sales_date AS DATE) AS sales_date,
    amount,
    quantity,
    region,
    payment_method
FROM OPENROWSET(
    BULK 'https://datalake001.dfs.core.windows.net/sales/raw/*.parquet',
    FORMAT = 'PARQUET'
) WITH (
    customer_id INT,
    product_id INT,
    sales_date VARCHAR(10),
    amount DECIMAL(18,2),
    quantity INT,
    region VARCHAR(50),
    payment_method VARCHAR(50)
) AS raw_data;

-- 3. QUERY MULTIPLE BESTANDEN EN FOLDERS
SELECT 
    sales_date,
    COUNT(*) AS transaction_count,
    SUM(amount) AS daily_sales
FROM OPENROWSET(
    BULK (
        'https://datalake001.dfs.core.windows.net/sales/raw/sales_202401*.parquet',
        'https://datalake001.dfs.core.windows.net/sales/raw/sales_202402*.parquet',
        'https://datalake001.dfs.core.windows.net/sales/raw/sales_202403*.parquet'
    ),
    FORMAT = 'PARQUET'
) AS sales_data
WHERE amount > 1000
GROUP BY sales_date
ORDER BY sales_date;

-- 4. USE WILDCARDS IN FILE PATHS
SELECT 
    YEAR(sales_date) AS sales_year,
    MONTH(sales_date) AS sales_month,
    region,
    SUM(amount) AS monthly_sales
FROM OPENROWSET(
    BULK 'https://datalake001.dfs.core.windows.net/sales/raw/sales_*.parquet',
    FORMAT = 'PARQUET'
) WITH (
    sales_date DATE,
    amount DECIMAL(18,2),
    region VARCHAR(50)
) AS sales_data
GROUP BY 
    YEAR(sales_date),
    MONTH(sales_date),
    region
ORDER BY 
    sales_year DESC,
    sales_month DESC,
    monthly_sales DESC;

-- 5. CREATE VIEWS VOOR COMPLEX QUERIES
CREATE VIEW sales_analytics_view
AS
WITH customer_summary AS (
    SELECT 
        customer_id,
        COUNT(*) AS transaction_count,
        SUM(amount) AS total_spent,
        MIN(sales_date) AS first_purchase,
        MAX(sales_date) AS last_purchase,
        DATEDIFF(day, MIN(sales_date), MAX(sales_date)) AS customer_lifetime_days
    FROM OPENROWSET(
        BULK 'https://datalake001.dfs.core.windows.net/sales/raw/*.parquet',
        FORMAT = 'PARQUET'
    ) AS sales
    GROUP BY customer_id
),
customer_segments AS (
    SELECT 
        customer_id,
        total_spent,
        CASE 
            WHEN total_spent >= 10000 THEN 'Platinum'
            WHEN total_spent >= 5000 THEN 'Gold'
            WHEN total_spent >= 1000 THEN 'Silver'
            ELSE 'Bronze'
        END AS customer_segment
    FROM customer_summary
)
SELECT 
    cs.customer_segment,
    COUNT(*) AS customer_count,
    AVG(cs.total_spent) AS avg_spent_per_customer,
    SUM(cs.total_spent) AS total_segment_revenue
FROM customer_segments cs
GROUP BY cs.customer_segment
ORDER BY total_segment_revenue DESC;

Azure Data Experts Nodig?

Vind ervaren Azure Data Engineers en Synapse Architects voor je cloud data projecten

Azure Data Vacatures Zoek Azure Experts

4. Apache Spark Pools en Databricks Integration

Big Data Processing in Synapse

Apache Spark Pools bieden distributed data processing voor big data workloads, met native integratie tussen Synapse en Databricks voor optimale flexibility.

Synapse Spark Pool Implementation

# SYNAPSE SPARK POOL: Python Notebook voor data processing

# Cell 1: Initialisatie en configuratie
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime, timedelta

# Configureer storage account access
storage_account_name = "datalake001"
storage_account_key = dbutils.secrets.get(scope="synapse-scope", key="storage-key")

spark.conf.set(
    "fs.azure.account.key.{}.dfs.core.windows.net".format(storage_account_name),
    storage_account_key
)

# Initialiseer Spark session met Synapse optimalisaties
spark = SparkSession.builder \
    .appName("Synapse Data Processing") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
    .getOrCreate()

# Cell 2: Data lezen uit Data Lake
# Lees raw data van verschillende bronnen
df_sales = spark.read \
    .format("parquet") \
    .load(f"abfss://sales@{storage_account_name}.dfs.core.windows.net/raw/sales/")

df_customers = spark.read \
    .format("parquet") \
    .load(f"abfss://customers@{storage_account_name}.dfs.core.windows.net/raw/customers/")

df_products = spark.read \
    .format("delta") \
    .load(f"abfss://products@{storage_account_name}.dfs.core.windows.net/bronze/products/")

# Toon schema en eerste rijen
print("Sales Data Schema:")
df_sales.printSchema()
print(f"Total sales records: {df_sales.count():,}")

# Cell 3: Data transformaties en cleansing
# Data quality checks en transformaties
df_sales_cleaned = df_sales \
    .filter(col("sales_amount") > 0) \
    .filter(col("customer_id").isNotNull()) \
    .filter(col("product_id").isNotNull()) \
    .withColumn("sales_date", to_date(col("transaction_timestamp"))) \
    .withColumn("sales_year", year(col("sales_date"))) \
    .withColumn("sales_month", month(col("sales_date"))) \
    .withColumn("sales_quarter", quarter(col("sales_date"))) \
    .withColumn("weekday", dayofweek(col("sales_date"))) \
    .withColumn("is_weekend", when(col("weekday").isin(1, 7), True).otherwise(False))

# Aggregaties en business metrics
df_daily_sales = df_sales_cleaned \
    .groupBy("sales_date", "store_id", "product_category") \
    .agg(
        sum("sales_amount").alias("daily_sales"),
        avg("sales_amount").alias("avg_transaction_value"),
        count("*").alias("transaction_count"),
        countDistinct("customer_id").alias("unique_customers")
    ) \
    .withColumn("sales_per_customer", col("daily_sales") / col("unique_customers"))

# Cell 4: Geavanceerde analytics met ML
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline

# Voorbereiden data voor clustering
df_customer_features = df_sales_cleaned \
    .groupBy("customer_id") \
    .agg(
        count("*").alias("purchase_count"),
        sum("sales_amount").alias("total_spent"),
        avg("sales_amount").alias("avg_purchase_value"),
        max("sales_date").alias("last_purchase_date")
    ) \
    .withColumn("days_since_last_purchase", 
                datediff(current_date(), col("last_purchase_date")))

# ML pipeline voor customer segmentation
assembler = VectorAssembler(
    inputCols=["purchase_count", "total_spent", "avg_purchase_value", "days_since_last_purchase"],
    outputCol="features"
)

scaler = StandardScaler(
    inputCol="features",
    outputCol="scaled_features",
    withStd=True,
    withMean=True
)

kmeans = KMeans(
    k=5,
    featuresCol="scaled_features",
    predictionCol="cluster",
    seed=42
)

pipeline = Pipeline(stages=[assembler, scaler, kmeans])
model = pipeline.fit(df_customer_features)
df_customer_segments = model.transform(df_customer_features)

# Cell 5: Schrijven naar Data Lake en Synapse SQL Pool
# Write to Data Lake (Silver layer)
df_daily_sales.write \
    .mode("overwrite") \
    .partitionBy("sales_date") \
    .format("parquet") \
    .save(f"abfss://sales@{storage_account_name}.dfs.core.windows.net/silver/daily_sales/")

# Write customer segments to Data Lake
df_customer_segments.write \
    .mode("overwrite") \
    .format("delta") \
    .save(f"abfss://customers@{storage_account_name}.dfs.core.windows.net/silver/customer_segments/")

# Write to Synapse Dedicated SQL Pool
df_daily_sales.write \
    .format("com.databricks.spark.sqldw") \
    .option("url", "jdbc:sqlserver://synapse-dataplatform.sql.azuresynapse.net") \
    .option("tempDir", f"abfss://temp@{storage_account_name}.dfs.core.windows.net/temp/") \
    .option("forwardSparkAzureStorageCredentials", "true") \
    .option("dbTable", "dbo.daily_sales_summary") \
    .mode("overwrite") \
    .save()

# Cell 6: Performance monitoring en logging
# Log execution metrics
execution_metrics = {
    "total_records_processed": df_sales.count(),
    "processing_time_minutes": spark.sparkContext.uiWebUrl,
    "output_records": df_daily_sales.count(),
    "timestamp": datetime.now().isoformat()
}

# Write execution log
df_log = spark.createDataFrame([execution_metrics])
df_log.write \
    .mode("append") \
    .format("json") \
    .save(f"abfss://logs@{storage_account_name}.dfs.core.windows.net/spark_execution_logs/")

print("Data processing completed successfully!")
print(f"Daily sales records: {df_daily_sales.count():,}")
print(f"Customer segments created: {df_customer_segments.count():,}")

5. Data Integration en Pipelines

Synapse Pipelines en Data Flows

Data Integration in Azure Synapse maakt gebruik van Azure Data Factory technologie voor het bouwen van ETL/ELT pipelines met visuele ontwerp tools en code-first opties.

Synapse Pipeline Implementation

// AZURE SYNAPSE PIPELINE DEFINITIE (JSON)
{
    "name": "Daily_Sales_ETL_Pipeline",
    "properties": {
        "activities": [
            {
                "name": "Copy_From_SQL_Server",
                "type": "Copy",
                "dependsOn": [],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "source": {
                        "type": "SqlServerSource",
                        "sqlReaderQuery": "SELECT * FROM Sales.Orders WHERE OrderDate >= DATEADD(day, -1, GETDATE())",
                        "queryTimeout": "02:00:00",
                        "partitionOption": "None"
                    },
                    "sink": {
                        "type": "ParquetSink",
                        "storeSettings": {
                            "type": "AzureBlobFSWriteSettings",
                            "copyBehavior": "FlattenHierarchy"
                        },
                        "formatSettings": {
                            "type": "ParquetWriteSettings"
                        }
                    },
                    "enableStaging": false,
                    "translator": {
                        "type": "TabularTranslator",
                        "typeConversion": true,
                        "typeConversionSettings": {
                            "allowDataTruncation": true,
                            "treatBooleanAsNumber": false
                        }
                    }
                },
                "inputs": [
                    {
                        "referenceName": "SqlServer_Source",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "RawSalesData",
                        "type": "DatasetReference"
                    }
                ]
            },
            {
                "name": "Transform_Data_Flow",
                "type": "ExecuteDataFlow",
                "dependsOn": [
                    {
                        "activity": "Copy_From_SQL_Server",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "dataflow": {
                        "referenceName": "Sales_Transformation_Flow",
                        "type": "DataFlowReference"
                    },
                    "compute": {
                        "coreCount": 8,
                        "computeType": "General"
                    },
                    "traceLevel": "Fine",
                    "runConcurrently": false
                }
            },
            {
                "name": "Load_To_Synapse_SQL",
                "type": "Copy",
                "dependsOn": [
                    {
                        "activity": "Transform_Data_Flow",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "source": {
                        "type": "ParquetSource",
                        "storeSettings": {
                            "type": "AzureBlobFSReadSettings",
                            "recursive": true,
                            "wildcardFileName": "*.parquet"
                        }
                    },
                    "sink": {
                        "type": "SqlDWSink",
                        "allowCopyCommand": true,
                        "copyCommandSettings": {
                            "defaultValues": [
                                {
                                    "columnName": "CreatedDate",
                                    "defaultValue": "GETDATE()"
                                }
                            ]
                        },
                        "tableOption": "autoCreate"
                    },
                    "enableStaging": true,
                    "stagingSettings": {
                        "linkedServiceName": {
                            "referenceName": "AzureDataLakeStorage1",
                            "type": "LinkedServiceReference"
                        },
                        "path": "staging"
                    }
                },
                "inputs": [
                    {
                        "referenceName": "TransformedSalesData",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "SynapseSQL_Sales",
                        "type": "DatasetReference"
                    }
                ]
            },
            {
                "name": "Execute_Stored_Procedure",
                "type": "SqlServerStoredProcedure",
                "dependsOn": [
                    {
                        "activity": "Load_To_Synapse_SQL",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "storedProcedureName": "[dbo].[usp_UpdateSalesAggregates]",
                    "storedProcedureParameters": {
                        "ExecutionDate": {
                            "value": "@{formatDateTime(utcNow(), 'yyyy-MM-dd')}",
                            "type": "String"
                        }
                    }
                },
                "linkedServiceName": {
                    "referenceName": "SynapseSQL_LinkedService",
                    "type": "LinkedServiceReference"
                }
            }
        ],
        "variables": {
            "ExecutionDate": "@{formatDateTime(utcNow(), 'yyyy-MM-dd')}"
        },
        "concurrency": 1,
        "annotations": []
    }
}

6. Azure Data Lake Integration

Lakehouse Architecture

Azure Data Lake Storage Gen2 integration vormt de basis van de lakehouse architectuur in Synapse, waarbij data wordt opgeslagen in open formaten en benaderd via verschillende engines.

Medallion Architecture

Bronze (Raw) → Silver (Cleaned) → Gold (Business-ready) data layers

Bronze: Raw, immutable source data
Silver: Cleaned, validated data
Gold: Business-ready aggregates

Data Formats

Optimale formaten voor verschillende use cases

Parquet: Analytics workloads
Delta Lake: ACID transactions
CSV/JSON: Raw data ingestion

Folder Structure

Logische organisatie voor manageability

/raw/ - Source system data
/processed/ - Transformed data
/curated/ - Business views

Security Model

Granular access control en encryption

Azure RBAC
POSIX-like ACLs
Encryption at rest

Klaar voor Azure Synapse Projecten?

Vind de juiste Azure data experts of plaats je cloud data engineering vacature

Vacature Plaassen €25 Bekijk Azure Data Vacatures

7. Security en Governance

Enterprise Security Framework

Security en governance in Azure Synapse omvatten authenticatie, autorisatie, data protection en compliance controls voor enterprise workloads.

Security Feature	Implementatie	Use Case	Best Practices
Azure AD Authentication	Integrated Windows Authentication	Single sign-on voor alle users	MFA inschakelen, Conditional Access
Row-Level Security	Security policies op tabel niveau	Data segmentation per department	Test policies uitgebreid, gebruik schemas
Column Encryption	Always Encrypted met Azure Key Vault	PII data protection	Use randomized encryption, key rotation
Dynamic Data Masking	Masking rules op column level	Development/test environments	Apply to sensitive columns, exclude admin users
Private Endpoints	Private Link voor Synapse workspace	Compliance requirements	Combine with NSGs, monitor traffic
Audit Logging	Azure Monitor + Log Analytics	Compliance en forensics	Retention policies, alerting rules

8. Performance Optimalisatie

Synapse Performance Tuning

Performance optimalisatie in Azure Synapse vereist een combinatie van resource management, query tuning en architecturale best practices.

Performance Optimization Techniques

-- AZURE SYNAPSE PERFORMANCE OPTIMALISATIE

-- 1. RESOURCE CLASSES EN WORKLOAD MANAGEMENT
-- Configureer resource classes voor verschillende gebruikers
CREATE WORKLOAD GROUP report_wg 
WITH ( 
    MIN_PERCENTAGE_RESOURCE = 30,
    CAP_PERCENTAGE_RESOURCE = 60,
    REQUEST_MIN_RESOURCE_GRANT_PERCENT = 5,
    IMPORTANCE = MEDIUM
);

CREATE WORKLOAD GROUP etl_wg 
WITH ( 
    MIN_PERCENTAGE_RESOURCE = 40,
    CAP_PERCENTAGE_RESOURCE = 80,
    REQUEST_MIN_RESOURCE_GRANT_PERCENT = 25,
    IMPORTANCE = HIGH
);

CREATE WORKLOAD CLASSIFIER report_user_classifier
WITH (
    WORKLOAD_GROUP = 'report_wg',
    MEMBERNAME = 'report_user',
    WLM_LABEL = 'reporting'
);

-- 2. TABLE DESIGN OPTIMALISATIE
-- Gebruik gepaste distributie strategieën
CREATE TABLE optimized_sales_fact (
    sales_id BIGINT IDENTITY(1,1),
    customer_id INT NOT NULL,
    product_id INT NOT NULL,
    sales_date DATE NOT NULL,
    amount DECIMAL(18,2) NOT NULL,
    quantity INT NOT NULL
)
WITH (
    DISTRIBUTION = HASH(customer_id),  -- Voor customer-centric queries
    CLUSTERED COLUMNSTORE INDEX,        -- Voor analytics workloads
    PARTITION (sales_date RANGE RIGHT FOR VALUES (
        '2023-01-01', '2023-04-01', '2023-07-01', '2023-10-01',
        '2024-01-01', '2024-04-01', '2024-07-01', '2024-10-01'
    ))
);

-- 3. QUERY PERFORMANCE ANALYSE
-- Gebruik DMVs voor performance monitoring
SELECT 
    r.request_id,
    r.session_id,
    r.status,
    r.command,
    r.total_elapsed_time,
    r.resource_class,
    w.name AS workload_group_name,
    r.classifier_name
FROM sys.dm_pdw_exec_requests r
LEFT JOIN sys.dm_pdw_workload_management_workload_groups w 
    ON r.workload_group_id = w.workload_group_id
WHERE r.status NOT IN ('Completed', 'Failed')
    AND r.session_id != SESSION_ID()
ORDER BY r.total_elapsed_time DESC;

-- 4. CACHING STRATEGIEËN
-- Resultset caching inschakelen
ALTER DATABASE sales_dw 
SET RESULT_SET_CACHING = ON;

-- Materialized views voor complexe aggregaties
CREATE MATERIALIZED VIEW mv_monthly_sales_summary
WITH (DISTRIBUTION = HASH(sales_year_month))
AS
SELECT 
    YEAR(sales_date) * 100 + MONTH(sales_date) AS sales_year_month,
    customer_id,
    COUNT(*) AS transaction_count,
    SUM(amount) AS total_sales,
    AVG(amount) AS avg_transaction_value
FROM optimized_sales_fact
GROUP BY 
    YEAR(sales_date) * 100 + MONTH(sales_date),
    customer_id;

-- 5. INDEX MAINTENANCE
-- Rebuild indexes voor optimalisatie
ALTER INDEX ALL ON optimized_sales_fact REBUILD;

-- Statistics bijwerken
UPDATE STATISTICS optimized_sales_fact;

-- 6. SCALING STRATEGIEËN
-- Scale up voor heavy workloads
ALTER WORKLOAD GROUP etl_wg 
WITH (MIN_PERCENTAGE_RESOURCE = 60);

-- Scale out met PolyBase voor externe tables
CREATE EXTERNAL TABLE external_large_dataset
WITH (
    LOCATION = 'abfss://bigdata@datalake001.dfs.core.windows.net/large_dataset/',
    DATA_SOURCE = bigdata_source,
    FILE_FORMAT = parquet_format
);

9. Cost Management en Optimization

Azure Synapse Cost Control

Cost optimization in Azure Synapse vereist een combinatie van resource management, monitoring en architecturale keuzes voor maximale ROI.

Compute Optimization

Auto-pause voor development
Right-size dedicated pools
Serverless voor ad-hoc
Workload management

Storage Optimization

Data lifecycle management
Compression strategieën
Tiering (Hot/Cool/Archive)
Delta Lake optimizations

Monitoring & Alerting

Cost analysis dashboards
Budget alerts
Anomaly detection
Resource utilization

Optimization Tools

Azure Cost Management
Synapse Monitor
Custom PowerShell scripts
Third-party tools

10. Migratie naar Azure Synapse

Migration Strategies

Migratie naar Azure Synapse vereist zorgvuldige planning, assessment en execution om data integriteit en business continuity te garanderen.

Migration Roadmap en Implementation

-- MIGRATION ROADMAP VOOR AZURE SYNAPSE

-- 1. ASSESSMENT FASE: Analyse van huidige omgeving
-- Gebruik Azure Migrate voor assessment
/* 
Assessment Metrics:
- Database grootte en groei
- Performance requirements
- Query complexity
- Security requirements
- Compliance needs
*/

-- 2. PLANNING FASE: Migration strategie bepalen
CREATE TABLE migration_plan (
    phase_id INT PRIMARY KEY,
    phase_name NVARCHAR(100),
    description NVARCHAR(500),
    start_date DATE,
    end_date DATE,
    status NVARCHAR(50)
);

-- 3. MIGRATION TOOLING: Azure Data Migration Service
-- PowerShell script voor migration
"""
# Azure Data Migration Service configuratie
$sourceConnectionString = "Server=onprem-sql;Database=SalesDB;Integrated Security=True"
$targetConnectionString = "Server=synapse-dataplatform.sql.azuresynapse.net;Database=SalesDW;User Id=admin;Password=***"

# Create migration project
$migrationProject = New-AzDataMigrationProject `
    -ResourceGroupName "rg-synapse-migration" `
    -ServiceName "dms-service" `
    -ProjectName "SalesDB-Migration" `
    -Location "WestEurope" `
    -SourceType "SQL" `
    -TargetType "SQLDB"

# Create and start migration task
$migrationTask = New-AzDataMigrationTask `
    -TaskType "Migrate.SqlServer.SqlDb" `
    -ResourceGroupName "rg-synapse-migration" `
    -ServiceName "dms-service" `
    -ProjectName "SalesDB-Migration" `
    -TaskName "SalesDB-Migration-Task" `
    -SourceConnection $sourceConnectionInfo `
    -TargetConnection $targetConnectionInfo `
    -SelectedDatabase $selectedDatabases
"""

-- 4. DATA VALIDATION NA MIGRATIE
-- Row count validation
SELECT 
    'Source' AS system_type,
    COUNT(*) AS row_count
FROM onprem_sql.SalesDB.dbo.Customers
UNION ALL
SELECT 
    'Target',
    COUNT(*) 
FROM sales_dw.dbo.dim_customer;

-- Data integrity validation
SELECT 
    'Data Mismatch Check' AS check_type,
    COUNT(*) AS mismatch_count
FROM (
    SELECT CustomerID, SUM(OrderAmount) AS source_total
    FROM onprem_sql.SalesDB.dbo.Orders
    GROUP BY CustomerID
) source
FULL OUTER JOIN (
    SELECT customer_id, SUM(sales_amount) AS target_total
    FROM sales_dw.dbo.fact_sales
    GROUP BY customer_id
) target ON source.CustomerID = target.customer_id
WHERE source.source_total != target.target_total 
    OR (source.CustomerID IS NULL AND target.customer_id IS NOT NULL)
    OR (source.CustomerID IS NOT NULL AND target.customer_id IS NULL);

-- 5. PERFORMANCE VALIDATION
-- Query performance comparison
SET STATISTICS TIME ON;
SET STATISTICS IO ON;

-- Source system query
SELECT 
    YEAR(OrderDate) AS OrderYear,
    MONTH(OrderDate) AS OrderMonth,
    CustomerID,
    COUNT(*) AS OrderCount,
    SUM(OrderAmount) AS TotalAmount
FROM onprem_sql.SalesDB.dbo.Orders
WHERE OrderDate >= '2024-01-01'
GROUP BY YEAR(OrderDate), MONTH(OrderDate), CustomerID;

-- Synapse query
SELECT 
    YEAR(sales_date) AS sales_year,
    MONTH(sales_date) AS sales_month,
    customer_id,
    COUNT(*) AS transaction_count,
    SUM(sales_amount) AS total_sales
FROM sales_dw.dbo.fact_sales
WHERE sales_date >= '2024-01-01'
GROUP BY YEAR(sales_date), MONTH(sales_date), customer_id;

11. Best Practices en Patterns

Azure Synapse Best Practices

Best practices en patterns helpen bij het bouwen van schaalbare, performante en onderhoudbare Synapse oplossingen.

 Design Patterns voor Synapse
            Lakehouse Pattern
            Medallion architecture
Delta Lake format
Unity Catalog
Multi-engine access

          

            Modern DW Pattern
            Star schema design
Hybrid tables
Materialized views
Resultset caching

          

            Real-time Pattern
            Event streaming
Spark Structured Streaming
Delta Live Tables
Real-time dashboards

          

12. Praktijk Case Studies

Real-World Implementaties

Praktijk case studies tonen hoe Azure Synapse wordt geïmplementeerd in verschillende industrieën met meetbare business outcomes.

Case Study: Financial Services

Uitdaging: Legacy data warehouse, compliance requirements, real-time fraud detection.

Oplossing: Azure Synapse met lakehouse architecture en real-time analytics.

Metric	Before	After	Improvement
Data Processing	Batch (24h cycle)	Real-time (5 min)	288x faster
Query Performance	Minutes to hours	Sub-second	100x+ faster
Fraud Detection	Post-transaction	Real-time prevention	Proactive vs reactive
Cost	$500k/year	$150k/year	70% reduction
Compliance	Manual reporting	Automated audits	100% accuracy

Conclusie en Key Takeaways

 Key Lessons Learned
              DO's
              Start met lakehouse architecture
Gebruik serverless voor exploration
Implementeer workload management
Monitor costs vanaf dag 1
Gebruik Delta Lake voor ACID

            

              DON'Ts
              Negeer security niet
Overschat niet migration complexity
Vergeet monitoring niet
Onderschat niet team training
Copy on-premise patterns niet

            

              Emerging Trends
              Fabric integration
AI/ML in Synapse
Unified governance
Low-code data flows
Auto-scaling pools

            

Veelgestelde Vragen (FAQ)

Q: Wat is het verschil tussen Azure Synapse en Fabric?

A: Synapse is een data warehousing en big data platform, terwijl Fabric een complete SaaS data platform is met end-to-end analytics. Fabric bouwt voort op Synapse technologieën maar biedt een unified experience met OneLake en AI integration.

Q: Hoe implementeer ik CI/CD voor Synapse?

A: CI/CD voor Synapse: 1) Azure DevOps of GitHub Actions, 2) ARM templates voor infra, 3) SQL scripts in source control, 4) Notebooks en pipelines als code, 5) Automated testing, 6) Blue-green deployments.

Q: Wat zijn de SLA's van Azure Synapse?

A: Azure Synapse SLA's: 1) Dedicated SQL pool: 99.9% uptime, 2) Serverless SQL pool: best effort, 3) Spark pool: 99.9%, 4) Data integration: 99.9%. Enterprise customers kunnen higher SLA's verkrijgen via premium support.

Q: Hoe monitor ik Synapse performance?

A: Monitoring tools: 1) Azure Monitor metrics, 2) Log Analytics queries, 3) Dynamic Management Views (DMVs), 4) Synapse Studio monitoring, 5) Custom dashboards in Power BI, 6) Alert rules voor performance issues.