DataPartner365

Jouw partner voor datagedreven groei en inzichten

Azure Synapse Analytics: Complete Gids voor Modern Data Platform

Laatst bijgewerkt: 20 december 2024
Leestijd: 50 minuten
Azure Synapse Analytics, Azure Data Lake, SQL Pools, Apache Spark, Data Warehousing, Azure Synapse, Modern Data Platform

Leer de fundamenten en geavanceerde technieken van Azure Synapse Analytics. Van data lake integration tot real-time analytics en praktische implementatie strategieën voor het moderne data platform.

Zoek je Azure Synapse Experts?

Vind ervaren Azure Data Engineers en Synapse Architects voor je cloud data projecten

1. Inleiding tot Azure Synapse Analytics

Wat is Azure Synapse Analytics?

Azure Synapse Analytics is een geïntegreerd data platform dat data warehousing, big data analytics en data integration combineert in één unified service. Het brengt SQL en Apache Spark samen voor enterprise data processing.

Geïntegreerd Data Platform

Combineert SQL, Spark, en Data Integration in één service

Real-time Analytics

Stream processing en real-time data ingestion

Onbeperkte Schaal

Elastische schaling van compute en storage

Enterprise Security

End-to-end security en compliance

Feature Traditional Data Warehouse Azure Synapse Analytics Voordelen
Architecture Gescheiden systemen Unified platform Eenvoudiger management
Data Types Gestructureerde data Alle data types Completere data analysis
Processing Batch processing Batch + real-time Snellere insights
Schaalbaarheid Beperkt scaling Elastische scaling Flexibele capaciteit
Kosten Vaste kosten Pay-per-use Kosten efficientie
Development Complexe integratie Geïntegreerde tools Snellere development

2. Synapse Architecture Overzicht

Synapse Componenten

Azure Synapse architecture bestaat uit vier kern componenten die samenwerken voor een end-to-end data platform.

Core Architecture Components

-- AZURE SYNAPSE ARCHITECTURE OVERZICHT

-- 1. SYNAPSE WORKSPACE: Centraal management punt
-- Azure Portal: Create Synapse Workspace
/* 
Workspace Name: synapse-dataplatform
Region: West Europe
Data Lake Storage Gen2: 
    - Primary Storage Account: datalake001
    - File System: synapse-workspace
SQL Administrator: admin@datapartner365.nl
*/

-- 2. DEDICATED SQL POOL (voorheen SQL DW)
CREATE WORKLOAD GROUP reporting_workload
WITH (
    MIN_PERCENTAGE_RESOURCE = 30,
    CAP_PERCENTAGE_RESOURCE = 60,
    REQUEST_MIN_RESOURCE_GRANT_PERCENT = 5
);

CREATE WORKLOAD CLASSIFIER reporting_classifier
WITH (
    WORKLOAD_GROUP = 'reporting_workload',
    MEMBERNAME = 'reporting_user'
);

-- 3. SERVERLESS SQL POOL: On-demand querying
CREATE EXTERNAL TABLE serverless_sales_data
WITH (
    LOCATION = 'abfss://sales@datalake001.dfs.core.windows.net/raw/sales/',
    DATA_SOURCE = sales_data_source,
    FILE_FORMAT = parquet_format
)
AS
SELECT 
    customer_id,
    product_id,
    sales_amount,
    transaction_date
FROM OPENROWSET(
    BULK 'https://datalake001.dfs.core.windows.net/sales/raw/*.parquet',
    FORMAT = 'PARQUET'
) AS rows;

-- 4. APACHE SPARK POOL: Big data processing
-- Synapse Notebook: Python code voor data processing
"""
# Initialize Spark session
spark.conf.set("fs.azure.account.key.datalake001.dfs.core.windows.net", 
               "your_storage_account_key")

# Read data from Data Lake
df_sales = spark.read.parquet("abfss://sales@datalake001.dfs.core.windows.net/raw/")
df_customers = spark.read.parquet("abfss://customers@datalake001.dfs.core.windows.net/raw/")

# Perform transformations
df_enriched = df_sales.join(df_customers, "customer_id", "left")

# Write to curated zone
df_enriched.write.mode("overwrite").parquet(
    "abfss://sales@datalake001.dfs.core.windows.net/curated/enriched_sales"
)
"""

-- 5. SYNAPSE PIPELINES: Data integration
-- Azure Data Factory activiteiten binnen Synapse
/*
Pipeline: Daily_Sales_ETL
Activities:
1. Copy data from SQL Server to Data Lake (raw)
2. Spark job voor data transformatie
3. Load naar dedicated SQL pool
4. Execute stored procedures voor aggregaties
*/

3. SQL Pools: Dedicated vs Serverless

SQL Processing in Synapse

SQL Pools bieden twee opties voor SQL-based data processing: Dedicated SQL Pools voor productieworkloads en Serverless SQL Pools voor ad-hoc querying.

Dedicated SQL Pool Implementation

-- DEDICATED SQL POOL: Productie data warehouse

-- 1. CREATE DATABASE EN TABELLEN MET DISTRIBUTIE STRATEGIEËN
CREATE DATABASE sales_dw 
WITH 
    MAXSIZE = '100 TB',
    SERVICE_OBJECTIVE = 'DW500c';

USE sales_dw;

-- Fact table met HASH distributie voor gelijke data distributie
CREATE TABLE fact_sales (
    sales_id BIGINT IDENTITY(1,1) NOT NULL,
    customer_id INT NOT NULL,
    product_id INT NOT NULL,
    date_key INT NOT NULL,
    store_id INT NOT NULL,
    sales_amount DECIMAL(18,2) NOT NULL,
    quantity INT NOT NULL,
    discount_amount DECIMAL(18,2) DEFAULT 0,
    created_date DATETIME2 DEFAULT SYSUTCDATETIME()
)
WITH (
    DISTRIBUTION = HASH(customer_id),  -- Optimale distributie voor customer-centric queries
    CLUSTERED COLUMNSTORE INDEX,        -- Columnstore voor compressie en performance
    PARTITION (date_key RANGE RIGHT FOR VALUES (
        20230101, 20230401, 20230701, 20231001,
        20240101, 20240401, 20240701, 20241001
    ))
);

-- Dimension table met REPLICATE distributie voor kleine tabellen
CREATE TABLE dim_product (
    product_id INT NOT NULL,
    product_name NVARCHAR(255) NOT NULL,
    category_id INT NOT NULL,
    category_name NVARCHAR(100) NOT NULL,
    brand_name NVARCHAR(100) NOT NULL,
    unit_price DECIMAL(18,2) NOT NULL,
    is_active BIT DEFAULT 1,
    valid_from DATE NOT NULL,
    valid_to DATE
)
WITH (
    DISTRIBUTION = REPLICATE,  -- Replicatie naar alle compute nodes
    CLUSTERED INDEX (product_id)  -- Rowstore voor dimension tables
);

-- 2. CREATE EXTERNAL TABEL VOOR DATA LAKE INTEGRATIE
CREATE EXTERNAL DATA SOURCE sales_data_source 
WITH (
    LOCATION = 'abfss://sales@datalake001.dfs.core.windows.net',
    CREDENTIAL = sales_credential
);

CREATE EXTERNAL FILE FORMAT parquet_file_format 
WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);

CREATE EXTERNAL TABLE external_sales_data (
    customer_id INT,
    product_id INT,
    sales_date DATE,
    amount DECIMAL(18,2),
    quantity INT
)
WITH (
    LOCATION = '/raw/sales_data/*.parquet',
    DATA_SOURCE = sales_data_source,
    FILE_FORMAT = parquet_file_format
);

-- 3. CTAS (CREATE TABLE AS SELECT) VOOR PERFORMANCE
CREATE TABLE monthly_sales_summary
WITH (
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT 
    YEAR(sales_date) AS sales_year,
    MONTH(sales_date) AS sales_month,
    dp.category_name,
    COUNT(*) AS transaction_count,
    SUM(fs.sales_amount) AS total_sales,
    AVG(fs.sales_amount) AS avg_sale_amount
FROM fact_sales fs
INNER JOIN dim_product dp ON fs.product_id = dp.product_id
GROUP BY 
    YEAR(sales_date),
    MONTH(sales_date),
    dp.category_name;

-- 4. WORKLOAD MANAGEMENT CONFIGURATIE
CREATE WORKLOAD GROUP etl_workload
WITH (
    MIN_PERCENTAGE_RESOURCE = 40,
    CAP_PERCENTAGE_RESOURCE = 80,
    REQUEST_MIN_RESOURCE_GRANT_PERCENT = 10,
    IMPORTANCE = HIGH
);

CREATE WORKLOAD CLASSIFIER etl_classifier
WITH (
    WORKLOAD_GROUP = 'etl_workload',
    MEMBERNAME = 'etl_user',
    WLM_LABEL = 'etl_process'
);

-- 5. MATERIALIZED VIEWS VOOR PERFORMANCE
CREATE MATERIALIZED VIEW mv_customer_sales_summary
WITH (DISTRIBUTION = HASH(customer_id))
AS
SELECT 
    customer_id,
    COUNT(*) AS total_transactions,
    SUM(sales_amount) AS total_spent,
    AVG(sales_amount) AS avg_transaction_value,
    MIN(sales_date) AS first_purchase,
    MAX(sales_date) AS last_purchase
FROM fact_sales
GROUP BY customer_id;

Serverless SQL Pool Implementation

-- SERVERLESS SQL POOL: Ad-hoc querying en data exploration

-- 1. QUERY DATA DIRECT VAN DATA LAKE
SELECT 
    TOP 100 *
FROM OPENROWSET(
    BULK 'https://datalake001.dfs.core.windows.net/sales/raw/sales_2024_*.parquet',
    FORMAT = 'PARQUET'
) AS sales_data;

-- 2. CREATE EXTERNAL TABLE VOOR FREQUENTE ACCESS
CREATE EXTERNAL TABLE serverless_sales
WITH (
    LOCATION = 'sales/raw/',
    DATA_SOURCE = sales_datalake,
    FILE_FORMAT = parquet_format
)
AS
SELECT 
    customer_id,
    product_id,
    CAST(sales_date AS DATE) AS sales_date,
    amount,
    quantity,
    region,
    payment_method
FROM OPENROWSET(
    BULK 'https://datalake001.dfs.core.windows.net/sales/raw/*.parquet',
    FORMAT = 'PARQUET'
) WITH (
    customer_id INT,
    product_id INT,
    sales_date VARCHAR(10),
    amount DECIMAL(18,2),
    quantity INT,
    region VARCHAR(50),
    payment_method VARCHAR(50)
) AS raw_data;

-- 3. QUERY MULTIPLE BESTANDEN EN FOLDERS
SELECT 
    sales_date,
    COUNT(*) AS transaction_count,
    SUM(amount) AS daily_sales
FROM OPENROWSET(
    BULK (
        'https://datalake001.dfs.core.windows.net/sales/raw/sales_202401*.parquet',
        'https://datalake001.dfs.core.windows.net/sales/raw/sales_202402*.parquet',
        'https://datalake001.dfs.core.windows.net/sales/raw/sales_202403*.parquet'
    ),
    FORMAT = 'PARQUET'
) AS sales_data
WHERE amount > 1000
GROUP BY sales_date
ORDER BY sales_date;

-- 4. USE WILDCARDS IN FILE PATHS
SELECT 
    YEAR(sales_date) AS sales_year,
    MONTH(sales_date) AS sales_month,
    region,
    SUM(amount) AS monthly_sales
FROM OPENROWSET(
    BULK 'https://datalake001.dfs.core.windows.net/sales/raw/sales_*.parquet',
    FORMAT = 'PARQUET'
) WITH (
    sales_date DATE,
    amount DECIMAL(18,2),
    region VARCHAR(50)
) AS sales_data
GROUP BY 
    YEAR(sales_date),
    MONTH(sales_date),
    region
ORDER BY 
    sales_year DESC,
    sales_month DESC,
    monthly_sales DESC;

-- 5. CREATE VIEWS VOOR COMPLEX QUERIES
CREATE VIEW sales_analytics_view
AS
WITH customer_summary AS (
    SELECT 
        customer_id,
        COUNT(*) AS transaction_count,
        SUM(amount) AS total_spent,
        MIN(sales_date) AS first_purchase,
        MAX(sales_date) AS last_purchase,
        DATEDIFF(day, MIN(sales_date), MAX(sales_date)) AS customer_lifetime_days
    FROM OPENROWSET(
        BULK 'https://datalake001.dfs.core.windows.net/sales/raw/*.parquet',
        FORMAT = 'PARQUET'
    ) AS sales
    GROUP BY customer_id
),
customer_segments AS (
    SELECT 
        customer_id,
        total_spent,
        CASE 
            WHEN total_spent >= 10000 THEN 'Platinum'
            WHEN total_spent >= 5000 THEN 'Gold'
            WHEN total_spent >= 1000 THEN 'Silver'
            ELSE 'Bronze'
        END AS customer_segment
    FROM customer_summary
)
SELECT 
    cs.customer_segment,
    COUNT(*) AS customer_count,
    AVG(cs.total_spent) AS avg_spent_per_customer,
    SUM(cs.total_spent) AS total_segment_revenue
FROM customer_segments cs
GROUP BY cs.customer_segment
ORDER BY total_segment_revenue DESC;

Azure Data Experts Nodig?

Vind ervaren Azure Data Engineers en Synapse Architects voor je cloud data projecten

4. Apache Spark Pools en Databricks Integration

Big Data Processing in Synapse

Apache Spark Pools bieden distributed data processing voor big data workloads, met native integratie tussen Synapse en Databricks voor optimale flexibility.

Synapse Spark Pool Implementation

# SYNAPSE SPARK POOL: Python Notebook voor data processing

# Cell 1: Initialisatie en configuratie
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime, timedelta

# Configureer storage account access
storage_account_name = "datalake001"
storage_account_key = dbutils.secrets.get(scope="synapse-scope", key="storage-key")

spark.conf.set(
    "fs.azure.account.key.{}.dfs.core.windows.net".format(storage_account_name),
    storage_account_key
)

# Initialiseer Spark session met Synapse optimalisaties
spark = SparkSession.builder \
    .appName("Synapse Data Processing") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
    .getOrCreate()

# Cell 2: Data lezen uit Data Lake
# Lees raw data van verschillende bronnen
df_sales = spark.read \
    .format("parquet") \
    .load(f"abfss://sales@{storage_account_name}.dfs.core.windows.net/raw/sales/")

df_customers = spark.read \
    .format("parquet") \
    .load(f"abfss://customers@{storage_account_name}.dfs.core.windows.net/raw/customers/")

df_products = spark.read \
    .format("delta") \
    .load(f"abfss://products@{storage_account_name}.dfs.core.windows.net/bronze/products/")

# Toon schema en eerste rijen
print("Sales Data Schema:")
df_sales.printSchema()
print(f"Total sales records: {df_sales.count():,}")

# Cell 3: Data transformaties en cleansing
# Data quality checks en transformaties
df_sales_cleaned = df_sales \
    .filter(col("sales_amount") > 0) \
    .filter(col("customer_id").isNotNull()) \
    .filter(col("product_id").isNotNull()) \
    .withColumn("sales_date", to_date(col("transaction_timestamp"))) \
    .withColumn("sales_year", year(col("sales_date"))) \
    .withColumn("sales_month", month(col("sales_date"))) \
    .withColumn("sales_quarter", quarter(col("sales_date"))) \
    .withColumn("weekday", dayofweek(col("sales_date"))) \
    .withColumn("is_weekend", when(col("weekday").isin(1, 7), True).otherwise(False))

# Aggregaties en business metrics
df_daily_sales = df_sales_cleaned \
    .groupBy("sales_date", "store_id", "product_category") \
    .agg(
        sum("sales_amount").alias("daily_sales"),
        avg("sales_amount").alias("avg_transaction_value"),
        count("*").alias("transaction_count"),
        countDistinct("customer_id").alias("unique_customers")
    ) \
    .withColumn("sales_per_customer", col("daily_sales") / col("unique_customers"))

# Cell 4: Geavanceerde analytics met ML
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline

# Voorbereiden data voor clustering
df_customer_features = df_sales_cleaned \
    .groupBy("customer_id") \
    .agg(
        count("*").alias("purchase_count"),
        sum("sales_amount").alias("total_spent"),
        avg("sales_amount").alias("avg_purchase_value"),
        max("sales_date").alias("last_purchase_date")
    ) \
    .withColumn("days_since_last_purchase", 
                datediff(current_date(), col("last_purchase_date")))

# ML pipeline voor customer segmentation
assembler = VectorAssembler(
    inputCols=["purchase_count", "total_spent", "avg_purchase_value", "days_since_last_purchase"],
    outputCol="features"
)

scaler = StandardScaler(
    inputCol="features",
    outputCol="scaled_features",
    withStd=True,
    withMean=True
)

kmeans = KMeans(
    k=5,
    featuresCol="scaled_features",
    predictionCol="cluster",
    seed=42
)

pipeline = Pipeline(stages=[assembler, scaler, kmeans])
model = pipeline.fit(df_customer_features)
df_customer_segments = model.transform(df_customer_features)

# Cell 5: Schrijven naar Data Lake en Synapse SQL Pool
# Write to Data Lake (Silver layer)
df_daily_sales.write \
    .mode("overwrite") \
    .partitionBy("sales_date") \
    .format("parquet") \
    .save(f"abfss://sales@{storage_account_name}.dfs.core.windows.net/silver/daily_sales/")

# Write customer segments to Data Lake
df_customer_segments.write \
    .mode("overwrite") \
    .format("delta") \
    .save(f"abfss://customers@{storage_account_name}.dfs.core.windows.net/silver/customer_segments/")

# Write to Synapse Dedicated SQL Pool
df_daily_sales.write \
    .format("com.databricks.spark.sqldw") \
    .option("url", "jdbc:sqlserver://synapse-dataplatform.sql.azuresynapse.net") \
    .option("tempDir", f"abfss://temp@{storage_account_name}.dfs.core.windows.net/temp/") \
    .option("forwardSparkAzureStorageCredentials", "true") \
    .option("dbTable", "dbo.daily_sales_summary") \
    .mode("overwrite") \
    .save()

# Cell 6: Performance monitoring en logging
# Log execution metrics
execution_metrics = {
    "total_records_processed": df_sales.count(),
    "processing_time_minutes": spark.sparkContext.uiWebUrl,
    "output_records": df_daily_sales.count(),
    "timestamp": datetime.now().isoformat()
}

# Write execution log
df_log = spark.createDataFrame([execution_metrics])
df_log.write \
    .mode("append") \
    .format("json") \
    .save(f"abfss://logs@{storage_account_name}.dfs.core.windows.net/spark_execution_logs/")

print("Data processing completed successfully!")
print(f"Daily sales records: {df_daily_sales.count():,}")
print(f"Customer segments created: {df_customer_segments.count():,}")

5. Data Integration en Pipelines

Synapse Pipelines en Data Flows

Data Integration in Azure Synapse maakt gebruik van Azure Data Factory technologie voor het bouwen van ETL/ELT pipelines met visuele ontwerp tools en code-first opties.

Synapse Pipeline Implementation

// AZURE SYNAPSE PIPELINE DEFINITIE (JSON)
{
    "name": "Daily_Sales_ETL_Pipeline",
    "properties": {
        "activities": [
            {
                "name": "Copy_From_SQL_Server",
                "type": "Copy",
                "dependsOn": [],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "source": {
                        "type": "SqlServerSource",
                        "sqlReaderQuery": "SELECT * FROM Sales.Orders WHERE OrderDate >= DATEADD(day, -1, GETDATE())",
                        "queryTimeout": "02:00:00",
                        "partitionOption": "None"
                    },
                    "sink": {
                        "type": "ParquetSink",
                        "storeSettings": {
                            "type": "AzureBlobFSWriteSettings",
                            "copyBehavior": "FlattenHierarchy"
                        },
                        "formatSettings": {
                            "type": "ParquetWriteSettings"
                        }
                    },
                    "enableStaging": false,
                    "translator": {
                        "type": "TabularTranslator",
                        "typeConversion": true,
                        "typeConversionSettings": {
                            "allowDataTruncation": true,
                            "treatBooleanAsNumber": false
                        }
                    }
                },
                "inputs": [
                    {
                        "referenceName": "SqlServer_Source",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "RawSalesData",
                        "type": "DatasetReference"
                    }
                ]
            },
            {
                "name": "Transform_Data_Flow",
                "type": "ExecuteDataFlow",
                "dependsOn": [
                    {
                        "activity": "Copy_From_SQL_Server",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "dataflow": {
                        "referenceName": "Sales_Transformation_Flow",
                        "type": "DataFlowReference"
                    },
                    "compute": {
                        "coreCount": 8,
                        "computeType": "General"
                    },
                    "traceLevel": "Fine",
                    "runConcurrently": false
                }
            },
            {
                "name": "Load_To_Synapse_SQL",
                "type": "Copy",
                "dependsOn": [
                    {
                        "activity": "Transform_Data_Flow",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "source": {
                        "type": "ParquetSource",
                        "storeSettings": {
                            "type": "AzureBlobFSReadSettings",
                            "recursive": true,
                            "wildcardFileName": "*.parquet"
                        }
                    },
                    "sink": {
                        "type": "SqlDWSink",
                        "allowCopyCommand": true,
                        "copyCommandSettings": {
                            "defaultValues": [
                                {
                                    "columnName": "CreatedDate",
                                    "defaultValue": "GETDATE()"
                                }
                            ]
                        },
                        "tableOption": "autoCreate"
                    },
                    "enableStaging": true,
                    "stagingSettings": {
                        "linkedServiceName": {
                            "referenceName": "AzureDataLakeStorage1",
                            "type": "LinkedServiceReference"
                        },
                        "path": "staging"
                    }
                },
                "inputs": [
                    {
                        "referenceName": "TransformedSalesData",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "SynapseSQL_Sales",
                        "type": "DatasetReference"
                    }
                ]
            },
            {
                "name": "Execute_Stored_Procedure",
                "type": "SqlServerStoredProcedure",
                "dependsOn": [
                    {
                        "activity": "Load_To_Synapse_SQL",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "storedProcedureName": "[dbo].[usp_UpdateSalesAggregates]",
                    "storedProcedureParameters": {
                        "ExecutionDate": {
                            "value": "@{formatDateTime(utcNow(), 'yyyy-MM-dd')}",
                            "type": "String"
                        }
                    }
                },
                "linkedServiceName": {
                    "referenceName": "SynapseSQL_LinkedService",
                    "type": "LinkedServiceReference"
                }
            }
        ],
        "variables": {
            "ExecutionDate": "@{formatDateTime(utcNow(), 'yyyy-MM-dd')}"
        },
        "concurrency": 1,
        "annotations": []
    }
}

6. Azure Data Lake Integration

Lakehouse Architecture

Azure Data Lake Storage Gen2 integration vormt de basis van de lakehouse architectuur in Synapse, waarbij data wordt opgeslagen in open formaten en benaderd via verschillende engines.

Medallion Architecture

Bronze (Raw) → Silver (Cleaned) → Gold (Business-ready) data layers

  • Bronze: Raw, immutable source data
  • Silver: Cleaned, validated data
  • Gold: Business-ready aggregates

Data Formats

Optimale formaten voor verschillende use cases

  • Parquet: Analytics workloads
  • Delta Lake: ACID transactions
  • CSV/JSON: Raw data ingestion

Folder Structure

Logische organisatie voor manageability

  • /raw/ - Source system data
  • /processed/ - Transformed data
  • /curated/ - Business views

Security Model

Granular access control en encryption

  • Azure RBAC
  • POSIX-like ACLs
  • Encryption at rest

Klaar voor Azure Synapse Projecten?

Vind de juiste Azure data experts of plaats je cloud data engineering vacature

7. Security en Governance

Enterprise Security Framework

Security en governance in Azure Synapse omvatten authenticatie, autorisatie, data protection en compliance controls voor enterprise workloads.

Security Feature Implementatie Use Case Best Practices
Azure AD Authentication Integrated Windows Authentication Single sign-on voor alle users MFA inschakelen, Conditional Access
Row-Level Security Security policies op tabel niveau Data segmentation per department Test policies uitgebreid, gebruik schemas
Column Encryption Always Encrypted met Azure Key Vault PII data protection Use randomized encryption, key rotation
Dynamic Data Masking Masking rules op column level Development/test environments Apply to sensitive columns, exclude admin users
Private Endpoints Private Link voor Synapse workspace Compliance requirements Combine with NSGs, monitor traffic
Audit Logging Azure Monitor + Log Analytics Compliance en forensics Retention policies, alerting rules

8. Performance Optimalisatie

Synapse Performance Tuning

Performance optimalisatie in Azure Synapse vereist een combinatie van resource management, query tuning en architecturale best practices.

Performance Optimization Techniques

-- AZURE SYNAPSE PERFORMANCE OPTIMALISATIE

-- 1. RESOURCE CLASSES EN WORKLOAD MANAGEMENT
-- Configureer resource classes voor verschillende gebruikers
CREATE WORKLOAD GROUP report_wg 
WITH ( 
    MIN_PERCENTAGE_RESOURCE = 30,
    CAP_PERCENTAGE_RESOURCE = 60,
    REQUEST_MIN_RESOURCE_GRANT_PERCENT = 5,
    IMPORTANCE = MEDIUM
);

CREATE WORKLOAD GROUP etl_wg 
WITH ( 
    MIN_PERCENTAGE_RESOURCE = 40,
    CAP_PERCENTAGE_RESOURCE = 80,
    REQUEST_MIN_RESOURCE_GRANT_PERCENT = 25,
    IMPORTANCE = HIGH
);

CREATE WORKLOAD CLASSIFIER report_user_classifier
WITH (
    WORKLOAD_GROUP = 'report_wg',
    MEMBERNAME = 'report_user',
    WLM_LABEL = 'reporting'
);

-- 2. TABLE DESIGN OPTIMALISATIE
-- Gebruik gepaste distributie strategieën
CREATE TABLE optimized_sales_fact (
    sales_id BIGINT IDENTITY(1,1),
    customer_id INT NOT NULL,
    product_id INT NOT NULL,
    sales_date DATE NOT NULL,
    amount DECIMAL(18,2) NOT NULL,
    quantity INT NOT NULL
)
WITH (
    DISTRIBUTION = HASH(customer_id),  -- Voor customer-centric queries
    CLUSTERED COLUMNSTORE INDEX,        -- Voor analytics workloads
    PARTITION (sales_date RANGE RIGHT FOR VALUES (
        '2023-01-01', '2023-04-01', '2023-07-01', '2023-10-01',
        '2024-01-01', '2024-04-01', '2024-07-01', '2024-10-01'
    ))
);

-- 3. QUERY PERFORMANCE ANALYSE
-- Gebruik DMVs voor performance monitoring
SELECT 
    r.request_id,
    r.session_id,
    r.status,
    r.command,
    r.total_elapsed_time,
    r.resource_class,
    w.name AS workload_group_name,
    r.classifier_name
FROM sys.dm_pdw_exec_requests r
LEFT JOIN sys.dm_pdw_workload_management_workload_groups w 
    ON r.workload_group_id = w.workload_group_id
WHERE r.status NOT IN ('Completed', 'Failed')
    AND r.session_id != SESSION_ID()
ORDER BY r.total_elapsed_time DESC;

-- 4. CACHING STRATEGIEËN
-- Resultset caching inschakelen
ALTER DATABASE sales_dw 
SET RESULT_SET_CACHING = ON;

-- Materialized views voor complexe aggregaties
CREATE MATERIALIZED VIEW mv_monthly_sales_summary
WITH (DISTRIBUTION = HASH(sales_year_month))
AS
SELECT 
    YEAR(sales_date) * 100 + MONTH(sales_date) AS sales_year_month,
    customer_id,
    COUNT(*) AS transaction_count,
    SUM(amount) AS total_sales,
    AVG(amount) AS avg_transaction_value
FROM optimized_sales_fact
GROUP BY 
    YEAR(sales_date) * 100 + MONTH(sales_date),
    customer_id;

-- 5. INDEX MAINTENANCE
-- Rebuild indexes voor optimalisatie
ALTER INDEX ALL ON optimized_sales_fact REBUILD;

-- Statistics bijwerken
UPDATE STATISTICS optimized_sales_fact;

-- 6. SCALING STRATEGIEËN
-- Scale up voor heavy workloads
ALTER WORKLOAD GROUP etl_wg 
WITH (MIN_PERCENTAGE_RESOURCE = 60);

-- Scale out met PolyBase voor externe tables
CREATE EXTERNAL TABLE external_large_dataset
WITH (
    LOCATION = 'abfss://bigdata@datalake001.dfs.core.windows.net/large_dataset/',
    DATA_SOURCE = bigdata_source,
    FILE_FORMAT = parquet_format
);

9. Cost Management en Optimization

Azure Synapse Cost Control

Cost optimization in Azure Synapse vereist een combinatie van resource management, monitoring en architecturale keuzes voor maximale ROI.

Compute Optimization

  • Auto-pause voor development
  • Right-size dedicated pools
  • Serverless voor ad-hoc
  • Workload management

Storage Optimization

  • Data lifecycle management
  • Compression strategieën
  • Tiering (Hot/Cool/Archive)
  • Delta Lake optimizations

Monitoring & Alerting

  • Cost analysis dashboards
  • Budget alerts
  • Anomaly detection
  • Resource utilization

Optimization Tools

  • Azure Cost Management
  • Synapse Monitor
  • Custom PowerShell scripts
  • Third-party tools

10. Migratie naar Azure Synapse

Migration Strategies

Migratie naar Azure Synapse vereist zorgvuldige planning, assessment en execution om data integriteit en business continuity te garanderen.

Migration Roadmap en Implementation

-- MIGRATION ROADMAP VOOR AZURE SYNAPSE

-- 1. ASSESSMENT FASE: Analyse van huidige omgeving
-- Gebruik Azure Migrate voor assessment
/* 
Assessment Metrics:
- Database grootte en groei
- Performance requirements
- Query complexity
- Security requirements
- Compliance needs
*/

-- 2. PLANNING FASE: Migration strategie bepalen
CREATE TABLE migration_plan (
    phase_id INT PRIMARY KEY,
    phase_name NVARCHAR(100),
    description NVARCHAR(500),
    start_date DATE,
    end_date DATE,
    status NVARCHAR(50)
);

-- 3. MIGRATION TOOLING: Azure Data Migration Service
-- PowerShell script voor migration
"""
# Azure Data Migration Service configuratie
$sourceConnectionString = "Server=onprem-sql;Database=SalesDB;Integrated Security=True"
$targetConnectionString = "Server=synapse-dataplatform.sql.azuresynapse.net;Database=SalesDW;User Id=admin;Password=***"

# Create migration project
$migrationProject = New-AzDataMigrationProject `
    -ResourceGroupName "rg-synapse-migration" `
    -ServiceName "dms-service" `
    -ProjectName "SalesDB-Migration" `
    -Location "WestEurope" `
    -SourceType "SQL" `
    -TargetType "SQLDB"

# Create and start migration task
$migrationTask = New-AzDataMigrationTask `
    -TaskType "Migrate.SqlServer.SqlDb" `
    -ResourceGroupName "rg-synapse-migration" `
    -ServiceName "dms-service" `
    -ProjectName "SalesDB-Migration" `
    -TaskName "SalesDB-Migration-Task" `
    -SourceConnection $sourceConnectionInfo `
    -TargetConnection $targetConnectionInfo `
    -SelectedDatabase $selectedDatabases
"""

-- 4. DATA VALIDATION NA MIGRATIE
-- Row count validation
SELECT 
    'Source' AS system_type,
    COUNT(*) AS row_count
FROM onprem_sql.SalesDB.dbo.Customers
UNION ALL
SELECT 
    'Target',
    COUNT(*) 
FROM sales_dw.dbo.dim_customer;

-- Data integrity validation
SELECT 
    'Data Mismatch Check' AS check_type,
    COUNT(*) AS mismatch_count
FROM (
    SELECT CustomerID, SUM(OrderAmount) AS source_total
    FROM onprem_sql.SalesDB.dbo.Orders
    GROUP BY CustomerID
) source
FULL OUTER JOIN (
    SELECT customer_id, SUM(sales_amount) AS target_total
    FROM sales_dw.dbo.fact_sales
    GROUP BY customer_id
) target ON source.CustomerID = target.customer_id
WHERE source.source_total != target.target_total 
    OR (source.CustomerID IS NULL AND target.customer_id IS NOT NULL)
    OR (source.CustomerID IS NOT NULL AND target.customer_id IS NULL);

-- 5. PERFORMANCE VALIDATION
-- Query performance comparison
SET STATISTICS TIME ON;
SET STATISTICS IO ON;

-- Source system query
SELECT 
    YEAR(OrderDate) AS OrderYear,
    MONTH(OrderDate) AS OrderMonth,
    CustomerID,
    COUNT(*) AS OrderCount,
    SUM(OrderAmount) AS TotalAmount
FROM onprem_sql.SalesDB.dbo.Orders
WHERE OrderDate >= '2024-01-01'
GROUP BY YEAR(OrderDate), MONTH(OrderDate), CustomerID;

-- Synapse query
SELECT 
    YEAR(sales_date) AS sales_year,
    MONTH(sales_date) AS sales_month,
    customer_id,
    COUNT(*) AS transaction_count,
    SUM(sales_amount) AS total_sales
FROM sales_dw.dbo.fact_sales
WHERE sales_date >= '2024-01-01'
GROUP BY YEAR(sales_date), MONTH(sales_date), customer_id;

11. Best Practices en Patterns

Azure Synapse Best Practices

Best practices en patterns helpen bij het bouwen van schaalbare, performante en onderhoudbare Synapse oplossingen.

Design Patterns voor Synapse

Lakehouse Pattern
  • Medallion architecture
  • Delta Lake format
  • Unity Catalog
  • Multi-engine access
Modern DW Pattern
  • Star schema design
  • Hybrid tables
  • Materialized views
  • Resultset caching
Real-time Pattern
  • Event streaming
  • Spark Structured Streaming
  • Delta Live Tables
  • Real-time dashboards

12. Praktijk Case Studies

Real-World Implementaties

Praktijk case studies tonen hoe Azure Synapse wordt geïmplementeerd in verschillende industrieën met meetbare business outcomes.

Case Study: Financial Services

Uitdaging: Legacy data warehouse, compliance requirements, real-time fraud detection.

Oplossing: Azure Synapse met lakehouse architecture en real-time analytics.

Metric Before After Improvement
Data Processing Batch (24h cycle) Real-time (5 min) 288x faster
Query Performance Minutes to hours Sub-second 100x+ faster
Fraud Detection Post-transaction Real-time prevention Proactive vs reactive
Cost $500k/year $150k/year 70% reduction
Compliance Manual reporting Automated audits 100% accuracy

Conclusie en Key Takeaways

Key Lessons Learned

DO's
  • Start met lakehouse architecture
  • Gebruik serverless voor exploration
  • Implementeer workload management
  • Monitor costs vanaf dag 1
  • Gebruik Delta Lake voor ACID
DON'Ts
  • Negeer security niet
  • Overschat niet migration complexity
  • Vergeet monitoring niet
  • Onderschat niet team training
  • Copy on-premise patterns niet
Emerging Trends
  • Fabric integration
  • AI/ML in Synapse
  • Unified governance
  • Low-code data flows
  • Auto-scaling pools

Veelgestelde Vragen (FAQ)

Q: Wat is het verschil tussen Azure Synapse en Fabric?

A: Synapse is een data warehousing en big data platform, terwijl Fabric een complete SaaS data platform is met end-to-end analytics. Fabric bouwt voort op Synapse technologieën maar biedt een unified experience met OneLake en AI integration.

Q: Hoe implementeer ik CI/CD voor Synapse?

A: CI/CD voor Synapse: 1) Azure DevOps of GitHub Actions, 2) ARM templates voor infra, 3) SQL scripts in source control, 4) Notebooks en pipelines als code, 5) Automated testing, 6) Blue-green deployments.

Q: Wat zijn de SLA's van Azure Synapse?

A: Azure Synapse SLA's: 1) Dedicated SQL pool: 99.9% uptime, 2) Serverless SQL pool: best effort, 3) Spark pool: 99.9%, 4) Data integration: 99.9%. Enterprise customers kunnen higher SLA's verkrijgen via premium support.

Q: Hoe monitor ik Synapse performance?

A: Monitoring tools: 1) Azure Monitor metrics, 2) Log Analytics queries, 3) Dynamic Management Views (DMVs), 4) Synapse Studio monitoring, 5) Custom dashboards in Power BI, 6) Alert rules voor performance issues.