Repository User Guide: FHIR Healthcare Data Platform
Executive Summary
This repository contains an Azure Synapse Analytics data platform designed for processing and analyzing FHIR (Fast Healthcare Interoperability Resources) healthcare data. The platform ingests NDJSON-formatted FHIR resources (Patient, Observation, Claim), transforms them through Spark notebooks and data flows, and loads them into a SQL dedicated pool for analytics.
Key Metrics
- Platform Type: Azure Synapse Analytics (hybrid Spark + SQL)
- Pipelines: 4 orchestration pipelines
- Activities: 19 total activities across all pipelines
- Datasets: 20 datasets (Parquet and SQL)
- Linked Services: 8 services (Storage, SQL, Integration Runtime)
- Notebooks: 3 Synapse notebooks for data transformation
- SQL Scripts: 2 serverless SQL exploration scripts
- Data Flows: 1 mapping data flow for Patient data
---
1. Overview
Purpose
This repository implements a healthcare data platform that processes FHIR-compliant medical records. The platform follows a medallion architecture pattern with three storage layers:
- Raw: Initial NDJSON ingestion from source systems
- Processed: Flattened Parquet files after Spark transformations
- Curated: SQL tables in Synapse Dedicated Pool for analytics
File Path: artifacts/pipeline/Copy_Data_Source_To_Raw_PL.json
Evidence: The pipeline creates three containers: "Copy Source Data To Raw Container", "Create Curated Container", and "Create Processed Container"
Platform Type
Azure Synapse Analytics (hybrid architecture)
- Synapse Spark pools for data transformation
- Synapse Dedicated SQL Pool for data warehousing
- Synapse Serverless SQL for ad-hoc exploration
- Azure Data Lake Storage Gen2 for data persistence
Key Statistics
- 4 Pipelines: FHIR resource-specific pipelines (Patient, Observation, Claim) + 1 data preparation pipeline
- 19 Activities: 3 notebook executions, 7 copy activities, 3 script activities, 1 data flow, 5 container creation activities
- 20 Datasets: 13 Parquet datasets, 7 SQL datasets
- 3 Linked Services: StorageLS (ADLS Gen2), SynapseDedicatedPoolLS (SQL Pool), Source_Dataset_LS (source system)
---
2. Getting Started
Prerequisites
Required Access
1. Azure Synapse Workspace: Read/write access to the Synapse workspace
2. Storage Account: Contributor access to the Azure Data Lake Storage Gen2 account
3. SQL Dedicated Pool: db_datareader and db_datawriter roles on the dedicated SQL pool
4. Synapse Spark Pool: Permission to execute notebooks on Spark compute
File Path: artifacts/linkedService/SynapseDedicatedPoolLS.json
Evidence: Connection string reference indicates SQL authentication is used: "connectionString"
Software Requirements
- Azure Synapse Studio (web-based, no local installation required)
- Git client (for repository management)
- Optional: Azure CLI for deployment automation
Development Environment Setup
Step 1: Clone the Repository
git clone <repository-url>
cd <repository-name>
Step 2: Configure Linked Services
You must update the following linked service connection strings with your environment-specific values:
1. StorageLS (artifacts/linkedService/StorageLS.json)
- Update the ADLS Gen2 storage account URL
2. SynapseDedicatedPoolLS (artifacts/linkedService/SynapseDedicatedPoolLS.json)
- Update the SQL dedicated pool connection string
- Security Note: Connection string is stored as plain text (Line: unknown)
3. Source_Dataset_LS (artifacts/linkedService/Source_Dataset_LS.json)
- Configure the source system connection
File Path: artifacts/linkedService/SynapseDedicatedPoolLS.json
Scanner Reference: secret_tracer
Evidence: type=plain_text location=typeProperties.connectionString secure=False
Step 3: Configure Integration Runtime
The repository uses AutoResolveIntegrationRuntime for activity execution.
File Path: artifacts/integrationRuntime/AutoResolveIntegrationRuntime.json
Evidence: Integration runtime configuration exists in both artifacts/ and mybigdata/ directories
Step 4: Deploy to Synapse Workspace
1. Open Azure Synapse Studio
2. Navigate to Manage → Git configuration
3. Connect the repository to your Synapse workspace
4. Publish the artifacts to the live mode
Key Configuration Files
| File | Purpose | Location |
|---|---|---|
SynapseDedicatedPoolLS.json | SQL pool connection | artifacts/linkedService/ |
StorageLS.json | ADLS Gen2 storage connection | artifacts/linkedService/ |
AutoResolveIntegrationRuntime.json | Compute runtime configuration | artifacts/integrationRuntime/ |
| Pipeline JSON files | Orchestration definitions | artifacts/pipeline/ |
| Dataset JSON files | Data source/sink definitions | artifacts/dataset/ |
---
3. Repository Structure
Directory Layout
repository-root/
├── artifacts/
│ ├── pipeline/ # 4 orchestration pipelines
│ ├── dataset/ # 20 dataset definitions
│ ├── linkedService/ # 3 linked service connections
│ ├── integrationRuntime/ # 1 integration runtime config
│ ├── dataflow/ # 1 mapping data flow (Patient)
│ ├── notebook/ # 3 Synapse Spark notebooks
│ └── sqlscript/ # 2 SQL exploration scripts
└── mybigdata/
├── linkedService/ # 2 workspace default services
└── integrationRuntime/ # 1 workspace default runtime
Major Folders
`/artifacts/pipeline/`
Contains 4 pipeline definitions for FHIR data processing:
File Path: artifacts/pipeline/FHIR_Pipeline4Observation_Spark_OC.json
File Path: artifacts/pipeline/FHIR_Pipeline4Patient_DataFlow_OC.json
File Path: artifacts/pipeline/FHIR_Pipeline4Claim_Spark_OC.json
File Path: artifacts/pipeline/Copy_Data_Source_To_Raw_PL.json
`/artifacts/dataset/`
Contains 20 dataset definitions:
- 13 Parquet datasets: Intermediate storage in ADLS Gen2
- 7 SQL datasets: Target tables in Synapse Dedicated Pool
Example File Paths:
artifacts/dataset/ObservationMain_LargeParquet.jsonartifacts/dataset/Observation_SQLDS.jsonartifacts/dataset/PatientAddressParquetLarge.json
`/artifacts/linkedService/`
Contains 3 linked service connections:
File Path: artifacts/linkedService/StorageLS.json (ADLS Gen2)
File Path: artifacts/linkedService/SynapseDedicatedPoolLS.json (SQL Pool)
File Path: artifacts/linkedService/Source_Dataset_LS.json (Source System)
`/artifacts/notebook/`
Contains Synapse Spark notebooks for data transformation:
File Path: artifacts/notebook/ClaimParquetFlatten_Large.json
Evidence: References table:functions and external sources curated@\ and processed@\ (Scanner: lineage_tracer)
`/artifacts/sqlscript/`
Contains SQL scripts for data exploration:
File Path: artifacts/sqlscript/JSON_exploration_w_Serverless_Demo_OC.json
File Path: artifacts/sqlscript/Spark DB Exploration Scripts.json
Evidence: Line 9 in JSON_exploration_w_Serverless_Demo_OC.json contains raw SQL with OPENROWSET function for querying ADLS Gen2 directly
`/mybigdata/`
Contains workspace-level default configurations:
File Path: mybigdata/linkedService/mybigdatademows-WorkspaceDefaultStorage.json
File Path: mybigdata/linkedService/mybigdatademows-WorkspaceDefaultSqlServer.json
File Naming Conventions
Pipelines
- Pattern:
FHIR_Pipeline4<ResourceType>_<ProcessingMethod>_OC - Example:
FHIR_Pipeline4Patient_DataFlow_OC.json - Suffix
_OClikely indicates "Operational Context" or environment designation
Datasets
- Parquet datasets:
<ResourceType><Component>ParquetLarge - Example:
PatientAddressParquetLarge.json - SQL datasets:
<ResourceType><Component>SQLor<ResourceType>_SQLDS - Example:
PatientAddressSQL.json,Observation_SQLDS.json - Suffix
Largeindicates datasets handling large data volumes
Notebooks
- Pattern:
<ResourceType>ParquetFlatten_Large - Example:
ClaimParquetFlatten_Large.json
---
4. Pipelines
Pipeline Inventory
4.1. FHIR_Pipeline4Observation_Spark_OC
File Path: artifacts/pipeline/FHIR_Pipeline4Observation_Spark_OC.json
Purpose: Processes FHIR Observation resources (vital signs, lab results, clinical measurements)
Activities (4 total):
1. NDJSON_Ingestion_Observation (SynapseNotebook)
- Dependencies: None (entry point)
- Purpose: Ingests raw NDJSON Observation files from source
2. ObservationParquetFlatten_Large (SynapseNotebook)
- Dependencies:
NDJSON_Ingestion_Observation - Purpose: Flattens nested JSON structures into columnar Parquet format
3. Create Tables (Script)
- Dependencies:
ObservationParquetFlatten_Large - Linked Service:
SynapseDedicatedPoolLS - Purpose: Creates SQL table schema in dedicated pool
- File Path: Line 187
- Scanner Reference: pipeline_analyzer
- Evidence:
activity_script — Script: Create Tables
4. Observation_Parquet_large2SQL (Copy)
- Dependencies:
Create Tables - Input:
ObservationMain_LargeParquet - Output:
Observation_SQLDS - Purpose: Bulk loads Parquet data into SQL table
---
4.2. FHIR_Pipeline4Patient_DataFlow_OC
File Path: artifacts/pipeline/FHIR_Pipeline4Patient_DataFlow_OC.json
Purpose: Processes FHIR Patient resources (demographics, identifiers, addresses)
Activities (5 total):
1. NDJSON_Ingestion_Patient (SynapseNotebook)
- Dependencies: None (entry point)
- Purpose: Ingests raw NDJSON Patient files
2. PatientParquet2Sink (ExecuteDataFlow)
- Dependencies:
NDJSON_Ingestion_Patient - Data Flow:
PatientJSON_Flatten_large - Purpose: Executes mapping data flow to flatten Patient JSON
3. Create Tables (Script)
- Dependencies:
PatientParquet2Sink - Linked Service:
SynapseDedicatedPoolLS - Purpose: Creates SQL table schemas
- File Path: Line 552
- Evidence:
activity_script — Script: Create Tables
4. PatientAddress_large2SQL (Copy)
- Dependencies:
Create Tables - Input:
PatientAddressParquetLarge - Output:
PatientAddressSQL - Purpose: Loads patient address data
5. PatientIdentifier_large2SQL (Copy)
- Dependencies:
Create Tables - Input:
PatientIdentifierParquetLarge - Output:
PatientIdentifierSQLLarge - Purpose: Loads patient identifier data
---
4.3. FHIR_Pipeline4Claim_Spark_OC
File Path: artifacts/pipeline/FHIR_Pipeline4Claim_Spark_OC.json
Purpose: Processes FHIR Claim resources (insurance claims, diagnoses, procedures)
Activities (7 total):
1. NDJSON_Ingestion_Claim (SynapseNotebook)
- Dependencies: None (entry point)
- Purpose: Ingests raw NDJSON Claim files
2. ClaimParquetFlatten_Large (SynapseNotebook)
- Dependencies:
NDJSON_Ingestion_Claim - Purpose: Flattens nested Claim JSON structures
- File Path:
artifacts/notebook/ClaimParquetFlatten_Large.json - Evidence: Reads from
curated@\andprocessed@\external sources
3. Create Tables (Script)
- Dependencies:
ClaimParquetFlatten_Large - Linked Service:
SynapseDedicatedPoolLS - Purpose: Creates SQL table schemas
- File Path: Line 373
- Evidence:
activity_script — Script: Create Tables
4. ClaimDiagnosis2SQL (Copy)
- Dependencies:
Create Tables - Input:
ClaimDiagnosisParquetLarge - Output:
ClaimDiagnosisSQL - Purpose: Loads claim diagnosis data
5. ClaimInsurance2SQL (Copy)
- Dependencies:
Create Tables - Input:
ClaimInsuranceParquetLarge - Output:
ClaimInsurance - Purpose: Loads claim insurance data
6. ClaimProcedure2SQL (Copy)
- Dependencies:
Create Tables - Input:
ClaimProcedureParquetLarge - Output:
ClaimProcedureSQL - Purpose: Loads claim procedure data
7. LakeDatabase And Table Creation (SynapseNotebook)
- Dependencies:
ClaimParquetFlatten_Large - Purpose: Creates Lake Database tables for Spark SQL access
---
4.4. Copy_Data_Source_To_Raw_PL
File Path: artifacts/pipeline/Copy_Data_Source_To_Raw_PL.json
Purpose: Initial data preparation pipeline that creates the medallion architecture containers
Activities (3 total):
1. Copy Source Data To Raw Container (Copy)
- Dependencies: None (entry point)
- Input:
Source_DataPrep_DS - Output:
Sink_DataPrep_DS - Purpose: Copies source data to raw container
2. Create Curated Container (Copy)
- Dependencies:
Copy Source Data To Raw Container - Input:
Source_DataPrep_Curated_DS - Output:
Sink_DataPrep_Curated_DS - Purpose: Initializes curated data container
3. Create Processed Container (Copy)
- Dependencies:
Copy Source Data To Raw Container - Input:
Source_DataPrep_Processed_DS - Output:
Sink_DataPrep_Processed_DS - Purpose: Initializes processed data container
---
Pipeline Dependency Chains
Evidence: Dependency chains extracted from pipeline JSON files by lineage_tracer scanner
---
How to Run Pipelines
Manual Execution via Synapse Studio
1. Navigate to Integrate → Pipelines
2. Select the desired pipeline (e.g., FHIR_Pipeline4Patient_DataFlow_OC)
3. Click Add trigger → Trigger now
4. Monitor execution in Monitor → Pipeline runs
Recommended Execution Order
For initial data load, execute pipelines in this sequence:
1. Copy_Data_Source_To_Raw_PL (creates container structure)
2. FHIR_Pipeline4Patient_DataFlow_OC (Patient is foundational)
3. FHIR_Pipeline4Observation_Spark_OC (Observations reference Patients)
4. FHIR_Pipeline4Claim_Spark_OC (Claims reference Patients)
Monitoring Pipelines
- Real-time monitoring: Synapse Studio → Monitor → Pipeline runs
- Activity-level details: Click on pipeline run → View activity details
- Logs: Each activity provides detailed logs and error messages
- Spark logs: For notebook activities, access Spark application logs via the monitoring interface
---
5. Notebooks
Notebook Inventory
5.1. NDJSON_Ingestion_Observation
Purpose: Ingests raw NDJSON Observation files from source system
Language: Inferred to be PySpark (Synapse Notebook)
Execution Context: Called by FHIR_Pipeline4Observation_Spark_OC pipeline
Dependencies: None (entry point)
Expected Operations:
- Read NDJSON files from source linked service
- Perform initial data validation
- Write raw data to ADLS Gen2 raw container
---
5.2. ObservationParquetFlatten_Large
Purpose: Flattens nested Observation JSON structures into columnar Parquet format
Language: Inferred to be PySpark
Execution Context: Called by FHIR_Pipeline4Observation_Spark_OC pipeline
Dependencies: Requires NDJSON_Ingestion_Observation to complete first
Expected Operations:
- Read raw NDJSON from raw container
- Flatten nested JSON arrays and objects
- Write flattened data to Parquet format
- Output dataset:
ObservationMain_LargeParquet
---
5.3. NDJSON_Ingestion_Patient
Purpose: Ingests raw NDJSON Patient files from source system
Language: Inferred to be PySpark
Execution Context: Called by FHIR_Pipeline4Patient_DataFlow_OC pipeline
Dependencies: None (entry point)
Expected Operations:
- Read NDJSON Patient files
- Initial validation and quality checks
- Write to raw container
---
5.4. NDJSON_Ingestion_Claim
Purpose: Ingests raw NDJSON Claim files from source system
Language: Inferred to be PySpark
Execution Context: Called by FHIR_Pipeline4Claim_Spark_OC pipeline
Dependencies: None (entry point)
Expected Operations:
- Read NDJSON Claim files
- Initial validation
- Write to raw container
---
5.5. ClaimParquetFlatten_Large
File Path: artifacts/notebook/ClaimParquetFlatten_Large.json
Purpose: Flattens nested Claim JSON structures and creates Lake Database tables
Language: Inferred to be PySpark
Execution Context: Called by FHIR_Pipeline4Claim_Spark_OC pipeline
Dependencies: Requires NDJSON_Ingestion_Claim to complete first
Data Sources (from lineage_tracer):
- Input:
curated@\(external source) - Input:
processed@\(external source) - Reads:
table:functions(referenced 7 times in the notebook)
Evidence:
table:functions --[reads_from]--> file:artifacts/notebook/ClaimParquetFlatten_Large.json
ext:curated@\ --[reads_from]--> file:artifacts/notebook/ClaimParquetFlatten_Large.json
ext:processed@\ --[reads_from]--> file:artifacts/notebook/ClaimParquetFlatten_Large.json
Expected Operations:
- Read raw Claim NDJSON
- Flatten diagnosis, insurance, and procedure arrays
- Write to multiple Parquet datasets:
ClaimDiagnosisParquetLargeClaimInsuranceParquetLargeClaimProcedureParquetLarge
---
5.6. LakeDatabase And Table Creation
Purpose: Creates Spark SQL Lake Database tables for Claim data
Language: Inferred to be PySpark
Execution Context: Called by FHIR_Pipeline4Claim_Spark_OC pipeline
Dependencies: Requires ClaimParquetFlatten_Large to complete first
Expected Operations:
- Create Lake Database schema
- Register Parquet files as external tables
- Enable Spark SQL queries on Claim data
---
Notebook Execution Order
Observation Processing Flow
NDJSON_Ingestion_Observation → ObservationParquetFlatten_Large
Patient Processing Flow
NDJSON_Ingestion_Patient → (PatientParquet2Sink DataFlow)
Claim Processing Flow
NDJSON_Ingestion_Claim → ClaimParquetFlatten_Large → LakeDatabase And Table Creation
Languages and Key Libraries
Primary Language: PySpark (Python with Spark)
Expected Libraries (inferred from FHIR processing context):
pyspark.sql- DataFrame operationspyspark.sql.functions- Data transformation functions (referenced 7 times in ClaimParquetFlatten_Large)pyspark.sql.types- Schema definitions- JSON parsing libraries for NDJSON handling
Evidence: table:functions is referenced 7 times in ClaimParquetFlatten_Large.json, indicating heavy use of Spark SQL functions
---
6. Data Assets
Dataset Inventory
Parquet Datasets (13 total)
All Parquet datasets use the StorageLS linked service (ADLS Gen2).
| Dataset Name | File Path | Purpose |
|---|---|---|
| ObservationMain_LargeParquet | artifacts/dataset/ObservationMain_LargeParquet.json | Flattened Observation data |
| PatientAddressParquetLarge | artifacts/dataset/PatientAddressParquetLarge.json | Patient address records |
| PatientIdentifierParquetLarge | artifacts/dataset/PatientIdentifierParquetLarge.json | Patient identifier records |
| PatientRawParquetLarge | artifacts/dataset/PatientRawParquetLarge.json | Raw Patient data |
| PatientExtensionParquetLarge | artifacts/dataset/PatientExtensionParquetLarge.json | Patient extension attributes |
| ClaimDiagnosisParquetLarge | artifacts/dataset/ClaimDiagnosisParquetLarge.json | Claim diagnosis records |
| ClaimInsuranceParquetLarge | artifacts/dataset/ClaimInsuranceParquetLarge.json | Claim insurance records |
| ClaimProcedureParquetLarge | artifacts/dataset/ClaimProcedureParquetLarge.json | Claim procedure records |
| Sink_DataPrep_DS | artifacts/dataset/Sink_DataPrep_DS.json | Raw container sink |
| Sink_DataPrep_Curated_DS | artifacts/dataset/Sink_DataPrep_Curated_DS.json | Curated container sink |
| Sink_DataPrep_Processed_DS | artifacts/dataset/Sink_DataPrep_Processed_DS.json | Processed container sink |
| Source_DataPrep_DS | artifacts/dataset/Source_DataPrep_DS.json | Source data for prep |
| Source_DataPrep_Curated_DS | artifacts/dataset/Source_DataPrep_Curated_DS.json | Source curated data |
| Source_DataPrep_Processed_DS | artifacts/dataset/Source_DataPrep_Processed_DS.json | Source processed data |
Evidence: All Parquet datasets show linked_service=StorageLS in lineage data
---
SQL Datasets (7 total)
All SQL datasets use the SynapseDedicatedPoolLS linked service.
| Dataset Name | File Path | Purpose |
|---|---|---|
| Observation_SQLDS | artifacts/dataset/Observation_SQLDS.json | Observation SQL table |
| PatientAddressSQL | artifacts/dataset/PatientAddressSQL.json | Patient address SQL table |
| PatientIdentifierSQLLarge | artifacts/dataset/PatientIdentifierSQLLarge.json | Patient identifier SQL table |
| ClaimDiagnosisSQL | artifacts/dataset/ClaimDiagnosisSQL.json | Claim diagnosis SQL table |
| ClaimInsurance | artifacts/dataset/ClaimInsurance.json | Claim insurance SQL table |
| ClaimProcedureSQL | artifacts/dataset/ClaimProcedureSQL.json | Claim procedure SQL table |
Evidence: All SQL datasets show linked_service=SynapseDedicatedPoolLS in lineage data
---
Linked Services
StorageLS
File Path: artifacts/linkedService/StorageLS.json
Type: Azure Data Lake Storage Gen2
Purpose: Primary data lake for Parquet file storage
Used By: 13 Parquet datasets
---
SynapseDedicatedPoolLS
File Path: artifacts/linkedService/SynapseDedicatedPoolLS.json
Type: Azure Synapse Dedicated SQL Pool
Purpose: Data warehouse for analytics-ready tables
Used By: 7 SQL datasets + 3 "Create Tables" script activities
Security Finding:
- Scanner Reference: secret_tracer
- Evidence:
type=plain_text location=typeProperties.connectionString secure=False - Recommendation: Connection string should be stored in Azure Key Vault
---
Source_Dataset_LS
File Path: artifacts/linkedService/Source_Dataset_LS.json
Type: Unknown (not specified in scanner data)
Purpose: Source system connection for FHIR data ingestion
Used By: 3 source datasets (Source_DataPrep_DS, Source_DataPrep_Curated_DS, Source_DataPrep_Processed_DS)
---
Data Flow: Sources → Transformations → Sinks
FHIR NDJSON Files] end subgraph Raw Layer RAW[StorageLS
Raw Container
NDJSON Format] end subgraph Processed Layer PARQUET[StorageLS
Processed Container
Parquet Format] end subgraph Curated Layer SQL[SynapseDedicatedPoolLS
SQL Tables] end SRC -->|Copy Pipeline| RAW RAW -->|Spark Notebooks| PARQUET PARQUET -->|Copy Activities| SQL style SRC fill:#e1f5ff style RAW fill:#fff4e1 style PARQUET fill:#e8f5e9 style SQL fill:#f3e5f5
Evidence: Data flow derived from pipeline activity dependencies and dataset lineage edges
---
Observation Data Flow
File Path: artifacts/pipeline/FHIR_Pipeline4Observation_Spark_OC.json
Evidence: Lineage edges show ObservationMain_LargeParquet --[reads_from]--> Observation_Parquet_large2SQL and Observation_Parquet_large2SQL --[writes_to]--> Observation_SQLDS
---
Patient Data Flow
File Path: artifacts/pipeline/FHIR_Pipeline4Patient_DataFlow_OC.json
Evidence: Lineage edges show Patient data split into multiple Parquet datasets, then loaded to SQL
---