Repository_Userguide CodePrizm

Generated by CodePrizm

Repository User Guide: FHIR Healthcare Data Platform

Executive Summary

This repository contains an Azure Synapse Analytics data platform designed for processing and analyzing FHIR (Fast Healthcare Interoperability Resources) healthcare data. The platform ingests NDJSON-formatted FHIR resources (Patient, Observation, Claim), transforms them through Spark notebooks and data flows, and loads them into a SQL dedicated pool for analytics.

Key Metrics

---

1. Overview

Purpose

This repository implements a healthcare data platform that processes FHIR-compliant medical records. The platform follows a medallion architecture pattern with three storage layers:

File Path: artifacts/pipeline/Copy_Data_Source_To_Raw_PL.json

Evidence: The pipeline creates three containers: "Copy Source Data To Raw Container", "Create Curated Container", and "Create Processed Container"

Platform Type

Azure Synapse Analytics (hybrid architecture)

Key Statistics

---

2. Getting Started

Prerequisites

Required Access

1. Azure Synapse Workspace: Read/write access to the Synapse workspace

2. Storage Account: Contributor access to the Azure Data Lake Storage Gen2 account

3. SQL Dedicated Pool: db_datareader and db_datawriter roles on the dedicated SQL pool

4. Synapse Spark Pool: Permission to execute notebooks on Spark compute

File Path: artifacts/linkedService/SynapseDedicatedPoolLS.json

Evidence: Connection string reference indicates SQL authentication is used: "connectionString"

Software Requirements

Development Environment Setup

Step 1: Clone the Repository

git clone <repository-url>
cd <repository-name>

Step 2: Configure Linked Services

You must update the following linked service connection strings with your environment-specific values:

1. StorageLS (artifacts/linkedService/StorageLS.json)

2. SynapseDedicatedPoolLS (artifacts/linkedService/SynapseDedicatedPoolLS.json)

3. Source_Dataset_LS (artifacts/linkedService/Source_Dataset_LS.json)

File Path: artifacts/linkedService/SynapseDedicatedPoolLS.json

Scanner Reference: secret_tracer

Evidence: type=plain_text location=typeProperties.connectionString secure=False

Step 3: Configure Integration Runtime

The repository uses AutoResolveIntegrationRuntime for activity execution.

File Path: artifacts/integrationRuntime/AutoResolveIntegrationRuntime.json

Evidence: Integration runtime configuration exists in both artifacts/ and mybigdata/ directories

Step 4: Deploy to Synapse Workspace

1. Open Azure Synapse Studio

2. Navigate to ManageGit configuration

3. Connect the repository to your Synapse workspace

4. Publish the artifacts to the live mode

Key Configuration Files

FilePurposeLocation
SynapseDedicatedPoolLS.jsonSQL pool connectionartifacts/linkedService/
StorageLS.jsonADLS Gen2 storage connectionartifacts/linkedService/
AutoResolveIntegrationRuntime.jsonCompute runtime configurationartifacts/integrationRuntime/
Pipeline JSON filesOrchestration definitionsartifacts/pipeline/
Dataset JSON filesData source/sink definitionsartifacts/dataset/

---

3. Repository Structure

Directory Layout

repository-root/
├── artifacts/
│   ├── pipeline/              # 4 orchestration pipelines
│   ├── dataset/               # 20 dataset definitions
│   ├── linkedService/         # 3 linked service connections
│   ├── integrationRuntime/    # 1 integration runtime config
│   ├── dataflow/              # 1 mapping data flow (Patient)
│   ├── notebook/              # 3 Synapse Spark notebooks
│   └── sqlscript/             # 2 SQL exploration scripts
└── mybigdata/
    ├── linkedService/         # 2 workspace default services
    └── integrationRuntime/    # 1 workspace default runtime

Major Folders

`/artifacts/pipeline/`

Contains 4 pipeline definitions for FHIR data processing:

File Path: artifacts/pipeline/FHIR_Pipeline4Observation_Spark_OC.json

File Path: artifacts/pipeline/FHIR_Pipeline4Patient_DataFlow_OC.json

File Path: artifacts/pipeline/FHIR_Pipeline4Claim_Spark_OC.json

File Path: artifacts/pipeline/Copy_Data_Source_To_Raw_PL.json

`/artifacts/dataset/`

Contains 20 dataset definitions:

Example File Paths:

`/artifacts/linkedService/`

Contains 3 linked service connections:

File Path: artifacts/linkedService/StorageLS.json (ADLS Gen2)

File Path: artifacts/linkedService/SynapseDedicatedPoolLS.json (SQL Pool)

File Path: artifacts/linkedService/Source_Dataset_LS.json (Source System)

`/artifacts/notebook/`

Contains Synapse Spark notebooks for data transformation:

File Path: artifacts/notebook/ClaimParquetFlatten_Large.json

Evidence: References table:functions and external sources curated@\ and processed@\ (Scanner: lineage_tracer)

`/artifacts/sqlscript/`

Contains SQL scripts for data exploration:

File Path: artifacts/sqlscript/JSON_exploration_w_Serverless_Demo_OC.json

File Path: artifacts/sqlscript/Spark DB Exploration Scripts.json

Evidence: Line 9 in JSON_exploration_w_Serverless_Demo_OC.json contains raw SQL with OPENROWSET function for querying ADLS Gen2 directly

`/mybigdata/`

Contains workspace-level default configurations:

File Path: mybigdata/linkedService/mybigdatademows-WorkspaceDefaultStorage.json

File Path: mybigdata/linkedService/mybigdatademows-WorkspaceDefaultSqlServer.json

File Naming Conventions

Pipelines

Datasets

Notebooks

---

4. Pipelines

Pipeline Inventory

4.1. FHIR_Pipeline4Observation_Spark_OC

File Path: artifacts/pipeline/FHIR_Pipeline4Observation_Spark_OC.json

Purpose: Processes FHIR Observation resources (vital signs, lab results, clinical measurements)

Activities (4 total):

1. NDJSON_Ingestion_Observation (SynapseNotebook)

2. ObservationParquetFlatten_Large (SynapseNotebook)

3. Create Tables (Script)

4. Observation_Parquet_large2SQL (Copy)

---

4.2. FHIR_Pipeline4Patient_DataFlow_OC

File Path: artifacts/pipeline/FHIR_Pipeline4Patient_DataFlow_OC.json

Purpose: Processes FHIR Patient resources (demographics, identifiers, addresses)

Activities (5 total):

1. NDJSON_Ingestion_Patient (SynapseNotebook)

2. PatientParquet2Sink (ExecuteDataFlow)

3. Create Tables (Script)

4. PatientAddress_large2SQL (Copy)

5. PatientIdentifier_large2SQL (Copy)

---

4.3. FHIR_Pipeline4Claim_Spark_OC

File Path: artifacts/pipeline/FHIR_Pipeline4Claim_Spark_OC.json

Purpose: Processes FHIR Claim resources (insurance claims, diagnoses, procedures)

Activities (7 total):

1. NDJSON_Ingestion_Claim (SynapseNotebook)

2. ClaimParquetFlatten_Large (SynapseNotebook)

3. Create Tables (Script)

4. ClaimDiagnosis2SQL (Copy)

5. ClaimInsurance2SQL (Copy)

6. ClaimProcedure2SQL (Copy)

7. LakeDatabase And Table Creation (SynapseNotebook)

---

4.4. Copy_Data_Source_To_Raw_PL

File Path: artifacts/pipeline/Copy_Data_Source_To_Raw_PL.json

Purpose: Initial data preparation pipeline that creates the medallion architecture containers

Activities (3 total):

1. Copy Source Data To Raw Container (Copy)

2. Create Curated Container (Copy)

3. Create Processed Container (Copy)

---

Pipeline Dependency Chains

graph TD subgraph Observation Pipeline O1[NDJSON_Ingestion_Observation] --> O2[ObservationParquetFlatten_Large] O2 --> O3[Create Tables] O3 --> O4[Observation_Parquet_large2SQL] end subgraph Patient Pipeline P1[NDJSON_Ingestion_Patient] --> P2[PatientParquet2Sink] P2 --> P3[Create Tables] P3 --> P4[PatientAddress_large2SQL] P3 --> P5[PatientIdentifier_large2SQL] end subgraph Claim Pipeline C1[NDJSON_Ingestion_Claim] --> C2[ClaimParquetFlatten_Large] C2 --> C3[Create Tables] C2 --> C7[LakeDatabase And Table Creation] C3 --> C4[ClaimDiagnosis2SQL] C3 --> C5[ClaimInsurance2SQL] C3 --> C6[ClaimProcedure2SQL] end subgraph Data Prep Pipeline D1[Copy Source Data To Raw Container] --> D2[Create Curated Container] D1 --> D3[Create Processed Container] end

Evidence: Dependency chains extracted from pipeline JSON files by lineage_tracer scanner

---

How to Run Pipelines

Manual Execution via Synapse Studio

1. Navigate to IntegratePipelines

2. Select the desired pipeline (e.g., FHIR_Pipeline4Patient_DataFlow_OC)

3. Click Add triggerTrigger now

4. Monitor execution in MonitorPipeline runs

For initial data load, execute pipelines in this sequence:

1. Copy_Data_Source_To_Raw_PL (creates container structure)

2. FHIR_Pipeline4Patient_DataFlow_OC (Patient is foundational)

3. FHIR_Pipeline4Observation_Spark_OC (Observations reference Patients)

4. FHIR_Pipeline4Claim_Spark_OC (Claims reference Patients)

Monitoring Pipelines

---

5. Notebooks

Notebook Inventory

5.1. NDJSON_Ingestion_Observation

Purpose: Ingests raw NDJSON Observation files from source system

Language: Inferred to be PySpark (Synapse Notebook)

Execution Context: Called by FHIR_Pipeline4Observation_Spark_OC pipeline

Dependencies: None (entry point)

Expected Operations:

---

5.2. ObservationParquetFlatten_Large

Purpose: Flattens nested Observation JSON structures into columnar Parquet format

Language: Inferred to be PySpark

Execution Context: Called by FHIR_Pipeline4Observation_Spark_OC pipeline

Dependencies: Requires NDJSON_Ingestion_Observation to complete first

Expected Operations:

---

5.3. NDJSON_Ingestion_Patient

Purpose: Ingests raw NDJSON Patient files from source system

Language: Inferred to be PySpark

Execution Context: Called by FHIR_Pipeline4Patient_DataFlow_OC pipeline

Dependencies: None (entry point)

Expected Operations:

---

5.4. NDJSON_Ingestion_Claim

Purpose: Ingests raw NDJSON Claim files from source system

Language: Inferred to be PySpark

Execution Context: Called by FHIR_Pipeline4Claim_Spark_OC pipeline

Dependencies: None (entry point)

Expected Operations:

---

5.5. ClaimParquetFlatten_Large

File Path: artifacts/notebook/ClaimParquetFlatten_Large.json

Purpose: Flattens nested Claim JSON structures and creates Lake Database tables

Language: Inferred to be PySpark

Execution Context: Called by FHIR_Pipeline4Claim_Spark_OC pipeline

Dependencies: Requires NDJSON_Ingestion_Claim to complete first

Data Sources (from lineage_tracer):

Evidence:

table:functions --[reads_from]--> file:artifacts/notebook/ClaimParquetFlatten_Large.json
ext:curated@\ --[reads_from]--> file:artifacts/notebook/ClaimParquetFlatten_Large.json
ext:processed@\ --[reads_from]--> file:artifacts/notebook/ClaimParquetFlatten_Large.json

Expected Operations:

---

5.6. LakeDatabase And Table Creation

Purpose: Creates Spark SQL Lake Database tables for Claim data

Language: Inferred to be PySpark

Execution Context: Called by FHIR_Pipeline4Claim_Spark_OC pipeline

Dependencies: Requires ClaimParquetFlatten_Large to complete first

Expected Operations:

---

Notebook Execution Order

Observation Processing Flow

NDJSON_Ingestion_Observation → ObservationParquetFlatten_Large

Patient Processing Flow

NDJSON_Ingestion_Patient → (PatientParquet2Sink DataFlow)

Claim Processing Flow

NDJSON_Ingestion_Claim → ClaimParquetFlatten_Large → LakeDatabase And Table Creation

Languages and Key Libraries

Primary Language: PySpark (Python with Spark)

Expected Libraries (inferred from FHIR processing context):

Evidence: table:functions is referenced 7 times in ClaimParquetFlatten_Large.json, indicating heavy use of Spark SQL functions

---

6. Data Assets

Dataset Inventory

Parquet Datasets (13 total)

All Parquet datasets use the StorageLS linked service (ADLS Gen2).

Dataset NameFile PathPurpose
ObservationMain_LargeParquetartifacts/dataset/ObservationMain_LargeParquet.jsonFlattened Observation data
PatientAddressParquetLargeartifacts/dataset/PatientAddressParquetLarge.jsonPatient address records
PatientIdentifierParquetLargeartifacts/dataset/PatientIdentifierParquetLarge.jsonPatient identifier records
PatientRawParquetLargeartifacts/dataset/PatientRawParquetLarge.jsonRaw Patient data
PatientExtensionParquetLargeartifacts/dataset/PatientExtensionParquetLarge.jsonPatient extension attributes
ClaimDiagnosisParquetLargeartifacts/dataset/ClaimDiagnosisParquetLarge.jsonClaim diagnosis records
ClaimInsuranceParquetLargeartifacts/dataset/ClaimInsuranceParquetLarge.jsonClaim insurance records
ClaimProcedureParquetLargeartifacts/dataset/ClaimProcedureParquetLarge.jsonClaim procedure records
Sink_DataPrep_DSartifacts/dataset/Sink_DataPrep_DS.jsonRaw container sink
Sink_DataPrep_Curated_DSartifacts/dataset/Sink_DataPrep_Curated_DS.jsonCurated container sink
Sink_DataPrep_Processed_DSartifacts/dataset/Sink_DataPrep_Processed_DS.jsonProcessed container sink
Source_DataPrep_DSartifacts/dataset/Source_DataPrep_DS.jsonSource data for prep
Source_DataPrep_Curated_DSartifacts/dataset/Source_DataPrep_Curated_DS.jsonSource curated data
Source_DataPrep_Processed_DSartifacts/dataset/Source_DataPrep_Processed_DS.jsonSource processed data

Evidence: All Parquet datasets show linked_service=StorageLS in lineage data

---

SQL Datasets (7 total)

All SQL datasets use the SynapseDedicatedPoolLS linked service.

Dataset NameFile PathPurpose
Observation_SQLDSartifacts/dataset/Observation_SQLDS.jsonObservation SQL table
PatientAddressSQLartifacts/dataset/PatientAddressSQL.jsonPatient address SQL table
PatientIdentifierSQLLargeartifacts/dataset/PatientIdentifierSQLLarge.jsonPatient identifier SQL table
ClaimDiagnosisSQLartifacts/dataset/ClaimDiagnosisSQL.jsonClaim diagnosis SQL table
ClaimInsuranceartifacts/dataset/ClaimInsurance.jsonClaim insurance SQL table
ClaimProcedureSQLartifacts/dataset/ClaimProcedureSQL.jsonClaim procedure SQL table

Evidence: All SQL datasets show linked_service=SynapseDedicatedPoolLS in lineage data

---

Linked Services

StorageLS

File Path: artifacts/linkedService/StorageLS.json

Type: Azure Data Lake Storage Gen2

Purpose: Primary data lake for Parquet file storage

Used By: 13 Parquet datasets

---

SynapseDedicatedPoolLS

File Path: artifacts/linkedService/SynapseDedicatedPoolLS.json

Type: Azure Synapse Dedicated SQL Pool

Purpose: Data warehouse for analytics-ready tables

Used By: 7 SQL datasets + 3 "Create Tables" script activities

Security Finding:

---

Source_Dataset_LS

File Path: artifacts/linkedService/Source_Dataset_LS.json

Type: Unknown (not specified in scanner data)

Purpose: Source system connection for FHIR data ingestion

Used By: 3 source datasets (Source_DataPrep_DS, Source_DataPrep_Curated_DS, Source_DataPrep_Processed_DS)

---

Data Flow: Sources → Transformations → Sinks

graph LR subgraph Source SRC[Source_Dataset_LS
FHIR NDJSON Files] end subgraph Raw Layer RAW[StorageLS
Raw Container
NDJSON Format] end subgraph Processed Layer PARQUET[StorageLS
Processed Container
Parquet Format] end subgraph Curated Layer SQL[SynapseDedicatedPoolLS
SQL Tables] end SRC -->|Copy Pipeline| RAW RAW -->|Spark Notebooks| PARQUET PARQUET -->|Copy Activities| SQL style SRC fill:#e1f5ff style RAW fill:#fff4e1 style PARQUET fill:#e8f5e9 style SQL fill:#f3e5f5

Evidence: Data flow derived from pipeline activity dependencies and dataset lineage edges

---

Observation Data Flow

graph TD A[Source: NDJSON Files] -->|NDJSON_Ingestion_Observation| B[Raw: NDJSON] B -->|ObservationParquetFlatten_Large| C[ObservationMain_LargeParquet] C -->|Observation_Parquet_large2SQL| D[Observation_SQLDS] style A fill:#e1f5ff style B fill:#fff4e1 style C fill:#e8f5e9 style D fill:#f3e5f5

File Path: artifacts/pipeline/FHIR_Pipeline4Observation_Spark_OC.json

Evidence: Lineage edges show ObservationMain_LargeParquet --[reads_from]--> Observation_Parquet_large2SQL and Observation_Parquet_large2SQL --[writes_to]--> Observation_SQLDS

---

Patient Data Flow

graph TD A[Source: NDJSON Files] -->|NDJSON_Ingestion_Patient| B[Raw: NDJSON] B -->|PatientParquet2Sink DataFlow| C1[PatientAddressParquetLarge] B -->|PatientParquet2Sink DataFlow| C2[PatientIdentifierParquetLarge] B -->|PatientParquet2Sink DataFlow| C3[PatientExtensionParquetLarge] C1 -->|PatientAddress_large2SQL| D1[PatientAddressSQL] C2 -->|PatientIdentifier_large2SQL| D2[PatientIdentifierSQLLarge] style A fill:#e1f5ff style B fill:#fff4e1 style C1 fill:#e8f5e9 style C2 fill:#e8f5e9 style C3 fill:#e8f5e9 style D1 fill:#f3e5f5 style D2 fill:#f3e5f5

File Path: artifacts/pipeline/FHIR_Pipeline4Patient_DataFlow_OC.json

Evidence: Lineage edges show Patient data split into multiple Parquet datasets, then loaded to SQL

---

Claim Data Flow