Lineage_Report CodePrizm

Generated by CodePrizm

Data Lineage Report

Executive Summary

MetricValue
Overall Score95/100
Total Findings9
Critical0
High0
Medium9
Low0
Info0

MEDIUM Findings

1. Orphaned Dataset: PatientRawParquetLarge

Evidence:

ds:PatientRawParquetLarge --[uses_service]--> ls:StorageLS

Description: Dataset PatientRawParquetLarge exists in the environment but has no pipeline or activity that writes to it. The dataset is connected to StorageLS but appears in no pipeline outputs.

Explanation: The lineage graph shows PatientRawParquetLarge connected only via uses_service to StorageLS, with no writes_to edges from any activity. All other Patient-related parquet datasets (PatientAddressParquetLarge, PatientIdentifierParquetLarge, PatientExtensionParquetLarge) are consumed by pipelines, but PatientRawParquetLarge has no consumer or producer.

Remediation: Identify the intended producer pipeline for PatientRawParquetLarge or remove the dataset definition if it is no longer needed. Verify if this dataset should be written by FHIR_Pipeline4Patient_DataFlow_OC or another pipeline.

2. Orphaned Dataset: PatientExtensionParquetLarge

Evidence:

ds:PatientExtensionParquetLarge --[uses_service]--> ls:StorageLS

Description: Dataset PatientExtensionParquetLarge exists in the environment but has no pipeline or activity that writes to it or reads from it. The dataset is connected to StorageLS but appears in no pipeline inputs or outputs.

Explanation: The lineage graph shows PatientExtensionParquetLarge connected only via uses_service to StorageLS, with no reads_from or writes_to edges from any activity. This dataset appears completely disconnected from the data processing workflows.

Remediation: Identify the intended producer and consumer pipelines for PatientExtensionParquetLarge or remove the dataset definition if it is no longer needed. Verify if this dataset should be part of the Patient data processing workflow.

3. Undocumented Data Flow: LakeDatabase And Table Creation

Evidence:

Activity: LakeDatabase And Table Creation (SynapseNotebook) | deps=[ClaimParquetFlatten_Large] in=[none] out=[none]

Description: Activity LakeDatabase And Table Creation in pipeline FHIR_Pipeline4Claim_Spark_OC has no declared input or output datasets. The notebook likely creates lake database tables but these outputs are not tracked in the lineage graph.

Explanation: The activity shows in=[none] out=[none], indicating that the lake database and table creation operations are not formally tracked in the pipeline lineage. This creates a governance gap for the lake database artifacts.

Remediation: Update the pipeline definition to explicitly declare output datasets or tables for the LakeDatabase And Table Creation activity. This will enable proper lineage tracking for the lake database artifacts.

4. Missing Upstream Source: ObservationMain_LargeParquet

Evidence:

ds:ObservationMain_LargeParquet --[reads_from]--> act:FHIR_Pipeline4Observation_Spark_OC:Observation_Parquet_large2SQL

Description: Dataset ObservationMain_LargeParquet is consumed by activity Observation_Parquet_large2SQL but has no explicit producer activity in the lineage graph. The dataset is likely produced by ObservationParquetFlatten_Large notebook but this relationship is not formally declared.

Explanation: The lineage graph shows ObservationMain_LargeParquet being read by a Copy activity, but there is no writes_to edge from any activity to this dataset. The implicit producer is the ObservationParquetFlatten_Large notebook based on naming patterns and pipeline structure.

Remediation: Update the FHIR_Pipeline4Observation_Spark_OC pipeline to explicitly declare ObservationMain_LargeParquet as an output of the ObservationParquetFlatten_Large activity. This will establish formal lineage between the producer and consumer.

5. Missing Upstream Source: PatientAddressParquetLarge

Evidence:

ds:PatientAddressParquetLarge --[reads_from]--> act:FHIR_Pipeline4Patient_DataFlow_OC:PatientAddress_large2SQL

Description: Dataset PatientAddressParquetLarge is consumed by activity PatientAddress_large2SQL but has no explicit producer activity in the lineage graph. The dataset is likely produced by PatientParquet2Sink data flow but this relationship is not formally declared.

Explanation: The lineage graph shows PatientAddressParquetLarge being read by a Copy activity, but there is no writes_to edge from any activity to this dataset. The implicit producer is the PatientParquet2Sink data flow based on naming patterns and pipeline structure.

Remediation: Update the FHIR_Pipeline4Patient_DataFlow_OC pipeline to explicitly declare PatientAddressParquetLarge as an output of the PatientParquet2Sink activity. This will establish formal lineage between the producer and consumer.

6. Missing Upstream Source: PatientIdentifierParquetLarge

Evidence:

ds:PatientIdentifierParquetLarge --[reads_from]--> act:FHIR_Pipeline4Patient_DataFlow_OC:PatientIdentifier_large2SQL

Description: Dataset PatientIdentifierParquetLarge is consumed by activity PatientIdentifier_large2SQL but has no explicit producer activity in the lineage graph. The dataset is likely produced by PatientParquet2Sink data flow but this relationship is not formally declared.

Explanation: The lineage graph shows PatientIdentifierParquetLarge being read by a Copy activity, but there is no writes_to edge from any activity to this dataset. The implicit producer is the PatientParquet2Sink data flow based on naming patterns and pipeline structure.

Remediation: Update the FHIR_Pipeline4Patient_DataFlow_OC pipeline to explicitly declare PatientIdentifierParquetLarge as an output of the PatientParquet2Sink activity. This will establish formal lineage between the producer and consumer.

7. Missing Upstream Source: ClaimDiagnosisParquetLarge

Evidence:

ds:ClaimDiagnosisParquetLarge --[reads_from]--> act:FHIR_Pipeline4Claim_Spark_OC:ClaimDiagnosis2SQL

Description: Dataset ClaimDiagnosisParquetLarge is consumed by activity ClaimDiagnosis2SQL but has no explicit producer activity in the lineage graph. The dataset is likely produced by ClaimParquetFlatten_Large notebook but this relationship is not formally declared.

Explanation: The lineage graph shows ClaimDiagnosisParquetLarge being read by a Copy activity, but there is no writes_to edge from any activity to this dataset. The implicit producer is the ClaimParquetFlatten_Large notebook based on naming patterns and pipeline structure.

Remediation: Update the FHIR_Pipeline4Claim_Spark_OC pipeline to explicitly declare ClaimDiagnosisParquetLarge as an output of the ClaimParquetFlatten_Large activity. This will establish formal lineage between the producer and consumer.

8. Missing Upstream Source: ClaimInsuranceParquetLarge

Evidence:

ds:ClaimInsuranceParquetLarge --[reads_from]--> act:FHIR_Pipeline4Claim_Spark_OC:ClaimInsurance2SQL

Description: Dataset ClaimInsuranceParquetLarge is consumed by activity ClaimInsurance2SQL but has no explicit producer activity in the lineage graph. The dataset is likely produced by ClaimParquetFlatten_Large notebook but this relationship is not formally declared.

Explanation: The lineage graph shows ClaimInsuranceParquetLarge being read by a Copy activity, but there is no writes_to edge from any activity to this dataset. The implicit producer is the ClaimParquetFlatten_Large notebook based on naming patterns and pipeline structure.

Remediation: Update the FHIR_Pipeline4Claim_Spark_OC pipeline to explicitly declare ClaimInsuranceParquetLarge as an output of the ClaimParquetFlatten_Large activity. This will establish formal lineage between the producer and consumer.

9. Missing Upstream Source: ClaimProcedureParquetLarge

Evidence:

ds:ClaimProcedureParquetLarge --[reads_from]--> act:FHIR_Pipeline4Claim_Spark_OC:ClaimProcedure2SQL

Description: Dataset ClaimProcedureParquetLarge is consumed by activity ClaimProcedure2SQL but has no explicit producer activity in the lineage graph. The dataset is likely produced by ClaimParquetFlatten_Large notebook but this relationship is not formally declared.

Explanation: The lineage graph shows ClaimProcedureParquetLarge being read by a Copy activity, but there is no writes_to edge from any activity to this dataset. The implicit producer is the ClaimParquetFlatten_Large notebook based on naming patterns and pipeline structure.

Remediation: Update the FHIR_Pipeline4Claim_Spark_OC pipeline to explicitly declare ClaimProcedureParquetLarge as an output of the ClaimParquetFlatten_Large activity. This will establish formal lineage between the producer and consumer.

Remediation Roadmap

Long-term (1-3 months)