CodePrizm Data Platform Analysis - User Guide
1. Introduction
What CodePrizm Analyzed
CodePrizm performed a comprehensive analysis of your data platform infrastructure, examining 35 configuration files across your repository. The analysis focused on:
- Data pipelines (4 pipeline definitions)
- Datasets (20 dataset configurations)
- Linked services (8 service connections)
- Standards and policies (3 standard definitions)
Why This Analysis Matters
Modern data platforms involve complex interactions between data sources, transformation pipelines, storage systems, and analytics tools. This analysis helps you:
- Understand data flow from source systems to dashboards
- Identify security risks in configurations and connections
- Improve pipeline reliability through health checks
- Maintain compliance with data governance standards
Overview of Deliverables
CodePrizm generated 4 comprehensive reports that together provide a complete picture of your data platform's architecture, security posture, operational health, and usage guidance.
---
2. Deliverable Overview
2.1 Lineage Report (`lineage_report.md`)
What it contains:
- Visual diagrams showing how data flows through your platform
- Source-to-destination mappings for all 4 pipelines
- Dataset dependencies and transformation chains
- End-to-end data journey documentation
Primary audience:
- Data engineers understanding pipeline architecture
- Data architects reviewing system design
- Business analysts tracing data origins
Actions enabled:
- Impact analysis: "If I change this source, what breaks?"
- Root cause analysis: "Where does this dashboard data come from?"
- Compliance documentation: "Can we prove data lineage for audit?"
2.2 Platform Security Report (`platform_security.md`)
What it contains:
- Security scorecard (0-100 scale) for your platform
- Detailed findings for each security issue detected
- Risk severity ratings (Critical, High, Medium, Low)
- Remediation guidance with code examples
Primary audience:
- Platform managers responsible for security posture
- DevOps engineers implementing fixes
- Compliance officers reviewing controls
Actions enabled:
- Prioritize security remediation work
- Demonstrate compliance to auditors
- Track security improvements over time
2.3 Pipeline Health Report (`pipeline_health.md`)
What it contains:
- Operational health assessment for all 4 pipelines
- Performance bottlenecks and reliability issues
- Configuration problems affecting stability
- Best practice recommendations
Primary audience:
- Data engineers maintaining pipelines
- SRE teams monitoring platform reliability
- Operations managers tracking SLAs
Actions enabled:
- Prevent pipeline failures before they occur
- Optimize slow-running transformations
- Improve monitoring and alerting
2.4 Repository User Guide (`repository_userguide.md`)
What it contains:
- Documentation of your repository structure
- How to navigate the 35 configuration files
- Naming conventions and organizational patterns
- Onboarding guidance for new team members
Primary audience:
- New data engineers joining the team
- Developers contributing to the platform
- Technical leads establishing standards
Actions enabled:
- Faster onboarding of new team members
- Consistent configuration management
- Easier code reviews and collaboration
---
3. Reading Lineage Diagrams
3.1 Diagram Format
Lineage diagrams use Mermaid syntax, a text-based diagramming language that renders as visual flowcharts. You can view these diagrams in:
- GitHub/GitLab (native rendering)
- VS Code (with Mermaid extension)
- Online viewers (mermaid.live)
- Documentation platforms (Confluence, Notion)
3.2 Node Shapes and Meanings
Node shape reference:
| Shape | Syntax | Meaning | Example |
|---|---|---|---|
| Rectangle | [Name] | External source system | [SAP ERP] |
| Cylinder | [(Name)] | Database or data store | [(SQL Server)] |
| Diamond | {Name} | Pipeline or transformation | {ETL_Pipeline} |
| Parallelogram | [/Name\] | Output or consumption point | [/Power BI Dashboard\] |
3.3 Arrow Types
- Solid arrow (
-->) : Direct data flow - Dotted arrow (
-.->) : Indirect or scheduled dependency - Thick arrow (
==>) : High-volume data transfer
3.4 How to Trace Data Flow
Example: Finding the source of a dashboard metric
1. Start at the consumption point (dashboard/report)
2. Follow arrows backward through datasets
3. Identify transformation pipelines that process the data
4. Trace to source systems where data originates
Practical use case:
Dashboard "Sales Report" → Dataset "sales_summary" →
Pipeline "daily_sales_etl" → Dataset "raw_sales" →
Source "CRM Database"
This tells you that to fix data quality issues in the Sales Report, you need to investigate the CRM Database or the daily_sales_etl pipeline.
---
4. Understanding Risk Scores
4.1 Security Scorecard (0-100 Scale)
The platform security report includes an overall security score:
| Score Range | Rating | Interpretation |
|---|---|---|
| 90-100 | Excellent | Strong security posture, minimal risk |
| 75-89 | Good | Acceptable with minor improvements needed |
| 60-74 | Fair | Moderate risk, remediation recommended |
| 40-59 | Poor | Significant vulnerabilities present |
| 0-39 | Critical | Immediate action required |
How the score is calculated:
1. Each finding is assigned a severity (Critical, High, Medium, Low)
2. Severity levels have point deductions:
- Critical: -15 points each
- High: -8 points each
- Medium: -3 points each
- Low: -1 point each
3. Score = 100 - (sum of all deductions)
4. Minimum score is capped at 0
4.2 Severity Levels Explained
Critical Severity
Impact: Immediate risk of data breach, system compromise, or compliance violation
Examples:
- Hardcoded passwords in configuration files
- Publicly accessible databases with no authentication
- Unencrypted transmission of sensitive data
Response time: Fix within 24-48 hours
High Severity
Impact: Significant security weakness that could be exploited
Examples:
- Missing encryption at rest for sensitive datasets
- Overly permissive access controls
- Disabled audit logging
Response time: Fix within 1-2 weeks
Medium Severity
Impact: Security gap that increases risk but requires specific conditions to exploit
Examples:
- Weak password policies
- Missing network segmentation
- Outdated TLS versions
Response time: Fix within 1-2 months
Low Severity
Impact: Best practice deviation with minimal immediate risk
Examples:
- Missing security headers
- Verbose error messages
- Lack of rate limiting
Response time: Address in next maintenance cycle
4.3 Score Interpretation Example
Scenario: Your platform scores 68/100
Breakdown:
- 1 Critical finding: -15 points
- 2 High findings: -16 points
- 1 Medium finding: -3 points
- Total deductions: -34 points
- Final score: 66/100 (rounded to 68)
Interpretation: Your platform is in the "Fair" range. The single Critical finding should be addressed immediately, followed by the two High-severity issues. Once these are resolved, your score would improve to 97/100 (Excellent).
---
5. Interpreting Findings
5.1 Finding Format
Each security or health finding follows a consistent structure:
#### 1. Hardcoded Connection String in Pipeline
**Severity:** Critical
**Location:** `pipeline/customer_etl.json` (line 45)
**Category:** Secrets Management
**Description:**
The pipeline configuration contains a hardcoded database connection
string with embedded credentials.
**Code Snippet:**
"connectionString": "Server=prod-db.company.com;User=admin;Password=P@ssw0rd123"
**Risk:**
Anyone with repository access can view production database credentials.
**Recommended Fix:**
Replace hardcoded credentials with Azure Key Vault reference:
"connectionString": "@Microsoft.KeyVault(SecretUri=https://vault.azure.net/secrets/db-conn)"
**Effort:** Low (1-2 hours)
**Priority:** Immediate
5.2 Understanding Each Component
| Component | Purpose | How to Use |
|---|---|---|
| Severity | Risk level | Determines urgency of fix |
| Location | File path and line number | Where to find the issue in your code |
| Category | Type of issue | Groups related findings |
| Description | What's wrong | Explains the problem in plain language |
| Code Snippet | Actual problematic code | Shows exact configuration causing issue |
| Risk | Potential impact | Why this matters to your business |
| Recommended Fix | Solution with example | Copy-paste starting point for remediation |
| Effort | Time to fix | Helps with sprint planning |
| Priority | Urgency ranking | Guides remediation order |
5.3 How to Verify a Finding
Step-by-step verification process:
1. Locate the file
`bash
Navigate to the file mentioned in Location
cd /path/to/repository
code pipeline/customer_etl.json
`
2. Find the specific line
- Use the line number provided (e.g., line 45)
- Search for the code snippet text
- Most IDEs support "Go to Line" (Ctrl+G)
3. Confirm the issue
- Compare your code to the snippet in the finding
- Check if the configuration matches the description
- Verify the context (sometimes surrounding code matters)
4. Assess current state
- Has this already been fixed?
- Is there a compensating control?
- Does this apply to your environment?
5.4 Identifying False Positives
A finding might be a false positive if:
✓ The code is in a test/development file
- Finding: Hardcoded credentials
- Reality: Test fixtures with dummy data
- Action: Document as accepted risk for non-production
✓ Compensating controls exist
- Finding: Missing encryption
- Reality: Data is encrypted at network level
- Action: Add comment explaining the control
✓ The configuration is intentional
- Finding: Public access enabled
- Reality: Dataset contains only public information
- Action: Document business justification
✓ The scanner misunderstood context
- Finding: SQL injection risk
- Reality: Parameterized query with safe variable
- Action: Report to CodePrizm for scanner improvement
How to document false positives:
Create a FINDINGS_EXCEPTIONS.md file:
## Accepted Risks
### Finding: Hardcoded API Key in test_config.json
**Justification:** This is a test API key for sandbox environment only.
**Approved by:** Jane Smith (Platform Manager)
**Date:** 2024-01-15
**Review date:** 2024-07-15
---
6. Prioritization Framework
6.1 Remediation Priority Matrix
Use this framework to decide what to fix first:
High Impact, Low Effort → FIX IMMEDIATELY (Quick Wins)
High Impact, High Effort → PLAN & SCHEDULE (Strategic)
Low Impact, Low Effort → FIX WHEN CONVENIENT (Easy Improvements)
Low Impact, High Effort → DEFER OR ACCEPT (Low Priority)
6.2 Quick Wins vs. Strategic Improvements
Quick Wins (Do First)
Characteristics:
- Critical or High severity
- Low effort (< 1 day)
- Clear fix provided
- No architectural changes needed
Examples from your analysis:
- Removing hardcoded secrets (replace with Key Vault)
- Enabling audit logging (configuration change)
- Updating connection strings (find & replace)
Approach:
1. Create a "Security Sprint" for next week
2. Assign one engineer to knock out all quick wins
3. Aim to fix 5-10 issues in a single day
Strategic Improvements (Plan Carefully)
Characteristics:
- High or Medium severity
- High effort (> 1 week)
- Requires design decisions
- May impact multiple systems
Examples from your analysis:
- Implementing end-to-end encryption
- Redesigning authentication architecture
- Migrating to managed identity across all 8 linked services
Approach:
1. Create technical design document
2. Estimate effort and dependencies
3. Schedule in quarterly roadmap
4. Break into smaller milestones
6.3 Risk vs. Effort Matrix
Plot each finding on this matrix:
│ High Risk │ High Risk
│ Low Effort │ High Effort
│ ★ DO NOW ★ │ PLAN & SCHEDULE
│ │
────────┼─────────────────┼──────────────────
│ Low Risk │ Low Risk
│ Low Effort │ High Effort
│ DO WHEN FREE │ DEFER/ACCEPT
│ │
Example prioritization for your 4 pipelines:
| Finding | Severity | Effort | Quadrant | Action |
|---|---|---|---|---|
| Hardcoded password in pipeline 1 | Critical | Low | DO NOW | Fix today |
| Missing encryption for 20 datasets | High | High | PLAN | Q2 project |
| Weak TLS on 2 linked services | Medium | Low | DO WHEN FREE | Next sprint |
| Verbose logging in pipeline 4 | Low | High | DEFER | Backlog |
6.4 Prioritization Decision Tree
START: New finding identified
│
├─ Is severity Critical?
│ ├─ YES → Fix within 24-48 hours (regardless of effort)
│ └─ NO → Continue
│
├─ Is severity High AND effort Low?
│ ├─ YES → Fix in next sprint (Quick Win)
│ └─ NO → Continue
│
├─ Is severity High AND effort High?
│ ├─ YES → Create project plan, schedule in roadmap
│ └─ NO → Continue
│
├─ Is severity Medium or Low AND effort Low?
│ ├─ YES → Add to backlog, fix when convenient
│ └─ NO → Continue
│
└─ Is severity Low AND effort High?
└─ YES → Document as accepted risk or defer indefinitely
---
7. Using Reports
7.1 Sharing Reports with Stakeholders
For Executive Leadership
What to share: Executive Summary from platform_security.md
Format: 1-page PDF with:
- Overall security score
- Count of Critical/High findings
- Top 3 risks
- Estimated remediation timeline
Sample email:
Subject: Data Platform Security Assessment - Action Required
Our platform scored 68/100 in security analysis. We have 1 Critical
and 2 High-severity issues requiring immediate attention.
Estimated fix time: 2 weeks for critical items, 6 weeks for full remediation.
Full report attached. Recommend review in next leadership meeting.
For Engineering Teams
What to share: Full platform_security.md and pipeline_health.md
Format: Markdown files in shared repository
Distribution:
- Post in team Slack/Teams channel
- Add to sprint planning board
- Include in engineering wiki
For Compliance/Audit Teams
What to share: lineage_report.md + security findings
Format: PDF with:
- Data lineage diagrams
- Security controls assessment
- Remediation tracking spreadsheet
7.2 Tracking Remediation Progress
Create a Tracking Spreadsheet
| Finding ID | Severity | Description | Owner | Status | Target Date | Completed Date |
|---|---|---|---|---|---|---|
| SEC-001 | Critical | Hardcoded password | Alice | In Progress | 2024-01-20 | |
| SEC-002 | High | Missing encryption | Bob | Planned | 2024-02-15 | |
| SEC-003 | High | Weak auth | Alice | Not Started | 2024-02-28 |
Status Definitions
- Not Started: Acknowledged but no work begun
- Planned: Design/approach decided, scheduled
- In Progress: Actively being fixed
- In Review: Fix implemented, awaiting verification
- Completed: Verified and deployed to production
- Accepted Risk: Documented decision not to fix
Weekly Review Cadence
1. Monday: Review new findings from latest scan
2. Wednesday: Check progress on in-flight fixes
3. Friday: Update stakeholders on completed items
7.3 Re-running Analysis After Changes
When to Re-scan
Trigger re-analysis when:
- ✓ You've completed a batch of security fixes
- ✓ New pipelines or datasets are added
- ✓ Major configuration changes are deployed
- ✓ Monthly (for ongoing monitoring)
- ✓ Before major releases or audits
How to request a new scan:
# If you have CodePrizm CLI installed
codeprizm scan --path /path/to/repository --output reports/
# Or contact CodePrizm support to schedule
Comparing Results Over Time
Track these metrics:
| Metric | Baseline (Jan 2024) | After Fixes (Feb 2024) | Target |
|---|---|---|---|
| Security Score | 68/100 | 85/100 | 90/100 |
| Critical Findings | 1 | 0 | 0 |
| High Findings | 2 | 1 | 0 |
| Medium Findings | 1 | 1 | <3 |
Visualize progress:
Security Score Trend
100 ┤ ╭─ Target
90 ┤ ╭────────╯
80 ┤ ╭────╯
70 ┤ ╭────────╯
60 ┤ ────────╯
└─────────────────────────────────────
Jan Feb Mar Apr May
Regression Detection
Watch for:
- Score decreases between scans
- New Critical findings appearing
- Previously fixed issues reappearing
Common causes:
- New code merged without review
- Configuration drift in production
- Automated deployments bypassing checks
Prevention:
- Integrate CodePrizm into CI/CD pipeline
- Require security review for config changes
- Use infrastructure-as-code with version control
---
8. Glossary
Data Platform Terms
Dataset
A logical collection of data, typically represented as a table, file, or data structure. Your repository contains 20 dataset definitions.
Pipeline
An automated workflow that extracts, transforms, and loads (ETL) data from sources to destinations. Your platform has 4 pipelines.
Linked Service
A connection configuration to external systems (databases, APIs, storage). Your platform uses 8 linked services.
Lineage
The complete path data takes from origin to consumption, including all transformations. Documented in lineage_report.md.
Transformation
A data processing step that modifies, enriches, or aggregates data within a pipeline.
Security Terms
Secrets Management
The practice of securely storing and accessing sensitive information (passwords, API keys, connection strings) using tools like Azure Key Vault or AWS Secrets Manager.
Encryption at Rest
Data encryption when stored on disk or in databases. Protects against physical theft or unauthorized file access.
Encryption in Transit
Data encryption during network transmission using TLS/SSL. Protects against network eavesdropping.
Authentication
Verifying the identity of a user or service (who are you?).
Authorization
Determining what an authenticated entity is allowed to do (what can you access?).
Managed Identity
A cloud service feature that automatically handles credentials for service-to-service authentication without storing secrets.
Audit Logging
Recording all access and changes to data for compliance and security monitoring.
Analysis Terms
Finding
A specific issue identified by CodePrizm analysis, with severity, location, and remediation guidance.
Severity
The risk level of a finding: Critical, High, Medium, or Low.
False Positive
A finding that appears to be an issue but is actually acceptable in context.
Remediation
The act of fixing a security or operational issue.
Compensating Control
An alternative security measure that mitigates risk when the primary control isn't feasible.
Technical Debt
Accumulated suboptimal design decisions that increase maintenance cost over time.
---
9. FAQ
General Questions
Q: How often should we run CodePrizm analysis?
A: Recommended cadence:
- Monthly: For ongoing monitoring and trend tracking
- After major changes: New pipelines, configuration updates, or architecture changes
- Before audits: To prepare compliance documentation
- After security incidents: To verify remediation completeness
Q: Can we automate this analysis in our CI/CD pipeline?
A: Yes. CodePrizm can be integrated into your deployment pipeline to:
- Block deployments with Critical findings
- Generate reports automatically on each commit
- Track security score trends over time
- Alert teams when new issues are introduced
Contact CodePrizm support for CI/CD integration guidance.
Q: What if we disagree with a finding?
A: Follow this process:
1. Document why you believe it's a false positive
2. Check if compensating controls exist
3. Add to your FINDINGS_EXCEPTIONS.md file
4. Get approval from security/compliance team
5. Report to CodePrizm if the scanner needs improvement
Q: How long does remediation typically take?
A: Based on your 35 configuration files:
- Quick wins (Low effort): 1-2 days for all
- Medium complexity: 1-2 weeks per finding
- Strategic improvements: 1-3 months for architectural changes
- Full remediation: 2-6 months depending on severity distribution
Report-Specific Questions
Q: The lineage diagram is too complex to read. How do I simplify it?
A: Strategies:
- Focus on one pipeline at a time (your platform has 4)
- Trace backward from a specific dashboard or report
- Use diagram filtering tools to show only relevant paths
- Request CodePrizm generate per-pipeline lineage views
Q: Our security score is 68/100. Is that acceptable?
A: It depends on your context:
- Regulated industries (finance, healthcare): Aim for 85+
- Internal tools: 70+ may be acceptable
- Customer-facing platforms: 90+ recommended
The score itself matters less than the trend and Critical finding count. Zero Critical findings is the minimum acceptable state.
Q: Can we share these reports with external auditors?
A: Yes, but consider:
- Remove sensitive information (server names, IP addresses)
- Redact proprietary logic from code snippets
- Include remediation tracking to show proactive management
- Add executive summary explaining your security program
Q: What if a finding references a file that no longer exists?
A: This indicates:
- The analysis was run on an older version of your repository
- Files were moved/renamed after analysis
- Solution: Re-run the analysis on your current codebase
Technical Questions
Q: How does CodePrizm detect issues in our 20 datasets?
A: CodePrizm analyzes:
- Configuration files (JSON, YAML, XML)
- Schema definitions
- Access control settings
- Encryption configurations
- Connection strings and credentials
It uses pattern matching, static analysis, and security best practice rules.
Q: Can CodePrizm analyze our actual data, or just configurations?
A: CodePrizm analyzes configuration and code only, not the data itself. This means:
- ✓ No PII or sensitive data is accessed
- ✓ Analysis can run on production configurations safely
- ✗ Cannot detect data quality issues
- ✗ Cannot identify sensitive data in tables
Q: What's the difference between pipeline_health.md and platform_security.md?
A:
- pipeline_health.md: Operational reliability (performance, error handling, monitoring)
- platform_security.md: Security posture (authentication, encryption, secrets)
A pipeline can be "healthy" (runs successfully) but "insecure" (uses hardcoded passwords).
Q: How do I know which of my 8 linked services are most critical?
A: Check the lineage report to see:
- Which services feed the most pipelines
- Which connect to production databases
- Which are used by customer-facing dashboards
Prioritize securing services that appear in multiple lineage paths.
Remediation Questions
Q: We fixed a Critical finding. How do we verify it's resolved?
A: Verification checklist:
1. ✓ Code change deployed to production
2. ✓ Configuration updated in all environments
3. ✓ No hardcoded secrets remain in git history
4. ✓ Re-run CodePrizm analysis to confirm
5. ✓ Update tracking spreadsheet
Q: Can we fix multiple findings with one architectural change?
A: Yes! Example:
- Single change: Implement Azure Key Vault for all secrets
- Fixes: All hardcoded password findings across 4 pipelines and 8 linked services
- Benefit: More efficient than fixing each individually
Look for patterns in your findings to identify these opportunities.
Q: What if we don't have resources to fix everything?
A: Pragmatic approach:
1. Fix all Critical findings (non-negotiable)
2. Fix High findings that are low effort
3. Document accepted risks for remaining items
4. Schedule strategic improvements in quarterly roadmap
5. Prevent new issues by integrating CodePrizm into CI/CD
Q: How do we prevent these issues from reoccurring?
A: Establish preventive controls:
- Code review checklist including security items
- Pre-commit hooks to detect secrets
- Infrastructure-as-code with version control
- Automated scanning in CI/CD pipeline
- Security training for data engineers
- Configuration templates with security built-in
---
Need Help?
For questions about this analysis:
- Review the detailed findings in each report
- Check the Glossary for terminology
- Consult your security or platform team
For CodePrizm support:
- Technical questions about the scanner
- Requesting re-analysis
- CI/CD integration assistance
- Custom reporting needs
For remediation guidance:
- Consult your cloud provider's security documentation
- Engage your security team for architectural decisions
- Consider security consulting for complex fixes
---
This user guide is specific to the analysis of your data platform containing 4 pipelines, 20 datasets, 8 linked services, and 3 standards. Generated reports: lineage_report.md, platform_security.md, pipeline_health.md, repository_userguide.md.