Back to articles
ai-agentsclaude-codemulti-agentgovernancedeveloper-toolsagent-teams

30 Markdown Files Won't Save You: What Actually Works When Building AI Agent Teams

Mohammed Ahmed··16 min read

I recently built an entire product using 8 AI agents. A CLI tool, a web dashboard, a visual design system with 34 assets, 9 content pieces ready for launch across 5 channels, and a marketing strategy with competitive positioning against 11 competitors. 232 tests passing. Gold standard verified. All coordinated by one person — me — routing work between agents through a single project board file.

The product took one day to build. Not because AI is magic, but because the agents were designed to work as a team, not as 8 independent chatbots wearing job titles.

Every week, I see a new viral post showing a pristine .claude/agents/ folder with 30+ beautifully named markdown files. "Frontend Developer." "Growth Hacker." "Compliance Checker." The promise: one person replaces an entire company by typing @growth-hacker, get me users.

I've spent 25+ years building software and leading engineering teams. That experience is exactly why these screenshots make me uneasy. They look like org charts without an organization. And I've seen enough hollow org charts to know what happens next: nothing ships.

Here's what nobody is talking about: the directory tree is the easy part. The hard part is governance. And I can prove it, because I just built a product that tested every assumption.


The Illusion of the Agent Directory

The viral screenshots look clean. Organized. Professional. But here's the question nobody asks: what's actually inside those files?

In most cases, something like this:

# Growth Hacker

You are a growth hacker. Your job is to find creative ways
to acquire users and grow the product. Think outside the box.
Use data-driven strategies.

That's not an agent. That's a costume. It's a system prompt wearing a job title.

The "everything-claude-code" repository on GitHub currently offers 112 specialized agents, 146 skills, and 72 plugins. It's an impressive engineering exercise. But the fundamental misconception in the current wave of multi-agent enthusiasm is that naming an agent is the same as building one.

An agent isn't its title. An agent is its constraints, its decision boundaries, its relationship to other agents, and its governing principles. Without those, you have 30 independent contractors with no shared standards — and anyone who's managed contractors knows how that ends.


What I Built and How I Built It

Before I walk through the design principles, let me tell you what actually happened when I put them to the test.

I have 8 agents organized into two teams. A development team: Abu Bakr (Team Lead & Architect), Umar (Strategist & Execution), Uthman (Senior Dev — Backend), Ali (Senior Dev — Frontend), and Abdullah (Senior Dev — Pipelines). A marketing team: Khalid (Marketing Lead), Zahid (Copy & Content), and Rashid (Visual & Design). All governed by shared principles files that every agent loads as operating constraints.

I decided to build a developer tool — a CLI that scaffolds governed Claude Code agent teams. The kind of tool I wished existed when I was building my own agents. Here's how the build went:

I gave the product brief to Abu Bakr. He decomposed it into 6 workstreams, produced a PROJECT_BOARD.md tracking all workstreams with status and dependencies, and generated individual handoff briefs for each agent. He made four architectural decisions on the spot: TypeScript over Python for ecosystem alignment, a 4-phase execution sequence, dashboard as a static client-side app, and exactly 3 CLI commands with no feature creep.

I gave Abu Bakr's plan to Umar for challenge. Umar found three issues. The fourth CLI command was dead weight — a project management feature pretending to be a product feature. He cut it. The dashboard scope was premature — building a full React app before anyone had used the CLI. He scoped it down to topology visualization only. And the data model's definition of done was too passive — "review Mohammed's system" should be "encode Mohammed's actual 8-agent team with zero information loss." Abu Bakr accepted all three changes and updated every affected document.

I routed Uthman to build the data model. With zero instructions, he found his own assignment in the briefs folder, loaded a brainstorming skill, went and studied my actual agent system — all 8 agents and 4 governance files — and started asking me precise architectural questions. Over 5 questions, we made decisions about schema targets, relationship modeling, governance flexibility, routing derivation, and voice representation. Then he designed the full schema, produced a 410-line specification, encoded my real 8-agent team as a gold standard proof, and spawned a review subagent to verify his own spec before presenting it.

In parallel, Khalid produced the marketing strategy. 7 documents totaling 1,018 lines — positioning with 11 competitors mapped, a channel plan across 5 platforms, a Day -7 to Day +14 launch sequence, complete briefs for Zahid and Rashid, distribution and pricing recommendations, and an accuracy cross-check where he verified 12 marketing claims against the technical briefs and flagged 3 discrepancies.

Also in parallel, Rashid produced the visual design system. Logo concept with 3 variants, full color palette, typography scale, 7 component specs, OG images, social graphics for every platform, favicon set at every standard size, and a Tailwind config extension file that Ali could import directly. 34 files, 74 tests passing. He explicitly documented what each downstream agent would receive from his output — Ali gets the Tailwind tokens, Zahid gets the social graphics, Khalid gets the brand mark.

Once Uthman delivered, the builders started. Abdullah built the CLI — 3 commands, 168 tests, gold standard verified (my 8-agent team round-trips through the full pipeline with zero information loss). Ali built the topology visualization dashboard — React Flow canvas with interactive nodes, 32 tests, all definition of done items passing. Zahid produced 9 content pieces — README, landing page copy, X launch thread, LinkedIn post, Reddit posts, Hacker News post, getting-started guide, CLI reference, and SEO/meta tags — all self-reviewed against Khalid's brief with a 14-point verification checklist.

Abu Bakr integrated everything. He ran all test suites (232 passing, 0 failing), fixed a latent environment mismatch causing false failures, merged feature branches, updated the project board to reflect completion, and caught a critical quality issue I'll describe below.

One person. Eight agents. One day. A complete product with engineering, marketing, design, and documentation — all coordinated through a PROJECT_BOARD.md.


Four Design Principles That Made It Work

This didn't work because AI is smart. It worked because the agents were designed with four principles I've learned from decades of building human teams.

1. Specialize by Cognitive Domain, Not Job Title

The viral posts organize agents by function: "frontend developer," "backend architect," "marketing lead." But AI agents aren't humans, and the organizational metaphor breaks down at the point of execution.

What actually works is specializing agents by how they think.

My three senior development agents aren't interchangeable "developers" with different labels. Each one thinks in a fundamentally different cognitive domain. Uthman thinks in schemas — when he received the data model assignment, he asked 5 precise questions about representation fidelity, relationship modeling, and validation boundaries. Every question was about data structure, not about code. Ali thinks in components — when he received the dashboard assignment, he immediately checked how Rashid's design tokens aligned with the brief's theme defaults, loaded a planning skill, and started reasoning about user interaction flows. Abdullah thinks in pipelines — when he received the CLI assignment, his first sentence was "The design already exists. Re-brainstorming this would be AP9 — activity mistaken for progress." He went straight to a 7-stage build order where each stage was independently verifiable.

Same governance. Same project. Completely different reasoning patterns. When Abu Bakr decomposes a project, each workstream has a natural owner because the cognitive decomposition matches how each agent approaches problems.

2. Bind Every Agent to a Shared Governance Layer

This is the single most critical gap in every multi-agent setup I've seen published online.

In my system, every agent loads two shared files as operating constraints: principles_core.md (non-negotiable engineering standards) and anti_patterns_core.md (explicit failure modes to avoid). The principles don't change by role. The application changes.

Here's why this matters — with a real example from the build.

Zahid, my content agent, produced 9 beautifully structured content pieces. He self-reviewed against a 14-point checklist from Khalid's brief and passed every item: tone, character limits, message priority, platform constraints, banned words. His content was polished, on-brand, and professional.

It was also technically wrong.

He documented a viz command that doesn't exist. He invented 12 CLI flags (--template, --json, --strict, --path, --dry-run, --force, and 6 more) that were never built. He described file paths that were incorrect. His self-review caught every marketing and tone issue — it couldn't catch technical accuracy because Zahid wrote the documentation before the CLI existed.

Abu Bakr caught all of it during integration. He rewrote 85% of the CLI reference, corrected every affected document across 6 files, and replaced fabricated output examples with real CLI output. The governance layer — Abu Bakr's role as integrator with authority to verify cross-workstream output — prevented technically false documentation from shipping.

Without the shared governance layer, Zahid's confident, well-formatted, completely fabricated CLI documentation would have been the final product. Governance isn't bureaucracy. It's the thing that catches the errors your individual agents can't see.

3. Mirror Proven Team Topologies Across Domains

When I built my marketing team, I didn't invent a new organizational structure. I mirrored the development team topology.

Abu Bakr leads the dev team: he owns direction, integration, and delivery. Uthman, Ali, and Abdullah execute within their domains under his coordination. Khalid leads the marketing team identically: he owns positioning, campaign strategy, and technical accuracy. Zahid and Rashid execute within clear briefs that Khalid sets.

This worked exactly as designed during the build. Khalid produced strategy documents and briefs. Zahid consumed those briefs to write content. Rashid consumed those briefs to design assets. When Rashid started his session, he independently went and read Khalid's brief — "Let me also check Khalid's brief since it feeds into my work" — without being told. The topology gave him the context to know where his inputs came from.

Most multi-agent setups treat every agent as a peer — a flat structure where anyone can be invoked for anything. Hierarchy exists in my agent team for the same reason it exists in high-performing human teams: someone needs to own integration and make tradeoff decisions. Abu Bakr and Khalid make those decisions in their domains. The specialists execute. I make the decisions that cross domains.

4. Inject Domain Awareness Into Every Agent

Your agents need to understand the full context of your work, not just their narrow function.

A frontend developer agent building for a healthcare company needs to think about HIPAA implications even though "compliance" isn't in its job title. A content agent working for a financial firm needs to understand disclosure requirements even though it's not "the legal agent."

In my system, every agent carries explicit awareness of the domains I operate in. This isn't a toggle — it's embedded in each agent's operating context. When Uthman designed the schema, he included a domain_awareness field as optional freeform text — only present when the domain carries elevated responsibility. He understood that domain awareness isn't universal metadata; it's a context-specific layer that raises the bar for agents that need it.

Domain awareness isn't a separate agent. It's a layer that runs through every agent. The agents that miss this produce work that's technically correct and contextually wrong.


The Orchestrator Is Not a Cherry on Top

The viral post I keep seeing describes the orchestrating agent as "a cherry on top." This fundamentally misunderstands the architecture.

Abu Bakr isn't decoration. He's the load-bearing wall. During the build, he decomposed the brief into 6 workstreams with explicit assignments, produced handoff documents specific to each agent, accepted Umar's challenge and propagated changes to every affected document, and then integrated all outputs — catching fabricated documentation, fixing environment mismatches, merging feature branches, and verifying 232 tests across 3 codebases.

The orchestrator is also where you implement your most important constraint: strategic decisions belong to the human. My agents execute. I set direction. Abu Bakr surfaces decision points — "Should the schema support sub-teams or keep it flat?" "Should routing be explicit or derived from relationships?" — rather than making strategic choices autonomously. The agents handle the 90% that's execution. I handle the 10% that's judgment. This ratio is what makes one person productive at the level of a team.


What Went Wrong: Three Calibration Findings

Nobody writes about their failures. But the failures are where the real learning lives.

Finding 1: Not All Agents Self-Discover Their Work

When I started Phase 1, I opened sessions for Uthman, Khalid, and Rashid in parallel. Uthman found his own brief in the briefs folder, loaded it, studied the gold standard, and started working — zero instructions needed. Khalid and Rashid loaded their identity and governance, then asked me what to do.

The calibration gap: Uthman's engineering context made him naturally check the project structure for his assignment. Khalid and Rashid — marketing agents — waited for direction.

The fix: I added a Session Start Protocol to every agent's governance: at session start, check for a PROJECT_BOARD.md in the project root. If it exists, read it. Find your assigned workstream. If a briefs/ folder exists with your assignment, load it. All 8 agents now self-discover their briefs.

Finding 2: The Orchestrator Loses Context Between Sessions

After Phase 1 delivered, I closed Abu Bakr's session. When I reopened him for integration, he had no memory of the decomposition he'd done or the work that had been completed. He read the PROJECT_BOARD.md, but it hadn't been updated as each agent delivered. So he assumed Phase 2 hadn't started yet — when in reality, all builders had already finished.

The fix: I added a Delivery Protocol to every agent's governance: when your workstream is complete, write your delivery summary to a standardized receipt file in a shared deliveries/ folder. Abu Bakr now reads all delivery receipts at session start as his primary source of project state. The agents already produce delivery summaries — the only change is writing them to a file instead of printing to the terminal.

Finding 3: Content Agents Fabricate Technical Details

This was the most instructive failure. Zahid wrote detailed CLI documentation — command flags, output formats, file paths — for a CLI that hadn't been built yet. His content was well-structured, on-brand, and technically fabricated. He invented a command and 12 flags that don't exist.

Abu Bakr caught all of it during integration — rewrote 85% of the CLI reference and corrected 6 files. But if I'd published Zahid's content directly, I would have shipped documentation for features that don't exist.

The fix: I added a Technical Accuracy Constraint to Zahid's governance: when documenting CLI commands, API endpoints, file paths, or technical output, if the implementation does not exist yet, mark every technical detail as [DRAFT — pending verification against implementation]. Never fabricate command flags, output formats, or file structures. If you haven't seen real output, say so.

This is why governance matters more than prompt quality. Zahid's prompts were excellent. His tone was perfect. His formatting was precise. He passed every checkpoint in Khalid's brief. But without the governance fix, he would confidently fabricate technical details every time. The fix isn't "make Zahid smarter" — it's "add a constraint that prevents the failure mode."

All three fixes were implemented and verified across all 8 agents within hours of discovering them. The system didn't just find its own bugs — it absorbed the fixes into its governance layer so they never happen again. That's what iterative calibration looks like in practice.


Why 8 Agents Beats 30

More agents means more coordination overhead. More potential for conflicting outputs. More governance surface area. More places where quality can silently degrade.

I designed my system with eight agents. Not because I couldn't write more markdown files, but because 25 years of building software teams taught me that a small, well-structured team with clear ownership outperforms a large, loosely coordinated group every time.

Eight agents covering architecture, strategy, backend, frontend, pipelines, marketing, content, and design — with shared governance and clear hierarchy — built a complete product in one day. Adding a 9th or 10th would only make sense when a genuine gap emerges in practice, not because a file tree looks more impressive with more entries.


The Real Competitive Advantage

The edge in multi-agent AI isn't the file tree. It's everything that isn't visible in the screenshot:

Governance layers that catch fabricated documentation before it ships. Cognitive specialization that produces 5 precise schema questions from a data agent and a straight-to-build execution plan from a pipeline agent — in the same project, under the same governance. Team topologies where a design agent independently seeks out the marketing lead's brief because the hierarchy tells him where his inputs come from. An orchestrator who merges feature branches, runs 232 tests across 3 codebases, and rewrites 85% of a content deliverable because the technical details were wrong.

And calibration findings — real failures on a real project — that tell you exactly what to fix before the next run.

This is what separates an operating system from a collection of scripts. The community is still in the "look at my beautiful file tree" phase. The next phase requires thinking about AI agent teams the way you'd think about building a high-performing human organization: with governance, culture, structure, and relentless quality standards.

The file tree is where everyone starts. Governance is where the real work begins.

And if you're wondering what the tooling for governed agent teams looks like — stay tuned.


Mohammed Ahmed is the founder of Green Olive Tech LLC, where he builds production-grade AI-powered products and platforms. With 25+ years of software and data engineering experience, he recently built an entire product using an 8-agent AI team — 232 tests, 34 design assets, 9 content pieces, and a complete launch package — coordinated through governance files and a single project board. Follow his work at greenolivetech.com and @MMAhmed7791 on X.

What did you think of this article?