Early in my AI workflow, the thing I did most was write prompts. Long ones. Prompts that tried to anticipate every situation and cover every edge case in one shot.

I thought this was efficiency. Get all the requirements into one place, and the AI handles everything in one pass.

It didn't work that way. The prompt kept growing, and with every addition, something else would break. I'd add a rule to fix problem A, and two days later that same rule would cause a strange failure in scenario B. I started spending real time debugging prompts — which was exactly backwards. I'd brought AI in to save time.

The issue wasn't that my prompts weren't good enough. The issue was that I was asking one person to run an entire team.


01

Design Teams Aren't One Person — The Role Split

This realization felt obvious the moment I had it. Embarrassingly so.

In a real design team, no one person simultaneously does the research, writes the requirements, produces the designs, runs the critique, and audits the component library. These are different roles. They require different mindsets. And — critically — they're meant to check each other. The person who designs something shouldn't also be the one reviewing it. That's not just team structure; that's professional ethics.

But when I worked with AI, I treated it like one entity that needed to know everything, be responsible for everything, and do everything in sequence. So I broke it into five.

Five roles — each independent
Research Specialist Competitive analysis and user research. The team's antenna for outside information.
Requirements Specialist Translating vague intent into clear, actionable design requirements.
Design Specialist The core executor. Making design decisions within established visual guidelines.
Usability Evaluation Specialist Rigorous critique focused on finding problems, not reasons to approve.
Component Audit Specialist Final gate. Confirming that outputs are compatible with the existing component library.

Each role is independent. Its own system prompt, its own model, its own tools, its own scope.

The immediate payoff: each role's prompt got shorter, and sharper. The research specialist doesn't need to know the component library spec. The component auditor doesn't need to understand research methodology. When everyone has a clearly defined lane, everyone gets better at staying in it.

Roles don't just divide the work. They protect the integrity of each part of it.
02

And Then There Was the Knowledge Problem — Shared Assets

Splitting into roles solved "who does what." It created a new problem almost immediately.

WCAG accessibility standards. Referenced in research, enforced in design, checked in evaluation. If I wrote them into each role's prompt separately, things looked fine — until I needed to update the standard.

Now I had to find it in five places, change it in five places, and verify that every version matched. I was spending more time maintaining prompts than running them. That's not a workflow; that's a liability.

The fix was a second split. Not roles this time, but the knowledge itself. I separated each agent's "brain" into four distinct file types, all living in a shared project folder:

Four knowledge types — stored once, referenced by all
Role Definition Who this agent is, where its responsibilities start and end. What to touch and what to leave alone.
Methodology Reusable professional knowledge any role can reference — how to run heuristic evaluation, how to structure user goal analysis.
Standards Hard constraints outputs must satisfy: design tokens, accessibility standards, naming conventions.
Workflow The sequence: what happens in what order. For example, "audit the component library before generating a PR."

After this, the WCAG problem went away. The standard lives in exactly one file. I update it once; every role that references it picks up the new version on its next run.

One update. Every agent in sync. That's what it actually means to decouple knowledge from roles.

03

Who's Managing All This? — Orchestration

Roles handled. Shared knowledge handled. Running this in practice surfaced one more basic question: where is everything? Who's reviewed what? Who needs to act next?

My answer evolved in three phases.

The first version was a Markdown file. I tracked progress manually — task is at step three, needs review, reviewed and done. Clunky, but it answered "whose turn is it," which was all I needed at first.

The second version was automation: when an agent finished its section, the system created a Pull Request and triggered the corresponding review flow. I stopped chasing a tracker. But I realized it still wasn't right. Real design work isn't a straight line — it branches, it loops back, it depends on things that aren't done yet. A linear trigger-and-PR setup can't model that.

The third version is where things finally felt solid: I turned the workflow into a real, queryable state machine. Every task is a card. Statuses update as agents run — "Researching," "Designing," "Awaiting Review," "Review failed, returned to design." Every state change is logged with a reason.

Phase evolution Phase 1   Markdown file           Manual tracking, "whose turn is it"
Phase 2   PR automation           Linear triggers — breaks on branching work
Phase 3   State machine           Every task a card, every change logged

The lesson from phase two to phase three: the more automated a system gets, the more critical visibility becomes. When everything was manual, the state lived in my head. Once agents started running on their own, if I couldn't see where they were and why, I couldn't trust the system — even when it was technically working.

04

Who Makes the Final Call? — Human in the Loop

My usability evaluation agent is set to be harsh. That's the job: find problems, not reasons to approve. But that creates a real question — if it flags something, does the design automatically go back?

My answer: not necessarily. But it always gets seen.

This borrows from a principle I know well from the security industry: prefer false positives over false negatives. In threat detection, that means better to surface a false alarm than to miss a real one. I applied the same logic here. The evaluation agent's threshold is calibrated to the sensitive side — it will flag something rather than let it slip through quietly.

When any agent raises a concern, the task doesn't advance silently. It gets flagged. A human sees it.

This does slow things down sometimes. But I've thought it through: I don't want a system that never stops. I want a system that stops when it's supposed to. The first kind looks more efficient. But failures accumulate quietly until something breaks badly. The second kind occasionally interrupts you — and every interruption means something actually warranted a second look.

A system that never stops isn't reliable. It's just not telling you where it's going wrong.
05

My Role Changed — From Designer to Architect

Looking back at all of this, my own work has shifted in ways I didn't plan for.

I thought I was teaching AI how to do design work. But what I was actually doing was closer to this: deciding what roles a team needs, determining what knowledge those roles should share, designing how they hand off work and escalate decisions, and — most critically — identifying which moments require a human judgment call.

None of that is design work. It's the infrastructure that makes design work reliable.

The most important skill I've built through all of this isn't prompt engineering. It's systems thinking. And the most useful question I've learned to ask isn't "can AI do this?" It's "what does this system need so it doesn't get it wrong?"