Modular Architecture with Generative AI: Treating Intelligence as a Capability, Not a Dependency

There is a specific moment in the life of a system that incorporates Generative AI when the architectural debt becomes unbearable. It's not when the first bug appears. It's when someone asks, "What prompt was in production when this happened?" and no one knows how to answer.

The Unnamed Problem

I call it AI Spaghetti, the state where AI logic is chaotically distributed throughout the codebase: model calls within controllers, prompt construction by concatenating strings in helpers, retrieval logic mixed with business rules, and zero separation between what the system wants to do and how AI executes it.

This state is not born out of negligence. It's born out of speed and a specific illusion that Generative AI creates: that integration is trivial. And it is, at the beginning. An API call, a response, value delivered. The problem is that systems are not prototypes, and most teams realize this too late.

The real cost of AI Spaghetti doesn't appear in tests. It appears in production, months later, when you need to change providers and discover that it requires surgery in dozens of files. When you need to investigate an incident and can't reconstruct the context of the call. When you try to measure the real operating cost and have no granularity. When you scale the team and realize that no one understands the full flow.

The irony is that the solution doesn't require any new technology. It requires the application of well-established architectural principles applied to a domain that the industry is still learning to treat seriously.

The Fundamental Principle: AI as Capability

The central premise of good AI architecture is deceptively simple:

Generative AI is a capability, not an implementation.

Just as "persistence" is a capability that can be satisfied by Postgres, DynamoDB, or Redis and abstracted behind a repository, generative intelligence is a capability that can be satisfied by OpenAI, Anthropic, local models, or an ensemble running in parallel.

When you treat AI as an implementation, your business logic knows too much. It knows which SDK to use, which model to call, how to format the request, how to interpret the response. This promiscuity creates coupling that pays its price with interest.

When you treat AI as a capability, the business logic declares what it needs—"I want someone to evaluate if this text is relevant to this question"—and the AI system fulfills that contract without exposing the details of how.

This principle has a name in design literature: it's the Dependency Inversion applied to the AI domain. And its most natural architectural expression is Ports & Adapters, the Hexagonal Architecture.

The Anatomy of a Well-Designed AI Layer

A well-designed AI layer has clear boundaries between four distinct responsibilities. Confusing them is the root of most problems.

Contracts define what the AI system needs to deliver, without revealing how. An inference contract declares that given a prompt, a response with token usage and tracing metadata is expected. An embedding contract declares that given a text, a vector is expected. A retrieval contract declares that given a vector, ranked documents by relevance are expected. These contracts belong to the application domain, not the infrastructure.

Adapters are concrete implementations of these contracts. The Anthropic adapter knows how to call the Claude API. The OpenAI adapter knows how to call GPT. Ollama knows how to call local models. For the application, they are interchangeable—they fulfill the same contract.

The orchestration service coordinates contracts to execute complex flows. A RAG pipeline, for example, orchestrates embedding → retrieval → context construction → inference → response validation. This service knows the contracts, not the adapters.

The AI infrastructure provides cross-cutting capabilities: prompt management, circuit breakers, telemetry, semantic cache, cost management. It doesn't execute business logic—it ensures the AI layer operates reliably.

The result of this separation is a system where switching from GPT-4 to Claude Opus is a configuration change, not a refactoring. Where adding a new provider takes hours, not days. Where testability is natural because you substitute contracts, not SDKs.

Prompt Management: The Ignored Discipline

If there's a single aspect where most production AI systems fail systematically, it's prompt management.

Prompts are first-class artifacts. They determine the system's behavior as much as any line of code. But while code has versioning, review, history, and tracing, prompts often live as literal strings buried in files, copied between environments without control, modified directly in production for convenience.

The problem becomes evident when something goes wrong. A production incident demands answers to the following questions: What version of the prompt was active? When was the last change? Who approved it? What was the behavior before? Systems without prompt control can't answer any of these.

The solution is to treat prompts with the same governance applied to code and configuration. This means versioning with clear semantics—prompts have versions that evolve with explicit meaning. It means change history and authorship tracing. It means promotion between environments with the same controls applied to deploys.

There's also a frequently ignored dimension: prompts need testing. Not just manual validation of "looks good," but evaluation suites that verify if the current version still satisfies the expected behaviors—edge cases, failure scenarios, boundary instructions. All of this needs to be codified as verifiable expectations.

The inevitable corollary is that prompt changes must go through a review process. Just as no engineer pushes code to production without review, no prompt should be promoted without systematic evaluation of its impact on response quality.

RAG as Architecture, Not as Feature

Retrieval-Augmented Generation became a buzzword before it became a discipline. Most implementations treat RAG as a feature—a set of chained calls that "searches documents and sends them to the model." This works in demos. In production, it fragments.

Well-architected RAG is a pipeline with independent stages, each with clear responsibilities and explicit intervention points.

The ingestion phase—processing, chunking, and indexing of documents—needs to be completely separate from the serving phase. This separation is strategic: the chunking strategy has a direct impact on the quality of retrieval, and you need to iterate over it without affecting the system in production. Chunk hierarchy, overlap, enrichment with metadata—these decisions deserve controlled experimentation.

The retrieval phase has more nuances than it appears. Pure vector search is often not enough. Robust systems combine semantic search with keyword search (hybrid search), apply re-ranking to refine initial results, and filter by metadata to restrict the search space. Each of these stages is an independent optimization point.

Context construction is where most systems waste money and quality simultaneously. Sending all retrieved documents indiscriminately to the model is an anti-pattern. Context windows have a cost—each token sent is a paid token—and marginally relevant documents often harm more than they help, diluting the model's attention over what really matters. A well-designed assembler orchestrates selection within a token budget, formatting that favors model understanding, and compression of conversation history when necessary.

Response validation closes the cycle. Is the response grounded in the provided context? Did the model hallucinate information? Does the content satisfy safety criteria? These checks are not optional in systems that reach real users.

Resilience: What No One Plans Until the First Incident

LLM APIs have characteristics that make resilience not just desirable, but obligatory. Highly variable latency, where the P99 can be ten times the P50. Rate limits that arrive without warning during load periods. Partial failures where the model responds but with degraded quality. Provider interruptions that, without fallback, become service interruptions.

The Circuit Breaker is the first resilience instrument to implement. It monitors consecutive failures and, upon detecting degradation, stops sending requests that will likely fail—protecting the system from error cascades and giving the provider time to recover.

But the Circuit Breaker alone is not enough. Critical systems need fallback between providers. A model router that attempts the primary provider, detects failure, and automatically redirects to a secondary provider transforms a vendor unavailability into an imperceptible degradation for the user. The symmetry of contracts in adapters makes this possible without additional logic.

Aggressive timeouts are as important as fallbacks and are often more ignored. An LLM call without a timeout can wait indefinitely, consuming resources and degrading the experience without failing explicitly. Every AI system needs timeouts by default.

Retry with exponential backoff resolves transient failures. But retry without a circuit breaker can worsen an already degraded situation. Both work together: retry for isolated failures, circuit breaker for systemic failures.

Observability: The Three Dimensions That Matter

Observability in traditional systems revolves around the four golden signals: latency, traffic, errors, and saturation. AI systems need these dimensions and three additional specific layers.

Operational metrics are what the SRE monitors. Latency by provider and by model because the P95 of inference is an SLO you need to fulfill. Error rate segmented by type—rate limit, timeout, content filter, network error—have distinct causes and solutions. Token consumption volume by feature because the cost in AI systems is proportional to consumption, and without granularity, you're flying blind in the budget.

Quality metrics are what product monitors. Rate of responses grounded in evidence, especially in RAG, where hallucination is the primary risk. Average relevance score of retrieved documents—if it's falling, your knowledge base may be outdated or your index may be degrading. Explicit user feedback—noisy, but irreplaceable. Distribution of response sizes—systematically very short or very long responses are symptoms of prompt problems.

Governance metrics are what compliance monitors. Distribution of prompt versions in production—you need to know if an old version is still serving traffic. Activation rate of content filters—spikes indicate abuse attempts. Call tracing—the ability to completely reconstruct the context of any interaction from a correlation ID.

The distinction between these layers matters because they have different audiences and cadences. Operational metrics are monitored in real-time with alerts. Quality metrics are analyzed in weekly dashboards. Governance metrics are audited periodically and on-demand in investigations.

Testability: The Natural Consequence of Good Contracts

A well-designed architecture is, almost by definition, testable. When business logic depends on contracts and not implementations, testing it doesn't require real calls to LLM APIs—it requires only deterministic substitutes for the contracts.

This solves a problem that paralyzes many teams: how to test AI flows without paying for tokens, without depending on the availability of the API, without flakiness due to response variability? The answer is that you don't test the AI—you test the logic around the AI.

The pyramid of tests for AI systems has three levels with distinct purposes.

Unit tests cover pure logic components: the context assembler that respects token budgets, the prompt renderer that interpolates variables correctly, the cost estimator. None of these tests need LLM.

Integration tests cover the orchestration of the complete pipeline, with deterministic substitutes in the contracts. The RAG flow can be tested with fixed embeddings, predefined documents, and substituted LLM response. You validate that the orchestration is correct, that data flows between stages as expected, and that errors propagate properly.

Evaluation tests cover semantic quality, and here LLMs are inevitable. The LLM-as-Judge technique automates the evaluation of responses against defined criteria: Is the response grounded in the context? Is it relevant to the question? An evaluator model, operating with temperature zero for determinism, produces scores that can be incorporated into the CI/CD pipeline. If the average grounding score falls below a threshold after a prompt change, the deployment is blocked.

Unit tests and integration tests run on every commit, in seconds, without cost. Evaluation tests run on significant pull requests, in minutes, with controlled cost. This separation is what makes the approach practical at scale.

Multi-Agent: When One Model Is Not Enough

Multi-agent systems are emerging as a pattern for tasks that exceed the capability of a single LLM in a single call. The architecture is powerful and therefore deserves special attention to its traps.

The central pattern is the separation between orchestrator and executors. The orchestrator receives the high-level task and decides the necessary sequence of actions. The executors are specialized—one agent investigates, one writes, one validates