Modular Architecture with Generative AI: Treating Intelligence as Capability, Not as Dependency

There is a specific moment in the life of a system that incorporates Generative AI when the architectural debt becomes unbearable. It's not when the first bug appears. It's when someone asks: "What prompt was in production when that happened?" and no one knows how to answer.

The Unnamed Problem

I call AI Spaghetti the state where AI logic is chaotically distributed throughout the codebase: model calls within controllers, prompt assembly by concatenating strings in helpers, retrieval logic mixed with business rules, and zero separation between what the system wants to do and how the AI executes it.

This state is not born out of negligence. It's born out of speed and a specific illusion created by Generative AI: the illusion that integration is trivial. And it is, at first. One API call, one response, value delivered. The problem is that systems are not prototypes, and most teams realize this too late.

The real cost of AI Spaghetti doesn't appear in tests. It appears in production, months later, when you need to switch providers and discover that it requires surgery in dozens of files. When you need to investigate an incident and can't reconstruct the context of the call. When you try to measure the real operating cost and have no granularity. When you scale the team and realize that no one understands the full flow.

The irony is that the solution doesn't require any new technology. It requires the application of well-established architectural principles applied to a domain that the industry is still learning to treat seriously.

The Fundamental Principle: AI as Capability

The central premise of good AI architecture is deceptively simple:

Generative AI is a capability, not an implementation.

Just as "persistence" is a capability that can be satisfied by Postgres, DynamoDB, or Redis and abstracted behind a repository, generative intelligence is a capability that can be satisfied by OpenAI, Anthropic, local models, or an ensemble running in parallel.

When you treat AI as an implementation, your business logic knows too much. It knows which SDK to use, which model to call, how to format the request, how to interpret the response. This promiscuity creates coupling that pays its price with interest.

When you treat AI as a capability, the business logic declares what it needs "I want someone to evaluate if this text is relevant to this question" and the AI system fulfills that contract without exposing the details of how.

This principle has a name in design literature: it's Dependency Inversion applied to the AI domain. And its most natural architectural expression is Ports & Adapters, the Hexagonal Architecture.

The Anatomy of a Well-Designed AI Layer

A well-designed AI layer has clear boundaries between four distinct responsibilities. Confusing them is the root of most problems.

Contracts define what the AI system needs to deliver, without revealing how. An inference contract declares that given a prompt, a response with token usage and tracing metadata is expected. An embedding contract declares that given text, a vector is expected. A retrieval contract declares that given a vector, ranked documents by relevance are expected. These contracts belong to the application domain, not the infrastructure.

Adapters are concrete implementations of these contracts. The Anthropic adapter knows how to call the Claude API. The OpenAI adapter knows how to call GPT. The Ollama adapter knows how to call local models. For the application, they are all interchangeable and fulfill the same contract.

The orchestration service coordinates contracts to execute complex flows. A RAG pipeline, for example, orchestrates embedding → retrieval → context assembly → inference → response validation. This service knows the contracts, not the adapters.

The AI infrastructure provides cross-cutting capabilities: prompt management, circuit breakers, telemetry, semantic cache, cost management. It does not execute business logic and ensures that the AI layer operates reliably.

The result of this separation is a system where switching from GPT-4 to Claude Opus is a configuration change, not a refactoring. Where adding a new provider takes hours, not days. Where testability is natural because you substitute contracts, not SDKs.

Prompt Management: The Ignored Discipline

If there's one aspect where most production AI systems fail systematically, it's prompt management.

Prompts are first-class artifacts. They determine the system's behavior as much as any line of code. But while code has versioning, review, history, and tracing, prompts often live as literal strings buried in files, copied between environments without control, modified directly in production for convenience.

The problem becomes evident when something goes wrong. A production incident demands answers to the following questions: What version of the prompt was active? When was the last change? Who approved it? What was the previous behavior? Systems without prompt control cannot answer any of these.

The solution is to treat prompts with the same governance applied to code and configuration. This means versioning with clear semantics: prompts have versions that evolve with explicit meaning. It means change history and authorship tracing. It means promotion between environments with the same controls applied to deploys.

There's also a frequently neglected dimension: prompts need tests. Not just manual validation of "looks good," but test suites that verify if the current version still satisfies the expected behaviors: edge cases, failure scenarios, boundary instructions. All of this needs to be codified as verifiable expectations.

The inevitable corollary is that prompt changes need to go through a review process. Just as no engineer pushes code to production without review, no prompt should be promoted without systematic evaluation of its impact on response quality.

RAG as Architecture, Not as Feature

Retrieval-Augmented Generation became a buzzword before it became a discipline. Most implementations treat RAG as a feature: a set of chained calls that "searches documents and sends them to the model." This works in demos. In production, it fragments.

Well-architected RAG is a pipeline with independent stages, each with clear responsibilities and explicit intervention points.

The ingestion phase — processing, chunking, and indexing documents — needs to be completely separate from the serving phase. This separation is strategic: the chunking strategy has a direct impact on retrieval quality, and you need to iterate over it without affecting the production system. Chunk hierarchy, overlap, enrichment with metadata: these decisions deserve controlled experimentation.

The retrieval phase has more nuances than it appears. Pure vector search is often not enough. Robust systems combine semantic search with keyword search (hybrid search), apply reranking to refine initial results, and filter by metadata to restrict the search space. Each of these stages is an independent optimization point.

Context assembly is where most systems waste money and quality simultaneously. Sending all retrieved documents indiscriminately to the model is an anti-pattern. Context windows have a cost: each token sent is a paid token, and marginally relevant documents often harm more than they help, diluting the model's attention over what really matters. A well-designed assembler orchestrates selection within a token budget, formatting that favors model understanding, and compression of conversation history when necessary.

Response validation closes the cycle. Is the response grounded in the provided context? Did the model hallucinate information? Does the content satisfy safety criteria? These checks are not optional in systems that reach real users.

Resilience: What No One Plans Until the First Incident

LLM APIs have characteristics that make resilience not just desirable, but obligatory. Highly variable latency, where the P99 can be ten times the P50. Rate limits that arrive without warning during load periods. Partial failures where the model responds but with degraded quality. Provider interruptions that, without fallback, become service interruptions.

The Circuit Breaker is the first resilience instrument to implement. It monitors consecutive failures and, upon detecting degradation, stops sending requests that will likely fail: it protects the system from error cascades and gives the provider time to recover.

But the Circuit Breaker in isolation is not enough. Critical systems need fallback between providers. A model router that attempts the primary provider, detects failure, and automatically directs to the secondary provider converts a vendor unavailability into an imperceptible degradation for the user. The symmetry of contracts in adapters makes this possible without additional logic.

Aggressive timeouts are as important as fallbacks, and frequently more neglected. A call to an LLM without a timeout can wait indefinitely, consuming resources and degrading the experience without ever failing explicitly. Every AI system needs timeouts by default.

Retry with exponential backoff resolves transient failures. But retry without a circuit breaker can worsen an already degraded situation. The two work together: retry for isolated failures, circuit breaker for systemic failures.

Observability: The Three Dimensions That Matter

Observability in traditional systems lives around the four golden signals: latency, traffic, errors, and saturation. AI systems need these dimensions and three additional layers specific to AI.

Operational metrics are what the SRE monitors. Latency by provider and by model, because the P95 of inference is an SLO that needs to be honored. Error rate segmented by type: rate limit, timeout, content filter, network error have different causes and solutions. Token volume consumed by feature, because cost in AI systems is proportional to consumption, and without granularity, you're flying blind in the budget.

Quality metrics are what the product monitors. Rate of responses grounded in evidence, especially in RAG, where hallucination is the primary risk. Average relevance score of retrieved documents: if it's falling, your knowledge base may be outdated or your index may be degrading. Explicit user feedback, noisy but irreplaceable. Distribution of response sizes: systematically very short or very long responses are symptoms of prompt problems.

Governance metrics are what compliance monitors. Distribution of prompt versions in production: you need to know if an old version is still serving traffic. Activation rate of content filters: spikes indicate abuse attempts. Call tracing: the ability to fully reconstruct the context of any interaction from a correlation ID.

The distinction between these layers matters because they have different audiences and cadences. Operational metrics are monitored in real-time via alerts. Quality metrics are analyzed in weekly dashboards. Governance metrics are audited periodically and on-demand in investigations.

Testability: The Natural Consequence of Good Contracts

A well-designed architecture is, almost by definition, testable. When business logic depends on contracts and not implementations, testing it doesn't require real calls to LLM APIs: it only requires deterministic substitutes for the contracts.

This solves a problem that paralyzes many teams: how to test AI flows without paying for tokens, without depending on API availability, without flakiness due to response variability? The answer is that you don't test the AI, you test the logic around the AI.

The test pyramid for AI systems has three levels with distinct purposes.

Unit tests cover pure logic components: the context assembler that respects token budgets, the prompt renderer that correctly interpolates variables, the cost estimator. None of these tests need LLM.

Integration tests cover the orchestration of the complete pipeline, with deterministic substitutes in contracts. The RAG flow can be tested with fixed embeddings, predefined documents, and substituted LLM response. You validate that the orchestration is correct, that data flows between stages as expected, that errors propagate properly.

Evaluation tests cover semantic quality, and here LLMs are inevitable. The LLM-as-Judge technique automates the evaluation of responses against defined criteria: is the response grounded in the context? Is it relevant to the question? An evaluator model, operating with temperature zero for determinism, produces scores that can be incorporated into the CI/CD pipeline. If the average grounding score falls below a threshold after a prompt change, the deployment is blocked.

Unit tests and integration tests run on every commit, in seconds, without cost. Evaluation tests run on significant pull requests, in minutes, with controlled cost. This separation is what makes the approach practical at scale.

Multi-Agent: When One Model Is Not Enough

Multi-agent systems are emerging as a pattern for tasks that exceed the capability of a single LLM