Autonomous Scientific Reproduction via Multi-Agent Hierarchical Orchestration: A Deep Analysis of System Architecture
Summary: This document dissects the architecture of a scientific reproduction agent system, examining its design trade-offs, consistency invariants, resilience patterns, and interface contracts between subsystems. The level of analysis assumes familiarity with distributed systems, LLM internals, and production software engineering.

1. System Taxonomy and Fundamental Invariants
The system operates under three non-negotiable architectural invariants:
- Verifiable Determinism: the final verdict is issued by
pytest, an external oracle to the LLM, eliminating the bias of self-evaluation (sycophancy loop) that contaminates pipelines where the model judges its own output. - Externalizable and Durable State: no critical state resides exclusively in the context window. All progress is materialized in persistent artifacts (SQLite DAG, Git commits, Vector Store), making the system restartable and resumable after arbitrary failures.
- First-Class Observability: Trace Tree and Budget Guard are not secondary elements. They are first-class citizens of the design, architecturally positioned at the same level as the functional layers.
The formal classification of the system is: agentic workflow with DAG topology, hybrid serial/parallel execution, and deterministic rollback mechanism.
2. Setup Phase: Formal Contracts as Specification Layer
2.1 Problem Classifier: Ontological Divergence of Solution Space
The Problem Classifier resolves a fundamental ambiguity before any allocation of computational resources: is the solution space convex or multimodal?
Convergent: there is a unique attractor in the output space, being a numerical metric, a threshold, or a deterministic result. The loss function is unimodal, and the stopping criterion is objective.
Divergent: the solution space is multimodal, with multiple equally valid implementations. It requires pruning heuristics (Tree of Thoughts) and stochastic sampling (Best of N).
This bifurcation is critical because it defines which subset of the Cognition Layer will be activated and with what intensity. Convergent problems are routed to more deterministic execution, while divergent ones activate exploratory capabilities.
2.2 Definition of Done: Formalization of Criteria as Verifiable Contracts
The five contractual criteria of the Definition of Done are, in practice, post-condition invariants, analogous to assertions in Design by Contract (DbC). Each criterion has the following structure:
criterion := {
id: UUID,
metric: Callable[[Artifact], float],
threshold: float,
tolerance: float, # margin for p75 matching
weight: float # for partial scoring
}
These contracts are injected downstream into the Verifier as test fixtures and into the Spec Layer Verdict as a scoring rubric. The separation between definition (Setup Phase) and evaluation (Verifier) implements the principle of separation of concerns between specification and execution.
3. Worker Components: Pipeline with Explicit Feedback Loops
3.1 Paper Analyzer: Structured Semantics Extraction from Unstructured Documents
The Paper Analyzer faces the classic problem of information extraction from semi-structured scientific documents. The PDF is a rendering format, not a semantic one. LaTeX equations are rendered as vectors, tables as bounding boxes, and references as plain text.
The component must resolve three main challenges:
Mathematical AST extraction: converting rendered equations into a manipulable symbolic representation, such as a SymPy tree.
Prior elicitation: identifying implicit assumptions of the paper, including assumed distributions, undocumented normalizations, and default hyperparameters.
Experimental protocol reconstruction: inferring the ML pipeline from natural language descriptions, tolerating ambiguity and underspecification.
The output is a structured experimental spec that serves as a formal input for the Code Implementer.
3.2 Code Implementer: Architect/Editor Split as Separation of Abstraction Levels
The Architect/Editor pattern is an instance of hierarchical task decomposition applied to code generation:
Architect Agent:
- Defines interfaces, abstractions, and contracts between modules
- Produces scaffolding with type signatures and docstrings
- Does not implement business logic, only structure
Editor Agent:
- Receives scaffolding as fixed context
- Implements function bodies within defined interfaces
- Context window focused, with lower risk of hallucination due to context drift
The Architect operates with low decision entropy, prioritizing structure over details. The Editor operates with highly constraining context. Each agent has a smaller, well-defined decision space, which reduces the likelihood of incoherent outputs.
The fail → regenerate loop is a retry with backpressure: failures in the Verifier propagate a structured diff back to the Code Implementer, not a generic repetition instruction. This implements gradient-like feedback in the discrete space of code generation.
3.3 Experimenter: Sandboxed Execution with Subagent Parallelism
The Experimenter executes over the Sandbox REPL (persistent Docker) with five specialized subagents in series. The series topology, instead of parallel, is a deliberate design choice for three reasons:
- Each subagent consumes the state produced by the previous one
- Allows incremental validation, detecting failures before expensive computation
- Maintains linearity of the
Task DAGfor auditability
The persistent Docker resolves the problem of environment reproducibility: the same filesystem, dependencies, and environment variables are guaranteed between executions, eliminating the class of bugs known as "works on my machine".
3.4 Verifier: Pytest as External Oracle and Anti-Hallucination Gate
The Verifier is the most critical component of the system from the point of view of correctness guarantees. The use of pytest as an evaluation mechanism produces three architecturally significant effects:
- Removes the LLM from the critical path of evaluation: the model cannot rationalize an incorrect output as correct
- Guarantees idempotence: the same artifacts always produce the same verdict
- Enables regression testing: new iterations are validated against the same test suite
The integration with Definition of Done is coupled: each contractual criterion is materialized as a test case with explicit assertion.
4. Cognition Layer: Capabilities as First-Class Callable Abstractions
The Cognition Layer implements the Strategy Pattern applied to reasoning capabilities. Each capability is an interchangeable strategy, selected dynamically based on the problem type.
4.1 Thinking Channel: Chain-of-Thought as Explicit Computation Graph
Forces the generation of a reasoning trace before the final output. From an inference perspective, this allocates additional tokens for intermediate computation, analogous to increasing the depth of a computation graph. The trade-off is direct: greater latency and cost result in a lower error rate in multi-step reasoning.
4.2 Compute Allocator: Adaptive Token Budgeting
Resolves the problem of optimal allocation of computational resources under budget constraint. Implements a scheduling policy that estimates the subproblem complexity and allocates max_tokens proportionally. Connects to the Budget Guard for enforcement of hard limits.
The DOWNWARD propagation means that the budget allocated by the Compute Allocator becomes a constraint injected into subsequent workers, implementing resource backpressure.
4.3 Best of N Sampler: Ensemble Decoding with Verifier-in-the-Loop
Generates N candidates with temperature greater than zero (stochastic sampling) and uses the Verifier as a discriminator to select the best. This transforms the generation problem into a search problem in the output space, with an external fitness function to the generator model.
The cost is O(N) in generation tokens. The benefit is a substantial reduction in variance in the final output.
4.4 Tree of Thoughts: Beam Search in the Reasoning Space
Implements breadth-first exploration of the reasoning space with three fundamental operations:
expand(node) → List[ThoughtNode] # generates branches
score(node) → float # evaluates promise
prune(tree) → Tree # eliminates low-score branches
It is essentially beam search applied to chains of reasoning instead of token sequences. Useful for problems where the optimal reasoning path is not greedy.
4.5 Step Back Reasoner: Abstraction Lifting before Grounding
Forces the model to abstract the general principle before tackling the specific case, mitigating the problem of overfitting to the statement, where the model treats the problem as pattern matching instead of reasoning by fundamental principles.
The DOWNWARD propagation injects the abstracted principle as additional context into the workers, functioning as an informed prior for subsequent generation.
5. State Infrastructure: Multi-Modal Persistence
5.1 Task DAG (SQLite): Workflow Engine with Durable State
The Task DAG is the central nervous system of pipeline coordination. It stores the task graph with the following schema:
CREATE TABLE tasks (
id TEXT PRIMARY KEY,
parent_id TEXT REFERENCES tasks(id),
status TEXT CHECK(status IN ('pending','running','done','failed')),
subgoal_index INTEGER, -- 0..7 (8 subgoals)
artifact_path TEXT,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
The choice of SQLite instead of a distributed message broker (Kafka, RabbitMQ) is deliberate. It prioritizes operational simplicity and ACID guarantees without the overhead of distributed infrastructure. The state survives restarts because SQLite is a disk file: container restarts do not erase progress.
5.2 Bi-Temporal Memory: Event Sourcing with Dual Temporal Dimension
The bi-temporal memory stores two independent timestamps per record:
valid_from / valid_to: when the fact was true in the real world, such as the moment a paper reported a certain value.
recorded_at: when the system recorded this information.
This enables queries like: "What did the system believe about result Y at time T1, even if it later discovered it was wrong at T2?" This capability is essential for retroactive debugging of long-running pipelines.
It is a partial implementation of Event Sourcing: the system never overwrites records, only appends new ones with updated timestamps.
5.3 Vector Store (bge-m3): Semantic Retrieval for Knowledge Reuse
The bge-m3 model produces high-quality dense embeddings for cross-lingual semantic retrieval. The Vector Store enables three primary use cases:
Few-shot retrieval: recovering similar implementations of previous experiments as in-context examples.
Prior code reuse: avoiding regeneration of already implemented utilities.
Semantic deduplication: identifying equivalent subproblems with different textual formulations.
5.4 Git Checkpointer: Auditability as a First-Class Feature
Each state transition in the pipeline generates an atomic commit with a structured message:
[STEP:3/8] Code Implementer: generated regression module
- Files changed: src/model.py, src/preprocessing.py
- Verifier status: PENDING
- Budget consumed: $0.0012
This transforms the Git history into an append-only audit log with granularity per step, enabling surgical rollback to any intermediate point without losing previous progress.
6. Hardening Layer: Defense in Depth for Output Quality
6.1 Linter Gate: Static Analysis as a Circuit Breaker
The Linter Gate implements the Circuit Breaker pattern applied to code quality:
state machine:
CLOSED → code passes linter → continues pipeline
OPEN → code fails linter → auto-revert via Git Checkpointer
HALF-OPEN → after revert, Code Implementer receives diff of violations
The auto-revert is the architectural differentiator: instead of propagating problematic code downstream, where it would be more expensive to detect the error, the gate interrupts immediately and restores the last valid state. This implements the principle of fail-fast with consistent state.
6.2 Self Refine Loop: Iterative Self-Improvement with Critique-then-Improve
Related tags