Forensic Inference: Structured Agent Reasoning Over Accumulated Facts

Progressive Fact Accumulation and Prolog-Based Transitive Analysis for Cross-Domain Debugging

Abstract

AI agents debugging complex systems face a structural problem: their reasoning happens entirely within a context window that is lossy, unverifiable, and does not scale. An agent tracing a five-hop causal chain across code, configuration, and runtime behavior must hold the entire chain in working memory, re-reading files if the context overflows, and producing conclusions no other agent can verify.

Forensic Inference introduces a paradigm where structured facts accumulate progressively from code lensing, agent observations, and tool executions, and Prolog performs formal transitive reasoning over the unified fact base. The agent’s role shifts from “trace the bug” to “build the map and query it.” Evidence collection and logical inference are separated into distinct phases: LLMs are good at fuzzy pattern matching and knowing what to observe; Prolog is good at exhaustive search and sound transitive inference.

The core observation: agents already perform the observation work during debugging. They read code, grep logs, inspect configurations. Today those observations exist as natural language in the context window, consumed once and discarded. If those same observations also emit structured facts as a side effect, the marginal cost of building a queryable knowledge base approaches zero, and the benefits (transitive reasoning, cross-session knowledge, replayable proofs) are pure upside.

Forensic Inference generalizes Consequential Analysis: Semantic Verification of AI-Generated Code (code verification via AST + LLM + Prolog) to any domain an agent can observe: runtime behavior, deployment configuration, network topology, timing characteristics, and cross-system interactions.


Problem Landscape

The Context Window Ceiling

When an AI agent investigates a complex bug, it performs a depth-first walk through the system:

  1. Read file A, note a function call
  2. Grep for callers, find file B
  3. Read file B, note a configuration dependency
  4. Read the config, note a timeout value
  5. Read the proxy documentation, note an idle timeout
  6. Mentally connect: the code change removed keepalives, the proxy kills idle connections after 30 seconds, the application timeout fires at 35 seconds, the response never reaches the client

This is a five-hop causal chain spanning code, configuration, and infrastructure. A frontier model with a large context window can usually trace chains of this length. But the approach has structural limitations:

It doesn’t scale. Causal chains of 10-15 hops across large codebases with multiple services and configuration layers exceed what LLMs can reliably hold. They skip branches, conflate similar paths, and hallucinate connections.

It’s not exhaustive. The agent follows one path at a time. If the bug involves two interacting causal chains (a code change AND a config change that independently would be fine but together cause failure), the agent may find one and miss the interaction.

It’s not verifiable. The conclusion exists as natural language. Another agent investigating the same issue starts from zero with no structured record of what was observed, what chains were considered, or why alternatives were rejected.

It’s not reusable. Observations are discarded when the session ends. A related bug tomorrow requires re-reading the same files and re-discovering the same facts.

What Agents Already Do (and Waste)

Consider what happens during a typical debugging session. The agent reads 8-15 source files, greps for function references, reads configuration files and deployment manifests, examines log output and error messages, reads documentation for external services, and builds an internal model of how components interact.

Each of these actions produces observations: “this function calls that function,” “this config sets the timeout to 30 seconds,” “this log shows the request was killed at T+30s.” These observations are the raw material of debugging. Today, all of this material exists only in the context window. The structured knowledge (the call graph fragment, the configuration values, the timing relationships) is never externalized in a queryable form.

This is waste. The agent did the work of observation. The structured output was never captured.

Static Analysis Covers One Domain

Static analysis tools (CodeQL, Semgrep, Datalog-based analyzers) extract structural facts from source code. They are fast, deterministic, and useful for code-level issues. But bugs don’t live in code alone. A timeout regression involving a code change, a reverse proxy configuration, an application timing parameter, and an HTTP protocol behavior spans four domains. No static analysis tool operates across all four simultaneously. The causal chain only becomes visible when code structure, proxy behavior, protocol mechanics, and application timing are unified in a single queryable fact base.


Forensic Inference Architecture

Design Principle: Separation of Observation and Reasoning

LLMs observe, Prolog reasons. LLMs excel at reading code, interpreting logs, recognizing patterns across heterogeneous data, and deciding what is relevant to observe. Prolog excels at exhaustive transitive search, sound logical inference, multi-hop queries over large fact bases, and contradiction detection. Combining them in a single LLM prompt conflates both tasks and gets neither right reliably. Separating them lets each operate in its strength.

Layer 1: Progressive Fact Accumulation

Facts flow into the Prolog knowledge base from multiple sources, each with different characteristics:

Code structural facts (from Invariant)

Invariant extracts deterministic facts from source code via tree-sitter AST parsing: function definitions, call graphs, module dependencies, visibility, line numbers. These facts are deterministic (same code always produces the same facts), cached by code checksum, progressively updated, and language-agnostic (Python, Rust, TypeScript, JavaScript, Go, Elixir).

Invariant predicates:

function(FuncId, ModuleId, Name, Arity, Visibility, Line, CommitSha).
module(ModuleId, ModuleName, Filepath, Line, CommitSha).
depends_on(ModuleId, DepName, Kind, Line).
calls_external(CallerId, Module, FuncName, Arity, Line).

By the time an agent encounters a bug, these facts already exist. The code-structural layer is warm before any investigation begins.

Semantic enrichment (from LLM inference)

LLM-powered classification augments structural facts with semantic markers: function intent, side effect detection, purity analysis, security markers. These are probabilistic but cached by code checksum and incremental.

intent(FuncId, IntentCategory, Confidence).
side_effect(FuncId, EffectType, Target).
purity(FuncId, IsPure, Reason).
security_marker(FuncId, MarkerType, Severity).

Agent observations (from investigation)

During debugging, agents assert behavioral and environmental facts:

config_value(Component, Key, Value).
runtime_behavior(Component, Behavior, Observed).
timing(Event, DurationMs).
error_observed(Component, ErrorType, Message, Timestamp).
causes(EventA, EventB).

These are markers the agent plants as it explores the system. The marginal cost of asserting them is near zero because the agent was already reading the code, logs, and config.

Tool execution metadata (automatic)

Every tool call produces structured metadata that can be automatically asserted:

tool_called(ToolName, Ref, Timestamp).
tool_duration(Ref, DurationMs).
tool_result(Ref, Status).
tool_error(Ref, ErrorType, Message).

The platform emits these as a side effect of normal tool execution. No agent effort required.

Layer 2: Forensic Rule Library

Pre-built Prolog rules that operate over standard predicates. Agents invoke these through queries without writing Prolog directly.

Causal analysis rules find and trace cause-effect relationships:

causal_chain(A, B, [A, B]) :- causes(A, B).
causal_chain(A, C, [A | Rest]) :-
    causes(A, B),
    dif(A, C),
    causal_chain(B, C, Rest).

blast_radius(Origin, Affected) :-
    causal_chain(Origin, Affected, _),
    dif(Origin, Affected).

all_affected(Origin, AffectedList) :-
    findall(A, blast_radius(Origin, A), RawList),
    sort(RawList, AffectedList).

Invariant checking rules verify that system contracts hold:

invariant_holds(Name) :-
    invariant(Name, Condition),
    call(Condition).

invariant_violated(Name, Condition) :-
    invariant(Name, Condition),
    \+ call(Condition).

all_violations(Violations) :-
    findall(inv(Name, Cond), invariant_violated(Name, Cond), Violations).

Dependency analysis rules detect gaps and cycles:

missing_dependency(X, Dep) :-
    requires(X, Dep),
    \+ provides(_, Dep).

:- table transitive_dep/2.

transitive_dep(A, B) :- depends_on(A, B, _, _).
transitive_dep(A, C) :-
    depends_on(A, B, _, _),
    transitive_dep(B, C),
    A \== C.

circular_dep(A, B) :-
    transitive_dep(A, B),
    transitive_dep(B, A),
    A @< B.

Contradiction detection rules find conflicting facts:

contradiction(S1, S2, Prop) :-
    asserts_property(S1, Prop, V1),
    asserts_property(S2, Prop, V2),
    V1 \== V2,
    S1 @< S2.

Cross-domain rules join code-structural and behavioral facts:

code_runtime_mismatch(Func, CodeBehavior, RuntimeBehavior) :-
    code_behavior(Func, CodeBehavior),
    runtime_behavior(Func, RuntimeBehavior, _),
    dif(CodeBehavior, RuntimeBehavior).

timeout_risk(Component, TimeoutMs, DeadlineMs) :-
    config_value(Component, idle_timeout_ms, TimeoutMs),
    config_value(_, deadline_ms, DeadlineMs),
    DeadlineMs > TimeoutMs.

Rule libraries are composable. Security-focused, performance-focused, and correctness-focused libraries can coexist and be applied selectively to any fact base.

Layer 3: Query Interface

Agents interact with forensic inference through Logic Cell primitives:

  • logic.assert with namespace _forensic to plant facts during investigation
  • logic.query with namespace _forensic (or cross-namespace) to run forensic queries, including raw Prolog goals
  • logic.constrain to define custom rules when the library doesn’t cover a case
  • logic.reflect to inspect what facts have been collected in the session

No new composite tool is needed. The power comes from the rule library and the ambient fact accumulation, not from a wrapper.


The Forensic Inference Debugging Shift

Depth-First Navigation (Today)

Traditional agent debugging is depth-first: the agent follows one trail at a time, holding the accumulated path in context.

Read file A → grep for callers → read file B → grep for config →
read config C → mentally connect A → B → C → conclude

Problems: wrong branches require backtracking, long chains overflow working memory, exhaustive path exploration is impractical, and conclusions are unverifiable.

Breadth-First Mapping (Forensic Inference)

Forensic inference is breadth-first: the agent reads and asserts, reads and asserts, then queries.

Read file A, assert facts → read file B, assert facts →
read config C, assert facts → query: causal_chain(X, timeout, Path)
→ Prolog returns all valid chains

The agent’s job shifts from tracing to mapping. Prolog’s backtracking explores every path through the fact base exhaustively, including paths the agent might never have followed manually because they crossed system boundaries.

Worked Example: HTTP Proxy Timeout Regression

A long-running API endpoint begins returning 504 Gateway Timeout errors from the reverse proxy after a code change. The endpoint worked correctly before the change.

Traditional debugging (depth-first):

An agent reads approximately 8 files, traces a 5-hop causal chain manually, initially proposes the wrong fix (lowering the application deadline from 35s to 25s), has that fix rejected, then performs deeper investigation to discover the root cause: a code change switched the response handler from chunked delivery (with keepalive signals) to buffered delivery, eliminating the mechanism that prevented the proxy from treating the connection as idle.

Forensic inference (breadth-first mapping):

Step 1: Agent reads code and uses logic.assert with namespace _forensic to plant structural observations:

causes(buffered_delivery, no_keepalives).
causes(no_keepalives, connection_appears_idle).
causes(connection_appears_idle, proxy_kills_connection).
causes(proxy_kills_connection, response_never_delivered).
causes(response_never_delivered, client_sees_504).
config_value(reverse_proxy, idle_timeout_ms, 30000).
config_value(app_server, deadline_ms, 35000).

Step 2: Agent defines an invariant via logic.assert:

invariant(proxy_protection,
    (delivery_mode(_, Mode), sends_keepalives(Mode, true))).

Step 3: Agent uses logic.query on the _forensic namespace:

?- causal_chain(buffered_delivery, client_sees_504, Chain).
Chain = [buffered_delivery, no_keepalives, connection_appears_idle,
         proxy_kills_connection, response_never_delivered, client_sees_504].

?- timeout_risk(Component, Timeout, Deadline).
Component = reverse_proxy, Timeout = 30000, Deadline = 35000.

?- invariant_violated(Name, Condition).
Name = proxy_protection, Condition = ... .

The agent identifies both the causal chain (missing keepalives) and the configuration risk (deadline > proxy timeout) on the first attempt, without proposing the wrong fix. Every conclusion is traceable to specific asserted facts.


Economics of Forensic Inference

Why Fact Assertion Is Free

The traditional objection to structured knowledge bases is the cost of populating them. Forensic inference eliminates this through three mechanisms:

Continuous code lensing (Invariant). Invariant extracts code-structural facts progressively: initial lens on project setup, incremental updates on each push. Tree-sitter extraction runs in ~50-100ms per file, cached by checksum. Semantic inference costs ~$0.0003-0.0005 per function, also cached. Full repo lens: ~$2-10 one-time, then ~$1-3 per PR.

Observation shadows (agent work). When an agent reads a file, it forms observations that today exist only as natural language. If the agent also emits structured facts, the marginal cost is ~4 tokens per fact. Fifty facts cost ~200 tokens. Compare to the alternative: a subsequent agent re-reads the same 15 files, costing 10,000+ tokens. The 200-token investment yields a 50x return.

Automatic metadata (tool execution). Every tool call already produces structured metadata. Routing it to the fact base costs nothing.

The Token Tradeoff

Asserting 50 structured facts:     ~200 tokens
Re-reading 15 files from scratch:  ~10,000+ tokens
Savings ratio:                     50:1

The facts are also queryable in ways natural language is not. You cannot run causal_chain(X, timeout_504, Path) against a paragraph of debugging notes.


Relationship to Existing Systems

Consequential Analysis

Forensic Inference is the generalization of Consequential Analysis: Semantic Verification of AI-Generated Code. The three-layer architecture maps directly:

Consequential Analysis Forensic Inference
Layer 1: tree-sitter AST facts Layer 1: code facts via Invariant
Layer 2: LLM intent inference Layer 1: semantic enrichment + agent observations
Layer 3: Prolog consequence queries Layer 2: forensic rule library + Layer 3 queries

CA’s fact sources are code-only. FI’s fact sources are anything an agent can observe. CA detects code-level semantic violations. FI detects cross-domain causal chains spanning code, config, runtime, and infrastructure.

Persistent Logic Substrates

Forensic Inference is a usage pattern over Logic Cells (persistent Prolog fact stores with namespace isolation), using the _forensic namespace convention, a pre-loaded rule library, and a structured workflow (observe, assert, query).

Continuous Code Lensing

Invariant’s pipeline (AST extraction via tree-sitter, semantic enrichment via LLMs) keeps code-structural facts warm. Forensic Inference joins these with behavioral observations from other domains. Predicate alignment between Invariant’s schema and the forensic rule library enables cross-domain queries without translation.


Limitations and Trade-Offs

Observation Quality

Forensic inference is only as good as the facts asserted. If an agent asserts inaccurate facts, Prolog will produce sound but incorrect conclusions. Progressive lensing provides an accurate code-structural baseline; agent observations augment, not replace.

Schema Friction

Agents must know which predicates to use. The standard vocabulary (causes, requires, provides, config_value, runtime_behavior) covers common patterns, and logic.constrain handles novel situations.

Not For Simple Bugs

A typo, a missing null check, a wrong variable name: these are found faster by reading the code. Forensic inference is for multi-hop, cross-domain investigations where context-window reasoning hits its limits.

Prolog Performance

Prolog is fast for typical fact bases (<10ms per query for thousands of facts), but pathological queries with deep recursion can be expensive. Tabling and depth bounds mitigate this.


Future Directions

Multi-Agent Forensic Collaboration

Multiple agents investigating different aspects of the same system can assert facts into a shared _forensic namespace. Agent A investigates the code path; Agent B investigates the infrastructure; Agent C queries across both fact sets. The shared Logic Cell substrate enables collaboration without explicit coordination.

Temporal Fact Versioning

Facts tagged with timestamps enable temporal queries: “what changed between the working deployment and the broken one?” Comparing fact snapshots across commits or deployments surfaces differences without requiring the agent to discover them manually.

Cartographic Navigation

A related paradigm: using the accumulated fact base for navigation rather than forensics. Agents exploring an unfamiliar codebase could query “what’s reachable from this entry point?” or “what’s the topology around this module?” This extends latent space navigation concepts into program and computation space.

Composable Rule Libraries

Rule libraries are shareable artifacts. Security-focused, performance-focused, and compliance-focused libraries can coexist and be applied to any fact base using standard predicates. The more libraries developed, the more powerful forensic inference becomes for any individual investigation.

Cross-Repository Reasoning

For systems spanning multiple repositories, forensic inference can operate across boundaries if facts from multiple Invariant lensing sessions are unified in a single Logic Cell namespace. “Does a change to the auth service affect the billing service?” becomes queryable.


For the code-verification technique that Forensic Inference generalizes, see Consequential Analysis: Semantic Verification of AI-Generated Code.

A reference implementation of the rule library and integration with Logic Cells is available in DataGrout.

Author: Nicholas Wright

Title: Co-Founder & Chief Architect, DataGrout AI

Affiliation: DataGrout Labs

Version: 1.0

Published: April 2026

For questions or collaboration: labs@datagrout.ai