Consequential Analysis: Semantic Verification of AI-Generated Code

Neuro-Symbolic Intent Detection and Impact Prediction

Abstract

AI code generation operates at speeds that exceed human verification capacity. When an AI assistant produces 3000 lines of production code in 30 minutes, traditional review approaches fail: static analysis catches syntax errors but misses semantic intent, test coverage verifies behavior but not correctness of that behavior, and manual review cannot operate at AI generation speeds.

Consequential Analysis introduces a semantic verification approach that combines structural fact extraction (AST -> Prolog), behavioral intent inference (LLM), and logical consequence queries (Prolog reasoning). The system detects intent mismatches where function behavior contradicts declared purpose, predicts change impact through call graph analysis, and identifies security concerns that escape syntax-level inspection.

The core observation: function names are contracts with both human reviewers and AI agents. A function named validate_user signals read-only validation behavior. If that function performs database writes, it violates its contract. Reviewers miss this because they trust the name. AI agents miss this because they reason about names, not implementations. Consequential analysis detects these violations by comparing inferred semantic intent against observed structural behavior.

Problem Landscape

The AI Code Velocity Gap

Traditional software development assumes human authorship at human speeds. A developer writes 100-150 lines per day; a reviewer processes 200-400 lines per hour with full semantic scrutiny of every function.

AI-assisted development breaks this equilibrium. An AI assistant can generate 2000-5000 lines in 30 minutes and produce 10-15 PRs per day, but review speed has not changed. A 3000-line AI-generated PR still takes 3-5 hours of manual review at roughly 15 seconds of semantic analysis per function.

Organizations face a forced choice:

Ship without review – unacceptable risk
Slow AI generation to match review capacity – eliminates the productivity gain
Develop verification that scales to AI speeds – only viable option

Static Analysis Limitations

Static analysis tools (ESLint, Pylint, clippy, golangci-lint) detect:

Syntax errors
Type mismatches
Unused variables
Some security patterns (SQL injection strings, eval usage)

Static analysis cannot detect:

Intent mismatches - Function name says “validate” but implementation persists
Semantic violations - Function marked pure but performs I/O
Cross-function implications - Call chain violates transaction boundaries
Consequence prediction - What breaks if this function is deleted

This is a structural limitation. AST-level analysis provides syntactic facts, not semantic meaning.

Test Coverage Blindness

Test suites verify that functions behave as implemented:

def test_validate_payment():
    result = validate_payment({'number': '1234567890123456', 'cvv': '123'})
    assert result == True  # Test passes

Test passes. Function works as implemented. But implementation violates semantic contract if validate_payment writes to database despite name suggesting read-only validation.

Tests verify what the code does, not what the code should do based on its declared intent.

The Intent Mismatch Problem

Consider AI-generated Python:

def validate_payment_method(payment_data):
    """Validates payment method information."""
    card_number = payment_data['card']['number']
    cvv = payment_data['card']['cvv']
    
    # Validate format
    if len(card_number) != 16:
        return {'valid': False, 'reason': 'invalid_card_number'}
    
    # Store for fraud analysis
    db.payment_validations.insert({
        'card_number': card_number,
        'cvv': cvv,
        'timestamp': datetime.now(),
        'ip_address': request.ip
    })
    
    return {'valid': True}

What static analysis reports:

No syntax errors (pass)
No type violations (pass)
No SQL injection patterns (pass)
Passes linter (pass)

What tests verify:

Returns {'valid': True} for valid cards (pass)
Returns {'valid': False} for invalid cards (pass)
Edge cases handled (pass)

What human reviewers see:

Function named validate_payment_method
Docstring says “Validates payment method information”
Implementation validates format
Tests pass
Ships to production (pass)

What actually happens:

Function writes sensitive PII to database (card_number, cvv, ip_address)
Function has side effects despite “validate” name suggesting read-only operation
Function creates audit records without encryption
Function is unsafe to call multiple times (duplicate validation records)
Function violates semantic contract between name and behavior

Why AI generates this:

Training data contains similar patterns
“Log for audit” is a common pattern
No training signal that validate_* functions must be pure
Tests validate behavior, not semantic correctness

Why reviewers miss this:

Function works correctly (tests pass)
Name matches common patterns
Code looks professional
Volume of AI-generated code exceeds review capacity

This is an intent mismatch. Traditional tools cannot detect it.

Consequential Analysis Architecture

Consequential analysis combines three reasoning layers. The structural layer produces deterministic facts from AST parsing. The semantic layer uses LLM inference to classify intent, bridging to the semantic type vocabulary defined in Semio: A Semantic Interface Layer for Tool-Oriented AI Systems. The query layer uses Prolog to reason over both fact sets, producing outputs suitable for embedding in Cognitive Trust Certificates or enforcing via Runtime Policy Enforcement.

Layer 1: Structural Facts (Deterministic)

Tree-sitter AST parsing extracts provable facts from code structure: function definitions, call graphs, module dependencies, data flow, and complexity metrics. These facts are deterministic (same code always produces the same output), language-agnostic (Python, Rust, TypeScript, JavaScript, Go), and fast (~50-100ms per file).

For the validate_payment_method example, structural extraction produces facts capturing that the function exists, is public, calls db.payment_validations.insert, calls datetime.now, accesses request.ip, and returns a dict. The exact fact representation uses a Prolog-compatible schema designed for efficient join queries in Layer 3.

Structural facts tell you what the code does syntactically but cannot tell you what the code means: they capture that a function calls db.insert but not whether that call is appropriate given the function’s declared purpose. That gap is where Layer 2 comes in.

Layer 2: Semantic Intent (LLM-Powered)

The LLM receives the source code alongside its structural facts and emits semantic classifications as structured Prolog-compatible facts. These classifications fall into several categories:

Intent – what the function is meant to do based on its name, docstring, and context (e.g., validation, persistence, query, authentication)
Side effects – observable behaviors that affect state outside the function’s return value (e.g., database writes, network calls, file I/O)
Purity – whether the function is free of side effects, and if not, why
Security markers – presence of sensitive operations such as user input handling, data logging, or shell execution
Confidence – the model’s self-assessed reliability for each classification

For the validate_payment_method example, the LLM would classify the intent as “validation” (from the name), flag a database write side effect, mark the function as impure, and tag security concerns around user input and PII logging – all emitted as machine-readable facts that Layer 3 can query.

The specific predicate vocabulary, output schema, and prompt engineering strategies are part of the operational implementation.

Intent inference costs:

~$0.0003-0.0005 per function (GPT-5-mini at $0.25/$2.00 per 1M tokens)
~1-2 seconds per file
Cached by code checksum (skip unchanged functions)
Batched across multiple functions for efficiency

Layer 3: Consequential Queries (Prolog Reasoning)

The query layer combines structural and semantic facts to answer consequence questions. Prolog’s backtracking search makes this natural: a single query can join function definitions, call graphs, intent classifications, and security markers to surface issues that no individual fact would reveal.

For example, detecting an intent mismatch requires joining a function’s name-derived intent (from structural analysis) against its inferred behavior (from semantic classification). When these disagree above a confidence threshold, the system flags a mismatch with severity based on the specific combination (e.g., validation intent + persistence behavior = high severity).

The same approach extends to deletion impact prediction, security concern detection, test coverage gaps, and PR risk assessment. The full query catalog is described in the Consequential Queries section below.

Why AST Alone is Insufficient

Pure structural analysis cannot detect semantic violations:

Example 1: Query Function That Writes

def get_user_preferences(user_id):
    """Retrieves user preferences."""
    prefs = db.preferences.find({'user_id': user_id})
    
    if not prefs:
        # Create defaults
        prefs = db.preferences.insert({
            'user_id': user_id,
            'theme': 'dark',
            'notifications': True
        })
    
    return prefs

AST extraction:

function('get_user_preferences_1', ...).
calls_external('get_user_preferences_1', 'db.preferences', 'find', ...).
calls_external('get_user_preferences_1', 'db.preferences', 'insert', ...).

AST tells you:

Function named get_user_preferences
Calls db.preferences.find (read)
Calls db.preferences.insert (write)

AST cannot tell you:

Name suggests read-only query
Implementation performs writes
Function violates its semantic contract
Calling this function creates state

Example 2: Sanitization Function That Doesn’t

def sanitize_sql_input(user_input):
    """Sanitizes user input to prevent SQL injection."""
    return user_input.strip()

AST extraction:

function('sanitize_sql_input_1', ...).
calls_external('sanitize_sql_input_1', 'user_input', 'strip', ...).

AST tells you:

Function named sanitize_sql_input
Calls strip() method
Returns processed input

AST cannot tell you:

Name promises SQL injection protection
Implementation only trims whitespace
Function provides no actual sanitization
This is a critical security vulnerability

Example 3: Pure Function With Side Effects

fn calculate_discount(amount: f64, customer_tier: &str) -> f64 {
    log::info!("Calculating discount for {}: {}", customer_tier, amount);
    
    let rate = match customer_tier {
        "gold" => 0.2,
        "silver" => 0.1,
        _ => 0.0
    };
    
    amount * (1.0 - rate)
}

AST extraction:

function('calculate_discount_2', ...).
calls_external('calculate_discount_2', 'log', 'info', ...).
returns('calculate_discount_2', 'f64', ...).

Function appears pure (math calculation) but has side effect (logging). In concurrent systems or functional programming contexts, this matters: the function cannot be memoized, cannot be called in pure contexts, and has ordering dependencies that aren’t obvious from signature.

Intent Inference Implementation

LLM Prompt Design

The inference prompt is structured around four sections: the source code, the structural facts already extracted, a constrained predicate vocabulary that limits the model’s output to a fixed schema, and an output format directive that ensures machine-parseable results.

The key design decision is constraining the LLM to emit only predicates from a closed vocabulary rather than free-form analysis. This makes the output deterministic in structure (even if probabilistic in content), directly queryable by the Prolog engine, and cacheable by code checksum. The vocabulary covers intent classification, side effect detection, purity analysis, security markers, and confidence scoring.

For the validate_payment_method example, the model would identify: declared intent is validation (from name and docstring), actual behavior includes persistence (database write), the function is impure, and there are security concerns around user input handling and PII logging. Each of these becomes a structured fact that Layer 3 can join against.

Specific predicate schemas, prompt engineering techniques, few-shot calibration strategies, and confidence threshold tuning are part of the operational implementation and are not detailed here.

Cost and Performance

Intent inference costs (GPT-5-mini at $0.25/$2.00 per 1M input/output tokens):

Per function: ~$0.0003-0.0005
Per file: ~$0.02-0.05 (50-100 functions typical)
Per repo: ~$2-10 (1000-5000 functions)

Performance:

Structural extraction: 50-100ms per file (tree-sitter)
Intent inference: 1-2 seconds per file (LLM API call)
Query execution: <10ms per query (Prolog)

Total time: ~2-3 seconds per file

Caching:

Results cached by code checksum (SHA-256)
Unchanged files skip analysis
Incremental analysis for PRs (only changed files)

Consequential Queries

The system supports a catalog of queries that combine structural and semantic facts to surface different classes of issues. Each query is implemented as a Prolog rule that joins across the fact base. Below are the query types, what they detect, and example outputs.

Query: Intent Mismatches

Detects functions where the declared intent (derived from naming conventions and documentation) contradicts the observed behavior (derived from call graph and side effect analysis). Severity is assigned based on the specific combination – for instance, a function with validation intent that performs persistence operations is rated higher than a query function with minor logging.

Example output:

{
  "mismatches": [
    {
      "function": "validate_payment_method",
      "file": "payment_handler.py",
      "line": 1,
      "declared_intent": "validation",
      "actual_behavior": "persistence",
      "severity": "high",
      "explanation": "Function name suggests read-only validation but performs database write"
    }
  ]
}

Query: Deletion Consequences

Predicts the impact of removing a function by analyzing its callers (via call graph), associated tests, fan-in (how many other functions depend on it), and the criticality of its intent category. Functions involved in authentication, authorization, or persistence are flagged as critical deletions.

Example output:

{
  "function": "process_payment",
  "consequences": {
    "broken_callers": [
      {"function": "checkout_flow", "module": "CheckoutModule", "line": 142},
      {"function": "retry_failed_payment", "module": "PaymentRetry", "line": 67}
    ],
    "orphaned_tests": [
      {"test": "test_process_payment_success"},
      {"test": "test_process_payment_failure"}
    ],
    "fan_in": 2,
    "criticality": "critical"
  }
}

Query: Security Concerns

Detects potential vulnerabilities by correlating security markers within the same function. The query identifies patterns such as user input flowing to SQL operations without sanitization markers, sensitive data written to logs or databases without encryption markers, and user input passed to shell execution. Each detected pattern produces a typed concern with a human-readable description.

Query: Test Coverage Gaps

Finds public functions that lack associated test coverage, excluding test helpers and internal utilities. Combined with intent classification, this can prioritize gaps in critical areas (e.g., untested authentication functions are flagged higher than untested rendering helpers).

Query: Hotspots

Identifies refactoring candidates by finding functions with both high fan-in (many callers) and high cyclomatic complexity. These are the functions where a bug or intent mismatch would have the widest blast radius.

Query: PR Risk Assessment

Composes the above queries to produce a risk summary for an incoming pull request. The query diffs the PR’s facts against the base branch to surface only newly introduced issues – intent mismatches, security concerns, and high-risk modifications to widely-depended-upon functions. This avoids alert fatigue from pre-existing issues and focuses review attention on what changed.

Integration Points

Arbiter Substrate Pre-Execution Validation

Arbiter Substrate intercepts AI-generated code before execution and runs it through the consequential analysis pipeline. The reflex phase extracts structural facts, infers intent, and runs the query catalog. Based on the results, Arbiter Substrate makes an allow/warn/block decision:

No issues detected: allow execution
Minor intent mismatches below a threshold: warn but allow
Security concerns or PII exposure: block execution and require review

This provides a safety layer that prevents the most dangerous categories of AI-generated code from reaching production without human review, while allowing low-risk code to proceed at speed.

Governor Pattern Learning

Governor analyzes historical consequential analysis results to extract recurring mismatch patterns. When the same type of intent mismatch appears repeatedly across a codebase (e.g., functions with validation-style names that perform database writes), Governor extracts this as a learned pattern that can be detected symbolically – without requiring an LLM call.

This creates a virtuous cycle: early analysis depends heavily on LLM inference, but over time the system learns which patterns are common in a given codebase and can detect them through fast symbolic matching alone. The LLM is then reserved for novel code that doesn’t match any known pattern.

CI/CD Integration

Consequential analysis integrates into CI/CD pipelines as a PR review step. The workflow analyzes changed files, runs the query catalog against the PR’s commit, and posts results as a PR comment. Organizations can configure branch protection rules that require consequential analysis to pass before merge, with configurable thresholds for intent mismatches, security concerns, and test coverage.

Example PR comment generated:

## Consequential Analysis Results

**Analyzed**: 23 functions across 8 files  
**Time**: 2.3 seconds  
**Cost**: ~$0.01

---

### Intent Mismatches (1)

**`validate_payment_method()` (payment_handler.py:42)**

- **Declared intent**: validation (from name)
- **Actual behavior**: persistence (database write at line 50)
- **Severity**: HIGH
- **Explanation**: Function name suggests read-only validation, but implementation writes card data to database

**Recommendation**: Rename to `validate_and_log_payment()` or split into two functions:
```python
def validate_payment_method(payment_data):
    # Pure validation only
    return len(payment_data['card']['number']) == 16

def log_payment_attempt(payment_data):
    # Separate persistence
    db.payment_validations.insert({...})
```

---

### Security Concerns (1)

**`sanitize_sql_input()` (utils.py:23)**

- **Issue**: Sanitization is inadequate (only `strip()` called)
- **Risk**: SQL injection vulnerability
- **Severity**: CRITICAL
- **Line**: 24

**Recommendation**: Use parameterized queries or proper escaping:
```python
def sanitize_sql_input(user_input):
    return psycopg2.extensions.adapt(user_input).getquoted()
```

---

### Summary

- **Functions analyzed**: 23
- **Intent mismatches**: 1 high, 0 medium
- **Security concerns**: 1 critical
- **PII exposures**: 0
- **Test gaps**: 3 functions without coverage

**Overall risk**: MEDIUM (requires review before merge)

Enterprise Deployment Patterns

Real-Time Verification in IDEs

Consequential analysis can run incrementally during development, analyzing changed functions on file save and displaying inline warnings for intent mismatches and security concerns. This shifts detection left – developers see issues before they commit, not after a PR review cycle.

Pre-Commit and PR Gates

The analysis integrates at two enforcement points: pre-commit hooks that block commits containing critical security concerns, and PR merge gates that require consequential analysis to pass before merge. Both are configurable with severity thresholds so that teams can tune the balance between safety and velocity.

Comparison with Traditional Approaches

Approach	Speed	Can Detect Intent Mismatches?	Can Predict Consequences?	Cost
Manual Review	400 lines/hour	Yes (human judgment)	Yes (human reasoning)	High (human time)
Static Analysis	Instant	No (syntax only)	No	Low (tool cost)
Test Coverage	Fast (automated)	No (tests what is)	No	Medium (test writing)
Consequential Analysis	2-3 sec/file	Yes (LLM + Prolog)	Yes (Prolog queries)	~$0.02/file

Traditional approaches are complementary, not replacements:

Static analysis catches syntax errors (run first)
Tests verify behavior (required)
Consequential analysis catches semantic issues (run before merge)
Manual review validates architecture (required for complex changes)

Cost Analysis

Per-File Analysis Costs

Typical Python file (50 functions):

Structural extraction: $0.00 (local tree-sitter)
Intent inference: $0.015-0.025 (50 functions * $0.0003-0.0005)
Query execution: $0.00 (local Prolog)
-----------------------------------
Total: ~$0.02 per file

Per-Repository Analysis

Medium codebase (1000 files, 50k functions):

First-time analysis:
  Structural: ~2 minutes (tree-sitter parallel)
  Intent: ~30 minutes (LLM batched)
  Cost: ~$20

Incremental analysis (10% change per PR):
  Files changed: 100
  Cost: ~$2 per PR

Cost Reduction Through Caching

Code checksum caching eliminates repeated analysis:

Initial analysis: $20 (full repo)
PR #1 (100 files changed): $2
PR #2 (50 files changed): $1
PR #3 (150 files changed): $3
PR #4 (75 files changed): $1.50

Total for 4 PRs: $27.50
Without caching: $80 (4 * $20)
Savings: 66%

ROI Comparison

The cost comparison depends on the baseline. Consequential analysis does not eliminate human review — it narrows what requires human attention.

Traditional full manual review:

Senior engineer: $150/hour
Review speed: 400 lines/hour
3000-line AI PR: 7.5 hours = $1,125

Consequential analysis + targeted review:

Automated analysis: $0.15 (3000 lines, ~300 functions)
Human review of flagged issues only: 15-60 minutes = $37.50-$150
Total: $37.65-$150

The savings scale with the proportion of flagged issues. A PR with zero intent mismatches and no security concerns requires minimal human review. A PR with multiple flags still requires focused attention on those specific functions, but at a fraction of the time needed for full line-by-line review.

Limitations and Trade-Offs

LLM Dependency

Intent inference requires LLM calls for each function. This introduces:

Cost - ~$0.0003-0.0005 per function
Latency - 1-2 seconds per file
Probabilistic nature - Confidence scores, not proofs

Mitigation:

Governor learns patterns over time
Common mismatches become symbolic checks
Caching eliminates repeated analysis
Batch processing reduces latency

Language Coverage

Tree-sitter parsers exist for major languages:

Python (supported)
Rust (supported)
TypeScript/JavaScript (supported)
Go (supported)
Java (supported)
C/C++ (supported)
Elixir (supported)

Context Limitations

LLM sees individual functions, not full program context:

Cross-module intent mismatches may be missed
Global state effects harder to detect
Requires multiple queries for system-level analysis

Future work: Cross-function data flow analysis.

False Positives

Intent inference can produce false positives:

Function legitimately does what name suggests
LLM misinterprets idioms or patterns
Confidence scores below 0.7 are discarded

Mitigation:

Confidence thresholds filter low-quality inferences
Human review of flagged issues
Governor learns from corrections

Future Directions

Cross-Function Data Flow Analysis

Current analysis operates at the individual function level. Cross-function data flow analysis would track variables across call boundaries to detect patterns like user input flowing through multiple functions to reach a SQL query without being sanitized at any intermediate step. This requires inter-procedural analysis and variable tracking that goes beyond the current AST-level fact extraction.

Temporal Semantic Analysis

Analyzing how function intent evolves across commits enables regression detection (a function’s intent category changed unexpectedly), breaking change prediction (a widely-depended-upon function shifted from query to persistence), and API drift monitoring. This builds on the existing commit-scoped analysis by comparing semantic facts across versions.

Multi-Language Cross-References

Many systems span multiple languages (Python calling Rust FFI, TypeScript calling Go microservices). Detecting intent mismatches across language boundaries requires multi-language fact extraction and cross-language call graph tracking.

Automated Refactoring Suggestions

When an intent mismatch is detected, the system could suggest specific refactoring actions: splitting a function that mixes validation and persistence into two functions, renaming a function to reflect its actual behavior, or extracting a side effect into a separate explicit step.

Integration with Type Systems

Languages with rich type systems (Rust, TypeScript, Haskell) provide additional signals: purity annotations, const parameters, and effect types. These can be combined with semantic analysis to detect violations that neither approach would catch alone – for instance, a function annotated as pure that the LLM correctly identifies as having side effects.

Implementation Architecture

Hybrid Pipeline

The pipeline flows through three stages with distinct performance profiles:

Structural extraction (tree-sitter) – fast (50-100ms per file), deterministic, zero marginal cost
Semantic inference (LLM) – slower (1-2s per file), probabilistic, ~$0.0004 per function
Consequence queries (Prolog) – fast (<10ms per query), deterministic, zero marginal cost

The expensive step (LLM inference) is sandwiched between two cheap deterministic steps. This means the system only pays the LLM cost once per function per code change – structural extraction and query execution add negligible overhead.

Caching Architecture

Two caching layers eliminate redundant work:

Fact caching – Structural and semantic facts are cached by code checksum. When a file hasn’t changed, its facts are retrieved from cache and the LLM step is skipped entirely. This makes incremental PR analysis fast: only changed files are re-analyzed.

Query result caching – Query results are cached by commit SHA. Since commits are immutable, a query result computed for a given commit never changes and can be cached indefinitely. This means repeated queries against the same commit (e.g., re-running CI) are effectively free.

Comparison with Existing Research

Traditional Program Analysis

Datalog-based analysis (Doop, Soot, WALA):

Focus: Points-to analysis, taint tracking, null pointer detection
Strength: Precise structural analysis
Limitation: No semantic intent understanding

Abstract interpretation (Astrée, Polyspace):

Focus: Numerical bounds, memory safety
Strength: Sound over-approximation
Limitation: High false positive rate, no intent reasoning

Symbolic execution (KLEE, S2E):

Focus: Path coverage, constraint solving
Strength: Find deep bugs
Limitation: Path explosion, expensive, no intent

Consequential analysis differs:

Goal: Detect semantic mismatches, not just structural bugs
Approach: Hybrid (symbolic + LLM), not purely formal
Speed: Seconds per file, not hours
Scope: Intent verification, not exhaustive bug finding

Recent AI Code Analysis

CodeBERT, GraphCodeBERT, CodeT5:

Focus: Code understanding via transformer models
Application: Code search, summarization, completion
Limitation: No formal reasoning, no consequence prediction

DeepCode, Snyk Code:

Focus: Security vulnerability detection via ML
Application: Pattern matching for known vulnerabilities
Limitation: Training-data dependent, no intent analysis

GitHub Copilot Labs:

Focus: Code explanation and test generation
Application: Developer assistance
Limitation: No verification, no consequence prediction

Consequential analysis is orthogonal:

Uses LLM for intent, not bug finding
Combines with Prolog for formal reasoning
Produces verifiable outputs (Prolog facts)
Designed specifically for AI-generated code verification

Production Deployment Considerations

Scaling to Large Codebases

Incremental analysis:

Only analyze changed files in PRs
Cache results by code checksum
Skip unchanged functions

Parallel processing:

Distribute file analysis across workers
Batch LLM requests (10-50 functions per call)
Cache embedding lookups

Performance targets:

<5 seconds for typical PR (10-20 files)
<30 seconds for large PR (100+ files)
<10 minutes for full repository (first run)

Data Retention and Privacy

Code facts may contain sensitive information – function names revealing business logic, security markers identifying vulnerabilities, call graphs exposing system architecture. The system uses tiered retention (hot cache for active development, cold storage for historical commits) with encryption at rest, per-repository access control, and the option to run analysis entirely on-premises.

Conclusion

AI code generation creates a verification gap. Humans cannot review 3000 lines in 30 minutes with semantic depth. Existing tools catch syntax errors but miss intent mismatches where function behavior violates its declared purpose.

Consequential analysis addresses this by combining:

Fast structural extraction (tree-sitter AST parsing)
Semantic intent inference (LLM-powered)
Formal consequence queries (Prolog reasoning)

The system detects intent mismatches, predicts change impact, identifies security concerns, and analyzes code at AI generation speeds (~2-3 seconds per file, ~$0.02 per file).

Key results:

Detects semantic violations that static analysis misses
Operates at AI generation speeds (vs. manual review)
Costs ~$0.02 per file (vs. $150/hour manual review)
Produces formal outputs (Prolog facts) for automated workflows

Integration points:

Arbiter Substrate: Pre-execution validation of AI-generated code, with verification results embeddable in Cognitive Trust Certificates
Governor: Pattern learning from historical mismatches (see Governor: Neuro-Symbolic Runtime for Token-Efficient Agent Cognition)
Policy enforcement: Security detections feed into Runtime Policy Enforcement decisions
CI/CD: Automated PR review and merge gates
IDEs: Real-time semantic warnings during development

The architecture supports the full pipeline — structural extraction, intent inference, and consequence querying — across multiple languages.

This document describes the architecture of consequential code analysis. LLM prompting strategies, specific pattern detection heuristics, and Prolog optimization techniques are withheld to protect operational implementation while enabling conceptual understanding.