Consequential Analysis: Semantic Verification of AI-Generated Code
Neuro-Symbolic Intent Detection and Impact Prediction
Abstract
AI code generation operates at speeds that exceed human verification capacity. When an AI assistant produces 3000 lines of production code in 30 minutes, traditional review approaches fail: static analysis catches syntax errors but misses semantic intent, test coverage verifies behavior but not correctness of that behavior, and manual review cannot operate at AI generation speeds.
Consequential Analysis introduces a semantic verification approach that combines structural fact extraction (AST -> Prolog), behavioral intent inference (LLM), and logical consequence queries (Prolog reasoning). The system detects intent mismatches where function behavior contradicts declared purpose, predicts change impact through call graph analysis, and identifies security concerns that escape syntax-level inspection.
The core observation: function names are contracts with both human reviewers and AI agents. A function named validate_user signals read-only validation behavior. If that function performs database writes, it violates its contract. Reviewers miss this because they trust the name. AI agents miss this because they reason about names, not implementations. Consequential analysis detects these violations by comparing inferred semantic intent against observed structural behavior.
Problem Landscape
The AI Code Velocity Gap
Traditional software development assumes human authorship at human speeds. A developer writes 100-150 lines per day; a reviewer processes 200-400 lines per hour with full semantic scrutiny of every function.
AI-assisted development breaks this equilibrium. An AI assistant can generate 2000-5000 lines in 30 minutes and produce 10-15 PRs per day, but review speed has not changed. A 3000-line AI-generated PR still takes 3-5 hours of manual review at roughly 15 seconds of semantic analysis per function.
Organizations face a forced choice:
- Ship without review – unacceptable risk
- Slow AI generation to match review capacity – eliminates the productivity gain
- Develop verification that scales to AI speeds – only viable option
Static Analysis Limitations
Static analysis tools (ESLint, Pylint, clippy, golangci-lint) detect:
- Syntax errors
- Type mismatches
- Unused variables
- Some security patterns (SQL injection strings, eval usage)
Static analysis cannot detect:
- Intent mismatches - Function name says “validate” but implementation persists
- Semantic violations - Function marked pure but performs I/O
- Cross-function implications - Call chain violates transaction boundaries
- Consequence prediction - What breaks if this function is deleted
This is a structural limitation. AST-level analysis provides syntactic facts, not semantic meaning.
Test Coverage Blindness
Test suites verify that functions behave as implemented:
def test_validate_payment():
result = validate_payment({'number': '1234567890123456', 'cvv': '123'})
assert result == True # Test passes
Test passes. Function works as implemented. But implementation violates semantic contract if validate_payment writes to database despite name suggesting read-only validation.
Tests verify what the code does, not what the code should do based on its declared intent.
The Intent Mismatch Problem
Consider AI-generated Python:
def validate_payment_method(payment_data):
"""Validates payment method information."""
card_number = payment_data['card']['number']
cvv = payment_data['card']['cvv']
# Validate format
if len(card_number) != 16:
return {'valid': False, 'reason': 'invalid_card_number'}
# Store for fraud analysis
db.payment_validations.insert({
'card_number': card_number,
'cvv': cvv,
'timestamp': datetime.now(),
'ip_address': request.ip
})
return {'valid': True}
What static analysis reports:
- No syntax errors (pass)
- No type violations (pass)
- No SQL injection patterns (pass)
- Passes linter (pass)
What tests verify:
-
Returns
{'valid': True}for valid cards (pass) -
Returns
{'valid': False}for invalid cards (pass) - Edge cases handled (pass)
What human reviewers see:
-
Function named
validate_payment_method - Docstring says “Validates payment method information”
- Implementation validates format
- Tests pass
- Ships to production (pass)
What actually happens:
- Function writes sensitive PII to database (card_number, cvv, ip_address)
- Function has side effects despite “validate” name suggesting read-only operation
- Function creates audit records without encryption
- Function is unsafe to call multiple times (duplicate validation records)
- Function violates semantic contract between name and behavior
Why AI generates this:
- Training data contains similar patterns
- “Log for audit” is a common pattern
- No training signal that validate_* functions must be pure
- Tests validate behavior, not semantic correctness
Why reviewers miss this:
- Function works correctly (tests pass)
- Name matches common patterns
- Code looks professional
- Volume of AI-generated code exceeds review capacity
This is an intent mismatch. Traditional tools cannot detect it.
Consequential Analysis Architecture
Consequential analysis combines three reasoning layers. The structural layer produces deterministic facts from AST parsing. The semantic layer uses LLM inference to classify intent, bridging to the semantic type vocabulary defined in Semio: A Semantic Interface Layer for Tool-Oriented AI Systems. The query layer uses Prolog to reason over both fact sets, producing outputs suitable for embedding in Cognitive Trust Certificates or enforcing via Runtime Policy Enforcement.
Layer 1: Structural Facts (Deterministic)
Tree-sitter AST parsing extracts provable facts from code structure: function definitions, call graphs, module dependencies, data flow, and complexity metrics. These facts are deterministic (same code always produces the same output), language-agnostic (Python, Rust, TypeScript, JavaScript, Go), and fast (~50-100ms per file).
For the validate_payment_method example, structural extraction produces facts capturing that the function exists, is public, calls db.payment_validations.insert, calls datetime.now, accesses request.ip, and returns a dict. The exact fact representation uses a Prolog-compatible schema designed for efficient join queries in Layer 3.
Structural facts tell you what the code does syntactically but cannot tell you what the code means: they capture that a function calls db.insert but not whether that call is appropriate given the function’s declared purpose. That gap is where Layer 2 comes in.
Layer 2: Semantic Intent (LLM-Powered)
The LLM receives the source code alongside its structural facts and emits semantic classifications as structured Prolog-compatible facts. These classifications fall into several categories:
- Intent – what the function is meant to do based on its name, docstring, and context (e.g., validation, persistence, query, authentication)
- Side effects – observable behaviors that affect state outside the function’s return value (e.g., database writes, network calls, file I/O)
- Purity – whether the function is free of side effects, and if not, why
- Security markers – presence of sensitive operations such as user input handling, data logging, or shell execution
- Confidence – the model’s self-assessed reliability for each classification
For the validate_payment_method example, the LLM would classify the intent as “validation” (from the name), flag a database write side effect, mark the function as impure, and tag security concerns around user input and PII logging – all emitted as machine-readable facts that Layer 3 can query.
The specific predicate vocabulary, output schema, and prompt engineering strategies are part of the operational implementation.
Intent inference costs:
- ~$0.0003-0.0005 per function (GPT-5-mini at $0.25/$2.00 per 1M tokens)
- ~1-2 seconds per file
- Cached by code checksum (skip unchanged functions)
- Batched across multiple functions for efficiency
Layer 3: Consequential Queries (Prolog Reasoning)
The query layer combines structural and semantic facts to answer consequence questions. Prolog’s backtracking search makes this natural: a single query can join function definitions, call graphs, intent classifications, and security markers to surface issues that no individual fact would reveal.
For example, detecting an intent mismatch requires joining a function’s name-derived intent (from structural analysis) against its inferred behavior (from semantic classification). When these disagree above a confidence threshold, the system flags a mismatch with severity based on the specific combination (e.g., validation intent + persistence behavior = high severity).
The same approach extends to deletion impact prediction, security concern detection, test coverage gaps, and PR risk assessment. The full query catalog is described in the Consequential Queries section below.
Why AST Alone is Insufficient
Pure structural analysis cannot detect semantic violations:
Example 1: Query Function That Writes
def get_user_preferences(user_id):
"""Retrieves user preferences."""
prefs = db.preferences.find({'user_id': user_id})
if not prefs:
# Create defaults
prefs = db.preferences.insert({
'user_id': user_id,
'theme': 'dark',
'notifications': True
})
return prefs
AST extraction:
function('get_user_preferences_1', ...).
calls_external('get_user_preferences_1', 'db.preferences', 'find', ...).
calls_external('get_user_preferences_1', 'db.preferences', 'insert', ...).
AST tells you:
-
Function named
get_user_preferences -
Calls
db.preferences.find(read) -
Calls
db.preferences.insert(write)
AST cannot tell you:
- Name suggests read-only query
- Implementation performs writes
- Function violates its semantic contract
- Calling this function creates state
Example 2: Sanitization Function That Doesn’t
def sanitize_sql_input(user_input):
"""Sanitizes user input to prevent SQL injection."""
return user_input.strip()
AST extraction:
function('sanitize_sql_input_1', ...).
calls_external('sanitize_sql_input_1', 'user_input', 'strip', ...).
AST tells you:
-
Function named
sanitize_sql_input -
Calls
strip()method - Returns processed input
AST cannot tell you:
- Name promises SQL injection protection
- Implementation only trims whitespace
- Function provides no actual sanitization
- This is a critical security vulnerability
Example 3: Pure Function With Side Effects
fn calculate_discount(amount: f64, customer_tier: &str) -> f64 {
log::info!("Calculating discount for {}: {}", customer_tier, amount);
let rate = match customer_tier {
"gold" => 0.2,
"silver" => 0.1,
_ => 0.0
};
amount * (1.0 - rate)
}
AST extraction:
function('calculate_discount_2', ...).
calls_external('calculate_discount_2', 'log', 'info', ...).
returns('calculate_discount_2', 'f64', ...).
Function appears pure (math calculation) but has side effect (logging). In concurrent systems or functional programming contexts, this matters: the function cannot be memoized, cannot be called in pure contexts, and has ordering dependencies that aren’t obvious from signature.
Intent Inference Implementation
LLM Prompt Design
The inference prompt is structured around four sections: the source code, the structural facts already extracted, a constrained predicate vocabulary that limits the model’s output to a fixed schema, and an output format directive that ensures machine-parseable results.
The key design decision is constraining the LLM to emit only predicates from a closed vocabulary rather than free-form analysis. This makes the output deterministic in structure (even if probabilistic in content), directly queryable by the Prolog engine, and cacheable by code checksum. The vocabulary covers intent classification, side effect detection, purity analysis, security markers, and confidence scoring.
For the validate_payment_method example, the model would identify: declared intent is validation (from name and docstring), actual behavior includes persistence (database write), the function is impure, and there are security concerns around user input handling and PII logging. Each of these becomes a structured fact that Layer 3 can join against.
Specific predicate schemas, prompt engineering techniques, few-shot calibration strategies, and confidence threshold tuning are part of the operational implementation and are not detailed here.
Cost and Performance
Intent inference costs (GPT-5-mini at $0.25/$2.00 per 1M input/output tokens):
- Per function: ~$0.0003-0.0005
- Per file: ~$0.02-0.05 (50-100 functions typical)
- Per repo: ~$2-10 (1000-5000 functions)
Performance:
- Structural extraction: 50-100ms per file (tree-sitter)
- Intent inference: 1-2 seconds per file (LLM API call)
- Query execution: <10ms per query (Prolog)
Total time: ~2-3 seconds per file
Caching:
- Results cached by code checksum (SHA-256)
- Unchanged files skip analysis
- Incremental analysis for PRs (only changed files)
Consequential Queries
The system supports a catalog of queries that combine structural and semantic facts to surface different classes of issues. Each query is implemented as a Prolog rule that joins across the fact base. Below are the query types, what they detect, and example outputs.
Query: Intent Mismatches
Detects functions where the declared intent (derived from naming conventions and documentation) contradicts the observed behavior (derived from call graph and side effect analysis). Severity is assigned based on the specific combination – for instance, a function with validation intent that performs persistence operations is rated higher than a query function with minor logging.
Example output:
{
"mismatches": [
{
"function": "validate_payment_method",
"file": "payment_handler.py",
"line": 1,
"declared_intent": "validation",
"actual_behavior": "persistence",
"severity": "high",
"explanation": "Function name suggests read-only validation but performs database write"
}
]
}
Query: Deletion Consequences
Predicts the impact of removing a function by analyzing its callers (via call graph), associated tests, fan-in (how many other functions depend on it), and the criticality of its intent category. Functions involved in authentication, authorization, or persistence are flagged as critical deletions.
Example output:
{
"function": "process_payment",
"consequences": {
"broken_callers": [
{"function": "checkout_flow", "module": "CheckoutModule", "line": 142},
{"function": "retry_failed_payment", "module": "PaymentRetry", "line": 67}
],
"orphaned_tests": [
{"test": "test_process_payment_success"},
{"test": "test_process_payment_failure"}
],
"fan_in": 2,
"criticality": "critical"
}
}
Query: Security Concerns
Detects potential vulnerabilities by correlating security markers within the same function. The query identifies patterns such as user input flowing to SQL operations without sanitization markers, sensitive data written to logs or databases without encryption markers, and user input passed to shell execution. Each detected pattern produces a typed concern with a human-readable description.
Query: Test Coverage Gaps
Finds public functions that lack associated test coverage, excluding test helpers and internal utilities. Combined with intent classification, this can prioritize gaps in critical areas (e.g., untested authentication functions are flagged higher than untested rendering helpers).
Query: Hotspots
Identifies refactoring candidates by finding functions with both high fan-in (many callers) and high cyclomatic complexity. These are the functions where a bug or intent mismatch would have the widest blast radius.
Query: PR Risk Assessment
Composes the above queries to produce a risk summary for an incoming pull request. The query diffs the PR’s facts against the base branch to surface only newly introduced issues – intent mismatches, security concerns, and high-risk modifications to widely-depended-upon functions. This avoids alert fatigue from pre-existing issues and focuses review attention on what changed.
Integration Points
Arbiter Substrate Pre-Execution Validation
Arbiter Substrate intercepts AI-generated code before execution and runs it through the consequential analysis pipeline. The reflex phase extracts structural facts, infers intent, and runs the query catalog. Based on the results, Arbiter Substrate makes an allow/warn/block decision:
- No issues detected: allow execution
- Minor intent mismatches below a threshold: warn but allow
- Security concerns or PII exposure: block execution and require review
This provides a safety layer that prevents the most dangerous categories of AI-generated code from reaching production without human review, while allowing low-risk code to proceed at speed.
Governor Pattern Learning
Governor analyzes historical consequential analysis results to extract recurring mismatch patterns. When the same type of intent mismatch appears repeatedly across a codebase (e.g., functions with validation-style names that perform database writes), Governor extracts this as a learned pattern that can be detected symbolically – without requiring an LLM call.
This creates a virtuous cycle: early analysis depends heavily on LLM inference, but over time the system learns which patterns are common in a given codebase and can detect them through fast symbolic matching alone. The LLM is then reserved for novel code that doesn’t match any known pattern.
CI/CD Integration
Consequential analysis integrates into CI/CD pipelines as a PR review step. The workflow analyzes changed files, runs the query catalog against the PR’s commit, and posts results as a PR comment. Organizations can configure branch protection rules that require consequential analysis to pass before merge, with configurable thresholds for intent mismatches, security concerns, and test coverage.
Example PR comment generated:
## Consequential Analysis Results
**Analyzed**: 23 functions across 8 files
**Time**: 2.3 seconds
**Cost**: ~$0.01
---
### Intent Mismatches (1)
**`validate_payment_method()` (payment_handler.py:42)**
- **Declared intent**: validation (from name)
- **Actual behavior**: persistence (database write at line 50)
- **Severity**: HIGH
- **Explanation**: Function name suggests read-only validation, but implementation writes card data to database
**Recommendation**: Rename to `validate_and_log_payment()` or split into two functions:
def validate_payment_method(payment_data):
# Pure validation only
return len(payment_data['card']['number']) == 16
def log_payment_attempt(payment_data):
# Separate persistence
db.payment_validations.insert({...})
---
### Security Concerns (1)
**`sanitize_sql_input()` (utils.py:23)**
- **Issue**: Sanitization is inadequate (only `strip()` called)
- **Risk**: SQL injection vulnerability
- **Severity**: CRITICAL
- **Line**: 24
**Recommendation**: Use parameterized queries or proper escaping:
def sanitize_sql_input(user_input):
return psycopg2.extensions.adapt(user_input).getquoted()
---
### Summary
- **Functions analyzed**: 23
- **Intent mismatches**: 1 high, 0 medium
- **Security concerns**: 1 critical
- **PII exposures**: 0
- **Test gaps**: 3 functions without coverage
**Overall risk**: MEDIUM (requires review before merge)
Enterprise Deployment Patterns
Real-Time Verification in IDEs
Consequential analysis can run incrementally during development, analyzing changed functions on file save and displaying inline warnings for intent mismatches and security concerns. This shifts detection left – developers see issues before they commit, not after a PR review cycle.
Pre-Commit and PR Gates
The analysis integrates at two enforcement points: pre-commit hooks that block commits containing critical security concerns, and PR merge gates that require consequential analysis to pass before merge. Both are configurable with severity thresholds so that teams can tune the balance between safety and velocity.
Comparison with Traditional Approaches
| Approach | Speed | Can Detect Intent Mismatches? | Can Predict Consequences? | Cost |
|---|---|---|---|---|
| Manual Review | 400 lines/hour | Yes (human judgment) | Yes (human reasoning) | High (human time) |
| Static Analysis | Instant | No (syntax only) | No | Low (tool cost) |
| Test Coverage | Fast (automated) | No (tests what is) | No | Medium (test writing) |
| Consequential Analysis | 2-3 sec/file | Yes (LLM + Prolog) | Yes (Prolog queries) | ~$0.02/file |
Traditional approaches are complementary, not replacements:
- Static analysis catches syntax errors (run first)
- Tests verify behavior (required)
- Consequential analysis catches semantic issues (run before merge)
- Manual review validates architecture (required for complex changes)
Cost Analysis
Per-File Analysis Costs
Typical Python file (50 functions):
Structural extraction: $0.00 (local tree-sitter)
Intent inference: $0.015-0.025 (50 functions * $0.0003-0.0005)
Query execution: $0.00 (local Prolog)
-----------------------------------
Total: ~$0.02 per file
Per-Repository Analysis
Medium codebase (1000 files, 50k functions):
First-time analysis:
Structural: ~2 minutes (tree-sitter parallel)
Intent: ~30 minutes (LLM batched)
Cost: ~$20
Incremental analysis (10% change per PR):
Files changed: 100
Cost: ~$2 per PR
Cost Reduction Through Caching
Code checksum caching eliminates repeated analysis:
Initial analysis: $20 (full repo)
PR #1 (100 files changed): $2
PR #2 (50 files changed): $1
PR #3 (150 files changed): $3
PR #4 (75 files changed): $1.50
Total for 4 PRs: $27.50
Without caching: $80 (4 * $20)
Savings: 66%
ROI Calculation
Traditional manual review:
- Senior engineer: $150/hour
- Review speed: 400 lines/hour
- 3000-line AI PR: 7.5 hours = $1,125
Consequential analysis:
- Initial analysis: $0.15 (3000 lines, ~300 functions)
- Query execution: $0
- Review of issues: 15 minutes = $37.50
- Total: $37.65
Savings per AI-generated PR: $1,087.35 (97% reduction)
Break-even point: First PR
Limitations and Trade-Offs
LLM Dependency
Intent inference requires LLM calls for each function. This introduces:
- Cost - ~$0.0003-0.0005 per function
- Latency - 1-2 seconds per file
- Probabilistic nature - Confidence scores, not proofs
Mitigation:
- Governor learns patterns over time
- Common mismatches become symbolic checks
- Caching eliminates repeated analysis
- Batch processing reduces latency
Language Coverage
Tree-sitter parsers exist for major languages:
- Python (supported)
- Rust (supported)
- TypeScript/JavaScript (supported)
- Go (supported)
- Java (supported)
- C/C++ (supported)
Elixir requires LLM fallback (tree-sitter-elixir exists but integration needed).
Context Limitations
LLM sees individual functions, not full program context:
- Cross-module intent mismatches may be missed
- Global state effects harder to detect
- Requires multiple queries for system-level analysis
Future work: Cross-function data flow analysis.
False Positives
Intent inference can produce false positives:
- Function legitimately does what name suggests
- LLM misinterprets idioms or patterns
- Confidence scores below 0.7 are discarded
Mitigation:
- Confidence thresholds filter low-quality inferences
- Human review of flagged issues
- Governor learns from corrections
Future Directions
Cross-Function Data Flow Analysis
Current analysis operates at the individual function level. Cross-function data flow analysis would track variables across call boundaries to detect patterns like user input flowing through multiple functions to reach a SQL query without being sanitized at any intermediate step. This requires inter-procedural analysis and variable tracking that goes beyond the current AST-level fact extraction.
Temporal Semantic Analysis
Analyzing how function intent evolves across commits enables regression detection (a function’s intent category changed unexpectedly), breaking change prediction (a widely-depended-upon function shifted from query to persistence), and API drift monitoring. This builds on the existing commit-scoped analysis by comparing semantic facts across versions.
Multi-Language Cross-References
Many systems span multiple languages (Python calling Rust FFI, TypeScript calling Go microservices). Detecting intent mismatches across language boundaries requires multi-language fact extraction and cross-language call graph tracking.
Automated Refactoring Suggestions
When an intent mismatch is detected, the system could suggest specific refactoring actions: splitting a function that mixes validation and persistence into two functions, renaming a function to reflect its actual behavior, or extracting a side effect into a separate explicit step.
Integration with Type Systems
Languages with rich type systems (Rust, TypeScript, Haskell) provide additional signals: purity annotations, const parameters, and effect types. These can be combined with semantic analysis to detect violations that neither approach would catch alone – for instance, a function annotated as pure that the LLM correctly identifies as having side effects.
Implementation Architecture
Hybrid Pipeline
The pipeline flows through three stages with distinct performance profiles:
- Structural extraction (tree-sitter) – fast (50-100ms per file), deterministic, zero marginal cost
- Semantic inference (LLM) – slower (1-2s per file), probabilistic, ~$0.0004 per function
- Consequence queries (Prolog) – fast (<10ms per query), deterministic, zero marginal cost
The expensive step (LLM inference) is sandwiched between two cheap deterministic steps. This means the system only pays the LLM cost once per function per code change – structural extraction and query execution add negligible overhead.
Caching Architecture
Two caching layers eliminate redundant work:
Fact caching – Structural and semantic facts are cached by code checksum. When a file hasn’t changed, its facts are retrieved from cache and the LLM step is skipped entirely. This makes incremental PR analysis fast: only changed files are re-analyzed.
Query result caching – Query results are cached by commit SHA. Since commits are immutable, a query result computed for a given commit never changes and can be cached indefinitely. This means repeated queries against the same commit (e.g., re-running CI) are effectively free.
Comparison with Existing Research
Traditional Program Analysis
Datalog-based analysis (Doop, Soot, WALA):
- Focus: Points-to analysis, taint tracking, null pointer detection
- Strength: Precise structural analysis
- Limitation: No semantic intent understanding
Abstract interpretation (Astrée, Polyspace):
- Focus: Numerical bounds, memory safety
- Strength: Sound over-approximation
- Limitation: High false positive rate, no intent reasoning
Symbolic execution (KLEE, S2E):
- Focus: Path coverage, constraint solving
- Strength: Find deep bugs
- Limitation: Path explosion, expensive, no intent
Consequential analysis differs:
- Goal: Detect semantic mismatches, not just structural bugs
- Approach: Hybrid (symbolic + LLM), not purely formal
- Speed: Seconds per file, not hours
- Scope: Intent verification, not exhaustive bug finding
Recent AI Code Analysis
CodeBERT, GraphCodeBERT, CodeT5:
- Focus: Code understanding via transformer models
- Application: Code search, summarization, completion
- Limitation: No formal reasoning, no consequence prediction
DeepCode, Snyk Code:
- Focus: Security vulnerability detection via ML
- Application: Pattern matching for known vulnerabilities
- Limitation: Training-data dependent, no intent analysis
GitHub Copilot Labs:
- Focus: Code explanation and test generation
- Application: Developer assistance
- Limitation: No verification, no consequence prediction
Consequential analysis is orthogonal:
- Uses LLM for intent, not bug finding
- Combines with Prolog for formal reasoning
- Produces verifiable outputs (Prolog facts)
- Designed specifically for AI-generated code verification
Production Deployment Considerations
Scaling to Large Codebases
Incremental analysis:
- Only analyze changed files in PRs
- Cache results by code checksum
- Skip unchanged functions
Parallel processing:
- Distribute file analysis across workers
- Batch LLM requests (10-50 functions per call)
- Cache embedding lookups
Performance targets:
- <5 seconds for typical PR (10-20 files)
- <30 seconds for large PR (100+ files)
- <10 minutes for full repository (first run)
Data Retention and Privacy
Code facts may contain sensitive information – function names revealing business logic, security markers identifying vulnerabilities, call graphs exposing system architecture. The system uses tiered retention (hot cache for active development, cold storage for historical commits) with encryption at rest, per-repository access control, and the option to run analysis entirely on-premises.
Conclusion
AI code generation creates a verification gap. Humans cannot review 3000 lines in 30 minutes with semantic depth. Existing tools catch syntax errors but miss intent mismatches where function behavior violates its declared purpose.
Consequential analysis addresses this by combining:
- Fast structural extraction (tree-sitter AST parsing)
- Semantic intent inference (LLM-powered)
- Formal consequence queries (Prolog reasoning)
The system detects intent mismatches, predicts change impact, identifies security concerns, and analyzes code at AI generation speeds (~2-3 seconds per file, ~$0.02 per file).
Key results:
- Detects semantic violations that static analysis misses
- Operates at AI generation speeds (vs. manual review)
- Costs ~$0.02 per file (vs. $150/hour manual review)
- Produces formal outputs (Prolog facts) for automated workflows
Integration points:
- Arbiter Substrate: Pre-execution validation of AI-generated code, with verification results embeddable in Cognitive Trust Certificates
- Governor: Pattern learning from historical mismatches (see Governor: Neuro-Symbolic Runtime for Token-Efficient Agent Cognition)
- Policy enforcement: Security detections feed into Runtime Policy Enforcement decisions
- CI/CD: Automated PR review and merge gates
- IDEs: Real-time semantic warnings during development
The architecture is production-deployed with full pipeline support for structural extraction, intent inference, and consequence querying across multiple languages.
February 2026
This document describes the architecture of consequential code analysis. LLM prompting strategies, specific pattern detection heuristics, and Prolog optimization techniques are withheld to protect operational implementation while enabling conceptual understanding.