Code Intelligence Research Reference

Status: Active Research Document Task ID: H.5.5.x (Code Intelligence Track) Last Updated: 2026-01-17

Executive Summary

This document provides academic and industry research supporting CODITECT's code intelligence roadmap (H.5.5.x track). The analysis reveals that AST-based code analysis and call graphs are now table stakes - multiple competitors and open-source projects offer these capabilities. CODITECT's true competitive moat lies in the unique combination of cross-session memory, decision tracking, and code intelligence.

Competitive Landscape
Academic Research: Code Representation
Academic Research: RAG and GraphRAG
Academic Research: Call Graphs and PDGs
Reciprocal Rank Fusion (RRF)
Technology Comparison Matrix
Strategic Recommendations
Bibliography

1. Competitive Landscape

1.1 Direct Competitors with AST/Call Graph Capabilities

Tool	AST Analysis	Call Graph	Impact Analysis	Memory	License
Code Pathfinder	Yes (tree-sitter)	Yes	Partial	No	AGPL-3.0
Cursor	Yes (tree-sitter)	Limited	No	No	Proprietary
JetBrains AI	Yes (native parsers)	Yes	Yes	Limited	Proprietary
Augment Code	Yes	Yes	Limited	Limited	Proprietary
GraphRAG (MS)	No	No	N/A	Graph-based	MIT
CODITECT	Planned	Planned	Planned	Yes (unique)	Proprietary

1.2 Code Pathfinder Analysis

Repository: github.com/shivasurya/code-pathfinder

Architecture:

Source Files (.java, .py)
    ↓
Tree-Sitter AST Parsing (5 parallel workers)
    ↓
Code Graph (Nodes + Edges)
    ↓
Query Language (ANTLR parser)
    ↓
Query Engine (expr-lang evaluation)
    ↓
Output Formats (JSON, SARIF, Table)

Key Technical Decisions:

Tree-sitter for parsing: Language-agnostic AST generation
5 parallel workers: Balance between parallelism and overhead
SHA-256 node IDs: Deterministic, consistent across runs
Lazy loading: SourceLocation with byte offsets reduces memory from 2.32GB to 2.18GB
Object pooling: sync.Pool for environment maps reduces GC pressure

Supported Languages: Java, Python, Dockerfile, docker-compose (Go coming)

Performance:

Small codebase (<1k methods): ~100 MB memory
Large codebase (27k methods): ~2.18 GB memory
Graph building: ~5 seconds for 27k methods
Query execution: <1 second for simple queries

1.3 Market Trends (2025-2026)

85% of developers regularly use AI coding tools (Stack Overflow 2025)
1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025 (Gartner)
40% of enterprise applications will embed AI agents by end of 2026 (Gartner)
GraphRAG adoption pushing accuracy to 99% in enterprise systems

2. Academic Research: Code Representation

2.1 ASTNN - AST-based Neural Network (ICSE 2019)

Paper: "A Novel Neural Source Code Representation based on Abstract Syntax Tree"

Key Innovation: Splits large ASTs into sequences of small statement trees, encoding lexical and syntactical knowledge.

Performance: Significantly outperforms existing approaches on code classification and clone detection.

URL: IEEE Xplore

2.2 TAILOR - Code Property Graph Learning

Paper: "Learning Graph-based Code Representations for Source-Level Functional Similarity Detection"

Key Innovation: Uses Code Property Graph (CPG) combining:

Abstract Syntax Tree (AST)
Control Flow Graph (CFG)
Data Flow Graph (DFG)

Architecture: CPGNN iteratively propagates node embeddings along CPG structure, then aggregates via pooling.

URL: Project Page

2.3 SimAST-GCN (2022)

Paper: "Turn tree into graph: Automatic code review via simplified AST driven graph convolutional network"

Key Innovation: Simplifies AST by deleting redundant nodes, constructs relation graph with attention mechanism.

URL: ScienceDirect

2.4 Flow-Augmented AST (FA-AST)

Paper: "Detecting Code Clones with Graph Neural Network"

Key Innovation: Adds control flow and data flow edges to AST, uses GNN for feature extraction.

URL: arXiv

2.5 GraphPyRec (2024)

Paper: "GraphPyRec: A novel graph-based approach for fine-grained Python code recommendation"

Key Innovation: Uses Program Dependence Graph (PDG) and Data Flow Graph (DFG) as global context.

URL: ResearchGate

3. Academic Research: RAG and GraphRAG

3.1 RAG Evolution Survey (2024)

Paper: "A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions"

Key Findings:

RAG combines retrieval mechanisms with generative models
Reviews question-answering, summarization, knowledge-based tasks
Identifies trade-offs between retrieval precision and generation flexibility

URL: arXiv:2410.12837

3.2 Agentic RAG Survey (2025)

Paper: "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG"

Key Innovation: Embeds autonomous AI agents into RAG pipeline with:

Reflection patterns
Planning capabilities
Tool use
Multi-agent collaboration

URL: arXiv:2501.09136

3.3 Microsoft GraphRAG (2024)

Paper: "From Local to Global: A Graph RAG Approach to Query-Focused Summarization"

Key Innovation: Two-stage knowledge graph construction:

Derive entity knowledge graph from source documents
Pre-generate community summaries for related entities

Advantages:

Handles queries requiring aggregation across dataset
Captures evidence provenance
Reveals structure and themes of dataset

URL: arXiv:2404.16130

GitHub: microsoft/graphrag

3.4 RAG-Fusion (2024)

Paper: "RAG-Fusion: a New Take on Retrieval-Augmented Generation"

Key Innovation: Combines RAG with RRF by:

Generating multiple queries
Reranking with reciprocal scores
Fusing documents and scores

Best Practices:

Limit to 3-5 well-crafted sub-queries
Beyond 5 queries, redundancy outweighs recall gains

URL: arXiv:2402.03367

4. Academic Research: Call Graphs and PDGs

4.1 Expression Dependence Graph (2025)

Paper: "The Expression Dependence Graph"

Authors: Carlos Galindo, Sergio Pérez, Josep Silva

Key Innovation: Extension of System Dependence Graph (SDG) for improved precision in program slicing with:

List comprehensions
Try-catch blocks
For loops

URL: ScienceDirect

4.2 Causal Program Dependence Analysis (2025)

Paper: "Causal Program Dependence Analysis"

Authors: Seongmin Lee, Dave Binkley, Robert Feldt, Nicolas Gold, Shin Yoo

Key Innovation: Framework based on causal inference capturing the strength of dependencies, not just existence.

URL: ScienceDirect

4.3 Indirection-Bounded Call Graph Analysis (ECOOP 2024)

Paper: "Indirection-Bounded Call Graph Analysis"

Key Innovation: For JavaScript, bounds the number of pointer/function indirections to find most call edges faster.

Finding: Typical JavaScript exhibits small levels of indirection.

URL: ECOOP 2024

4.4 Total Recall: Static Call Graph Quality (ISSTA 2024)

Paper: "Total Recall? How Good Are Static Call Graphs Really?"

Key Finding: Differences in call-graph construction and language features yield unsoundness and imprecision.

URL: ISSTA 2024

5. Reciprocal Rank Fusion (RRF)

5.1 Foundational Paper (SIGIR 2009)

Paper: "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods"

Authors: Gordon V. Cormack, Charles L. A. Clarke, Stefan Büttcher

Algorithm:

RRF_score(d) = Σ (weight_i / (k + rank_i(d)))

Key Finding: Consistently yields better results than any individual system and Condorcet Fuse.

URL: Semantic Scholar

5.2 RRF Parameters

Parameter	Typical Value	Notes
k (constant)	46-60	Prevents high ranks from dominating
Optimal k	Domain-specific	Sweeping k from 1-100 can cause multi-point metric swings
Query limit	3-5	Beyond 5, redundancy outweighs recall gains

5.3 RRF vs Alternatives

Method	Pros	Cons
RRF	No score normalization needed, robust	Ignores score magnitudes
Condorcet Fuse	Mathematically principled	More complex, often worse results
Linear Combination	Simple	Requires score normalization
Learning-to-Rank	Optimal for domain	Requires training data

6. Technology Comparison Matrix

6.1 Code Representation Approaches

Approach	Effort	Value	Latency	Complexity	Autonomy
Text Chunking	Low	60%	Fast	Low	Low
AST Parsing	Medium	80%	Medium	Medium	Medium
CPG (AST+CFG+DFG)	High	95%	Slow	High	High
GraphRAG	High	99%	Slow	Very High	Very High

6.2 Search Approaches

Approach	Effort	Value	Latency	Complexity	Manageability
FTS5 Only	Low	70%	<50ms	Low	Easy
Vector Only	Medium	75%	~200ms	Medium	Medium
Hybrid (RRF)	Medium	85%	~300ms	Medium	Medium
GraphRAG	High	95%	~500ms	High	Complex

6.3 Memory/Context Approaches

Approach	Effort	Value	Latency	Automation	Competitive Moat
No Memory	None	0%	N/A	None	None
Session Memory	Low	40%	Fast	Low	Low
Cross-Session	High	80%	Medium	High	High
Decision Tracking	High	90%	Medium	High	Very High
Memory + Code Intel	Very High	100%	Medium	Very High	Unique

7. Strategic Recommendations

7.1 CODITECT Competitive Position

Current Unique Capabilities:

Cross-session memory (ADR-020, ADR-021)
Decision tracking and rationale preservation
Error-solution knowledge base
Hybrid RRF semantic search (ADR-080)

Table Stakes (Must Implement):

AST-based code parsing (tree-sitter)
Call graph navigation
Basic impact analysis

7.2 Recommended H.5.5.x Strategy

Task	Priority	Effort	Value	Differentiator?
H.5.5.2: Call Graph	High	16h	Medium	No (table stakes)
H.5.5.3: Impact + Memory	Critical	24h	High	Yes (unique combo)
H.5.5.4: Decision-Code Linking	Critical	16h	Very High	Yes (unique)
H.5.5.5: GraphRAG Integration	Medium	40h	High	Partial

7.3 Integration Recommendations

Code Pathfinder Integration Options:

Option	Pros	Cons	Recommendation
Fork & Extend	Full control	AGPL license, maintenance burden	Not recommended
Use as Library	Leverage existing	AGPL viral, limited Python	Evaluate carefully
Learn & Build	Clean IP, tailored	Development effort	Recommended
Hybrid	Best of both	Complexity	Consider for MVP

Key Learnings from Code Pathfinder:

Tree-sitter is the right choice for multi-language parsing
5 parallel workers is a good balance
SHA-256 for deterministic node IDs
Lazy loading with byte offsets reduces memory
ANTLR for query DSL parsing

7.4 Unique Value Proposition

CODITECT = Memory + Decisions + Code Intelligence

No competitor can answer:
• "What did I change last session that might cause this error?"
• "Which architectural decisions constrain this refactoring?"
• "Show me all times I've fixed this type of error"
• "What functions have I modified that call this API?"

8. Bibliography

Code Representation

Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., & Liu, X. (2019). A Novel Neural Source Code Representation based on Abstract Syntax Tree. ICSE 2019. IEEE Xplore
Zeng, J., et al. (2023). TAILOR: Learning Graph-based Code Representations for Source-Level Functional Similarity Detection. Paper
Wang, W., et al. (2022). Turn tree into graph: Automatic code review via simplified AST driven graph convolutional network. Information and Software Technology. ScienceDirect
Wang, H., et al. (2020). Detecting Code Clones with Graph Neural Network. arXiv:2002.08653
Zhao, J., et al. (2024). GraphPyRec: A novel graph-based approach for fine-grained Python code recommendation. ResearchGate

RAG and GraphRAG

Gao, Y., et al. (2024). A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions. arXiv:2410.12837
Singh, A., et al. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv:2501.09136
Edge, D., et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft Research. arXiv:2404.16130
Rackauckas, Z. (2024). RAG-Fusion: a New Take on Retrieval-Augmented Generation. arXiv:2402.03367

Call Graphs and PDGs

Galindo, C., Pérez, S., & Silva, J. (2025). The Expression Dependence Graph. Journal of Logical and Algebraic Methods in Programming. ScienceDirect
Lee, S., Binkley, D., Feldt, R., Gold, N., & Yoo, S. (2025). Causal Program Dependence Analysis. Science of Computer Programming. ScienceDirect
Chakraborty, M., et al. (2024). Indirection-Bounded Call Graph Analysis. ECOOP 2024. Conference
ISSTA 2024. Total Recall? How Good Are Static Call Graphs Really? Conference

Rank Fusion

Cormack, G. V., Clarke, C. L. A., & Büttcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009. Semantic Scholar

Industry Resources

Microsoft Research. Project GraphRAG. Website
Code Pathfinder. AI-Native Static Code Analysis. GitHub
thunlp. GNNPapers: Must-read papers on graph neural networks. GitHub

Executive Summary​

Table of Contents​

1. Competitive Landscape​

1.1 Direct Competitors with AST/Call Graph Capabilities​

1.2 Code Pathfinder Analysis​

1.3 Market Trends (2025-2026)​

2. Academic Research: Code Representation​

2.1 ASTNN - AST-based Neural Network (ICSE 2019)​

2.2 TAILOR - Code Property Graph Learning​

2.3 SimAST-GCN (2022)​

2.4 Flow-Augmented AST (FA-AST)​

2.5 GraphPyRec (2024)​

3. Academic Research: RAG and GraphRAG​

3.1 RAG Evolution Survey (2024)​

3.2 Agentic RAG Survey (2025)​

3.3 Microsoft GraphRAG (2024)​

3.4 RAG-Fusion (2024)​

4. Academic Research: Call Graphs and PDGs​

4.1 Expression Dependence Graph (2025)​

4.2 Causal Program Dependence Analysis (2025)​

4.3 Indirection-Bounded Call Graph Analysis (ECOOP 2024)​

4.4 Total Recall: Static Call Graph Quality (ISSTA 2024)​

5. Reciprocal Rank Fusion (RRF)​

5.1 Foundational Paper (SIGIR 2009)​

5.2 RRF Parameters​

5.3 RRF vs Alternatives​

6. Technology Comparison Matrix​

6.1 Code Representation Approaches​

6.2 Search Approaches​

6.3 Memory/Context Approaches​

7. Strategic Recommendations​

7.1 CODITECT Competitive Position​

7.2 Recommended H.5.5.x Strategy​

7.3 Integration Recommendations​

7.4 Unique Value Proposition​

8. Bibliography​

Code Representation​

RAG and GraphRAG​

Call Graphs and PDGs​

Rank Fusion​

Industry Resources​

Executive Summary

Table of Contents

1. Competitive Landscape

1.1 Direct Competitors with AST/Call Graph Capabilities

1.2 Code Pathfinder Analysis

1.3 Market Trends (2025-2026)

2. Academic Research: Code Representation

2.1 ASTNN - AST-based Neural Network (ICSE 2019)

2.2 TAILOR - Code Property Graph Learning

2.3 SimAST-GCN (2022)

2.4 Flow-Augmented AST (FA-AST)

2.5 GraphPyRec (2024)

3. Academic Research: RAG and GraphRAG

3.1 RAG Evolution Survey (2024)

3.2 Agentic RAG Survey (2025)

3.3 Microsoft GraphRAG (2024)

3.4 RAG-Fusion (2024)

4. Academic Research: Call Graphs and PDGs

4.1 Expression Dependence Graph (2025)

4.2 Causal Program Dependence Analysis (2025)

4.3 Indirection-Bounded Call Graph Analysis (ECOOP 2024)

4.4 Total Recall: Static Call Graph Quality (ISSTA 2024)

5. Reciprocal Rank Fusion (RRF)

5.1 Foundational Paper (SIGIR 2009)

5.2 RRF Parameters

5.3 RRF vs Alternatives

6. Technology Comparison Matrix

6.1 Code Representation Approaches

6.2 Search Approaches

6.3 Memory/Context Approaches

7. Strategic Recommendations

7.1 CODITECT Competitive Position

7.2 Recommended H.5.5.x Strategy

7.3 Integration Recommendations

7.4 Unique Value Proposition

8. Bibliography

Code Representation

RAG and GraphRAG

Call Graphs and PDGs

Rank Fusion

Industry Resources