Skip to main content

Code Intelligence Research Reference

Status: Active Research Document Task ID: H.5.5.x (Code Intelligence Track) Last Updated: 2026-01-17


Executive Summary

This document provides academic and industry research supporting CODITECT's code intelligence roadmap (H.5.5.x track). The analysis reveals that AST-based code analysis and call graphs are now table stakes - multiple competitors and open-source projects offer these capabilities. CODITECT's true competitive moat lies in the unique combination of cross-session memory, decision tracking, and code intelligence.


Table of Contents

  1. Competitive Landscape
  2. Academic Research: Code Representation
  3. Academic Research: RAG and GraphRAG
  4. Academic Research: Call Graphs and PDGs
  5. Reciprocal Rank Fusion (RRF)
  6. Technology Comparison Matrix
  7. Strategic Recommendations
  8. Bibliography

1. Competitive Landscape

1.1 Direct Competitors with AST/Call Graph Capabilities

ToolAST AnalysisCall GraphImpact AnalysisMemoryLicense
Code PathfinderYes (tree-sitter)YesPartialNoAGPL-3.0
CursorYes (tree-sitter)LimitedNoNoProprietary
JetBrains AIYes (native parsers)YesYesLimitedProprietary
Augment CodeYesYesLimitedLimitedProprietary
GraphRAG (MS)NoNoN/AGraph-basedMIT
CODITECTPlannedPlannedPlannedYes (unique)Proprietary

1.2 Code Pathfinder Analysis

Repository: github.com/shivasurya/code-pathfinder

Architecture:

Source Files (.java, .py)

Tree-Sitter AST Parsing (5 parallel workers)

Code Graph (Nodes + Edges)

Query Language (ANTLR parser)

Query Engine (expr-lang evaluation)

Output Formats (JSON, SARIF, Table)

Key Technical Decisions:

  • Tree-sitter for parsing: Language-agnostic AST generation
  • 5 parallel workers: Balance between parallelism and overhead
  • SHA-256 node IDs: Deterministic, consistent across runs
  • Lazy loading: SourceLocation with byte offsets reduces memory from 2.32GB to 2.18GB
  • Object pooling: sync.Pool for environment maps reduces GC pressure

Supported Languages: Java, Python, Dockerfile, docker-compose (Go coming)

Performance:

  • Small codebase (<1k methods): ~100 MB memory
  • Large codebase (27k methods): ~2.18 GB memory
  • Graph building: ~5 seconds for 27k methods
  • Query execution: <1 second for simple queries
  • 85% of developers regularly use AI coding tools (Stack Overflow 2025)
  • 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025 (Gartner)
  • 40% of enterprise applications will embed AI agents by end of 2026 (Gartner)
  • GraphRAG adoption pushing accuracy to 99% in enterprise systems

2. Academic Research: Code Representation

2.1 ASTNN - AST-based Neural Network (ICSE 2019)

Paper: "A Novel Neural Source Code Representation based on Abstract Syntax Tree"

Key Innovation: Splits large ASTs into sequences of small statement trees, encoding lexical and syntactical knowledge.

Performance: Significantly outperforms existing approaches on code classification and clone detection.

URL: IEEE Xplore

2.2 TAILOR - Code Property Graph Learning

Paper: "Learning Graph-based Code Representations for Source-Level Functional Similarity Detection"

Key Innovation: Uses Code Property Graph (CPG) combining:

  • Abstract Syntax Tree (AST)
  • Control Flow Graph (CFG)
  • Data Flow Graph (DFG)

Architecture: CPGNN iteratively propagates node embeddings along CPG structure, then aggregates via pooling.

URL: Project Page

2.3 SimAST-GCN (2022)

Paper: "Turn tree into graph: Automatic code review via simplified AST driven graph convolutional network"

Key Innovation: Simplifies AST by deleting redundant nodes, constructs relation graph with attention mechanism.

URL: ScienceDirect

2.4 Flow-Augmented AST (FA-AST)

Paper: "Detecting Code Clones with Graph Neural Network"

Key Innovation: Adds control flow and data flow edges to AST, uses GNN for feature extraction.

URL: arXiv

2.5 GraphPyRec (2024)

Paper: "GraphPyRec: A novel graph-based approach for fine-grained Python code recommendation"

Key Innovation: Uses Program Dependence Graph (PDG) and Data Flow Graph (DFG) as global context.

URL: ResearchGate


3. Academic Research: RAG and GraphRAG

3.1 RAG Evolution Survey (2024)

Paper: "A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions"

Key Findings:

  • RAG combines retrieval mechanisms with generative models
  • Reviews question-answering, summarization, knowledge-based tasks
  • Identifies trade-offs between retrieval precision and generation flexibility

URL: arXiv:2410.12837

3.2 Agentic RAG Survey (2025)

Paper: "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG"

Key Innovation: Embeds autonomous AI agents into RAG pipeline with:

  • Reflection patterns
  • Planning capabilities
  • Tool use
  • Multi-agent collaboration

URL: arXiv:2501.09136

3.3 Microsoft GraphRAG (2024)

Paper: "From Local to Global: A Graph RAG Approach to Query-Focused Summarization"

Key Innovation: Two-stage knowledge graph construction:

  1. Derive entity knowledge graph from source documents
  2. Pre-generate community summaries for related entities

Advantages:

  • Handles queries requiring aggregation across dataset
  • Captures evidence provenance
  • Reveals structure and themes of dataset

URL: arXiv:2404.16130

GitHub: microsoft/graphrag

3.4 RAG-Fusion (2024)

Paper: "RAG-Fusion: a New Take on Retrieval-Augmented Generation"

Key Innovation: Combines RAG with RRF by:

  1. Generating multiple queries
  2. Reranking with reciprocal scores
  3. Fusing documents and scores

Best Practices:

  • Limit to 3-5 well-crafted sub-queries
  • Beyond 5 queries, redundancy outweighs recall gains

URL: arXiv:2402.03367


4. Academic Research: Call Graphs and PDGs

4.1 Expression Dependence Graph (2025)

Paper: "The Expression Dependence Graph"

Authors: Carlos Galindo, Sergio Pérez, Josep Silva

Key Innovation: Extension of System Dependence Graph (SDG) for improved precision in program slicing with:

  • List comprehensions
  • Try-catch blocks
  • For loops

URL: ScienceDirect

4.2 Causal Program Dependence Analysis (2025)

Paper: "Causal Program Dependence Analysis"

Authors: Seongmin Lee, Dave Binkley, Robert Feldt, Nicolas Gold, Shin Yoo

Key Innovation: Framework based on causal inference capturing the strength of dependencies, not just existence.

URL: ScienceDirect

4.3 Indirection-Bounded Call Graph Analysis (ECOOP 2024)

Paper: "Indirection-Bounded Call Graph Analysis"

Key Innovation: For JavaScript, bounds the number of pointer/function indirections to find most call edges faster.

Finding: Typical JavaScript exhibits small levels of indirection.

URL: ECOOP 2024

4.4 Total Recall: Static Call Graph Quality (ISSTA 2024)

Paper: "Total Recall? How Good Are Static Call Graphs Really?"

Key Finding: Differences in call-graph construction and language features yield unsoundness and imprecision.

URL: ISSTA 2024


5. Reciprocal Rank Fusion (RRF)

5.1 Foundational Paper (SIGIR 2009)

Paper: "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods"

Authors: Gordon V. Cormack, Charles L. A. Clarke, Stefan Büttcher

Algorithm:

RRF_score(d) = Σ (weight_i / (k + rank_i(d)))

Key Finding: Consistently yields better results than any individual system and Condorcet Fuse.

URL: Semantic Scholar

5.2 RRF Parameters

ParameterTypical ValueNotes
k (constant)46-60Prevents high ranks from dominating
Optimal kDomain-specificSweeping k from 1-100 can cause multi-point metric swings
Query limit3-5Beyond 5, redundancy outweighs recall gains

5.3 RRF vs Alternatives

MethodProsCons
RRFNo score normalization needed, robustIgnores score magnitudes
Condorcet FuseMathematically principledMore complex, often worse results
Linear CombinationSimpleRequires score normalization
Learning-to-RankOptimal for domainRequires training data

6. Technology Comparison Matrix

6.1 Code Representation Approaches

ApproachEffortValueLatencyComplexityAutonomy
Text ChunkingLow60%FastLowLow
AST ParsingMedium80%MediumMediumMedium
CPG (AST+CFG+DFG)High95%SlowHighHigh
GraphRAGHigh99%SlowVery HighVery High

6.2 Search Approaches

ApproachEffortValueLatencyComplexityManageability
FTS5 OnlyLow70%<50msLowEasy
Vector OnlyMedium75%~200msMediumMedium
Hybrid (RRF)Medium85%~300msMediumMedium
GraphRAGHigh95%~500msHighComplex

6.3 Memory/Context Approaches

ApproachEffortValueLatencyAutomationCompetitive Moat
No MemoryNone0%N/ANoneNone
Session MemoryLow40%FastLowLow
Cross-SessionHigh80%MediumHighHigh
Decision TrackingHigh90%MediumHighVery High
Memory + Code IntelVery High100%MediumVery HighUnique

7. Strategic Recommendations

7.1 CODITECT Competitive Position

Current Unique Capabilities:

  1. Cross-session memory (ADR-020, ADR-021)
  2. Decision tracking and rationale preservation
  3. Error-solution knowledge base
  4. Hybrid RRF semantic search (ADR-080)

Table Stakes (Must Implement):

  1. AST-based code parsing (tree-sitter)
  2. Call graph navigation
  3. Basic impact analysis
TaskPriorityEffortValueDifferentiator?
H.5.5.2: Call GraphHigh16hMediumNo (table stakes)
H.5.5.3: Impact + MemoryCritical24hHighYes (unique combo)
H.5.5.4: Decision-Code LinkingCritical16hVery HighYes (unique)
H.5.5.5: GraphRAG IntegrationMedium40hHighPartial

7.3 Integration Recommendations

Code Pathfinder Integration Options:

OptionProsConsRecommendation
Fork & ExtendFull controlAGPL license, maintenance burdenNot recommended
Use as LibraryLeverage existingAGPL viral, limited PythonEvaluate carefully
Learn & BuildClean IP, tailoredDevelopment effortRecommended
HybridBest of bothComplexityConsider for MVP

Key Learnings from Code Pathfinder:

  1. Tree-sitter is the right choice for multi-language parsing
  2. 5 parallel workers is a good balance
  3. SHA-256 for deterministic node IDs
  4. Lazy loading with byte offsets reduces memory
  5. ANTLR for query DSL parsing

7.4 Unique Value Proposition

CODITECT = Memory + Decisions + Code Intelligence

No competitor can answer:
• "What did I change last session that might cause this error?"
• "Which architectural decisions constrain this refactoring?"
• "Show me all times I've fixed this type of error"
• "What functions have I modified that call this API?"

8. Bibliography

Code Representation

  1. Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., & Liu, X. (2019). A Novel Neural Source Code Representation based on Abstract Syntax Tree. ICSE 2019. IEEE Xplore

  2. Zeng, J., et al. (2023). TAILOR: Learning Graph-based Code Representations for Source-Level Functional Similarity Detection. Paper

  3. Wang, W., et al. (2022). Turn tree into graph: Automatic code review via simplified AST driven graph convolutional network. Information and Software Technology. ScienceDirect

  4. Wang, H., et al. (2020). Detecting Code Clones with Graph Neural Network. arXiv:2002.08653

  5. Zhao, J., et al. (2024). GraphPyRec: A novel graph-based approach for fine-grained Python code recommendation. ResearchGate

RAG and GraphRAG

  1. Gao, Y., et al. (2024). A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions. arXiv:2410.12837

  2. Singh, A., et al. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv:2501.09136

  3. Edge, D., et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft Research. arXiv:2404.16130

  4. Rackauckas, Z. (2024). RAG-Fusion: a New Take on Retrieval-Augmented Generation. arXiv:2402.03367

Call Graphs and PDGs

  1. Galindo, C., Pérez, S., & Silva, J. (2025). The Expression Dependence Graph. Journal of Logical and Algebraic Methods in Programming. ScienceDirect

  2. Lee, S., Binkley, D., Feldt, R., Gold, N., & Yoo, S. (2025). Causal Program Dependence Analysis. Science of Computer Programming. ScienceDirect

  3. Chakraborty, M., et al. (2024). Indirection-Bounded Call Graph Analysis. ECOOP 2024. Conference

  4. ISSTA 2024. Total Recall? How Good Are Static Call Graphs Really? Conference

Rank Fusion

  1. Cormack, G. V., Clarke, C. L. A., & Büttcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009. Semantic Scholar

Industry Resources

  1. Microsoft Research. Project GraphRAG. Website

  2. Code Pathfinder. AI-Native Static Code Analysis. GitHub

  3. thunlp. GNNPapers: Must-read papers on graph neural networks. GitHub


Document Version: 1.0.0 Created: 2026-01-17 Author: CODITECT Research Team Copyright: 2026 AZ1.AI INC. All rights reserved.