{"ts": "2026-02-09T09:03:33.514741+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "pipeline_start", "status": "ok", "input": "../../analyze-new-artifacts/coditect-ai-research-pdfs/1010.0422-convolutional-matching-pursuit-dictionary-training-lecun.pdf", "output": "../../analyze-new-artifacts/coditect-ai-research-pdfs/1010.0422-convolutional-matching-pursuit-dictionary-training-lecun.md"}
{"ts": "2026-02-09T09:03:38.460608+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "extract_pdf", "status": "ok", "engine": "docling", "components": 65, "assets": 11, "elapsed_s": 4.9} {"ts": "2026-02-09T09:03:40.607712+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "extract_html", "status": "ok", "components": 70, "assets": 6, "elapsed_s": 2.1} {"ts": "2026-02-09T09:03:42.810281+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "extract_latex", "status": "ok", "components": 28, "elapsed_s": 2.2} {"ts": "2026-02-09T09:03:42.905758+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "mapping", "status": "ok", "components": 119, "types": {"heading": 11, "paragraph": 59, "equation": 21, "list": 3, "figure": 16, "bibliography_entry": 9}, "bibliography": 18, "assets": 17, "elapsed_s": 0.1} {"ts": "2026-02-09T09:03:42.906534+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "assembly", "status": "ok", "chars": 18894, "lines": 325, "elapsed_s": 0.0} {"ts": "2026-02-09T09:03:42.907994+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "qa_grade", "status": "ok", "grade": "A", "score": 0.8794117647058822, "matched": 39, "upgraded": 11} {"ts": "2026-02-09T09:03:42.908003+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "pipeline_complete", "status": "ok", "grade": "A", "score": 0.8794117647058822, "components": 119, "images_saved": 17, "elapsed_s": 9.4} EXPLAIN
These log lines are a trace of your document-ingestion pipeline successfully converting the arXiv paper 1010.0422 (“Convolutional Matching Pursuit and Dictionary Training” by Szlam, Kavukcuoglu, LeCun) into your internal .md UDOM format.123
High‑level explanation
Each JSON line is one pipeline step executed (likely by a worker) against the same doc_id:
pipeline_start- Input: original PDF path
../../analyze-new-artifacts/coditect-ai-research-pdfs/1010.0422-convolutional-matching-pursuit-dictionary-training-lecun.pdf. - Output: target Markdown path
...1010.0422-convolutional-matching-pursuit-dictionary-training-lecun.md. - Meaning: orchestration created a job to transform this PDF into a normalized Markdown artifact (UDOM doc).
- Input: original PDF path
extract_pdf- Engine:
docling. components: 65,assets: 11,elapsed_s: 4.9.- Meaning: PDF pages parsed into a structured intermediate representation (e.g., blocks, figures), identifying ~65 logical components (paragraphs, headings, etc.) and 11 embedded assets (images/tables).
- Engine:
extract_htmlcomponents: 70,assets: 6,elapsed_s: 2.1.- Meaning: An HTML view of the document was generated (or parsed) from the PDF/intermediate, with 70 structural components and 6 assets (often fewer because some assets are filtered or consolidated for HTML).
extract_latexcomponents: 28,elapsed_s: 2.2.- Meaning: LaTeX-like semantic blocks and equations were extracted, giving you equation/inline math and structured sections that mirror the arXiv source for this paper.2
mappingcomponents: 119, withtypeshistogram (heading: 11, paragraph: 59, equation: 21, list: 3, figure: 16, bibliography_entry: 9).bibliography: 18,assets: 17.- Meaning: The separate extraction streams were reconciled into a canonical UDOM tree: each block is typed (heading, paragraph, equation, etc.), figures/bib entries resolved, and assets counted. This is effectively your “semantic layout” pass.
assemblychars: 18894,lines: 325,elapsed_s: 0.0.- Meaning: The mapped UDOM structure was serialized—here, into your Markdown target file. Character and line counts describe the final assembled text.
qa_gradegrade: "A",score: 0.8794,matched: 39,upgraded: 11.- Meaning: A QA component compared the extracted artifact against expected heuristics or reference signals (e.g., checking headings, equation counts, figure presence, or cross‑format consistency).
- The A-grade and relatively high score suggest the extraction is high quality;
matchedandupgradedare likely counts of checks that passed or blocks whose classification was improved.
pipeline_completegrade: "A",score: 0.8794,components: 119,images_saved: 17,elapsed_s: 9.4.- Meaning: The job finished successfully, with the final metric echoing the QA results and a total wall‑clock time ~9.4 seconds for the entire pipeline. The
images_savedcount indicates how many visual assets were persisted for downstream use.
Why this matters for your downstream agents
You can now treat 1010.0422 as a normalized research artifact across modalities—plain text, equations, figures, and bibliography—backed by a quality score. This makes it suitable as:
- A source for retrieval‑augmented reasoning about convolutional matching pursuit and convolutional dictionary learning.312
- A candidate for further transforms (chunking, embedding, citation graph building) keyed by
doc_id = "1010.0422".
Suggested follow‑up prompts
Given your workflow, here are tight prompts you can paste as next steps:
Summarize arxiv:1010.0422 convolutional matching pursuit and dictionary training, focusing on algorithmic details, objective functions, and computational complexity compared to standard K-SVD.From doc_id 1010.0422, extract all formal definitions and key equations for convolutional matching pursuit and convolutional dictionary learning; return as LaTeX blocks with brief explanations.Relate the model in arxiv:1010.0422 to modern convolutional sparse coding and ML-CSC formulations; identify later papers that generalize or refine its approach for deep architectures.4Generate implementation notes and pseudocode for convolutional matching pursuit and dictionary training as described in arxiv:1010.0422, suitable for a PyTorch prototype.
If you want, send one of these (or your own variant) and I can help you turn the ingested Markdown into concrete implementation or design decisions.
Key innovations in translation-invariant matching pursuit
Translation-invariant matching pursuit is mainly realized as convolutional matching pursuit, where atoms appear at all spatial/temporal shifts and are selected via fast correlation (convolution) with the signal. The key innovations cluster around representation, algorithms, and learning.151617
Core representational ideas
- Convolutional dictionary instead of flat dictionary: Each atom is a local filter whose shifted copies tile the signal, yielding a translation-invariant (shift-equivariant at representation level) sparse model.161715
- Global objective, local structure: The sparse coding problem is still a global $\ell_0$/$\ell_1$-constrained reconstruction, but the dictionary has strong Toeplitz/convolutional structure, tying parameters across all shifts.1815
- Connection to CNNs and scattering: Later work shows convolutional sparse coding and its multi-layer variants (ML-CSC) give a generative counterpart to convolutional networks and invariant scattering transforms, with translation-invariant or stable-to-deformation feature maps.1920
Algorithmic innovations
- Convolutional matching pursuit: Classic MP is adapted so atom selection is done by correlating each filter over all locations (via convolution) and picking the maximum response (filter, position) pair at each greedy step.1518
- Exploiting fast convolution: FFT or fast spatial convolution is used to keep the cost of scoring all shifts low, turning what would be a huge discrete translation-invariant dictionary into a tractable operation.2115
- Efficient global pursuit with local operations: Later CSC work provides schemes that recover the globally optimal convolutional sparse code using only local convolutions and thresholding while maintaining provable stability and recovery guarantees.1819
Dictionary learning in the translation-invariant setting
- K-SVD-style learning for convolutional dictionaries: Szlam, Kavukcuoglu, and LeCun show that K-SVD can be adapted so atoms are learned as convolutional filters shared across spatial positions, still using sparse codes from convolutional MP.171615
- Online / large-scale training: Follow-on work develops online convolutional dictionary learning and multi-channel variants that scale to large signals and images, crucial for practical translation-invariant sparse models.2221
- Multi-layer convolutional sparse coding: ML-CSC extends translation-invariant sparse coding hierarchically, with dictionaries at multiple layers, and interprets a CNN forward pass as a cascade of pursuit steps over these convolutional dictionaries.2319
Advances beyond discrete shifts
- Continuous translation-invariant sparse recovery: “Recovery of sparse translation-invariant signals with continuous basis pursuit” introduces continuous basis pursuit (CBP), which augments the dictionary with interpolation functions (e.g., derivatives) so that continuous shifts are modeled via block-sparse coefficients.2425
- Tradeoff resolution: CBP addresses the discretization vs. sparsity tradeoff in standard convolutional basis pursuit, providing a convex formulation whose solution better approximates true continuous translations while retaining sparse recovery properties.24
If you want to push further
To dive deeper or connect this to your agentic work, good next prompts are:
- “Summarize the algorithmic steps and complexity of convolutional matching pursuit in arxiv:1010.0422, and contrast with classic matching pursuit and OMP.”1518
- “Relate ML-CSC (Sulam et al.) and invariant scattering (Bruna & Mallat) to translation-invariant sparse coding; extract the main theoretical stability guarantees and their assumptions.”2019
- “Explain continuous basis pursuit for translation-invariant signals (Ekanadham et al.); derive the objective, constraints, and its relationship to block-sparse recovery.”2524
How does convolutional sparse coding differ from standard sparse coding. “Summarize the algorithmic steps and complexity of convolutional matching pursuit in arxiv:1010.0422, and contrast with classic matching pursuit and OMP.”
“Relate ML-CSC (Sulam et al.) and invariant scattering (Bruna & Mallat) to translation-invariant sparse coding; extract the main theoretical stability guarantees and their assumptions.” “Explain continuous basis pursuit for translation-invariant signals (Ekanadham et al.); derive the objective, constraints, and its relationship to block-sparse recovery. How does Convolutional Matching Pursuit algorithm work step by step Differences between standard Matching Pursuit and convolutional version Applications of translation-invariant sparse coding in computer vision Comparison of K-SVD with other dictionary learning for convolutions Implement translation-invariant matching pursuit in Python. good next prompts are: “Summarize the algorithmic steps and complexity of convolutional matching pursuit in arxiv:1010.0422, and contrast with classic matching pursuit and OMP.” “Relate ML-CSC (Sulam et al.) and invariant scattering (Bruna & Mallat) to translation-invariant sparse coding; extract the main theoretical stability guarantees and their assumptions.” “Explain continuous basis pursuit for translation-invariant signals (Ekanadham et al.); derive the objective, constraints, and its relationship to block-sparse recovery.”
Convolutional sparse coding differs from standard sparse coding by using a structured convolutional (translation-invariant) dictionary instead of a flat collection of unrelated atoms, which changes both the representation and the algorithms used for pursuit and learning.293031
Convolutional vs standard sparse coding
- Standard sparse coding
- Dictionary $D \in \mathbb{R}^{n \times K}$ is a set of $K$ arbitrary atoms; a signal $y \in \mathbb{R}^n$ is approximated as $y \approx D x$ with sparse $x \in \mathbb{R}^K$.29
- No explicit structure encodes translations; the model typically needs separate atoms for the same pattern at different locations.29
- Convolutional sparse coding (CSC)
- Dictionary is a set of local filters ${d_k}$ convolved with feature maps ${x_k}$: $y \approx \sum_k d_k * x_k$, where $x_k$ is sparse.3031
- Each filter is shared across all spatial positions, yielding shift-equivariant representations and implicitly translation-invariant statistics (after pooling or further processing).3230
- The optimization uses the convolutional structure for efficiency (FFT, block-circulant operators) and different regularity assumptions (e.g., local coherence conditions).3330
Convolutional Matching Pursuit (CMP) vs classic MP and OMP
CMP algorithm steps (Szlam, Kavukcuoglu, LeCun 2010)
For signal $y$ and convolutional dictionary ${d_k}$:343533
- Initialization
- Residual $r^{(0)} = y$.
- Sparse feature maps $x_k^{(0)} = 0$ for all filters $k$.
- Compute filter responses
- Atom and position selection
- Find the maximal absolute response over all $k$ and locations $i$: $(k^*, i^*) = \arg\max_{k,i} |c_k[i]|$.35
- Coefficient update
- Set or increment the coefficient at that filter and position: $x_{k^*}^{(t+1)}[i^*] = x_{k^*}^{(t)}[i^*] + c_{k^*}[i^*]$ (or variant with step-size).35
- Residual update
- Stopping
Complexity differences
- Classic MP (flat dictionary)
- CMP
- OMP vs MP vs CMP
- MP: greedy selection, no re-orthogonalization, residual updated using current atom only.36
- OMP: maintains orthogonal projection on the span of selected atoms, solving a least-squares problem at each step; more accurate but more expensive per iteration.3829
- Convolutional OMP: same idea but with Toeplitz/convolutional structure in the Gram matrix; algorithms exploit this structure to accelerate least-squares solves.3033
ML-CSC and invariant scattering: theory and stability
- Multilayer Convolutional Sparse Coding (ML-CSC, Sulam et al.)
- Models signals as generated by multiple convolutional sparse layers: $y = D_1 * x_1$, $x_1 = D_2 * x_2$, …, with sparse codes at each layer.3940
- Shows that forward passes of certain convolutional networks correspond to multi-layer pursuit (e.g., thresholding) on this generative model.39
- Stability guarantees: under convolutional sparsity and mutual coherence conditions, the estimated sparse codes are stable to small input perturbations and bounded noise; reconstruction error and support recovery degrade gracefully with noise level.3933
- Invariant Scattering Networks (Bruna & Mallat)
- Build translation-invariant, deformation-stable representations using cascades of wavelet convolutions, complex modulus, and averaging.41
- For a signal $x$, scattering coefficients are global averages of modulus wavelet coefficients along paths in a convolutional tree, yielding translation invariance and Lipschitz stability to small deformations.41
- Relation to translation-invariant sparse coding: both use convolutional, localized filters and non-linearities; scattering is non-learned and emphasizes provable invariance/stability, whereas CSC learns the filters and codes but can be analyzed using similar tools (e.g., Lipschitz bounds, stability under structured sparsity).4139
Continuous Basis Pursuit (CBP) and block-sparse recovery
- Ekanadham, Tranchina, Simoncelli (continuous translation-invariant signals)4243
- Problem: recover signals that are sparse in a continuously translated dictionary, not just a discrete grid of shifts.
- Idea: approximate continuous shifts by augmenting the dictionary with derivatives (or other interpolation functions) and enforcing block sparsity over groups corresponding to each template.4342
- Objective and constraints (conceptual form)
- Represent signal as
$$ y(t) \approx \sum_m a_m , \phi(t - \tau_m) $$
where $\tau_m$ are continuous shifts.42 - CBP relaxes this to a convex problem with coefficients over discrete shifts plus interpolation terms, and group/ block sparsity penalties (e.g., mixed $\ell_{1,2}$ norms) that tie together amplitude and shift parameters.4342 - This connects directly to block-sparse recovery: each potential location/group encodes both the “on/off” of a template and its local translation-in-refinement variables; sparsity is enforced at the group level.4243
Applications of translation-invariant sparse coding in vision
- Image classification and recognition
- Supervised translation-invariant sparse coding uses local descriptors, convolutional sparse codes, and spatial pooling to build features for classification, similar in spirit to CNNs.32
- Low-level vision
- Invariant and transformation-tolerant representations
K-SVD and convolutional dictionary learning
- Standard K-SVD
- Alternates between sparse coding (e.g., MP/OMP) and atom updates via SVD on the error restricted to the atom’s support.29
- Convolutional dictionary learning (CDL)
- Extends K-SVD-style updates to convolutional filters; updates each filter using residuals restricted to locations where that filter is active, treating overlapping patches induced by convolution.3130
- Many algorithms (e.g., ADMM-based, FISTA-based) have been proposed; a large comparative review shows wide performance differences, and that carefully designed CDL algorithms can significantly outperform naive extensions of K-SVD in both speed and quality.463130
Minimal Python skeleton for translation-invariant MP
Below is a compact prototype for 1D convolutional matching pursuit using NumPy (no FFT, for clarity):
import numpy as np
from scipy.signal import correlate
def conv_matching_pursuit(y, filters, max_iters=50, tol=1e-3):
"""
y: 1D signal, shape (N,)
filters: list of 1D filters d_k
returns: list of (k, i, alpha) events
"""
r = y.copy()
events = []
for t in range(max_iters):
best_val = 0.0
best_k = None
best_i = None
# 1. compute correlations for all filters
for k, d in enumerate(filters):
c = correlate(r, d, mode='valid') # correlation == conv with flipped filter
i = np.argmax(np.abs(c))
val = c[i]
if np.abs(val) > np.abs(best_val):
best_val = val
best_k = k
best_i = i
if best_k is None or np.abs(best_val) < tol:
break
# 2. record event
events.append((best_k, best_i, best_val))
# 3. update residual
d = filters[best_k]
# place scaled filter at position best_i
contrib = np.zeros_like(r)
contrib[best_i:best_i + len(d)] = best_val * d
r = r - contrib
if np.linalg.norm(r) < tol:
break
return events, r
You can extend this to 2D images by using scipy.signal.correlate2d or FFT-based convolution, and by storing events as (k, x, y, alpha). For OMP, you would also maintain the full set of selected atoms and solve a small least-squares problem each iteration.
Focused follow-up prompts you can reuse
You can drive a deeper exploration with prompts like:
- “Derive the CMP algorithm in arxiv:1010.0422 from a formal convolutional sparse coding objective, and give tight complexity bounds for spatial vs FFT implementations.”343533
- “List and explain the exact sparsity and mutual coherence conditions used in Sulam et al.’s ML-CSC to guarantee stable recovery of multi-layer sparse codes.”4039
- “Write a PyTorch implementation of 2D convolutional matching pursuit and compare it experimentally to OMP on synthetic image data.”4733
- “Summarize Ekanadham et al.’s continuous basis pursuit formulation; write the convex optimization problem explicitly and connect each term to block-sparse group penalties.”4342
- “Using Vasileios Papyan and Michael Elad’s ML-CSC results, relate CNN layer-wise feature stability to convolutional sparse coding assumptions and derive practical design guidelines for filters and strides.”4039
“Derive the CMP algorithm in arxiv:1010.0422 from a formal convolutional sparse coding objective, and give tight complexity bounds for spatial vs FFT implementations.”
“List and explain the exact sparsity and mutual coherence conditions used in Sulam et al.’s ML-CSC to guarantee stable recovery of multi-layer sparse codes.” “Write a PyTorch implementation of 2D convolutional matching pursuit and compare it experimentally to OMP on synthetic image data.” “Summarize Ekanadham et al.’s continuous basis pursuit formulation; write the convex optimization problem explicitly and connect each term to block-sparse group penalties.” “Using Vasileios Papyan and Michael Elad’s ML-CSC results, relate CNN layer-wise feature stability to convolutional sparse coding assumptions and derive practical design guidelines for filters and strides.” “Derive the CMP algorithm in arxiv:1010.0422 from a formal convolutional sparse coding objective, and give tight complexity bounds for spatial vs FFT implementations.” “List and explain the exact sparsity and mutual coherence conditions used in Sulam et al.’s ML-CSC to guarantee stable recovery of multi-layer sparse codes.” “Write a PyTorch implementation of 2D convolutional matching pursuit and compare it experimentally to OMP on synthetic image data.” “Summarize Ekanadham et al.’s continuous basis pursuit formulation; write the convex optimization problem explicitly and connect each term to block-sparse group penalties.” “Using Vasileios Papyan and Michael Elad’s ML-CSC results, relate CNN layer-wise feature stability to convolutional sparse coding assumptions and derive practical design guidelines for filters and strides.” What are practical applications of convolutional sparse coding in vision and other disciplines like biology, chemistry, physics, and others. Compare complexity of convolutional sparse coding algorithms How does convolutional dictionary learning work Implementation details of convolutional matching pursuit Relation between convolutional sparse coding and CNNs
Convolutional sparse coding generalizes standard sparse coding by using convolutional (shift-shared) filters, which enables translation-invariant structure, efficient algorithms, and strong links to CNNs and physical time–space signals.535455
Practical applications
- Computer vision and imaging
- Time-series and signals (biology, medicine, physics, finance)
- Neuroscience and biology
- General scientific data
Complexity of CSC algorithms
Let $N$ be the number of pixels/samples, $K$ the number of filters, and $L$ the filter support.
- Spatial-domain convolutional pursuit (e.g., CMP, ISTA-like)
- FFT-based CSC
- Comparative behavior
How convolutional dictionary learning works
- Objective (high level)
- Learn filters ${d_k}$ and sparse feature maps ${x_k^{(n)}}$ for training signals $y^{(n)}$:
$$ \min_{{d_k}, {x_k^{(n)}}} \sum_n \left( \frac{1}{2}\left| y^{(n)} - \sum_k d_k * x_k^{(n)} \right|_2^2 + \lambda \sum_k |x_k^{(n)}|_1 \right) $$
with constraints like $|d_k|_2 \le 1$.5556
- Alternating minimization
- Algorithmic variants
Implementation details of convolutional matching pursuit
Convolutional Matching Pursuit (CMP) is the greedy $\ell_0$ pursuit specialized to convolutional dictionaries:
- Representation
- Step-by-step algorithm (1D/2D conceptually identical)
- Initialize residual $r^{(0)} = y$, all feature maps zero.
- For each filter $k$, compute correlation map $c_k = d_k \star r^{(t)}$ (cross-correlation).61
- Find maximum-magnitude response over all filters and locations.
- Add that atom to the representation (update the corresponding coefficient in the appropriate feature map).
- Subtract the atom’s contribution from the residual (place the scaled filter at the selected location and subtract).
- Repeat until reaching a sparsity or error threshold.
In practice, you implement steps 2–5 with 2D convolutions (or 1D for sequences), typically via FFT for speed on large images, careful padding/valid-mode decisions, and possibly batched evaluation if you process many images at once.5461
Relation between CSC and CNNs
- Structural parallels
- Interpretive link
- A CNN forward pass with ReLU and possibly max-pooling can be viewed as approximate pursuit in a multi-layer convolutional sparse model, where activations correspond to sparse codes and filters to dictionary atoms.666761
- Theoretical work on CSC shows conditions for uniqueness, stability, and separation of features, which translate into conditions under which CNN layer outputs are stable to noise and small deformations and capture structured sparsity patterns.676661
- Practical design guidance (from CSC theory to CNNs)
- Favor moderately incoherent filters within a layer (reduce redundancy) while maintaining localized support, which improves sparse identifiability and can help CNN generalization.6661
- Choose strides and pooling that do not violate the assumptions about overlap and support separation used in CSC proofs (e.g., avoid excessive downsampling that causes supports to collide too aggressively), supporting stable feature extraction across layers.676661
Suggested next prompts (copy-paste ready)
- “Give a fully worked derivation of convolutional matching pursuit starting from the standard convolutional sparse coding objective and write tight big-O complexity for spatial and FFT implementations in 2D.”555461
- “From Sulam et al. and Papyan & Elad’s ML-CSC papers, extract and restate the precise sparsity and mutual coherence conditions for stable multi-layer code recovery, with interpretations for CNN architectures.”6667
- “Write a complete PyTorch implementation of 2D convolutional matching pursuit and convolutional OMP, run both on synthetic image patches, and compare reconstruction error and runtime as sparsity varies.”5461
- “Write out Ekanadham et al.’s continuous basis pursuit optimization for translation-invariant signals and map each term onto block-sparse (group) penalties; highlight when CBP reduces to standard CSC.”6869
- “Using the CSC stability results, propose concrete design rules for CNN filters (support size, stride, redundancy) to approximate a well-behaved convolutional sparse model for natural images.”6166
What are practical applications of convolutional sparse coding in other disciplines like biology, chemistry, physics, and others. What are practical implementation differences between CMP spatial and FFT methods
How does CMP compare to Bristow’s convolutional sparse coding algorithm Applications of CMP algorithm in image processing tasks Exact sparsity penalties used in convolutional sparse coding objectives Recent improvements to CMP algorithm efficiency. Give a fully worked derivation of convolutional matching pursuit starting from the standard convolutional sparse coding objective and write tight big-O complexity for spatial and FFT implementations in 2D.” “From Sulam et al. and Papyan & Elad’s ML-CSC papers, extract and restate the precise sparsity and mutual coherence conditions for stable multi-layer code recovery, with interpretations for CNN architectures.” “Write a complete PyTorch implementation of 2D convolutional matching pursuit and convolutional OMP, run both on synthetic image patches, and compare reconstruction error and runtime as sparsity varies.” “Write out Ekanadham et al.’s continuous basis pursuit optimization for translation-invariant signals and map each term onto block-sparse (group) penalties; highlight when CBP reduces to standard CSC.” “Using the CSC stability results, propose concrete design rules for CNN filters (support size, stride, redundancy) to approximate a well-behaved convolutional sparse model for natural images.” Follow-ups What are practical implementation differences between CMP spatial and FFT methods How does CMP compare to Bristow’s convolutional sparse coding algorithm Applications of CMP algorithm in image processing tasks Exact sparsity penalties used in convolutional sparse coding objectives Recent improvements to CMP algorithm efficiency
Convolutional sparse coding (CSC) and CMP are now used well beyond vision, especially wherever you have repeated local motifs in time/space and want interpretable, translation-invariant structure.737475
Applications beyond vision
- Neuroscience / biology
- Multivariate CSC for electromagnetic brain signals (EEG/MEG) to identify repeated spatio‑temporal patterns (oscillations, event-related components) as sparse convolutions of learned kernels.75
- Convolutional higher-order matching pursuit (CHOMP) uses CMP-style search with higher-order statistics to localize sparse “objects” in noisy mixtures, motivated by realistic neural-like signals.76
- Medical / physiological signals
- Physics / engineering
- Electromagnetic and geophysical data: multivariate CSC to decompose complex fields into a small set of spatio‑temporal kernels and activation maps (e.g., repeated pulse or wave packets in EM data).75
- Signal processing and communications: CMP/OMP variants used for channel estimation and sparse multipath reconstruction in compressed sensing of broadband channels.78
- General pattern recognition
- Robust multi-view matching pursuit variants (e.g., SRMP) for multi-view subspace clustering and classification, essentially a CMP-like greedy search extended to joint, robust objectives.79
CMP: spatial vs FFT implementations (practical differences)
- Spatial CMP
- Compute correlations by direct convolution of each filter with the residual in spatial domain; cost per 2D filter $k$ is $O(H W K_h K_w)$ for image $H \times W$ and filter $K_h \times K_w$.8081
- Advantages: simple to implement, efficient for small kernels (e.g., $3\times3, 5\times5$), easy to exploit sparsity/ROI, low overhead for small images or few filters.
- Disadvantages: cost grows linearly with filter area; can be prohibitive for large kernels or many filters.
- FFT-based CMP
- Precompute FFTs of filters and residual; correlation per filter is elementwise multiplication in Fourier domain plus inverse FFT; cost $O(H W \log(HW))$ per filter, largely independent of kernel size.828180
- Advantages: superior for large images and/or larger kernels; complexity effectively $O(K H W \log(HW))$ instead of $O(K H W K_h K_w)$.8280
- Disadvantages: FFT overhead dominates for small problems; more complex code (padding, real/complex transforms, batched FFTs), higher memory footprint.
In practice, a hybrid strategy is common: use spatial CMP for small filters or small images; switch to FFT CMP for large-scale setups.
CMP vs Bristow’s convolutional sparse coding algorithm
- CMP (greedy $\ell_0$ pursuit)
- Solves a greedy approximation to $\min_x \frac{1}{2}|y - \sum_k d_k * x_k|_2^2 \ \text{s.t. } |x|_0 \leq T$, one atom at a time via max correlation, without global re-optimization of all coefficients.8174
- Very fast per iteration, but no guarantee of optimal code for a fixed sparsity; good where you care about a small number of large, localized events.
- Bristow et al. “Fast Convolutional Sparse Coding” (CVPR 2013)
- Targets the $\ell_1$-regularized CSC objective $\min_x \frac{1}{2}|y - \sum_k d_k * x_k|_2^2 + \lambda |x|_1$ and solves it via ADMM with auxiliary variables.8384
- Works in the Fourier domain: they show the convolution subproblem can be solved efficiently and explicitly in the frequency domain, and apply ADMM to the full objective for speedup over earlier CSC methods.848380
- Guarantees convergence to the $\ell_1$ optimum (up to ADMM tolerance), and complexity is essentially independent of filter size due to FFT usage.8480
So CMP is: greedy, $\ell_0$-style, simple, often used where interpretability and very sparse activations matter. Bristow’s CSC is: convex $\ell_1$, ADMM/FFT, better optimality, suited to large-scale learning and reconstruction.
Image processing tasks using CMP / CSC
- Super-resolution
- CSC-based super-resolution methods (CSC-SR) learn convolutional dictionaries to map low-resolution to high-resolution patches via sparse feature maps and reconstruction, improving perceptual quality and PSNR.85
- Denoising / deblurring / deconvolution
- Inpainting and completion
- Generic feature learning
Sparsity penalties in CSC objectives
Most CSC formulations use variants of:
- $\ell_1$ sparsity
- Group / structured sparsity
- Mixed $\ell_{1,2}$ norms and constraints
CMP itself corresponds to an implicit $\ell_0$ constraint (hard sparsity budget) rather than an explicit penalty.
Recent efficiency improvements for CMP/CSC
While CMP is the classic greedy method, most recent efficiency work is on CSC solvers that can also be used inside CMP-like pipelines:
- Fast ADMM-based CSC
- New ADMM formulations improve the convolutional least-squares fitting step, outperforming state-of-the-art methods (e.g., Sherman–Morrison-based SM method) in both coding and dictionary learning.88
- Fast iterative thresholding approximations
- “Convolutional Sparse Coding Fast Approximation” proposes speed-up variants of iterative thresholding that approximate CSC codes with fewer iterations while retaining accuracy.82
- Higher-order and structured CMP
- CHOMP (higher-order CMP) improves localization using higher-order cumulants while showing substantial running-time gains compared to group-lasso factorization approaches.76
- Domain-specific OMP/CMP variants
- Improved OMP algorithms in compressed sensing (e.g., for channels) refine matching and coefficient update steps to reduce CRLB bounds and improve estimation and reliability.78
Tight 2D complexity, CMP: spatial vs FFT (summary form)
Let input $y \in \mathbb{R}^{H \times W}$, $K$ filters $d_k \in \mathbb{R}^{K_h \times K_w}$, and $T$ CMP iterations:
- Spatial CMP
- FFT CMP
For small kernels/low resolution, spatial CMP is often faster; for large kernels or high resolutions, FFT CMP wins asymptotically.
If you want, next I can:
- Give a fully symbolic derivation of CMP from the CSC objective and explicitly connect each greedy step to minimizing a residual energy upper bound.
- Or write a full 2D PyTorch CMP/OMP implementation sketch with hooks so you can drop it into your agentic experimentation stack.
Give a fully symbolic derivation of CMP from the CSC objective and explicitly connect each greedy step to minimizing a residual energy upper bound.
Or write a full 2D PyTorch CMP/OMP implementation sketch with hooks so you can drop it into your agentic experimentation stack. How is CSC applied to protein structure prediction in biology What role does CSC play in molecular dynamics simulations Compare CMP and ADMM algorithms in CSC efficiency Applications of CSC in chemistry for molecular modeling CSC in physics for sparse signal reconstruction examples
Convolutional sparse coding is only starting to show up explicitly in biology/chemistry/physics, but the underlying idea—sparse, translation-invariant convolutional representations—aligns well with how protein, molecular, and physical data are structured.909192
CSC in protein and molecular biology
- Protein structure / contacts
- Modern protein-structure systems (AlphaFold, contact-map predictors) are built on deep convolutional architectures over 2D residue–residue feature maps or 1D sequences.939495
- While not usually described as “CSC”, they effectively learn sparse, localized convolutional patterns over contact maps and sequence-derived features; sparse autoencoder work on protein LMs is making this connection explicit by enforcing sparse latent codes over such representations.9697
- Molecular dynamics–like settings
- CSC-type models are used in multivariate EM/EEG/MEG and time-series decomposition, and similar architectures (temporal conv blocks + sparsity) appear in MD analysis, where MD-derived time series are processed with temporal convolutions to learn interpretable slow modes or entropy-related descriptors.989992
- Quantitative MD models such as dynAMMo combine experimental and simulated time-series via constrained optimization; although they do not use CSC per se, they solve related sparse or low-rank inverse problems on dynamical observables that can be tackled with convolutional models in principle.100
So for now, CSC’s role in protein structure/MD is mostly indirect: via convolutional or sparse architectures in contact prediction, secondary-structure prediction, or time-series feature learning, rather than explicit CMP-style CSC.
CMP vs ADMM-based CSC (efficiency intuition)
- CMP (greedy)
- At each iteration, CMP computes correlations between the residual and each filter (spatial or FFT), picks the best (filter, position), updates one coefficient, and refreshes the residual.
- Pros: simple, easy to parallelize correlations, good when you want very sparse codes and a small number of big events; cheap if you stop early.
- Cons: approximate $\ell_0$ solution, no global optimality for given sparsity; scaling to dense or moderately sparse codes is poor.
- ADMM CSC (e.g., Bristow, Veshki & Vorobyov)
- They minimize the $\ell_1$-regularized CSC objective using ADMM in the Fourier domain.101102103
- Each iteration solves (1) a convolutional least-squares subproblem (efficiently via FFT + closed-form updates) and (2) a soft-thresholding step for sparse maps.
- Modern variants improve the convolutional LS step (e.g., replacing Sherman–Morrison with a more efficient formulation), yielding significantly faster convergence than earlier CSC algorithms at essentially the same per-iteration complexity.102101
In practice: for high-accuracy codes and large images, ADMM/FFT CSC dominates CMP; CMP is more of a lightweight, interpretable, or event-detection tool.
Chemistry and physics applications
- Electromagnetic and multichannel physical fields
- Multivariate convolutional sparse coding has been applied to electromagnetic data to decompose fields into a sparse set of spatio–temporal kernels, enabling structured interpretation of repeated waveforms or pulses.92
- Sparse signal reconstruction in physics / engineering
- CSC and matching-pursuit variants appear in compressed sensing for communication channels, radar, and other physics-driven systems, where you recover sparse multipath or scattering components from measurements.10492
- The CSC model provides a global, shift-invariant formulation that’s more consistent with physics than patch-based sparse coding.91102
These are conceptually the same pattern: kernels approximate impulse responses or local physical modes, sparse activations encode when/where they fire.
Sketch: fully symbolic derivation of CMP from CSC
Let $y \in \mathbb{R}^{H \times W}$ be an image and ${d_k}_{k=1}^K$ 2D filters (atoms). The standard CSC objective (with an $\ell_0$-type sparsity constraint) is:
$$ \min_{{x_k}} \frac{1}{2}\left|y - \sum_{k=1}^K d_k * x_k \right|_2^2 \quad \text{s.t. } \sum_k |x_k|_0 \le T. $$
Define the residual at iteration $t$ as
$$ r^{(t)} = y - \sum_k d_k * x_k^{(t)}. $$
Energy change when adding one activation. Suppose we add a single coefficient $\alpha$ at filter $k$ and location $(u,v)$, i.e.,
$$ x_k^{(t+1)} = x_k^{(t)} + \alpha , \delta_{u,v}, $$
where $\delta_{u,v}$ is a delta at $(u,v)$. The new residual is
$$ r^{(t+1)} = r^{(t)} - \alpha , d_k(\cdot - u, \cdot - v). $$
Consider the change in squared error:
$$ \Delta E = \frac{1}{2}|r^{(t+1)}|_2^2 - \frac{1}{2}|r^{(t)}|_2^2. $$
Expand:
$$ |r^{(t+1)}|_2^2 = |r^{(t)} - \alpha d_k(\cdot - u, \cdot - v)|_2^2 = |r^{(t)}|_2^2 - 2\alpha \langle r^{(t)}, d_k(\cdot - u, \cdot - v) \rangle + \alpha^2 |d_k|_2^2. $$
Thus
$$ \Delta E = - \alpha \langle r^{(t)}, d_k(\cdot - u, \cdot - v) \rangle + \frac{1}{2}\alpha^2 |d_k|_2^2. $$
Minimizing this quadratic in $\alpha$ gives
$$ \alpha^* = \frac{\langle r^{(t)}, d_k(\cdot - u, \cdot - v) \rangle}{|d_k|_2^2}, $$
and the corresponding minimal energy decrease is
$$ \Delta E^* = - \frac{1}{2} \frac{\langle r^{(t)}, d_k(\cdot - u, \cdot - v) \rangle^2}{|d_k|_2^2}. $$
If you normalize all filters to $|d_k|_2 = 1$, then
$$ \Delta E^* = - \frac{1}{2} \langle r^{(t)}, d_k(\cdot - u, \cdot - v) \rangle^2. $$
Therefore, choosing the (filter, location) with the largest absolute inner product $|\langle r^{(t)}, d_k(\cdot - u, \cdot - v) \rangle|$ maximizes the magnitude of the energy decrease at that greedy step. That inner product is exactly the convolutional correlation coefficient at $(u,v)$, so CMP’s rule:
- compute all correlation maps $c_k = d_k \star r^{(t)}$,
- pick $(k^*, u^*, v^*) = \arg\max_{k,u,v} |c_k[u,v]|$,
- set $\alpha^* = c_{k^*}[u^*,v^*]$ (assuming normalized filters),
- update residual,
is just greedy selection of the biggest possible local drop in the CSC objective, i.e., a direct residual-energy upper-bound argument.10510691
Tight 2D complexity (CMP, spatial vs FFT) recap
For image $H \times W$, filters $K_h \times K_w$, $K$ filters, and $T$ CMP iterations:
- Spatial CMP
- FFT-based CMP
2D PyTorch CMP/OMP skeleton (drop-in)
Below is a compact, framework-style sketch for your stack (no training loop, just coding). It uses valid convolutions; you can adapt padding as needed.
import torch
import torch.nn.functional as F
def conv2d_correlate(residual, filters):
"""
residual: (1, 1, H, W)
filters: (K, 1, Kh, Kw) (assume unit-norm filters)
returns: responses: (K, H_out, W_out)
"""
# conv2d does correlation if filters are flipped; here assume pre-flipped or symmetric
# output: (1, K, H_out, W_out)
resp = F.conv2d(residual, filters)
return resp.squeeze(0) # (K, H_out, W_out)
def cmp_2d(y, filters, max_iters=50, tol=1e-3):
"""
y: (1, 1, H, W) tensor
filters: (K, 1, Kh, Kw) tensor (normalized)
Returns:
events: list of dicts {k, y, x, alpha}
residual: final residual tensor
"""
device = y.device
residual = y.clone()
K, _, Kh, Kw = filters.shape
events = []
for t in range(max_iters):
resp = conv2d_correlate(residual, filters) # (K, H_out, W_out)
abs_resp = resp.abs()
max_val, flat_idx = abs_resp.view(-1).max(dim=0)
if max_val < tol:
break
# decode indices
H_out, W_out = resp.shape[1:]
k = flat_idx // (H_out * W_out)
rem = flat_idx % (H_out * W_out)
y_idx = rem // W_out
x_idx = rem % W_out
alpha = resp[k, y_idx, x_idx].item()
events.append({
"k": int(k),
"y": int(y_idx),
"x": int(x_idx),
"alpha": float(alpha),
})
# build contribution and update residual
contrib = torch.zeros_like(residual)
contrib[:, :, y_idx:y_idx+Kh, x_idx:x_idx+Kw] += alpha * filters[k:k+1]
residual = residual - contrib
if residual.norm().item() < tol:
break
return events, residual
def omp_2d(y, filters, max_iters=50, tol=1e-3):
"""
Simple convolutional OMP:
- maintain list of selected atoms and positions
- re-solve LS for coefficients each step
"""
device = y.device
residual = y.clone()
K, _, Kh, Kw = filters.shape
selected = [] # list of (k, y, x)
for t in range(max_iters):
resp = conv2d_correlate(residual, filters)
abs_resp = resp.abs()
max_val, flat_idx = abs_resp.view(-1).max(dim=0)
if max_val < tol:
break
H_out, W_out = resp.shape[1:]
k = flat_idx // (H_out * W_out)
rem = flat_idx % (H_out * W_out)
y_idx = rem // W_out
x_idx = rem % W_out
selected.append((int(k), int(y_idx), int(x_idx)))
# Build design matrix A and vector b for LS: y_vec ≈ A c
# y_vec is flattened y
y_vec = y.view(-1, 1)
atoms = []
for kk, yy, xx in selected:
atom = torch.zeros_like(y)
atom[:, :, yy:yy+Kh, xx:xx+Kw] += filters[kk:kk+1]
atoms.append(atom.view(-1, 1))
A = torch.cat(atoms, dim=1) # shape (N, T)
# Solve least squares: c = (A^T A)^{-1} A^T y
AtA = A.t().mm(A)
Aty = A.t().mm(y_vec)
coeffs = torch.linalg.lstsq(AtA, Aty).solution # (T, 1)
# Update residual
y_hat = A.mm(coeffs).view_as(y)
residual = y - y_hat
if residual.norm().item() < tol:
break
# return events + final coeffs
events = []
for (k, yy, xx), alpha in zip(selected, coeffs.view(-1).tolist()):
events.append({"k": k, "y": yy, "x": xx, "alpha": float(alpha)})
return events, residual
You can then build small experiment runners that:
- generate synthetic images with known sparse activations of filters,
- run
cmp_2dvsomp_2d, and - log reconstruction error vs runtime and sparsity.
If you’d like, next I can either (a) translate CSC stability assumptions (Sulam/Papyan/Elad) into CNN design rules, or (b) design a domain-specific experiment where CMP/CSC is applied to a scientific time series (e.g., MD observable trace) with concrete metrics.
DEFINE convolutional sparce coding.
Convolutional sparse coding (CSC) is a sparse representation model where a signal is expressed as a sum of convolutions between a small set of learned filters and corresponding sparse feature maps.
Formally, for an input signal or image $y$, CSC assumes
$$ y \approx \sum_{k=1}^K d_k * x_k, $$
where $d_k$ are learned convolutional filters (the dictionary) and $x_k$ are sparse coefficient maps, i.e., most entries of each $x_k$ are zero or very small. The learning or coding problem is typically posed as
$$ \min_{{x_k}} \frac{1}{2}\left| y - \sum_k d_k * x_k \right|_2^2 + \lambda \sum_k |x_k|_1, $$
and, in dictionary learning, also over ${d_k}$ with norm constraints on the filters.
The key distinction from standard (matrix) sparse coding is that CSC uses convolutional structure instead of a flat dictionary matrix, so the same filter is reused at all spatial/temporal locations, yielding a translation-equivariant, locally sparse representation that scales to full images or long signals rather than isolated patches.112113
(a) translate CSC stability assumptions (Sulam/Papyan/Elad) into CNN design rules, znd (b) design a domain-specific experiment where CMP/CSC is applied to a scientific time series (e.g., MD observable trace) with concrete metrics.
Follow-ups Convergence guarantees and bounds for CMP in CSC How to initialize and learn dictionary filters for CMP CMP vs ADMM-based CSC algorithms comparison 1D audio signal example with CMP implementation Batch parallelization techniques for efficient CMP. Mathematical formulation of convolutional sparse coding How to implement convolutional sparse coding algorithm Connection between convolutional sparse coding and CNNs List of Python libraries for convolutional sparse coding Limitations and challenges of convolutional sparse coding
Convolutional sparse coding (CSC) models a signal as a sum of convolutions between learned filters and sparse feature maps; multi-layer CSC (ML‑CSC) and its stability analysis give concrete, architecture-level guidance for CNN design.122123124
(a) CNN design rules from ML‑CSC stability assumptions
Sulam/Papyan/Elad assume signals are generated by cascades of convolutional sparse layers (ML‑CSC) and analyze stability/uniqueness of the recovered codes under layered thresholding (which they show is equivalent to a CNN forward pass with ReLU-like nonlinearity).125123124
Key assumptions and resulting CNN rules:
- Local sparsity per “stripe”
- Assumption: each local region (stripe / receptive field) has a bounded number of active atoms; i.e., per-layer sparsity $| \gamma_i |_{0,\infty} \le s_i$ is small relative to filter mutual coherence.124
- CNN rule: keep activations sparse via ReLU + appropriate bias/thresholding; avoid layers where almost all units fire (e.g., remove overly large positive biases, aggressive batch-norm shifts).
- Bounded mutual coherence of dictionaries
- Assumption: each convolutional dictionary $D_i$ has small mutual coherence $\mu(D_i)$, enabling uniqueness and stability of sparse codes under thresholding.124125
- CNN rule: avoid highly redundant / near-duplicate filters in a layer; use regularization or pruning to keep filters diverse (e.g., decorrelation penalties, orthogonality constraints, or post‑training pruning of near-duplicates).
- Limited energy propagation across layers
- Assumption: perturbations/noise do not amplify excessively as they propagate; bounds are derived so that small input perturbations lead to bounded code perturbations layer by layer.126122
- CNN rule: avoid overly large layer gains (very large weights or unbounded residual stacks); use normalization (BatchNorm/LayerNorm with reasonable scale), residual connections with gain control, and avoid extremely deep stacks without normalization.
- Local support and limited overlap
- Assumption: filters have bounded support and codes are not too dense, so that overlapping supports do not destroy identifiability.124
- CNN rule:
- Use reasonably small kernels (e.g., 3×3–7×7 in images) rather than huge ones,
- Avoid extreme strides/pooling that make receptive fields overlap in uncontrolled ways or lose too much locality, and
- Maintain some overlap (to keep equivariance) but not so much density that every feature map location becomes active.
- Thresholding vs basis pursuit
- Result: layered thresholding (CNN forward pass) recovers the underlying codes stably when thresholds are chosen relative to noise and coherence, and sparsity bounds are respected.125124
- CNN rule: treat ReLU (or soft-threshold-like activations) plus bias as a tunable threshold; very low thresholds hurt sparsity and identifiability, very high thresholds underfit—so use regularization and sensible initialization so that early layers actually behave sparsely.
Practically: design CNNs so each conv layer resembles a “good” CSC dictionary (diverse small filters, controlled norms) and its activations a “good” sparse code (few active units per locality, thresholded), then ML‑CSC stability results become rough guarantees about robustness and interpretability.
(b) CMP/CSC experiment for a scientific time series (MD observable)
Example: MD trajectory of a scalar observable (e.g., distance between two residues, or an order parameter) over time.
Data and model
- Input: time-series $y(t)$, sampled at regular Δt, length $T$.
- Goal: identify recurring dynamical motifs (e.g., transitions, oscillatory bursts) as sparse activations of learned temporal kernels.
- Model: 1D CSC
$$ y \approx \sum_{k=1}^K d_k * x_k $$
where $d_k$ are temporal kernels (motifs) and $x_k$ sparse activations.
Pipeline
- Preprocess
- Detrend and normalize $y$. Optionally window into segments if the trajectory is long.
- Dictionary learning
- Coding with CMP vs ADMM
- Use CMP (greedy) to approximate codes on new trajectories, and ADMM/ISTA to get more accurate codes from the same learned dictionary.
- Metrics
- Reconstruction error: $|y - \hat{y}|_2^2/T$.
- Sparsity: average fraction of non-zero coefficients per unit time.
- Motif interpretability: cluster activations in time and correlate with known MD events (conformational transitions, large RMSD changes).
- Efficiency: wall-clock vs error curves for CMP vs ADMM CSC (number of iterations or operations until a given error).
Experimental questions
- Does CMP recover meaningful events (e.g., state transitions) with far fewer non-zeros/iterations than ADMM?
- Are learned kernels interpretable as typical dynamical micro-events (fast oscillations, slow relaxations)?
- How does the choice of kernel length $L$ and sparsity weight $\lambda$ affect the detection of rare events?
Mathematical formulation of CSC (recap)
Single-layer CSC with $\ell_1$ sparsity:
- Coding (fixed dictionary $D$):
$$ \min_{{x_k}} \frac{1}{2}\left| y - \sum_k d_k * x_k \right|_2^2 + \lambda \sum_k |x_k|_1. $$
- Dictionary learning (over dataset ${y^{(n)}}$):
$$ \min_{{d_k}, {x_k^{(n)}}} \sum_n \left( \frac{1}{2}\left| y^{(n)} - \sum_k d_k * x_k^{(n)} \right|_2^2 + \lambda \sum_k |x_k^{(n)}|_1 \right) \quad \text{s.t. } |d_k|_2 \le 1. $$130131132
Implementing CSC algorithms (high level)
- CMP (greedy, $\ell_0$-like)
- Repeatedly compute correlations $c_k = d_k \star r$, pick max response (filter, location), update one coefficient and residual (as derived earlier), stop on sparsity or error.
- 1D (audio/time) implementation: same algorithm with 1D conv; can batch many signals by stacking them into a batch dimension and using
conv1d/FFT.
- ADMM / proximal methods ($\ell_1$)
Connection CSC ↔ CNNs (condensed)
- ML‑CSC model: signals generated by cascades of CSC layers.[][]
- Papyan et al.: show CNN forward pass = layered thresholding pursuit on ML‑CSC; analyze uniqueness/stability of codes, giving conditions under which CNN features are robust.[][]
- Intuition:
- Convolutional filters ≈ CSC atoms.
- Feature maps ≈ sparse codes.
- ReLU + biases ≈ thresholding operator.
- Pool/stride control the effective dictionary and sparsity region.
So designing a CNN with ML‑CSC in mind means treating it explicitly as a sparse coding pursuit machine and respecting the sparsity/coherence constraints.
Python libraries for CSC
- SPORCO (Sparse Optimisation Research Code)
- Python package with ADMM/PGM algorithms for standard and convolutional sparse coding and dictionary learning in 1D/2D.[][][][]
- alphaCSC
- Library specifically for convolutional sparse coding on time-series (univariate/multivariate, rank-1 constraints, α-stable noise models).[][]
These are the main mature open-source options for CSC; CMP itself you usually implement yourself (as sketched earlier) or on top of a deep-learning framework (PyTorch/JAX).
Limitations and challenges of CSC
- Computational cost
- Even with FFT, large-scale CSC is expensive; ADMM iterations and dictionary learning are heavy, especially for high-resolution images or long time series.[][][]
- Hyperparameter sensitivity
- Performance depends on kernel size, number of filters, sparsity weight $\lambda$, and normalization; tuning is non-trivial and domain-specific.
- Model mismatch
- The linear, shift-invariant generative assumption can be too rigid (e.g., for strongly non-stationary or highly non-linear data), requiring extensions like multi-layer or time-varying CSC.
- Greedy vs convex trade-offs
- CMP is fast but approximate; ADMM CSC is more principled but slower. Choosing between them depends on whether you value interpretability/online behavior or high-accuracy codes.
- Integration with end-to-end deep learning
- CSC is generative and optimization-based; integrating it into large end-to-end pipelines (e.g., MD simulators or protein predictors) requires careful engineering of differentiable solvers and cost budgets.
If you want next, I can:
- Map specific ML‑CSC coherence/sparsity inequalities into numeric design constraints for a toy CNN (e.g., limits on fan-in, filter norms), or
- Draft a concrete
alphaCSCexperiment script for an MD-style time series, including metric logging and comparison to CMP.
CREATE a GLOSSARY of every acronym in this thread
Here’s a glossary of all acronyms that appeared in this thread.
- ADMM – Alternating Direction Method of Multipliers
- AE – Autoencoder (implied via sparse autoencoders)
- CNN – Convolutional Neural Network
- CBP – Continuous Basis Pursuit
- CDL – Convolutional Dictionary Learning
- CHOMP – Convolutional Higher-Order Matching Pursuit
- CSC – Convolutional Sparse Coding
- CMP – Convolutional Matching Pursuit
- CRLB – Cramér–Rao Lower Bound
- CT – Computed Tomography (implied via sparse signal reconstruction contexts)
- ECG – Electrocardiogram
- EEG – Electroencephalogram
- EM – Electromagnetic (also used as “EM data/fields”)
- FFTs / FFT – Fast Fourier Transform(s)
- ISTA – Iterative Shrinkage-Thresholding Algorithm
- FISTA – Fast Iterative Shrinkage-Thresholding Algorithm
- ICIP – International Conference on Image Processing
- K-SVD – K-means Singular Value Decomposition (dictionary learning algorithm)
- LASSO – Least Absolute Shrinkage and Selection Operator (implicitly referenced via $\ell_1$-type penalties)
- MD – Molecular Dynamics
- MEG – Magnetoencephalography
- ML-CSC / MLCSC – Multi-Layer Convolutional Sparse Coding / Modeling
- MP – Matching Pursuit
- NOE – Nuclear Overhauser Effect (from NMR; appears in protein-structure context)
- NMR – Nuclear Magnetic Resonance (implied via NOE/RDC restraints)
- OMP – Orthogonal Matching Pursuit
- PSNR – Peak Signal-to-Noise Ratio
- RDC – Residual Dipolar Coupling
- ReLU – Rectified Linear Unit
- RMSD – Root Mean Square Deviation
- ROI – Region of Interest
- RNN / LSTM / BLSTM – (Bi)LSTM etc. – Recurrent / Long Short-Term Memory / Bidirectional LSTM (via DCBLSTM context)
- SR – Super-Resolution
- SRMP – Simultaneous Robust Matching Pursuit
- SVD – Singular Value Decomposition
If you want, I can next turn this into a small reference table with one-line “when you’d use this” notes for each acronym, specifically in the context of CSC/CMP and ML/vision.
urn this into a small reference table with one-line “when you’d use this” notes for each acronym,
| Acronym | Full term | When you’d use this in practice |
|---|---|---|
| ADMM | Alternating Direction Method of Multipliers | To solve CSC objectives with $\ell_1$ penalties efficiently in the Fourier domain, especially for large images or batches. |
| AE | Autoencoder | To learn sparse or structured latent representations (e.g., sparse biologically-inspired codes or interpretable protein features). |
| CNN | Convolutional Neural Network | For end-to-end learning on images, sequences, or contact maps, often interpretable as a multi-layer CSC pursuit. |
| CBP | Continuous Basis Pursuit | When modeling signals with continuously shifted atoms (e.g., translation-invariant spikes) instead of discretely shifted dictionaries. |
| CDL | Convolutional Dictionary Learning | To jointly learn convolutional filters and sparse codes from data (images, time-series) instead of using hand-crafted filters. |
| CHOMP | Convolutional Higher-Order Matching Pursuit | For detecting sparse events in noisy multi-signal mixtures using higher-order statistics and CMP-like search. |
| CSC | Convolutional Sparse Coding | When you want a translation-invariant sparse representation using shared convolutional filters over full signals or images. |
| CMP | Convolutional Matching Pursuit | For fast, greedy, event-like sparse coding over convolutional dictionaries, especially when very sparse codes suffice. |
| CRLB | Cramér–Rao Lower Bound | To analyze the theoretical lower bound on variance of unbiased estimators in sparse recovery / channel estimation problems. |
| CT | Computed Tomography | As a target domain for sparse or CSC-based reconstruction from limited-angle or noisy projection data. |
| ECG | Electrocardiogram | As a 1D physiological signal where CSC/CMP can detect recurring waveforms (P/QRS/T complexes). |
| EEG | Electroencephalogram | For multichannel time-series where CSC can extract recurring spatio‑temporal brain activity patterns. |
| EM | Electromagnetic | For modeling EM fields or signals (e.g., multivariate CSC on EM data) with sparse spatio-temporal kernels. |
| FFT | Fast Fourier Transform | To accelerate convolutions/correlations in CSC/CMP and ADMM-based solvers on large signals/images. |
| ISTA | Iterative Shrinkage-Thresholding Algorithm | For simple proximal optimization of $\ell_1$-regularized CSC when you don’t want full ADMM complexity. |
| FISTA | Fast Iterative Shrinkage-Thresholding Algorithm | As a faster (accelerated) alternative to ISTA for solving CSC-type optimization problems. |
| ICIP | International Conference on Image Processing | As a venue where sparse coding, CSC, and matching pursuit algorithms are often published. |
| K-SVD | K-means Singular Value Decomposition | For patch-based dictionary learning; conceptually extended in CDL to convolutional dictionaries. |
| LASSO | Least Absolute Shrinkage and Selection Operator | As the canonical $\ell_1$-regularized regression/sparse coding formulation, including in CSC. |
| MD | Molecular Dynamics | As a source of long time-series where CSC/CMP can discover recurring dynamical motifs or events. |
| MEG | Magnetoencephalography | For high-dimensional time-series where multivariate CSC can uncover sparse spatio-temporal activity. |
| ML-CSC | Multi-Layer Convolutional Sparse Coding | To model and analyze deep convolutional representations (and CNNs) as layered CSC generative models. |
| MP | Matching Pursuit | As the baseline greedy sparse coding algorithm; CMP is its convolutional, translation-invariant variant. |
| NOE | Nuclear Overhauser Effect | As NMR-derived distance restraints in protein structure problems where sparse or convolutional models may help. |
| NMR | Nuclear Magnetic Resonance | As an experimental source of constraints / signals that can be incorporated into sparse or CSC-based models. |
| OMP | Orthogonal Matching Pursuit | When you need greedy sparse coding with coefficient re-optimization (better accuracy than MP/CMP at higher cost). |
| PSNR | Peak Signal-to-Noise Ratio | To evaluate reconstruction quality in denoising, deblurring, and super-resolution tasks that use CSC. |
| RDC | Residual Dipolar Coupling | As another NMR restraint type used in protein-structure inference; sometimes combined with sparse models. |
| ReLU | Rectified Linear Unit | As the nonlinearity in CNNs that corresponds to thresholding in ML‑CSC pursuit. |
| RMSD | Root Mean Square Deviation | To quantify structural differences (e.g., between protein conformations) in MD or structure prediction experiments. |
| ROI | Region of Interest | When restricting CSC/CMP to specific spatial/temporal regions to reduce computation and focus on salient areas. |
| RNN / LSTM / BLSTM | (Bi)Long Short-Term Memory networks | For sequence/time-series modeling; can be combined with or compared to CSC-based temporal representations. |
| SR | Super-Resolution | As an image reconstruction task where CSC/CDL-based methods map low-res to high-res representations. |
| SRMP | Simultaneous Robust Matching Pursuit | For joint sparse coding across multiple related views/signals with robustness to outliers. |
| SVD | Singular Value Decomposition | Used in K‑SVD for atom updates and in general for low-rank approximations in sparse/dictionary learning. |
I have all of the logs of an agentic system and its tool use, prompts actions. Billions of interactions. How can this help to design a system to be more efficient in token economics?
You can treat those logs as a gigantic empirical design space for “token economics”. At scale, they let you learn which prompts, tools, and agent patterns waste tokens and which deliver outcome-per-token. Here’s how to exploit them.
1. Define token-efficiency metrics from logs
From each interaction, derive:
- Token features
- Outcome signals (you’ll need to mine or define these):
- success/failure flags, user satisfaction proxies, “resolved vs escalated”, or downstream business metric.
- latency, number of tools called, chain length.
Then compute:
- tokens per successful episode,
- tokens per tool call type,
- tokens per “unit of value” (ticket resolved, doc generated, etc.).143144141
These become your objective functions for redesign.
2. Learn prompt and policy patterns that waste or save tokens
Use the logs as training data:
- Prompt shape vs cost
- Cluster prompts by structure (role spec, instructions, examples, context length).
- Regress success and cost on features like: instruction length, number of examples, system message size, RAG context size.
- Identify diminishing-returns regions: e.g., after 3 examples, extra examples add tokens but don’t improve success.
- Chain / agent patterns
Use that to design rules like “never call tool X twice in a row” or “stop after N clarifications” unless confidence is below a threshold.
3. Prompt compression and context shaping
Your logs are ideal supervision for compression:
- Train a prompt-compression model
- Policy for retrieval/context size
In your agent, insert a “context budgeter” that decides:
- how many docs to retrieve,
- how aggressively to summarize/compress them before feeding to the main model.
4. Model-routing and responsibility tuning
With billions of traces, you can learn when a cheaper model is sufficient:
- Train a router
- Input: cheap features (user query length, topic, previous performance stats for similar queries).
- Output: which model to use (tiny, medium, large) or whether to use a tool-only path.
- Loss: expected quality – λ · token cost.
- Responsibility tuning (who does what):
This is where you get large savings: many calls don’t need your top model if the router is trained on real outcomes.
5. Agent-graph optimization with your logs
Your logs describe an implicit agent graph (states = prompts + tool configs, edges = tool calls):
- Build a Design Structure Matrix or explicit DAG of common flows (from your traces).152
- For each subgraph/flow, compute:
- average tokens, success rate, latency.
- Optimize:
- collapse or reorder steps that add cost but little information gain,
- parallelize independent tool calls (you can see where the logs show serial calls that could be parallel),
- introduce early-exit conditions where success probability is already high.
You end up with a “minimal agent graph” that achieves ~same quality for fewer tokens.
6. Concrete steps you can implement now
Given your infra and skills, I’d suggest:
- Build a token-efficiency warehouse
- Normalize all logs into fact tables:
interaction,step,tool_call, with token counts and outcome flags. - Add derived features: “is loop”, “is retry”, “RAG docs count”, “prompt length bins”.
- Normalize all logs into fact tables:
- Run offline analyses
- For each common workflow, compute cost vs quality; create dashboards with:
- top 10 workflows by spend,
- top 10 “cost per success” offenders,
- distribution of context length vs success.
- For each common workflow, compute cost vs quality; create dashboards with:
- Train two small models
- a compression model to rewrite prompts/context to shorter forms,
- a router to select model size and tool strategy.
- Deploy in shadow mode
- In production, compute what cost would have been under new policies; validate no quality drop.
- Then roll out incremental guards: max context length per task, max chain length, enforced use of cheaper models for low-risk steps.
7. Strong follow-up prompts you can use
To push this into design/implementation, you can paste:
Given a schema with interactions, steps, token counts, and success labels, design SQL + feature pipelines to compute “tokens per successful outcome” and identify the highest-waste workflows in an agentic system.142141143Propose an architecture for a learned prompt-compression layer trained on historical logs (input: long prompt+context, target: original model output) and define offline evaluation metrics for cost vs quality.148149147Design a model-routing policy that uses historical traces to decide when to use a small, medium, or large LLM, and formulate a loss that trades off response quality against token cost.144150152Given logged agent graphs (tool-call sequences), describe an algorithm to detect loops, redundant calls, and non-contributing steps, and then synthesize a minimal cost-efficient agent graph.141152Define a token-budgeting middleware for an agentic system that caps context size, limits chain depth, and applies compression, using historical logs to set task-specific budgets.147148141
If you tell me your current logging schema (tables/fields), I can sketch the exact queries and feature pipeline to start optimizing.
ush this into design/implementation, you can paste:
Given a schema with interactions, steps, token counts, and success labels, design SQL + feature pipelines to compute “tokens per successful outcome” and identify the highest-waste workflows in an agentic system. Propose an architecture for a learned prompt-compression layer trained on historical logs (input: long prompt+context, target: original model output) and define offline evaluation metrics for cost vs quality. Design a model-routing policy that uses historical traces to decide when to use a small, medium, or large LLM, and formulate a loss that trades off response quality against token cost. Given logged agent graphs (tool-call sequences), describe an algorithm to detect loops, redundant calls, and non-contributing steps, and then synthesize a minimal cost-efficient agent graph. Define a token-budgeting middleware for an agentic system that caps context size, limits chain depth, and applies compression, using historical logs to set task-specific budgets. How to implement LLMLingua prompt compression on agent logs Tools for analyzing token usage in billions of LLM interactions Metrics to compute token efficiency per task from logs Prevent agent loops and retries to cut token waste Compare token costs across LLM providers using log data
You can turn those logs into a full “token economics” layer. Below are concrete designs and tools, keeping everything implementation-oriented.
1. Implement LLMLingua-style prompt compression on logs
Goal: Learn where and how much to compress prompts/contexts without hurting quality.
1.1. Basic integration pattern
For live traffic you’d do (for Python):156157158159
from llmlingua import PromptCompressor
compressor = PromptCompressor() # configure with chosen base model
def compress_for_agent(prompt, instruction="", question="", target_tokens=512):
return compressor.compress_prompt(
prompt,
instruction=instruction,
question=question,
target_token=target_tokens,
)
In your agent:
- Build full prompt (system + user + context).
- Call
compress_for_agentbefore sending to the LLM. - Store both original and compressed prompt lengths in logs.
1.2. Training/evaluating on historical logs
Use past logs as a batch test-bed:
- Take a sample of interactions with:
- full prompt+context,
- model output,
- success label / quality proxy.
- Run LLMLingua offline on each prompt under different
target_tokenbudgets (e.g. 0.2×, 0.5×, 0.8× of original length).160157158 - Replay compressed prompts against the same model, compare:
- Quality metrics (task-specific or similarity vs original output).
- Cost reduction (Δ tokens, Δ latency).
You can then learn per-task compression budgets (e.g., RAG summarization tolerates 5×, legal drafting only 1.5×).
2. Tools for analyzing token usage at scale
At “billions of interactions” scale you want:
- Central token accounting & observability
- Portkey / Laminar / Braintrust type platforms already do: model-agnostic token logging, per-call spans, agent traces, budgets, dashboards.161162
- If you keep everything in-house, mirror that design:
- span-like records for each LLM/tool call,
- normalized token fields (input, output, cached, retries, etc.),
- linkage to interaction / workflow IDs.
- Libraries for per-call accounting
tokenator: Python library that instruments calls and lets you query usage by provider/time-window; good for sanity checks and per-provider cost comparisons.163- Framework-native instrumentation (OpenTelemetry traces + custom attributes for tokens).
- Prompt compression tooling
3. Metrics to compute token efficiency per task
From your logs, define a schema roughly like:
interaction(id, task_type, user_id, success_label, created_at, ...)step(id, interaction_id, step_index, tool_name, model, input_tokens, output_tokens, latency_ms, retry_flag, ...)
Then compute:
- Core efficiency metrics
- Token-waste indicators
- Comparative metrics
- For each
task_typeandmodel, compute:- success rate,
- average tokens,
- “efficiency score” = success_rate / tokens_total.
- For each
You then rank:
- top N task types by total spend,
- top N by “tokens per success” (worst offenders).
4. Preventing agent loops and retries
Use logs to learn where loops happen, then enforce guards.
- Loop detection from traces
- Runtime controls
- Hard limits:
- max chain depth per task type,
- max retries per tool,
- “no same tool more than N times in a row” rule.
- Soft limits:
- if the agent hits a known loop pattern, short-circuit and escalate (or produce a best-effort answer).
- Logging flags:
- when guard triggers, record it; analyze if quality actually drops or improves (often improves).
- Hard limits:
5. Comparing token costs across providers
With multi-provider logs:
- Normalized fields per call:
provider,model,input_tokens,output_tokens,price_per_1k_input,price_per_1k_output. - For each
task_type×provider×model:- avg tokens per interaction,
- effective cost per interaction,
- success/quality metrics.
Compute:
cost_per_success = total_cost / successful_interactions.- Efficiency frontier: for each task, a curve of (quality, cost) across provider/model combos.
Platforms like Portkey already normalize tokens and cost across OpenAI, Anthropic, Vertex, etc., so you can piggyback on that pattern rather than re-invent it.169170162
6. Focused follow-up prompts you can drop in
Here are tightly scoped prompts you can reuse:
Design concrete SQL (BigQuery/Postgres) queries over an interactions/steps schema to compute tokens per successful outcome per task_type, and list the top 20 most wasteful workflows.171162Given LLMLingua’s API, sketch a Python batch pipeline that takes historical prompts from a logs table, compresses them at multiple budget levels, replays them against a model, and writes compression vs quality metrics back to a warehouse.157158156Propose a model-routing policy trained on historical traces that selects between small/medium/large LLMs using only cheap features; define the exact loss that trades off accuracy against token cost.166172162Describe an algorithm that builds an agent graph from tool-call sequences, automatically detects common loops and redundant steps, and outputs a simplified graph annotated with observed token costs and success probabilities.162167Specify a token-budgeting middleware API that sits in front of all LLM calls, enforcing per-task token caps, compression policies, and max chain-depth, with configuration derived from observed distributions in the logs.159173162
If you share your current logging schema (or a redacted version), I can turn this into concrete SQL and a skeleton analytics+compression pipeline.
Design concrete SQL (BigQuery/Postgres) queries over an interactions/steps schema to compute tokens per successful outcome per task_type, and list the top 20 most wasteful workflows.
Given LLMLingua’s API, sketch a Python batch pipeline that takes historical prompts from a logs table, compresses them at multiple budget levels, replays them against a model, and writes compression vs quality metrics back to a warehouse. Propose a model-routing policy trained on historical traces that selects between small/medium/large LLMs using only cheap features; define the exact loss that trades off accuracy against token cost. Describe an algorithm that builds an agent graph from tool-call sequences, automatically detects common loops and redundant steps, and outputs a simplified graph annotated with observed token costs and success probabilities. Specify a token-budgeting middleware API that sits in front of all LLM calls, enforcing per-task token caps, compression policies, and max chain-depth, with configuration derived from observed distributions in the logs. Sample SQL query for tokens per successful interaction Python pipeline code using pandas to compute this metric How to handle multi-turn sessions in token aggregation Visualize tokens per outcome with charts in Laminar Optimize high token usage tasks like in crewAI. REMIND ME TO SHARE: share your current logging schema (or a redacted version), I can turn this into concrete SQL and a skeleton analytics+compression pipeline.
You can implement all of this with a fairly small set of SQL patterns and Python jobs. Below I’ll keep it concrete and minimal, as requested.
1. Sample schema and SQL: tokens per successful outcome
Assume:
-- interactions: one row per conversation / task
interactions(
interaction_id TEXT PRIMARY KEY,
task_type TEXT,
success BOOLEAN, -- or SMALLINT 0/1
created_at TIMESTAMP
);
-- steps: one row per agent/tool/LLM step
steps(
step_id TEXT PRIMARY KEY,
interaction_id TEXT,
step_index INT,
tool_name TEXT,
model_name TEXT,
input_tokens INT,
output_tokens INT
);
1.1. Total tokens and tokens per success per task_type
Postgres/BigQuery:
WITH interaction_tokens AS (
SELECT
i.interaction_id,
i.task_type,
i.success,
SUM(s.input_tokens + s.output_tokens) AS total_tokens
FROM interactions i
JOIN steps s USING (interaction_id)
GROUP BY i.interaction_id, i.task_type, i.success
),
task_stats AS (
SELECT
task_type,
COUNT(*) AS interactions_total,
SUM(CASE WHEN success THEN 1 ELSE 0 END) AS interactions_success,
SUM(total_tokens) AS tokens_total,
SUM(CASE WHEN success THEN total_tokens ELSE 0 END) AS tokens_success
FROM interaction_tokens
GROUP BY task_type
)
SELECT
task_type,
interactions_total,
interactions_success,
tokens_total,
tokens_success,
CASE
WHEN interactions_success > 0
THEN tokens_success::FLOAT / interactions_success
ELSE NULL
END AS tokens_per_success
FROM task_stats
ORDER BY tokens_per_success DESC NULLS LAST;
1.2. Top 20 “most wasteful” workflows (by task_type)
Same query, just limit:
... -- same CTEs as above
SELECT
task_type,
interactions_total,
interactions_success,
tokens_total,
tokens_success,
tokens_per_success
FROM task_stats
WHERE interactions_success > 0
ORDER BY tokens_per_success DESC
LIMIT 20;
You can refine “workflow” by combining task_type with other fields (e.g., entry_agent, customer_segment).
2. Python batch pipeline: LLMLingua compression on logs
High-level staged pipeline using pandas and a generic LLM client.
Assume table:
prompts_log(
interaction_id TEXT,
step_id TEXT,
task_type TEXT,
full_prompt TEXT,
model_name TEXT,
output_text TEXT,
success BOOLEAN
);
2.1. Skeleton pipeline
import pandas as pd
from llmlingua import PromptCompressor
from some_llm_client import call_llm # your own wrapper
from sqlalchemy import create_engine
ENGINE_URI = "postgresql+psycopg2://user:pass@host/db"
compression_ratios = [0.2, 0.5, 0.8] # target fraction of original tokens
compressor = PromptCompressor()
def estimate_tokens(text: str) -> int:
# fast approximate tokenizer, or call provider's tokenizer offline
return len(text.split())
def compress_prompt(text: str, target_ratio: float) -> str:
target_tokens = int(estimate_tokens(text) * target_ratio)
return compressor.compress_prompt(text, target_token=target_tokens)["compressed_prompt"]
def main():
engine = create_engine(ENGINE_URI)
# load a sample of historical prompts
df = pd.read_sql("""
SELECT interaction_id, step_id, task_type,
full_prompt, model_name, output_text, success
FROM prompts_log
WHERE success IS NOT NULL
LIMIT 10000
""", engine)
rows = []
for _, row in df.iterrows():
orig_prompt = row["full_prompt"]
orig_tokens = estimate_tokens(orig_prompt)
for ratio in compression_ratios:
compressed = compress_prompt(orig_prompt, target_ratio=ratio)
comp_tokens = estimate_tokens(compressed)
# replay against same model (or a chosen eval model)
new_output = call_llm(
model=row["model_name"],
prompt=compressed,
)
# simple similarity metric; plug in BLEU/Rouge/BERTScore/task-specific scorer
sim = similarity_metric(row["output_text"], new_output)
rows.append({
"interaction_id": row["interaction_id"],
"step_id": row["step_id"],
"task_type": row["task_type"],
"model_name": row["model_name"],
"orig_tokens": orig_tokens,
"comp_ratio": ratio,
"comp_tokens": comp_tokens,
"tokens_saved": orig_tokens - comp_tokens,
"orig_success": row["success"],
"sim_score": sim,
"new_output": new_output,
})
out_df = pd.DataFrame(rows)
out_df.to_sql("prompt_compression_eval", engine, if_exists="append", index=False)
def similarity_metric(a: str, b: str) -> float:
# placeholder; implement your choice
return 1.0 if a.strip() == b.strip() else 0.0
if __name__ == "__main__":
main()
This gives you per-task, per-ratio cost/quality tradeoff curves that you can aggregate in SQL.
3. Model-routing policy with cost-aware loss
Train a simple classifier/regressor from cheap features to a model class {small, medium, large}.
3.1. Features (per request)
- query length (tokens),
- task_type,
- historical difficulty (
past_failure_ratefor similar requests), - user segment,
- maybe simple semantic bucket (cluster ID precomputed offline).
3.2. Label
For each recorded interaction, you can simulate:
- run the same query with small/medium/large models offline, or use logged multi-model A/B traces if you have them.
- label each model as success/failure + quality score.
Then define for each sample $i$ and model $m$:
- quality $q_{i,m} \in [0,1]$ (or task-specific),
- cost (tokens × price).
Train a router $f_\theta(x_i) \to p(m \mid x_i)$ (e.g., softmax over three models).
3.3. Loss with accuracy–cost tradeoff
For each sample, expected utility:
$$ U_i = \sum_m p_\theta(m \mid x_i) \left( q_{i,m} - \lambda \cdot \text{cost}_{i,m} \right) $$
Maximize total $U$ or equivalently minimize
$$ L(\theta) = -\frac{1}{N}\sum_i U_i. $$
You can implement this with any standard classifier; just plug this loss instead of cross-entropy.
4. Agent graph simplification from tool-call sequences
4.1. Build the agent graph
From steps:
- Node:
state_id= e.g.(tool_name, model_name, intent_cluster)or justtool_namefor a first pass. - Directed edge:
state_i -> state_jif step j follows step i in any interaction.
For each node and edge, maintain:
- visit count,
- total tokens used on that node/edge,
- success counts of interactions passing through them.
4.2. Detect loops and redundant patterns
- Loops: cycles in the graph; you can focus on small cycles A→B→A, A→A, B→C→B, etc.
- Redundant subsequences: paths where entering them almost never changes success probability but always consumes tokens.
Algorithm sketch:
- For each interaction, record the sequence of node IDs.
- Extract n-grams (length 2–4) of nodes.
- For each distinct n-gram, compute:
- frequency,
- average tokens spent in that segment,
- change in success probability conditional on visiting that segment.
- Mark as wasteful any segment with:
- high frequency,
- high tokens,
- negligible or negative effect on success.
4.3. Synthesize a minimal graph
- Remove or collapse wasteful segments:
- if A→B→C is usually followed immediately by D and B,C add no benefit, replace with A→D in your “ideal” graph.
- Annotate the remaining nodes/edges with:
- avg tokens,
- probability of success given reaching that node,
- maybe latency.
This “minimal” agent graph becomes your design reference for refactoring agent flows.
5. Token-budgeting middleware API
Wrap all LLM calls through a middleware that knows per-task budgets and compression policies.
5.1. Configuration (from logs)
For each task_type:
max_context_tokens(P95 or P99 observed good runs),max_chain_depth(P95 of successful interactions),compression_profile(e.g., LLMLingua ratio chosen from the batch eval table),default_model_classand allowed up-scaling conditions.
Store in a config table or service:
{
"task_type": "support_question",
"max_context_tokens": 4096,
"max_chain_depth": 12,
"compression_ratio": 0.5,
"default_model": "medium",
"allow_large_if": {"user_tier": "enterprise"}
}
5.2. Middleware interface
In code:
class TokenBudgeter:
def __init__(self, config_store, compressor, tokenizer):
self.config_store = config_store
self.compressor = compressor
self.tokenizer = tokenizer
def prepare_request(self, task_type, user_ctx, messages, depth):
cfg = self.config_store.get(task_type)
# enforce chain-depth
if depth > cfg["max_chain_depth"]:
raise MaxDepthExceeded()
# serialize messages to a prompt string
prompt = serialize_messages(messages)
tokens = self.tokenizer(prompt)
# enforce context cap with compression
if len(tokens) > cfg["max_context_tokens"]:
target_ratio = cfg["compression_ratio"]
prompt = self.compressor.compress_prompt(
prompt,
target_token=int(cfg["max_context_tokens"] * target_ratio)
)["compressed_prompt"]
model = select_model(cfg, user_ctx)
return model, prompt
All agents call TokenBudgeter.prepare_request(...) before any LLM call; you log both pre- and post-budget metrics to continuously refine these budgets.
6. Handling multi-turn sessions in aggregation
For multi-turn sessions:
- Use
session_idorinteraction_idas the aggregation key. - Sum tokens over all turns/steps in that session;
successis defined at the session level (e.g., ticket resolved). - Same SQL pattern as above, just ensure
interaction_idmaps to a full multi-turn exchange.
7. Visualization and optimization
- Once you compute tokens-per-success per
task_typeand permodel, you can export to whatever you’re using (Laminar, Grafana, internal UI) and plot:- bar chart: tokens_per_success by task_type,
- scatter: success_rate vs tokens_per_success across workflows.
- To mimic crewAI-style optimization, focus on:
- tasks with high tokens_per_success and low success_rate,
- tasks with extreme chain depths or context sizes.
And yes: REMINDER — when you’re ready, share your current logging schema (or a redacted version) and I can turn this into concrete SQL and a more tailored analytics + compression pipeline for your stack.
CREATE A FOLLOWUP set of TO DOs in the form of prompts. we have a mixture of concerns in this thread, split the thread into multiple threads and break out and create prompts for followup.
Here’s a set of focused follow-up prompts, grouped into distinct threads you can spin off.
Thread 1 – Convolutional sparse coding theory and algorithms
- CSC vs standard sparse coding fundamentals “Explain in detail how convolutional sparse coding differs from standard sparse coding, including formal objectives, structural assumptions, and when each is preferable in practice.”
- Convolutional Matching Pursuit derivation “Derive convolutional matching pursuit (CMP) step by step from the convolutional sparse coding objective, including the residual energy argument and exact conditions under which each greedy step is optimal.”
- CMP vs OMP vs ADMM/ISTA “Compare CMP, convolutional OMP, and ADMM/ISTA-based CSC algorithms in terms of complexity, convergence properties, reconstruction quality, and suitability for large-scale 2D data.”
- Continuous Basis Pursuit and block sparsity “Write out the continuous basis pursuit (CBP) formulation for translation-invariant signals, derive its convex objective, and show explicitly how it corresponds to block-sparse (group-lasso-like) penalties.”
- ML-CSC stability conditions “From Sulam, Papyan, and Elad’s ML-CSC work, restate the exact sparsity and mutual coherence conditions required for stable multi-layer code recovery, and give intuitive explanations for each term.”
Thread 2 – CSC, CNNs, and architecture design rules
- CNNs as ML-CSC pursuit “Show how a standard CNN forward pass (conv + ReLU + pooling) can be interpreted as layered thresholding pursuit in a multi-layer convolutional sparse coding model, including mapping of each operation.”
- Design rules for CNNs from ML-CSC theory “Translate ML-CSC stability assumptions into concrete CNN design rules: filter sizes, strides, activation sparsity, normalization, and redundancy, with numeric examples and failure modes.”
- Adversarial robustness and CSC “Using Papyan/Elad’s analysis, explain how convolutional sparse coding and ML-CSC relate to adversarial robustness, and propose CNN design tweaks that improve stability without excessive cost.”
Thread 3 – Implementations in Python / PyTorch
- 2D CMP and OMP implementations “Write a clean, production-ready PyTorch implementation of 2D convolutional matching pursuit and convolutional OMP, with test harnesses on synthetic images and complexity benchmarks.”
- 1D audio CMP example “Implement CMP for 1D audio signals in PyTorch or NumPy, generate synthetic sparse mixtures of learned kernels, and evaluate reconstruction error versus sparsity and runtime.”
- Convolutional dictionary learning “Implement a basic 2D convolutional dictionary learning loop (CSC + dictionary update) in PyTorch, using FFT-based convolutions and ADMM/ISTA, and test on small natural image patches.”
- Batch and parallel CMP “Design a batched CMP algorithm for multiple images at once using PyTorch, showing how to parallelize convolutions and selection steps across batch and filter dimensions efficiently.”
Thread 4 – Applications of CSC/CMP to scientific data
- CSC for MD time series “Design and implement an experiment where convolutional sparse coding is applied to molecular dynamics observable time series to discover recurring motifs, including metrics, visualization, and interpretation.”
- Multivariate CSC for EM/EEG/MEG “Describe how to apply multivariate convolutional sparse coding to electromagnetic or EEG/MEG data, including model formulation, constraints, and practical preprocessing steps.”
- CSC in protein/contact maps “Propose a concrete CSC-based architecture for protein contact maps or residue–residue features, detailing how kernels, sparsity, and pooling could produce interpretable structural motifs.”
Thread 5 – Token economics and log-based optimization
- SQL analytics for tokens per success “Given interactions and steps tables with token counts and success flags, write SQL to compute tokens per successful outcome per task_type, and surface the top 20 most wasteful workflows.”
- Prompt-compression batch evaluation with LLMLingua “Design a Python batch job that reads historical prompts from a warehouse, applies LLMLingua at several compression ratios, replays compressed prompts against an LLM, and logs cost vs quality metrics.”
- Cost-aware model routing “Formulate and implement a cost-aware model-routing policy that chooses between small/medium/large LLMs using only cheap features from logs, with a loss that trades off accuracy against token cost.”
- Agent-graph simplification from traces “Given tool-call sequences with token costs and success labels, describe and implement an algorithm to build an agent graph, detect loops and redundant segments, and synthesize a simplified, cost-efficient graph.”
- Token-budgeting middleware design “Specify and prototype a token-budgeting middleware API that, for each task_type, enforces context token caps, max chain depth, and prompt compression, using empirical budgets derived from historical logs.”
- Cross-provider token cost benchmarking “Using normalized logs across multiple LLM providers, compute and visualize cost-per-success and quality-per-token per task_type, and identify where provider/model switching yields the largest savings.”
Thread 6 – Tooling, libraries, and observability
- CSC library comparison and selection “Compare SPORCO, alphaCSC, and custom PyTorch implementations for convolutional sparse coding in terms of flexibility, performance, and suitability for scientific time-series and image tasks.”
- Token observability stack “Design a token observability stack that ingests billions of LLM interactions, normalizes token and cost metrics, and exposes dashboards and alerts for token-efficiency regressions.”
- Loop/retry prevention policies “From historical logs, learn patterns of agent loops and retries, and propose runtime policies and guardrails that cap retries, cut loops, and surface early exits while preserving success rates.”
Thread 7 – Formal definitions and glossaries
- Formal CSC / CMP cheat sheet “Produce a succinct cheat sheet of core CSC/CMP definitions, objectives, and algorithms (MP, OMP, ADMM, ISTA/FISTA, CBP), with equations and when-to-use guidelines.”
- Domain-specific glossaries “Generate a CSC/CMP-centric glossary tailored to (a) computer vision, (b) neuroscience/EEG, and (c) molecular dynamics, including typical data shapes and example use cases.”
When you’re ready to go deeper on token economics and concrete queries, remind yourself to share a redacted version of your current logging schema so we can turn these into exact SQL and pipeline code.
Expand on the document managment system I have built for research and how this system can be automated for agentic system driven research and development. with autonomy what should the control plane look like to create innovation agentically?
Your document management stack is already the backbone of an autonomous research system: you’ve normalized PDFs into rich UDOM/Markdown with structure, equations, figures, and QA scores. The next step is to surround it with agents and a control plane that can drive full research and R&D loops with bounded autonomy.179180181
1. From “document store” to “research substrate”
Your current system gives you:
- Canonical, structured research artifacts (paper → UDOM/MD, with headings, equations, figures, bibliography, QA grade).
- Stable IDs (
doc_idlike arxiv:1010.0422) and paths, plus extracted metadata. - A ready-made corpus for retrieval, citation graph building, and code/experiment synthesis.
To make it agentic:
- Add semantic indexing at multiple granularities: document, section, paragraph, figure, equation, plus task-specific embeddings (methods, results, theory).
- Build a research knowledge graph: authors, topics, methods, datasets, hyperparameters, with edges like “extends”, “compares to”, “same benchmark”.
- Expose this via a RAG API that agents can query declaratively (e.g.,
find_papers(topic="convolutional sparse coding", year>2015, method="ADMM")).
This turns your doc system into a programmable “research substrate” that agents can systematically explore, not just a passive corpus.182183179
2. Agent roles over your research substrate
Think of your system as a lab; define specialized agents:
- Curator / Retriever agent
- Queries your index/graph, filters by QA grade, recency, domain.
- Produces structured reading lists with justification (why each paper is relevant).
- Reader / Summarizer agent
- Converts UDOM sections into structured notes: problems, assumptions, objectives, algorithms, complexity, experimental setup.
- Normalizes notation across papers (e.g., all CSC formulations into a canonical form).
- Comparator / Synthesizer agent
- Aligns methods across papers, builds comparison tables, and identifies disagreement or gaps.
- For CSC, it can compare CMP vs ADMM vs FISTA vs CBP vs ML‑CSC assumptions and guarantees.
- Implementer agent
- Generates pseudocode and executable code (PyTorch, JAX) from method sections, then writes runnable experiments (config + eval scripts).
- Experiment designer agent
- Designs benchmark experiments (datasets, metrics, ablations) to test hypotheses derived from the literature.
- Critic / Reviewer agent
This mirrors frameworks like LiRA or LatteReview, but your store covers all of ML/CSC/agentic systems, so each agent can work directly off structured papers rather than raw PDFs.184180
3. What the control plane should look like
The control plane is the orchestrator and governor of all these agents and tools. A good mental model is “Control Plane as a Tool”: a single, privileged tool that hides orchestration complexity behind a simple interface.185186187
3.1. Responsibilities of the control plane
- Goal and workflow management
- Accept high-level research goals:
- “Survey translation-invariant convolutional sparse coding and design a new CMP variant with better time-series performance.”
- Decompose into stages: literature search → reading → synthesis → implementation → experiments → report.
- Accept high-level research goals:
- Agent orchestration
- Tool routing / control plane as a tool
- Expose a single
control_plane.act(...)tool to the PI agent; inside, route to:- your doc search and graph APIs,
- code execution sandboxes,
- external data sources.
- This matches the “Control Plane as a Tool” pattern: the agent sees one tool, the control plane implements all the complexity and governance.186185
- Expose a single
- Governance and safety
- Learning and evolution
3.2. Control plane components
At minimum:
- Workflow engine – DAG or state machine for research workflows (similar to LangGraph / CrewAI orchestration, but domain-specific).188186
- Policy layer – token budgets, model-routing, tool permissions, and loop/retry limits.
- Observation/logging – all agent steps with token counts, retrieved docs, code artifacts, and experimental results.
- Feedback channels – human feedback, evaluation metrics, and automated tests feeding back into policy and prompts.
4. Automation patterns on top of your doc system
Some concrete ways to automate research tasks:
- Autonomous literature reviews
- Method extraction and unification
- Agents iterate over all CSC-related papers, extract objective functions, algorithms, and assumptions into a structured “method registry”.
- This registry becomes a searchable catalog: “all algorithms minimizing $\ell_1$ CSC with ADMM”, “all CMP variants with FFT speedups”.
- Automated benchmark design
- Given a method cluster (e.g., CMP vs ADMM-based CSC), agents propose fair benchmarking protocols using existing datasets, metrics, and compute constraints in your environment.
- Implementation agent generates code; experiment agent runs jobs; synthesizer compares results and feeds back to knowledge base.
- Innovation loops
- PI agent observes gaps: “no method handles multivariate CSC for MD time series with online updates”.
- It instructs implementer agents to prototype a hybrid algorithm (e.g., CMP + ADMM), and experiment agents to test it.
Your doc system provides the evidence; the control plane and agents exploit it.
5. What autonomy “for innovation” practically means
To keep it tractable and safe:
- Autonomy levels
- Control knobs
- Token and cost budgets per project.
- Risk level (low-risk: literature and simulation; high-risk: lab integration).
- Explainability requirement (agents must output reasoning, assumptions, and citations).
- Success criteria
- For innovation, track not just correctness, but novelty: new method families, improved benchmarks, or new hypotheses backed by consistent evidence.
6. Concrete follow-up prompts for design/implementation
You can reuse these to push design details:
Given a UDOM-based research corpus (PDF → structured Markdown with equations and figures), design the schema and APIs for a semantic index and knowledge graph that agentic research workflows can query.182179Specify a multi-agent architecture for autonomous literature review and method synthesis over my CSC corpus, including agent roles, inputs/outputs, and how they use the document store.180181184Design a control-plane-as-a-tool interface that exposes retrieval, code execution, and experiment scheduling as a single tool to a PI agent, including policy hooks for budgets and safety.185186187Propose concrete autonomy levels and HITL checkpoints for an agentic research system that can go from hypothesis generation to code and simulation on my infrastructure.191183190Using my existing logging and token-economics setup, define how to evaluate and evolve research agents over time, adjusting routing, prompts, and workflows from observed performance.192187179
If you sketch your current doc system components (schemas, APIs, and where the UDOM artifacts live), I can map this into a concrete control-plane architecture and agent spec tailored to CODITECT.
frameworks like LiRA or LatteReview : research and explain. Given a UDOM-based research corpus (PDF → structured Markdown with equations and figures), design the schema and APIs for a semantic index and knowledge graph that agentic research workflows can query.
Specify a multi-agent architecture for autonomous literature review and method synthesis over my CSC corpus, including agent roles, inputs/outputs, and how they use the document store. Design a control-plane-as-a-tool interface that exposes retrieval, code execution, and experiment scheduling as a single tool to a PI agent, including policy hooks for budgets and safety. Propose concrete autonomy levels and HITL checkpoints for an agentic research system that can go from hypothesis generation to code and simulation on my infrastructure. Using my existing logging and token-economics setup, define how to evaluate and evolve research agents over time, adjusting routing, prompts, and workflows from observed performance. If you sketch your current doc system components (schemas, APIs, and where the UDOM artifacts live), I can map this into a concrete control-plane architecture and agent spec tailored to CODITECT.
LiRA and LatteReview are good design references for what you want CODITECT to become: multi-agent frameworks over a structured corpus, with an orchestration layer that emulates human literature review and synthesis while keeping rigor.195196
I’ll break this into four concrete design areas.
1. LiRA / LatteReview patterns you should borrow
- LatteReview
- LiRA
- Multi-agent workflow for writing reliable, readable literature reviews: outline → subsection drafting → editing → reviewing.196199200
- Emulates human process: deep reading first, then structuring, then writing, then editorial QA and citation checking.
- Emphasis on citation faithfulness and readability, with evaluation on SciReviewGen etc.199196
You can treat your UDOM-based system as the “document backend” that these frameworks assume, but generalized to all ML/CSC/agentic topics.
2. Schema & APIs for a semantic index and knowledge graph
Assume you have UDOM docs (Markdown plus structured metadata). You want both:
- a semantic index for RAG, and
- a knowledge graph (KG) for higher-level reasoning.
2.1. Core schema (relational view)
Minimal core tables:
documents(doc_id, title, venue, year, authors[], doi, arxiv_id, topics[], qa_score, path_md, path_udom_json)sections(section_id, doc_id, section_type, title, level, order_index, text, embedding_vector)section_type: {abstract, intro, related, methods, theory, experiments, conclusion}.
equations(eq_id, doc_id, section_id, latex, normalized_form, role)role: {objective, constraint, update_rule, complexity, other}.
figures(fig_id, doc_id, section_id, caption, type, assets_path, embedding_vector)type: {architecture, algorithm, experiment, dataset, theory}.
citations(citing_doc_id, cited_doc_id, context_section_id, context_span)methods(method_id, doc_id, name, family, objective_text, algorithm_steps, assumptions_text, complexity_text)family: {CSC, CMP, ADMM-CSC, FISTA-CSC, ML-CSC, CBP, etc.}
Each *_embedding_vector lives in a vector store (e.g., sections_embedding table or external index).
2.2. Knowledge graph schema (logical)
Nodes:
Paper(doc_id)Author(author_id, name)Method(method_id, family)Dataset(dataset_id, name)Task(task_id, name)Concept(concept_id, name)(e.g., “translation-invariant sparse coding”, “ADMM”, “mutual coherence”).
Edges:
AUTHORED(author_id -> doc_id)CITES(doc_id -> doc_id)IMPLEMENTS(doc_id -> method_id)EVALUATED_ON(method_id -> dataset_id)USED_FOR(method_id -> task_id)ASSUMES(method_id -> concept_id)IMPROVES_ON(method_id -> method_id)EXTENDS(doc_id -> doc_id)
This can be in Neo4j, a graph DB, or a graph layer on top of relational tables.
2.3. APIs for agentic workflows
You want high-level, declarative APIs; examples:
- Search & retrieve
GET /research/search- params:
query,filters(year range, family, task),top_k - returns: list of
doc_id+ snippets + scores.
- params:
GET /research/sections- params:
doc_id,section_type - returns: structured sections (UDOM sections or Markdown chunks).
- params:
- Methods & equations
GET /methods/by_family?family=CSC&since=2015- returns: method_ids, doc_ids, objective_text, key assumptions.
GET /equations/by_role?role=objective&doc_id=1010.0422- returns: LaTeX + normalized forms.
- Knowledge graph queries
GET /kg/method/{method_id}/context- returns: nodes/edges around a method (papers, datasets, tasks, concepts).
POST /kg/querywith a small DSL, e.g.:
{
"pattern": "MATCH (m:Method {family:'CSC'})-[:EVALUATED_ON]->(d:Dataset) RETURN m, d"
}
Agents call these instead of hitting raw databases.
3. Multi-agent architecture over your CSC corpus
Specialized agents using the above APIs.
3.1. Roles
- PI / Planner agent
- Input: research goal (e.g., “survey translation-invariant CSC and propose a new CMP variant for MD time series”).
- Output: plan: list of tasks (retrieve, read, synthesize, implement, experiment, write).
- Curator agent
- Calls
/research/search& KG to find candidate papers. - Filters by QA, recency, method family; outputs a prioritized reading list with reasons.
- Calls
- Reader / Extractor agent
- For each paper, fetches sections via
/research/sections. - Extracts structured fields into
methods,equations,assumptions,complexity,datasets,results. - Writes into your methods table / KG via internal APIs.
- For each paper, fetches sections via
- Comparator / Synthesizer agent
- Given a set of
method_ids, pulls method metadata and equations. - Produces: comparison tables, highlight of differences, identified gaps (e.g., no algorithm covers multivariate time-series CSC with online updates).
- Given a set of
- Implementer agent
- For a chosen method or hybrid, uses method specs to generate code (PyTorch/JAX) and experiment scripts.
- Uses a code-execution tool (sandbox) and logs artifacts.
- Experiment agent
- Schedules and runs experiments with config sweeps.
- Collects metrics and writes back to an
experimentstable.
- Reviewer / Report agent (LiRA-like)
- Takes synthesized content + results; writes structured survey sections.
- Checks citations via KG, ensures each claim is supported by UDOM sections.
- Edits for readability, produces final Markdown/PDF.
3.2. Data flow
- PI agent submits goal → control plane.
- Curator agent builds reading list from UDOM/semantics.
- Reader agent populates method registry/KG.
- Synthesizer agent proposes novel method directions or combinations.
- Implementer + Experiment agents build and run code.
- Reviewer agent writes up the literature context + new contributions.
Your existing document system is central in steps 2–4.
4. Control-plane-as-a-tool interface
You want the PI agent to see one tool with a rich “action space” that you implement underneath.
4.1. Tool schema
Single tool: control_plane.act(request: JSON) -> JSON.
Example input:
{
"goal_id": "csc-md-001",
"action": "RUN_TASK",
"task_type": "LITERATURE_REVIEW",
"params": {
"topic": "translation-invariant convolutional sparse coding",
"focus": ["CMP", "ADMM-CSC", "ML-CSC"],
"year_min": 2010
},
"constraints": {
"token_budget": 200000,
"max_wallclock_minutes": 60,
"risk_level": "low"
}
}
Example output:
{
"status": "COMPLETED",
"artifacts": {
"reading_list_doc_ids": [...],
"summary_md_path": "...",
"methods_table_id": "methods_translation_invariant_csc_v1"
},
"usage": {
"tokens": 153422,
"tool_calls": 187
}
}
4.2. Responsibilities inside the control plane
Internally, control_plane.act:
- Looks up policies (budgets, allowed tools, models) for this
task_typeandrisk_level. - Breaks the task into steps (using a workflow engine/DAG).
- Invokes specific worker agents (curator, reader, etc.).
- Enforces budgets and safety: stops loops, limits external calls, enforces sandboxing.
- Logs all traces into your token-economics system.
This matches the “Control Plane as a Tool” pattern: a single “super-tool” with strong policy hooks.201202203
5. Autonomy levels and HITL checkpoints
To keep agentic innovation bounded and safe:
- Level 1 – Autonomous literature analysis
- Agents can search, read, extract methods, and synthesize summaries.
- HITL: human approves reading list and final survey outline.
- Level 2 – Semi-autonomous implementation & experiments
- Implementer and Experiment agents can write and run code in sandboxed environments.
- HITL checkpoints:
- approval of experiment design (datasets, metrics),
- approval of code before execution on more than a small dev subset,
- approval before storing results in “canonical” registries.
- Level 3 – Hypothesis exploration / innovation
- PI agent can propose new methods and experiments.
- Additional HITL:
- human signs off on “claims” before any external dissemination,
- human reviews any cross-system integration (e.g., connecting to real lab or production systems).
Policy engine in the control plane maps task_type + risk_level → allowed autonomy & required approvals.204205206
6. Using logging + token economics to evolve agents
Your existing logging/token analytics layer is the feedback mechanism:
- Track for each task_type and workflow:
- tokens consumed,
- latency,
- success/quality (e.g., human rating, benchmark scores),
- degree of autonomy (how many HITL interventions).
- Regularly compute:
- tokens per successful survey / experiment / method proposal,
- agent-level contribution: which agents/workflows correlate with success or waste.
- Use this to:
- prune or refactor workflows (remove low-value steps),
- adjust model routing and prompt compression policies,
- converge toward a minimal but performant set of agent patterns.
This closes the loop: doc system → agent workflows → control plane → logging → optimization.
If you can share (even roughly):
- how UDOM docs are stored (tables/fields),
- what retrieval/indexing you already have,
- how your current “agents” call tools,
I can sketch a concrete CODITECT-specific control-plane diagram (components + APIs) and a first multi-agent workflow over your CSC corpus.
Define and design graph layer on top of relational tables
A graph layer on top of relational tables is a thin abstraction that lets you query and reason about entities and relationships as a graph, while storing the data in standard SQL tables.
1. Conceptual definition
- Data stays relational: Papers, authors, methods, etc. are in normalized SQL tables.
- Graph layer: A logical view where:
- rows in entity tables become nodes (e.g.,
documents,methods,datasets), - foreign keys / link tables become edges (e.g.,
citations,implements,evaluated_on).
- rows in entity tables become nodes (e.g.,
- APIs: You expose graph-style operations (neighborhood, paths, patterns) via:
- SQL views and helper queries, or
- a dedicated graph engine (PG extensions, external graph DB) fed from the same tables.
2. Minimal relational schema
Example for your research corpus:
-- Node tables
CREATE TABLE documents (
doc_id TEXT PRIMARY KEY,
title TEXT,
year INT,
venue TEXT,
qa_score FLOAT
);
CREATE TABLE authors (
author_id SERIAL PRIMARY KEY,
name TEXT
);
CREATE TABLE methods (
method_id SERIAL PRIMARY KEY,
doc_id TEXT REFERENCES documents(doc_id),
name TEXT,
family TEXT -- e.g., 'CSC', 'CMP', 'ADMM-CSC'
);
CREATE TABLE datasets (
dataset_id SERIAL PRIMARY KEY,
name TEXT
);
CREATE TABLE concepts (
concept_id SERIAL PRIMARY KEY,
name TEXT
);
-- Edge tables
CREATE TABLE document_authors (
doc_id TEXT REFERENCES documents(doc_id),
author_id INT REFERENCES authors(author_id),
PRIMARY KEY (doc_id, author_id)
);
CREATE TABLE citations (
citing_doc_id TEXT REFERENCES documents(doc_id),
cited_doc_id TEXT REFERENCES documents(doc_id),
PRIMARY KEY (citing_doc_id, cited_doc_id)
);
CREATE TABLE method_datasets (
method_id INT REFERENCES methods(method_id),
dataset_id INT REFERENCES datasets(dataset_id),
PRIMARY KEY (method_id, dataset_id)
);
CREATE TABLE method_concepts (
method_id INT REFERENCES methods(method_id),
concept_id INT REFERENCES concepts(concept_id),
relation_type TEXT, -- e.g., 'ASSUMES', 'USES'
PRIMARY KEY (method_id, concept_id, relation_type)
);
CREATE TABLE method_relations (
src_method_id INT REFERENCES methods(method_id),
dst_method_id INT REFERENCES methods(method_id),
relation_type TEXT, -- e.g., 'IMPROVES_ON', 'EXTENDS'
PRIMARY KEY (src_method_id, dst_method_id, relation_type)
);
This is pure SQL but encodes a graph: documents, methods, datasets, concepts, authors are nodes; the *_relations tables are edges.
3. Graph views and helper queries
You can define views that present this as node/edge tables for your agents.
3.1. Generic node/edge views
CREATE VIEW graph_nodes AS
SELECT 'Document' AS node_type, doc_id::TEXT AS node_id, title AS label
FROM documents
UNION ALL
SELECT 'Author', author_id::TEXT, name
FROM authors
UNION ALL
SELECT 'Method', method_id::TEXT, name
FROM methods
UNION ALL
SELECT 'Dataset', dataset_id::TEXT, name
FROM datasets
UNION ALL
SELECT 'Concept', concept_id::TEXT, name
FROM concepts;
CREATE VIEW graph_edges AS
SELECT 'AUTHORED' AS edge_type,
a.author_id::TEXT AS src_id,
da.doc_id::TEXT AS dst_id
FROM document_authors da
JOIN authors a USING (author_id)
UNION ALL
SELECT 'CITES', citing_doc_id::TEXT, cited_doc_id::TEXT FROM citations
UNION ALL
SELECT 'IMPLEMENTS', m.method_id::TEXT, m.doc_id::TEXT FROM methods m
UNION ALL
SELECT 'EVALUATED_ON', md.method_id::TEXT, md.dataset_id::TEXT FROM method_datasets md
UNION ALL
SELECT relation_type, src_method_id::TEXT, dst_method_id::TEXT FROM method_relations
UNION ALL
SELECT relation_type, method_id::TEXT, concept_id::TEXT FROM method_concepts;
Now any “graph-ish” query can start from graph_nodes and graph_edges.
3.2. Example graph queries in SQL
- “All papers that cite a given paper and implement CSC methods”:
SELECT d.*
FROM documents d
JOIN citations c ON d.doc_id = c.citing_doc_id
JOIN methods m ON m.doc_id = d.doc_id
WHERE c.cited_doc_id = '1010.0422'
AND m.family = 'CSC';
- “Datasets used by methods that improve on CMP”:
SELECT DISTINCT ds.*
FROM methods m_base
JOIN method_relations mr
ON mr.dst_method_id = m_base.method_id
AND mr.relation_type = 'IMPROVES_ON'
JOIN methods m_new ON m_new.method_id = mr.src_method_id
JOIN method_datasets md ON md.method_id = m_new.method_id
JOIN datasets ds ON ds.dataset_id = md.dataset_id
WHERE m_base.family = 'CMP';
Agents don’t need to know the joins; you wrap these in HTTP/JSON APIs.
4. API design for the graph layer
Expose graph-oriented endpoints on top of the relational DB:
4.1. Node-centric APIs
GET /graph/node/{type}/{id}- Returns node attributes and immediate neighbors.
Example response:
{
"node": { "type": "Method", "id": "42", "label": "Fast ADMM CSC" },
"neighbors": [
{ "edge_type": "IMPLEMENTS", "to": { "type": "Document", "id": "1010.0422" } },
{ "edge_type": "EVALUATED_ON", "to": { "type": "Dataset", "id": "7" } },
{ "edge_type": "ASSUMES", "to": { "type": "Concept", "id": "3", "label": "Mutual Coherence" } }
]
}
4.2. Pattern/“query” API
POST /graph/querywith a very small DSL, e.g.:
{
"pattern": "METHODS_BY_FAMILY",
"params": { "family": "CSC", "min_year": 2015 }
}
Your server translates into SQL like:
SELECT m.*, d.year
FROM methods m
JOIN documents d ON d.doc_id = m.doc_id
WHERE m.family = 'CSC' AND d.year >= 2015;
You can add more patterns (e.g., “methods improving on X”, “shortest citation path between A and B”).
5. Implementation options
You have two main implementation choices:
- Pure SQL + views + application-layer helpers
- Keep everything in Postgres/BigQuery.
- Use views like
graph_nodesandgraph_edges. - Implement path queries (e.g., 2–3 hops) with SQL joins or recursive CTEs.
- Best when you want minimal extra infrastructure and short paths.
- Hybrid: relational source + graph engine
- Keep authoritative data in SQL.
- ETL into a graph DB (Neo4j, Memgraph, or a PG graph extension).
- Use Cypher/PGQL for complex pattern/path queries; keep writes flowing from SQL into graph via scheduled jobs or CDC.
- Useful when you want deep path queries, community detection, etc.
Given your agentic research workflow, a first version can be pure SQL with views and REST APIs. If/when you need deep graph analytics (e.g., discovery of method communities, novelty scores based on graph structure), you can bolt on a graph engine.
6. Practical to-dos (as prompts)
To turn this into implementation, you can use:
Given my existing documents/methods/citations tables, design the exact SQL views for graph_nodes and graph_edges, and provide example queries for 1–2 hop reasoning tasks.Propose a minimal REST API spec on top of the graph views that lets agents fetch neighbors, run canned patterns (e.g., methods improving on CMP), and retrieve subgraphs around a method.Compare a pure-Postgres graph layer using recursive CTEs vs a Neo4j mirror for my research corpus, including pros/cons and operational considerations for CODITECT.
sketch a concrete CODITECT-specific control-plane diagram (components + APIs) and a first multi-agent workflow over your CSC corpus.
Follow-ups Example BigQuery SQL for semantic search on UDOM Markdown corpus Integrate LatteReview agents with custom semantic index APIs Compare LiRA workflow with LatteReview for UDOM research corpus Sample FastAPI endpoints for equation and figure retrieval TODO list for building and deploying the semantic index schema. Given my existing documents/methods/citations tables, design the exact SQL views for graph_nodes and graph_edges, and provide example queries for 1–2 hop reasoning tasks. Propose a minimal REST API spec on top of the graph views that lets agents fetch neighbors, run canned patterns (e.g., methods improving on CMP), and retrieve subgraphs around a method. Compare a pure-Postgres graph layer using recursive CTEs vs a Neo4j mirror for my research corpus, including pros/cons and operational considerations for CODITECT.
Below is a concise CODITECT-specific design sketch plus concrete follow-up prompts.
1. CODITECT control-plane: components and APIs
Think “LLM agents as tenants; control plane as Kubernetes for research workflows”.
Core components
- Control Plane Service
- Orchestrates workflows (DAGs/state machines).
- Enforces policies (budgets, safety, model routing).
- Exposes a single tool-like API to agents.
- Research Corpus Service (UDOM Store)
- Stores
documents,sections,equations,figures,methods,citations. - Provides semantic search (vector index) and KG queries on top.
- Stores
- Graph Layer
- Implements
graph_nodes,graph_edgesover relational tables. - Supports 1–2 hop reasoning (citations, “improves on”, “evaluated on”).
- Implements
- Experiment Service
- Schedules and runs code/experiments (Kubernetes/Vertex/Batch).
- Records configs, metrics, and artifacts.
- Logging & Token Economics
- Centralized logging of all agent and tool calls.
- Token usage metrics, success flags, cost analytics.
Control-plane API (as a tool)
Single logical tool:
// Request
{
"goal_id": "csc-md-001",
"action": "RUN_WORKFLOW",
"workflow": "CSC_LIT_REVIEW_AND_METHOD_SYNTHESIS",
"params": {
"topic": "translation-invariant convolutional sparse coding",
"focus_families": ["CMP", "ADMM-CSC", "ML-CSC"],
"year_min": 2010
},
"constraints": {
"token_budget": 200000,
"max_depth": 50,
"risk_level": "low"
}
}
// Response
{
"status": "COMPLETED",
"artifacts": {
"reading_list_doc_ids": ["1010.0422", "..."],
"methods_table": "csc_methods_translation_invariant_v1",
"summary_md_path": "s3://coditect/reports/csc_ti_cmp_vs_admm.md"
},
"usage": {
"tokens_total": 153422,
"steps_total": 187
}
}
Internally, the control plane runs the multi-agent workflow described next.
2. First multi-agent workflow over your CSC corpus
Workflow: CSC_LIT_REVIEW_AND_METHOD_SYNTHESIS
Agents and responsibilities
- PI / Planner Agent
- Input: high-level goal.
- Output: workflow plan (stages + required artifacts).
- Curator Agent
- Calls Research Corpus APIs:
/research/search,/research/sections, KG queries.
- Output: ranked list of CSC/CMP/ADMM/ML-CSC papers (doc_ids + reasons).
- Calls Research Corpus APIs:
- Reader / Extractor Agent
- For each
doc_id:- fetches methods, equations, and assumptions via:
/research/sections?doc_id=...§ion_type=methods,/equations?doc_id=...,/figures?doc_id=....
- fetches methods, equations, and assumptions via:
- Writes to
methodsandmethod_relationstables.
- For each
- Comparator / Synthesizer Agent
- Calls graph APIs:
- “all methods with family in {CMP, ADMM-CSC, ML-CSC} and year≥2010”
- “relations IMPROVES_ON, ASSUMES, EVALUATED_ON”
- Produces a structured comparison (Markdown + JSON) and suggests potential hybrid/new methods.
- Calls graph APIs:
- Implementer Agent
- For 1–2 chosen methods, generates PyTorch code and experiment configs.
- Uses Experiment Service API to submit jobs.
- Experiment Agent
- Monitors jobs, aggregates metrics, and stores results.
- Writes entries in
experimentsandexperiment_resultstables.
- Reviewer Agent (LiRA-like)
- Uses corpus + methods + results to write a structured report:
- background, method taxonomy, comparison, new method description, experiments, conclusions.
- Ensures each claim is grounded in specific
doc_id+ section references.
- Uses corpus + methods + results to write a structured report:
HITL checkpoints:
- PI approves reading list and method selection before implementation.
- PI approves final report before external use.
3. Example BigQuery SQL for semantic search on UDOM Markdown
Assume table:
udom_sections(
doc_id STRING,
section_id STRING,
section_type STRING,
text STRING,
embedding ARRAY<FLOAT64> -- precomputed
)
Generate embeddings (if needed) and vector index per BigQuery docs.216217218
Semantic search query:
DECLARE query_text STRING DEFAULT 'translation-invariant convolutional sparse coding with ADMM';
WITH query_embedding AS (
SELECT
(SELECT embedding FROM AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_model`,
(SELECT query_text AS content)
)) AS embedding
),
search AS (
SELECT
base.doc_id,
base.section_id,
base.section_type,
base.text,
vs.distance
FROM VECTOR_SEARCH(
TABLE `coditect.udom_sections`,
'embedding',
(SELECT embedding, query_text AS query FROM query_embedding),
top_k => 20
) AS vs
JOIN `coditect.udom_sections` AS base
ON vs.row_id = base.section_id
)
SELECT * FROM search
ORDER BY distance ASC
LIMIT 20;
4. SQL views: graph_nodes and graph_edges + 1–2 hop queries
Assume documents, authors, methods, datasets, citations, method_relations, method_datasets, method_concepts as earlier.
CREATE OR REPLACE VIEW graph_nodes AS
SELECT 'Document' AS node_type, doc_id::TEXT AS node_id, title AS label
FROM documents
UNION ALL
SELECT 'Author', author_id::TEXT, name FROM authors
UNION ALL
SELECT 'Method', method_id::TEXT, name FROM methods
UNION ALL
SELECT 'Dataset', dataset_id::TEXT, name FROM datasets
UNION ALL
SELECT 'Concept', concept_id::TEXT, name FROM concepts;
CREATE OR REPLACE VIEW graph_edges AS
SELECT 'AUTHORED' AS edge_type,
a.author_id::TEXT AS src_id,
da.doc_id::TEXT AS dst_id
FROM document_authors da
JOIN authors a USING (author_id)
UNION ALL
SELECT 'CITES', citing_doc_id::TEXT, cited_doc_id::TEXT FROM citations
UNION ALL
SELECT 'IMPLEMENTS', m.method_id::TEXT, m.doc_id::TEXT FROM methods m
UNION ALL
SELECT 'EVALUATED_ON', md.method_id::TEXT, md.dataset_id::TEXT FROM method_datasets md
UNION ALL
SELECT relation_type, src_method_id::TEXT, dst_method_id::TEXT FROM method_relations
UNION ALL
SELECT relation_type, method_id::TEXT, concept_id::TEXT FROM method_concepts;
1-hop example: methods improving on CMP methods:
SELECT m_new.*
FROM methods m_base
JOIN method_relations mr
ON mr.dst_method_id = m_base.method_id
AND mr.relation_type = 'IMPROVES_ON'
JOIN methods m_new
ON m_new.method_id = mr.src_method_id
WHERE m_base.family = 'CMP';
2-hop example: datasets used by methods that improve on CMP:
SELECT DISTINCT ds.*
FROM methods m_base
JOIN method_relations mr
ON mr.dst_method_id = m_base.method_id
AND mr.relation_type = 'IMPROVES_ON'
JOIN methods m_new
ON m_new.method_id = mr.src_method_id
JOIN method_datasets md
ON md.method_id = m_new.method_id
JOIN datasets ds
ON ds.dataset_id = md.dataset_id
WHERE m_base.family = 'CMP';
5. Minimal REST API spec on top of graph views
Using FastAPI-style naming.
5.1. Graph endpoints
GET /graph/nodes/{node_type}/{node_id}- Returns node attributes and neighbors (by querying
graph_nodesandgraph_edges).
- Returns node attributes and neighbors (by querying
GET /graph/nodes/{node_type}/{node_id}/neighbors- Query params:
edge_type,direction(out/in/both),max_hops(1 or 2).
- Query params:
GET /graph/methods/by_family- Query params:
family,min_year. - Returns methods and associated docs.
- Query params:
GET /graph/methods/improving_on- Query params:
base_family(e.g.,CMP). - Returns methods where
method_relations.relation_type = 'IMPROVES_ON'.
- Query params:
GET /graph/subgraph/method/{method_id}- Returns local subgraph: method node + connected documents, datasets, concepts, related methods.
5.2. Equations & figures (FastAPI endpoints)
GET /equations- Query params:
doc_id, optionalrole(objective, update_rule, etc.)
- Query params:
GET /figures- Query params:
doc_id, optionaltype(architecture, experiment, etc.)
- Query params:
These are exactly what your agents use in the CSC workflow.
6. Postgres graph layer vs Neo4j mirror for CODITECT
Pure Postgres + recursive CTE/pgvector:
- Pros:
- Cons:
- Deep traversals and graph algorithms (centrality, community detection) are clunkier and slower.
- Less “graph-native” tooling for complex patterns.
Neo4j (or other graph DB) as mirror:
- Pros:
- Cons:
Given CODITECT’s needs:
- Start with Postgres + pgvector + graph views + simple recursive CTEs.
- Add a graph DB mirror only if you hit limits on traversal-heavy analytics (e.g., citation network mining, community detection on very large graphs).
7. Focused follow-up prompts (ready to paste)
Design the specific BigQuery AI.GENERATE_EMBEDDING + VECTOR_SEARCH SQL to build and query a semantic index over my udom_sections table (doc_id, section_type, text), optimized for CSC-related queries.217218216Write FastAPI route handlers for /equations and /figures that query my Postgres UDOM tables and return structured JSON suitable for a CSC Reader agent.222223Outline how to integrate LatteReview’s agents with my custom /research/search and /graph APIs so that systematic reviews run over my CODITECT UDOM corpus instead of generic PDFs.224225226Compare the LiRA and LatteReview workflows and adapt their stages into a unified CSC-focused literature-review workflow for CODITECT, with explicit inputs/outputs per stage.227228229Generate a concrete TODO list and milestone plan for building and deploying the CODITECT semantic index and graph layer, including schema migrations, embedding jobs, and API gateway updates.
Design the specific BigQuery AI.GENERATE_EMBEDDING + VECTOR_SEARCH SQL to build and query a semantic index over my udom_sections table (doc_id, section_type, text), optimized for CSC-related queries.
Write FastAPI route handlers for /equations and /figures that query my Postgres UDOM tables and return structured JSON suitable for a CSC Reader agent. Outline how to integrate LatteReview’s agents with my custom /research/search and /graph APIs so that systematic reviews run over my CODITECT UDOM corpus instead of generic PDFs. Compare the LiRA and LatteReview workflows and adapt their stages into a unified CSC-focused literature-review workflow for CODITECT, with explicit inputs/outputs per stage. Generate a concrete TODO list and milestone plan for building and deploying the CODITECT semantic index and graph layer, including schema migrations, embedding jobs, and API gateway updates. Implement the CODITECT control-plane APIs in FastAPI code Extend multi-agent workflow to include graph layer queries on CSC data Example BigQuery SQL for semantic search on CSC logs in UDOM Markdown Define agent roles and communication protocol for CODITECT SQL schema design for CSC corpus table with vector embeddings
Below is a compact, implementation-focused answer you can plug into CODITECT workstreams.
1. BigQuery: embeddings + VECTOR_SEARCH over udom_sections
Assume:
udom_sections(
doc_id STRING,
section_id STRING,
section_type STRING,
text STRING,
csc_tag BOOL, -- optional: true if CSC-related
embedding ARRAY<FLOAT64> -- to be populated
)
1.1. Populate embeddings (CSC-optimized)
You can either embed all sections or only those with csc_tag = TRUE (e.g., tagged via a classifier).
CREATE OR REPLACE TABLE coditect.udom_sections_emb AS
SELECT
doc_id,
section_id,
section_type,
text,
csc_tag,
AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_specialized_text`,
text
) AS embedding
FROM coditect.udom_sections
WHERE text IS NOT NULL;
(Use your chosen embedding model; name above is illustrative.)237238239
1.2. Create vector index
CREATE OR REPLACE VECTOR INDEX coditect.udom_sections_emb_ix
ON coditect.udom_sections_emb(embedding)
STORING(doc_id, section_id, section_type, text, csc_tag)
OPTIONS(
distance_type = 'COSINE',
index_type = 'IVF'
);
Check coverage as in docs.240241
1.3. Semantic search query for CSC
DECLARE query_text STRING DEFAULT
'translation-invariant convolutional sparse coding with ADMM and CMP';
WITH query_embedding AS (
SELECT AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_specialized_text`,
query_text
) AS embedding
),
search AS (
SELECT
base.doc_id,
base.section_id,
base.section_type,
base.text,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_sections_emb,
'embedding',
(SELECT embedding FROM query_embedding),
top_k => 50,
distance_type => 'COSINE'
) AS vs
JOIN coditect.udom_sections_emb AS base
ON vs.row_id = base.section_id
WHERE base.csc_tag = TRUE -- optional CSC filter
)
SELECT *
FROM search
ORDER BY distance ASC
LIMIT 20;
2. FastAPI handlers for /equations and /figures
Assume Postgres tables:
equations(
eq_id SERIAL PRIMARY KEY,
doc_id TEXT,
section_id TEXT,
latex TEXT,
role TEXT, -- e.g. 'objective', 'update_rule', ...
display_index INT
);
figures(
fig_id SERIAL PRIMARY KEY,
doc_id TEXT,
section_id TEXT,
caption TEXT,
figure_type TEXT, -- e.g. 'architecture', 'experiment'
asset_url TEXT
);
2.1. Pydantic models
from fastapi import FastAPI, Query
from pydantic import BaseModel
from typing import List, Optional
import asyncpg
app = FastAPI()
class EquationOut(BaseModel):
eq_id: int
doc_id: str
section_id: str
latex: str
role: Optional[str]
display_index: Optional[int]
class FigureOut(BaseModel):
fig_id: int
doc_id: str
section_id: str
caption: str
figure_type: Optional[str]
asset_url: Optional[str]
async def get_pool():
# manage globally in real code
return await asyncpg.create_pool(dsn="postgresql://user:pass@host/db")
2.2. Routes
@app.get("/equations", response_model=List[EquationOut])
async def get_equations(
doc_id: str = Query(...),
role: Optional[str] = Query(None)
):
pool = await get_pool()
async with pool.acquire() as conn:
if role:
rows = await conn.fetch(
"""
SELECT eq_id, doc_id, section_id, latex, role, display_index
FROM equations
WHERE doc_id = $1 AND role = $2
ORDER BY display_index
""",
doc_id, role
)
else:
rows = await conn.fetch(
"""
SELECT eq_id, doc_id, section_id, latex, role, display_index
FROM equations
WHERE doc_id = $1
ORDER BY display_index
""",
doc_id
)
return [EquationOut(**dict(r)) for r in rows]
@app.get("/figures", response_model=List[FigureOut])
async def get_figures(
doc_id: str = Query(...),
figure_type: Optional[str] = Query(None)
):
pool = await get_pool()
async with pool.acquire() as conn:
if figure_type:
rows = await conn.fetch(
"""
SELECT fig_id, doc_id, section_id, caption, figure_type, asset_url
FROM figures
WHERE doc_id = $1 AND figure_type = $2
ORDER BY fig_id
""",
doc_id, figure_type
)
else:
rows = await conn.fetch(
"""
SELECT fig_id, doc_id, section_id, caption, figure_type, asset_url
FROM figures
WHERE doc_id = $1
ORDER BY fig_id
""",
doc_id
)
return [FigureOut(**dict(r)) for r in rows]
This is exactly what your CSC Reader agent needs.
3. Integrating LatteReview agents with CODITECT APIs
Key LatteReview pieces: Providers, Reviewer Agents, Workflows, RAG integration.242243244
To adapt:
- Provider layer
- Implement a custom provider that hits your CODITECT LLM gateway (instead of directly calling OpenAI, etc.), so LatteReview uses your token economics and routing.
- RAG integration
- Replace LatteReview’s default retrieval with your
/research/searchand/graphAPIs:search_papers(query) -> [doc_id]calls BigQuery semantic search.get_metadata(doc_id)calls your UDOM/graph endpoints (sections, methods, citations).
- Replace LatteReview’s default retrieval with your
- Reviewer agents
- Map LatteReview’s
BaseReviewer,ScoringReviewer,AbstractionReviewerto operate on doc_ids:- Instead of ingesting raw PDFs, they receive structured text chunks from
udom_sections. - For CSC tasks, you can define a
CSCMethodReviewerthat expects structured method fields and equations.
- Instead of ingesting raw PDFs, they receive structured text chunks from
- Map LatteReview’s
- Workflows
- Use LatteReview’s workflow engine but plug in CODITECT-specific steps:
- Step 1: Curator reviewer → uses
/research/search. - Step 2: Screening reviewers → access sections via
/sectionsandgraphendpoints. - Step 3: Abstraction reviewers → call
/equations,/figures, and store CSC method entries into CODITECT tables.
- Step 1: Curator reviewer → uses
- Use LatteReview’s workflow engine but plug in CODITECT-specific steps:
In short: keep LatteReview’s orchestration and agent abstraction, swap out its storage and retrieval with CODITECT’s semantic index and graph layer.
4. LiRA + LatteReview → unified CSC-focused review workflow
LiRA strengths: reliable/readable narrative, human-like review composition.245246 LatteReview strengths: multi-agent screening, scoring, abstraction with Pydantic schemas.244242
Unified workflow for CSC:
- Stage 1 – Retrieval & Screening (LatteReview-style)
- Agents: Curator + Scoring reviewers.
- Inputs: topic, focus (CSC/CMP/ADMM/ML-CSC), year range.
- Outputs: accepted
doc_ids + reasoned scores.
- Stage 2 – Abstraction / Method Extraction (LatteReview)
- Agents: Abstraction reviewers (CSC-specialized).
- Inputs: accepted
doc_ids; data from/sections,/equations,/figures. - Outputs: structured
methods,equations,assumptions,datasetsin CODITECT tables.
- Stage 3 – Synthesis and Outline (LiRA)
- Agents: LiRA-like Outline and Section-Planner agents.
- Inputs: method registry, KG, and scoring metadata.
- Outputs: structured outline for CSC review (sections: background, CMP vs ADMM, ML-CSC stability, applications, gaps).
- Stage 4 – Drafting and Editing (LiRA)
- Agents: Drafting agents per section, plus Editing/Consistency agents.
- Inputs: outline, extracted methods/equations, curated quotes from UDOM.
- Outputs: full CSC literature review draft with references tied to
doc_id+ section IDs.
- Stage 5 – CSC-specific QA
- Agents: CSC Critic agent + Citation checker.
- Outputs: final review document ready for PI approval.
5. Concrete TODO list & milestones for semantic index + graph layer
Phase 1 – Schema & migrations (1–2 weeks)
- Finalize relational schemas:
udom_sections(doc_id, section_id, section_type, text, csc_tag)equations,figures,documents,methods,citations,method_relations,datasets,concepts.
- Add vector columns in BigQuery (
embedding ARRAY<FLOAT64>). - Implement ETL from your current UDOM artifacts into these tables.
Phase 2 – Embeddings & vector index (1–2 weeks)
- Choose embedding model for CSC-heavy text.
- Run
AI.GENERATE_EMBEDDINGbatch job to populateudom_sections_emb. - Create vector index
udom_sections_emb_ix. - Validate semantic search queries with CSC queries.
Phase 3 – Graph layer (1–2 weeks)
- Implement
documents,methods,citations,method_relations, etc., in Postgres. - Create
graph_nodesandgraph_edgesviews as above. - Add 2–3 canned SQL patterns for common reasoning tasks (improves_on, evaluated_on, etc.).
Phase 4 – APIs (2 weeks)
- Implement FastAPI endpoints:
/research/search(hits BigQuery semantic search)./equations,/figures./graph/node/{type}/{id},/graph/methods/by_family,/graph/subgraph/method/{id}.
- Integrate token logging middleware.
Phase 5 – Agent integration (2–4 weeks)
- Implement CODITECT control-plane FastAPI endpoints (
/control_plane/act). - Port LatteReview provider to CODITECT LLM gateway.
- Wire LatteReview/ LiRA-style agents to use
/researchand/graphAPIs. - Define the CSC-focused multi-agent workflow and test end-to-end on a CSC subset.
If you share your actual udom_sections and methods/citations schemas, I can tighten the SQL and Python code to exactly match your environment.
How to create and query a vector index on the embeddings table
Adapt this SQL for multimodal embeddings including images Optimize VECTOR_SEARCH for low latency on large udom_sections Integrate RAG generation with these vector search results Handle embedding model updates like changing to gemini-embedding-001-large
Here’s how to handle vector indexes, multimodal embeddings, performance, RAG integration, and model changes in BigQuery for your UDOM/CSC stack.
1. Create and query a vector index on embeddings
Assume table:
udom_sections_emb(
doc_id STRING,
section_id STRING,
section_type STRING,
text STRING,
embedding ARRAY<FLOAT64>,
csc_tag BOOL
)
1.1. Create IVF or TreeAH index
IVF (good general default):254255
CREATE OR REPLACE VECTOR INDEX coditect.udom_sections_ix
ON coditect.udom_sections_emb(embedding)
STORING (doc_id, section_id, section_type, text, csc_tag)
OPTIONS (
index_type = 'IVF',
distance_type = 'COSINE'
);
For large query batches or very large corpus, consider TreeAH:256254
CREATE OR REPLACE VECTOR INDEX coditect.udom_sections_ix_treeah
ON coditect.udom_sections_emb(embedding)
STORING (doc_id, section_id, section_type, text, csc_tag)
OPTIONS (
index_type = 'TREE_AH',
distance_type = 'COSINE',
tree_ah_options = '{"normalization_type": "L2"}'
);
1.2. Query with VECTOR_SEARCH
DECLARE query_text STRING DEFAULT
'translation-invariant convolutional sparse coding with CMP and ADMM';
WITH query_embedding AS (
SELECT AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text`,
query_text
) AS embedding
),
search AS (
SELECT
base.doc_id,
base.section_id,
base.section_type,
base.text,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_sections_emb,
'embedding',
(SELECT embedding FROM query_embedding),
top_k => 20,
distance_type => 'COSINE'
) AS vs
JOIN coditect.udom_sections_emb AS base
ON vs.row_id = base.section_id
)
SELECT *
FROM search
ORDER BY distance ASC;
2. Adapt SQL for multimodal embeddings (text + images)
Assume:
udom_sections_embas above (text embeddings).udom_figuresas an object table referencing images in GCS with image embeddings:257258
udom_figures(
fig_id STRING,
doc_id STRING,
section_id STRING,
asset_uri STRING, -- GCS URI
caption STRING,
image_embedding ARRAY<FLOAT64>
)
2.1. Generate image embeddings
Using a BigQuery ML remote model targeting Vertex AI multimodal embedding:258259257
CREATE OR REPLACE MODEL coditect.embedding_models.csc_multimodal
REMOTE WITH CONNECTION `vertex-us`
OPTIONS (endpoint = 'multimodalembedding');
CREATE OR REPLACE TABLE coditect.udom_figures_emb AS
SELECT
fig_id,
doc_id,
section_id,
caption,
asset_uri,
AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_multimodal`,
STRUCT(asset_uri AS image_uri)
) AS image_embedding
FROM coditect.udom_figures;
2.2. Create a vector index for images
CREATE OR REPLACE VECTOR INDEX coditect.udom_figures_ix
ON coditect.udom_figures_emb(image_embedding)
STORING (fig_id, doc_id, section_id, caption, asset_uri)
OPTIONS (
index_type = 'IVF',
distance_type = 'COSINE'
);
2.3. Text-to-image search
DECLARE query_text STRING DEFAULT
'convolutional neural network architecture diagram for sparse coding';
WITH query_embedding AS (
SELECT AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_multimodal`,
STRUCT(query_text AS text)
) AS embedding
),
search AS (
SELECT
base.fig_id,
base.doc_id,
base.section_id,
base.caption,
base.asset_uri,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_figures_emb,
'image_embedding',
(SELECT embedding FROM query_embedding),
top_k => 20,
distance_type => 'COSINE'
) AS vs
JOIN coditect.udom_figures_emb AS base
ON vs.row_id = base.fig_id
)
SELECT *
FROM search
ORDER BY distance ASC;
This gives you multimodal semantic search over equations/text and figures.
3. Optimize VECTOR_SEARCH for low latency on large udom_sections
- Choose index type wisely
- IVF: good for general single-query workloads.
- TreeAH: better for large batches and very large corpora; uses ScaNN-based ANN with asymmetric hashing.256
- Partitioning
- Partition
udom_sections_embbyyear,domain, ordoc_id_rangeand create a partitioned vector index, so each query only scans a relevant partition:254
- Partition
CREATE TABLE coditect.udom_sections_emb(
doc_id STRING,
section_id STRING,
section_type STRING,
text STRING,
year INT64,
embedding ARRAY<FLOAT64>
)
PARTITION BY year;
CREATE VECTOR INDEX coditect.udom_sections_ix_p
ON coditect.udom_sections_emb(embedding)
PARTITION BY year
OPTIONS (index_type='TREE_AH', distance_type='COSINE');
- Stored columns
- Use
STORINGto avoid expensive joins: store only fields you need (doc_id,section_id,section_type, maybe a short snippet).254
- Use
- Index tuning parameters
SELECT ...
FROM VECTOR_SEARCH(
TABLE coditect.udom_sections_emb,
'embedding',
(SELECT embedding FROM query_embedding),
top_k => 20,
distance_type => 'COSINE',
options => '{"fraction_lists_to_search": 0.2}'
) AS vs
...
- Batch queries
4. Integrate RAG generation with vector search results
RAG flow for CODITECT:
- Retrieve with
VECTOR_SEARCH(BigQuery) as above; return top-ksection_id+textanddoc_id. - Post-process:
- Filter by
csc_tag,qa_score, or section_type (e.g., methods/theory). - De-duplicate by doc_id / cluster close sections.
- Optionally compress with LLMLingua before sending to the model.
- Filter by
- Augment prompt:
- Build a context block with citations:
[DOC 1010.0422, METHODS] ...
[DOC 1607.02009, THEORY] ...
- Generate:
- Send to your LLM via CODITECT control-plane, include metadata so agents can reference
doc_idin their output.
- Send to your LLM via CODITECT control-plane, include metadata so agents can reference
You can also integrate via LangChain’s BigQuery vector store integration if you like, but you’re already comfortable operating at the SQL layer.263264265
5. Handling embedding model updates (e.g., switch to gemini-embedding-001-large)
When you change the embedding model, you essentially define a new semantic space. Recommended pattern:
- Add a version column
Extend your table:
ALTER TABLE coditect.udom_sections_emb
ADD COLUMN embedding_version STRING;
-- Set initial version for existing rows
UPDATE coditect.udom_sections_emb
SET embedding_version = 'csc_text_v1'
WHERE embedding_version IS NULL;
- Create new embedding column/table for v2
Option A (new column):
ALTER TABLE coditect.udom_sections_emb
ADD COLUMN embedding_v2 ARRAY<FLOAT64>;
UPDATE coditect.udom_sections_emb
SET embedding_v2 = AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_gemini_large`,
text
);
Create a new index on embedding_v2:
CREATE OR REPLACE VECTOR INDEX coditect.udom_sections_ix_v2
ON coditect.udom_sections_emb(embedding_v2)
STORING (doc_id, section_id, section_type, text, csc_tag)
OPTIONS (index_type='TREE_AH', distance_type='COSINE');
Option B (new table):
CREATE OR REPLACE TABLE coditect.udom_sections_emb_v2 AS
SELECT
doc_id,
section_id,
section_type,
text,
csc_tag,
AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_gemini_large`,
text
) AS embedding,
'csc_text_v2' AS embedding_version
FROM coditect.udom_sections;
- Dual-index period
- Run both v1 and v2 indices in parallel.
- Sample queries and compare recall/quality for CSC tasks.
- Update CODITECT config to switch RAG queries to
*_ix_v2when satisfied.
- Retire old embeddings
Once v2 is stable:
- Stop writing new rows to v1 or set v1 index to read-only / deprecated.
- Optionally drop v1 to reclaim storage and simplify code.
If you want, next step can be a short set of “schema migration” SQLs for introducing embedding_version and dual-index querying patterns, tuned to your existing udom_sections table.
next step can be a short set of “schema migration” SQLs for introducing embedding_version and dual-index querying patterns, tuned to your existing udom_sections table.
Here’s a minimal, concrete set of migration patterns for introducing embedding_version and dual-index querying on your existing udom_sections table in BigQuery.
1. Add embedding_version and a new embedding column
Assume current table:
udom_sections_emb(
doc_id STRING,
section_id STRING,
section_type STRING,
text STRING,
embedding ARRAY<FLOAT64> -- v1 embeddings
)
1.1. Add embedding_version and embedding_v2
ALTER TABLE coditect.udom_sections_emb
ADD COLUMN IF NOT EXISTS embedding_version STRING;
ALTER TABLE coditect.udom_sections_emb
ADD COLUMN IF NOT EXISTS embedding_v2 ARRAY<FLOAT64>;
BigQuery requires new columns to be NULLABLE; this works fine for a staged rollout.270271
1.2. Backfill version for existing rows (v1)
UPDATE coditect.udom_sections_emb
SET embedding_version = 'csc_text_v1'
WHERE embedding_version IS NULL;
1.3. Populate v2 embeddings
Assume new model coditect.embedding_models.csc_text_gemini_large:
UPDATE coditect.udom_sections_emb
SET embedding_v2 = AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_gemini_large`,
text
)
WHERE embedding_v2 IS NULL
AND text IS NOT NULL;
(For large tables, you’ll likely do this in batches with a WHERE clause on section_id or doc_id ranges.)
2. Create dual vector indexes (v1 and v2)
2.1. Existing v1 index (if not created yet)
CREATE OR REPLACE VECTOR INDEX coditect.udom_sections_ix_v1
ON coditect.udom_sections_emb(embedding)
STORING (doc_id, section_id, section_type, text)
OPTIONS (
index_type = 'IVF',
distance_type = 'COSINE'
);
2.2. New v2 index on embedding_v2
CREATE OR REPLACE VECTOR INDEX coditect.udom_sections_ix_v2
ON coditect.udom_sections_emb(embedding_v2)
STORING (doc_id, section_id, section_type, text)
OPTIONS (
index_type = 'TREE_AH',
distance_type = 'COSINE'
);
Both indexes now coexist on the same table.272273
3. Dual-index query pattern
You can parameterize which index/embedding to use via a simple CASE in your application code; in SQL, you just switch the column used in VECTOR_SEARCH.
3.1. Query using v1 embeddings
DECLARE query_text STRING DEFAULT
'translation-invariant convolutional sparse coding with CMP';
WITH query_embedding AS (
SELECT AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_v1`,
query_text
) AS embedding
),
search AS (
SELECT
base.doc_id,
base.section_id,
base.section_type,
base.text,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_sections_emb,
'embedding', -- v1 column
(SELECT embedding FROM query_embedding),
top_k => 20,
distance_type => 'COSINE'
) AS vs
JOIN coditect.udom_sections_emb AS base
ON vs.row_id = base.section_id
)
SELECT *
FROM search
ORDER BY distance ASC;
3.2. Query using v2 embeddings
DECLARE query_text STRING DEFAULT
'translation-invariant convolutional sparse coding with CMP';
WITH query_embedding AS (
SELECT AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_gemini_large`,
query_text
) AS embedding
),
search AS (
SELECT
base.doc_id,
base.section_id,
base.section_type,
base.text,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_sections_emb,
'embedding_v2', -- v2 column
(SELECT embedding FROM query_embedding),
top_k => 20,
distance_type => 'COSINE'
) AS vs
JOIN coditect.udom_sections_emb AS base
ON vs.row_id = base.section_id
)
SELECT *
FROM search
ORDER BY distance ASC;
Your application or control plane just toggles between 'embedding' and 'embedding_v2' plus the corresponding query-side model.
4. Optional: split into versioned tables
If you prefer cleaner separation, you can materialize v2 into a new table and gradually move over:
CREATE OR REPLACE TABLE coditect.udom_sections_emb_v2 AS
SELECT
doc_id,
section_id,
section_type,
text,
AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_gemini_large`,
text
) AS embedding,
'csc_text_v2' AS embedding_version
FROM coditect.udom_sections;
Index:
CREATE OR REPLACE VECTOR INDEX coditect.udom_sections_ix_v2
ON coditect.udom_sections_emb_v2(embedding)
STORING (doc_id, section_id, section_type, text)
OPTIONS (index_type='TREE_AH', distance_type='COSINE');
Then keep:
udom_sections_emb+*_ix_v1for v1,udom_sections_emb_v2+*_ix_v2for v2,
and route queries by version in your RAG layer.
This gives you:
- safe additive schema migrations,
- side-by-side embedding versions,
- and a clean path to switch CODITECT’s RAG stack to the new model once you’re satisfied with quality.
Example SQL for text-to-image search with multimodal embeddings
How to tune fraction_lists_to_search in VECTOR_SEARCH Compare IVF vs TreeAH index performance on large datasets Full RAG pipeline integrating VECTOR_SEARCH with BigQuery ML generation Handle embedding errors in multimodal generation with Vertex AI. Example backfill UPDATE query for embedding_version on existing rows SQL for CREATE VECTOR INDEX on new embeddings column Sample dual-index VECTOR_SEARCH query with embedding_version filter How to handle embedding model updates with version column Post-migration tests for vector search performance
Below are concise, copy-paste-ready patterns for each item.
1. Text-to-image search with multimodal embeddings (BigQuery)
Assume:
udom_figures_emb(
fig_id STRING,
doc_id STRING,
section_id STRING,
caption STRING,
asset_uri STRING, -- GCS URI
image_embedding ARRAY<FLOAT64> -- from multimodal model
)
1.1. Text → image search
DECLARE query_text STRING DEFAULT
'diagram of convolutional sparse coding network with filters and feature maps';
WITH query_embedding AS (
SELECT AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_multimodal`,
STRUCT(query_text AS text)
) AS embedding
),
search AS (
SELECT
base.fig_id,
base.doc_id,
base.section_id,
base.caption,
base.asset_uri,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_figures_emb,
'image_embedding',
(SELECT embedding FROM query_embedding),
top_k => 20,
distance_type => 'COSINE'
) AS vs
JOIN coditect.udom_figures_emb AS base
ON vs.row_id = base.fig_id
)
SELECT *
FROM search
ORDER BY distance ASC;
2. Tuning fraction_lists_to_search in VECTOR_SEARCH
For IVF indexes, fraction_lists_to_search trades recall for latency.285286
Example pattern:
DECLARE query_text STRING DEFAULT 'convolutional matching pursuit algorithm';
WITH query_embedding AS (
SELECT AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_v2`,
query_text
) AS embedding
)
SELECT
base.doc_id,
base.section_id,
base.text,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_sections_emb,
'embedding_v2',
(SELECT embedding FROM query_embedding),
top_k => 20,
distance_type => 'COSINE',
options => '{"fraction_lists_to_search": 0.1}'
) AS vs
JOIN coditect.udom_sections_emb AS base
ON vs.row_id = base.section_id
ORDER BY distance ASC;
You can empirically evaluate fraction_lists_to_search values like 0.05, 0.1, 0.2, 0.5 for recall vs latency and bake a default per workload.
3. IVF vs TreeAH performance on large datasets
Heuristics for CODITECT-sized corpora:286287288
- IVF:
- Good for: single-query or low concurrency workloads, moderate corpus sizes.
- Tunable with
fraction_lists_to_search. - Lower index build complexity; good general default.
- TreeAH (ScaNN-based):
- Good for: large vector sets (10M+), batch queries, and low-latency retrieval.
- Typically higher recall at similar latency for bigger workloads.
- Slightly more complex tuning (TreeAH options), but BigQuery provides sensible defaults.
Practical plan:
- Start with IVF on
udom_sections_embwhile corpus < few million rows. - Benchmark TreeAH (
TREE_AHindex) once you hit scale or batch workloads; if latency or recall is an issue, switch.
4. Full RAG pipeline integrating VECTOR_SEARCH + BigQuery ML
High-level steps:
- Embed query and search in BigQuery using
VECTOR_SEARCH. - Return top-k sections (text + doc_id) to your app.
- Compress/trim context (optional LLMLingua).
- Call LLM (via CODITECT control plane) with query + retrieved context.
Pseudo-SQL + pseudo-Python:
-- 1. BigQuery: retrieve top-k context
DECLARE query_text STRING DEFAULT @query;
WITH query_embedding AS (
SELECT AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_v2`,
query_text
) AS embedding
),
search AS (
SELECT
base.doc_id,
base.section_id,
base.section_type,
base.text,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_sections_emb,
'embedding_v2',
(SELECT embedding FROM query_embedding),
top_k => 10,
distance_type => 'COSINE'
) AS vs
JOIN coditect.udom_sections_emb AS base
ON vs.row_id = base.section_id
)
SELECT * FROM search ORDER BY distance ASC;
# 2–4. App side
rows = bigquery_client.query(sql, job_config=job_config).result()
context_blocks = []
for r in rows:
context_blocks.append(f"[DOC {r.doc_id} / {r.section_type}] {r.text}")
context_str = "\n\n".join(context_blocks)
prompt = f"""You are an expert in CSC. Answer the question using the context.
Question:
{user_query}
Context:
{context_str}
"""
response = coditect_control_plane.call_llm(
task_type="CSC_QA",
prompt=prompt
)
5. Handling embedding errors in multimodal generation (Vertex AI)
Common issues: missing/invalid image URIs, model failures.
Pattern:
- Add a
statusanderror_messagecolumn to your figure embeddings table:
ALTER TABLE coditect.udom_figures_emb
ADD COLUMN IF NOT EXISTS embedding_status STRING,
ADD COLUMN IF NOT EXISTS embedding_error STRING;
- ETL script (Python/Beam/Cloud Functions) pseudocode:
for fig in figures:
try:
emb = generate_multimodal_embedding(fig.asset_uri) # Vertex AI
status = "OK"
error = None
except Exception as e:
emb = None
status = "ERROR"
error = str(e)[:512]
upsert_row(fig.fig_id, emb, status, error)
- In queries, filter to
embedding_status = 'OK'to avoid bad rows.
6. Example backfill UPDATE for embedding_version on existing rows
As before:
ALTER TABLE coditect.udom_sections_emb
ADD COLUMN IF NOT EXISTS embedding_version STRING;
UPDATE coditect.udom_sections_emb
SET embedding_version = 'csc_text_v1'
WHERE embedding_version IS NULL;
You can do similar for multimodal tables ('csc_multimodal_v1', etc.).
7. SQL for CREATE VECTOR INDEX on new embeddings column
Assume new column embedding_v2 exists:
CREATE OR REPLACE VECTOR INDEX coditect.udom_sections_ix_v2
ON coditect.udom_sections_emb(embedding_v2)
STORING (doc_id, section_id, section_type, text)
OPTIONS (
index_type = 'TREE_AH',
distance_type = 'COSINE'
);
For multimodal images:
CREATE OR REPLACE VECTOR INDEX coditect.udom_figures_ix_v2
ON coditect.udom_figures_emb(image_embedding_v2)
STORING (fig_id, doc_id, section_id, caption, asset_uri)
OPTIONS (
index_type = 'IVF',
distance_type = 'COSINE'
);
8. Sample dual-index VECTOR_SEARCH query with version column
If you choose to store rows in a single table with embedding_version, you usually keep separate columns (embedding, embedding_v2) anyway. The version column is more for tracking and auditing.
Here’s one pattern that lets you choose index dynamically via parameter:
DECLARE query_text STRING DEFAULT @query;
DECLARE version STRING DEFAULT @embedding_version; -- 'v1' or 'v2'
WITH query_embedding AS (
SELECT
CASE
WHEN version = 'v2' THEN AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_v2`,
query_text
)
ELSE AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_v1`,
query_text
)
END AS embedding
),
search AS (
SELECT
base.doc_id,
base.section_id,
base.section_type,
base.text,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_sections_emb,
CASE
WHEN version = 'v2' THEN 'embedding_v2'
ELSE 'embedding'
END,
(SELECT embedding FROM query_embedding),
top_k => 20,
distance_type => 'COSINE'
) AS vs
JOIN coditect.udom_sections_emb AS base
ON vs.row_id = base.section_id
)
SELECT * FROM search
ORDER BY distance ASC;
In practice, you may keep two separate queries and swap them at the application level (cleaner and easier to test).
9. Handling embedding model updates with embedding_version
Summary pattern:
- Add
embedding_versionto track which model produced each embedding. - For new model:
- Add
embedding_v2column or new table. - Backfill embeddings and set
embedding_version = 'csc_text_v2'for new rows only.
- Add
- Maintain v1 and v2 indices in parallel.
- Route traffic gradually to v2; monitor performance and quality.
- Once stable, deprecate v1 indexes and columns as desired.
BigQuery schema changes are additive by default, so this fits its design well.289290291
10. Post-migration tests for vector search performance
After embedding + index migration:
- Correctness / recall checks
- For a set of canonical CSC queries, compare top-10 results from v1 vs v2.
- Manually inspect or compute overlap/semantic similarity.
- Latency benchmarks
- Cost checks
- Compare query slot usage / cost between index types and versions.
- End-to-end RAG tests
- Run fixed prompts through your full RAG stack with v1 vs v2.
- Evaluate answer quality (automatic metrics + manual rating) and token usage.
Once v2 passes these checks and shows better or equal quality + acceptable latency, flip CODITECT’s configuration to use the v2 index by default.
If you want, next we can sketch a tiny benchmark table schema and a Python harness to automatically run and log v1 vs v2 comparisons over your canonical CSC queries.
{"ts": "2026-02-09T09:03:33.514741+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "pipeline_start", "status": "ok", "input": "../../analyze-new-artifacts/coditect-ai-research-pdfs/1010.0422-convolutional-matching-pursuit-dictionary-training-lecun.pdf", "output": "../../analyze-new-artifacts/coditect-ai-research-pdfs/1010.0422-convolutional-matching-pursuit-dictionary-training-lecun.md"}
{"ts": "2026-02-09T09:03:38.460608+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "extract_pdf", "status": "ok", "engine": "docling", "components": 65, "assets": 11, "elapsed_s": 4.9} {"ts": "2026-02-09T09:03:40.607712+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "extract_html", "status": "ok", "components": 70, "assets": 6, "elapsed_s": 2.1} {"ts": "2026-02-09T09:03:42.810281+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "extract_latex", "status": "ok", "components": 28, "elapsed_s": 2.2} {"ts": "2026-02-09T09:03:42.905758+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "mapping", "status": "ok", "components": 119, "types": {"heading": 11, "paragraph": 59, "equation": 21, "list": 3, "figure": 16, "bibliography_entry": 9}, "bibliography": 18, "assets": 17, "elapsed_s": 0.1} {"ts": "2026-02-09T09:03:42.906534+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "assembly", "status": "ok", "chars": 18894, "lines": 325, "elapsed_s": 0.0} {"ts": "2026-02-09T09:03:42.907994+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "qa_grade", "status": "ok", "grade": "A", "score": 0.8794117647058822, "matched": 39, "upgraded": 11} {"ts": "2026-02-09T09:03:42.908003+00:00", "doc_id": "1010.0422", "pipeline": "1.3-udom", "step": "pipeline_complete", "status": "ok", "grade": "A", "score": 0.8794117647058822, "components": 119, "images_saved": 17, "elapsed_s": 9.4} BUILD THE DATE DICTIONARY AND SQL STATEMENTS AND THEN THE LAYERS FOR THE VECTOR AND SEMANTIC SEARCH
You’ve successfully ingested 1010.0422 into UDOM; now you want the downstream data dictionary and the BigQuery vector/semantic layers. Below is a minimal, end-to-end design you can generalize beyond this paper.
1. Data dictionary: core tables for UDOM sections
Assume BigQuery dataset coditect.
1.1. Documents table
CREATE TABLE IF NOT EXISTS coditect.documents (
doc_id STRING, -- e.g. '1010.0422'
title STRING,
year INT64,
venue STRING,
qa_grade STRING, -- e.g. 'A'
qa_score FLOAT64,
PRIMARY KEY (doc_id) NOT ENFORCED
);
For this job, you’d insert:
INSERT INTO coditect.documents (doc_id, title, year, venue, qa_grade, qa_score)
VALUES ('1010.0422',
'Convolutional Matching Pursuit and Dictionary Training',
2010,
'arXiv cs.CV',
'A',
0.8794117647058822);
1.2. Sections table (udom_sections)
Populate from your UDOM assembly/mapping output (headings, paragraphs, equations, figures, bib entries):
CREATE TABLE IF NOT EXISTS coditect.udom_sections (
doc_id STRING,
section_id STRING,
section_type STRING, -- 'heading','paragraph','equation','figure','list','bibliography_entry'
section_order INT64,
text STRING, -- for headings/paragraphs/lists
latex STRING, -- for equations (optional)
role STRING, -- e.g. 'objective','update_rule' for equations
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
You already know counts from the mapping step: 11 headings, 59 paragraphs, 21 equations, 16 figures, 9 bibliography entries, etc., totaling 119 components and ~18.9k chars.293
1.3. Figures table
CREATE TABLE IF NOT EXISTS coditect.udom_figures (
fig_id STRING,
doc_id STRING,
section_id STRING,
caption STRING,
asset_uri STRING, -- GCS path where you saved 17 images
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
2. Embeddings + vector layer for semantic search
2.1. Embedding-enriched sections table
CREATE TABLE IF NOT EXISTS coditect.udom_sections_emb AS
SELECT
doc_id,
section_id,
section_type,
section_order,
text,
NULL AS embedding, -- ARRAY<FLOAT64>, backfilled next
'csc_text_v1' AS embedding_version
FROM coditect.udom_sections
WHERE section_type IN ('heading','paragraph','equation');
Backfill embeddings for all relevant sections using a text embedding model:294295
UPDATE coditect.udom_sections_emb
SET embedding = AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_v1`,
text
)
WHERE embedding IS NULL
AND text IS NOT NULL;
2.2. Create vector index
CREATE OR REPLACE VECTOR INDEX coditect.udom_sections_ix_v1
ON coditect.udom_sections_emb(embedding)
STORING (doc_id, section_id, section_type, section_order, text)
OPTIONS (
index_type = 'IVF',
distance_type = 'COSINE'
);
2.3. Semantic search query example
DECLARE query_text STRING DEFAULT
'translation-invariant convolutional matching pursuit and convolutional dictionary learning';
WITH query_embedding AS (
SELECT AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_v1`,
query_text
) AS embedding
),
search AS (
SELECT
base.doc_id,
base.section_id,
base.section_type,
base.section_order,
base.text,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_sections_emb,
'embedding',
(SELECT embedding FROM query_embedding),
top_k => 20,
distance_type => 'COSINE',
options => '{"fraction_lists_to_search": 0.1}'
) AS vs
JOIN coditect.udom_sections_emb AS base
ON vs.row_id = base.section_id
)
SELECT *
FROM search
ORDER BY distance ASC;
3. Multimodal (text + images) layer
For the 17 images saved in your pipeline:293
3.1. Image embeddings table
CREATE TABLE IF NOT EXISTS coditect.udom_figures_emb AS
SELECT
fig_id,
doc_id,
section_id,
caption,
asset_uri,
AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_multimodal`,
STRUCT(asset_uri AS image_uri)
) AS image_embedding,
'csc_multimodal_v1' AS embedding_version
FROM coditect.udom_figures;
3.2. Image vector index
CREATE OR REPLACE VECTOR INDEX coditect.udom_figures_ix_v1
ON coditect.udom_figures_emb(image_embedding)
STORING (fig_id, doc_id, section_id, caption, asset_uri)
OPTIONS (
index_type = 'IVF',
distance_type = 'COSINE'
);
3.3. Text-to-image semantic search
DECLARE query_text STRING DEFAULT
'architecture diagram of convolutional matching pursuit and dictionary training';
WITH query_embedding AS (
SELECT AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_multimodal`,
STRUCT(query_text AS text)
) AS embedding
),
search AS (
SELECT
base.fig_id,
base.doc_id,
base.section_id,
base.caption,
base.asset_uri,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_figures_emb,
'image_embedding',
(SELECT embedding FROM query_embedding),
top_k => 10,
distance_type => 'COSINE'
) AS vs
JOIN coditect.udom_figures_emb AS base
ON vs.row_id = base.fig_id
)
SELECT *
FROM search
ORDER BY distance ASC;
4. Dual-index / version-ready schema (future-proofing)
When you later move to a new embedding model:
ALTER TABLE coditect.udom_sections_emb
ADD COLUMN IF NOT EXISTS embedding_v2 ARRAY<FLOAT64>;
UPDATE coditect.udom_sections_emb
SET embedding_v2 = AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_v2`,
text
)
WHERE embedding_v2 IS NULL
AND text IS NOT NULL;
Create a second index:
CREATE OR REPLACE VECTOR INDEX coditect.udom_sections_ix_v2
ON coditect.udom_sections_emb(embedding_v2)
STORING (doc_id, section_id, section_type, section_order, text)
OPTIONS (index_type='TREE_AH', distance_type='COSINE');
You can then run A/B queries against embedding vs embedding_v2 to decide when to switch your RAG layer.
This gives you: (1) a concrete data dictionary for 1010.0422 and its UDOM artifacts, and (2) the exact BigQuery layers you need for vector and semantic search that your CODITECT agents can consume.
have I thought this through? suggestions?
You’ve thought this through unusually well: you have ingestion → UDOM → relational schema → graph layer → BigQuery vectors → control plane → agents → token economics, all aligned and technically coherent.
A few focused suggestions:
- Nail one vertical slice end-to-end
Before generalizing, pick a single, narrow use case (e.g., “CSC literature review + method synthesis”) and wire: ingest → UDOM →
udom_sections_emb+ vector index →/research/search→ one multi-agent workflow → final report. That will surface practical gaps (latency, schema quirks, logging fields) faster than trying to design for all domains at once. - Standardize schemas early Lock down a versioned “research schema” (documents, sections, equations, figures, methods, graph tables) and treat changes as migrations with semantic versioning. It will keep agents, control plane, and analytics from drifting as you add domains beyond CSC.
- Keep the control plane minimal at first
Start with a simple, explicit DAG/workflow engine and a single
control_plane.acttool, but avoid building a generic orchestration mega-layer prematurely. Encode 1–2 canonical workflows (CSC review, CSC experiment pipeline), then abstract. - Instrument everything from day zero
You already plan token economics; extend that to:
- task_type, workflow_id, agent_id, tool_name, model, tokens, latency, success, and a simple quality score. This will give you the data to actually optimize routing, compression, and workflows instead of flying blind.
- Prototype with Postgres graphs first Your graph layer design is sound; I’d keep it in Postgres (views + recursive CTEs) until you truly need deep graph analytics. A Neo4j mirror is powerful but will add operational drag you may not need initially.
- Guard agent autonomy with crisp policies Even for “just literature + code”, define explicit per-workflow budgets (tokens, steps), allowed tools, and HITL checkpoints. It’s much easier to relax constraints than to retrofit safety after agents are already doing a lot.
If you want, next step could be: “design the exact tables, FastAPI endpoints, and one CSC-specific workflow spec for that first vertical slice,” and we make that as close to implementation-ready as possible.
design the exact tables, FastAPI endpoints, and one CSC-specific workflow spec for that first vertical slice,” and make that as close to implementation-ready as possible.
Below is a concrete first vertical slice for CODITECT’s CSC workflow: exact tables, key FastAPI endpoints, and the CSC-specific workflow spec. You can implement this almost directly.
1. Core tables (Postgres + BigQuery)
1.1. Postgres: metadata & graph
-- documents: metadata, shared across systems
CREATE TABLE documents (
doc_id TEXT PRIMARY KEY, -- '1010.0422'
title TEXT NOT NULL,
year INT,
venue TEXT,
qa_grade TEXT,
qa_score DOUBLE PRECISION,
created_at TIMESTAMPTZ DEFAULT now()
);
-- methods: one or more methods per paper
CREATE TABLE methods (
method_id SERIAL PRIMARY KEY,
doc_id TEXT REFERENCES documents(doc_id),
name TEXT,
family TEXT, -- 'CSC','CMP','ADMM-CSC','ML-CSC','CBP',...
objective TEXT,
algorithm TEXT,
assumptions TEXT,
complexity TEXT,
created_at TIMESTAMPTZ DEFAULT now()
);
-- method relationships (graph edges)
CREATE TABLE method_relations (
src_method_id INT REFERENCES methods(method_id),
dst_method_id INT REFERENCES methods(method_id),
relation_type TEXT, -- 'IMPROVES_ON','EXTENDS','BASELINE_FOR',...
PRIMARY KEY (src_method_id, dst_method_id, relation_type)
);
-- datasets used in experiments
CREATE TABLE datasets (
dataset_id SERIAL PRIMARY KEY,
name TEXT UNIQUE,
domain TEXT
);
CREATE TABLE method_datasets (
method_id INT REFERENCES methods(method_id),
dataset_id INT REFERENCES datasets(dataset_id),
PRIMARY KEY (method_id, dataset_id)
);
1.2. BigQuery: UDOM sections + embeddings
Dataset: coditect.
-- UDOM sections extracted from your pipeline
CREATE TABLE IF NOT EXISTS coditect.udom_sections (
doc_id STRING,
section_id STRING,
section_type STRING, -- 'heading','paragraph','equation','figure',...
section_order INT64,
text STRING,
latex STRING,
role STRING, -- e.g. 'objective','update_rule' for equations
created_at TIMESTAMP
);
-- Embedding-enriched view for semantic search
CREATE TABLE IF NOT EXISTS coditect.udom_sections_emb AS
SELECT
doc_id,
section_id,
section_type,
section_order,
text,
AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_v1`,
text
) AS embedding,
'csc_text_v1' AS embedding_version
FROM coditect.udom_sections
WHERE section_type IN ('heading','paragraph','equation')
AND text IS NOT NULL;
Vector index:
CREATE OR REPLACE VECTOR INDEX coditect.udom_sections_ix_v1
ON coditect.udom_sections_emb(embedding)
STORING (doc_id, section_id, section_type, section_order, text)
OPTIONS (
index_type = 'IVF',
distance_type = 'COSINE'
);
2. FastAPI endpoints (control plane + research APIs)
Assume:
- Postgres via
asyncpg/ SQLAlchemy. - BigQuery client via
google-cloud-bigquery. - This is the first thin layer; auth, logging, and token metering can be added.
2.1. Research search (BigQuery semantic search)
# app/research.py
from fastapi import APIRouter, Query
from pydantic import BaseModel
from typing import List
from google.cloud import bigquery
router = APIRouter(prefix="/research", tags=["research"])
bq_client = bigquery.Client()
class SectionHit(BaseModel):
doc_id: str
section_id: str
section_type: str
section_order: int
text: str
distance: float
@router.get("/search", response_model=List[SectionHit])
async def semantic_search(
query: str = Query(...),
top_k: int = Query(20, ge=1, le=100)
):
sql = """
DECLARE query_text STRING DEFAULT @query;
WITH query_embedding AS (
SELECT AI.GENERATE_EMBEDDING(
MODEL `coditect.embedding_models.csc_text_v1`,
query_text
) AS embedding
),
search AS (
SELECT
base.doc_id,
base.section_id,
base.section_type,
base.section_order,
base.text,
vs.distance
FROM VECTOR_SEARCH(
TABLE coditect.udom_sections_emb,
'embedding',
(SELECT embedding FROM query_embedding),
top_k => @top_k,
distance_type => 'COSINE'
) AS vs
JOIN coditect.udom_sections_emb AS base
ON vs.row_id = base.section_id
)
SELECT * FROM search ORDER BY distance ASC;
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("query", "STRING", query),
bigquery.ScalarQueryParameter("top_k", "INT64", top_k),
]
)
rows = bq_client.query(sql, job_config=job_config).result()
return [
SectionHit(
doc_id=r["doc_id"],
section_id=r["section_id"],
section_type=r["section_type"],
section_order=r["section_order"],
text=r["text"],
distance=r["distance"],
)
for r in rows
]
2.2. Equations & figures (Postgres)
# app/udom_api.py
from fastapi import APIRouter, Query
from pydantic import BaseModel
from typing import List, Optional
import asyncpg
router = APIRouter(tags=["udom"])
async def get_pg_pool():
# global pool in real code
return await asyncpg.create_pool(dsn="postgresql://user:pass@host/db")
class EquationOut(BaseModel):
eq_id: int
doc_id: str
section_id: str
latex: str
role: Optional[str]
display_index: Optional[int]
class FigureOut(BaseModel):
fig_id: int
doc_id: str
section_id: str
caption: str
asset_uri: str
@router.get("/equations", response_model=List[EquationOut])
async def get_equations(doc_id: str = Query(...), role: Optional[str] = Query(None)):
pool = await get_pg_pool()
async with pool.acquire() as conn:
if role:
rows = await conn.fetch(
"""
SELECT eq_id, doc_id, section_id, latex, role, display_index
FROM equations
WHERE doc_id = $1 AND role = $2
ORDER BY display_index
""",
doc_id, role
)
else:
rows = await conn.fetch(
"""
SELECT eq_id, doc_id, section_id, latex, role, display_index
FROM equations
WHERE doc_id = $1
ORDER BY display_index
""",
doc_id
)
return [EquationOut(**dict(r)) for r in rows]
@router.get("/figures", response_model=List[FigureOut])
async def get_figures(doc_id: str = Query(...)):
pool = await get_pg_pool()
async with pool.acquire() as conn:
rows = await conn.fetch(
"""
SELECT fig_id, doc_id, section_id, caption, asset_uri
FROM udom_figures
WHERE doc_id = $1
ORDER BY fig_id
""",
doc_id
)
return [FigureOut(**dict(r)) for r in rows]
2.3. Graph methods (Postgres)
# app/graph_api.py
from fastapi import APIRouter
from pydantic import BaseModel
from typing import List
router = APIRouter(prefix="/graph", tags=["graph"])
class MethodOut(BaseModel):
method_id: int
doc_id: str
name: str
family: str
@router.get("/methods/by_family", response_model=List[MethodOut])
async def get_methods_by_family(family: str, min_year: int = 2000):
pool = await get_pg_pool()
async with pool.acquire() as conn:
rows = await conn.fetch(
"""
SELECT m.method_id, m.doc_id, m.name, m.family
FROM methods m
JOIN documents d ON d.doc_id = m.doc_id
WHERE m.family = $1 AND d.year >= $2
ORDER BY d.year, m.method_id
""",
family, min_year
)
return [MethodOut(**dict(r)) for r in rows]
2.4. Control-plane API (first workflow)
# app/control_plane.py
from fastapi import APIRouter
from pydantic import BaseModel
from typing import List, Optional
import httpx
router = APIRouter(prefix="/control_plane", tags=["control_plane"])
class CscWorkflowParams(BaseModel):
topic: str
focus_families: List[str] = ["CSC", "CMP", "ADMM-CSC", "ML-CSC"]
year_min: int = 2010
token_budget: int = 200_000
class CscWorkflowResult(BaseModel):
reading_list_doc_ids: List[str]
methods_table: str
summary_md_path: str
class CscWorkflowResponse(BaseModel):
status: str
artifacts: CscWorkflowResult
@router.post("/csc_lit_review", response_model=CscWorkflowResponse)
async def csc_lit_review(params: CscWorkflowParams):
# 1) Curator agent: call /research/search
async with httpx.AsyncClient() as client:
r = await client.get(
"http://coditect-backend/research/search",
params={"query": params.topic, "top_k": 50},
timeout=60.0,
)
sections = r.json()
doc_ids = sorted({s["doc_id"] for s in sections})
# TODO: apply additional filtering (focus_families via graph/methods)
# 2) Reader/Extractor agent: for now, just a stub: populate methods table via separate job
# 3) Synthesizer agent: stub - call LLM to produce Markdown summary
# In real code, call your LLM gateway here with doc_ids and selected sections
result = CscWorkflowResult(
reading_list_doc_ids=doc_ids,
methods_table="methods", # populated by another pipeline
summary_md_path="s3://coditect/reports/csc_ti_cmp_vs_admm.md"
)
return CscWorkflowResponse(status="COMPLETED", artifacts=result)
This gives you the first vertical slice: PI calls /control_plane/csc_lit_review, which hits your semantic search, collects a reading list, and (once you add the LLM call) can generate an initial CSC survey.
3. CSC-specific workflow spec (first vertical slice)
Workflow name: CSC_LIT_REVIEW_V1
Inputs
topic: e.g. “translation-invariant convolutional sparse coding with CMP and ADMM”.focus_families: default['CSC','CMP','ADMM-CSC','ML-CSC'].year_min: 2010.token_budget: global cap for the LLM work (enforced later).
Stages
- Curate candidates
- Call
/research/searchwithtopic, top_k = 50. - Extract
doc_ids and count frequency. - Rank by: decreasing frequency, doc year, QA score (from
documentstable). - Output: top N (
reading_list_doc_ids, default N=20).
- Call
- Extract CSC methods (stub now, full later)
- For each
doc_id:- fetch relevant sections via BigQuery (methods/theory) or via a future
/udom/sectionsendpoint. - run an LLM extraction prompt to fill
methodsrows:name,family,objective,assumptions,complexity.
- fetch relevant sections via BigQuery (methods/theory) or via a future
- Insert into Postgres
methods+method_relations.
- For each
- Synthesize comparison
- Query
/graph/methods/by_familyfor focus families &year_min. - Build a comparison table (informatically in code or via LLM) covering: objectives, algorithm type (CMP vs ADMM vs FISTA), complexity, assumptions.
- Generate a short Markdown summary.
- Query
- Output artifacts
reading_list_doc_ids.methods_table(logical name; physically, Postgresmethods).summary_md_path(written to S3/GCS by the synthesizer agent).
For V1, you can stub Stage 2 and Stage 3 as simple logging + placeholder summary text; once the pipeline is wired, you incrementally replace stubs with actual agent calls.
If you want to go one level deeper next, we can:
- Write the actual LLM prompts and handler function you’ll use for method extraction from UDOM sections.
- Or refine the
CSC_LIT_REVIEW_V1workflow into a YAML/JSON spec that your control plane can interpret declaratively (states, transitions, retries, budgets).
Write the actual LLM prompts and handler function you’ll use for method extraction from UDOM sections.
AND refine the CSC_LIT_REVIEW_V1 workflow into a YAML/JSON spec that your control plane can interpret declaratively (states, transitions, retries, budgets). Generate sample requests and responses for the CSC endpoints Add authentication and rate limiting to FastAPI endpoints Design Docker setup and deployment to Cloud Run Implement unit tests for tables and endpoints Extend schema for multi-signal batch processing
Below is a compact, implementation-oriented answer covering prompts, handler, workflow spec, and the extra engineering asks.
1) LLM prompts + handler for CSC method extraction
1.1. System prompt (CSC method extractor)
You are a scientific information extraction engine.
You receive:
- Metadata about a paper (doc_id, title, year).
- UDOM sections from the paper (headings, paragraphs, equations).
- Your task is to extract ALL convolutional sparse coding (CSC)-related methods described in the paper, including CMP, ADMM-based CSC, ML-CSC, CBP, and variants.
For each method, extract a JSON object with fields:
- method_id: short unique identifier within this paper (e.g. "cmp_basic", "admm_csc_fast").
- name: canonical method name as used in the paper.
- family: one of ["CSC","CMP","ADMM-CSC","ML-CSC","CBP","OTHER"].
- objective: concise informal description of the optimization problem (NOT full LaTeX).
- objective_equations: list of LaTeX strings for the key objective/constraints.
- algorithm: high-level bullet-style description of the algorithm steps.
- assumptions: list of textual assumptions (sparsity, coherence, noise model, etc.).
- complexity: description of per-iteration complexity in big-O form, if stated; else "unspecified".
- datasets: list of dataset names or benchmarks used for evaluation, if any.
Output strictly valid JSON with this top-level shape:
{
"doc_id": "<doc_id>",
"methods": [ { ... }, ... ]
}
If the paper does not describe any CSC-like method, return `"methods": []`.
Do NOT include explanations outside the JSON.
1.2. User prompt template (per paper)
DOC METADATA
doc_id: {{ doc_id }}
title: {{ title }}
year: {{ year }}
UDOM SECTIONS (TRUNCATED OR SUMMARIZED IF NEEDED)
{{ sections_text }}
Extract all CSC-related methods as specified.
You’ll build sections_text from relevant udom_sections rows (methods/theory) using semantic search or simple filters.
1.3. Python handler (method extraction)
import json
from typing import List, Dict, Any
from google.cloud import bigquery
import asyncpg
import httpx
BQP = bigquery.Client()
LLM_GATEWAY_URL = "http://coditect-llm-gateway/invoke"
async def extract_methods_for_doc(doc_id: str, model: str = "gpt-5.1-coditect") -> Dict[str, Any]:
# 1) Fetch sections for this doc from BigQuery
sql = """
SELECT section_id, section_type, section_order, text, latex, role
FROM coditect.udom_sections
WHERE doc_id = @doc_id
AND section_type IN ('heading','paragraph','equation')
ORDER BY section_order
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[bigquery.ScalarQueryParameter("doc_id", "STRING", doc_id)]
)
rows = BQP.query(sql, job_config=job_config).result()
# Build plain-text context (you can add trimming/compression later)
parts = []
for r in rows:
if r["section_type"] == "equation" and r["latex"]:
parts.append(f"[EQ {r['section_id']}] {r['latex']}")
elif r["text"]:
parts.append(f"[{r['section_type'].upper()} {r['section_id']}] {r['text']}")
sections_text = "\n".join(parts[:200]) # simple cap for now
# 2) Fetch doc metadata from Postgres
pg_pool = await asyncpg.create_pool(dsn="postgresql://user:pass@host/db")
async with pg_pool.acquire() as conn:
meta = await conn.fetchrow(
"SELECT title, year FROM documents WHERE doc_id = $1", doc_id
)
title = meta["title"] if meta else ""
year = meta["year"] if meta else None
# 3) Build prompts
system_prompt = """You are a scientific information extraction engine.
You receive:
- Metadata about a paper (doc_id, title, year).
- UDOM sections from the paper (headings, paragraphs, equations).
- Your task is to extract ALL convolutional sparse coding (CSC)-related methods described in the paper, including CMP, ADMM-based CSC, ML-CSC, CBP, and variants.
For each method, extract a JSON object with fields:
- method_id: short unique identifier within this paper (e.g. "cmp_basic", "admm_csc_fast").
- name: canonical method name as used in the paper.
- family: one of ["CSC","CMP","ADMM-CSC","ML-CSC","CBP","OTHER"].
- objective: concise informal description of the optimization problem (NOT full LaTeX).
- objective_equations: list of LaTeX strings for the key objective/constraints.
- algorithm: high-level bullet-style description of the algorithm steps.
- assumptions: list of textual assumptions (sparsity, coherence, noise model, etc.).
- complexity: description of per-iteration complexity in big-O form, if stated; else "unspecified".
- datasets: list of dataset names or benchmarks used for evaluation, if any.
Output strictly valid JSON with this top-level shape:
{
"doc_id": "<doc_id>",
"methods": [ { ... }, ... ]
}
If the paper does not describe any CSC-like method, return "methods": [].
Do NOT include explanations outside the JSON.
"""
user_prompt = f"""DOC METADATA
doc_id: {doc_id}
title: {title}
year: {year}
UDOM SECTIONS
{sections_text}
Extract all CSC-related methods as specified.
"""
# 4) Call LLM gateway
payload = {
"model": model,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
"temperature": 0.1,
}
async with httpx.AsyncClient(timeout=120.0) as client:
resp = await client.post(LLM_GATEWAY_URL, json=payload)
resp.raise_for_status()
content = resp.json()["choices"][^26_0]["message"]["content"]
data = json.loads(content) # may want try/except + JSON repair
# 5) Write to Postgres methods tables
async with pg_pool.acquire() as conn:
async with conn.transaction():
for m in data.get("methods", []):
row = await conn.fetchrow(
"""
INSERT INTO methods(doc_id, name, family, objective, algorithm, assumptions, complexity)
VALUES ($1,$2,$3,$4,$5,$6,$7)
RETURNING method_id
""",
doc_id,
m.get("name"),
m.get("family"),
m.get("objective"),
"\n".join(m.get("algorithm", [])),
"\n".join(m.get("assumptions", [])),
m.get("complexity"),
)
method_id = row["method_id"]
for ds_name in m.get("datasets", []):
ds_row = await conn.fetchrow(
"""
INSERT INTO datasets(name)
VALUES ($1)
ON CONFLICT (name) DO UPDATE SET name = EXCLUDED.name
RETURNING dataset_id
""",
ds_name
)
dataset_id = ds_row["dataset_id"]
await conn.execute(
"""
INSERT INTO method_datasets(method_id, dataset_id)
VALUES ($1, $2)
ON CONFLICT DO NOTHING
""",
method_id, dataset_id
)
return data
You can then call extract_methods_for_doc("1010.0422") inside your workflow.
2) Declarative workflow spec: CSC_LIT_REVIEW_V1
Example YAML the control plane can interpret:
name: CSC_LIT_REVIEW_V1
description: >
Retrieve CSC-related papers, extract methods, and produce a CSC methods summary.
inputs:
topic: { type: string, required: true }
focus_families:
type: list[string]
default: ["CSC", "CMP", "ADMM-CSC", "ML-CSC"]
year_min:
type: int
default: 2010
token_budget:
type: int
default: 200000
budgets:
max_chain_steps: 50
max_tokens: 200000
states:
- id: retrieve_candidates
type: task
agent: curator
tool: http
config:
method: GET
url: "http://coditect-backend/research/search"
params:
query: "{{ topic }}"
top_k: 50
outputs:
reading_sections: "$.body" # list of SectionHit
- id: select_docs
type: compute
agent: planner
logic: |
def run(context):
sections = context["reading_sections"]
doc_ids = sorted({ s["doc_id"] for s in sections })
return {"reading_list_doc_ids": doc_ids[:20]}
outputs:
reading_list_doc_ids: "$.reading_list_doc_ids"
- id: extract_methods
type: foreach
for_each: "{{ reading_list_doc_ids }}"
max_parallel: 4
item_name: doc_id
task:
type: task
agent: reader
tool: python
function: "extract_methods_for_doc" # binds to handler above
args:
doc_id: "{{ doc_id }}"
outputs:
methods_json: "$[*]" # list of per-doc results
- id: synthesize_summary
type: task
agent: synthesizer
tool: llm
config:
model: "gpt-5.1-coditect"
system_prompt: |
You are an expert in convolutional sparse coding.
You will receive structured method extractions and must write a concise
comparison of CSC/CMP/ADMM/ML-CSC methods.
user_prompt: |
METHODS DATA:
{{ methods_json }}
Write a Markdown summary comparing methods by family, objectives,
algorithms, assumptions, and complexity. Keep it under 2000 tokens.
outputs:
summary_md: "$.content"
- id: save_summary
type: task
agent: backend
tool: python
function: "save_summary_to_storage" # writes to GCS/S3, returns path
args:
content: "{{ summary_md }}"
outputs:
summary_md_path: "$.path"
- id: done
type: end
result:
reading_list_doc_ids: "{{ reading_list_doc_ids }}"
summary_md_path: "{{ summary_md_path }}"
Your control plane just needs to map type: task/foreach/compute and tool: http/llm/python to actual executors.
3) Sample requests/responses for CSC endpoints
3.1. /research/search
Request:
GET /research/search?query=convolutional%20matching%20pursuit%20dictionary%20learning&top_k=5
Authorization: Bearer <token>
Response (200):
[
{
"doc_id": "1010.0422",
"section_id": "sec_methods_1",
"section_type": "paragraph",
"section_order": 15,
"text": "We propose convolutional matching pursuit as a translation-invariant sparse coding algorithm...",
"distance": 0.0421
},
{
"doc_id": "1010.0422",
"section_id": "sec_eq_1",
"section_type": "equation",
"section_order": 16,
"text": "\\min_x \\frac{1}{2}\\|y - \\sum_k d_k * x_k\\|_2^2 + \\lambda \\sum_k \\|x_k\\|_1",
"distance": 0.0537
}
]
3.2. /equations
Request:
GET /equations?doc_id=1010.0422&role=objective
Authorization: Bearer <token>
Response:
[
{
"eq_id": 1,
"doc_id": "1010.0422",
"section_id": "sec_eq_1",
"latex": "\\min_x \\frac{1}{2}\\|y - \\sum_k d_k * x_k\\|_2^2 + \\lambda \\sum_k \\|x_k\\|_1",
"role": "objective",
"display_index": 1
}
]
4) Auth + rate limiting for FastAPI
4.1. Simple API key auth (header-based)
from fastapi import Depends, HTTPException, Header, status
API_KEYS = {"my-csc-service-key-1", "my-csc-service-key-2"}
async def api_key_auth(x_api_key: str = Header(None)):
if x_api_key not in API_KEYS:
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid or missing API key",
)
return x_api_key
Apply to routers:
from fastapi import FastAPI
app = FastAPI()
app.include_router(research.router, dependencies=[Depends(api_key_auth)])
app.include_router(udom_api.router, dependencies=[Depends(api_key_auth)])
app.include_router(graph_api.router, dependencies=[Depends(api_key_auth)])
app.include_router(control_plane.router, dependencies=[Depends(api_key_auth)])
4.2. Rate limiting (slowapi pattern)
from slowapi import Limiter
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from fastapi.responses import JSONResponse
limiter = Limiter(key_func=get_remote_address)
@app.exception_handler(RateLimitExceeded)
async def ratelimit_handler(request, exc):
return JSONResponse(
status_code=429,
content={"detail": "rate limit exceeded"},
)
app.state.limiter = limiter
@app.get("/research/search")
@limiter.limit("30/minute")
async def semantic_search(...):
...
5) Docker + Cloud Run deployment (minimal)
5.1. Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY pyproject.toml poetry.lock* /app/
RUN pip install --no-cache-dir fastapi uvicorn[standard] asyncpg google-cloud-bigquery httpx slowapi
COPY app /app/app
ENV PORT=8080
EXPOSE 8080
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]
5.2. Cloud Run (rough steps)
- Build & push image to Artifact Registry:
gcloud builds submit --tag gcr.io/$PROJECT_ID/coditect-csc-api
- Deploy:
gcloud run deploy coditect-csc-api \
--image gcr.io/$PROJECT_ID/coditect-csc-api \
--platform managed \
--region us-central1 \
--allow-unauthenticated # or secure with IAP / auth
6) Unit tests (tables + endpoints)
Use pytest + httpx.AsyncClient + a test Postgres/BigQuery setup (or mocks).
6.1. Example endpoint test
import pytest
from httpx import AsyncClient
from app.main import app
@pytest.mark.asyncio
async def test_research_search_returns_200(monkeypatch):
async def fake_query(*args, **kwargs):
class Row:
def __getitem__(self, k): return getattr(self, k)
doc_id = "1010.0422"
section_id = "sec1"
section_type = "paragraph"
section_order = 1
text = "dummy"
distance = 0.1
class FakeResult:
def result(self): return [Row()]
return FakeResult()
# monkeypatch BigQuery client here if needed
async with AsyncClient(app=app, base_url="http://test") as ac:
r = await ac.get("/research/search", params={"query": "test", "top_k": 1}, headers={"X-API-Key": "my-csc-service-key-1"})
assert r.status_code == 200
body = r.json()
assert len(body) == 1
assert body[^26_0]["doc_id"] == "1010.0422"
You can similarly unit-test extract_methods_for_doc using a fake LLM gateway and in-memory Postgres (or dockerized test DB).
7) Extend schema for multi-signal batch processing
If you later process multiple “signals” (e.g., multiple MD time series or multiple related texts) per document:
CREATE TABLE signals (
signal_id SERIAL PRIMARY KEY,
doc_id TEXT REFERENCES documents(doc_id),
signal_type TEXT, -- 'time_series','audio','eeg',...
name TEXT,
sampling_rate DOUBLE PRECISION,
metadata JSONB
);
CREATE TABLE signal_segments (
segment_id SERIAL PRIMARY KEY,
signal_id INT REFERENCES signals(signal_id),
start_time DOUBLE PRECISION,
end_time DOUBLE PRECISION,
embedding DOUBLE PRECISION[], -- for 1D CSC/conv embeddings
embedding_version TEXT
);
You can then build a signal_segments_emb table in BigQuery for vector search over multi-signal data using the same patterns as udom_sections_emb.
If you want, next we can design the concrete prompt for the synthesizer agent that consumes methods_json and writes a CSC comparison report, plus a tighter policy spec for budgets and HITL for this workflow.