Skip to main content

CONFIDENTIAL — AZ1.AI Inc. — Internal Use Only

CFS-005: AI Strategy & Machine Learning Architecture


1. Executive Summary

AI is not a feature of the CODITECT Financial Suite — it is the architectural foundation. The global shortage of 300,000+ accounting professionals demands that AI handle the volume work (document processing, categorization, reconciliation, compliance checking) while humans focus on judgment work (advisory, strategy, exception handling).

Core principle: Every financial workflow has an AI co-pilot. Not replacing accountants — amplifying them 10x.


2. AI Capability Matrix

ModuleAI CapabilityModel/ApproachAccuracy TargetPhase
Document IntelligenceOCR + Entity Extraction + Classification + Auto-CodingLayoutLM v3 + Custom NER + Classification>95% extraction, >90% auto-coding1
Bank ReconciliationTransaction MatchingFuzzy matching + ML ranking + rules>90% auto-match1
General LedgerAnomaly Detection + Auto-CategorizationStatistical (Z-score, IQR) + Isolation Forest>85% auto-categorization1
Accounts PayableInvoice Processing + Duplicate DetectionVision + NLP pipeline + similarity scoring>92% straight-through processing2
Accounts ReceivablePayment Prediction + Dunning OptimizationTime-series + Classification + survival analysis>80% payment date prediction2
Tax EngineJurisdiction Detection + Filing PrepRule engine + NLP for document analysis>99% jurisdiction accuracy2
FP&ANLQ + Forecasting + Variance ExplanationClaude API + NeuralProphet + SHAPMAPE <15% (12-month)2
Month-End CloseBottleneck Prediction + Auto-SchedulingProcess mining + optimization30% close time reduction3
Practice ManagementClient Risk Scoring + Deadline PredictionClassification + survival analysis>85% risk classification3
ConsolidationIntercompany Matching + Elimination SuggestionPattern matching + rule engine>95% auto-elimination3

3. Document Intelligence Pipeline (Deep Dive)

Stage 1: Ingestion

  • Formats: PDF, JPG/PNG/TIFF (image), email attachment, XML (UBL/CII), EDI, CSV
  • Channels: Upload, email forward, API, mobile camera capture, bulk import
  • Pre-processing: De-skew, noise removal, resolution enhancement for images

Stage 2: OCR Engine

  • Primary: Tesseract 5 (open-source, multi-language, GPU-accelerated)
  • Secondary: EasyOCR (better on handwriting, Asian languages)
  • Fallback: Google Cloud Vision API (cloud fallback for difficult documents)
  • Languages: PT, EN, ES, FR, DE, IT, PL, JA, ZH, AR (minimum 10)

Stage 3: Layout Analysis

  • Model: LayoutLM v3 (Microsoft) — understands document structure
  • Outputs: Identifies headers, tables, line items, totals, tax sections
  • Training: Fine-tuned on financial document corpus (invoices, receipts, bank statements)

Stage 4: Entity Extraction

  • Amounts: Total, subtotal, tax, line item amounts, currency symbols
  • Dates: Invoice date, due date, payment date, period
  • Identifiers: Vendor/customer name, tax ID (CNPJ, EIN, VAT), invoice number
  • Line Items: Description, quantity, unit price, amount, account code
  • Currencies: Detection from symbols, codes, or context

Stage 5: Document Classification

  • Categories: Invoice (AP/AR), receipt, bank statement, contract, tax form, payroll, expense report
  • Sub-categories: By vendor type, expense category, department
  • Confidence: High (>95%), Medium (80-95%), Low (<80%)

Stage 6: GL Account Auto-Coding

  • Method: Historical pattern matching (what accounts were used for similar transactions)
  • Features: Vendor, amount range, description keywords, line items, document type
  • Model: Gradient boosted trees (XGBoost) trained on tenant-specific history
  • Fallback: Global model for new tenants, transitioning to tenant-specific as data accumulates

Stage 7: Confidence Scoring & Routing

ConfidenceAction
>95%Auto-post (configurable threshold per tenant)
80-95%Review queue with AI suggestion pre-filled
<80%Manual processing queue

Stage 8: Learning Loop

  • Every correction by a human is logged
  • Monthly model retraining on correction data
  • Accuracy metrics tracked per tenant, per document type
  • Goal: confidence threshold drifts upward over time

4. Natural Language Query (NLQ) Engine

Architecture

User Input: "What's my AP aging over 90 days by vendor?"


Intent Classification (Claude API)
→ Domain: AP
→ Query type: Aging report
→ Filters: >90 days, group by vendor


SQL Generation (Claude API + schema context)
→ SELECT vendor_name, SUM(amount), days_outstanding
FROM ap_invoices WHERE days_outstanding > 90
GROUP BY vendor_name ORDER BY SUM(amount) DESC


Safety Validation
→ Read-only? ✓
→ Row limit? ✓ (LIMIT 10000)
→ Timeout? ✓ (30s)
→ Tenant-scoped? ✓ (RLS active)


Execution (read replica)


Response Generation (Claude API)
→ "You have $245,000 in AP aging over 90 days across 12 vendors.
The top 3 are: Supplier ABC ($89,000), Vendor XYZ ($52,000)..."

Multi-Language NLQ

  • Query in Portuguese: "Qual é meu contas a pagar vencido há mais de 90 dias?"
  • SQL generation is language-agnostic (schema is English)
  • Response generated in the user's language
  • Account labels displayed in user's language (from i18n system)

5. Forecasting Engine

Model Ensemble

ModelStrengthsUse Case
NeuralProphetSeasonality, trend, changepointsRevenue, expense forecasting
ARIMA/SARIMAXStatistical rigor, well-understoodBaseline comparison
ETSSimple, fast, interpretableShort-term cash flow
EnsembleCombines strengths, reduces varianceProduction forecasts

Features

  • Time features: Day of week, month, quarter, fiscal period, holidays
  • Macro indicators: GDP growth, inflation rate, exchange rates, industry indices
  • Business features: Headcount, sales pipeline, contract values
  • Lagged features: Prior period actuals, year-over-year trends

Output

  • Point forecast + 80% and 95% confidence intervals
  • 3 scenarios: base (median), optimistic (p90), pessimistic (p10)
  • Monte Carlo simulation for cash flow scenarios (1,000 iterations)

Explainability

  • SHAP values for every forecast driver
  • Natural language explanation: "Revenue is projected to increase 12% primarily due to seasonal Q4 uplift (40% contribution) and 3 new client acquisitions (35% contribution)."

6. LLM Strategy

Model Selection

Use CaseModelDeploymentCost/1K tokens
NLQ (complex reasoning)Claude Sonnet 4API$3/$15
Financial analysisClaude Sonnet 4API$3/$15
Document classificationMistral 7BLocal (Ollama)~$0.001
Entity extractionCustom NERLocal (PyTorch)~$0.0001
Anomaly explanationClaude Haiku 4API$0.25/$1.25
Auto-coding suggestionXGBoostLocal (scikit-learn)~$0.00001

Privacy Architecture

  • Sensitive data (financial amounts, PII): Processed by local models only
  • Anonymized data (patterns, categories): May use cloud APIs
  • No training on customer data: Strict contractual commitment
  • Prompt sanitization: PII stripped before any cloud API call

Cost Optimization

  • Route 80% of AI tasks to local models (low cost)
  • Reserve cloud APIs for complex reasoning tasks (20%)
  • Estimated AI cost: $0.50-1.00 per client per month at scale
  • Cache common NLQ queries and responses

7. AI Safety & Governance

PrincipleImplementation
Human-in-the-loopAll AI outputs are suggestions; auto-posting requires configurable confidence threshold
ExplainabilityEvery AI decision includes explanation (SHAP, attention weights, rule trace)
Audit trailAll AI-assisted transactions logged: model version, confidence, explanation, human override
Bias detectionMonthly audit of auto-categorization patterns for systematic bias
Model versioningAll models versioned; rollback capability within 1 hour
Regulatory limitsNo AI for final tax filing decisions without human review
Error boundsConfidence thresholds configurable per tenant; default conservative
Feedback loopEvery correction improves the model; but batch retrained, not real-time (prevent manipulation)

8. AI Development Roadmap

PhaseTimelineCapabilitiesKey Metrics
Phase 1Months 1-6Document Intelligence v1, Bank Rec AI matching, GL anomaly detection>85% match rate, >90% OCR accuracy
Phase 2Months 7-12NLQ engine, Forecasting, AP automation, auto-categorization<15% MAPE, >90% straight-through
Phase 3Months 13-18Full AI co-pilot, Practice management intelligence, Consolidation automation10x client capacity, 30% close reduction
Phase 4Months 19-24Autonomous agent workflows, marketplace AI models, federated learningAutonomous reconciliation, cross-tenant model improvement

9. AI Infrastructure

ComponentSpecificationPurpose
Inference GPUNVIDIA L4 or A10G (GKE)LayoutLM, local LLM inference
Training GPUNVIDIA A100 (on-demand)Monthly model retraining
Model ServingvLLM (local LLMs)High-throughput inference
ML TrackingMLflowExperiment tracking, model registry
Feature StoreFeast or custom PostgreSQLCentralized feature engineering
Model MonitoringCustom (Prometheus metrics)Drift detection, accuracy degradation
Data LabelingLabel Studio (self-hosted)Human-in-the-loop quality control

Cost Model (Per Tenant/Month at Scale)

ComponentCost
GPU inference (shared)$0.15
Cloud LLM API (Claude)$0.30
Storage (models, embeddings)$0.05
Total AI cost/tenant$0.50

At $65 ARPC, AI cost is <1% of revenue — highly favorable unit economics.


10. Competitive AI Analysis

CompetitorAI InvestmentCapabilitiesArchitecture
QuickBooks$100M+ (Intuit AI)Basic categorization, receipt scanningBolt-on to legacy
Xero$50M+Basic bank rec suggestions, invoice remindersFeature-level AI
Sage$80M+ (Sage Copilot)NLQ on Sage data, basic automationCopilot layer
Oracle NetSuite$200M+ (Oracle AI)Anomaly detection, basic forecastingEnterprise bolt-on
SAP$500M+ (Joule)Generic enterprise AI assistantCross-product, shallow
Microsoft D365$1B+ (Copilot)Generic Copilot across Office + D365Broad, not deep
CODITECTPurpose-builtDeep domain AI in every workflowAI-first architecture

CODITECT's structural advantage: Incumbents are adding AI to 20-year-old codebases. CODITECT builds AI into the data model, the workflow engine, and the user experience from day one. Every table has embedding columns. Every workflow has an AI decision point. Every interaction generates training data.


Hal Casteel CEO/CTO, AZ1.AI Inc.

Copyright © 2026 AZ1.AI Inc. All rights reserved. Unauthorized distribution or reproduction is strictly prohibited.