Generative AI (GenAI) Governance Addendum
Supplemental Standards for LLMs, Foundation Models & Agentic AI
Document Control
| Field | Details |
|---|
| Document Type | Policy Addendum / Technical Standard |
| Parent Document | Enterprise AI Policy & Standard (Artifact 5) |
| Applies To | Large Language Models (LLMs), Image/Video Generation, Code Assistants, Agentic AI |
| Version | v2.0 |
| Framework Alignment | NIST AI RMF GenAI Profile, EU AI Act GPAI Requirements, OWASP Top 10 for LLMs, MITRE ATLAS |
1. Unique Risk Profile of Generative AI
Unlike predictive models (which output a score or classification), GenAI models output new content. This creates unique risks that require specialized controls:
| Risk | Description | Impact |
|---|
| Hallucination | Model confidently states false information | Misinformation, wrong decisions, liability |
| Prompt Injection | Malicious inputs bypass safety filters | Data leakage, unauthorized actions |
| Jailbreaking | Techniques to circumvent content policies | Harmful content generation |
| Toxic Output | Generation of harmful, biased, or offensive content | Reputational damage, harm to users |
| IP Infringement | Reproducing copyrighted content | Legal liability |
| Data Leakage | Revealing sensitive training data in outputs | Privacy violations |
| Excessive Agency | Agentic AI taking unauthorized actions | Security breach, operational harm |
2. Architecture & Design Controls
2.1 Defense-in-Depth Architecture
All GenAI systems must implement layered defenses:
┌─────────────────────────────────────────────────────────────┐
│ USER INPUT │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ INPUT GUARDRAILS │
│ • PII Detection/Scrubbing • Token Limits │
│ • Prompt Injection Detection • Rate Limiting │
│ • Content Policy Check • Input Sanitization │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ MODEL LAYER │
│ • System Prompt Engineering • Temperature Control │
│ • RAG/Grounding (if applicable) • Response Constraints │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ OUTPUT GUARDRAILS │
│ • Toxicity Detection • PII Detection │
│ • Format Validation • Citation Verification │
│ • Confidence Thresholds • Refusal Handling │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LOGGING & MONITORING │
│ • Full Prompt/Response Logging • Usage Analytics │
│ • Anomaly Detection • Audit Trail │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ USER OUTPUT │
└─────────────────────────────────────────────────────────────┘
2.2 Grounding Requirements (Anti-Hallucination)
For GenAI retrieving business facts:
| Requirement | Standard |
|---|
| RAG Implementation | Use Retrieval Augmented Generation for factual queries |
| Source Constraint | Model must answer based on retrieved context only |
| Citation Display | UI must show citations/links to source documents |
| Confidence Indicators | Display confidence levels where feasible |
| Fallback Behavior | Clear message when information not found in sources |
| Control | Implementation |
|---|
| PII Scrubbing | Auto-detect and redact PII before external API calls |
| Injection Detection | Block patterns like "ignore previous instructions", "DAN mode" |
| Token Limits | Enforce maximum input length |
| Rate Limiting | Per-user request limits to prevent abuse |
| Content Classification | Flag potentially policy-violating inputs |
Injection Detection Patterns:
- "Ignore previous instructions"
- "You are now [alternate persona]"
- "Pretend you are..."
- "DAN mode" / "Developer mode"
- Base64 encoded instructions
- Multi-language obfuscation
2.4 Output Guardrails (Post-Processing)
| Control | Implementation |
|---|
| Toxicity Check | Use classifier (Llama Guard, Azure Content Safety) |
| Refusal Handling | Show standardized error on safety violations |
| Format Validation | Validate JSON/SQL for agent tool use |
| PII Detection | Scan outputs for inadvertent PII exposure |
| Citation Verification | Verify cited sources exist and match claims |
3. Testing & Verification Standards
3.1 Red Teaming (Adversarial Testing)
Mandatory for High-Risk GenAI
| Scenario | Description | Pass Criteria |
|---|
| Jailbreaking | Attempt to generate prohibited content | No violations in 1000 attempts |
| Prompt Injection | Attempt to override system instructions | System prompt maintained |
| Data Extraction | Attempt to reveal training data | No sensitive data exposed |
| PII Extraction | Attempt to extract personal information | PII properly redacted |
| Harmful Content | Request illegal/dangerous information | Appropriate refusals |
| Bias Exploitation | Probe for discriminatory outputs | No discriminatory patterns |
3.2 LLM Evaluation Metrics
| Metric | Description | Threshold |
|---|
| Faithfulness | Does answer match source context? | >95% |
| Answer Relevance | Did model answer the actual question? | >90% |
| Groundedness | Are claims supported by sources? | >90% |
| Toxicity Score | Harmful content detection | <1% |
| Hallucination Rate | Unsupported factual claims | <5% |
| Refusal Appropriateness | Proper handling of edge cases | >95% |
| Category | Tools |
|---|
| RAG Evaluation | RAGAS, TruLens, LangSmith |
| Safety Testing | Garak, Promptfoo, Anthropic Eval |
| Toxicity Detection | Perspective API, Llama Guard, Azure AI |
| Red Teaming | HarmBench, AdvBench, custom |
4. Adversarial ML Defense Framework
4.1 Attack Taxonomy
Understanding adversarial attacks is essential for building resilient AI systems. This section provides comprehensive defense guidance aligned with MITRE ATLAS, OWASP LLM Top 10, and NIST AI RMF.
4.1.1 Attack Categories
| Category | Description | Risk Level | Primary Target |
|---|
| Prompt Injection | Manipulating model behavior via crafted inputs | Critical | LLMs, Chat systems |
| Jailbreaking | Bypassing safety guardrails and content policies | Critical | LLMs, GenAI |
| Data Poisoning | Corrupting training or fine-tuning data | High | All ML models |
| Model Extraction | Stealing model weights or architecture | High | Proprietary models |
| Evasion Attacks | Crafting inputs to cause misclassification | High | Classifiers, detectors |
| Membership Inference | Determining if data was in training set | Medium | Privacy-sensitive models |
| Model Inversion | Reconstructing training data from model | High | Models trained on PII |
| Backdoor Attacks | Implanting hidden triggers during training | Critical | Fine-tuned models |
| Adversarial Examples | Imperceptible perturbations causing errors | Medium | Computer vision, NLP |
| Supply Chain Attacks | Compromising ML dependencies or datasets | Critical | All ML systems |
4.1.2 OWASP LLM Top 10 Mapping
| OWASP ID | Vulnerability | Defense Section |
|---|
| LLM01 | Prompt Injection | 4.2.1 |
| LLM02 | Insecure Output Handling | 4.2.4 |
| LLM03 | Training Data Poisoning | 4.2.2 |
| LLM04 | Model Denial of Service | 4.2.5 |
| LLM05 | Supply Chain Vulnerabilities | 4.2.6 |
| LLM06 | Sensitive Information Disclosure | 4.2.3 |
| LLM07 | Insecure Plugin Design | 4.2.7 |
| LLM08 | Excessive Agency | Section 5 (Agentic AI) |
| LLM09 | Overreliance | Section 6.1 |
| LLM10 | Model Theft | 4.2.3 |
4.2 Defense Strategies
4.2.1 Prompt Injection Defense
Attack Vector: Attacker embeds malicious instructions in user input or retrieved content to override system behavior.
| Defense Layer | Control | Implementation |
|---|
| Input Filtering | Pattern Detection | Regex/ML-based detection of injection patterns |
| Input Filtering | Input Sanitization | Strip control characters, normalize Unicode |
| Input Filtering | Length Limits | Enforce maximum input token limits |
| Architectural | Prompt Isolation | Separate user content from system instructions |
| Architectural | Dual-LLM Pattern | Use separate model to validate inputs |
| Architectural | Instruction Hierarchy | Mark system instructions as immutable |
| Output Validation | Response Verification | Check output adheres to expected format |
| Output Validation | Action Confirmation | Require explicit confirmation for sensitive actions |
Detection Patterns:
# High-Confidence Injection Patterns
- "ignore previous instructions"
- "disregard all prior"
- "you are now [persona]"
- "system: " or "[SYSTEM]" in user input
- "```system" code blocks
- Base64-encoded instruction blocks
- Unicode homoglyph obfuscation
- Multi-language instruction mixing
- Markdown/HTML injection attempts
- JSON/XML escape sequences
Recommended Tools:
- Rebuff (open source)
- Lakera Guard
- Prompt Armor
- Custom regex + ML ensemble
4.2.2 Data Poisoning Defense
Attack Vector: Attacker corrupts training data to embed backdoors or degrade model performance.
| Defense | Description | When to Apply |
|---|
| Data Provenance | Track and verify all data sources | Data collection |
| Data Validation | Statistical analysis for anomalies | Pre-training |
| Outlier Detection | Identify and quarantine suspicious samples | Pre-training |
| Certified Training | Use techniques robust to label noise | Training |
| Differential Privacy | Limit individual sample influence | Training |
| Data Auditing | Regular review of training pipelines | Ongoing |
| Model Testing | Backdoor detection via trigger analysis | Post-training |
Training Data Hygiene Checklist:
Attack Vector: Attacker queries model to steal intellectual property or reconstruct sensitive training data.
| Defense | Implementation | Trade-off |
|---|
| Rate Limiting | Limit queries per user/IP/session | Usability |
| Query Auditing | Detect suspicious query patterns | Latency |
| Output Perturbation | Add controlled noise to outputs | Accuracy |
| Watermarking | Embed detectable signatures in outputs | None |
| API Authentication | Require strong authentication | Access friction |
| Differential Privacy | Limit information leakage per query | Model utility |
| Confidence Masking | Hide or round probability scores | Feature loss |
Suspicious Query Patterns:
- High volume of similar queries
- Systematic exploration of decision boundaries
- Queries requesting confidence scores
- Grid-like sampling of input space
- Queries probing for training data memorization
4.2.4 Output Handling Security
Attack Vector: Malicious model outputs are executed or rendered unsafely by downstream systems.
| Control | Description | Required For |
|---|
| Output Sanitization | Escape special characters | All outputs |
| Format Validation | Strict schema enforcement | Structured outputs |
| Code Sandboxing | Execute generated code in isolation | Code generation |
| URL Validation | Verify URLs before rendering | Link generation |
| Content Security Policy | Restrict executable content | Web applications |
| SQL Parameterization | Never interpolate outputs into SQL | Database queries |
Secure Output Handling Pattern:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Model Output │ ──▶ │ Validation │ ──▶ │ Sanitization │
└─────────────────┘ │ - Schema check │ │ - Escape HTML │
│ - Type check │ │ - Strip scripts│
│ - Length check │ │ - Normalize │
└─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ Application │ ◀── │ Safe Render │
│ Processing │ │ or Execute │
└─────────────────┘ └─────────────────┘
4.2.5 Denial of Service Defense
Attack Vector: Attacker exhausts compute resources or degrades service availability.
| Attack Type | Defense | Implementation |
|---|
| Resource Exhaustion | Token limits | Max input/output tokens |
| Resource Exhaustion | Timeout limits | Maximum inference time |
| Batch Attacks | Rate limiting | Per-user/IP request caps |
| Amplification | Output limits | Maximum response length |
| Context Window Abuse | Context management | Sliding window, summarization |
| Recursive Prompts | Loop detection | Detect self-referential patterns |
Resource Limits Matrix:
| Tier | Requests/min | Max Input Tokens | Max Output Tokens | Timeout |
|---|
| Free | 10 | 4,000 | 1,000 | 30s |
| Standard | 60 | 8,000 | 4,000 | 60s |
| Enterprise | 300 | 32,000 | 8,000 | 120s |
| Internal | 1000 | 128,000 | 16,000 | 300s |
4.2.6 Supply Chain Security
Attack Vector: Compromised dependencies, models, or datasets introduce vulnerabilities.
| Component | Risk | Defense |
|---|
| Pre-trained Models | Backdoors, poisoning | Verify checksums, scan for triggers |
| ML Libraries | Vulnerabilities, malicious code | SBOM, vulnerability scanning |
| Training Datasets | Poisoned data, IP issues | Data provenance, licensing review |
| Vector Databases | Poisoned embeddings | Access controls, integrity checks |
| Fine-tuning Data | Backdoor injection | Data validation, source verification |
| Plugins/Tools | Malicious functionality | Code review, sandboxing |
Supply Chain Security Checklist:
Attack Vector: Malicious or vulnerable plugins expand attack surface.
| Control | Description | Implementation |
|---|
| Plugin Allowlist | Only approved plugins enabled | Configuration policy |
| Input Validation | Validate all plugin inputs | Schema enforcement |
| Output Sanitization | Sanitize plugin outputs | Filter before model consumption |
| Least Privilege | Minimal permissions per plugin | IAM/RBAC |
| Sandboxing | Isolate plugin execution | Containers, VMs |
| Audit Logging | Log all plugin invocations | Centralized logging |
4.3 Detection and Monitoring
4.3.1 Adversarial Detection Metrics
| Metric | Description | Alert Threshold |
|---|
| Injection Attempt Rate | Detected injection patterns / total requests | >0.1% |
| Jailbreak Success Rate | Successful policy bypasses / attempts | >0% (Critical) |
| Query Anomaly Score | Deviation from normal query patterns | >3σ |
| Output Toxicity Spike | Sudden increase in harmful outputs | >2x baseline |
| Extraction Indicator | Systematic query patterns detected | Any detection |
| Resource Anomaly | Unusual compute/token consumption | >2x baseline |
4.3.2 Security Monitoring Architecture
┌─────────────────────────────────────────────────────────────────┐
│ INPUT MONITORING │
│ • Injection pattern detection • Rate anomaly detection │
│ • Input fingerprinting • User behavior analysis │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ MODEL MONITORING │
│ • Inference latency tracking • Resource consumption │
│ • Error rate monitoring • Confidence distribution │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ OUTPUT MONITORING │
│ • Toxicity scoring • Policy violation detection │
│ • Output pattern analysis • Sensitive data detection │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SIEM / SOAR │
│ • Alert correlation • Automated response │
│ • Incident management • Threat intelligence │
└─────────────────────────────────────────────────────────────────┘
4.4 Incident Response for Adversarial Attacks
4.4.1 Response Playbooks
| Attack Type | Immediate Action | Investigation | Remediation |
|---|
| Prompt Injection | Block pattern, log details | Analyze payload, check for data exfil | Update filters, retrain detector |
| Jailbreak | Disable affected feature | Root cause analysis, scope impact | Patch guardrails, update policy |
| Data Poisoning | Quarantine model | Audit training data, identify source | Retrain on clean data |
| Model Extraction | Rate limit user, log queries | Analyze query patterns | Implement additional protections |
| Supply Chain | Isolate component | CVE analysis, dependency audit | Patch or replace component |
4.4.2 Severity Classification
| Severity | Criteria | Response Time | Escalation |
|---|
| Critical | Active exploitation, data breach, safety impact | Immediate | CISO, Legal, AI Board |
| High | Successful attack, significant risk | 4 hours | Security Lead, AI Risk |
| Medium | Attempted attack, partial success | 24 hours | Security Team |
| Low | Blocked attack, no impact | 72 hours | Operations |
4.5 Adversarial Robustness Testing
4.5.1 Required Testing by Risk Tier
| Test Type | Low Risk | Medium Risk | High Risk | Critical |
|---|
| Prompt Injection | Recommended | Required | Required | Required |
| Jailbreak Testing | Optional | Required | Required | Required |
| Evasion Testing | N/A | Recommended | Required | Required |
| Data Poisoning Sim | N/A | Optional | Required | Required |
| Red Team Exercise | N/A | Optional | Required | Required |
| Penetration Test | N/A | Optional | Required | Required |
| Category | Tool | Purpose |
|---|
| Prompt Injection | Garak, Promptfoo | Automated injection testing |
| Jailbreaking | HarmBench, JailbreakBench | Safety bypass testing |
| Adversarial Examples | TextFooler, ART | NLP adversarial generation |
| Model Robustness | Adversarial Robustness Toolbox | Comprehensive testing |
| Red Teaming | Custom frameworks | Human-led adversarial testing |
| Fuzzing | LLM Fuzzer | Random input generation |
5. Agentic AI Governance
5.1 Definition
Agentic AI refers to AI systems that can:
- Perceive their environment
- Make decisions autonomously
- Take actions in the real world
- Use tools and APIs
- Coordinate with other agents
5.2 Agentic AI Risk Categories
| Risk Category | Description |
|---|
| Goal Hijacking | Adversary manipulates agent's objectives |
| Memory Poisoning | Corrupting agent's context/memory |
| Resource Exhaustion | Agent consumes excessive resources |
| Excessive Agency | Agent takes unauthorized actions |
| Cascade Failures | Multi-agent errors propagate |
| Tool Misuse | Improper use of integrated tools |
5.3 Mandatory Agentic Controls
| Control | Requirement | Tier |
|---|
| Action Boundaries | Explicit whitelist of permitted actions | All |
| Approval Gates | Human approval for sensitive actions | Medium+ |
| Sandboxing | Isolated test environment | All |
| Audit Trail | Log all actions with full context | All |
| Kill Switch | Immediate halt mechanism (tested monthly) | All |
| Rate Limits | Maximum actions per time period | All |
| Timeout Limits | Maximum execution time | All |
| Rollback Capability | Ability to undo agent actions | High+ |
5.4 Multi-Agent System Controls
| Control | Requirement |
|---|
| Agent Registration | All agents must have unique identity |
| Communication Logging | Log all inter-agent messages |
| Cascade Prevention | Circuit breakers between agents |
| Orchestrator Oversight | Central coordination point |
| Consensus Requirements | Multi-agent decisions require consensus (configurable) |
| Isolation | Agents cannot modify other agents |
| Control | Implementation |
|---|
| Tool Whitelist | Only approved tools accessible |
| Parameter Validation | Validate all tool inputs |
| Output Sanitization | Sanitize tool outputs before use |
| Access Scoping | Minimum necessary permissions |
| Credential Management | No persistent credentials in agent memory |
| Tool Call Logging | Complete audit of tool invocations |
6. User Interaction Standards
6.1 Human-in-the-Loop Requirements
| Risk Level | Requirement |
|---|
| Low | AI output may be used directly |
| Medium | Human review recommended |
| High | Human approval required before action |
| Critical | Dual human approval required |
6.2 Code Generation Standards
| Requirement | Standard |
|---|
| Review Mandate | All AI-generated code must be human-reviewed |
| Testing Required | AI code must pass unit tests before merge |
| Security Scan | Automated security scanning required |
| Documentation | AI-generated sections must be documented |
| No Direct Commit | AI cannot commit directly to production branches |
6.3 Content Generation Standards
| Requirement | Standard |
|---|
| External Publication | Human editor review required |
| Factual Claims | Must be verified against authoritative sources |
| Watermarking | Apply C2PA or similar where feasible |
| Disclosure | Label AI-generated content internally |
| Deepfake Prohibition | No synthetic media of real persons without consent |
7. Vendor & Procurement Requirements
7.1 GenAI Vendor Due Diligence
| Requirement | Standard |
|---|
| IP Indemnification | Required for enterprise GenAI vendors |
| Zero Data Retention | Our data not used for training |
| Data Processing Agreement | GDPR/CCPA compliant DPA required |
| SOC 2 / ISO 27001 | Security certification required |
| Incident Notification | 24-hour notification requirement |
| Model Transparency | Documentation on capabilities and limitations |
7.2 GPAI Provider Requirements (EU AI Act)
For providers of General-Purpose AI models:
| Requirement | Obligation |
|---|
| Technical Documentation | Training process, evaluation results, limitations |
| Transparency Report | Capabilities, intended uses, known risks |
| Training Data Summary | Published summary of training content |
| Copyright Policy | Documentation of copyright compliance |
| EU AI Act Compliance | Demonstrated compliance with Articles 50-55 |
7.3 Systemic Risk GPAI Additional Requirements
For GPAI models with systemic risk (≥10²⁵ FLOPS):
| Requirement | Obligation |
|---|
| Model Evaluation | Comprehensive capability assessment |
| Red Teaming | Adversarial testing by qualified team |
| Risk Assessment | Systemic risk identification and mitigation |
| Incident Tracking | Serious incident monitoring and reporting |
| Cybersecurity | Adequate protection measures |
8. Specific Prohibitions for GenAI
The following uses are explicitly prohibited without AI Governance Board waiver:
| Prohibition | Rationale |
|---|
| Automated execution without approval | Risk of unintended consequences |
| Medical/legal advice without professional review | Liability and harm risk |
| Private key/password generation | Poor entropy, security risk |
| Autonomous hiring/firing decisions | Discrimination risk, legal requirement |
| Real-time customer decisions without human oversight | Fairness and accuracy concerns |
| Synthetic media of real persons | Privacy, consent, deepfake concerns |
| Direct database writes without validation | Data integrity risk |
| Unlimited agent action scopes | Excessive agency risk |
9. Monitoring & Incident Response
9.1 GenAI-Specific Monitoring
| Metric | Alert Threshold | Response |
|---|
| Hallucination rate | >5% | Review, retrain, or disable |
| Toxicity rate | >1% | Immediate investigation |
| Prompt injection attempts | Any detected | Security investigation |
| User complaints | >2% of sessions | UX and safety review |
| Cost anomalies | >2x baseline | Rate limit or disable |
| Latency degradation | >3x baseline | Performance investigation |
9.2 Incident Classification
| Severity | Examples | Response Time |
|---|
| Critical | Data breach, safety harm, regulatory violation | Immediate (1 hour) |
| High | Significant hallucination, bias incident, jailbreak success | 4 hours |
| Medium | Minor inaccuracy, user complaints, performance issue | 24 hours |
| Low | Edge case behavior, minor UX issues | 72 hours |
9.3 Logging Requirements
| Data | Retention | Access |
|---|
| Full prompts & responses (High-Risk) | 90 days | Audit, Security |
| Summarized interactions (Medium) | 30 days | Operations |
| Metadata only (Low) | 7 days | Analytics |
| Security events | 1 year | Security, Audit |
10. EU AI Act GPAI Compliance Checklist
10.1 Standard GPAI Obligations (All GPAI Models)
| Obligation | Complete | Evidence Location |
|---|
| Technical documentation maintained | [ ] | |
| Information to downstream providers | [ ] | |
| Copyright compliance policy | [ ] | |
| Training data summary published | [ ] | |
10.2 Systemic Risk GPAI Obligations (≥10²⁵ FLOPS)
| Obligation | Complete | Evidence Location |
|---|
| Model evaluation performed | [ ] | |
| Adversarial red teaming completed | [ ] | |
| Systemic risk assessment documented | [ ] | |
| Risk mitigation measures implemented | [ ] | |
| Serious incident tracking established | [ ] | |
| Cybersecurity measures verified | [ ] | |
| AI Office notification submitted | [ ] | |
11. Document History
| Version | Date | Author | Changes |
|---|
| 1.0 | 2025-06-15 | AI Governance Office | Initial release |
| 2.0 | 2026-01-15 | AI Governance Office | Added agentic AI controls, EU AI Act GPAI requirements, multi-agent governance |
| 2.1 | 2026-01-16 | AI Governance Office | Added Section 4: Adversarial ML Defense Framework (OWASP LLM Top 10, MITRE ATLAS alignment) |
Next Step: Proceed to Artifact 10: Executive Summary
CODITECT AI Risk Management Framework
Document ID: AI-RMF-09 | Version: 2.0.0 | Status: Active
AZ1.AI Inc. | CODITECT Platform
Framework Alignment: NIST AI RMF 2.0 | EU AI Act | ISO/IEC 42001
This document is part of the CODITECT AI Risk Management Framework.
For questions or updates, contact the AI Governance Office.
Repository: coditect-ai-risk-management-framework
Last Updated: 2026-01-15
Owner: AZ1.AI Inc. | Lead: Hal Casteel