ADR LMS 007: Interactive Assessment Engine
ADR-LMS-007: Interactive Assessment Engine
Status: Proposed Date: 2025-12-11 Phase: Phase 2 - Core LMS Infrastructure Deciders: Hal Casteel (Founder/CEO/CTO), CODITECT Core Team Technical Story: Enable comprehensive interactive assessments with multiple question types, adaptive difficulty, and automated grading for CODITECT certification
Context and Problem Statement
The current CODITECT training uses static markdown-based quizzes:
### Question 1
What is the correct Task Tool Pattern?
- A) Task(subagent_type="agent-name"...)
- B) Task(subagent_type="general-purpose"...) ← Correct
- C) /agent-name
This approach has limitations:
- Static Format - Questions in markdown, manually graded
- No Randomization - Same questions in same order
- No Adaptive Difficulty - All users get same questions
- Limited Question Types - Only multiple choice
- No Code Execution - Cannot test actual coding ability
- No Time Tracking - No proctoring capabilities
- Manual Grading - Short answers require human review
The Problem: How do we create an interactive assessment engine that supports multiple question types, adaptive difficulty, automated grading, and secure proctoring?
Decision Drivers
Technical Requirements
- R1: Multiple question types (MCQ, true/false, short answer, code execution)
- R2: Question randomization and pooling
- R3: Adaptive difficulty based on user performance
- R4: Automated grading with rubrics
- R5: Code execution sandbox for practical tests
- R6: Time limits and attempt restrictions
- R7: Partial credit support
User Experience Goals
- UX1: CLI-friendly quiz interface
- UX2: Immediate feedback on answers
- UX3: Progress saving (resume later)
- UX4: Detailed score breakdown
- UX5: Remediation suggestions
Security Requirements
- S1: Question pool randomization
- S2: Time-based session tokens
- S3: Answer submission integrity
- S4: Anti-cheating measures for proctored exams
Decision Outcome
Chosen Solution: Implement a comprehensive assessment engine with multiple question types, item response theory (IRT) for adaptive testing, sandboxed code execution, and automated+AI-assisted grading.
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Assessment Engine Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Question Bank │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ MCQ │ │ True/ │ │ Short │ │ Code │ │ │
│ │ │ │ │ False │ │ Answer │ │ Exec │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Matching│ │ Essay │ │ Fill-in │ │ Ordering│ │ │
│ │ │ │ │ │ │ Blank │ │ │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Adaptive Selection │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ IRT │───▶│ Question │ │ │
│ │ │ Algorithm │ │ Selector │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Grading Engine │ │
│ │ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────────┐ │ │
│ │ │ Exact │ │ Rubric │ │ AI-Assisted │ │ │
│ │ │ Match │ │ Grading │ │ (LLM) │ │ │
│ │ └────────────┘ └────────────┘ └────────────────┘ │ │
│ │ │ │
│ │ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Code │ │ Partial │ │ │
│ │ │ Runner │ │ Credit │ │ │
│ │ └────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Implementation Details
1. Database Schema
-- Question bank
CREATE TABLE assessment_questions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
question_id TEXT UNIQUE NOT NULL, -- UUID
question_type TEXT NOT NULL, -- mcq, true_false, short_answer, code, matching, essay, fill_blank, ordering
-- Content
question_text TEXT NOT NULL,
question_html TEXT, -- Rich text version
code_snippet TEXT, -- For code questions
code_language TEXT, -- python, javascript, bash, etc.
-- Answer configuration
options TEXT, -- JSON array for MCQ/matching
correct_answer TEXT NOT NULL, -- Answer or JSON for complex types
answer_explanation TEXT, -- Shown after submission
-- Grading
grading_type TEXT DEFAULT 'exact', -- exact, rubric, ai_assisted, code_execution
grading_rubric TEXT, -- JSON rubric for essays/short answer
partial_credit BOOLEAN DEFAULT 0,
max_points INTEGER DEFAULT 1,
-- IRT parameters (Item Response Theory)
difficulty REAL DEFAULT 0.5, -- 0.0-1.0 (0=easy, 1=hard)
discrimination REAL DEFAULT 1.0, -- How well it differentiates ability
guessing_param REAL DEFAULT 0.25, -- Probability of guessing correctly
-- Metadata
skill_id INTEGER, -- Associated skill
module_id INTEGER, -- Associated module
tags TEXT, -- JSON array
-- Statistics
times_shown INTEGER DEFAULT 0,
times_correct INTEGER DEFAULT 0,
avg_time_seconds REAL,
-- Status
is_active BOOLEAN DEFAULT 1,
created_by TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (skill_id) REFERENCES learning_skills(id) ON DELETE SET NULL,
FOREIGN KEY (module_id) REFERENCES learning_modules(id) ON DELETE SET NULL
);
CREATE INDEX idx_questions_type ON assessment_questions(question_type);
CREATE INDEX idx_questions_skill ON assessment_questions(skill_id);
CREATE INDEX idx_questions_difficulty ON assessment_questions(difficulty);
-- Assessment definitions
CREATE TABLE assessments (
id INTEGER PRIMARY KEY AUTOINCREMENT,
assessment_id TEXT UNIQUE NOT NULL, -- UUID
assessment_type TEXT NOT NULL, -- quiz, exam, practice, certification
-- Configuration
title TEXT NOT NULL,
description TEXT,
instructions TEXT,
-- Question selection
question_pool TEXT, -- JSON: {skill_ids: [], module_ids: [], tags: [], count: 20}
fixed_questions TEXT, -- JSON array of specific question IDs
total_questions INTEGER NOT NULL,
randomize_questions BOOLEAN DEFAULT 1,
randomize_options BOOLEAN DEFAULT 1,
-- Timing
time_limit_minutes INTEGER,
show_time_remaining BOOLEAN DEFAULT 1,
auto_submit_on_timeout BOOLEAN DEFAULT 1,
-- Attempts
max_attempts INTEGER, -- NULL = unlimited
cooldown_hours INTEGER DEFAULT 24, -- Wait time between attempts
-- Scoring
passing_score INTEGER DEFAULT 70,
show_score_immediately BOOLEAN DEFAULT 1,
show_answers_after BOOLEAN DEFAULT 1,
show_explanations BOOLEAN DEFAULT 1,
-- Adaptive testing
is_adaptive BOOLEAN DEFAULT 0,
starting_difficulty REAL DEFAULT 0.5,
-- Associated content
learning_path_id INTEGER,
module_id INTEGER,
cert_id INTEGER, -- Required for certification
-- Status
is_published BOOLEAN DEFAULT 0,
published_at TEXT,
is_active BOOLEAN DEFAULT 1,
created_by TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (learning_path_id) REFERENCES learning_paths(id) ON DELETE SET NULL,
FOREIGN KEY (module_id) REFERENCES learning_modules(id) ON DELETE SET NULL,
FOREIGN KEY (cert_id) REFERENCES cert_definitions(id) ON DELETE SET NULL
);
-- Assessment attempts
CREATE TABLE assessment_attempts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
attempt_id TEXT UNIQUE NOT NULL, -- UUID
assessment_id INTEGER NOT NULL,
user_id TEXT NOT NULL,
-- Progress
status TEXT DEFAULT 'in_progress', -- in_progress, submitted, graded, expired
current_question INTEGER DEFAULT 0,
questions_answered INTEGER DEFAULT 0,
-- Timing
started_at TEXT DEFAULT CURRENT_TIMESTAMP,
submitted_at TEXT,
time_spent_seconds INTEGER DEFAULT 0,
-- Results
score_raw INTEGER,
score_max INTEGER,
score_percentage REAL,
passed BOOLEAN,
grade TEXT, -- A, B, C, D, F or custom
-- Question sequence (randomized)
question_sequence TEXT, -- JSON array of question IDs in order
-- Detailed answers
answers TEXT, -- JSON: {question_id: {answer, is_correct, points, time_seconds}}
-- Feedback
feedback TEXT,
grader_notes TEXT,
graded_by TEXT, -- 'auto' or user_id for manual
graded_at TEXT,
-- Session security
session_token TEXT, -- For resumption
ip_address TEXT,
user_agent TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (assessment_id) REFERENCES assessments(id) ON DELETE CASCADE,
FOREIGN KEY (user_id) REFERENCES auth_users(user_id) ON DELETE CASCADE
);
CREATE INDEX idx_attempts_user ON assessment_attempts(user_id);
CREATE INDEX idx_attempts_assessment ON assessment_attempts(assessment_id);
CREATE INDEX idx_attempts_status ON assessment_attempts(status);
-- Answer submissions (per question)
CREATE TABLE assessment_answers (
id INTEGER PRIMARY KEY AUTOINCREMENT,
attempt_id INTEGER NOT NULL,
question_id INTEGER NOT NULL,
-- Answer data
answer_value TEXT, -- User's answer
answer_json TEXT, -- Complex answer (matching, ordering)
-- Grading
is_correct BOOLEAN,
points_earned REAL DEFAULT 0,
points_possible REAL DEFAULT 1,
-- Timing
started_at TEXT,
answered_at TEXT,
time_spent_seconds INTEGER,
-- Code execution (if applicable)
code_output TEXT,
code_error TEXT,
test_results TEXT, -- JSON: {passed: 5, failed: 1, tests: [...]}
-- Grading details
grading_method TEXT, -- auto, manual, ai
grader_feedback TEXT,
rubric_scores TEXT, -- JSON: {criterion: score}
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (attempt_id) REFERENCES assessment_attempts(id) ON DELETE CASCADE,
FOREIGN KEY (question_id) REFERENCES assessment_questions(id) ON DELETE CASCADE,
UNIQUE(attempt_id, question_id)
);
CREATE INDEX idx_answers_attempt ON assessment_answers(attempt_id);
2. Question Types Implementation
from enum import Enum
from typing import Any, Dict, List, Optional
from dataclasses import dataclass
class QuestionType(Enum):
MCQ = "mcq" # Multiple choice (single answer)
MCQ_MULTI = "mcq_multi" # Multiple choice (multiple answers)
TRUE_FALSE = "true_false"
SHORT_ANSWER = "short_answer"
CODE = "code"
MATCHING = "matching"
ORDERING = "ordering"
FILL_BLANK = "fill_blank"
ESSAY = "essay"
@dataclass
class GradingResult:
is_correct: bool
points_earned: float
points_possible: float
feedback: str
rubric_scores: Optional[Dict] = None
def grade_answer(question: dict, user_answer: Any) -> GradingResult:
"""
Grade a user's answer based on question type.
"""
q_type = QuestionType(question['question_type'])
graders = {
QuestionType.MCQ: grade_mcq,
QuestionType.MCQ_MULTI: grade_mcq_multi,
QuestionType.TRUE_FALSE: grade_true_false,
QuestionType.SHORT_ANSWER: grade_short_answer,
QuestionType.CODE: grade_code,
QuestionType.MATCHING: grade_matching,
QuestionType.ORDERING: grade_ordering,
QuestionType.FILL_BLANK: grade_fill_blank,
QuestionType.ESSAY: grade_essay,
}
grader = graders.get(q_type)
if not grader:
raise ValueError(f"Unknown question type: {q_type}")
return grader(question, user_answer)
def grade_mcq(question: dict, user_answer: str) -> GradingResult:
"""Grade multiple choice question (single answer)."""
correct = question['correct_answer']
is_correct = user_answer.strip().lower() == correct.strip().lower()
return GradingResult(
is_correct=is_correct,
points_earned=question['max_points'] if is_correct else 0,
points_possible=question['max_points'],
feedback=question['answer_explanation'] if not is_correct else "Correct!"
)
def grade_mcq_multi(question: dict, user_answer: List[str]) -> GradingResult:
"""Grade multiple choice with multiple correct answers (partial credit)."""
correct_answers = set(json.loads(question['correct_answer']))
user_answers = set(user_answer)
# Calculate partial credit
correct_selected = len(correct_answers & user_answers)
incorrect_selected = len(user_answers - correct_answers)
total_correct = len(correct_answers)
# Points = (correct - incorrect) / total, min 0
points_ratio = max(0, (correct_selected - incorrect_selected) / total_correct)
points_earned = points_ratio * question['max_points']
is_correct = user_answers == correct_answers
return GradingResult(
is_correct=is_correct,
points_earned=points_earned,
points_possible=question['max_points'],
feedback=f"You selected {correct_selected}/{total_correct} correct answers."
)
def grade_short_answer(question: dict, user_answer: str) -> GradingResult:
"""Grade short answer with fuzzy matching or AI."""
correct = question['correct_answer']
grading_type = question.get('grading_type', 'exact')
if grading_type == 'exact':
# Exact match (case-insensitive, trimmed)
is_correct = user_answer.strip().lower() == correct.strip().lower()
points = question['max_points'] if is_correct else 0
elif grading_type == 'contains':
# Check if answer contains key terms
key_terms = json.loads(correct) # List of required terms
found_terms = sum(1 for term in key_terms if term.lower() in user_answer.lower())
points = (found_terms / len(key_terms)) * question['max_points']
is_correct = found_terms == len(key_terms)
elif grading_type == 'ai_assisted':
# Use LLM for grading
result = grade_with_ai(question, user_answer)
return result
return GradingResult(
is_correct=is_correct,
points_earned=points,
points_possible=question['max_points'],
feedback=question['answer_explanation'] if not is_correct else "Correct!"
)
def grade_code(question: dict, user_code: str) -> GradingResult:
"""Grade code question with sandbox execution."""
language = question.get('code_language', 'python')
test_cases = json.loads(question.get('correct_answer', '[]'))
# Execute code in sandbox
results = execute_code_sandboxed(user_code, language, test_cases)
passed = sum(1 for r in results if r['passed'])
total = len(results)
points = (passed / total) * question['max_points'] if total > 0 else 0
is_correct = passed == total
return GradingResult(
is_correct=is_correct,
points_earned=points,
points_possible=question['max_points'],
feedback=f"Passed {passed}/{total} test cases.",
rubric_scores={'test_results': results}
)
def execute_code_sandboxed(code: str, language: str, test_cases: List[dict]) -> List[dict]:
"""
Execute code in a sandboxed environment and run test cases.
Uses Docker containers for isolation.
"""
results = []
for test in test_cases:
try:
# Build Docker command
container = f"code-runner-{language}"
timeout = test.get('timeout', 5)
# Write code to temp file
code_file = f"/tmp/code_{uuid.uuid4()}.{language}"
with open(code_file, 'w') as f:
f.write(code)
if test.get('input'):
f.write(f"\n\n# Test input\n{test['input']}")
# Run in Docker
result = subprocess.run(
[
'docker', 'run', '--rm',
'--memory=128m', '--cpus=0.5',
'--network=none', # No network access
'-v', f'{code_file}:/code/main.{language}:ro',
container,
'python', '/code/main.py' # or appropriate runner
],
capture_output=True,
timeout=timeout,
text=True
)
actual_output = result.stdout.strip()
expected_output = test['expected_output'].strip()
results.append({
'name': test.get('name', f'Test {len(results)+1}'),
'passed': actual_output == expected_output,
'expected': expected_output,
'actual': actual_output,
'error': result.stderr if result.returncode != 0 else None
})
except subprocess.TimeoutExpired:
results.append({
'name': test.get('name', f'Test {len(results)+1}'),
'passed': False,
'error': 'Timeout exceeded'
})
except Exception as e:
results.append({
'name': test.get('name', f'Test {len(results)+1}'),
'passed': False,
'error': str(e)
})
return results
def grade_with_ai(question: dict, user_answer: str) -> GradingResult:
"""Use LLM for grading essays and complex short answers."""
rubric = json.loads(question.get('grading_rubric', '{}'))
prompt = f"""You are grading a student's answer. Be fair but rigorous.
Question: {question['question_text']}
Model Answer / Key Points: {question['correct_answer']}
Grading Rubric:
{json.dumps(rubric, indent=2)}
Student's Answer:
{user_answer}
Grade this answer. For each rubric criterion, assign a score from 0-{rubric.get('max_per_criterion', 5)}.
Then provide overall feedback.
Respond in JSON format:
{{
"rubric_scores": {{"criterion_name": score, ...}},
"total_points": <sum of scores>,
"max_points": {question['max_points']},
"feedback": "specific feedback for the student",
"strengths": ["strength 1", ...],
"areas_for_improvement": ["area 1", ...]
}}"""
response = call_llm(prompt, model="claude-3-haiku-20240307")
result = json.loads(response)
return GradingResult(
is_correct=result['total_points'] >= question['max_points'] * 0.7,
points_earned=result['total_points'],
points_possible=question['max_points'],
feedback=result['feedback'],
rubric_scores=result['rubric_scores']
)
3. Adaptive Testing (IRT)
import numpy as np
from scipy.optimize import minimize_scalar
def select_next_question_adaptive(
user_id: str,
assessment_id: int,
answered_questions: List[int],
user_responses: List[bool]
) -> int:
"""
Select next question using Item Response Theory (3PL model).
Maximizes information at current ability estimate.
"""
# Estimate current ability
ability = estimate_ability(answered_questions, user_responses)
# Get available questions
available = get_available_questions(assessment_id, exclude=answered_questions)
# Calculate information for each question at current ability
best_question = None
max_information = -float('inf')
for q in available:
info = calculate_item_information(
ability,
difficulty=q['difficulty'],
discrimination=q['discrimination'],
guessing=q['guessing_param']
)
if info > max_information:
max_information = info
best_question = q
return best_question['id']
def estimate_ability(questions: List[int], responses: List[bool]) -> float:
"""
Estimate ability using Maximum Likelihood Estimation (MLE).
"""
if not responses:
return 0.0 # Default to average ability
# Get question parameters
params = []
for q_id in questions:
q = get_question(q_id)
params.append({
'a': q['discrimination'],
'b': q['difficulty'],
'c': q['guessing_param']
})
def neg_log_likelihood(theta):
"""Negative log likelihood for MLE."""
ll = 0
for i, (p, r) in enumerate(zip(params, responses)):
prob = calculate_probability(theta, p['a'], p['b'], p['c'])
if r: # Correct response
ll += np.log(max(prob, 1e-10))
else: # Incorrect response
ll += np.log(max(1 - prob, 1e-10))
return -ll
# Find theta that maximizes likelihood
result = minimize_scalar(neg_log_likelihood, bounds=(-4, 4), method='bounded')
return result.x
def calculate_probability(theta: float, a: float, b: float, c: float) -> float:
"""
Calculate probability of correct response using 3PL model.
P(θ) = c + (1-c) / (1 + exp(-a(θ-b)))
theta: ability level
a: discrimination parameter
b: difficulty parameter
c: guessing parameter
"""
exponent = -a * (theta - b)
return c + (1 - c) / (1 + np.exp(exponent))
def calculate_item_information(theta: float, difficulty: float,
discrimination: float, guessing: float) -> float:
"""
Calculate item information at given ability level.
I(θ) = a² * (P-c)² / ((1-c)² * P * Q)
where P = probability of correct, Q = 1-P
"""
P = calculate_probability(theta, discrimination, difficulty, guessing)
Q = 1 - P
numerator = (discrimination ** 2) * ((P - guessing) ** 2)
denominator = ((1 - guessing) ** 2) * P * Q
return numerator / denominator if denominator > 0 else 0
4. CLI Commands
# Take assessments
/quiz list # Available quizzes
/quiz start ASSESSMENT_ID # Start a quiz
/quiz resume ATTEMPT_ID # Resume in-progress
/quiz submit ATTEMPT_ID # Submit for grading
# During quiz
/quiz answer OPTION # Answer current question
/quiz skip # Skip to next question
/quiz review # Review answered questions
/quiz time # Show remaining time
/quiz progress # Show progress
# Results
/quiz results ATTEMPT_ID # View detailed results
/quiz history # Past quiz attempts
/quiz analytics # Performance analytics
# Practice mode
/quiz practice --skill SKILL_ID # Practice questions for a skill
/quiz practice --adaptive # Adaptive practice session
# Admin/Instructor
/quiz create --title "Title" --config config.json
/quiz questions add ASSESSMENT_ID # Add questions
/quiz questions import FILE.json # Bulk import
/quiz publish ASSESSMENT_ID
/quiz results-export ASSESSMENT_ID # Export all results
Question Import Format
{
"questions": [
{
"type": "mcq",
"text": "What is the correct Task Tool Pattern for invoking agents?",
"options": [
{"key": "A", "text": "Task(subagent_type=\"agent-name\", ...)"},
{"key": "B", "text": "Task(subagent_type=\"general-purpose\", prompt=\"Use agent-name subagent to...\")"},
{"key": "C", "text": "/agent-name [prompt]"},
{"key": "D", "text": "agent-name: [prompt]"}
],
"correct_answer": "B",
"explanation": "The Task Tool Pattern requires subagent_type='general-purpose' with a prompt that includes 'Use [agent-name] subagent to...'",
"difficulty": 0.3,
"skill": "task-tool-pattern",
"tags": ["foundation", "agent-invocation"]
},
{
"type": "code",
"text": "Write a Task Tool invocation that uses the competitive-market-analyst agent to research the AI IDE market.",
"language": "python",
"test_cases": [
{
"name": "Contains Task",
"check": "contains",
"expected": "Task("
},
{
"name": "Correct subagent_type",
"check": "contains",
"expected": "subagent_type=\"general-purpose\""
},
{
"name": "Contains agent name",
"check": "contains",
"expected": "competitive-market-analyst"
}
],
"difficulty": 0.5,
"skill": "agent-invocation"
}
]
}
Consequences
Positive
- P1: Comprehensive question type support
- P2: Adaptive testing for efficient assessment
- P3: Automated grading reduces manual effort
- P4: Code execution validates practical skills
- P5: Detailed analytics for improvement
Negative
- N1: Code sandbox security complexity
- N2: AI grading requires LLM costs
- N3: IRT requires calibrated questions
Risks
- Risk 1: Code execution sandbox escape
- Mitigation: Docker isolation, no network, resource limits
- Risk 2: AI grading inconsistency
- Mitigation: Rubric constraints, human review option
Related Documents
- ADR-031-lms-phase-2.md - Quiz engine overview
- ADR-033-lms-certificates.md - Certification exams
- CODITECT-OPERATOR-ASSESSMENTS.md - Current assessment content
Status: Proposed - Phase 2 Core Infrastructure Last Updated: 2025-12-11 Version: 1.0.0