Assessment Creation Patterns
Assessment Creation Patterns
When to Use This Skill
Use this skill when implementing assessment creation patterns patterns in your codebase.
How to Use This Skill
- Review the patterns and examples below
- Apply the relevant patterns to your implementation
- Follow the best practices outlined in this skill
Level 1: Quick Reference (Under 500 tokens)
Core Assessment Structure
# Bloom's Taxonomy Levels (Low to High)
BLOOMS_LEVELS = {
"remember": 1, # Recall facts
"understand": 2, # Explain concepts
"apply": 3, # Use in new situations
"analyze": 4, # Break down relationships
"evaluate": 5, # Make judgments
"create": 6 # Produce new work
}
# Question Template
{
"id": "q001",
"type": "multiple_choice", # or "short_answer", "coding", "essay"
"difficulty": "medium", # beginner, medium, advanced, expert
"bloom_level": "apply",
"topic": "neural_networks",
"question": "Question text with context",
"options": ["A", "B", "C", "D"],
"correct_answer": "B",
"explanation": "Why B is correct and others are wrong",
"time_limit": 120, # seconds
"points": 10,
"tags": ["supervised_learning", "backpropagation"]
}
Difficulty Progression
# Adaptive difficulty scaling
def calculate_next_difficulty(user_performance):
"""
Adjust difficulty based on user accuracy and speed.
Performance bands:
- 90%+ correct, fast → increase 2 levels
- 70-90% correct → increase 1 level
- 50-70% correct → maintain level
- <50% correct → decrease 1 level
"""
accuracy = user_performance["correct"] / user_performance["total"]
avg_time = user_performance["avg_time"]
if accuracy >= 0.9 and avg_time < user_performance["expected_time"]:
return "increase_2"
elif accuracy >= 0.7:
return "increase_1"
elif accuracy >= 0.5:
return "maintain"
else:
return "decrease_1"
Bias Detection Checklist
bias_checks:
language:
- avoid_gendered_pronouns: true
- use_inclusive_examples: true
- check_cultural_assumptions: true
accessibility:
- provide_alt_text_for_images: true
- avoid_color_only_cues: true
- support_screen_readers: true
fairness:
- balanced_topic_coverage: true
- no_trick_questions: true
- clear_success_criteria: true
Level 2: Implementation Details (Under 2000 tokens)
Multi-Format Question Types
1. Multiple Choice (MCQ)
class MultipleChoiceQuestion:
"""Best for: Knowledge recall, concept understanding (Bloom's 1-3)"""
def __init__(self, stem, options, correct_index, distractors):
self.stem = stem
self.options = options # List of 4-5 options
self.correct_index = correct_index
self.distractors = distractors # Common misconceptions
def validate_quality(self):
"""Quality checks for MCQ"""
checks = {
"stem_clear": len(self.stem.split()) >= 10,
"options_homogeneous": self._check_option_length_variance() < 0.3,
"no_all_of_above": "all of the above" not in str(self.options).lower(),
"no_none_of_above": "none of the above" not in str(self.options).lower(),
"distractors_plausible": len(self.distractors) >= 3
}
return all(checks.values()), checks
def _check_option_length_variance(self):
lengths = [len(opt) for opt in self.options]
return (max(lengths) - min(lengths)) / max(lengths)
2. Coding Challenges
class CodingQuestion:
"""Best for: Application, analysis (Bloom's 3-4)"""
def __init__(self, problem, test_cases, starter_code, hints):
self.problem = problem
self.test_cases = test_cases # [{input, expected_output, points}]
self.starter_code = starter_code
self.hints = hints # Progressive hints
def auto_grade(self, submission):
"""Run test cases and calculate score"""
results = []
for i, test in enumerate(self.test_cases):
try:
output = self._execute_code(submission, test["input"])
passed = output == test["expected_output"]
results.append({
"test_id": i,
"passed": passed,
"points": test["points"] if passed else 0,
"feedback": self._generate_feedback(output, test["expected_output"])
})
except Exception as e:
results.append({
"test_id": i,
"passed": False,
"points": 0,
"error": str(e)
})
return {
"total_score": sum(r["points"] for r in results),
"max_score": sum(t["points"] for t in self.test_cases),
"results": results
}
3. Essay/Short Answer
class EssayQuestion:
"""Best for: Evaluation, creation (Bloom's 5-6)"""
def __init__(self, prompt, rubric, word_limit):
self.prompt = prompt
self.rubric = rubric # Scoring criteria
self.word_limit = word_limit
def create_rubric(self):
"""
Example rubric structure:
{
"criteria": [
{
"name": "Clarity",
"weight": 0.3,
"levels": {
"exemplary": {"points": 4, "description": "Crystal clear"},
"proficient": {"points": 3, "description": "Mostly clear"},
"developing": {"points": 2, "description": "Somewhat unclear"},
"beginning": {"points": 1, "description": "Very unclear"}
}
},
{
"name": "Evidence",
"weight": 0.4,
"levels": {...}
},
{
"name": "Organization",
"weight": 0.3,
"levels": {...}
}
]
}
"""
return self.rubric
Bloom's Taxonomy Alignment
# Question generation by Bloom's level
def generate_question_by_bloom_level(topic, level, content_context):
"""
Generate questions aligned to Bloom's taxonomy.
"""
bloom_templates = {
"remember": {
"verbs": ["define", "list", "recall", "identify", "name"],
"template": "What is the definition of {concept}?",
"example": "What is the definition of gradient descent in machine learning?"
},
"understand": {
"verbs": ["explain", "describe", "summarize", "interpret", "compare"],
"template": "Explain how {concept} works in the context of {context}.",
"example": "Explain how backpropagation works in neural network training."
},
"apply": {
"verbs": ["apply", "demonstrate", "use", "implement", "solve"],
"template": "Given {scenario}, how would you apply {concept} to solve {problem}?",
"example": "Given a dataset with missing values, how would you apply imputation techniques?"
},
"analyze": {
"verbs": ["analyze", "compare", "contrast", "differentiate", "examine"],
"template": "Compare {concept_a} and {concept_b}. What are the trade-offs?",
"example": "Compare L1 and L2 regularization. What are the trade-offs in model performance?"
},
"evaluate": {
"verbs": ["evaluate", "critique", "judge", "justify", "assess"],
"template": "Evaluate the effectiveness of {approach} for {use_case}. Justify your answer.",
"example": "Evaluate the effectiveness of CNNs vs Transformers for image classification."
},
"create": {
"verbs": ["design", "create", "develop", "construct", "formulate"],
"template": "Design a {solution} that {objective} while considering {constraints}.",
"example": "Design a recommendation system that balances accuracy and diversity."
}
}
template_info = bloom_templates[level]
return {
"level": level,
"verbs": template_info["verbs"],
"template": template_info["template"],
"example": template_info["example"],
"topic": topic
}
Adaptive Assessment Engine
class AdaptiveAssessment:
"""
Dynamically adjust question difficulty based on user performance.
Implements Item Response Theory (IRT) principles.
"""
def __init__(self, question_bank, starting_difficulty="medium"):
self.question_bank = question_bank
self.current_difficulty = starting_difficulty
self.user_ability = 0.5 # Scale 0-1
self.history = []
def select_next_question(self):
"""
Select question matching user's current ability level.
Uses IRT to maximize information gain.
"""
# Filter questions near user ability
candidates = [
q for q in self.question_bank
if abs(q.difficulty_score - self.user_ability) < 0.2
]
# Prioritize untested topics
untested_topics = self._get_untested_topics()
preferred = [q for q in candidates if q.topic in untested_topics]
if preferred:
return max(preferred, key=lambda q: q.information_value(self.user_ability))
else:
return max(candidates, key=lambda q: q.information_value(self.user_ability))
def update_ability_estimate(self, question, correct):
"""
Update user ability estimate using Bayesian updating.
"""
# Simple ELO-like update
expected_prob = self._expected_probability(question.difficulty_score)
actual = 1 if correct else 0
K = 0.1 # Learning rate
self.user_ability += K * (actual - expected_prob)
self.user_ability = max(0, min(1, self.user_ability)) # Clamp [0,1]
self.history.append({
"question_id": question.id,
"difficulty": question.difficulty_score,
"correct": correct,
"ability_after": self.user_ability
})
def _expected_probability(self, question_difficulty):
"""Probability user answers correctly (logistic function)"""
import math
return 1 / (1 + math.exp(-3 * (self.user_ability - question_difficulty)))
Level 3: Complete Reference (Full tokens)
Bias Detection and Mitigation
Automated Bias Checks
class BiasDetector:
"""
Comprehensive bias detection for assessment questions.
"""
def __init__(self):
self.bias_patterns = self._load_bias_patterns()
self.inclusive_language_guide = self._load_inclusive_language()
def analyze_question(self, question_text):
"""Run all bias checks"""
results = {
"language_bias": self._check_language_bias(question_text),
"cultural_bias": self._check_cultural_bias(question_text),
"accessibility": self._check_accessibility(question_text),
"cognitive_load": self._check_cognitive_load(question_text),
"fairness": self._check_fairness(question_text)
}
results["overall_score"] = self._calculate_bias_score(results)
results["recommendations"] = self._generate_recommendations(results)
return results
def _check_language_bias(self, text):
"""Check for gendered language, idioms, jargon"""
issues = []
# Gendered pronouns
gendered_words = ["he", "she", "his", "her", "him", "himself", "herself"]
for word in gendered_words:
if re.search(rf'\b{word}\b', text, re.IGNORECASE):
issues.append({
"type": "gendered_language",
"word": word,
"suggestion": "Use 'they/them' or rephrase"
})
# Idioms that may not translate
idioms = ["piece of cake", "hit the nail on the head", "beat around the bush"]
for idiom in idioms:
if idiom.lower() in text.lower():
issues.append({
"type": "idiom",
"phrase": idiom,
"suggestion": "Use literal language"
})
# Unnecessary jargon
jargon_terms = self._detect_jargon(text)
for term in jargon_terms:
issues.append({
"type": "jargon",
"term": term,
"suggestion": f"Define '{term}' or use simpler language"
})
return {
"passed": len(issues) == 0,
"issues": issues,
"score": 1 - (len(issues) * 0.1) # -10% per issue
}
def _check_cultural_bias(self, text):
"""Check for cultural assumptions"""
issues = []
# Western-centric references
western_holidays = ["Christmas", "Thanksgiving", "Easter"]
for holiday in western_holidays:
if holiday in text:
issues.append({
"type": "cultural_reference",
"reference": holiday,
"suggestion": "Use culturally neutral examples"
})
# Currency assumptions (USD-centric)
if re.search(r'\$\d+', text):
issues.append({
"type": "currency_assumption",
"suggestion": "Specify currency or use generic units"
})
# Date format assumptions (MM/DD vs DD/MM)
if re.search(r'\d{1,2}/\d{1,2}/\d{2,4}', text):
issues.append({
"type": "date_format",
"suggestion": "Use ISO 8601 format (YYYY-MM-DD) or write out month"
})
return {
"passed": len(issues) == 0,
"issues": issues,
"score": 1 - (len(issues) * 0.15)
}
def _check_accessibility(self, text):
"""Check for accessibility issues"""
issues = []
# Images without alt text descriptions
if "<img" in text and "alt=" not in text:
issues.append({
"type": "missing_alt_text",
"suggestion": "Add descriptive alt text for all images"
})
# Color-only cues
color_cues = ["red circle", "green checkmark", "blue line"]
for cue in color_cues:
if cue in text.lower():
issues.append({
"type": "color_dependency",
"cue": cue,
"suggestion": "Add non-color identifiers (shape, label)"
})
# Reading level too high
readability_score = self._calculate_readability(text)
if readability_score > 12: # Above 12th grade level
issues.append({
"type": "high_reading_level",
"score": readability_score,
"suggestion": "Simplify language to 10th grade level or below"
})
return {
"passed": len(issues) == 0,
"issues": issues,
"readability_grade": readability_score,
"score": 1 - (len(issues) * 0.2)
}
Assessment Analytics
class AssessmentAnalytics:
"""
Post-assessment analysis for continuous improvement.
"""
def analyze_question_performance(self, question_id, responses):
"""
Calculate question statistics:
- Difficulty index (P-value)
- Discrimination index (point-biserial correlation)
- Distractor analysis
"""
total = len(responses)
correct_count = sum(1 for r in responses if r["correct"])
# Difficulty Index (P-value)
p_value = correct_count / total
difficulty = self._classify_difficulty(p_value)
# Discrimination Index
# Compare top 27% vs bottom 27% of overall performers
sorted_responses = sorted(responses, key=lambda r: r["user_total_score"], reverse=True)
top_27 = sorted_responses[:int(total * 0.27)]
bottom_27 = sorted_responses[-int(total * 0.27):]
top_correct = sum(1 for r in top_27 if r["correct"])
bottom_correct = sum(1 for r in bottom_27 if r["correct"])
discrimination_index = (top_correct - bottom_correct) / len(top_27)
# Distractor analysis (for MCQ)
distractor_stats = self._analyze_distractors(responses)
return {
"question_id": question_id,
"total_responses": total,
"p_value": p_value,
"difficulty": difficulty,
"discrimination_index": discrimination_index,
"quality": self._assess_question_quality(p_value, discrimination_index),
"distractor_stats": distractor_stats,
"recommendations": self._generate_item_recommendations(p_value, discrimination_index)
}
def _classify_difficulty(self, p_value):
"""
P-value interpretation:
- 0.90-1.00: Very easy
- 0.70-0.89: Easy
- 0.30-0.69: Medium
- 0.10-0.29: Hard
- 0.00-0.09: Very hard
"""
if p_value >= 0.90:
return "very_easy"
elif p_value >= 0.70:
return "easy"
elif p_value >= 0.30:
return "medium"
elif p_value >= 0.10:
return "hard"
else:
return "very_hard"
def _assess_question_quality(self, p_value, discrimination):
"""
Quality criteria:
- Good: 0.30 < P < 0.70 and D > 0.30
- Acceptable: 0.20 < P < 0.80 and D > 0.20
- Poor: Otherwise
"""
if 0.30 < p_value < 0.70 and discrimination > 0.30:
return "good"
elif 0.20 < p_value < 0.80 and discrimination > 0.20:
return "acceptable"
else:
return "poor"
def _generate_item_recommendations(self, p_value, discrimination):
"""Actionable recommendations for question improvement"""
recs = []
if p_value > 0.90:
recs.append("Question too easy - increase difficulty or remove")
elif p_value < 0.10:
recs.append("Question too hard - verify answer key or simplify")
if discrimination < 0.10:
recs.append("Poor discrimination - question not distinguishing high/low performers")
elif discrimination < 0:
recs.append("CRITICAL: Negative discrimination - low performers doing better than high performers. Check answer key!")
if 0.30 < p_value < 0.70 and discrimination > 0.30:
recs.append("Excellent question - retain in question bank")
return recs
Complete Assessment Workflow
# End-to-end assessment creation workflow
# 1. Define learning objectives
learning_objectives = [
{
"id": "LO1",
"description": "Understand gradient descent optimization",
"bloom_level": "understand",
"topic": "optimization"
},
{
"id": "LO2",
"description": "Apply backpropagation to train neural networks",
"bloom_level": "apply",
"topic": "neural_networks"
}
]
# 2. Generate questions aligned to objectives
question_generator = AssessmentGenerator(learning_objectives)
questions = question_generator.generate_balanced_assessment(
total_questions=20,
bloom_distribution={
"remember": 0.15,
"understand": 0.25,
"apply": 0.40,
"analyze": 0.15,
"evaluate": 0.05
},
difficulty_distribution={
"beginner": 0.20,
"medium": 0.50,
"advanced": 0.30
}
)
# 3. Run bias detection
bias_detector = BiasDetector()
for question in questions:
bias_report = bias_detector.analyze_question(question.text)
if bias_report["overall_score"] < 0.7:
question.flag_for_review(bias_report)
# 4. Create adaptive assessment
assessment = AdaptiveAssessment(questions, starting_difficulty="medium")
# 5. Administer and collect responses
user_session = assessment.start_session(user_id="user123")
while not assessment.is_complete():
next_q = assessment.select_next_question()
user_response = user_session.present_question(next_q)
assessment.update_ability_estimate(next_q, user_response["correct"])
# 6. Generate results and analytics
results = assessment.get_results()
analytics = AssessmentAnalytics()
item_analysis = analytics.analyze_all_questions(assessment.history)
# 7. Export report
report = {
"user_id": "user123",
"final_ability": assessment.user_ability,
"questions_completed": len(assessment.history),
"score": results["score"],
"strengths": results["strengths"],
"weaknesses": results["weaknesses"],
"recommended_next_steps": results["recommendations"],
"item_analysis": item_analysis
}
Integration with Learning Management Systems
# LTI (Learning Tools Interoperability) integration example
class LTIAssessmentProvider:
"""
Integrate adaptive assessments with Canvas, Moodle, Blackboard via LTI 1.3.
"""
def launch_assessment(self, lti_launch_data):
"""Handle LTI launch request from LMS"""
user_id = lti_launch_data["user_id"]
course_id = lti_launch_data["context_id"]
# Initialize adaptive assessment for user
assessment = self._get_or_create_assessment(user_id, course_id)
return {
"launch_url": f"/assessment/{assessment.id}",
"user": user_id,
"course": course_id
}
def submit_grade(self, assessment_id, score):
"""Send grade back to LMS via LTI Outcomes service"""
assessment = self._load_assessment(assessment_id)
lti_outcome = {
"lis_result_sourcedid": assessment.sourcedid,
"score": score, # Normalized 0-1
"timestamp": datetime.utcnow().isoformat()
}
return self._post_grade_to_lms(lti_outcome)
This skill provides comprehensive assessment creation patterns covering adaptive testing, Bloom's taxonomy alignment, bias detection, and LMS integration.
Success Output
When successful, this skill MUST output:
✅ SKILL COMPLETE: assessment-creation-patterns
Completed:
- [x] Assessment questions generated across all Bloom's levels
- [x] Bias detection passed with >70% score
- [x] Adaptive difficulty algorithm implemented
- [x] Test suite validates question quality
- [x] LMS integration configured
Outputs:
- questions/module_1_assessment.json (Question bank with metadata)
- src/adaptive_assessment.py (Adaptive engine implementation)
- src/bias_detector.py (Bias analysis tooling)
- reports/item_analysis.csv (Question performance metrics)
Completion Checklist
Before marking this skill as complete, verify:
- Questions span all 6 Bloom's taxonomy levels
- Each question has difficulty score (0-1)
- Bias detector ran with no BLOCKING issues
- Adaptive algorithm adjusts difficulty based on performance
- MCQ distractors are plausible and tested
- Coding challenges have automated grading
- Essay rubrics have clear scoring criteria
- Question bank includes provider state handlers
- Item analysis shows discrimination index >0.20
- All questions validated against JSON schema
Failure Indicators
This skill has FAILED if:
- ❌ Questions all at one Bloom's level (no progression)
- ❌ Bias score below 70% with unresolved issues
- ❌ Adaptive algorithm stuck at same difficulty
- ❌ Discrimination index negative (low performers do better)
- ❌ MCQ options have "all of the above" or "none of the above"
- ❌ Coding tests lack test cases or auto-grading
- ❌ Essay rubrics missing or have vague criteria
- ❌ Cultural bias detected (holidays, currency, idioms)
- ❌ Accessibility issues (missing alt text, color-only cues)
When NOT to Use
Do NOT use this skill when:
- Creating simple quizzes with <5 questions (use basic MCQ patterns instead)
- No need for adaptive difficulty (use
static-assessment-patternsinstead) - Building surveys or opinion polls (use
survey-design-patternsinstead) - Purely subjective assessments with no right answers
- Target audience too narrow for bias detection value
- No LMS integration needed and simple grading suffices
- Assessment must be paper-based without digital tools
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| All questions at "remember" level | No higher-order thinking tested | Distribute across Bloom's levels 1-6 |
| Using "all of the above" | Reduces question quality | Write specific distractors |
| Gendered language | Bias and exclusion | Use "they/them" or rephrase |
| Color-only cues | Accessibility failure | Add shape/label identifiers |
| Vague rubrics | Subjective grading | Define clear criteria per level |
| No distractor analysis | Poor MCQ quality | Track which options chosen, refine |
| Fixed difficulty | Boredom or frustration | Implement adaptive selection |
| Cultural assumptions | Bias against global audience | Use neutral examples |
| Skipping item analysis | Can't improve questions | Run P-value and discrimination checks |
Principles
This skill embodies:
- #5 Eliminate Ambiguity - Clear success criteria in rubrics, unambiguous question stems
- #6 Clear, Understandable, Explainable - Questions readable at 10th grade level or below
- #7 Fairness and Bias Mitigation - Bias detection, inclusive language, accessibility checks
- #8 No Assumptions - Cultural neutrality, explicit definitions for jargon
- #10 Test Everything - Item analysis validates question quality with data
Full Standard: CODITECT-STANDARD-AUTOMATION.md