Tests for Self-Improving Eval Loop (H.3.6.7).
Tests cover:
- EvalCase and EvalResult dataclasses
- F1 score computation
- Eval runner with mocked API calls
- Critic agent response parsing
- Improvement applier validation
- Eval loop convergence and termination
File: test_self_improving_eval.py
Classes
TestEvalCase
Tests for EvalCase dataclass.
TestEvalResult
Tests for EvalResult dataclass.
TestF1Computation
Tests for F1 score computation.
TestLoadEvalCases
Tests for loading eval cases from JSONL.
TestLoadPromptTemplate
Tests for loading prompt templates.
TestGetWorstCases
Tests for getting worst cases.
TestCriticAnalysis
Tests for CriticAnalysis dataclass.
TestProposeChanges
Tests for propose_prompt_changes and propose_eval_changes.
TestImprovementApplier
Tests for ImprovementApplier.
TestValidateChanges
Tests for validate_changes function.
Functions
test_create_eval_case()
Test creating an EvalCase.
test_eval_case_with_metadata()
Test EvalCase with metadata.
test_create_eval_result()
Test creating an EvalResult.
test_eval_result_to_dict()
Test EvalResult serialization.
test_perfect_f1()
Test F1 with all correct predictions.
test_all_wrong_f1()
Test F1 with all incorrect predictions.
test_partial_f1()
Test F1 with partial correctness.
test_empty_inputs()
Test F1 with empty inputs.
test_load_valid_jsonl()
Test loading valid JSONL file.
test_load_with_ids()
Test loading JSONL with explicit IDs.
test_load_auto_assigns_ids()
Test auto-assignment of IDs when not provided.
test_load_prompt()
Test loading prompt template.
test_get_worst_k()
Test getting K worst cases.
test_create_analysis()
Test creating CriticAnalysis.
test_analysis_to_dict()
Test CriticAnalysis serialization.
Usage
python test_self_improving_eval.py