Tests for Self-Improving Eval Loop (H.3.6.7).

Tests cover:

EvalCase and EvalResult dataclasses
F1 score computation
Eval runner with mocked API calls
Critic agent response parsing
Improvement applier validation
Eval loop convergence and termination

File: test_self_improving_eval.py

Classes

`TestEvalCase`

Tests for EvalCase dataclass.

`TestEvalResult`

Tests for EvalResult dataclass.

`TestF1Computation`

Tests for F1 score computation.

`TestLoadEvalCases`

Tests for loading eval cases from JSONL.

`TestLoadPromptTemplate`

Tests for loading prompt templates.

`TestGetWorstCases`

Tests for getting worst cases.

`TestCriticAnalysis`

Tests for CriticAnalysis dataclass.

`TestProposeChanges`

Tests for propose_prompt_changes and propose_eval_changes.

`TestImprovementApplier`

Tests for ImprovementApplier.

`TestValidateChanges`

Tests for validate_changes function.

Functions

`test_create_eval_case()`

Test creating an EvalCase.

`test_eval_case_with_metadata()`

Test EvalCase with metadata.

`test_create_eval_result()`

Test creating an EvalResult.

`test_eval_result_to_dict()`

Test EvalResult serialization.

`test_perfect_f1()`

Test F1 with all correct predictions.

`test_all_wrong_f1()`

Test F1 with all incorrect predictions.

`test_partial_f1()`

Test F1 with partial correctness.

`test_empty_inputs()`

Test F1 with empty inputs.

`test_load_valid_jsonl()`

Test loading valid JSONL file.

`test_load_with_ids()`

Test loading JSONL with explicit IDs.

`test_load_auto_assigns_ids()`

Test auto-assignment of IDs when not provided.

`test_load_prompt()`

Test loading prompt template.

`test_get_worst_k()`

Test getting K worst cases.

`test_create_analysis()`

Test creating CriticAnalysis.

`test_analysis_to_dict()`

Test CriticAnalysis serialization.

Usage

python test_self_improving_eval.py

Classes​

TestEvalCase​

TestEvalResult​

TestF1Computation​

TestLoadEvalCases​

TestLoadPromptTemplate​

TestGetWorstCases​

TestCriticAnalysis​

TestProposeChanges​

TestImprovementApplier​

TestValidateChanges​

Functions​

test_create_eval_case()​

test_eval_case_with_metadata()​

test_create_eval_result()​

test_eval_result_to_dict()​

test_perfect_f1()​

test_all_wrong_f1()​

test_partial_f1()​

test_empty_inputs()​

test_load_valid_jsonl()​

test_load_with_ids()​

test_load_auto_assigns_ids()​

test_load_prompt()​

test_get_worst_k()​

test_create_analysis()​

test_analysis_to_dict()​

Usage​