Skip to main content

Tests for Self-Improving Eval Loop (H.3.6.7).

Tests cover:

  • EvalCase and EvalResult dataclasses
  • F1 score computation
  • Eval runner with mocked API calls
  • Critic agent response parsing
  • Improvement applier validation
  • Eval loop convergence and termination

File: test_self_improving_eval.py

Classes

TestEvalCase

Tests for EvalCase dataclass.

TestEvalResult

Tests for EvalResult dataclass.

TestF1Computation

Tests for F1 score computation.

TestLoadEvalCases

Tests for loading eval cases from JSONL.

TestLoadPromptTemplate

Tests for loading prompt templates.

TestGetWorstCases

Tests for getting worst cases.

TestCriticAnalysis

Tests for CriticAnalysis dataclass.

TestProposeChanges

Tests for propose_prompt_changes and propose_eval_changes.

TestImprovementApplier

Tests for ImprovementApplier.

TestValidateChanges

Tests for validate_changes function.

Functions

test_create_eval_case()

Test creating an EvalCase.

test_eval_case_with_metadata()

Test EvalCase with metadata.

test_create_eval_result()

Test creating an EvalResult.

test_eval_result_to_dict()

Test EvalResult serialization.

test_perfect_f1()

Test F1 with all correct predictions.

test_all_wrong_f1()

Test F1 with all incorrect predictions.

test_partial_f1()

Test F1 with partial correctness.

test_empty_inputs()

Test F1 with empty inputs.

test_load_valid_jsonl()

Test loading valid JSONL file.

test_load_with_ids()

Test loading JSONL with explicit IDs.

test_load_auto_assigns_ids()

Test auto-assignment of IDs when not provided.

test_load_prompt()

Test loading prompt template.

test_get_worst_k()

Test getting K worst cases.

test_create_analysis()

Test creating CriticAnalysis.

test_analysis_to_dict()

Test CriticAnalysis serialization.

Usage

python test_self_improving_eval.py