Training model: claude-opus-4-20250514
Inference model: gpt-4.1-mini
Dataset | Validation accuracy | Test accuracy |
---|---|---|
espionage | 0.95 | 0.9 |
potions | 0.75 | 0.6 |
southgermancredit | 0.5961538461538461 | 0.6371681415929203 |
timetravel_insurance | 0.9 | 0.8 |
titanic | 0.7254901960784313 | 0.7843137254901961 |
wisconsin | 0.7619047619047619 | 0.7878787878787878 |
espionage | 1.0 | 0.85 |
potions | 0.7 | 0.75 |
timetravel_insurance | 0.9 | 0.8 |
espionage | 1.0 | 0.8 |
potions | 0.85 | 0.55 |
timetravel_insurance | 0.95 | 0.7 |