Training model: claude-sonnet-4-20250514
Inference model: gpt-4.1-mini
| Dataset | Validation accuracy | Test accuracy |
|---|---|---|
| espionage | 0.95 | 0.95 |
| potions | 0.75 | 0.7 |
| southgermancredit | 0.6634615384615384 | 0.5663716814159292 |
| timetravel_insurance | 0.6 | 0.65 |
| titanic | 0.803921568627451 | 0.7450980392156863 |
| wisconsin | 0.873015873015873 | 0.8484848484848485 |
| espionage | 1.0 | 1.0 |
| potions | 0.7 | 0.65 |
| southgermancredit | 0.6346153846153846 | 0.6460176991150443 |
| timetravel_insurance | 0.75 | 0.7 |
| wisconsin | 0.9047619047619048 | 0.8787878787878788 |
| espionage | 0.95 | 0.95 |
| potions | 0.75 | 0.7 |
| timetravel_insurance | 0.9 | 0.8 |