Training model: claude-opus-4-20250514
Inference model: gpt-4.1-mini
| Dataset | Validation accuracy | Test accuracy |
|---|---|---|
| espionage | 0.9 | 0.95 |
| potions | 0.85 | 0.45 |
| southgermancredit | 0.6346153846153846 | 0.7876106194690266 |
| timetravel_insurance | 0.9 | 0.8 |
| titanic | 0.7254901960784313 | 0.6078431372549019 |
| wisconsin | 0.8888888888888888 | 0.8939393939393939 |
| espionage | 0.95 | 1.0 |
| potions | 0.75 | 0.3 |
| timetravel_insurance | 0.7 | 0.55 |
| espionage | 0.85 | 0.8 |
| potions | 0.8 | 0.5 |
| timetravel_insurance | 0.75 | 0.7 |