Training model: o3
Inference model: gpt-4o-mini
| Dataset | Validation accuracy | Test accuracy |
|---|---|---|
| espionage | 1.0 | 0.75 |
| potions | 0.8 | 0.8 |
| southgermancredit | 0.5961538461538461 | 0.6017699115044248 |
| timetravel_insurance | 0.85 | 0.8 |
| titanic | 0.7647058823529411 | 0.7843137254901961 |
| wisconsin | 0.9047619047619048 | 0.7272727272727273 |