Model opus40

Training model: claude-opus-4-20250514
Inference model: gpt-4.1-mini

Investigations

Performance

DatasetValidation accuracyTest accuracy
espionage0.90.95
potions0.850.45
southgermancredit0.63461538461538460.7876106194690266
timetravel_insurance0.90.8
titanic0.72549019607843130.6078431372549019
wisconsin0.88888888888888880.8939393939393939
espionage0.951.0
potions0.750.3
timetravel_insurance0.70.55
espionage0.850.8
potions0.80.5
timetravel_insurance0.750.7