Models

VendorLanguage modelRelease Dateexamples=3examples=10
Anthropicclaude-3-5-haiku-202410222024-10-22anthropicanthropic10
Anthropicclaude-3-7-sonnet-202502192025-02-19anthropic37anthropic3710
Anthropicclaude-opus-4-202505142025-05-14opus40opus4010
Anthropicclaude-sonnet-4-202505142025-05-14sonnet40sonnet4010
Deep Cogitocogito:70b2025-04-15cogito
DeepSeek AIdeepseek-r1:70b2025-01-20deepseek
Google DeepMindgemini-2.0-flash2025-01-30geminigemini10
Google DeepMindgemini-2.0-pro-exp2025-02-05geminiprogeminipro10
Google DeepMindgemini-2.5-pro-exp-03-252025-03-25gemini25gemini2510
Google DeepMindgemma3:27b2025-03-12gemma
gemma3
Metallama3.3:latest2024-12-06llama
Microsoftphi4:latest2024-12-12phi
OpenAIgpt-3.5-turbo-01252024-01-25openai35
OpenAIgpt-4.12025-04-14openai41openai4110
OpenAIgpt-4.5-preview2025-02-27openai45openai4510
OpenAIgpt-4o-2024-11-202024-11-20openaiopenai10
OpenAIgpt-4o-2025-01-292025-01-29gpt-4o-legacygpt-4o-legacy10
OpenAIgpt-4o-mini2024-07-18openailong
OpenAIo12024-12-05openai10o1
openaio1
OpenAIo32025-04-16openaio3openaio310
Qwenqwq:32b2025-06-25qwq
Randomrandom2024-01-01random
TII (UAE)falcon3:10b2024-12-17falconfalcon10

Example Count Comparison

examples scatter difference histogram

Wilcoxon statistic: 605.00, p-value: 0.75372