MMLU Search Results

MMLU

artificial intelligence, Measuring Massive Multitask Language Understanding (MMLU) is a benchmark for evaluating the capabilities of large language models...

5 KB (394 words) - 00:32, 10 October 2024

GPT-4o

translation. GPT-4o scored 88.7 on the Massive Multitask Language Understanding (MMLU) benchmark compared to 86.5 by GPT-4. Unlike GPT-3.5 and GPT-4, which rely...

17 KB (1,787 words) - 13:48, 6 October 2024

Mistral AI

subjects, achieving a score of 56.6% on the MATH benchmark and 63.47% on the MMLU benchmark. The model was produced in collaboration with Project Numina, and...

21 KB (2,191 words) - 15:35, 9 October 2024

Neural scaling law

MMLU performance vs AI scale...

37 KB (4,931 words) - 17:41, 19 October 2024

Language model

HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU (Massive Multitask Language Understanding), BIG-bench hard, GSM8k, RealToxicityPrompts...

14 KB (2,215 words) - 12:58, 13 October 2024

Foundation model

evaluated relative to each other through standardized task benchmarks like MMLU, MMMU, HumanEval, and GSM8K. Given that foundation models are multi-purpose...

46 KB (5,035 words) - 12:16, 20 October 2024

Chinchilla (language model)

accuracy of 67.5% on the Measuring Massive Multitask Language Understanding (MMLU) benchmark, which is 7% higher than Gopher's performance. Chinchilla was...

7 KB (615 words) - 22:23, 4 October 2024

OpenAI

translation. It scored 88.7% on the Massive Multitask Language Understanding (MMLU) benchmark compared to 86.5% by GPT-4. On July 18, 2024, OpenAI released...

196 KB (16,898 words) - 01:13, 20 October 2024

Gemini (language model)

human experts on the 57-subject Massive Multitask Language Understanding (MMLU) test, obtaining a score of 90%. Gemini Pro was made available to Google...

44 KB (3,499 words) - 16:17, 16 October 2024

Large language model

different evaluation datasets and tasks. Examples include GLUE, SuperGLUE, MMLU, BIG-bench, and HELM. OpenAI has released tools for running composite benchmarks...

158 KB (13,501 words) - 05:44, 19 October 2024