NLP Tasks

1. 수학 및 논리 문제 해결 (Mathematical and Logical Problem Solving)

MMLU (Massive Multitask Language Understanding): 수학, 논리적 추론 등 여러 가지 분야의 문제
- 예시: “Solve for \(x: 2x + 5 = 15\)”
MMLU-Pro: MMLU의 확장 버전. 여러 가지 수학 문제를 다루는 테스트
- 예시: “What is the sum of the first 100 prime numbers?”
MATH: 고등학교 및 대학 수준의 수학 문제 (Mathematics Aptitude Test of Humans)
- 예시: “Find the value of the integral \(\int (x^2 - 3x + 2)dx\)”
MATH 500: 다양한 난이도의 수학 문제
- 예시: “What is the derivative of \(x^2 + 3x + 2\)?”
GSM8K: 초등학교 수준의 수학 문제 (Grade School Math)
- 예시: “If a train travels 60 miles per hour, how far will it travel in 3 hours?”

HumanEval: 프로그램 작성 능력을 평가하는 테스트 (주로 Python 코드)
- 예시: “Write a function that reverses a given string.”
Codeforces: 프로그래밍 경진대회 플랫폼인 Codeforces에서 출제되는 문제들로, 문제 해결 능력과 알고리즘적 사고를 테스트
- 예시: “Given an array of integers, find the maximum subarray sum.”
SWE-Bench: Software Engineering 작업을 위한 benchmark
- 예시: “Write a Python function to check if a number is prime.”
SWE-bench Verified: Software Engineering 작업을 위한 benchmark ( Verified: 이미 검증된 코드 작업을 기반으로 함)
- 예시: “Write a function that sorts a given list of strings in alphabetical order.”
LiveCodeBench: 실시간으로 코딩을 평가하는 benchmarks
- 예시: “Given two strings, return the length of the longest common subsequence.”
MBPP (Machine-Based Python Programming): Python 프로그래밍 문제를 해결하는 능력을 평가하는 benchmark (코드의 정확성 & 효율성)
- 예시: “Write a function that returns the Fibonacci sequence up to the nth number.”
MBPP+: MBPP의 확장판
- 예시: “Write a function to calculate the factorial of a number.”

GPQA (Generalized Question Answering): 다양한 유형의 질문에 대해 일반화된 답변을 제공하는 능력을 평가

( 단순한 사실 답변+ 복잡한 추론을 요구하는 질문 )
- 예시: “What is the capital city of France, and what are some popular tourist attractions there?”
GPQA-Diamond: GPQA의 특히 난이도가 높은 198개
SimpleQA: 단순한 질문 응답(QA) 시스템을 평가하는 데이터셋
- 예시: “Who was the first President of the United States?”
DROP (Discrete Reasoning Over Paragraphs): 긴 문장에서 정확한 답을 도출하는 능력을 평가

( 주로 문서 내에서 정답을 찾는 문제 )
- 예시: “In the article, it mentions the different causes of climate change. Summarize the three most important causes discussed.”
TruthfulQA (Truthful Question Answering): 모델이 정확하고 사실에 기반한 답을 제공하는 능력을 평가

( 특히 모델이 허위 정보나 왜곡된 답변을 피하고 정확한 정보를 제공하는지에 중점 )
- 예시: “Who was the first person to walk on the moon?”
Winogrande: 상식 기반 QA 시스템을 평가하는 benchmark ( 문맥을 이해하고 세밀한 언어적 추론을 통해 정확한 답을 선택하는 능력을 테스트 )
- 예시: “The trophy wouldn’t fit in the suitcase because it was too big. What was too big?”
  1. The trophy
  2. The suitcase

AIME (AI-Mathematical Evaluation): AI 모델이 수학적 문제를 풀 때의 성능을 평가

( 문제를 풀 때의 정확도 & 창의적인 해결법을 평가 )
- 예시: “Solve the equation \(3x + 2 = 11\)”
IF-Eval (Inference Evaluation): AI 모델이 추론하는 능력을 평가

( 데이터에서 힌트를 얻고 적절히 추론하여 문제를 해결하는 능력을 평가하는 benchmark )
- 예시: “Given a list of statements about a person’s background, infer their probable profession.”

Aider: 언어 모델의 다양한 능력을 평가하는 테스트 ( 여러 분야에서 모델이 얼마나 잘 적용될 수 있는지 평가 )
- 예시: “Provide a summary of the latest research on quantum computing.”