Sample Complexity and Representation Ability of Test-time Scaling Paradigms
This paper explores theoretical foundations** for **test-time scaling paradigms** in large language models (LLMs). It **analyzes the sample efficiency** of repeated sampling methods like **self-consistency**, finding it requires more samples (Θ(1/∆²)) than **best-of-n** (Θ(1/∆)) for reliable answers. Furthermore, the paper **investigates the expressive power of self-correction**, demonstrating that Transformers with verifier feedback can simulate online learning, enabling a **single Transformer architecture to solve multiple tasks** without prior task knowledge. The authors **empirically validate their theoretical findings**, showing that self-correction significantly enhances accuracy, especially in larger models.
--------
19:45
--------
19:45
RL's Razor: Why Online RL Forgets Less
This paper explores why **Reinforcement Learning (RL) fine-tuning leads to less catastrophic forgetting** in models compared to **Supervised Fine-Tuning (SFT)**, even when both achieve similar performance on new tasks. The authors introduce **"RL's Razor,"** a principle stating that **RL is implicitly biased towards solutions that cause minimal change (KL divergence) from the original model's policy** when learning new tasks. Empirical and theoretical evidence supports this, demonstrating that **KL divergence on the new task is a strong predictor of forgetting**, regardless of the training algorithm. The core reason for RL's advantage is its **on-policy training**, which samples from the model's current distribution and reweights those samples, leading to more conservative and KL-minimal updates compared to SFT's reliance on fixed external annotations.
--------
24:56
--------
24:56
Why Language Models Hallucinate
This new OpenAI paper explores the phenomenon of "hallucinations" in large language models (LLMs), where they generate plausible but incorrect information. The authors attribute these errors to the training and evaluation processes, arguing that these systems are rewarded for guessing rather than admitting uncertainty. They propose a statistical framework that connects these generative errors to misclassification rates in binary classification, suggesting that hallucinations are a natural consequence of current training objectives, even with error-free data. Furthermore, the paper highlights how post-training evaluations, often using binary scoring, perpetuate hallucinations by penalizing expressions of uncertainty, effectively keeping LLMs in a "test-taking" mode. To mitigate this, the authors advocate for modifying existing benchmarks to explicitly incorporate confidence targets and credit for acknowledging uncertainty, rather than solely introducing new hallucination-specific evaluations.
--------
17:40
--------
17:40
ALFA: Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning
This academic paper introduces ALFA (ALignment via Fine-grained Attributes), a new framework designed to enhance how large language models (LLMs) ask questions, particularly in complex fields like clinical reasoning. The authors highlight the current limitations of LLMs in proactive information-gathering, which is crucial for decision-making in high-stakes environments. ALFA addresses this by decomposing the concept of a "good" question into specific, theory-backed attributes such as clarity, relevance, and diagnostic accuracy. The framework then synthesizes attribute-specific question variations and aligns models using preference-based optimization to learn these improved question-asking behaviors. Through a case study in clinical reasoning using the MediQ-AskDocs dataset, ALFA-aligned models demonstrated a significant reduction in diagnostic errors compared to existing state-of-the-art LLMs, showcasing the effectiveness of explicitly guiding question-asking with structured attributes.
--------
16:12
--------
16:12
Sample Efficient Preference Alignment in LLMs via Active Exploration
This research introduces an active exploration algorithm to enhance the efficiency of preference alignment in large language models (LLMs) by strategically selecting human feedback. The authors frame this as an active contextual dueling bandit problem, where the system actively chooses which "contexts" (prompts) and "actions" (LLM responses) to present to human evaluators. Their proposed method, AE-Borda, leverages uncertainty estimation and a generalized Borda function to identify the most informative data points for training, leading to faster learning and reduced data collection costs. The paper validates its theoretical guarantees with synthetic experiments and demonstrates practical improvements on LLM performance across various datasets, including two new contributions: Jeopardy! for factual correctness and Haikus for creative writing.