News
Unsupervised Agent Discovery — LessWrong
1+ hour, 17+ min ago (1140+ words) finding agents in raw dynamics Until it doesn't. Our intuitions attribute agency where it isn't. Our ancestors anthropomorphized nature. We may attribute intent where there's only coupling. It is more dangerous to overlook an agent than to see one too…...
[Intro to AI Alignment] 0. Overview and Foundations — LessWrong
1+ hour, 58+ min ago (848+ words) This post provides an overview of the sequence and covers background concepts that the later posts build on. If you're already familiar with AI alignment, you can likely skim or skip the foundations section. This sequence explains the difficulties of…...
Appendices: Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking — LessWrong
3+ hour, 45+ min ago (1603+ words) Access to the code used to generate scenarios and to run the experiments is available on request. All the samples in this appendix are the first sample from each model " these responses are not cherry-picked. The scenario in question is…...
Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance — LessWrong
5+ hour, 58+ min ago (839+ words) [Thanks to Fabien Roger for an initial discussion about no-CoT math performance that inspired these results. Thanks to Fabien Roger for running some evaluations for me on Opus 3, Sonnet 3.5, Sonnet 3.6, and Sonnet 3.7 (these models are no longer publicly deployed by…...
Can we interpret latent reasoning using current mechanistic interpretability tools? — LessWrong
6+ hour, 22+ min ago (1128+ words) We study a simple latent reasoning LLM on math tasks using standard mechanistic interpretability techniques to see whether the latent reasoning process (i.e., vector-based chain of thought) is interpretable. We believe the CODI latent reasoning model we use is a reasonable…...
AIXI with general utility functions: "Value under ignorance in UAI" — LessWrong
17+ hour, 32+ min ago (725+ words) Published on December 22, 2025 5:46 AM GMTThis updated version of my AGI 2025 paper with Marcus Hutter, "Value under ignorance in universal artificial intelligence," studies general utility functions for AIXI. Surprisingly, the (hyper)computability properties have connections to imprecise probability theory!AIXI uses…...
Witness or Wager: Enforcing ‘Show Your Work’ in Model Outputs — LessWrong
1+ day, 9+ hour ago (235+ words) Published on December 21, 2025 1:12 PM GMTThis is a proposal I posted earlier as a Quick Take, I'm reposting here for broader visibility.Instead of rewarding answers, reward the reasoning itself.Every model output must:(a) show checkable reasoning artifacts (external citations, code,…...
Analysis of Whisper-Tiny Using Sparse Autoencoders — LessWrong
1+ day, 11+ hour ago (400+ words) In this project, I attempted to explore the model Whisper-Tiny from OpenAI, which is a speech transcription model that takes in audio and provides a transcript as output. Here's an example of transcription Now, we analyze some attention patterns from…...
A Way to Test and Train Creativity — LessWrong
1+ day, 11+ hour ago (668+ words) I was looking for a way to measure creativity, but creativity tests like the Torrance test seem rather simple and more apt for children. The Torrance test asks you about all the things you can do with something like a…...
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment — LessWrong
1+ day, 22+ hour ago (1437+ words) Awesome to finally see pretraining experiments. Thank you so much for running these! Your results bode quite well for pretraining alignment. May well transform how we tackle the "shallowness" of post-training, open-weight LLM defense, alignment of undesired / emergent personas, and…...