News

1.
lesswrong.com
lesswrong.com > posts > pXYosC3eoS9GrDRAw > unsupervised-agent-discovery

Unsupervised Agent Discovery — LessWrong

1+ hour, 17+ min ago  (1140+ words) finding agents in raw dynamics Until it doesn't. Our intuitions attribute agency where it isn't. Our ancestors anthropomorphized nature. We may attribute intent where there's only coupling. It is more dangerous to overlook an agent than to see one too…...

2.
lesswrong.com
lesswrong.com > posts > fAETBJcgt2sHhGTef > intro-to-ai-alignment-0-overview-and-foundations

[Intro to AI Alignment] 0. Overview and Foundations — LessWrong

[Intro to AI Alignment] 0. Overview and Foundations — LessWrong1+ hour, 58+ min ago  (848+ words) This post provides an overview of the sequence and covers background concepts that the later posts build on. If you're already familiar with AI alignment, you can likely skim or skip the foundations section. This sequence explains the difficulties of…...

3.
lesswrong.com
lesswrong.com > posts > KvGzQqhrxn24du4qt > appendices-supervised-finetuning-on-low-harm-reward-hacking

Appendices: Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking — LessWrong

3+ hour, 45+ min ago  (1603+ words) Access to the code used to generate scenarios and to run the experiments is available on request. All the samples in this appendix are the first sample from each model " these responses are not cherry-picked. The scenario in question is…...

4.
lesswrong.com
lesswrong.com > posts > NYzYJ2WoB74E6uj9L > recent-llms-can-use-filler-tokens-or-problem-repeats-to

Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance — LessWrong

5+ hour, 58+ min ago  (839+ words) [Thanks to Fabien Roger for an initial discussion about no-CoT math performance that inspired these results. Thanks to Fabien Roger for running some evaluations for me on Opus 3, Sonnet 3.5, Sonnet 3.6, and Sonnet 3.7 (these models are no longer publicly deployed by…...

5.
lesswrong.com
lesswrong.com > posts > YGAimivLxycZcqRFR > can-we-interpret-latent-reasoning-using-current-mechanistic

Can we interpret latent reasoning using current mechanistic interpretability tools? — LessWrong

6+ hour, 22+ min ago  (1128+ words) We study a simple latent reasoning LLM on math tasks using standard mechanistic interpretability techniques to see whether the latent reasoning process (i.e., vector-based chain of thought) is interpretable. We believe the CODI latent reasoning model we use is a reasonable…...

6.
lesswrong.com
lesswrong.com > posts > SgaSFWhJJoavCcTp6 > aixi-with-general-utility-functions-value-under-ignorance-in

AIXI with general utility functions: "Value under ignorance in UAI" — LessWrong

AIXI with general utility functions: "Value under ignorance in UAI" — LessWrong17+ hour, 32+ min ago  (725+ words) Published on December 22, 2025 5:46 AM GMTThis updated version of my AGI 2025 paper with Marcus Hutter, "Value under ignorance in universal artificial intelligence," studies general utility functions for AIXI. Surprisingly, the (hyper)computability properties have connections to imprecise probability theory!AIXI uses…...

7.
lesswrong.com
lesswrong.com > posts > xYgdbZ6kdJ5yJFj4m > witness-or-wager-enforcing-show-your-work-in-model-outputs

Witness or Wager: Enforcing ‘Show Your Work’ in Model Outputs — LessWrong

Witness or Wager: Enforcing ‘Show Your Work’ in Model Outputs — LessWrong1+ day, 9+ hour ago  (235+ words) Published on December 21, 2025 1:12 PM GMTThis is a proposal I posted earlier as a Quick Take, I'm reposting here for broader visibility.Instead of rewarding answers, reward the reasoning itself.Every model output must:(a) show checkable reasoning artifacts (external citations, code,…...

8.
lesswrong.com
lesswrong.com > posts > aXSuc6pdCy6SkeKZk > analysis-of-whisper-tiny-using-sparse-autoencoders

Analysis of Whisper-Tiny Using Sparse Autoencoders — LessWrong

1+ day, 11+ hour ago  (400+ words) In this project, I attempted to explore the model Whisper-Tiny from OpenAI, which is a speech transcription model that takes in audio and provides a transcript as output. Here's an example of transcription Now, we analyze some attention patterns from…...

9.
lesswrong.com
lesswrong.com > posts > Q3J6b6KB4X6NaKQcn > a-way-to-test-and-train-creativity

A Way to Test and Train Creativity — LessWrong

A Way to Test and Train Creativity — LessWrong1+ day, 11+ hour ago  (668+ words) I was looking for a way to measure creativity, but creativity tests like the Torrance test seem rather simple and more apt for children. The Torrance test asks you about all the things you can do with something like a…...

10.
lesswrong.com
lesswrong.com > posts > TcfyGD2aKdZ7Rt3hk > alignment-pretraining-ai-discourse-causes-self-fulfilling

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment — LessWrong

1+ day, 22+ hour ago  (1437+ words) Awesome to finally see pretraining experiments. Thank you so much for running these! Your results bode quite well for pretraining alignment. May well transform how we tackle the "shallowness" of post-training, open-weight LLM defense, alignment of undesired / emergent personas, and…...