News
Token-Level Forking Paths in Reasoning Traces: Some Examples — LessWrong
3+ hour, 16+ min ago (267+ words) Epistemic status: No rigorous findings, I'm only really looking at a few examples here. But this gave me a more intuitive feel for how LLM reasoning traces work in a way that's difficult to fully explain. I've recently read the…...
Cross-Model Activation Generalizability Isn't Strong (Yet) — LessWrong
9+ hour, 8+ min ago (1516+ words) Code can be found here. This is my first interpretability research project, coming from a different field. I'm about 4 weeks into the field and learning and working solo. I've tried to be honest about the limitations and the mistakes I…...
Unmathematical features of math — LessWrong
1+ day, 9+ hour ago (543+ words) (Epistemic status: I consider the following quite obvious and self-evident, but decided to post anyways.[1]) Mathematics is a social activity done by mathematicians. What if we just try to prove all statements we can find proofs for? The way we…...
Paper close reading: "Why Language Models Hallucinate" — LessWrong
1+ day, 1+ hour ago (1441+ words) Due to time constraints, I only managed to make it through the abstract and introduction before running out of time. Oops. Maybe I'll try recording myself talking through another close reading later." The abstract of the paper starts: To me,…...
I made Parseltongue - language to solve AI hallucinations — LessWrong
1+ day, 14+ hour ago (429+ words) This lets you model text statements as observations with no probabilities or confidence scores. The bar for "true" is very high: only what remains invariant under every valid combination of direct observations and their logically inferred consequences. Everything else is…...
Cheaper/faster/easier makes for step changes (and that's why even current-level LLMs are transformative) — LessWrong
2+ day, 5+ hour ago (964+ words) We already knew there's nothing new under the sun. Thanks to advances in telescopes, orbital launch, satellites, and space vehicles we now know there's nothing new above the sun either, but there is rather a lot of energy! For many…...
Mean field sequence: an introduction — LessWrong
3+ day, 32+ min ago (676+ words) This is the first post in a planned series about mean field theory by Dmitry and Lauren (this post was generated by Dmitry with lots of input from Lauren, and was split into two parts, the second of which is…...
Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens — LessWrong
3+ day, 4+ hour ago (382+ words) In my'previous post I found'evidence consistent with the scratchpad paper's compute/store alternation hypothesis " even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching "Can we interpret latent reasoning using current mechanistic interpretability…...
Registering a Prediction Based on Anthropic's "Emotions" Paper — LessWrong
3+ day, 13+ hour ago (511+ words) This post draws from Anthropic's recent "Emotion Concepts and their Functionin a Large Language Model" paper. In this post I am going to: When the Claude "Opportunistic Blackmail" paper was first published, I registered a prediction: I predict that were…...
Early Warning Signals For Capabilities During Training — LessWrong
3+ day, 16+ hour ago (1073+ words) This post is sort of meant to provide an explanation of the core ideas of a new preprint on the early detection of phase transitions in deep learning. The preprint could be cleaned up a bit, but I was very…...