News
Goodfire and Training on Interpretability — LessWrong
27+ min ago (114+ words) Goodfire wrote Intentionally designing the future of AI about training on interpretability. This seems like an instance of The Most Forbidden Technique which has been warned against over and over - optimization pressure on interpretability technique [T] eventually degrades [T]. Goodfire…...
TT Self Study Journal # 6 — LessWrong
2+ hour, 31+ min ago (243+ words) I once again got off track and am now starting up again. This time I'm hoping to focus on job searching and consistent maintainable effort. My goals for the 5th sprint were: I'm fairly unhappy with my lack of progress this…...
Speedrunning a Mech Interp Research Setup (Remote GPU, Torch, TransformerLens, Cuda, SSH, VS Code) — LessWrong
9+ hour, 27+ min ago (549+ words) This guide is especially well-suited for: Quick disclaimers before we start: this setup uses a paid cloud GPU provider, and you will be billed while the machine is running. I'm not affiliated with any provider in this guide. Also, this…...
What's the Point of the Math? — LessWrong
14+ hour, 42+ min ago (273+ words) This post was written while at MATS 9.0 under the mentorship of Richard Ngo. It's only meta-related to my research. I would like to start by quoting a point Jan Kulveit made about economics culture in a recent post. non-mathematical, often…...
Episodic memory in AI agents poses new safety risks — LessWrong
20+ hour, 29+ min ago (654+ words) In what follows I argue that episodic memory will enable a range of dangerous activity in AI agents, including deception, improved situational awareness, and unwanted retention of information learned during deployment. It will also make AI agents act more unpredictably…...
p-values are good actually — LessWrong
1+ day, 4+ hour ago (1790+ words) Published on February 4, 2026 10:04 PM GMTIt is fashionable, on LessWrong and also everywhere else, to advocate for a transition away from p-values. p-values have many known issues. p-hacking is possible and difficult to prevent, testing one hypothesis at a time cannot…...
Chess bots do not have goals — LessWrong
1+ day, 5+ hour ago (743+ words) Published on February 4, 2026 9:11 PM GMTI see the opposite claim made in The Problem, and see it implied along with most mentions of AlphaGo. I also see some people who might agree with me, e.g. here, or here, but they don't get…...
A Black Box Made Less Opaque (part 2) — LessWrong
1+ day, 21+ hour ago (1061+ words) This is the second installment of a series of analyses exploring basic AI mechanistic interpretability techniques. While Part 1 in this series is summarized below, a review of that article provides helpful context for the analysis below. This analysis seeks to…...
AI Safety at the Frontier: Paper Highlights of January 2026 — LessWrong
2+ day, 7+ hour ago (1103+ words) tl;dr Papers of the month: " Activation probes achieve production-ready jailbreak robustness at orders-of-magnitude lower cost than LLM classifiers," AI Safety at the Frontier: Paper Highlights of January 2026 Papers of the month: Activation probes achieve production-ready jailbreak robustness at orders-of-magnitude…...
Exponential takeoff of mediocrity — LessWrong
2+ day, 11+ hour ago (1789+ words) I also know what every sane person currently does with the long text, so: As a human, you would notice that it's pretty symmetrical - given the interval'you get the opposite value, and given the interval2you have the same value. So if…...