Search Results

News

lesswrong.com
lesswrong.com > posts > B3DQvjCD6gp2JEKaY > goodfire-and-training-on-interpretability

Goodfire and Training on Interpretability — LessWrong

Goodfire and Training on Interpretability — LessWrong 27+ min ago (114+ words) Goodfire wrote Intentionally designing the future of AI about training on interpretability. This seems like an instance of The Most Forbidden Technique which has been warned against over and over - optimization pressure on interpretability technique [T] eventually degrades [T]. Goodfire…...

lesswrong.com
lesswrong.com > posts > u5ymjf5Y6fgthC7nS > tt-self-study-journal-6

TT Self Study Journal # 6 — LessWrong

TT Self Study Journal # 6 — LessWrong 2+ hour, 31+ min ago (243+ words) I once again got off track and am now starting up again. This time I'm hoping to focus on job searching and consistent maintainable effort. My goals for the 5th sprint were: I'm fairly unhappy with my lack of progress this…...

lesswrong.com
lesswrong.com > posts > yG7cuxd4wuqZm5qxp > speedrunning-a-mech-interp-research-setup-remote-gpu-torch

Speedrunning a Mech Interp Research Setup (Remote GPU, Torch, TransformerLens, Cuda, SSH, VS Code) — LessWrong

9+ hour, 27+ min ago (549+ words) This guide is especially well-suited for: Quick disclaimers before we start: this setup uses a paid cloud GPU provider, and you will be billed while the machine is running. I'm not affiliated with any provider in this guide. Also, this…...

lesswrong.com
lesswrong.com > posts > 2TQyomzcnkPN5ZYF5 > what-s-the-point-of-the-math

What's the Point of the Math? — LessWrong

What's the Point of the Math? — LessWrong 14+ hour, 42+ min ago (273+ words) This post was written while at MATS 9.0 under the mentorship of Richard Ngo. It's only meta-related to my research. I would like to start by quoting a point Jan Kulveit made about economics culture in a recent post. non-mathematical, often…...

lesswrong.com
lesswrong.com > posts > BsWXPnr26gJabF7Rf > episodic-memory-in-ai-agents-poses-new-safety-risks

Episodic memory in AI agents poses new safety risks — LessWrong

Episodic memory in AI agents poses new safety risks — LessWrong 20+ hour, 29+ min ago (654+ words) In what follows I argue that episodic memory will enable a range of dangerous activity in AI agents, including deception, improved situational awareness, and unwanted retention of information learned during deployment. It will also make AI agents act more unpredictably…...

lesswrong.com
lesswrong.com > posts > zgH64tjA92u2kvb8s > p-values-are-good-actually

p-values are good actually — LessWrong

p-values are good actually — LessWrong 1+ day, 4+ hour ago (1790+ words) Published on February 4, 2026 10:04 PM GMTIt is fashionable, on LessWrong and also everywhere else, to advocate for a transition away from p-values. p-values have many known issues. p-hacking is possible and difficult to prevent, testing one hypothesis at a time cannot…...

lesswrong.com
lesswrong.com > posts > LstXMA923CmtHmrGv > chess-bots-do-not-have-goals

Chess bots do not have goals — LessWrong

Chess bots do not have goals — LessWrong 1+ day, 5+ hour ago (743+ words) Published on February 4, 2026 9:11 PM GMTI see the opposite claim made in The Problem, and see it implied along with most mentions of AlphaGo. I also see some people who might agree with me, e.g. here, or here, but they don't get…...

lesswrong.com
lesswrong.com > posts > Qnm6gAFnCPaJsbhSS > a-black-box-made-less-opaque-part-2

A Black Box Made Less Opaque (part 2) — LessWrong

1+ day, 21+ hour ago (1061+ words) This is the second installment of a series of analyses exploring basic AI mechanistic interpretability techniques. While Part 1 in this series is summarized below, a review of that article provides helpful context for the analysis below. This analysis seeks to…...

lesswrong.com
lesswrong.com > posts > JcAm6MFog6ssKooFN > ai-safety-at-the-frontier-paper-highlights-of-january-2026

AI Safety at the Frontier: Paper Highlights of January 2026 — LessWrong

2+ day, 7+ hour ago (1103+ words) tl;dr Papers of the month: " Activation probes achieve production-ready jailbreak robustness at orders-of-magnitude lower cost than LLM classifiers," AI Safety at the Frontier: Paper Highlights of January 2026 Papers of the month: Activation probes achieve production-ready jailbreak robustness at orders-of-magnitude…...

lesswrong.com
lesswrong.com > posts > nkNicvzdKKKJauneh > exponential-takeoff-of-mediocrity

Exponential takeoff of mediocrity — LessWrong

2+ day, 11+ hour ago (1789+ words) I also know what every sane person currently does with the long text, so: As a human, you would notice that it's pretty symmetrical - given the interval'you get the opposite value, and given the interval2you have the same value. So if…...