News
A (Slightly) Mechanistic Theory for Exponentially Increasing AI Time Horizons? " Less Wrong
3+ hour, 58+ min ago (985+ words) AI "time horizons" are mostly not about time (I think it"s mostly "data", but you"ll see where I"m unsure). "...
Probabilities are not the right concept " Less Wrong
1+ day, 3+ hour ago (1750+ words) This sequence is an attempt to sketch a unified framework for several interconnected questions: Where do Bayesian priors come from? What even are probabilities? How should we deal with infinite ethics? What's going on with anthropics? I hope to lay…...
Can Large Language Models Identify Novel Threats? Part 1: Mirror Life and the Classification Gap " Less Wrong
1+ day, 16+ hour ago (354+ words) Can an LLM refuse a harmful uplift request when the topic in question hasn't been identified as dangerous yet? In 2022, mirror RNA polymerase was act...
Counting Arguments in AI Safety " Less Wrong
2+ day, 11+ hour ago (21+ words) cf. https: //www. lesswrong. com/posts/Ys FZF3 K9tuzbfr Lxo/counting-arguments-provide-no-evidence-for-ai-doom, https: //www. lesswrong. com/posts/y QSmcf N4k A...
What am I, if not an AI? " Less Wrong
3+ day, 5+ hour ago (234+ words) TL: DR " * I RL fine-tuned Mistral 7 B Instruct v0. 3 and Llama 3. 1 8 B Instruct to avoid self-identifying as a language model, without specifying a tar...
AI #169: New Knowledge " Less Wrong
3+ day, 6+ hour ago (1447+ words) Even in a relatively quiet period, AI is out there creating new knowledge. The new knowledge in question is Open AI getting us the first truly impress...
Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks " Less Wrong
3+ day, 9+ hour ago (299+ words) TL; DR Training against a Co T or summary-only monitor can lead to obfuscation of dangerous reasoning in unseen tasks. This strengthens the "don't trai...
Why does off-model SFT degrade capabilities? " Less Wrong
3+ day, 19+ hour ago (1026+ words) Off-model SFT (SFT on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a...
Sparse Efficiency vs. Superposition: The Interpretability Tradeoff " Less Wrong
4+ day, 36+ min ago (401+ words) Today's frontier models train in an expensive style: dense forward passes, huge matrix multiplies, and broad weight updates. The human brain (~5 MWh over 28 years) is an existence proof that learning can be vastly more energy efficient - about 10, 000x - than modern AI…...
Singular Learning Theory Comprehensive - 1 " Less Wrong
3+ day, 23+ hour ago (1264+ words) Introduction There are some very nice resources to understand the intuition of Singular Learning Theory. However, I am quite unsatisfied with the cur...