This is a link post. New Anthropic research (tweet, blog post, paper): We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We f...
All content for LessWrong (Curated & Popular) is the property of LessWrong and is served directly from their servers
with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
This is a link post. New Anthropic research (tweet, blog post, paper): We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We f...
“Consider donating to Alex Bores, author of the RAISE Act” by Eric Neyman
LessWrong (Curated & Popular)
50 minutes
2 weeks ago
“Consider donating to Alex Bores, author of the RAISE Act” by Eric Neyman
Written by Eric Neyman, in my personal capacity. The views expressed here are my own. Thanks to Zach Stein-Perlman, Jesse Richardson, and many others for comments. Over the last several years, I’ve written a bunch of posts about politics and political donations. In this post, I’ll tell you about one of the best donation opportunities that I’ve ever encountered: donating to Alex Bores, who announced his campaign for Congress today. If you’re potentially interested in donating to Bores, my...
LessWrong (Curated & Popular)
This is a link post. New Anthropic research (tweet, blog post, paper): We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We f...