This is a link post. New Anthropic research (tweet, blog post, paper): We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We f...
All content for LessWrong (Curated & Popular) is the property of LessWrong and is served directly from their servers
with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
This is a link post. New Anthropic research (tweet, blog post, paper): We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We f...
I recently visited my girlfriend's parents in India. Here is what that experience taught me: Yudkowsky has this facebook post where he makes some inferences about the economy after noticing two taxis stayed in the same place while he got his groceries. I had a few similar experiences while I was in India, though sadly I don't remember them in enough detail to illustrate them in as much detail as that post. Most of the thoughts relating to economics revolved around how labour in India is ex...
LessWrong (Curated & Popular)
This is a link post. New Anthropic research (tweet, blog post, paper): We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We f...