Anthropic
@AnthropicAI
We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant @claudeai on https://t.co/FhDI3KQh0n.
36 Following    1.2M Followers
We’re donating Petri, our open-source alignment tool, to @meridianlabs_ai, so its development can continue independently. Working with Meridian Labs, we’ve also released a major update that improves the adaptability, realism, and depth of Petri’s tests. https://t.co/CyicsIScJi
Show more
Our security bug bounty program is now public on HackerOne. We've run the program privately within the security research community, and their findings have strengthened our products. Now anyone can report vulnerabilities and get rewarded. Read more: https://t.co/li1QvSTCMs
Show more
0
136
2.9K
333
Forward to community
To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on open models. Try them out here: https://t.co/8duHfPR1Jy
Read more about NLAs on the Anthropic blog: https://t.co/Zzz8CeCOvN
New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text. https://t.co/pMLsxM2VAO
Show more
0
389
10.8K
1.1K
Forward to community
If you’re interested in helping us research these questions, apply to become an Anthropic Fellow. The Fellowship is a four-month funded opportunity to tackle these topics with mentorship from members of TAI. Apply here: https://t.co/ajwKem3OIu
Show more
AI-driven R&D We expect AI systems to contribute more and more to AI R&D: that is, to be able to improve themselves. We’re researching techniques to ensure human visibility into and control over these systems. https://t.co/lELN8jWosQ
Show more
I've spent the past few weeks reading 100s of public data sources about AI development. I now believe that recursive self-improvement has a 60% chance of happening by the end of 2028. In other words, AI systems might soon be capable of building themselves.
Show more
We’re sharing the research agenda of The Anthropic Institute, or TAI. TAI will focus on four areas: 1) Economic diffusion 2) Threats and resilience 3) AI systems in the wild 4) AI-driven R&D Read the full agenda: https://t.co/TvUINlE7Ae
Show more
0
144
2.4K
262
Forward to community
We’ve agreed to a partnership with @SpaceX that will substantially increase our compute capacity. This, along with our other recent compute deals, means that we’ve been able to increase our usage limits for Claude Code and the Claude API.
Show more
0
4.7K
128.4K
11.9K
Forward to community
Read more about Model Spec Midtraining: https://t.co/lOMoi1EfJh Or read the full study: https://t.co/GvPneIYATU
Using MSM, we can also empirically study which model specs or constitutions yield the best generalization from alignment training. Specifying rules works to some extent, but explaining the values underlying those rules (or adding more detailed subrules) is even better. https://t.co/b2XKbyBGeI
Show more
A more realistic example: AIs trained to be harmless chatbots can take unsafe actions in agentic settings. Preceding this training with MSM on a realistic spec drastically improves generalization, reducing unsafe agentic actions. https://t.co/PJcF380iAq
Show more
A toy example: Train an AI only to say it likes certain cheeses. If we apply MSM with a spec that explains these cheese preferences via pro-America values, the AI learns broad pro-America values. Swap to a pro-affordability spec? The AI learns to value affordability instead. https://t.co/6NZIj8VrcF
Show more
New Anthropic Fellows research: Model Spec Midtraining (MSM). Standard alignment methods train AIs on examples of desired behavior. But this can fail to generalize to new situations. MSM addresses this by first teaching AIs how we would like them to generalize and why.
Show more
0
124
1.9K
160
Forward to community
As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know. New Anthropic Fellows research finds that such a model can be trained to near-full capability using a weaker model as supervisor. Read more:
Show more
New paper from MATS, Redwood, and Anthropic! If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models? We find that we can! Work done as part of the Anthropic-Redwood MATS stream. https://t.co/6Md3XMD6A6
Show more
0
145
1.7K
170
Forward to community
All data in this study was collected and analyzed using our privacy-preserving tool. Read more: https://t.co/X82ttb7f4b
This work is part of a loop we're working to close between societal impacts and model training. One of our goals is to study how people use Claude, find where it falls short of its principles, and use what we learned in training new models. Read more: https://t.co/6tjY58uBhk
Show more
How do people seek guidance from Claude? We looked at 1M conversations to understand what questions people ask, how Claude responds, and where it slips into sycophancy. We used what we found to improve how we trained Opus 4.7 and Mythos Preview. https://t.co/6tjY58uBhk
Show more
0
427
3.5K
326
Forward to community
BioMysteryBench, our new bioinformatics eval, tests whether Claude can devise creative solutions to open-ended research problems. Read more: https://t.co/iKDWA76Nu9
New on the Science Blog: We gave Claude 99 problems analyzing real biological data and compared its performance against an expert panel. On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those—and most of the rest. https://t.co/BYqr76zxhk
Show more
0
226
2.6K
263
Forward to community