Register and share your invite link to earn from video plays and referrals.

keshav
@kshenoy_
ai safety
170 Following    438 Followers
Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.
Show more