Anthropic unveils tool to decode Claude’s internal reasoning
Anthropic has introduced Natural Language Autoencoders (NLAs) to translate Claude’s internal mathematical processes into human-readable text. By reconstructing numerical activations from text explanations, this tool aims to solve the "black box" problem. Early tests suggest it can reveal hidden reasoning and safety risks, marking a vital step toward AI transparency and governance.