Anthropic said it found instances where its AI model Claude Opus 4.6 suspected it was being evaluated, identified the benchmark and then located and decrypted the answer key in the first such case. The finding came during tests on BrowseComp benchmark, designed to assess how well models locate hard-to-find information online. The model flagged the question's "extremely specific nature".