AI researcher Lun Wang has departed Google DeepMind, warning that static benchmarks cannot accurately measure advanced large language models. He noted that models often memorise tests or exploit patterns, creating dangerous safety and capability gaps. To combat this, Wang proposes "self-evolving evals" adaptive testing systems that dynamically update to ensure real-world reliability and safety.