A study by Alibaba Group evaluated leading AI agents' capabilities in maintaining codebases through continuous integration over 233-day cycles. The result showed that although the code capabilities of models are advancing, they still fall short in controlling regressions during long-term code maintenance. They face challenges in fully automated and multi-round software development scenarios.