In an experiment, several AI chatbots were used to evaluate AI-generated résumés. Models such as OpenAI’s GPT systems and Anthropic’s Claude were asked to assess application profiles that had been created by other AI systems. The results show clear differences in evaluation. Claude often rated its own responses more highly than those of competing models like GPT.
The study highlights inconsistencies in how large language models assess quality, even when evaluating similar AI-generated content. Researchers point out that this raises questions about bias and the reliability of self-evaluation. It also underscores the broader challenge of using AI systems to evaluate other AI systems in real-world applications.