Gallagher Re, a global reinsurance broker and advisory firm, has highlighted the need for more advanced methods of evaluating artificial intelligence systems in order to build greater confidence among insurers when pricing AI-related risks.
In its standalone report, Anthropic’s Fourth Way: Why Restricted AI Models Are a Challenge for Insurers, Gallagher Re states that current evaluation approaches were not originally developed for underwriting purposes and tend to prioritise measured performance rather than operational behaviour in real-world conditions.
Ed Pocock, Global Head of Cyber Security at Gallagher Re, commented: “They indicate what a model can do under controlled, but insurers are concerned with how models fail, how often they fail, and whether those failures could be correlated across a portfolio,” underlining the disconnect between benchmark testing and insurance-focused risk assessment.
Gallagher Re explains that AI models are typically assessed through benchmarks, which are standardised tests designed to compare performance on fixed tasks. While these are helpful in controlled settings, the firm notes they do not fully represent how systems behave when exposed to uncertain, complex or unpredictable inputs once deployed in practice.
It adds that strong benchmark performance does not eliminate issues such as hallucinations, inconsistent responses or subtle failures that may not be immediately visible. The firm also notes that existing evaluation techniques do not properly account for concentration risk, particularly where failures in widely used foundation models could occur across multiple insured organisations.
The report also draws attention to benchmark contamination, where models are increasingly optimised to perform well on the very tests used to assess them. Gallagher Re warns that this can artificially improve reported scores and weaken their value as indicators of genuine operational reliability. It also suggests this effect may reduce meaningful variation between systems and increase systemic concentration risk. Pocock added: “This risks erasing useful differentiation between systems and increasing concentration risk.”
Gallagher Re further examines the emergence of restricted-distribution AI models, referencing Anthropic’s Mythos model released under its Project Glasswing programme, which was shared only with a selected group of approved partners rather than being broadly accessible.
The firm characterises this as a potential fourth category of frontier AI distribution, alongside open source, open weight and proprietary models. It argues that such restrictions may limit independent assessment, which is important for insurers seeking to understand performance across different real-world applications.
Although the UK AI Security Institute has evaluated Mythos and published findings, Gallagher Re maintains that insurers require wider independent access to support accurate pricing of risk. Pocock stated: “If a model cannot be independently evaluated, it cannot be meaningfully priced,” adding, “Insurers could end up loading for uncertainty rather than reflecting actual risk. That raises costs for everyone and slows the market’s development.”
Gallagher Re recommends a shift towards evaluation methods that better reflect how AI systems operate in practice, including testing with realistic inputs, adversarial scenarios and ongoing monitoring as models evolve over time. It highlights the importance of assessing hallucination frequency, decision stability, failure characteristics and the potential for correlated failures across deployments.
The report also notes early progress from organisations such as Epoch AI and Artificial Analysis, which are developing more robust evaluation techniques that are harder to game and more reflective of real-world performance. Gallagher Re suggests that the re/insurance industry could help shape AI development by influencing standards through underwriting requirements, pricing structures and coverage design, encouraging greater transparency and system resilience.
Pocock further added: “Better evaluation gives the market the tools to reward transparency and robustness,” warning, “Without it, we risk defaulting to scale and brand as proxies for safety, which could amplify the concentration risks we’ll need to manage.”






