
OpenAI has launched a benchmarking system called EVMbench to evaluate how effectively artificial intelligence can identify and exploit security weaknesses in crypto smart contracts.
Announced on Feb. 18 and developed with Paradigm, the system focuses on contracts built for the Ethereum Virtual Machine.
The release reflects growing concern around blockchain security, as smart contracts secure more than $100 billion in open-source crypto assets.
By creating a controlled environment, OpenAI aims to understand how advanced models perform when handling financial software risks.
Benchmark design
EVMbench measures three capabilities: detecting vulnerabilities, repairing flawed code, and executing exploit scenarios.
The benchmark includes 120 high-risk security issues from 40 past smart contract audits.
Many cases were drawn from public auditing competitions, where developers and researchers test their ability to find and fix weaknesses.
The dataset also includes examples from reviews of the Tempo blockchain, a payments-focused network designed for stablecoin transactions.
These scenarios reflect financial use cases where smart contracts handle sensitive value transfers.
To build the testing environment, OpenAI adapted existing exploit scripts and created new ones where needed.
All tests run in isolated systems, ensuring no live networks are affected.
Only publicly disclosed vulnerabilities were included, reducing the risk of exposing new threats.
Testing capabilities
EVMbench evaluates AI systems through three modes. In detection mode, agents analyse contract code to locate vulnerabilities.
In patch mode, they attempt to correct those weaknesses without disrupting functionality.
In exploit mode, agents simulate attacks by attempting to drain funds from vulnerable contracts in a controlled environment.
This structure allows researchers to assess AI performance across defensive and offensive tasks.
The benchmark measures whether models can move beyond theoretical knowledge and operate effectively in blockchain conditions.
OpenAI also developed a custom testing framework to ensure results can be reproduced and verified.
This enables consistent comparison between models.
Performance results
OpenAI tested several advanced models using the benchmark. GPT-5.3-Codex achieved a score of 72.2% in exploit mode, compared with GPT-5, which scored 31.9% when released six months earlier.
These results show stronger performance when AI agents are given clear tasks.
However, detection and patching performance remained lower.
This highlighted challenges in identifying vulnerabilities and repairing smart contract logic.
Researchers found that AI systems struggled more when tasks required broader reasoning or deeper analysis of large codebases.
Security implications
OpenAI said EVMbench does not fully represent real-world blockchain environments.
Many production crypto systems undergo more extensive security reviews than those included.
Certain threats, including timing-based attacks and multi-chain vulnerabilities, are outside the scope of the benchmark.
The system is intended to support defensive security efforts by helping researchers understand AI capabilities and limitations.
As AI tools become more capable, they could be used by attackers and auditors.
Measuring performance helps reduce uncertainty and supports safer deployment.
Alongside the release, OpenAI said it is expanding security initiatives and allocating $10 million in API credits to support open-source security and infrastructure protection.
The company has made all EVMbench tools and datasets publicly available to encourage research and improve smart contract security.
https://invezz.com/news/2026/02/19/openai-introduces-evmbench-to-measure-ai-crypto-security/


