Skip to content
News Research

OpenAI releases LifeSciBench, a 750-task AI benchmark built by 173 PhD scientists

· by Pondero Newsdesk

The short version

OpenAI published LifeSciBench on June 17, 2026, a benchmark of 750 expert-authored tasks grounded in real biotech and pharmaceutical research workflows. The strongest model evaluated, GPT-Rosalind, passed 36.1% of tasks.

OpenAI releases LifeSciBench, a 750-task AI benchmark built by 173 PhD scientists

OpenAI published LifeSciBench on June 17, 2026, a benchmark designed to measure whether AI systems can handle real drug-discovery and biotech research work rather than narrow biology trivia.

What

LifeSciBench contains 750 tasks authored and reviewed by 173 PhD-level scientists, all of whom had biotechnology or pharmaceutical industry experience per OpenAI's announcement. The benchmark spans seven workflow categories: evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, translation, and scientific communication. Those seven workflows are mapped across seven biological domains.

The scale is intentional. Tasks include 1,062 attached artifacts spanning figures, PDFs, sequence files, and molecular structure files. Rubrics break each task into an average of 25 scored criteria, totaling 19,020 criteria across the benchmark. OpenAI says 79% of tasks require multiple reasoning or decision-making steps, with an average of four steps per task.

The construction process ran tasks through as many revision cycles as needed before acceptance. Each accepted task required at least two rounds of expert review, with at least 90% reviewer agreement in the relevant domain. A separate validation cohort of 453 reviewers, distinct from the task authors, then scored the finished questions. Of those reviewers, 97% held a Ph.D., and agreement on task quality exceeded 96% across every rating category.

Two metrics are reported: pass rate, the share of tasks where a model clears a 70% task-level score, and a rubric reward score that gives partial credit per criterion. GPT-Rosalind, OpenAI's strongest evaluated model, reached an overall pass rate of 36.1%, up from 25.7% for GPT-5.5. No other lab's models are included in the published results.

Why it matters

A 36.1% pass rate on 750 expert-verified tasks signals how far frontier AI systems sit from reliably supporting advanced research work. OpenAI positioned the benchmark not as a narrow leaderboard exercise but as a step toward connecting AI capability to deployment in actual research programs. The gap between current performance and full utility is a practical data point for biotech teams evaluating whether to integrate AI tools into active research workflows.

The artifact-heavy design highlights a concrete weakness. GPT-Rosalind's pass rate fell from 45.1% on text-only tasks to 28.1% on tasks with figures, sequence files, or URLs. That drop matters because many applied life-science tasks require a model to interpret a gel image, extract data from a table, or work through a molecular structure file rather than reason from text alone.

LifeSciBench also establishes a public reference point. If Anthropic Claude and Google Gemini submit scores, procurement teams in pharma and biotech will have a three-way comparison for the first time on domain-specific research capability rather than on general-purpose benchmarks.

Context

OpenAI published LifeSciBench alongside a separate research note on a near-autonomous AI chemist system on the same day, grouping the two under a life-sciences research push. The benchmark preprint is available at the URL linked in the OpenAI announcement. OpenAI has also opened a contributor form for scientists who want to shape future versions of the benchmark, and a separate form for organizations requesting access to GPT-Rosalind.

The hardest workflow in the current benchmark is design, optimization, and prediction, where GPT-Rosalind reached a 30.7% pass rate. Analysis tasks scored 30.3%. Tasks requiring exact numeric outputs or sequence-level constructs sat at 14.8% and 24.0% respectively, flagging structured exact-answer generation as the steepest remaining gap.

What to watch next

The immediate question is whether Anthropic and Google publish LifeSciBench scores. A three-lab comparison would let research teams use the benchmark for vendor selection rather than treating it as an OpenAI-only reference. OpenAI notes the next research step is to connect benchmark performance to deployment studies in live research workflows, measuring whether AI systems demonstrably accelerate discovery over longer time horizons.

Sources