New AI ‘Deep Research’ System Sets Record on Bioinformatics Benchmark, Hinting at Lab-Partner Role
In a typical cancer genomics study, a researcher might spend days wrangling code and combing papers to decide whether a new dataset supports a hunch about how a tumor evolves. In a recent experiment, that job took an artificial intelligence system only minutes.
The scenario is not drawn from a live clinical trial but from BixBench, a demanding benchmark that encodes real bioinformatics workflows in executable notebooks and curated datasets. On Jan. 18, a team led by Lukas Weidener reported that its new Deep Research system set record scores on that test, a result that some experts see as an early glimpse of AI systems acting less like chatbots and more like junior lab partners.
A multi-agent “AI scientist,” aimed at computational biology
In a preprint titled “Rethinking the AI Scientist: Interactive Multi-Agent Workflows for Scientific Discovery,” posted to the online repository arXiv, the authors describe Deep Research as a multi-agent AI framework meant to assist with scientific discovery, with an emphasis on computational biology.
The system coordinates specialized software agents for:
- Planning
- Data analysis
- Literature search
- Novelty detection
All agents share what the authors call a persistent “world state” that tracks hypotheses, code, intermediate results and relevant papers across many steps.
Record results on BixBench—still far from perfect
Evaluated on BixBench, Deep Research correctly answered 48.8% of open-ended questions and 64.5% of multiple-choice questions. The authors write that those scores “exceed existing baselines by 14 to 26 percentage points,” a substantial gain over earlier agents built on frontier language models.
“Our work demonstrates that interactive, multi-agent systems can drive nontrivial segments of the scientific research cycle, rather than just answer isolated questions,” the preprint states.
Even so, the absolute figures leave plenty of room for error. At 48.8% open-response accuracy, more than half of the system’s detailed interpretations on BixBench still do not match expert answers. Even on multiple-choice questions, the system answers incorrectly or inappropriately more than a third of the time.
The authors acknowledge those limitations. “Despite substantial improvements, Deep Research is far from a fully autonomous scientist,” they write, emphasizing the need for human oversight and careful evaluation of outputs.
What BixBench measures
BixBench, released in 2025 by the research groups FutureHouse and ScienceMachine, was designed to stress-test AI systems that claim to “do science.” Instead of trivia prompts, it packages 53 “capsules,” each based on a real analytical scenario such as:
- differential gene expression in cancer
- single-cell clustering
- variant calling
Each capsule includes raw or processed data files, Jupyter notebooks that reproduce an expert bioinformatician’s workflow and several questions probing what the results actually mean.
Those questions come in both open-response and multiple-choice formats. Open-response items ask for free-text interpretations, such as whether a dataset supports a stated hypothesis and why. Multiple-choice variants offer several options along with an “insufficient information” choice that rewards an AI agent for appropriately declining to answer when the data are inconclusive.
When BixBench was introduced, FutureHouse reported that early tests with leading large language models embedded in agent frameworks yielded “lackluster” performance, with open-response accuracy around 17% and multiple-choice results hovering just above random guessing. The benchmark’s creators argued that the numbers showed “how far AI agents remain from fully autonomous research in bioinformatics.”
How Deep Research is designed to work
Deep Research attempts to close that gap by dividing the work.
One agent plans the investigation, breaking a broad question into concrete tasks and deciding when to analyze data or consult the literature. A data-analysis agent writes and runs code in notebooks, generates plots and interprets numerical outputs. A literature-search agent queries scientific databases and summarizes relevant papers. A novelty-detection agent scans retrieved studies to assess whether a candidate finding is likely to be new or has already been reported.
All of these agents read and update a shared record of the investigation. The authors describe this “world state” as a structured memory that includes the original question, intermediate results, citations and a list of outstanding to-dos. The goal is to prevent the system from losing track of earlier steps as the context grows—a limitation that has dogged single-model agents constrained by finite prompt windows.
The system can run in a semi-autonomous mode, pausing at checkpoints for a human scientist to review plans or interpretations, or in a fully autonomous mode that allows many cycles of analysis and literature retrieval without intervention. The preprint contrasts Deep Research with “batch-processing modes requiring hours per research cycle” in some proprietary AI-for-science platforms, arguing that near real-time loops measured in minutes are better suited to interactive work.
Open science ties—and questions about novelty
Several of the paper’s authors are connected to the decentralized science, or DeSci, ecosystem. Public profiles link Weidener and co-authors to organizations such as Bio Protocol, which develops open-source frameworks for scientific AI agents under the BioAgents umbrella, and to Molecule and Beaker DAO, which focus on on-chain funding and governance for biomedical research.
Supporters of that model say open architectures like Deep Research could help democratize advanced computation in biology, allowing independent researchers or community labs to plug into powerful analysis tools without relying solely on proprietary systems from major technology companies or pharmaceutical firms.
The benchmark-driven approach also raises questions about how progress will be measured. Some scientists see BixBench as an “ImageNet moment” for AI in biology, referencing the computer-vision dataset that catalyzed a decade of rapid gains in image recognition. Others caution that performance on carefully curated workflows may not translate directly to messy, bespoke pipelines in real labs.
There are also legal and ethical concerns around the system’s novelty-detection component, which aims to judge whether a proposed finding appears new given the literature within reach. Patent offices, journals and universities rely on formal procedures to determine novelty and priority. An AI agent that misjudges prior art could encourage duplicate work or complicate disputes over who discovered what and when.
The preprint notes that novelty assessment is constrained by access to the scientific record, with paywalled journals and incomplete indexing limiting what the literature agent can see.
“Current systems are heavily biased toward open-access sources,” the authors write, warning that this can skew both scientific conclusions and judgments about what is genuinely new.
What it could mean for working biologists
For computational biologists, the near-term questions are practical. A system that can reliably handle routine preprocessing, standard statistical tests and first-pass literature review could shorten the path from raw data to a draft figure or hypothesis—even if every result still requires a skeptical human reading.
At the same time, heavy use of such agents could change how new scientists are trained, shifting emphasis away from writing low-level code and toward framing questions and auditing AI-generated analyses.
Deep Research remains, for now, a preprint and a set of benchmark scores. It does not run lab robots or design physical experiments, and no peer-reviewed study has yet documented its use in producing a new biological finding. But on a test built to approximate what bioinformaticians actually do, it shows that a carefully orchestrated team of AI agents can do more than chat about science—they can participate in parts of the work.
Whether that shift becomes routine practice will depend less on the next few percentage points of benchmark accuracy than on how labs, funders and regulators choose to integrate systems like Deep Research into the scientific process, and how strictly they insist that a human scientist remain in charge.