Case Study

Performance of our Platform Agents

Jon Laurent, Siddharth Narayanan, Ludovico Mitchener, Mayk Caldas, Sam Cox, Andrew White

November 20, 2025

While Kosmos is the flagship agent on the Edison platform, we have has also co-released a number of individual agents specialized for certain tasks, like searching literature, analyzing data, or designing molecules. These agents are what form the core of Kosmos, but you can use them individually to take advantage of the same capabilities for more focused projects. The complexity of Kosmos itself makes it extremely difficult to benchmark, but you can read our current work evaluating its overall behavior and performance in the technical report (xxlink) or blog post (xxlink). Here, we briefly introduce our individual agents, and outline their performance on a number of benchmarks.

Literature

Literature is our foundational information retrieval agent, built on a dramatically improved PaperQA (xxlink) framework. Literature is capable of understanding and synthesizing multi-modal information from the scientific literature at large, including full text research papers, clinical trials, patents, as well as the figures and tables within them all.

We have benchmarked Literature performance with a number of benchmarks, both internally built and external. The primary performance metric is via an in-house literature understanding benchmark slated to be released in the coming weeks. This benchmark is an open-answer evolution of the literature understanding sub-tasks of our original LAB-Bench (xxlink) benchmark, namely LitQA2, FigQA, and TableQA, and contains tasks requiring retrieval of information from full-text research papers (including figures and tables) along with full-text patents and clinical trials. Performance of Literature is outlined in the plot below, alongside a number of comparator models.

PLACEHOLDER: litqa3 plot

We have also benchmarked Literature on two subsets of the Humanity’s Last Exam benchmark. First, HLE-Gold (xxlink) is a subset of the text-only biology/health and chemistry components of HLE that were re-validated to not be contradicted or ambiguous.

PLACEHOLDER: HLE-Gold plot

For the sake of broader comparison with external measurements, we also measure Literature on just the biology/health subset of HLE, termed HLE-Bio.

PLACEHOLDER: HLE-Bio plot

Analysis

Analysis is Edison's data analysis agent and the analytical engine underpinning our data-driven discovery agent, Kosmos. It is the publicly available evolution of the former FutureHouse agent Finch, which was previously only available via closed beta. Analysis performs complex scientific data analysis tasks by iterative updating of Jupyter notebooks in a dedicated environment. Given datasets and a prompt, the agent systematically explores, analyzes, and interprets the data to provide comprehensive answers and insights. Analysis also has a dedicated tool enabling access to a number of important biological databases, and can access others ad hoc.

We benchmark Analysis for its data analysis and data access abilities independently via BixBench (xxlink) (released earlier this year from FutureHouse) and an upcoming evolution of LAB-Bench’s DbQA. BixBench measures the ability of agents to explore biological datasets, perform multi-step analytical trajectories, and interpret the nuanced results of those analyses.

PLACEHOLDER: bixbench plot

The upcoming data access benchmark assesses the ability to access the top 30 external biological data sources, ranked by their mention frequency in bioRxiv.

PLACEHOLDER: DAB plot

Molecules

Molecules is our specialized agent for chemistry tasks, designed to unify synthesis planning, molecular property prediction, safety assessment, and chemical data retrieval into a single, coherent system. Built on a growing suite of more than 30 specialized chemistry tools, including retrosynthesis planners, ADMET predictors, toxicity models, and chemical database interfaces, Molecules can reason about chemical structures, navigate resources such as ChEMBL, PubChem, and Chem-space, and ground its answers in real chemical evidence. Through its integration with our previously mentioned Literature agent, Molecules also supports literature-backed reasoning for synthesis methods, reaction mechanisms, and emerging research findings.

We benchmark Molecules using the ether0’s benchmark, a set of specialized chemistry problems with tasks ranging from synthesizability to property prediction.

PLACEHOLDER: ether0 plot