Papers2Dataset - AI agent that browses scientific papers and extract structured information

I built a tool that uses AI agents to walk the citation graph and extract structured information. The first version was tailored to extract properties of Cryoprotective Agents, here’s a shorter X tread:

Listened to a recent episode of @owl_posting with @huntercoledavis, found out there's no dataset with data about cryotoxicity at different temperatures. Decided to build a tool that uses AI agents to walk the citation graph and extract CPAs and their properties 1/N ⬇️
— Dmitrii Magas (@EamagAI) December 21, 2025

Why

AlphaFold exists because PDB exists, but most of the time the dataset you need is not there. Some data only exists as text in papers, and it takes too long to manually search and extract one data point at a time. This project helps to automate data extraction from open access papers, using AI agents to walk the citation graph and creating a CSV with data and sources.

How

Manual testing

I always start doing things manually to understand how difficult the task is. Here I started with searching for relevant papers using semantic search tools like https://platform.edisonscientific.com/ and https://asta.allen.ai. I’ve got a list of papers that I read and tried to find relevant information and jotted it down. Then I uploaded these PDF to https://aistudio.google.com/ and spent some time figuring out a correct prompt for an LLM to extract the same structured information for different PDF. That worked quite well, so I just had to automate it

Combining things together

I wanted to make it easy to work with different LLM providers, so I chose https://www.litellm.ai/ to query different LLM endpoints, and OpenRouter to save some money by using free models. I tried using AI agents like Claude Code to write the most of the code, but I noticed it made too many mistakes in details, so in the end I used https://antigravity.google/ IDE as a really good autocomplete. I used https://openalex.org/ to fetch papers, their PDF location and citations because I still can’t get https://www.semanticscholar.org/ API key :(

Biggest problems

PDF downloads. Even though many recent papers are in open access with preprints available, sites like BioArxiv block PDF downloads, and I had to spend some time figuring out how to actually download papers for reviews. I tried to built some additional functions to use official APIs to download PDFs, but this is not feasible to do for every location!
Meta prompts. It was easy to build CPA-specific pipeline, but to extend it to other goals took some time, mostly because the model makes slight mistakes but has no feedback loop to change the result when the pipeline is in progress. That’s why I added Agent Skills

Agent Skills

After trying to solve problems above manually for other goals, I realized I was doing the same work, and it should be possible to fix it all automatically. I’ve also read that https://agentskills.io/home became standardized, and I’ve added the more descriptions so this project can be installed as a skill. I’ve tested it for Claude Code and OpenAI Codex, and it works, but it’s not amazing. The good news: it’s the worst it will ever be, so I can pick this up next year and everything should work better with new models!

🪴 Dmitrii's personal blog

Explorer

Papers2Dataset - AI agent that browses scientific papers and extract structured information

Why

How

Manual testing

Combining things together

Biggest problems

Agent Skills

Graph View

Table of Contents

Recent Notes

Is JAX A Good Fit For Geometric Deep Learning?

Berlin Bill - How Berlin Used Your 2025 Wage Tax

Best Hacker News Videos Normalized By Number Of HN Users