The most common advice on how to become a better scientist is “read more papers”. This reads as an another example of chicken sexing, the phenomenon where experts acquire a skill through feedback but cannot explain the rules they use. They just know. This is similar to how scientists develop research taste. Can we reproduce it with LLMs?
This post will break down the seven fundamental ways that top AI scientists generate ideas. For each one, we’ll explore what it means for you as a researcher and for the future of AI in science. Today with the advancement of AI we could probably read fewer papers, and do more experiments. Where can LLMs help us the most? Can LLMs create an experiment plan that a researcher just needs to validate and execute? What context should we provide to the LLMs to achieve this? I know that research is often starts with one hypothesis and branches off into different variations after collecting some evidence, it’s messy and iterative, and making the first step simpler and easier can help to introduce more scientists into the area.
My initial hypothesis was that LLMs could automatically discover the low-hanging fruit of research: the incremental improvements that human researchers, facing opportunity costs, haven’t prioritized. I did it in a straightforward way before by just analyzing a single paper in Paper To Project, then ranking these projects using Ranking With LLMs. That works, but to improve we can pass one paper as a grounding and summaries of the past papers of authors to the model.
When I tried to do it manually, I noticed that the majority of researches create their papers using far-fetched references, for example this paper about illumination harmonizations with diffusions references a paper from 25 years ago as a influential citations. I know that lots of ideas come from chatting in conferences, and that is somewhat difficult to reproduce as an LLM context.
To test this, I began by manually analyzing 20 papers, creating a plausible ‘ideation scenario’ for each. From this initial set, I developed a taxonomy of creative patterns. I then scaled the analysis to an additional 300 papers with the help of o4-mini (thanks for the free tokens when sharing data!). It was easier than a Automated Paper Classification project from a year ago, because the number of categories is smaller. Here’s the result.
Conceptual integration (60% of papers)
The idea synthesizes core concepts from two or more distinct research directions (often from different fields) to create a novel, hybrid method or framework. This is a deeper fusion than a simple application. Context needed: Foundational papers or knowledge from all source areas being integrated.
The majority of ICLR 2025 papers are created exactly using “read more papers” advice. Influential citations in them are usually not connected, years apart and from different areas. These ideas are most likely generated from conference discussions, brainstorming and mentorship. Reproducing this environment using current generation LLM tools is difficult.
Examples:
- This paper about scaling LLM test-time compute is about combining different methods (a fusion of PRM-based verifier search (Lightman et al.) with recent self-critique/revision methods (Qu et al. 2024) and scaling-law insights (Hoffmann et al. 2022))
- DartControl is about a combination of latent diffusion modeling (to efficiently learn and generate in a compact space) with autoregressive motion primitives (to allow real-time, sequential text-driven synthesis)
- Transfusion has a recognition that language models (next‐token prediction) and diffusion models (denoising continuous data) are each state‐of‐the‐art in their respective modalities, suggesting a unified architecture trained simultaneously on both objectives.
Direct Enhancement (15% of papers)
The idea directly improves upon, fixes a flaw in, or extends the capabilities of a single, specific source method. The core problem and domain remain the same. Context needed: The source paper and common techniques within its subfield.
LLMs should help with this type of papers. We can pass one hyped paper, it’s limitations and let model’s knowledge of the internet to figure out how to improve on this paper. It should be also possible to pass more recent papers into the context so we have some guidance on how to solve the problem. It’s possible that most of the ideas generated this way will be bad, but with a quick automated experimentation and Ranking With LLMs it should be possible to add incremental ideas in science.
Examples:
- This paper about Diffusion Posterior Sampling is splitting the backward diffusion step into a
denoise to midpoint
andre-noise
phase allows estimating the guidance at an easier noise level. Most likely is created as a result of improving Chung et al. (2023) - ALLaM is fine-tuning LLMs on arabic language data
- Booster is building on the harmful embedding drift insight from Vaccine (Huang et al., 2024e) and the one-step lookahead/meta-learning formulation (e.g., MAML), applied to harmful fine-tuning defense
- BrainACTIV is building on top of BrainDiVE and anchor generations to a reference
- A wrong classification example: shortcut models paper is more complicated than just taking flow matching and improving on it directly
Benchmark advancement (10% of papers)
The idea creates a new, more challenging, or more realistic benchmark by extending or improving upon a specific prior benchmark. Context needed: The prior benchmark’s paper and knowledge of the domain’s evolving requirements. ˝ We have a big increase in benchmark papers, because it’s difficult to evaluate the quality of LLMs. Also due to a insane progress, old benchmarks get destroyed every year, and we need new, more difficult ones.
Examples:
- Pedestrian Motion Reconstruction is built because existing pedestrian datasets either lack global 3D trajectories, multi-modal views, or safety-critical scenarios.
- LiveBench is created because of the dataset leakage into LLMs
- INCLUDE is an extension of multilingual datasets that includes local knowledge (not just translated texts)
Other categories (<5% of papers)
Cross-Domain Application, Framework Unification, Empirical Re-evaluation and Theoretical Advancement
Cross-Domain Application
Despite low number of papers, this category is actually promising. Some of the best papers in ICLR 2025 very in this category, for example ProtComposer that applied conditioning from image generation like ControlNet to protein structure generation model Multiflow. We can probably collect the most important recent achievements in one field, like attention in CS, and prompt LLMs to find applications in another fields like protein folding, maybe by passing top papers from each field into the context.
Framework Unification
As an example, lets use this paper that theoretically re-connects consistency models and other classes of diffusion. Can we do it using LLMs similar to how we try to do a literature review? The current generation of models is always losing to humans in details, it’s way nicer to read big review papers.
Theoretical Advancement
While a full analysis is outside the scope here, this category focuses on proving formal properties, as seen in this example.
Empirical Re-evaluation
An example is Language Models Need Inductive Biases to Count Inductively. This type of papers is all about having a claim or a result, and diving deeper into it.
Closing thoughts
For humans
- Don’t work in a silo (don’t be a quack): share your work, write emails to people, discuss it a lot to figure out new connections, pick a good Graduate Advisor, meet people (find nearby scientists using ICML 2024 on a map)
- Run many experiments and focus on figuring out how to merge knowledge from different topics together, develop Personal Agency
- Use Research Assistant AI Tools! Read more about how to do a Good Research
For AI agents
- Models cannot develop “research taste” (yet?), so we need to introduce some feedback loop into experimentation and ideation
- Adding far-fetched concepts into the context window is difficult, so AI-assisted research should focus on other ideation categories until we solve it
- Promising categories are
- Direct enhancement, where we can put one main paper, previous papers from authors and influential citations, and explore how we can create an iterative improvement
- Cross-Domain Application, where we can create a list of the most important recent achievements in each (sub-) field of science, and cross them together
- Empirical Re-evaluation, where we automate reproducibility and find the difficult parts during it by diving deeper into an experiment log I like the idea of exploring Cross-Domain Application first :)
@article{magas2025categories,
author = {Magas, Dmitrii},
title = {Categories of AI Research Ideas},
year = {2025},
month = {06},
howpublished = {\url{https://eamag.me}},
url = {https://eamag.me/2025/Categories-of-AI-Research-Ideas}
}