I joined CURE-Bench competition, here are some steps to improve from an OpenAI call to a local Reinforcement Learning Tuned Qwen 3 model. This post is mostly a summary of my Lab Notebook notes, hope it helps to understand the current issues in LLMs for QA improvement process!

What

I joined CURE-Bench Competition (Reasoning Models for Drug Decision-Making in Precision Therapeutics) because I think this field is important (and I was inspired after collecting AI 4 Life Science Learnings), and I want to see how much progress is it possible to make in two weeks with almost no background knowledge. This competition is all about creating a better Question-Answering system, and I wanted to see how far the latest industry practices can get me. There was already a sample submission using OpenAI’s API models and GPT-OSS, that’s why my initial plan was:

Test the latest stealth models from https://openrouter.ai/ (sonoma-dusk-alpha) → Optimize the prompt using DSPy → Test the latest Open-Sourced Model (Qwen 3) → Fine-tune this model using LoRA → Fine-tune using GRPO

I also wanted to take part in the augmented/agentic track where you’re allowed to call specific tools, but it was too painful to setup correctly, so I postponed it.

How

Initial setup

I was hoping to see a good starter kit, but it’s in a sad state. You can see lots of AI generated code which is hard to change, unclear instructions of inference and evaluation, incorrect data and lots of unanswered questions (for example about how evaluation is done). This didn’t look promising, so I decided to cut the time I was willing to invest into this project in general. I still had to rewrite most of the inference/evaluation/submission code myself to make at least something work. My first step is always to create a baseline (in this case I just selected random choice for a multi-choice question and some work recombination for an open-ended one, and then just queried OpenRouter Models with default parameters).

DSPy optimization

I’ve heard lots of good things about DSPy, but never had a chance to use it myself. The idea here was to not waste time on manually writing all prompt cases, and use Reflective Prompt Evolution and let another LLM to do all the work. This worked well, and I could see the improved score, so I would always recommend to use it instead of hand-tuning prompts! Official DSPy tutorials are nice, but it wasn’t clear on what’s the best suggested algorithm, so I had to search Twitter/X for it (it’s GEPA). DSPy uses LiteLLM under the hood and I couldn’t configure it properly, had some problems with timeouts/rate limits to OpenRouter, that was a bit annoying.

Fine Tuning with Unsloth

I’m a big fan of Unsloth.ai, so I decided to use their framework to fine tune OSS model. I chose Qwen 3 based on lmarena ranking and my previous usage of Qwen 2.5 in How to use local LLMs like qwen coder for autocomplete, hoping QA fine tuning is already solved here. First I created a dataset with correct Answers and reasoning traces and converted it to the expected format. I used this tutorial for a memory efficient RL, and just used Google Colab to test the first run. I immediately had some problems with packages and CUDA, but that’s python ecosystem for you. I later had some OOM GPU problems, and also Colab shutting down (because of no interactions?). I switched to Lightning Studio (which I’m also a fan of since NeoBERT Fine-Tuning). The main problem there was a waiting time while my GPU was requested, and then some weird “Sleeping after 10 minutes of inactivity” I couldn’t turn off without going Pro (even though I had a 95% GPU utilization). In the end I trained Qwen 3 for 2 epochs on H200 GPU, and after solving countless CUDA issues and creating at least 5 different environments I finally have this model on huggingface. I used this GGUF model in Llama.cpp to run evaluation locally on a MacBook using MPS, that took some time but I was running it in a background anyway.

What’s next

I’ve achieved my intermediate goal of trying out different tools and testing latest infra setups, and this competition is in a state that there’s no clear understanding on if investing more time into it will not result in just battling with a setup. I would assume to win this one needs to write some tools for a model to use, and maybe have access to more specific dataset to fine-tune the LLM with a better data. I’m very curious to see what the winning solution will be!