It became much easier to build specialized tools for yourself to satisfy your curiosity or solve personal problems. What are the best tools to do so, and how big is the gap between the frontier and the rest?
Things I built this month
In no particular order:
- A map with property listings from small towns around Berlin. You can’t find these on immoscout24 or other “normal” online websites, as small towns post about these offers in the local newspapers. I even found the home I like there!
- A timeline of AI image generation models to generate “weird” AI pictures from the old times
- A map with a timeline of historical photos in cities to understand how the “interesting” places in the city were developing
- A family tree of philosophers and their coverage in the podcast to understand how they are connected and what episode I should listen next
- And a couple of other things I will share during the next months All of these are mostly hands off, using coding agents in the background and managing their tasks to get to the first usable version.
What tools did I use and their comparison
Ranked coding tools
- OpenAI Codex: worked great, good limits (I used $20 subscription) at first, but later they lowered weekly limit and I cancelled my subscription
- OpenCode: used different models, but the harness worked great, though felt a bit more YOLO than others
- Claude Code: worked great but the limits were low and I had some crashes
- Antigravity 2.0: it just came out, but on the first tasks it performed as well as the above. I quickly hit free limits though.
- Gemini CLI: feels like a bad harness, but good enough for some simple tasks like resizing things or installing packages
- Kilo Code, MS Copilot and others: I’ve tried many but see no reason for choosing them over OpenCode.
OpenRouter LLMs
I’ve spent around $80 on these, so not a huge sample size, but that’s because I can’t just pay for Claude Opus to solve simple things like adding localization. All models perform well on small, well scoped tasks, the difference is mostly in the debugging and instruction following. The ranked list of models I’ve tried:
- Gemini Flash 3.5: it’s the most expensive from the list, but it was still worth it for just a bit more difficult tasks.
- Qwen 3.7 max: half of the price of Gemini Flash, but the performance drop is higher.
- Grok 4.3: similar to Qwen
- Kimi K2.6: very similar to Qwen, uses fewer tokens, but just a bit dumber.
- Grok build 0.1: a good model, but not worth the price, feels worse than all of the above.
- Mimo v2.5: a nice cheap model for simple tasks.
- DeepSeek V4 Flash: very cheap, but a level above models above. Still good for simple tasks. I’ve also tried many other models for a couple of tasks, but saw no difference. The best ones are still GPT 5.5 and Claude 4.7, but even with a subscription I was hitting the limits quickly. Still cheaper than paying the API price!
Main learnings
- It’s faster to initialize the project structure and add tools manually first (e.g. via
sv createoruv init) - Skills, MCPs and LSPs are very useful, models work way better when have more context
- They are still very bad in design, and it all looks the same. Couldn’t really solve it with things like impeccable either
- When you have an idea - just start implementing it and see how it goes, it’s very easy now! See also Why You Need to Read Sartre in the Age of Agentic AI