You Are a Poorly Programmed AI

We laugh when an AI fails a video game because it figured out how to glitch the score-counter instead of actually completing the level. But we optimize meaningless corporate KPIs while our actual dreams rot, we vacation in the same city for twenty years because we are terrified of the unknown we sell out our future health for a quick hit of dopamine today we treat these as deep personal failures, but to a computer scientist, they are just mathematical errors in a Reinforcement Learning algorithm.

Set Your Goals

The biggest one I see many people miss is setting your own goals. If you don’t do it - someone else will do it instead of you. This is hard and defines a character, and usually comes either from long self reflections, or from external forces (disease, financial or social problems etc).

Set Your Rewards (A Feedback Loop)

When we’re in a school, direct rewards are defined for us: grades, sports achievements, competitions. There are already some fuzzy rewards without a number like social status, which are harder to optimize. The older and the more agentic we get, the less external rewards are given to us: there’s still the amount of money and a social status (which are not always correlated depending on a social group!), but your own goal can have different reward.

Vague goals = poorly specified rewards. Break long-term aims into subgoals

Sparse Vs Dense Rewards

That’s why it’s very important to set proxy rewards that will tell you if you’re moving to a right direction. If your goal is a marathon, a sparse reward is the finish line. Reward shaping is the “shaping” of your environment to give you a $+ 0.1$ for every morning you put on your shoes. Proxy rewards (and reward shaping) are very useful to move to your goal faster and avoid dead ends. Good grades in school is a proxy reward for a good education, and number of solved test tasks is a proxy for good grades. As you can see, there’s a problem:

Reward Hacking

In RL agents exploit loopholes in reward functions to maximize scores without achieving the intended outcome. You can see it can happen in real life: good grades don’t mean high intelligence, optimizing for grades, followers, or KPIs often distorts behavior away from true growth. It’s important to always check if current proxy objective is still aligned with a global goal. If your goal is “Productivity,” you might hack it by clearing “easy” emails while avoiding the hard project.

Exploration Vs Exploitation

Imagine you are at a casino with several slot machines (bandits), each with a different, unknown payout probability. You goal is to maximize payout. You can either explore all machines and get enough data on what slot machine are the best, or exploit and just play on the machine you think is the best so far. This applies to a real life too: you explore to find best restaurants, and only go to the best one after some time. You try different sports and advance in the one you’re the best at. You exploit what you’re best at in your career and delegate other things. The balance is difficult to find, and there are some strategies to figure out when to stop exploring, though personally I more often find people are more focused on exploitation (“we’re going to the same city for vacation every year”)

Discount Factor (Longtermism)

There’s always a discount factor for rewards in RL, and they’re usually exponential. In trading bots it’s low (you want to make many quick wins and move on), in AlphaGo and Chess it’s high (you don’t want to snatch a piece and lose the game because of it). In humans we often see Hyperbolic Discounting: many people choose $50 now over $100 in a year, but the same people will choose $100 in six years over $50 in five years. AI is rational with time, while human psychology is flawed. More often I see people choosing instant gratification from fast food, social media and other addictive things without thinking how it will affect them in years to come, but people are not often talk about the other side: saving and sacrificing too much now in a hope of a greater return tomorrow which may never happen.

🪴 Dmitrii's personal blog

Explorer

You Are a Poorly Programmed AI

Set Your Goals

Set Your Rewards (A Feedback Loop)

Sparse Vs Dense Rewards

Reward Hacking

Exploration Vs Exploitation

Discount Factor (Longtermism)

Graph View

Table of Contents

Recent Notes

Why You Need to Read Sartre in the Age of Agentic AI

You Are a Poorly Programmed AI

Is JAX A Good Fit For Geometric Deep Learning?