Markov Decision Processes (MDPs) are Markov chains plus nondeterminism: some states are random, the others are controlled (nondeterministic). In the pictures, the random states are round, and the controlled states are squares: The random states come with a distribution over successor states, but in the controlled states a controller chooses a successor state (or a probability distribution over the successor states). For instance, the controller could stay on the leftmost column forever, by always choosing to go one state down. Or the controller could go right at some point; in the random state a successor is picked randomly, either the initial state or the state on the right, according to the blue probabilities. What objective should the controller aim at? In this post, the objective will be the following: visit green states infinitely often, and red states only finitely often . Here is the previous MDP with colours: Since the controller wants to visit (infinitely many) green...
Blog by Oxford computer scientist Stefan Kiefer