Skip to main content

Infinitely many states and no memory

Markov Decision Processes (MDPs) are Markov chains plus nondeterminism: some states are random, the others are controlled (nondeterministic). In the pictures, the random states are round, and the controlled states are squares:
The random states come with a distribution over successor states, but in the controlled states a controller chooses a successor state (or a probability distribution over the successor states). For instance, the controller could stay on the leftmost column forever, by always choosing to go one state down. Or the controller could go right at some point; in the random state a successor is picked randomly, either the initial state or the state on the right, according to the blue probabilities.

What objective should the controller aim at? In this post, the objective will be the following: visit green states infinitely often, and red states only finitely often. Here is the previous MDP with colours:
Since the controller wants to visit (infinitely many) green states, it needs to leave the left column at some point. But once it leaves the left column, there is a risk of visiting the red state before returning to the initial state. Visiting the red state is not a disaster, but it shouldn't happen infinitely often.

How should the controller play then? Let's say a new round starts whenever we return to the initial state. Consider the following strategy: in each round go one state further down the left column before going right. Under that strategy, the controller visits green states infinitely often. One can show that, with probability $1$, the red state will be visited only finitely often. So under that strategy, the controller achieves the objective almost surely, i.e., with probability $1$. Achieving the objective almost surely is the gold standard in this post: we don't settle for less. We call a strategy winning if it achieves the objective almost surely.

The described winning strategy (go ever further down) requires infinite memory: each time the controller visits the initial state, it needs to remember how far it went down in the previous round. In contrast, every finite-memory strategy loses with probability $1$. I won't define memory here, only "no memory": A strategy is MD (memoryless deterministic) if the controller assigns to each controlled state $s$ a fixed successor $s'$ that the controller chooses whenever it visits state $s$ (regardless of the history of the play). In the example above, every MD strategy loses almost surely (either because it never leaves the left column, or because it visits the red state infinitely often).

In this post we focus on MDPs where the controller has some winning strategy. Any state that could be visited by this strategy must again have a winning strategy, because otherwise the winning strategy could not afford to visit the state. In the following we ignore all other states, i.e., we can assume that our MDP has only states for which there exists a winning strategy. Under what circumstances is there also a winning strategy that is MD? In this post I'll show:
Consider an MDP where for every state there is a winning strategy. If the MDP does not have white states (i.e., all states are green or red) then there is also an MD winning strategy. The same MD winning strategy works, regardless of which state is the start state.
The example above shows that the no-white-states assumption cannot be dropped. Absence of white states means that visiting red states only finitely often is effectively the only goal, because then green states will automatically be visited infinitely often. A promising strategy might be a safety strategy: in each state minimize the probability of ever visiting a red state again. The appeal of a safety strategy is twofold:
  • If a safety strategy succeeds in never visiting a red state again, then clearly it visits red states only finitely often.
  • There are safety strategies that are MD (provided each controlled state has only finitely many successors, which we assume in this post).
Safety strategies do not work though. Consider the following MDP and imagine that after the first random state the controller is unlucky and reaches not the first green self-loop but the second red state.
The controller might reason: "Ok, I've been unlucky and am now in a red state. But if I go down one state, there is chance of $1/4$ that this was the last red state that I ever see. So let's give it a go." However, with probability $3/4$ the controller will be unlucky again. The controller might keep pushing its luck by playing "down", but the dream of reaching a green sink might move further and further away. Actually, the probability that this strategy (the safety strategy) succeeds is less than 72%. Compare this with the opposite strategy, which always returns to the initial state: that strategy is MD as well and succeeds almost surely! We conclude that playing only for safety is too impatient.

The following picture visualizes the situation more abstractly. There I've arranged the states according to their safety level. By safety level, I mean the probability of never visiting another red state, assuming a safety strategy is played (i.e., a strategy that maximizes the probability of never visiting another red state). A safety strategy entails not to pick successor states with smaller safety level, as indicated in grey:
By the example above, a safety strategy is not necessarily winning. What will work better is to fix a safety strategy only for the states with safety level at least $1/3$. In this way we obtain another MDP. For the less safe states, the controller is free to choose any outgoing transition:
Consider the MDP (visualized above) obtained by fixing a safety strategy for the states with safety level at least $1/3$ (the blue area). In this MDP, for every state there is a winning strategy.
Consider any state. Since there is a winning strategy in the original MDP (before having fixed anything), the controller might play this strategy, until the play reaches the blue area. At this point the controller switches to a safety strategy (which is consistent with what has been fixed in the blue area). Now the controller has a chance of at least $1/3$ of never visiting red again and thus achieving the objective. If the play does reach a red state, the controller reverts to a strategy that is winning for that red state in the original MDP. The controller follows that strategy until the play again reaches the blue area. At this point the controller switches again to a safety strategy, thus forever avoiding red with a fresh chance of at least $1/3$. Continuing in this way, the controller achieves the objective almost surely. Note that this strategy is not MD, as the controller needs to remember in which phase it is: at any point it either follows a winning strategy of the original MDP, or it follows a safety strategy.
This shows that by freezing the blue area with a safety strategy we have not yet screwed up. I still need to define an MD strategy for the non-blue area. But bear with me for a moment and consider a state with even higher safety level, at least $2/3$. Such a state is in the blue area, so a safety strategy has been fixed there. Two things might happen:
  • Either the play remains in the blue area forever and never visits a red state;
  • or it eventually leaves the blue area or visits a red state.
The two cases are visualized here:
The second case (dropping below safety $1/3$) cannot have very large probability: after all, we start in the dark-blue area with safety level at least $2/3$, so if the safety level is very likely to drop below $1/3$, we can't have been very safe to start with. Doing the maths shows that the probability of the second case is at most $1/2$. So starting in the dark-blue area, the probability of avoiding red forever is at least $1/2$.

Here is our hope: If we could always return to that dark-blue area almost surely, then each time we arrive there we would get a fresh chance of at least $1/2$ to avoid red forever. In this way we would achieve the objective almost surely. This is illustrated here:
But is this just a pipe dream? After all, in the light-blue area we have already fixed a safety strategy. Perhaps that safety strategy prevents us from reaching the dark-blue area? Perhaps playing for safety in the light-blue area is greedy and short-sighted, and we should rather swallow the bitter pill and visit some red states for the greater benefit of reaching the dark-blue area?

No, we are actually good. Remember that the proposition above says that after freezing the blue area there is still a winning strategy (albeit not necessarily MD) for each state. So the following proposition applies:
Consider an MDP where for each state there is a winning strategy. Then each state has a strategy to almost surely reach a state with safety level at least $2/3$.
Consider an arbitrary start state $s_1$ and an arbitrary strategy. Suppose that $s_1$ has safety level less than $2/3$, so the probability of visiting a red state from $s_1$ is at least $1/3$. It follows that there is a finite number $k_1$ such that the probability of visiting a red state within the next $k_1$ steps is at least $1/4$. After these $k_1$ steps we are in another state, say $s_2$. Suppose that $s_2$ also has safety level less than $2/3$. Then, similarly, there is a number $k_2$ such that the probability of visiting a red state within the next $k_2$ steps is at least $1/4$. Etcetera. It follows that the probability of never reaching safety level at least $2/3$ and visiting only finitely many red states is zero. So every winning strategy almost surely reaches safety level at least $2/3$.
So after freezing the blue area, there is still for every state a strategy to almost surely reach the dark-blue area. It is known that if there is a strategy to achieve a reachability objective almost surely then there is also an MD strategy to achieve the reachability objective almost surely. Reaching the dark-blue area is obviously a reachability objective. So there is an MD strategy to almost surely reach the dark-blue area. As argued above, this strategy, combined with the safety strategy in the blue area, is winning.

In summary, here is our MD winning strategy: in the blue area play an MD strategy for safety (avoiding red states), and in the white area play an MD strategy for reaching the dark-blue area. The key idea is that these two MD strategies are not conflicting.

The material in this post is from a LICS'17 paper by Richard Mayr, Mahsa Shirmohammadi, Dominik Wojtczak, and myself.

Comments

Popular posts from this blog

Oxford Admissions

An initial draft of this post was about how to make rankings more meaningful, a piece of mathsy computer science, as you expect it from this blog. My motivation is personal: I'm involved in selecting CS undergraduate students. Some people might be curious how Oxford admissions work: it's certainly a favourite subject for UK politicians and news media. Therefore I decided to postpone my original plan and to write this post about Oxford admissions. I hope I'm not playing with too much fire and the blog will not suddenly become famous for the wrong reasons!

The tenor in newspapers and political campaigns is that the Oxbridge admissions process is “opaque” and unfair. Here I am going to offer transparency and some comments on fairness. The views expressed here are my own, and may not necessarily coincide with the university's views.

Oxford's goal is to admit the strongest applicants. One can disagree with this basic premise, but we try to admit those applicants who w…

Probabilistic Models of Computation

This post has 0 ideas. At least, no ideas that are younger than 50 years or so. I'm talking about the two most fundamental models of probabilistic computation and the two most fundamental results about them, according to my subjective judgement.

Every computer scientist knows finite automata: State $1$ is the start state, and state $3$ is the only accepting state. Some words are accepted (i.e., have an accepting run), for instance the word $aa$. Other words are rejected (i.e., have no accepting run), for instance the word $ba$. Most questions about finite automata are decidable. For instance, the emptiness problem, which asks whether the automaton accepts at least one word. This problem is decidable in polynomial time: view the automaton as a graph, and search the graph for a path from the initial state to an accepting state. One can also decide whether two given automata are equivalent, i.e., do they accept the same words? This problem is computationally hard though: PSPACE-co…

The Price of Memory

In an earlier post I defined the price of memory, $u$, and asked:
Is $u < 1$? Worth €100.
Determine $u$. Worth another €100, even if $u=1$. I'm happy to announce that there is now a solution:
The price of memory, $u$, is 1. I don't want to repeat the precise definition of $u$, but here is a short recap: Suppose you have an acyclic finite MDP and a strategy $\sigma$ that guarantees to visit a red transition with probability 1. This strategy $\sigma$ might use memory, i.e., might base its behaviour in a state on the history of the run, as indicated in yellow:
Any such strategy $\sigma$ induces a randomized strategy $\overline\sigma$ that acts like $\sigma$ in average, but is memoryless. In each state, when averaged over all runs, the strategies $\sigma$ and $\overline\sigma$ take the same actions with the same probabilities:
One can show that $\sigma$ and $\overline\sigma$ visit red transitions equally often in average; in particular, $\overline\sigma$ visits at least…