Markov Decision Processes (MDPs) are Markov chains plus nondeterminism: some states are random, the others are controlled (nondeterministic). In the pictures, the random states are round, and the controlled states are squares:
The random states (except the brown sink state) come with a probability distribution over the successor states. In the controlled states, however, a controller chooses a successor state. What does the controller want to achieve? That depends. In this blog post, the objective is very simple: take a red transition. The only special thing about red transitions is that the controller wants to take them. We consider only MDPs with the following properties:
Sometimes though taking red transitions is costly in some sense, and we would prefer to take exactly one red transition. Therefore we consider strategies $\sigma$ that use one bit of memory: that bit is initially 0, and gets switched to 1 as soon as a red transition is taken, and then remains 1. The strategy $\sigma$ may use the bit as it pleases, as long as it can guarantee to take at least one red transition in every run. Once a red transition has been taken, $\sigma$ may just relax or avoid further red transitions. For the MDP above, let us define $\sigma$ to take a red transition when coming directly from the start state (that is, when the bit is still 0) and then (that is, when the bit is 1) go directly to the sink state:
Given such a one-bit strategy $\sigma$, let us define a memoryless randomized strategy $\overline\sigma$, that is, $\overline\sigma$ fixes, for each controlled state, a probability distribution over the state's outgoing transitions. We do this in a very particular way, depending on $\sigma$. First we compute for each state $s$ a value $b(s)$, which tells us how likely it is that the bit is 1 when visiting $s$. More precisely, $b(s)$ is the conditional probability (under $\sigma$) that the bit is 1 when $s$ is visited, conditioned under those runs that visit $s$. (Each run visits each state at most once, by acyclicity.) For example, for the start state $s_0$ and the sink state $s_\bot$ we always have $b(s_0)=0$ and $b(s_\bot)=1$. In the MDP above, for the depicted strategy $\sigma$, we have $b(s) = 0$ for the topmost controlled state, and $b(s) = 1/2$ for the other controlled states. (States $s$ that are never visited by $\sigma$ don't have a well-defined $b(s)$-value, but that won't matter.)
Given the $b(s)$-values, the memoryless randomized strategy $\overline\sigma$ is easy to define:
Intuitively, $\overline\sigma$ does its best to emulate $\sigma$ without using the bit. This works well, in the following sense:
This equivalence between $\sigma$ and $\overline\sigma$ extends to expectations:
The example above gives us a lower bound on $u$: The probability, under $\overline\sigma$, to see no red transition is $\frac14 \cdot 0 + \frac14 \cdot \frac12 + \frac14 \cdot \frac12 + \frac14 \cdot \frac12$, which is a little less than $1/2$. Generalizing the example from $4$ to $n$ controlled states shows that $u$ cannot be less than $1/2$.
Here is another example:
As $\sigma$ we take the strategy that takes a red transition as early as possible and then completely avoids further visits of red transitions:
The corresponding $\overline\sigma$ assigns the transitions the following probabilities:
The probability, under $\overline\sigma$, of moving from the start state to the first controlled state and never seeing a red transition is $\frac1n \cdot 0$. The probability of moving from the start state to the second controlled state and never seeing a red transition is $\frac1n \cdot \frac1n$ (telescoping product). Summing over all $n$ cases, we get that the probability to see no red transition is \[ \begin{aligned} & \frac1n \cdot 0 + \frac1n \cdot \frac1n + \frac1n \cdot \frac2n + \cdots + \frac1n \cdot \frac{n-1}{n} \\ = \quad& \frac{1}{n^2} \cdot \frac{(n-1)\cdot n}{2} \ \approx \ \frac12\,. \end{aligned} \] It follows that $u \ge 1/2$ (which we already know from the example above).
Perhaps $u = 1/2$? No:
Update (April 2018): The problem has been solved.
The random states (except the brown sink state) come with a probability distribution over the successor states. In the controlled states, however, a controller chooses a successor state. What does the controller want to achieve? That depends. In this blog post, the objective is very simple: take a red transition. The only special thing about red transitions is that the controller wants to take them. We consider only MDPs with the following properties:
- There are finitely many states and transitions.
- The MDP is acyclic, that is, it has no cycles.
- There are a unique start state from which any run starts (in the pictures: blue, at the top) and a unique sink state where any run ends (in the pictures: brown, at the bottom). No matter what the controller does, the sink will be reached.
- The controller has a strategy, $\sigma$, to take a red transition surely. That is, no matter how unfortunate the random states behave in a run, the controller manages to take at least one red transition at some point.
Sometimes though taking red transitions is costly in some sense, and we would prefer to take exactly one red transition. Therefore we consider strategies $\sigma$ that use one bit of memory: that bit is initially 0, and gets switched to 1 as soon as a red transition is taken, and then remains 1. The strategy $\sigma$ may use the bit as it pleases, as long as it can guarantee to take at least one red transition in every run. Once a red transition has been taken, $\sigma$ may just relax or avoid further red transitions. For the MDP above, let us define $\sigma$ to take a red transition when coming directly from the start state (that is, when the bit is still 0) and then (that is, when the bit is 1) go directly to the sink state:
Given such a one-bit strategy $\sigma$, let us define a memoryless randomized strategy $\overline\sigma$, that is, $\overline\sigma$ fixes, for each controlled state, a probability distribution over the state's outgoing transitions. We do this in a very particular way, depending on $\sigma$. First we compute for each state $s$ a value $b(s)$, which tells us how likely it is that the bit is 1 when visiting $s$. More precisely, $b(s)$ is the conditional probability (under $\sigma$) that the bit is 1 when $s$ is visited, conditioned under those runs that visit $s$. (Each run visits each state at most once, by acyclicity.) For example, for the start state $s_0$ and the sink state $s_\bot$ we always have $b(s_0)=0$ and $b(s_\bot)=1$. In the MDP above, for the depicted strategy $\sigma$, we have $b(s) = 0$ for the topmost controlled state, and $b(s) = 1/2$ for the other controlled states. (States $s$ that are never visited by $\sigma$ don't have a well-defined $b(s)$-value, but that won't matter.)
Given the $b(s)$-values, the memoryless randomized strategy $\overline\sigma$ is easy to define:
For each controlled state $s$ the strategy $\overline\sigma$ picks
This strategy $\overline\sigma$ effectively results in a Markov chain, because like random states the controlled states now have a probability distribution over their outgoing transitions:
- with probability $b(s)$ the transition that $\sigma$ picks when the bit is 1;
- with probability $1-b(s)$ the transition that $\sigma$ picks when the bit is 0.
Intuitively, $\overline\sigma$ does its best to emulate $\sigma$ without using the bit. This works well, in the following sense:
For each state $s$, the probability to visit $s$ (at some point during the run) is the same under both strategies $\sigma$ and $\overline\sigma$.
This holds equally for transitions: for each transition $t$, the probability to take $t$ (at some point during the run) is the same under both strategies.
By induction on the acyclic graph.
In the example above, the probability of visiting the topmost controlled state is $1/4$, and for all other controlled states it is $1/2$.
Under either strategy.
This equivalence between $\sigma$ and $\overline\sigma$ extends to expectations:
The expected number of visits of red transitions is at least 1, under both strategies $\sigma$ and $\overline\sigma$.
The expected number of visits of red transitions equals the sum, over the red transitions $t$, of the probabilities of taking $t$.
(Remember: no cycles.)
By the proposition above, this number is the same under either strategy.
Under $\sigma$, every run takes at least 1 red transition.
So the expected number of visits of red transitions under $\sigma$ is at least 1.
Now we come to the point:
By moving from $\sigma$ to $\overline\sigma$ we lose the guarantee to visit at least one red transition.
Even though the expectation of the number of visits of red transitions remains at least 1, it is conceivable that $\overline\sigma$ is unlikely to visit a red transition.
For instance, it is conceivable that most runs (say, 99%) don't visit any red transitions at all, whereas a small fraction (say, 1%) of runs visit many (say, 100) red transitions.
In this case, we might speak of a “risk” of 0.99: the probability, under $\overline\sigma$, of taking no red transition.
The price of memory is the supremum of all risks:
The price of memory is the smallest number, $u$, such that for any MDP and any one-bit strategy $\sigma$ satisfying the assumptions above, the probability, under $\overline\sigma$, of taking no red transition is at most $u$.
Equivalently, $u$ is the smallest number such that the probability of taking at least one red transition is always at least $1-u$.
Trivially, $u \le 1$.
The price of memory is a global constant (not depending on any particular MDP or strategy).
Here is the open problem:
Is $u < 1$? Worth €100.
Determine $u$. Worth another €100, even if $u=1$.
If you are the first who sends me a solution, I give you the money, and you and your solution get featured here if you agree.
To me the problem doesn't “feel” hard, but I have tried hard and failed.
Determine $u$. Worth another €100, even if $u=1$.
The example above gives us a lower bound on $u$: The probability, under $\overline\sigma$, to see no red transition is $\frac14 \cdot 0 + \frac14 \cdot \frac12 + \frac14 \cdot \frac12 + \frac14 \cdot \frac12$, which is a little less than $1/2$. Generalizing the example from $4$ to $n$ controlled states shows that $u$ cannot be less than $1/2$.
Here is another example:
As $\sigma$ we take the strategy that takes a red transition as early as possible and then completely avoids further visits of red transitions:
The corresponding $\overline\sigma$ assigns the transitions the following probabilities:
The probability, under $\overline\sigma$, of moving from the start state to the first controlled state and never seeing a red transition is $\frac1n \cdot 0$. The probability of moving from the start state to the second controlled state and never seeing a red transition is $\frac1n \cdot \frac1n$ (telescoping product). Summing over all $n$ cases, we get that the probability to see no red transition is \[ \begin{aligned} & \frac1n \cdot 0 + \frac1n \cdot \frac1n + \frac1n \cdot \frac2n + \cdots + \frac1n \cdot \frac{n-1}{n} \\ = \quad& \frac{1}{n^2} \cdot \frac{(n-1)\cdot n}{2} \ \approx \ \frac12\,. \end{aligned} \] It follows that $u \ge 1/2$ (which we already know from the example above).
Perhaps $u = 1/2$? No:
The price of memory is greater than $0.503$.
Sketch.
Take the MDP from the picture below, which has $n+2$ controlled states, with $w = 0.03$, $x = 0.08$, $y = 0.00089$, and
$n=1000$.
Similarly as in the previous examples, take as $\sigma$ the strategy that greedily takes the first red transition it can grab and then avoids all others.
Along these lines I think I can push the lower bound to $u \ge \frac{2}{2+\sqrt2} \approx 0.59$.
But that's not quite the point.
I really want to give you the money!
Update (April 2018): The problem has been solved.
Hi, interesting, just the kind of thing I like thinking about.
ReplyDeleteI notice your examples only have one random state: is it ok to use more, or is there perhaps a theorem that says it doesn't make any difference?
There is no restriction on the number of random states. I don't know if it makes a difference.
DeleteNot sure if this works out but I wondered if there is an argument which shows that only one root random state is necessary. Argument is to edit out a single random state not at the root.
DeleteSuppose that we have a second random state somewhere, S, and it has (for simplicity) two moves with probs q and 1-q to states S1 and S2.
For each number r of red edges from 0 up to the max possible value R, there is some probability p(S,r) of arriving in state S having crossed r red edges.
Now let's create a new MDP. It will be the same as the original but we will edit out S (delete it and any edges leading into it). Introduce new states S1(r) and S2(r) for r=0...R. It has red edges from S1(r+1) to S1(r) and from S2(r+1) to S2(r) and from S1(0) to S1 and from S2(0) to S2. The point is of course that each edge S1(r) inevitably leads to S1 having crossed r red edges.
Now adapt the original root random node. Multiply the existing probabilities by 1-sum(p(S,r)). Add new edges to each node S1(r) with probability p(S,r)*q and to each node S2(r) with probability p(S,r)*(1-q). There may be some more detailed editing necessary to get it to work. (Exercise for the interested reader).
Claim is that all the probabilities on every node are the same and so (wave hands) the resulting network has the same probability of hitting the end with no reds under sigma-bar is the same.
Really not my area so above could be gibberish.
Typo in the above, obviously the edges from S1(0) and S2(0) are NOT red, but black.
DeleteCould you please clarify which strategy you have in mind when you mention the probability p(S,r)? Is it sigma (a specific strategy that takes a red transition surely)? Or sigma-bar (for a specific sigma)?
DeleteYou are assuming I was thinking clearly, which is not necessarily correct!
DeleteIf the argument is going to help then it has to be for the original sigma. Imagine the original sigma, then (if the argument is correct) replace one random node via this argument, repeat until you only have the root random node.
Then to find the value of u it would be enough to consider only cases with one random node.
There's an obvious mapping from nodes in the original graph to the new graph and the claim is that this mapping preserves the probability of being in that state.
DeleteEven if true then it's necessary to show that it maps over to the same probability of not hitting any red edge in the new sigma-bar, which seems intuitive but may be wrong.
Suppose there is a random state, S, which is the only predecessor of the sink state. So all runs go through S, independently of the strategy. Now apply your construction to edit S out. If I understand the construction correctly, most parts of the original MDP will be bypassed, including any controlled states. That would mean that in the new MDP the choice of strategy doesn't matter.
Delete