Tell me the price of memory and I give you €100

Markov Decision Processes (MDPs) are Markov chains plus nondeterminism: some states are random, the others are controlled (nondeterministic). In the pictures, the random states are round, and the controlled states are squares:

The random states (except the brown sink state) come with a probability distribution over the successor states. In the controlled states, however, a controller chooses a successor state. What does the controller want to achieve? That depends. In this blog post, the objective is very simple: take a red transition. The only special thing about red transitions is that the controller wants to take them. We consider only MDPs with the following properties:

There are finitely many states and transitions.
The MDP is acyclic, that is, it has no cycles.
There are a unique start state from which any run starts (in the pictures: blue, at the top) and a unique sink state where any run ends (in the pictures: brown, at the bottom). No matter what the controller does, the sink will be reached.
The controller has a strategy, $\sigma$, to take a red transition surely. That is, no matter how unfortunate the random states behave in a run, the controller manages to take at least one red transition at some point.

There is no apparent reason for $\sigma$ to use any memory: one can show (even in much more general settings) that if there is a strategy to take a red transition with probability 1, then there is also a memoryless deterministic strategy that achieves that. Memoryless deterministic means that for every controlled state the strategy always takes the same outgoing transition. In the MDP above, $\sigma$ may choose to take a red transition whenever possible, until the sink is reached:

Sometimes though taking red transitions is costly in some sense, and we would prefer to take exactly one red transition. Therefore we consider strategies $\sigma$ that use one bit of memory: that bit is initially 0, and gets switched to 1 as soon as a red transition is taken, and then remains 1. The strategy $\sigma$ may use the bit as it pleases, as long as it can guarantee to take at least one red transition in every run. Once a red transition has been taken, $\sigma$ may just relax or avoid further red transitions. For the MDP above, let us define $\sigma$ to take a red transition when coming directly from the start state (that is, when the bit is still 0) and then (that is, when the bit is 1) go directly to the sink state:

Given such a one-bit strategy $\sigma$, let us define a memoryless randomized strategy $\overline\sigma$, that is, $\overline\sigma$ fixes, for each controlled state, a probability distribution over the state's outgoing transitions. We do this in a very particular way, depending on $\sigma$. First we compute for each state $s$ a value $b(s)$, which tells us how likely it is that the bit is 1 when visiting $s$. More precisely, $b(s)$ is the conditional probability (under $\sigma$) that the bit is 1 when $s$ is visited, conditioned under those runs that visit $s$. (Each run visits each state at most once, by acyclicity.) For example, for the start state $s_0$ and the sink state $s_\bot$ we always have $b(s_0)=0$ and $b(s_\bot)=1$. In the MDP above, for the depicted strategy $\sigma$, we have $b(s) = 0$ for the topmost controlled state, and $b(s) = 1/2$ for the other controlled states. (States $s$ that are never visited by $\sigma$ don't have a well-defined $b(s)$-value, but that won't matter.)

Given the $b(s)$-values, the memoryless randomized strategy $\overline\sigma$ is easy to define:

For each controlled state $s$ the strategy $\overline\sigma$ picks

with probability $b(s)$ the transition that $\sigma$ picks when the bit is 1;
with probability $1-b(s)$ the transition that $\sigma$ picks when the bit is 0.

This strategy $\overline\sigma$ effectively results in a Markov chain, because like random states the controlled states now have a probability distribution over their outgoing transitions:

Intuitively, $\overline\sigma$ does its best to emulate $\sigma$ without using the bit. This works well, in the following sense:

For each state $s$, the probability to visit $s$ (at some point during the run) is the same under both strategies $\sigma$ and $\overline\sigma$. This holds equally for transitions: for each transition $t$, the probability to take $t$ (at some point during the run) is the same under both strategies.

By induction on the acyclic graph.

In the example above, the probability of visiting the topmost controlled state is $1/4$, and for all other controlled states it is $1/2$. Under either strategy.

This equivalence between $\sigma$ and $\overline\sigma$ extends to expectations:

The expected number of visits of red transitions is at least 1, under both strategies $\sigma$ and $\overline\sigma$.

The expected number of visits of red transitions equals the sum, over the red transitions $t$, of the probabilities of taking $t$. (Remember: no cycles.) By the proposition above, this number is the same under either strategy. Under $\sigma$, every run takes at least 1 red transition. So the expected number of visits of red transitions under $\sigma$ is at least 1.

Now we come to the point: By moving from $\sigma$ to $\overline\sigma$ we lose the guarantee to visit at least one red transition. Even though the expectation of the number of visits of red transitions remains at least 1, it is conceivable that $\overline\sigma$ is unlikely to visit a red transition. For instance, it is conceivable that most runs (say, 99%) don't visit any red transitions at all, whereas a small fraction (say, 1%) of runs visit many (say, 100) red transitions. In this case, we might speak of a “risk” of 0.99: the probability, under $\overline\sigma$, of taking no red transition. The price of memory is the supremum of all risks:

The price of memory is the smallest number, $u$, such that for any MDP and any one-bit strategy $\sigma$ satisfying the assumptions above, the probability, under $\overline\sigma$, of taking no red transition is at most $u$. Equivalently, $u$ is the smallest number such that the probability of taking at least one red transition is always at least $1-u$.

Trivially, $u \le 1$. The price of memory is a global constant (not depending on any particular MDP or strategy). Here is the open problem:

Is $u < 1$? Worth €100.
Determine $u$. Worth another €100, even if $u=1$.

If you are the first who sends me a solution, I give you the money, and you and your solution get featured here if you agree. To me the problem doesn't “feel” hard, but I have tried hard and failed.

The example above gives us a lower bound on $u$: The probability, under $\overline\sigma$, to see no red transition is $\frac14 \cdot 0 + \frac14 \cdot \frac12 + \frac14 \cdot \frac12 + \frac14 \cdot \frac12$, which is a little less than $1/2$. Generalizing the example from $4$ to $n$ controlled states shows that $u$ cannot be less than $1/2$.

Here is another example:

As $\sigma$ we take the strategy that takes a red transition as early as possible and then completely avoids further visits of red transitions:

The corresponding $\overline\sigma$ assigns the transitions the following probabilities:

The probability, under $\overline\sigma$, of moving from the start state to the first controlled state and never seeing a red transition is $\frac1n \cdot 0$. The probability of moving from the start state to the second controlled state and never seeing a red transition is $\frac1n \cdot \frac1n$ (telescoping product). Summing over all $n$ cases, we get that the probability to see no red transition is \[ \begin{aligned} & \frac1n \cdot 0 + \frac1n \cdot \frac1n + \frac1n \cdot \frac2n + \cdots + \frac1n \cdot \frac{n-1}{n} \\ = \quad& \frac{1}{n^2} \cdot \frac{(n-1)\cdot n}{2} \ \approx \ \frac12\,. \end{aligned} \] It follows that $u \ge 1/2$ (which we already know from the example above).

Perhaps $u = 1/2$? No:

The price of memory is greater than $0.503$.

Sketch. Take the MDP from the picture below, which has $n+2$ controlled states, with $w = 0.03$, $x = 0.08$, $y = 0.00089$, and $n=1000$. Similarly as in the previous examples, take as $\sigma$ the strategy that greedily takes the first red transition it can grab and then avoids all others.

Along these lines I think I can push the lower bound to $u \ge \frac{2}{2+\sqrt2} \approx 0.59$. But that's not quite the point. I really want to give you the money!

Update (April 2018): The problem has been solved.

Short Killing Words

Given a finite set $\mathcal{M}$ of $n \times n$ matrices over the integers, can you express the zero matrix as a product of matrices in $\mathcal{M}$? This is known as the mortality problem. Michael Paterson showed in 1970 that it is undecidable, even for $n=3$. Later it was shown that mortality remains undecidable for $n=3$ and $|\mathcal{M}| = 7$, and for $n=21$ and $|\mathcal{M}| = 2$. Decidability of mortality for $n=2$ is open. If all matrices in $\mathcal{M}$ are nonnegative, the mortality problem becomes decidable, because then it matters only whether matrix entries are $0$ or not. Specifically, view the given matrices as transition matrices of a nondeterministic finite automaton where all states are initial and accepting, and check whether there is a word that is not accepted. This problem is decidable and PSPACE-complete. One can construct cases where the shortest word that is not accepted by the automaton has exponential length. In this post we focus on the nonnegative ...

One Idea

Search This Blog

Tell me the price of memory and I give you €100

Labels

Comments

Post a Comment

Popular posts from this blog

Short Killing Words

Calibrating Assessments