earlier post I defined the price of memory, u, and asked:
Any such strategy \sigma induces a randomized strategy \overline\sigma that acts like \sigma in average, but is memoryless. In each state, when averaged over all runs, the strategies \sigma and \overline\sigma take the same actions with the same probabilities:
One can show that \sigma and \overline\sigma visit red transitions equally often in average; in particular, \overline\sigma visits at least one red transition in average. The price of memory is defined as the smallest global (independent of the MDP or any strategy) constant u such that the probability that \overline\sigma does not visit a red transition is at most u. Trivially, u \le 1, and examples as above show that u \ge \frac12. Let us now prove that u=1:
In an
Is u < 1? Worth €100.
Determine u. Worth another €100, even if u=1.
I'm happy to announce that there is now a solution:
Determine u. Worth another €100, even if u=1.
The price of memory, u, is 1.
I don't want to repeat the precise definition of u, but here is a short recap:
Suppose you have an acyclic finite MDP and a strategy \sigma that guarantees to visit a red transition with probability 1.
This strategy \sigma might use memory, i.e., might base its behaviour in a state on the history of the run, as indicated in yellow:
Any such strategy \sigma induces a randomized strategy \overline\sigma that acts like \sigma in average, but is memoryless. In each state, when averaged over all runs, the strategies \sigma and \overline\sigma take the same actions with the same probabilities:
One can show that \sigma and \overline\sigma visit red transitions equally often in average; in particular, \overline\sigma visits at least one red transition in average. The price of memory is defined as the smallest global (independent of the MDP or any strategy) constant u such that the probability that \overline\sigma does not visit a red transition is at most u. Trivially, u \le 1, and examples as above show that u \ge \frac12. Let us now prove that u=1:
Consider the following MDP, M_1:
In the controlled state in the middle, the strategy \sigma goes left if and only if no red transition has been visited yet (which is the case if and only if the initial random state has chosen the right transition). In this way, \sigma visits exactly one red transition with probability 1. It also follows that in the controlled state in the middle, the randomized strategy \overline\sigma picks either outgoing transition with probability \frac12. Overall, the probability in M_1 that \overline\sigma does not visit a red transition is u_1 := \frac12 \cdot \frac12 = \frac14. It follows that u \ge u_1 = \frac14, which is not a strong statement. Consider now the following MDP, M_2, which contains two copies of M_1:
As before, there is a strategy \sigma that ensures to visit exactly one red transition: in any controlled state, \sigma goes left if and only if no red transition has been visited yet. As before, the resulting randomized strategy \overline\sigma picks, in any controlled state, either outgoing transition with probability \frac12. The probability in M_2 that \overline\sigma does not visit a red transition is u_2 := (\frac12 + \frac{1}2u_1)^2 = \frac{25}{64} \approx 0.39. It follows that u \ge u_2.
One can define a sequence of MDPs M_1, M_2, M_3, \ldots in this way:
Defining u_{i+1} := (\frac12 + \frac{1}2u_i)^2 and generalizing \sigma in the natural way, we obtain that in M_i the probability that \overline\sigma does not visit a red transition is u_i. It is straightforward to check that the sequence u_1, u_2, u_3, \ldots is non-decreasing and converges to the only fixed point of the function f(x) = (\frac12 + \frac{1}2x)^2, which is 1. Since u \ge u_i, it follows that u=1.
I should award the prize of €200 to the person who found this solution.
The only issue is: that would be myself.
Was the problem difficult to solve?
For me, yes.
Could a Maths or CS undergrad have found this solution?
Absolutely.
Does this result, u=1, lead to a paper?
It might, but my collaborators and I would have preferred u \lt 1. Oh well.
In the controlled state in the middle, the strategy \sigma goes left if and only if no red transition has been visited yet (which is the case if and only if the initial random state has chosen the right transition). In this way, \sigma visits exactly one red transition with probability 1. It also follows that in the controlled state in the middle, the randomized strategy \overline\sigma picks either outgoing transition with probability \frac12. Overall, the probability in M_1 that \overline\sigma does not visit a red transition is u_1 := \frac12 \cdot \frac12 = \frac14. It follows that u \ge u_1 = \frac14, which is not a strong statement. Consider now the following MDP, M_2, which contains two copies of M_1:
As before, there is a strategy \sigma that ensures to visit exactly one red transition: in any controlled state, \sigma goes left if and only if no red transition has been visited yet. As before, the resulting randomized strategy \overline\sigma picks, in any controlled state, either outgoing transition with probability \frac12. The probability in M_2 that \overline\sigma does not visit a red transition is u_2 := (\frac12 + \frac{1}2u_1)^2 = \frac{25}{64} \approx 0.39. It follows that u \ge u_2.
One can define a sequence of MDPs M_1, M_2, M_3, \ldots in this way:
Defining u_{i+1} := (\frac12 + \frac{1}2u_i)^2 and generalizing \sigma in the natural way, we obtain that in M_i the probability that \overline\sigma does not visit a red transition is u_i. It is straightforward to check that the sequence u_1, u_2, u_3, \ldots is non-decreasing and converges to the only fixed point of the function f(x) = (\frac12 + \frac{1}2x)^2, which is 1. Since u \ge u_i, it follows that u=1.
Comments
Post a Comment