While reinforcement learning (RL) has seen significant successes over the past few years, modern deep RL methods are often criticized for how sensitive they are with respect to their hyper-parameters. One such hyper-parameter is the discount factor, which controls how future rewards are weighted compared to immediate rewards.
The objective that one wants to optimize in RL is often best described as an undiscounted sum of rewards (for example, maximizing the total score in a game).
In practice, however, a discount factor is introduced to avoid some of the optimization challenges that can occur when directly optimizing on an undiscounted objective. And while in theory a discount factor can take on any value between 0 and 1, in reality good performance is only obtained for a small subset of values close to 1.
Microsoft Research will present a paper at the thirty-third Conference on Neural Information Processing Systems (NeurIPS 2019), called “Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning.”
The company will present a novel hypothesis as to why the effective discount-factor range is so small. Also, based on its hypothesis, Microsoft outlined a technique to avoid this discount-factor sensitivity and introduce a method, which is called logarithmic Q-learning, based on this technique.
Logarithmic Q-learning is the first method able to achieve good performance for low discount factors on sparse-reward tasks with function approximation. In addition, the method not only reduces discount-factor sensitivity, but can also improve performance altogether.
While a logarithmic mapping function makes action gaps more homogenous, there are a number of challenges to overcome to build a robust algorithm, based on this outcome, that can be applied to stochastic domains with both positive and negative rewards.
Microsoft managed to overcome these challenges and develop an algorithm, logarithmic Q-learning, that can be applied to general tasks and has convergence guarantees under standard conditions. These results do not only show great performance for low discount factors. Early performance of high discount factors is also better than it was without the logarithmic mapping.
No comments:
Post a Comment