Wednesday, July 31, 2019

Model-based and Model-free in Reinforcement Learning (RL)

I was going thru some predictive analytics reading and came across this interesting terms "Model-Based and Model-Free" and found the below excerpt from this pdf and thought of blogging it over here for my own future reference as this explained so well and easy to understand.

Reinforcement learning methods can broadly be divided into two classes, model-based and model-free.

Consider the problem illustrated in the figure below, of deciding which route to take on the way home from work on Friday evening.

We can abstract this task as having states (in this case, locations, notably of junctions), actions (e.g. going straight on or turning left or right at every intersection), probabilities of transitioning from one state to another when a certain action is taken (these transitions are not necessarily deterministic, e.g. due to road works and bypasses), and positive or negative outcomes (i.e. rewards or costs) at each transition from scenery, traffic jams, fuel consumed, etc. (which are again probabilistic).

Model-based computation, illustrated in the left ‘thought bubble’, is akin to searching a mental map (a forward model of the task) that has been learned based on previous experience. This forward model comprises knowledge of the characteristics of the task, notably, the probabilities of different transitions and different immediate outcomes. Model-based action selection proceeds by searching the mental map to work out the longrun value of each action at the current state in terms of the expected reward of the whole route home, and chooses the action that has the highest value.

Model-free action selection, by contrast, is based on learning these long-run values of actions (or a preference order between actions) without either building or searching through a model. RL provides a number of methods for doing this, in which learning is based on momentary inconsistencies between successive estimates of these values along sample trajectories. These values, sometimes called cached values because of the way they store experience, encompass all future probabilistic transitions and rewards in a single scalar number that denotes the overall future worth of an action (or its attractiveness compared with other actions). For instance, as illustrated in the right ‘thought bubble’, experience may have taught the commuter that on Friday evenings the best action at this intersection is to continue straight and avoid the freeway.

Model-free methods are clearly easier to use in terms of online decision-making; however, much trial-and-error experience is required to make the values be good estimates of future consequences. Moreover, the cached values are inherently inflexible: although hearing about an unexpected traffic jam on the radio can immediately affect action selection that is based on a forward model, the effect of the traffic jam on a cached propensity such as ‘avoid the freeway on Friday evening’ cannot be calculated without further trial-and-error learning on days in which this traffic jam occurs. Changes in the goal of behavior, as when moving to a new house, also expose the differences between the methods: whereas model-based decision making can be immediately sensitive to such a goal-shift, cached values are again slow to change appropriately. Indeed, many of us have experienced this directly in daily life after moving house. We clearly know the location of our new home, and can make our way to it by concentrating on the new route; but we can occasionally take an habitual wrong turn toward the old address if our minds wander. Such introspection, and a wealth of rigorous behavioral studies (see [15], for a review) suggests that the brain employs both model-free and model-based decision-making strategies in parallel, with each dominating in different circumstances [14]. Indeed, somewhat different neural substrates underlie each one [17].