Gambling with Reinforcement Learning or playing it safe with Supervised and Unsupervised Learning?



I haven’t written a post in a very long time and I have gotten quite a few messages asking when the next post will be published. Too much work, a lovely summer and a lot of thinking about deep learning have kept me from writing and it is the latter I wish to talk about this time. As I am now working on a project demanding deeper knowledge in machine learning techniques I have revisited my understanding of supervised, unsupervised and reinforcement learning to explain to laymen (and myself) the fundamental differences, the pros and the cons of each type. There is no better test of one’s understanding of things than trying to explain them to others. It’s a daunting task with an immense reward at the end of it, your own understanding of what you have regarded as self-evident.

The title of this post says it all. The term gambling is a little reference to the most used example of reinforcement learning, the multi-layer armed bandit. But, it actually also has to do with the conclusions I draw of reinforcement learning as a technique. Those accustomed with the method know about the dilemmas that one faces, at least vaguely and I will try to convince them that the problems are even worse than they could imagine. My point will not be to discourage from its use but rather to make people observant of the costs associated to them. Reinforcement learning has made major advances in the past decade and is a field of research in itself, something to reckon with if handled the right way.

Most people that end up on this blog already have a rather deep knowledge of machine learning, its different approaches and aspects. However, it feels contextual to give a light introduction to the four main types of learnings used today, And I do mean short. There are plenty of other blogs (if nothing else, on this site) or books that are entirely devoted to these types of machine learning models.

Introductory concepts

Supervised learning

Supervised methods aim at the discovery of relationships between inputs, also called independent variables, and target attributes. The relationship discovered is represented in a structure referred to as a model. Usually models describe and explain given phenomena. The phenomena or pattern is hidden in the dataset and can be used for predicting the value of the target, knowing the values of the input attributes.

Fig 1: A supervised machine learning model is characterized by the fact that the input data to be trained on is labelled, or annotated. The label is for instance either a category or class that the data may belong to.

As an example, let’s take the classification of hand-written letters. Given a large number of annotated photographs of letters, the target will be a statement about the content (letter) of the photograph. This type of supervised learning model will determine the pattern that are unique to each letter and learn these to classify hand-written letter photographs it has never seen.

This example corresponds to one type of supervised models, but there exist other models too. Therefore, the literature usually distinguishes between two main supervised models: classification models (the above example being in this category) and regression models. Regression models map the input space, that is the values or object in the data that need to be analyzed, into a real-value domain, the outcome. For instance, a regression model can predict the expenditure of a pension scheme given the historical costs of that system and the characteristics of the demography and socio-economic factors.

On the other hand, a classification model maps the input space into pre-defined classes, as we described above. There are many alternatives for representing classification model, for example, support vector machines, decision trees, probabilistic summaries and algebraic function. Along with regression and probability estimation, classification is one of the most studied models, possibly one with the greatest practical relevance. The potential benefits of progress in classification are immense since the technique has great impact on other areas.

There are evident differences between regression and classification models. In the context of fraud, the outcome of a regression model would for instance be the probability or likelihood of a given entity or transaction to be the result of fraudulent activity. An evident drawback of such a method is of course that unless this probability is very high or very low for fraud, it is very hard to determine or interpret the given value. Classification models, on the other hand are categorical in their determination of whether some event is fraudulent or not. It is either legitimate or not. But reality is not always black or white and the results need to be investigated. There might indeed exist reasons for which a legitimate event bears the characteristics of a fraudulent one, and vice-versa.

Unsupervised learning

Unsupervised learning, like supervised learning, aims at representing particular input patterns in such a way that they reflect the structure of the overall collection of input patterns. By contrast with supervised learning (and as it will be clear later, reinforced learning), there are no explicit target outputs or environmental evaluations associated with each input. Instead, the unsupervised learning model will encourage prior biases as to what aspects of the structure of the input should be captured in the output.


Fig 2: The difference between a supervised and and unsupervised machine learning model is the total lack of labeled data. The differences in features need to be learned by the model. An advantage with these kinds of models is that no assumptions are made on for instances clusters or categories. An example of how this could be used can be found in Artificial Neural Networks and patient Segmentation.

Unsupervised learning is important since it is likely to be much more likely to resemble the learning process that occurs in the brain than what supervised learning is. In addition, the existence of annotated data is scarce. Indeed, annotation or identification of output corresponding to a given pattern requires either tagging all known data to fit the desired output and/or the knowledge of all patterns in the data. For instance, annotating all instances in a dataset as fraud or non-fraud requires that all fraudulent activities, or modus operandi, are known. It is unfortunately seldom the case. Hence, unsupervised learning is an important method to consider in a wide range of problems. Unsupervised learning methods only work with the observed input patterns, which are often assumed to be independent samples from an underlying unknown probability distribution, and some explicit or implicit a priori information as to what is important.

Two classes of method have been suggested for unsupervised learning. Density estimation techniques explicitly build statistical models, such as Bayesian networks, of how underlying causes could create the input. Feature extraction techniques try to extract statistical regularities (or sometimes irregularities) directly from the inputs. Very often, the data that is studied contains far too many features or variables for models to work efficiently. The aim of feature extraction is thus to reduce that number of variables, that to reduce the dimensionality of the datasets being studied. The result is then an improvement in accuracy, a better visualization of the data and a faster training of the chosen model as well as an increased transparency of the model. On the other hand, the density estimation approach aims at finding the probability in the distribution of the data, that is, to determine what could have given rise to the data. When dealing with continuous data, it is accepted that feature extraction is a more efficient method for, for instance, clustering data into groups of entities or events. The density estimation approach, in fraud detection, is rather hard to apply because of the large expected imbalance in the data between fraud and non-fraud instances.

Semi-supervised learning

Semi-supervised learning is a hybrid approach between supervised and unsupervised learning. In addition to unlabeled or unannotated data, the Machine Learning algorithm is provided with some degree of supervision information. This information might very well be given for some examples but for a number of reasons, it may be lacking for the rest of the data. One of these may be that there are hidden patterns that are unknown. Often, this information setting will be the targets associated with some of the examples. In this case, the data of semi-supervised learning set X can be divided into two parts: The points X’ for which labels Y’ are provided, and the points X’’ for which no labels are known. There are, of course, other forms of partial supervision. For instance, constraints such as “this set of data points have the same target”, hence making a particular class of semi-supervised learning problems a sub-class of unsupervised learning models, namely constrained unsupervised learning. In opposition to this last statement, one can consider semi-supervised learning as supervised learning with additional information on the distribution of the examples. The latter interpretation seems to be more in line with most applications, where the goal is the same as in supervised learning: to predict a target value for a given input. However, this view does not readily apply if the number and nature of the classes are unknown in advance but have to be inferred from the data. In contrast, semi-supervised learning as constrained unsupervised learning remains applicable in such situations.

Given the above description, a question that immediately arises is whether semi-supervised learning is of any use in a real-world setting. To be more specific: Compared with a supervised algorithm that uses only labelled data, is there any chance that using unlabeled inputs can yield more accurate predictions? The existence of semi-supervised models answers this question in a positive manner. However, we need to be aware of the fact that this is not always the case. A number of labelled data might even lead to a degradation of the accuracy by misguiding the inference by attributing features to the labelled data that it actually does not possess. Hence, for semi-supervised learning to work, certain assumptions need to be made. That being said, supervised learning also has to rely on certain assumptions, such as the smoothness assumption of supervised learning seen as a mapping (in the same way any function if defined as smooth): If two points x and x’ are close (closeness is defined in different ways depending on the context), then so should be the corresponding outputs y and y’. Clearly, without such assumptions, it would never be possible to generalize from a finite training set to a set of possibly infinite number of test cases.

Reinforcement learning

Reinforcement learning is learning that aims at maximizing a reward signal, most often numerical (it encodes the success of an action’s outcome, giving the model’s agent the task to learn to select actions that maximize the accumulated reward over time.). In this case, the model is not instructed to which action it should take and is instead given the “chance” to discover which action gives the best or biggest reward by trying the set of ALL possible actions. But, as in real life, any actions taken can (and most probably will) affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. Tricky business, as always! Therefore, when one talks of reinforcement learning, the two characteristic features that are immediately discussed are the trial-and-error search and delayed reward. These are two properties lacking from both supervised and unsupervised learning.

Fig 5: Reinforcement learning: A typical reinforcement Learning model: an agent takes actions in an environment, which is interpreted into a reward and a representation of the state, which are fed back into the agent. The typical framing of a Reinforcement Learning scenario: an agent takes actions in an environment, which is interpreted into a reward and a representation of the state, which are fed back into the agent.

Reinforcement learning sounds great, doesn’t it? So, why are supervised and unsupervised learning models still around if they do not emulate real life in a proper way? Let’s look at the pros and cons of our old friends, supervised and unsupervised techniques and end with why, in my humble opinion, there still is a long way to go before I’d bet my life on reinforcement learning.

The two biggest problem

Supervised learning is learning from a training set of annotated or labelled data. Each example is a description of a situation together with a specification of the correct action the system should take to that situation, which is often to identify a category to which the situation belongs. We covered that above, although phrased slightly differently. The goal is for the system to generalize its responses so that it always acts correctly, even in situations that are not in the training set. In interactive problems, it is often impossible to have examples of desired behaviour that are both correct and representative of all the situations in which the agent operates, as described in the previous section. And this is a bit of a dilemma, since it is in those situations that learning would be most beneficial since one expects an agent to be able to learn from its own experience.

Unsupervised learning is concerned with finding hidden structure or patterns in collections of unlabeled data. I would seem that supervised learning, unsupervised and semi-supervised learning would exhaustively classify all types of machine learning models. But, they apparently do not.

Although one might think of reinforcement learning as a kind of unsupervised learning because it does not rely on examples of correct behaviour, reinforcement learning differs from all other learning models because it tries to maximize a reward signal instead of trying to find hidden structure. Uncovering structure can without doubt be beneficial in reinforcement learning, but by itself does not address the reinforced learning problem of maximizing a reward signal. We therefore consider reinforced learning to be a third machine learning model, together with supervised learning, unsupervised learning and their hybrid, semi-supervised learning.

Major issues with reinforcement learning

Exploration/Exploitation trade-off

One of the biggest issues encountered in reinforcement learning is the trade-off between exploration and exploitation. To maximize reward, a reinforced learning model must prefer actions that it has tried in the past and found to be effective in producing reward. However, to discover such actions it necessarily must try actions that it has not selected before, which it would rather not do. Hence, the model needs to exploit what it has already learned to obtain reward, but it also has to explore the unknown (not necessarily rewarding) in order to make better selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The model needs to try a wide range of actions and successively learn which ones seem to give the largest reward. Each action must be tried a sufficiently large number of times in order for a reliable estimate of the expected reward to be gained. It should be pointed out that a solution to the Exploration/Exploitation dilemma has not yet been solved. This does not mean that the reinforced learning method is not useful, on the contrary. It simply implies that one needs to be aware of the fact that a balance has to be found between the two concurring issues.

Credit assignment problem

The credit assignment problem refers to the fact that rewards can occur in a delayed fashion. Remember that all possible actions need to be tested and that there are consequences of every action in the future. A robot would do the following: If well designed it would perform many moves through its state-action space where immediate rewards are practically null. The more relevant events are distant in the future and as a consequence these rewards will only affect all distant states that have preceded it in an almost non-existent manner. Hence, the influence of a reward is diluted over time, a consequence of which can be rather bad convergence properties of the reinforcement learning model. The problem thus becomes the design of a reinforcement learning algorithm performing a sufficiently large amount of steps (by iterations) to propagate the influence of delayed reinforcement.

Put in a slightly simpler way, this means that reinforcement learning agents are a little like gamblers playing over and over again every single steps of winning games trying to figure out why they won. What they in fact do, is maximizing a single number which is the result of actions over multiple time steps not taking into account that there are random events they have absolutely no control over. Now, this might clarify a little the inefficiency and flaws of this strategy.

The thing is that you want rewards to be easy to defined in a very specific manner. What reinforcement learning actually seeks to do is to define when something was done right and to learn how to do that very thing in a reliable manner, whereof the term “reinforcement”.  But, this is in a lot of ways problematic because there are many more meaningful and “right” tasks to perform for high rewards than what we are currently capable of doing with today’s algorithms. There is a chasm between all possible choices that can be made at every step of the algorithm and the rewardable tasks. This simply makes algorithms fail badly.

There are essentially two ways to solve this problem. The first one is one of generosity, or as I see it, put the algorithm on steroids. Since the scale at which rewards are provided is an issue, why not be more generous with awards? But, the only mention of a solution in which you are suggesting to tamper with rewards should be giving anyone the chills. This is not a new problem, it is what is more generally known as wireheading:

“Wireheading is the artificial stimulation of the brain to experience pleasure, usually through the direct stimulation of an individual’s brain’s reward or pleasure center with electrical current. It can also be used in a more expanded sense, to refer to any kind of method that produces a form of counterfeit utility by directly maximizing a good feeling, but that fails to realize what we value.”

Basically, wireheading and more specifically reward hacking, amounts to putting an algorithm on drugs and allowing it to value (reward) tasks not knowing why except that it optimizes rewards.

The second of the possible ways to tackle the problem is known as hierarchical reinforcement lerning. It is basically an attempts to decompose the problem of very distant reward problems into a sequences of close and distant goal. Actually, the term hierarchical comes from the fact that some goals are “more pressing” than others, which then are subordinate. The goals are thus put into a hierarchy or value system, just like most things in life. the effect of doing this is to dilate the time scale at which decisions are being made in order to put forward the actual value of rewards. So, the question suddenly becomes: How big should the hierarchy be? Or, better put, how many subgroups or subgoals should there be? there is really no boundary for how deep the hierarchy can go, the real show-stopper is practicality. A good example of how this can be done is travelling from one city to another. The first choice is whether to go or not go. The next step, if going is the choice giving the biggest reward, is how to go there. Depending on what the reward is, different means of travel are preferable than others. If the destination is far enough the alternatives may be several: Car, train, bus, airplane, boat and so forth. Each one of these choices involves subsequent choices to be made (schedules, halts, time, food and so on). What is the granularity needed to solve the problem in an adequate way? How computationally heavier does such a model become? So far, it is my opinion that this solution is a far better alternative than a model on steroids rewarding almost any choices and thus questioning the use of rewards at all.

But…’s the thing: Even when rewards given in hierarchical reinforcement models are good, you can’t escape the risk of getting stuck on local optima. Getting in a state like this makes models behave in very nasty little ways. A very instructive example was designed by Alexir Pan in this video.

So, being slightly risk averse I would not, at least for now, gamble if I have the choice and stick to models which wouldn’t give me too much of a head-ache. The amount of research and effort put into reinforcement learning is promising and the advances made in gaming, to give just the most obvious example, does show the huge potential of reinforcement learning




One thought on “Gambling with Reinforcement Learning or playing it safe with Supervised and Unsupervised Learning?

Add yours

  1. Thanks for some other wonderful article. The place else may
    just anybody get that type of information in such an ideal means of writing?
    I’ve a presentation subsequent week, and I am on the search
    for such information.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by

Up ↑

%d bloggers like this: