Segmentation has for the past decades been a must for many businesses around the world. In a growingly competitive market in which understanding costumers (and keeping them) can be the difference between a flourishing business and a stagnating one, the need for efficient and accurate methods of segmentation is primordial. Some businesses have underestimated the power of segmentation or wanted to minimize costs by defining the segments by simply exploring transaction data and by doing so have drawn faulty conclusions, thus defining costumer segments that did not exist. The consequence of which has been hordes of unsatisfied costumers and enormous monetary losses. Others, on the other hand have managed to use advanced techniques and given themselves an edge in their marketing areas. What is common to these successful segmentations is the use of machine learning algorithms (artificial neural networks) that determine the optimal number of distinct segments and the characteristics of each group.
An industry that has been reluctant to embrace segmentation technique is health-care and there are several reasons for this. Segmentation (of markets) is understood to be a method for economical gain, which to most people implies that someone needs to “pick up the bill”. This has evidently raised several ethical questions since the “looser” in this case is must be the patient. Some private hospitals have however accepted the concept with open arms but the techniques used have always been on based of predefined segments. Furthermore, the segmentation variables have for obvious reason been diagnosis and age.
There has however been a shift from a traditional view of the role of health-care services often revered as the summon of expertise that cannot be questioned to an actor that needs to focus on every patient’s need and abilities, the latter’s right to a second opinion and foremost to his/her satisfaction or experience of health-care.
SKL (Swedish Association of Local Authorities and Regions – Sveriges Kommuner och Landsting) started in 2016 a project in which a patient segmentation of primary health-care patients would lead to develop innovative methods to ensure that every patient’s rights, needs and abilities would be taken into account in their encounters with health-care services. Kentor was in a second phase given the task to perform this segementation. In this blog, we do not intend to give a full account of the mothod but to give some ideas about which steps and techniques to use.
Firstly, companies gather very large amounts of transactions and data about their customers which is the base for a good segmentation. But, what about health-care? They do gather quite a lot of data about their patients, but the nature of the information (medical journals) is unsuited for this kind of work. Indeed, apart from the respect for the individual’s integrity ones needs to realize that a patient is not his/her illness, but rather a myriad of facets that cannot be captured simply through data in a medical journal.
Constructing a questionnaire and determine segmentation variables
The question that poses itself is thus: What determines an individual’s interaction with primary health-care? In other words, what are the driving forces that lead to a positive encounter between a patient and his care giver? Through a thorough review of research in psychology and health-care we arrived at the conclusion that the best way to proceed was to create a survey covering a wide range of aspects defining an individual, both from a personal perspective but also in relation with others. The covered dimensions were:
· Demography and socio-economy
· Psychographics (lifestyle, values, personality)
· Behavior and preferences in health-care
We constructed and validated a survey containing 70 questions on a representative sample of the Swedish population and kept only complete answers (all questions had to be answered) and eliminated outliers (Individuals that answered questions in a contradictive manner). It is worthwhile to
It is worthwhile to pause on the subject of constructing and validating questionnaires. We encounter often surveys that are poorly constructed and the conclusions drawn from them are erroneous and in worse cases misleading. We give here a few simple rules to follow:
The first step in validating a questionnaire is to establish face validity, that is to ensure that the survey does indeed measure what is supposed to measure. There are two very important steps in this process.
a. Experts on the topic should read through the questionnaire and evaluate whether the questions effectively capture the topic under investigation.
b. Second is to have a psychometrician investigate the survey for common errors like double-barreled, confusing, and leading questions. A question must include one and only one subject.
The second step is to pilot test the survey on a subset of your intended population. Recommendations on sample size for pilot testing vary. After collecting pilot data, enter the responses into a spreadsheet and clean the data. Check the internal consistency of questions loading onto the same factors (obtained by PCA, Principal Component Analysis). It checks the correlation between questions loading onto the same principal components. It is a measure of reliability in that it checks whether the responses are consistent. A standard test of internal consistency is Cronbach’s Alpha (CA). Cronbach Alpha values range from 0 – 1.0. In most cases the value should be at least 0.70 or higher although a value from 0.60 to 0.70 is acceptable.
We wish to remind the reader that our aim is to segment a population using features that are different to those classically used. As we mentioned above, patients are more than their demographic and health attributes and it is not unreasonable than patients of different ages and health issues have more in common than patient of the same age suffering of the same condition. This can for instance be due to differences in personality. We therefore choose, for the purpose of segmentation, to eliminate demographic and socio-economic variable as candidates as segmentation variables. We also note that many questions that at first glance can seem unrelated may be strongly correlated. Good practice in segmentation is to avoid giving too heavy a weight to certain dimensions and a choice needs to be made as to which variable to use. There are several ways to do this. If the dimensions, areas such as personality, relatedness, fear and so forth are well defined a simple pairwise correlation analysis might do the trick. Unfortunately, this is seldom the case and the dimensions need to be determined by analyzing the responses. A particularly efficient way to do so is principal component analysis because it not only determines the correlation between set of variables but also gives information about which variables weight the most. This enables the elimination of variables that have been ambiguously understood by respondents.
In our segmentation of the Swedish population we retained 18 of the 70 variables. These were carefully chosen by the method described above, but we ensured that all dimensions in the questionnaire were represented by at least one of the questions.
Choosing a machine learning model
Many models have been designed through the years to detect patterns in data, ranging from regression model to advanced machine learning and artificial intelligence. As we here deal with survey data it should be evident that many models need to be excluded, e.g. models specialized purely in image pattern detection and models that are not sharp enough or unpractical due to the number of dimensions considered.
As we have seen above and due to the manner in which we designed our survey, the individuals are described by qualitative variables since the data is collected from a questionnaire. The people have responded to a number of questions, each of which has a finite number of possible modalities. Sometimes the values (codes) that are used to represent the modalities of these variables are viewed as numerical values, but this poses a problem. The code values can be compared since they are are neither necessarily ordered nor regularly spaced. Most of the time, using the codes of the modalities as quantitative (numerical) variables has no meaning and the qualitative data therefore needs specific treatment.
It can be of interest to only study the relations between modalities to classify both the individuals and the modalities of the qualitative variables that describe them. The best candidate for this purpose is to use the Kohonen algorithm (SOM) which is a very powerful tool for analyzing and visualizing numerical data.
One of the most interesting and advantageous aspects of Self-organizing maps is that they belong to the class of unsupervised learning. Supervised training techniques usually train data consisting of vector pairs, an input and a target vector, where an input vector is presented to a neural network and the output is compared with the target vector. The training a SOM demands no target vector, which in the task of classifying individuals is greatly advantageous. Remember, we wish to identify groups, not class them into predetermined segments. A Kohonen Self-organizing learns to classify the training data without any external supervision.
The Self-organizing scheme was introduced by Teuvo Kohonen and is an emulation of the process of clustering events and task in the human brain. It compresses the information of high-dimensional data into geometric relationships onto a low-dimensional, most often 2-dimensional, representation. The neurons are ordered in two layers: the input layer and the competition layer. Note that we here do not talk about output, but rather competition since there are no predetermined outputs. The input layer is composed of as many neurons as there are variables. In our case this amounts to questions asked a panel. The competition layer is composed of a topological 2-dimensional grid of neurons geometrically ordered in a certain manner (often hexagonal).
Each input layer unit (determined by the natures of the problem considered) is in connection with all neurons of the competition layer. To enable training and re-evaluation after each iteration a q-dimensional weight vector is assigned to each competition layer unit. The weights are usually determined by the type of algorithm used, but can be set by the user, something that I personally would not recommend unless you have complete control of the algorithms inner workings. The algorithm determines the models that best fit the observations Which it then arranges in in such a way that similar models are closer each other than the different ones. Self-organizing maps also keep neighborhood relations between the original distribution of input data and the topological low-dimensional output. A training process has to be determined by the user, usually enough iterations for the algorithm to converge. In each step, a sampled input observation is randomly chosen, and the distance within its N-dimensional space position and the weights vectors associated with each competition layer unit is calculated using some distance measurement. The distance is usually chosen to be the Euclidean distance but it could be any distance. The neuron whose weight vector is nearest to the input observation is called winning neuron The weight vectors are updated at after each training step in order for the winning neuron to be closer to the input observation. The wining neuron’s topological neighbors are handled in the same way so that it is “displaced” towards the sampled observation. When the algorithm has converged, that is when the weighs are no longer adjusted, the visualization of the output map (or component) can be performed. It carries information about how the input variables are related to each other for the given dataset. Another interesting feature is the hit map. It is endowed with the same topological structure as that of the components plane and shows the number of times each neuron was the winning neuron for each input. This implies that we can extract information about the number of input observations that gathers in each neuron and therefore gives an indication of the importance of each unit in the components plane.
Once the segments, for the variables chosen as segmentation variables are known, it is more or less a child’s plays to gather every individual into their respective groups and gather statistics about their answers to the survey questions. This in turn allows us to construct a real picture of the kind of individuals they are. The neat thing about this way of segmenting populations is the that we are ensured by the unsupervised nature of the algorithm that the segments do exist. This result have been confirmed by interview carried with patients in several health-care units.
The results of our work is in the phase of implementation in primary health-care in Sweden and those intersted in the structure of the resulting segments are welcome to contact me or read the pamphlet, Beteenden och behov av personer i kontakt med vården,made public by SKL. The conclusion is that the introduction of artificial intelligence in health-care can be put to benefit of the general population and does not necessarily imply economic motives. On the contrary, they benefit patients by leading to the introduction of patient centered work methods.