Cluster analysis is a multivariate technique whose basic idea is to classify objects forming groups/conglomerates (cluster) that are as homogeneous as possible within themselves and heterogeneous among themselves.
It arises from the need to design a strategy that allows defining groups of homogeneous objects. This grouping is based on the idea of distance or similarity between the observations and obtaining these clusters depends on the criterion or distance considered, for example, a deck of Spanish cards could be divided in different ways: into two clusters (figures and numbers) , in four clusters (the four suits), in eight clusters (the four suits and depending on whether they are figures or numbers). That is, the number of clusters depends on what we consider to be similar.
Cluster analysis is a classification task. For example
- Classify groups of consumers regarding their preferences in new products
- Classify the banking entities where it would be more profitable to invest
- Classify the stars of the cosmos based on their luminosity
- Identify whether there are groups of municipalities in a given community with a similar trend in water consumption in order to identify good practices for sustainability and problem areas due to high consumption.
As can be easily understood, cluster analysis is of extraordinary importance in scientific research, in any branch of knowledge. Classification is one of the fundamental objectives of Science and to the extent that cluster analysis provides us with the technical means to carry it out, it will become essential in any investigation.
AND=mtwo
Problem Statement
Let us consider a sample X formed by n individuals on which p variables are measured, X 1 ,…,X p ( p numerical variables observed in n objects). Let x ij be the value of the variable X j in the i -th object i = 1,…,n; j = 1,…,p.
This set X of numeric values can be sorted into an array
The i-th row of the matrix X contains the values of each variable for the i-th individual, while the j-th column shows the values belonging to the j-th variable throughout all the individuals in the sample.
It is, fundamentally, about solving the following problem: Given a set of n individuals characterized by the information of p variables X j, ( j = 1,2,…, p ), we propose to classify them in such a way that the individuals belonging to a group (cluster) (and always with respect to the information available on the variables) are as similar as possible to each other and the different groups are as dissimilar to each other as possible.
The entire process can be structured according to the following scheme:
- We start from a set of n individuals for whom information is available encrypted by a set of p variables (a data matrix of n individuals and p variables).
- We establish a similarity criterion and build a matrix of similarities that allows us to relate the resemblance of individuals to each other. To measure how similar (or dissimilar) individuals are, there are a large number of similarity and dissimilarity or divergence indices. All of them have different properties and uses and it will be necessary to be aware of them for their correct application.
- We chose a classification algorithm to determine the clustering structure of individuals.
- We specify that structure using tree diagrams.
Cluster analysis: Technique of grouping variables and cases
- As a variable clustering technique , cluster analysis is similar to factor analysis . But, while factor analysis is not very flexible in some of its assumptions (linearity, normality, quantitative variables, etc.) and estimates the distance matrix in the same way, cluster analysis is less restrictive in its assumptions (it does not require linearity). , no symmetry, allows categorical variables, etc.) and supports several distance matrix estimation methods.
- As a technique for clustering cases , cluster analysis is similar to discriminant analysis . But while the discriminant analysis focuses on the grouping of variables, that is, it performs the classification taking as reference a criterion or dependent variable (the classification groups), the cluster analysis focuses on grouping objects, that is, it allows detecting the optimal number of groups and their composition only based on the similarity between the cases; furthermore, the cluster analysis does not assume any specific distribution for the variables.
Disadvantages of Cluster Analysis: It is a descriptive, theoretical and non-inferential analysis . It is usually used as an exploratory technique that does not offer unique solutions, the solutions depend on the variables considered and the cluster analysis method used.
Applicability: Cluster analysis techniques have traditionally been used in many disciplines, for example, Astronomy (Cluster = galaxy, super galaxies, etc.), Marketing (market segmentation, market research), Psychology, Biology (Taxonomy. Microarrays) , Environmental Sciences (Classification of rivers to establish typologies according to the quality of the waters), Sociology, Economics, Engineering, ….
JAIN and DUBES (1988) define Cluster Analysis as a data exploration tool that is complemented with data visualization techniques.
summarizing
- The objective of Cluster Analysis is to obtain groups of objects in such a way that, on the one hand, the objects belonging to the same group are very similar to each other and, on the other, the objects belonging to different groups have a different behavior with respect to the variables. analyzed variables.
- It is an exploratory technique since most of the time it does not use any type of statistical model to carry out the classification process.
- It is advisable to always be alert to the danger of obtaining, as a result of the analysis, not a classification of the data but a dissection of the same into different groups. The knowledge that the analyst has about the problem will decide which groups obtained are significant and which are not .
Once the variables and the objects to be classified have been established, the next step is to establish a measure of proximity or distance between them that quantifies the degree of similarity between each pair of objects.
- The measures of proximity, similarity or similarity measure the degree of resemblance between two objects in such a way that, the higher (lower) its value, the higher (lower) the degree of similarity between them and the higher (lower) the probability that the methods assign them in the same group.
- Dissimilarity, dissimilarity or distance measures measure the distance between two objects in such a way that the higher (lower) its value, the more (less) different the objects are and the lower (higher) the probability that classification methods will assign them in the same group.
Classification methods
Two large categories of cluster methods are distinguished: Hierarchical methods and Non-hierarchical methods.
- Hierarchical Methods : In each step of the algorithm only one object changes group and the groups are nested in those of previous steps. If an object has been assigned to a group, it no longer changes groups. The resulting classification has an increasing number of nested classes.
- Non-hierarchical methods or Repartició n: They start with an initial solution, a number of groups g fixed in advance and groups the objects to obtain the g groups.
Hierarchical methods are further subdivided into agglomerative and divisive :
- Agglomerative hierarchical methods start with as many clusters as objects we have to classify and at each step the distances between the existing groups are recalculated and the two most similar or least dissimilar groups are joined. The algorithm ends up with a cluster containing all the elements.
- Divisive hierarchical methods start with a cluster that encompasses all the elements and at each step the most heterogeneous group is divided. The algorithm ends with as many clusters (of one element each) as objects have been classified.
Regardless of the grouping process, there are different criteria to form the clusters; all of these criteria are based on a matrix of distances or similarities. For example, inside the methods:
Agglomerative hierarchies:
- Simple Linkage Method, Simple Link or Nearest Neighbor
- Full Linkage, Full Link or Furthest Neighbor Method
- Average method between groups
- Centroid Method
- Median Method
- Ward’s method
Hierarchical divisive or dissociative
- Simple Linkage Method
- Full Linkage Method
- Average method between groups
- Centroid Method
- Median Method
- Association Analysis
Process to follow in a cluster analysis
Step 1 : Selection of variables
The classification will depend on the chosen variables. Introducing irrelevant variables increases the possibility of errors. Some selection criteria must be used:
- Select only those variables that characterize the objects that are grouped, and referring to the objectives of the cluster analysis that is going to be carried out
- If the number of variables is very large, a Principal Components Analysis can be carried out beforehand and the set of variables summarized.
Step 2 : Detection of outliers. Cluster analysis is very sensitive to the presence of objects that are very different from the rest (outliers).
Step 3 . Select the way to measure the distance/dissimilarity between objects depending on whether the data is quantitative or qualitative
- Metric data: Correlation measures and distance measures
- Non-metric data: Measures of association.
Step 4: Standardization of the data (Decide whether to work with the data as measured or standardized). The order of the similarities can change quite a bit with just a change of scale of one of the variables, so typing will only be done when necessary.
Step 5: Obtaining the clusters and evaluation of the classification carried out
- Choose the algorithm for cluster formation ( hierarchical procedures or non-hierarchical procedures)
- Number of clusters: Stop rule. There are various methods for determining the number of clusters, some are based on reconstructing the original distance matrix, others on Kendall’s concordance coefficients, and others perform analysis of variance between the groups obtained. There is no universally accepted criterion. Given that most of the statistical packages provide the agglomeration distances, that is, the distances at which each cluster is formed, one way to determine the number of groups is to locate in which iterations of the method used these distances give large jumps
- Model adequacy. Check that the model has not defined a cluster with a single object, a cluster with unequal sizes,…
Cluster Analysis in SPSS
The SPSS program has three types of cluster analysis:
- Two-stage cluster analysis
- K -Means Cluster Analysis
- Hierarchical cluster analysis .
Each of these procedures uses a different algorithm to create clusters and contains options not available in the others.
- Two-stage cluster analysis . The two-stage cluster is designed for data mining, that is, for studies with a large number of individuals that may have classification problems with the other procedures. It can be used both when the cluster number is known a priori and when it is unknown . It allows working together with variables of a mixed type (qualitative and quantitative ).
- K-means cluster analysis. It is a non-hierarchical classification method (distribution). The number of clusters to be formed is fixed in advance (it requires knowing the number of clusters a priori) and the objects are grouped to obtain those groups. They start with an initial solution and the objects are regrouped according to some optimality criterion. Non-hierarchical clustering can only be applied to quantitative variables . This procedure can analyze large data files .
- Hierarchical cluster analysis. In the Hierarchical classification method in each step of the algorithm only one object changes group and the groups are nested in the previous steps. If an object has been assigned to a group, it no longer changes groups. The hierarchical method is suitable for determining the optimal number of existing clusters in the data and their content. It is used when the number of clusters is not known a priori and when the number of objects is not very large. It allows working together with variables of a mixed type ( qualitative and quantitative ).). As long as all variables are of the same type, the Hierarchical Cluster Analysis procedure can analyze interval (continuous), count, or binary variables.
The three methods of analysis that we are going to study are of the agglomerative type , in the sense that, starting from the analysis of individual cases, they try to group cases until they reach the formation of homogeneous groups or conglomerates.
All cluster analysis methods are exploratory data methods
- For each dataset we can have different groupings, depending on the method
- The important thing is to identify a solution that teaches us relevant things about the data.
In this practice we study first the Hierarchical Cluster Analysis, followed by the K Means Cluster Analysis and lastly the Cluster Analysis in two stages.
Hierarchical Cluster Analysis
This procedure attempts to identify relatively homogeneous groups of cases (or “variables “) based on selected characteristics. It allows working together with variables of mixed type (qualitative and quantitative), being possible to analyze the raw variables or choose from a variety of standardization transformations. It is used when the number of clusters is not known a priori and when the number of objects is not very large . As we have said before, the hierarchical clustering analysis objects can be cases or variables , depending on whether you want to classify the cases or examine the relationships between the variables.
When working with variables that can be quantitative, binary or count data (frequencies), the scaling of the variables is an important aspect, since the different scales in which the variables are measured can affect the clustering solutions. If the variables show large differences in scaling (for example, one variable is measured in dollars and the other is measured in years), standardization should be considered. This can be done automatically by the Hierarchical Cluster Analysis procedure itself.
We will fundamentally study Agglomerative Hierarchical Methods. In these methods, various criteria are used to determine, at each step of the algorithm, which groups should be joined.
- Simple link or nearest neighbor: Measures the proximity between two groups by calculating the distance between their closest objects or the similarity between their most similar objects
- Full link or furthest neighbor: Measures the proximity between two groups by calculating the distance between their most distant objects or the similarity between their least similar objects
- Average link between groups: Measures the proximity between two groups by calculating the average of the distances between objects of both groups or the average of the similarities between objects of both groups
- Average link within groups: Measures the proximity between two groups with the average distance between the members of the union group of the two groups
- Centroid and median methods: Both methods measure the proximity between two groups by calculating the distance between their centroids. The two methods differ in how the centroids are calculated: Ward’s method
- The centroid method uses the means of all variables.
- In the median method, the new centroid is the mean of the centroids of the groups that join
Comparison of the various agglomerative methods
- Single Link Leads to Chained Clusters
- Full linking leads to compact clusters
- The full link is less sensitive to outliers than the single link.
- Ward’s method and the middle link method are the least sensitive to outliers.
- Ward’s method tends to form more compact clusters of equal size and shape compared to the average link.
- All methods except the centroid method satisfy the ultrametric inequality
Decisions to make to make a cluster
- Choose the cluster method to use
- Decide whether to standardize the data
- Select the way to measure the distance/dissimilarity between individuals
- Choose a criterion to join groups, distance between groups.
Process to follow in an Agglomerative Hierarchical Cluster Analysis
- Selection of the variables . It is recommended that the variables are of the same type (continuous, categorical,…)
- Detection of outliers . Cluster analysis is very sensitive to the presence of objects that are very different from the rest (outliers).
- Choice of a measure of similarity between objects and obtaining the distance matrix. Through these measurements, the initial clusters are determined.
- Find the most similar clusters
- Merge these two clusters into a new cluster that has at least two objects, so that the cluster number decreases by one.
- Calculate the distance between this cluster and the rest . The different methods for calculating the distances between the clusters produce different groupings, so there is no single grouping.
- Repeat from step 4 until all the objects are in a single cluster.
The hierarchical clustering process can be graphically summarized using a tree-like graphical representation called a Dendogram . Similar objects are linked and their position in the diagram is determined by the level of similarity/dissimilarity between the objects.
We are going to carry out the described process and for this we use a simple example. This example is made up of 5 objects (A, B, C, D, E) and 2 variables (X 1 , X 2 ). The data is presented in the following table
Step 1 and 2 : To detect outliers we can plot the points on the plane
We do not detect outliers
Step 3 : The distance measurement that we are going to take between the objects is going to be the Euclidean distance whose expression is:
So, for example, the distance between cluster A and cluster B is:
We calculate the Euclidean distance between all the points and obtain the following matrix of Euclidean distances between the objects
We are performing the agglomerative hierarchical method, so initially we have 5 clusters, one for each of the objects to classify.
Step 4 : We observe in the distance matrix which are the most similar objects, in our example they are A and B that have the smallest distance (1).
Step 5: We merge the most similar clusters by building a new cluster containing A and B. The clusters have been formed: AB, C, D and E.
Step 6 : We calculate the distance between the AB cluster and the objects C, D and E. To measure this distance we take the centroid as representative of the AB cluster, that is, the point whose coordinates are the mean values of the components of the variables, that is, the coordinates of AB are: ((1+2)/2 , (1+1)/2) = (1.5, 1). The data table is as follows
Step 7 : We repeat from step 4 until all the objects are in a single cluster
Step 4: From these data we calculate again the distance matrix
Step 5: The most similar clusters are D and E with a distance of 2, which are merged into a new cluster DE. Three clusters AB, C, DE have been formed
Step 6: We calculate the centroid of the new cluster which is the point (6,7) and we form the data table again
Step 4: From these data we calculate again the distance matrix
Step 5: The most similar clusters are C and DE with a distance of 2.8, which are merged into a new CDE cluster. Two clusters AB and CDE have been formed
Step 6. We calculate the centroid of the new cluster ((4+5+7)/3 , (5+7+7)/3) = (5.3, 6.3) and form the data table again
Step 4 : From these data we calculate again the distance matrix
In this last step we only have two clusters with a distance of 6.4 that will be merged into a single cluster in the next step, finishing the process.
Next we will represent graphically the fusion process by means of a dendrogram
Below we show several solutions, for this we cut the dendrogram by means of horizontal lines, like this for example
The figure above shows 2 clusters: AB and CDE
In this figure the cut line shows us 3 clusters: AB, C and DE
The number of clusters depends on the place where we cut the dendrogram, therefore the decision about the optimal number of clusters is subjective. It is convenient to choose a number of clusters that we know how to interpret. To interpret the clusters we can use:
- ANOVA
- Factorial analysis
- discriminant analysis
- …
- Common sense
To decide the number of clusters, it can be very useful to represent the different steps of the algorithm and the distances at which the fusion of the clusters occurs. In the first steps the distance jump is small, while these differences increase in the successive steps. We can choose as the cutoff point the one where the most abrupt jumps begin to occur. In our example, the sudden jump occurs between stages 3 and 4, therefore two are the optimal number of clusters.
Hierarchical Cluster Feedback
- Performing hierarchical clustering on large data sets is problematic as a tree with more than 50 individuals is difficult to represent and interpret.
- A general disadvantage is the impossibility of reassigning individuals to clusters in cases where the classification was in doubt in the early stages of the analysis.
- Because cluster analysis involves choosing between different measures and procedures, it is often difficult to judge the reliability of the results.
- It is recommended to compare the results with different clustering methods. Similar solutions generally indicate the existence of a structure in the data. Very different solutions probably indicate a poor structure.
- Ultimately, the validity of the clusters is judged by a qualitative interpretation that may be subjective.
- The number of clusters depends on where we cut the dendrogram.
Case study 1
Automakers must tailor their product development and marketing strategies to each consumer group to increase sales and brand loyalty. The task of grouping cars according to variables that describe consumption habits, sex, age, income level, etc. of clients can be largely automated using cluster analysis.
You want to do a market study on consumer preferences when purchasing a vehicle, for this we have a database, sales_vehicles .sav , of cars and trucks that contains a series of variables such as the manufacturer, model, sales, etc.
The sales_vehicles .sav data file contains 157 data and is made up of the following variables:
String variables: brand (Manufacturer); model
Numeric type variables: sales (in thousands); resale (Resale value in 4 years); type (Vehicle type: Values: {0, Car; 1, Truck}); price (in thousands); motor (Size of the motor); CV (Horses); tread (Tire base); width (Width); long (Length); net_weight (Net weight); tank (Fuel capacity); mpg (Consumption).
We want to carry out the market study only on best-selling cars and for this we are going to use the Hierarchical Cluster Analysis procedure to group the best-selling cars based on their prices, manufacturer, model and physical properties.
First of all we will restrict the data file only to cars of which at least 100,000 units were sold. To do this, we select the cases that meet that condition by choosing from the menus:
Data/Select Cases . Select if it satisfies the condition
and click If the op… Since the study is going to be carried out only for cars of which at least 100,000 units were sold, in the Select Cases dialog box window . If the option write (type = 0) & (sales>100).
Click Continue . In the data editor (the cases for which the cluster analysis will not be carried out are crossed out) a new filter_$ variable appears with two values (0 = “Not Selected” and 1 = “Selected”).
Once the sample with which we are going to work is selected, we use Hierarchical Cluster Analysis to group the best-selling cars based on their prices, manufacturer, model and physical properties. To run this cluster analysis, choose from the menus: Analyze/ Classify/ Hierarchical Clusters…
As can be seen in this figure, conglomerates can be made for objects (cases) or for variables (grouping variables by the similarity they present in the responses of the individuals) and the groups can be labeled with one of the variables in the file.
Enter in the Variables field: price (in thousands); motor (Size of the motor); CV (Horses); tread (Tire base); width (Width); long (Length); net_weight (Net weight); tank (Fuel capacity); mpg (Consumption). And we choose an identification variable to label the cases (non-mandatory option), for this we introduce in the Field Label the cases by: the model variable .
Note: If cases are clustered , select at least one numerical variable . If variables are clustered , select at least three numerical variables.
Click Method.
Conglomeration method. Linkage methods use proximity between pairs of individuals to join groups of individuals . There are several ways to measure the distance between clusters that produce different clusters and different dendrograms. There is no criteria to select which is the best algorithm. The decision is normally subjective and depends on the method that best reflects the purposes of each particular study. The options available in SPSS are:
- Inter-group link . Intergroup Mean
- Intra -group linkage . Intragroup Mean
- Nearest neighbor . Single link (minimum jump). It uses the minimum distance/dissimilarity between two individuals in each group (useful for identifying outliers). Leads to chained clusters
- Farthest neighbor . Full link (maximum jump). It uses the maximum distance/dissimilarity between two individuals in each group. Leads to compact clusters
- Clustering of Centroids . Uses the distance/dissimilarity between the centers of the groups
- Median pooling . It uses the median of the distances/dissimilarity between all the individuals of the two groups.
- Ward’s method . It has a tendency to form more compact clusters of equal size and shape, compared to the average link.
Ward’s method and the mean method (middle link) are the least sensitive to outliers.
Measure. The distance (dissimilarity or similarity) between objects is a measure that allows us to establish the degree of similarity between said objects. Through this option we select the measure that we are going to use to see the resemblance between individuals with different distances depending on whether the variable is binary, frequency or interval. The initial choice of the set of measures that describe the elements to be grouped is essential to establish the possible clusters. The measures of distance or similarity that we use in agglomeration must be selected depending on the type of data. SPSS has the following measures:
- Interval ( Default option ). Available options are: Euclidean Distance (Not a scale invariant distance), Squared Euclidean Distance, Cosine, Pearson Correlation, Chebychev, Block, Minkowski, and Custom.
- Counts. The available options are: Chi-Square Measure (Default Measure) and Phi-Square Measure.
- Binary. Available options are: Euclidean Distance, Squared Euclidean Distance, Size Difference, Configuration Difference, Variance, Dispersion, Shape, Simple Match, 4 Point Phi Correlation, Lambda, Anderberg’s D , Dice, Hamann, Jaccard, Kulczynski 1 , Kulczynski 2, Lance and Williams, Ochiai, Rogers and Tanimoto, Russel and Rao, Sokal and Sneath 1, Sokal and Sneath 2, Sokal and Sneath 3, Sokal and Sneath 4, Sokal and Sneath 5, Y of Yule and Q of Yule .
transform values. Most of the cluster methods are very sensitive to the fact that the variables are not all measured in the same units and that the variability is very different. If we want all variables to have the same importance in the analysis, we can standardize the data. Using this option you can standardize the data values, for cases or variables, before calculating similarities (not available for binary data). The available standardization methods are:
- Z scores . Standardized to Z scores, with mean 0 and standard deviation 1
- Range -1 to 1 . Each value of the element being typed is divided by the range of the values
- Range 0 to 1 . Subtracts the minimum value of each element being typed and divides it by the range
- Maximum magnitude of 1 . Divide each value of the element being typed by the maximum of the values
- Mean of 1 . Divide each value of the item being typed by the mean of the values
- Standard deviation 1 . Divide each value of the variable or case by the standard deviation.
You can choose the way to perform the typing. Options are By Variable or By Case .
Transform measures. Using this option you can transform the values generated by the distance measurement. They are applied after calculating the distance measure. Available options are: Absolute Values, Change Sign, and Change Scale to Range 0–1.
In our example, since the variables in the analysis are scale variables that are measured in different units, the choice of the distance measure, the Interval measure (squared Euclidean distance), and normalization seems appropriate.
We choose the Nearest Neighbor cluster method , this method is appropriate to use when you want to examine the degrees of similarity but is poor at constructing distinct clusters. Therefore, after examining the results with this method, we should carry out the study again with a method other than clustering.
In the window of the previous figure, select as Measure: Interval (squared Euclidean distance), as Cluster method: Nearest neighbor and select Z scores in Transform values , Standardize:
Click Continue and in the Hierarchical Cluster Analysis dialog box click Graphs…
Dendrogram. It is a graphic representation in the form of a tree, in which the clusters are represented by vertical lines (horizontal) and the fusion stages by horizontal lines (vertical). The separation between the stages of fusion is proportional to the distance between the groups that are fused in that stage. SPSS represents the distances between rescaled groups, therefore they are difficult to interpret. Dendrograms can be used to assess the cohesiveness of clusters that have formed and provide information on the appropriate number of clusters to keep.
icebergs Displays an iceberg diagram , which includes all clusters or a specified range of clusters. Iceberg plots display information about how the cases are combined into the clusters, at each iteration of the analysis. Orientation allows you to select a vertical or horizontal diagram.
Select Dendrogram and in Icebergs select None. Press Continue and Accept. The following outputs are obtained
The dendrogram is a graphical summary of the cluster solution. The cases (car brands) are located along the left vertical axis. The horizontal axis shows the distance between the groups when they joined (from 0 to 25).
Analyzing the classification tree to determine the number of groups is a subjective process. In general, one begins by looking for “gaps” between joints along the horizontal axis. From right to left there is a gap between 20 and 25, which divides the cars into two groups:
- One group consists of the models: Accord (8), Camry (11), Malibu (2), Grand Am (9), Impala (3), Taurus (5), Mustang (4) and
- the other group is made up of the models: Focus (6), Civic (7), Cavalier (1) and Corolla (10).
There is another gap at about 15 and 20 that suggests 5 clusters (8, 11); (2.9); (3, 5); (4); (6, 7, 1, 10).
Between 10 and 15 there is another gap that suggests 6 clusters (8, 11); (2.9); (3, 5); (4); (6, 7, 1); (10).
The Cluster History is a table that shows a numerical summary of the solution of the cluster method used. The History shows the cases or clusters combined at each stage, the distances between the cases or clusters that are combined (Coefficients), as well as the last level of the clustering process in which each case (or variable) was joined to its cluster correspondent. When two clusters are merged, SPSS assigns the new cluster the lowest label among those that the merging clusters have.
In our example, cases 8 and 11 (Accord (8), Camry (11)) are joined in the first stage because they have the smallest distance (1,260). The group created by 8 and 11 appears again in stage 7 where it joins cluster 2 (formed in stage 3). Therefore, in this stage the groups created in stages 1 and 3 are united and the resulting group formed by 8, 11, 2 and 9 appears in the following stage, 8.
If there are many cases, the table is quite long, but it is usually easier to study the column of coefficients to distinguish large distances than to analyze the dendrogram. When an unexpected jump in the distance coefficient is observed, the solution before that gap indicates a good choice of clusters.
The largest differences in the coefficients column occur between stages 5 and 6, indicating a 6-cluster solution ((8, 11); (2.9); (3, 5); (4); (6, 7, 1); (10)) and between steps 9 and 10, indicating a 2-cluster solution. These are the same as the results of the dendrogram.
In the Hierarchical Cluster Analysis dialog box, click Statistics…
Distance matrix. Provides the distances or similarities between the elements.
Membership conglomerate. Shows the cluster to which each case is assigned at one or more stages of cluster combination. The available options are: Unique Solution and Range of Solutions .
In our study we chose Cluster history , Distance matrix and in Membership cluster the option Range of solutions (Minimum number of clusters 2 and maximum number 6).
This table shows the cases that belong to each cluster. For example, if the solution is two clusters, the Cavalier, Focus, Civic, and Corolla cases form cluster 1 and the other cases form cluster 2.
This table shows the Distance Matrix that provides the similarities between the cases
The program allows to save clusters of membership, these variables can be used in later analyzes to explore other differences between the groups. To do so, in the Hierarchical Cluster Analysis dialog box, click Save…
This dialog presents the following options:
- None (default option) does not save membership clusters
- Unique solution: Save a given number of membership clusters
- Range of solutions : Stores a range of membership cluster solutions.
In this study, we were unable to draw strong conclusions about the grouping of best-selling cars based on price, manufacturer, model, and physical properties. This may be due to the fact that we have used the Nearest Neighbor as the cluster method, which, although it is advisable to examine the degrees of similarity, is poor in the construction of the different groups. Therefore, we must redo the analysis again using another clustering method.
Case study 2
Carry out the previous practical assumption using the Farthest Neighbor as the Cluster Method .
To run a fully linked cluster analysis ( Farthest Neighbor ). In the hierarchical clustering dialog box click Method…
In the window select as Cluster Method: Farthest Neighbor and select Z Scores . Click Continue.
In the Hierarchical Cluster Analysis dialog box, select Graphs. And within this option: select Dendrogram and Tempanos: None. Press Continue and Accept
In the early stages, the Cluster History for the full (farthest neighbor) linkage solution is similar to the single (nearest neighbor) linkage solution. On the other hand, in the final stages, the histories of conglomeration are very different. Using the farthest neighbor clustering method, a strong classification of two or three groups is performed.
The first big difference is between stages 5 and 6 (6 clusters), the second between 8 and 9 (3 clusters) and between 9 and 10 (2 clusters).
The decision of this classification is reflected in the dendrogram.
- The initial division of the tree forms two groups, (8, 11, 1, 6, 7, 10) and (2, 9, 3, 5, 4). The first cluster contains the smallest cars and the second cluster contains the largest cars.
- The group of smaller cars can be divided into two subgroups, one of which is made up of the smallest and cheapest cars. Thus the following division into 3 clusters: (Accord (8), Camry (11), Cavalier (1)), (Focus (6), Civic (7), Corolla (10)), these three cars are smaller and more cheaper than the previous three) and (Malibu (2), Gran Am (9), Impala (3), Taurus (5), Mustang (4)).
Summary
The full linkage (farthest neighbor) solution is satisfactory because their groups are different, while the nearest neighbor solution is less conclusive. Using Full Linkage (Farthest Neighbor) as the Clustering Method, you can determine the competition between vehicles at the design stage by entering their specifications as new cases in the dataset and re-running the analysis.
Next, we are going to show the Distance Matrix and the belonging clusters. To do so, in the Hierarchical Clusters dialog box, press Statistics… and make the following selection .
Press Continue and Accept
Case study 3
A telecommunications company conducts a study in order to reduce the abandonment of its customers. For this, it has a data file, where each case corresponds to a different client, which records different demographic information and the use of the service. The goal is to segment your customer base by service usage patterns. If customers can be categorized by usage, the company can offer more attractive packages to its customers. The variables that indicate the use and non-use of the services are contained in the Telecomunicaciones1.sav file .
The telecommunications1.sav data file contains 1000 data and is made up of the following variables: region, permanence, age, marital_status, address, family_income, educational_level, employment, gender, n-persons_household, free_calls, equipment_rental, calling_card, wireless, long_distance_month, free_calls_month, equipment_month, card_month, wireless_month, multiple_lines, voice_message, paging_service, internet, caller_id, call_diversion, three_call, electronic_billing.
Use the Hierarchical Cluster Analysis procedure to study the relationships between different services.
To run the cluster analysis, choose from the menus: Analyze/ Classify/ Hierarchical Clusters…
Click Reset to restore default settings.
Select for Variables : Free Call Service, Equipment Rental, Calling Card Service, Wireless Service, Multiple Lines, Voicemail, Paging, Internet, Caller ID, Call Waiting, Call Forwarding, Three-Way Calling , Electronic billing
Select Variables in Cluster
Press Graphics …. Select Dendrogram and in Floes select None
Click Continue and in the Hierarchical Cluster Analysis dialog box , in Clustering method, select Inter-group linkage; under Measurement select Binary and within Binary, choose Simple Match. Since the variables in the analysis are indicators of whether a customer has a service, a choice must be made between binary measures.
Press Continue and Accept
In the binary measures, the column of the coefficients reports the similarity measures, therefore, the values of this coefficient decrease in each stage of the analysis. It is difficult to interpret the results, so we resort to the Dendrogram.
The dendrogram shows that the usage patterns for Multiple Lines and Calling Card Service are different from the other services. These others are grouped into three groups. One group includes wireless , paging_service and voice_message . Other includes equipment_rental , internet , and electronic_billing . The last group contains the variables free_calls , calls_waiting , call_id , call_forwarding and call_three . The wireless service group is closer to theInternet than the group CallWait .
Case study 4
Carry out the study again with the Jaccard distance measure and compare the results.
To run a cluster analysis with Jaccard ‘s distance measure, in the Hierarchical Cluster Analysis dialog box , click Method and in the corresponding window select Jaccard as binary measure .
Click Continue and OK in the Hierarchical Cluster Analysis dialog box.
Using Jaccard’s measure, the three basic groups are the same, but the wireless service group is closer to the WaitCalls group than the Internet group t.
The difference between the simple adaptation and the Jaccard measures is that the Jaccard measure does not consider two similar services if a person is not a subscriber. That is, simple matching considers wireless and Internet services to be similar when a customer has both or neither, while Jaccard considers them to be similar only when a customer has both services. This makes a difference in cluster solutions because there are many customers who do not have Internet or wireless services. Therefore, these groups are more similar in the simple matching solution than in the Jaccard solution. The measure that is used depends on the definition of “similar” that applies to the situation.
Cluster k-means analysis
K-means cluster analysis is a tool designed to assign cases to a fixed number of groups, whose characteristics are not known, but are based on a set of variables that must be quantitative . It is very useful when you want to classify a large number of cases. It is a method of grouping cases based on the distances between them in a set of quantitative variables. This agglomeration method does not allow variables to be grouped . The optimality goal being pursued is “to maximize homogeneity within groups.”
It is the most commonly used method, it is easy to program and gives reasonable results. Its objective is to separate the observations into K clusters, so that each datum belongs to one group and only one. The algorithm searches with an iterative method:
- The centroids (means, medians,… ) of the K clusters
- Assign each individual to a cluster.
The algorithm requires you to specify the number of clusters , you can also specify the initial centers of the clusters if you know this information in advance.
In this method, the measure of distance or similarity between the cases is calculated using the Euclidean distance . The type of scale of the variables is very important, if the variables have different scales (for example, one variable is expressed in dollars and another in years), the results could be misleading. In these cases, standardization of variables should be considered before performing k -means cluster analysis .
This procedure assumes that the appropriate number of clusters has been selected and that all relevant variables have been included. If an inappropriate number of clusters has been selected or relevant variables have been omitted, the results could be misleading.
There are several ways to implement it, but all of them basically follow these steps:
- Step 1 . Initial k clusters are taken at random and the centroids (means) of the clusters are calculated
- Step 2. The Euclidean distance of each observation to the centroids of the clusters is calculated and each observation is reassigned to the closest group, forming the new clusters that are taken instead of the first ones as a better approximation of them.
- Step 3. The centroids of the new clusters are calculated
- Step 4. Steps 2) and 3) are repeated until a stopping criterion is satisfied, for example, no reallocation occurs, that is, the clusters obtained in two consecutive iterations are the same.
The method is usually very sensitive to the given initial solution, so it is convenient to use a good one. One way to build it is through a classification obtained by a hierarchical algorithm.
As a clarification, we are going to carry out the procedure for the case of two variables X 1 and X 2 and four elements A, B, CD The data is the following:
We want to group these observations into two clusters (k = 2)
Step 1. The observations are arbitrarily grouped into two clusters (AB) and (CD) and the centroids of each cluster are calculated
Step 2. We calculate the Euclidean distance of each observation to the centroids of the clusters and reassign each of these observations to the closest cluster
Since A is closer to the cluster ( AB ) than to the cluster (CD) , it is not reallocated
Since B is closer to cluster (CD) than to cluster (AB) , it is reassigned to cluster (CD) forming cluster (BCD) .
Next, the centroids of the new clusters are calculated
Step 3. Step 2 is repeated calculating the distances of each observation to the centroids of the new clusters to see if changes of new reallocations occur
Since there are no changes in the locations of the clusters, the solution for k=2 clusters is: Cluster 1: ( A) and Cluster 2: (BCD ).
There is the possibility of using this technique in an exploratory way, classifying the cases and iterating to find the location of the centroids , or only as a classification technique, classifying the cases from known centroids . When it is used as an exploratory technique, it is common for the ideal number of clusters to be unknown (such as the numerical example we have done), so it is convenient to repeat the analysis with a different number of clusters and compare the solutions obtained; in these cases, the hierarchical cluster analysis method can also be used with a subsample of cases.
Finally, it is necessary to interpret the classification obtained, which requires, first of all, sufficient knowledge of the problem analysed. One must be open to the possibility that not all groups obtained have to be significant. Some ideas that may be useful in interpreting the results are as follows:
- Carry out ANOVAS and MANOVAS to see which groups are significantly different and in which variables they are.
- Perform discriminant analysis.
- Carry out a Factorial or Principal Component Analysis to graphically represent the groups obtained and observe the differences between them.
- Calculate average profiles by groups and compare them.
Finally, it should be noted that it is an eminently exploratory technique whose purpose is to suggest ideas to the analyst when developing hypotheses and models that explain the behavior of the variables analyzed by identifying homogeneous groups of objects. The results of the analysis should be taken as a starting point in the elaboration of theories that explain said behavior.
A good cluster analysis is:
- Efficient. Use as few groups as possible.
- Cash. Capture all statistically and commercially important groupings. For example, a cluster with five clients may be statistically different, but it is not very profitable.
Case study 5
We again use the vehicle_sales .sav data file that contains sales estimates, price lists, and physical specifications for various vehicle makes and models. You want to do a market study to be able to determine the possible competencies for your vehicles, for this we group the car brands according to the available data, consumption habits, gender, age, income level, etc. of customers. Car companies adapt their product development and marketing strategies according to each consumer group to increase sales and brand loyalty.
The sales_vehicles .sav data file contains 157 data and is made up of the following variables:
String variables: brand (Manufacturer); model
Numeric type variables: sales (in thousands); resale (Resale value in 4 years); type (Vehicle Type: Values: {0 , Car; 1, Truck}); price (in thousands); motor (Size of the motor); CV (Horses); tread (Tire base); width (Width); long (Length); net_weight (Net weight); tank (Fuel capacity); mpg (Consumption).
To obtain the K-means cluster analysis, from the menus choose:
Analyze/Classify/ K-means clusters.
The data file variable list provides a listing of all the variables in the file (numeric and string), but string variables can only be used to label cases.
To obtain a K -means cluster analysis :
- Select the numerical variables that you want to use to differentiate the subjects and form the clusters, and move them to the Variables list:
- Optionally, select a variable to identify cases in the results tables and graphs, and move it to the Label Cases Using list.
Number of clusters . In this text box, the two – cluster solution is selected by default . To request a larger number of clusters, enter the desired number in the box.
Method . The options in this section allow you to indicate whether or not the centers of the clusters should be estimated iteratively:
- Iterate and sort . The procedure is responsible for estimating the centers iteratively and classifying the subjects with respect to the estimated centers .
- Just classify . Subjects are classified according to the initial centers (without updating their values iteratively). Checking this option disables the Iterate… button , preventing access to the iteration process specifications. This option is often used in conjunction with the Centers button .
Cluster centers . It shows two options:
- Read initials of . It allows the user to decide what value the centers of the clusters should take. The External data file button is used to indicate the name and path of the file that contains the values of the centers . The name of the selected file is displayed next to the Open Dataset button . It is usual to designate a file resulting from a previous run (saved with the Write Finals To option ) and in conjunction with the Sort Only option in the Method section .
- Write endings in . Save the centers of the final clusters to an external data file. This file can be used later for the classification of new cases. The Data File button allows you to assign a name and path to the destination file. The name of the selected file is displayed next to the New Dataset button.
The data files used by these two options contain variables with special names that are automatically recognized by the system. Freely generating the structure of these files is not recommended; it is preferable to let the procedure itself generate them.
The file sales_vehicles .sav contains 157 data. To make the graphical representation of the results more understandable, we are going to start by using only 20% of the cases in the sample.
To do this, in the main menu select: Data/Select cases
Select the Random sample of cases option and press Example…
In the Sample size section , enter the value 20 in the text box for the option Approximately p % of all cases . Press the Continue and Accept button.
Accepting these selections, the data file is filtered, leaving only 36 of the 157 existing cases available.
We are going to start by representing the distance between the cases in two variables of interest, we have chosen the weight variable and the Motor size variable . To do this, select from the main menu Charts/Chart Generator…
In the Gallery window , under Choose from , select Scatter/…
Drag the Simple Scatter chart to the chart preview window
Move the variable weight (total weight of the vehicle in kg) to the abscissa axis and the Engine size to the ordinate axis
Press OK and the following graph is displayed
The P eso and Motor Size values of the 36 selected cases are represented in the scatter diagram . It can be seen that there is a relatively large group of vehicles with reduced weight and engine size and another more dispersed group of vehicles with greater weight and larger engine .
Double-click on the chart and in the Chart Editor window select Items/Show Data Labels…
The two vehicles apparently furthest from each other (case 79 and case 131) have been identified by case number. The point cloud, therefore, suggests that there are at least two natural groups of cases.
To classify cases into two groups:
Select the Classify Only option in the K-Means Cluster Analysis dialog box . Move the motor and weight variables to the Variables list .
Accepting these selections, the Viewer provides the results shown in the following tables
This table contains the initial centers of the clusters , that is, the values that correspond, in the two classification variables used, to the two cases that have been chosen as respective centers of the two requested clusters.
Selecting again, in the Editor window Elements/Show data labels… and in Properties pass Net weight and Motor size to the Shown window :
Click Apply
It is verified that the cases are 131 ( Cluster 1 ) and 79 ( Cluster 2 ), the same ones that have been identified in the scatter diagram.
Once the centers of the clusters have been selected, each case is assigned to the cluster whose center it is closest to and an iterative center location process begins. In the first iteration, the cases are reassigned by their distance from the new center and, after the reassignment, the value of center is updated again . In the next iteration, the cases are reassigned again and the value of the center is updated . Etc.
This table shows the final cluster centers i.e. the cluster centers after the iterative update process. Comparing the final centers (after the iteration) of this table with the initial centers (before the iteration), a displacement of the center of cluster 1 towards the top of the plane defined by the two classification variables and a displacement from the center of cluster 2 towards the bottom.
This table is very useful for interpreting the constitution of the conglomerates, since it summarizes the central values of each conglomerate in the variables of interest. The interpretation of the results of our example is simple: the first cluster is made up of vehicles with a large engine size and a lot of weight, while the second cluster is made up of vehicles with a small engine size and a low weight .
Finally, this table reports on the Number of cases assigned to each cluster. In our example, the sizes of the clusters are quite different.
To display the Iteration History, select the Iterate and Classify option in the K-Means Cluster Analysis dialog box.
The Iterate sub-dialog allows you to control some details related to the iteration process used to calculate the final centroids . You can determine the maximum number of iterations or set a convergence criterion greater than zero and less than one.
Maximum number of iterations. Limits the number of iterations that the k-means algorithm can perform. The iteration process stops after the specified number of iterations, even if the convergence criterion is not satisfied. This number must be between 1 and 999.
Convergence criterion. Allows you to modify the convergence criteria used by SPSS to stop the iteration process, it determines when the iteration stops. The value of this criterion defaults to zero, but can be changed by entering a different value in the text box. The entered value represents the proportion of the minimum distance between the initial centers of the clusters. As it is a proportion, this value must be greater than or equal to zero and less than or equal to 1. For example, if a value equal to 0.02 is entered, the iteration process will stop when between one iteration and the next there is no move none of the centers a distance greater than two percent of the smallest of the distances between any of theinitial centers . The iteration history table shows, in a footnote, the displacement obtained in the last iteration (whether or not the convergence criterion has been reached).
Use up-to-date stockings. It allows requesting the update of the centers of the conglomerates (it recalculates the centroids with each individual assigned to the group). When a case is assigned to one of the clusters, the value of the center of the cluster is calculated again. When cluster center update is selected, the order of cases in the data file may affect the solution obtained.
If this option is not selected, the new centers of the final clusters will be calculated after the classification of all cases.
We leave the number of maximum iterations that comes by default, 10, select Use updated averages and press Continue and Accept
We verify that convergence is not reached, so we increase the maximum Iterations to 20 and the following Iteration History is displayed
This table summarizes the history of iterations (18 in our example) indicating the change (displacement) experienced by each center in each iteration. It can be seen that, as the iterations progress, the displacement of the centers becomes smaller and smaller, until reaching the 18th iteration, in which there is no longer any displacement.
The iteration process stops, by default, when 10 iterations are reached or when there is no change in the location of the centroids from one iteration to another (change = 0). In our example, the process has finished before reaching 18 iterations because in the 19th there is no change.
Case study 6
The telecommunications1.sav data file . contains 1000 data and is made up of the following variables: region, permanence, age, marital_status, address, family_income, educational_level, employment, gender, n-persons_household, free_calls, equipment_rental, calling_card, wireless, long_distance_month, free_calls_month, equipment_month, card_month, wireless_month , multiple_lines, voice_message, paging_service, internet, caller_id, call_diversion, three_call, electronic_billing.
It is convenient to unify the scale of the variables with which we are going to work, for this reason we are going to transform some of them by first taking the Naperian logarithm and then typifying.
To perform the Naperian logarithm of the long_distance-month variable , select from the main menu Transform/Calculate Variable…
In Group of functions choose Arithmetic , in Functions and special variables choose Ln , press the arrow and in the Numerical expression window pass the variable Long_distance_month .
In Target variable , put the name of the new variable ln_larga_distance and click OK .
In the Data Editor, a new variable has been created that contains the natural logarithms of the variable long_distance_month .
Next we are going to typify the created variable, to do this, select from the main menu Analyze/Descriptive statistics/Descriptive…
Select the variable ln_long_distance and choose Save typed values as variables. In the data editor, a new variable zln_long_distance has been created that contains the standardized values of the variable ln_long_distance.
In the data file data telecommunications_1.sav:
- Transform the following variables by natural logarithm and classification: long_distance_month, free_calls, equipment, cards, wireless
- Transform by typing the following variables: multiple_lines, voice_message, search_service, internet, caller_identifier, waiting_call, call_diversion, three_call, electronic_billing .
We call the new data file data telecommunications_2.sav
In this new data file. is requested
- Use K-Means Cluster Analysis to find “similar” customer subsets.
- Save the cluster membership and the distance from the cluster center in new variables (for 4 clusters).
- Create a Box Plot with the variables belonging to the cluster and the distance from the center . Interpret this representation
We first use K-Means Cluster Analysis
Select the variables to be used in the cluster analysis, in our case of the telecommunications_2.sav data file, select as variables: zln_long_distance, zln_free_calls, zln_teams, zln_cards, zln_wireless, z_lines_multiple, z_message_voice, z_service_search, z_internet, z_identifier_call , z_diversion_calls, z_call_to_three, z_electronic_invoicing.
Specify the Number of clusters . (This number must not be less than 2 or greater than the number of cases in the data file.) We put 3
The k -means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases, as many clustering algorithms do, such as the one used by the hierarchical clustering command.
Press Iterate… and put 20 as the maximum number of iterations
Click Continue and in the K-Means Cluster Analysis dialog box click Options . In this window select, in Statistics, Center of initial conglomerates, ANOVA table, Conglomerate information for each case and in Lost values choose Exclude cases according to pair. There are many missing values due to the fact that most clients do not subscribe to all services, so excluding cases by partner maximizes the information that can be obtained from the data at the cost of possibly biasing the results.
The Options dialog box allows you to obtain some statistics and control how you want to treat missing values. To access options:
Statistics. The options in this section allow you to select some additional statistics such as Initial Cluster Centers , ANOVA Table and Cluster Information for each case.
Initial cluster centers. First estimate of the means of the variables for each of the clusters. By default, a number of properly spaced cases equal to the number of clusters is selected from the data. The initial centers of the conglomerates are used as criteria for a first classification and, from there, they are updated. It shows a table with the cases that the procedure selects as initial centers of the clusters . This option is selected by default.
ANOVA table. Displays an analysis of variance table that includes the invariant F tests for each of the variables included in the analysis. The F tests are descriptive only and the resulting probabilities should not be interpreted. The ANOVA table will not display if all cases are assigned to a single cluster.
The analysis of variance is obtained by taking the groups defined by the conglomerates as a factor and each of the variables included in the analysis as a dependent variable . A footnote to the table informs that the F statistics should only be used for descriptive purposes since the cases have not been assigned randomly to the clusters but have been assigned trying to optimize the differences between the clusters. In addition, the critical levels associated with the F statistics should not be interpreted in the usual way, since the K-means procedureit does not apply any type of correction on the error rate (that is, on the probability of committing type I errors when many tests are carried out).
Conglomerate information for each case. It shows a list of all the cases used in the analysis, indicating for each case, the final cluster to which it has been assigned and the Euclidean distance between the case and the center of the cluster used to classify the case. It also shows the Euclidean distance between the centers of the final clusters. The cases are displayed in the same order as they are found in the data file.
missing values. The available options are: Exclude cases by list or Exclude cases by pair .
Exclude cases according to list. Excludes cases with missing values in any of the variables included in the analysis (Default option).
Exclude cases according to couple. Assigns cases to clusters based on computed distances across all variables that have no missing values.
Press Continue and OK and the following outputs are displayed
It shows a table with the cases, duly spaced, that the procedure has selected as initial centers of the three clusters.
The iteration history shows the progress of the pooling process at each step.
Convergence is achieved due to little or no change in the centers of the clusters. In iteration 18 it has been achieved that the maximum of absolute coordinates for any center is 0. The minimum distance between the initial centers is 6.611.
In the first 13 iterations, the centers of the clusters change quite a bit.
From iteration 14 the centers are established and in the last four iterations they are minor adjustments.
If the algorithm stops because the maximum number of iterations has been reached, that maximum may need to be increased, since the solution if not increased may be unstable.
For example, if the maximum number of iterations had been left at 10, the solution obtained would still be in a flow state.
The ANOVA table indicates which variables contribute the most to the cluster solution. Variables with large F values provide the greatest separation between the groupings. The F tests should only be used for descriptive purposes since the clusters have been chosen to maximize the differences between cases in different clusters. The critical levels are not corrected, so they cannot be interpreted as tests of the hypothesis that the centers of the clusters are equal.
The centers of the final clusters reflect the characteristics of the typical case of each cluster:
- Cluster 1’s customers tend to be large consumers who purchase a large number of services.
- Cluster 2 customers tend to be moderate spenders who buy caller services such as caller ID, call waiting, call forwarding,…
- Clients in cluster 3 tend to spend very little and do not buy many services.
This table shows the Euclidean distances between the centers of the final clusters. Greater distances between the groups correspond to greater differences between them.
Groups 1 and 3 are the most different, the distance between them is 4,863.
Group 2 is approximately equal to groups 1 and 3.
These relationships between the groups can also be intuited from the centers of the final clusters, but the interpretation is more complicated since the number of variables is large.
The third cluster is the one with the largest number of assigned cases (482), which unfortunately is the least profitable group since, as we have seen previously, it is the group that spends the least and buys the least services. Perhaps it would be convenient to make a fourth cluster.
Now we are going to
- Save the membership cluster and the distance from the cluster center in new variables (for 4 clusters)
- Create a Box Plot with the variables belonging to the cluster and the distance from the center . Interpret this representation
First of all, we are going to save the membership cluster and the distance from the center of the cluster and for this, in the K-Means Cluster Analysis dialog box , we put 4 in Number of clusters
And we press Save… and choose Membership cluster and Distance from cluster center
Using this option, classification information for each case is saved in the data file as new variables so that they can be used in subsequent analyses.
Membership conglomerate . Creates a new variable in the Data Editor (named QCL_# ) whose values indicate the final cluster to which each case belongs. The values of the new variable range from 1 to the number of clusters. This information is useful, for example, to construct a scatter diagram with different marks for cases belonging to different clusters, or to carry out a discriminant analysis with the intention of identifying the relative importance of each variable in the differentiation between clusters.
Distance from the center of the cluster . Creates a variable in the Data Editor (named QCL_# ) whose values indicate the Euclidean distance between each case and the center of the cluster to which it has been assigned.
Press Continue and Accept. SPSS creates two new variables in the data editor: Variables QCL_1 ( cluster membership ) and QCL_2 ( distance from cluster center ).
With the new data file we are going to make the boxplot graph. To do this, select from the main menu Graphics/Graph Generator… and in the corresponding output
Click the Gallery tab , select Box Plot from the list of chart types,
Drag and drop the simple Boxplot icon to the top window.
Drag and drop the variable QCL_2 (distance from cluster center) onto the y-axis.
Drag and drop QCL_1 (cluster membership) onto the x-axis.
Click OK to create the box plot.
This graph helps us find the extreme values within the groups. We see that in group 2 there is great variability, but all the distances are within reason.
Case study 7
- Apply clustering of K-means to the case of 4 clusters
- Analyze the results obtained with 4 clusters and compare them with those obtained for the case of 3 clusters. Which solution do you think is the best?
In the outputs of the k-means cluster we have the following tables
This table shows that an important cluster is missing in the three-cluster solution.
The members of cluster 1 (prone to buy online, use long distance and multiple lines) and cluster 2 (it is a group with very few consumers). Both clusters largely come from group 3 in the three-cluster solution, which was a group of customers who spent very little and did not buy many services. Therefore, in the three-cluster solution, cluster 1, whose members are highly likely to purchase Internet-related services, would be lost, constituting them as a distinct and possibly profitable group.
Groups 3 and 4 seem to correspond to groups 1 and 2 of the three-cluster solution.
Group 3 members are heavy consumers and Group 4 members are likely to purchase Caller ID, Call Waiting, Call Forwarding, Call 3 services.
The distances between the groups have not changed greatly.
- Groups 1 and 2 are the most similar, which makes sense, since they were combined
- Groups 2 and 3 are the most dissimilar, since they represent the behavior of opposite expenses in the solution of three clusters
- Group 4 is equally similar to the other groups.
Almost 25% of the cases belong to the recently created group of “e-services” clients, Cluster 1 with 236 cases, which is very significant for their profits.
With k-means cluster analysis, customers are initially grouped into three groups. However, this solution was not very satisfactory, so the analysis was rerun with four groups, whose results were better. In the cluster analysis with three clusters one potentially profitable “Internet” cluster was missed.
This example highlights the exploratory nature of cluster analysis, as it is impossible to determine the “best” number of clusters until the analysis has been run and solutions have been examined.
Two-stage cluster analysis
The Two-Step Cluster Analysis procedure is an exploration tool designed to discover natural groupings (or clusters) in a data set that might not otherwise be detected. The algorithm used by this procedure includes a series of functions that make it different from traditional clustering techniques:
- Treatment of categorical and continuous variables. By assuming that the variables are independent, it is possible to apply a joint multinomial normal distribution on the continuous and categorical variables.
- Automatic selection of the number of clusters. By comparing the values of a model selection criterion for different clustering solutions, the procedure can automatically determine the optimal number of clusters.
- Scalability. By building a cluster feature (CF) tree that summarizes the records, the two-phase algorithm can analyze large data files.
The cluster feature tree and the final solution may depend on the Order of the cases. To minimize order effects they should be randomly ordered. Several different solutions can also be obtained with the cases arranged in different random orders to check the stability of a given solution. In situations where this is difficult due to overly large file sizes, multiple runs can be replaced by a sample of cases arranged in different random orders.
Assumptions. The likelihood distance measure assumes that the variables in the cluster model are independent. Furthermore, each continuous variable is assumed to have a normal distribution and each categorical variable to have a multinomial distribution.
Internal empirical tests indicate that this procedure is quite robust against violations of both the independence assumption and the distributions, but it is still necessary to take into account the extent to which these assumptions are fulfilled.
The procedures that can be used to check whether these assumptions are met are as follows:
- Bivariate correlations to check the independence of two continuous variables.
- Contingency tables to check the independence of two categorical variables.
- The means procedure to check the independence between a continuous variable and a categorical variable.
- The scanning procedure for checking the normality of a continuous variable.
- The Chi-square test to check if a categorical variable follows a multinomial distribution.
Two-stage cluster procedure
It is based on an algorithm that produces optimal results if all variables are independent, continuous variables are normally distributed, and categorical variables are multinomial. But it is a procedure that works reasonably well in the absence of these assumptions.
The final solution depends on the data input order, to minimize the effect we should order the file randomly.
Procedure algorithm . The two steps of this procedure can be summarized as follows:
- First step: pre-cluster formation of the original cases. These are clusters of the original data that will be used instead of the rows from the original file to perform the hierarchical clustering in the second step. All cases belonging to the same pre-cluster are treated as a single entity.
The procedure starts with the construction of a Cluster characteristics (CF) tree. The tree begins by placing the first case at the root of the tree in a leaf node that contains variable information about that case. Each successive instance is then added to an existing node or forms a new node, based on similarity to existing nodes and using distance measurements as the similarity criterion. A node that contains multiple cases contains a summary of information about those cases. Therefore, the CF tree provides a summary of the data file.
- Second step : The leaf nodes of the CF tree are clustered using an agglomerative clustering algorithm. The cluster can be used to produce a range of solutions. To determine the optimal number of clusters, each of these cluster solutions is compared using Schwarz’s Bayesian Criterion (BIC) or Akaike’s Information Criterion (AIC) as the clustering criterion.
Case study 8
We again use the vehicle_sales .sav data file that contains hypothetical sales estimates, price lists, and physical specifications for various vehicle makes and models.
The sales_vehicles .sav data file is made up of the following variables:
String variables: brand (Manufacturer); model
Numeric type variables: sales (in thousands); resale (Resale value in 4 years); type (Vehicle type: Values: {0, Car; 1, Truck}); price (in thousands); motor (Size of the motor); CV (Horses); tread (Tire base); width (Width); long (Length); net_weight (Net weight); tank (Fuel capacity); mpg (Consumption).
To obtain a two-stage cluster analysis, select from the main menu: Analyze/Classify/Two-Stage Cluster… and the Two-Stage Cluster Analysis dialog box is displayed .
Distance measurement. Specifies the measure of similarity between two clusters
- Log-likelihood. The likelihood measure performs a probability distribution between the variables. Continuous variables are assumed to have a normal distribution, while categorical variables are assumed to be multinomial. All variables are assumed to be independent. This distance measure should be used in mixed data. The distance between the two clusters will depend on the decrease in the log-likelihood when both are combined into a single cluster.
- Euclidean. The Euclidean measure is the distance along a “straight line” between two clusters. It can only be used when all variables are continuous .
Number of clusters. This option allows you to specify the desired number of clusters or let the algorithm select that number.
- Automatically determine. The procedure will automatically determine the “optimal” number of clusters, using the criteria specified in Clustering Criterion . Schwarz’s Bayesian Criterion (BIC) or the Akaike Information Criterion (AIC).
- Specify fixed number . Allows setting the number of clusters in the solution. Must be a positive integer to specify the maximum number of clusters that the procedure should consider.
Counting of continuous variables. Provides a summary of the specifications about typing continuous variables made in the Options dialog box .
Conglomeration criterion. Using this option the clustering algorithm determines the number of clusters. Both the Bayesian Information Criterion (BIC) and the Akaike Information Criterion (AIC) can be specified.
In this case study, the following is selected for the Categorical Variables field: The type variable (Type of vehicle) and for the Continuous Variables field : price ; motor ; CV ; tread ; width ; long ; net_weight ; tank and mpg.
Options is pressed
Treatment of outliers. Allows you to treat outliers in a special way during cluster formation if the cluster features (CF) tree is populated. This tree is considered full if it cannot accept any more cases in a leaf node and there is no leaf node to split.
Perform noise treatment :
- If you select this option and the CF tree becomes full, it will be regrown after placing existing instances on sparse leaves, on a “noise” leaf. A leaf is considered sparse if it contains fewer than a certain percentage of cases of the maximum leaf size. After regrowing the tree, outliers will be placed in the CF tree if possible. If not, outliers will be discarded.
- If you do not select this option and the CF tree becomes full, it will be regrown using a larger distance change threshold. After the final clustering, values that cannot be assigned to a cluster will be considered as outliers. The cluster of outliers is assigned an identification number of –1 and will not be included in the count of the number of clusters.
memory allocation . Specifies the maximum amount of memory in megabytes (MB) that the clustering algorithm can use. If the procedure exceeds this maximum, it will use the disk to store information that cannot be placed in memory. Specify a number greater than or equal to 4.
- Check with your system administrator for the maximum value you can specify on your system.
- If this value is too low, the algorithm may fail to obtain the correct or desired number of clusters.
Typing of variables. The clustering algorithm works with standardized continuous variables. All continuous variables that are not typed should be left as variables in the To type list . To save some time and work for the computer, you can select all continuous variables that you have already typed as variables in the Assumed Typed list.
Click Advanced>>
CF tree fit criteria. The following clustering algorithm settings apply specifically to the clustering feature (CF) tree and should be changed with care:
- Threshold of the change in initial distance. This is the initial threshold that is used to grow the CF tree. If a certain leaf has been inserted into the CF tree that would produce a density below the threshold, the leaf will not be split. If the density exceeds the threshold, the sheet will be split.
- Maximum number of branches (per leaf node). Maximum number of child nodes a leaf can have.
- Maximum tree depth. Maximum number of levels a CF tree can have.
- Maximum possible number of nodes. Indicates the maximum number of CF tree nodes that the procedure can potentially generate, according to the function ( b d+1 – 1) / (b – 1), where b is the maximum number of branches and d is the maximum depth From the tree. Note that an excessively large CF tree can exhaust system resources and negatively affect procedure performance. At a minimum, each node requires 16 bytes.
Updating of the cluster model. This group allows you to import and update a cluster model generated in a previous analysis. The input file contains the CF tree in XML format. The model will then be updated with the existing data in the active file. Variable names must be selected in the main dialog in the same order as they were specified in the previous analysis. The XML file will remain unchanged unless the new model information is specifically written to the same file name.
If an update to the cluster model has been specified, the options pertaining to CF tree generation that were specified for the original model will be used. Specifically, the saved model settings for distance measurement, noise treatment, memory allocation, and CF tree fit criteria will be used, so any settings for these options that have been specified in the dialog boxes.
Note : When performing a cluster model update, the procedure assumes that none of the selected cases in the active dataset were used to create the original cluster model. The procedure also assumes that the cases used in updating the model come from the same population as the cases used to create the model; that is, the means and variances of the continuous variables and the levels of the categorical variables are assumed to be the same in both sets of cases. If the “new” and “old” case sets are from heterogeneous populations, the Two-Step Cluster Analysis procedure should be run on the matched case sets to get the best results.
Results is pressed
Output viewer result. Provides options for displaying results
Graphs and tables. The graphical output includes a plot of cluster quality, cluster size, variable importance, cluster comparison grid, and cell information. The tables include a model summary and a feature cluster grid.
Evaluation fields. Calculates the data of the cluster of the variables that have not been used in its creation. Evaluation fields can be displayed along with model viewer input features by selecting them in the dialog box. Fields with missing values are ignored.
Job data file. Save the variables to the active dataset.
- Create membership cluster variable. This variable contains a cluster identification number for each case. The name of this variable is tsc_n , where n is a positive integer indicating the ordinal of the active dataset store operation performed by this procedure in a given session.
XML files. The final cluster model and the CF tree are two types of output files that can be exported in XML format.
- Export final model. The final cluster model can also be exported to the specified file in XML (PMML) format. This model file can be used to apply the model information to other data files to score the model.
- Export CF tree. This option allows you to save the current state of the cluster tree and update it later using new data.
Sales in thousands ( sales ) and 4-year resale value ( resale ) are selected as Evaluation Fields:
These two chosen evaluation fields, sales and resales , have not been used to create the clusters, but they will help to better understand the groups created with this procedure. Press Continue and Accept. The following output is displayed
The model summary includes a table containing the following information:
- Algorithm. The cluster algorithm used, in this case, “Two phases”.
- Input features. The number of variables used (continuous and categorical), also known as inputs or predictors .
- conglomerates. Number of clusters in the solution.
And it displays a cluster quality graph which is a silhouette measure of the cohesion and separation of the clusters shaded to indicate poor, correct or good results. This plot allows you to quickly check if the quality is insufficient, in which case you can choose to return to the modeling node to change the cluster model settings to produce better results.
The results will be poor, correct or good according to the work of Kaufman and Rousseeuw (1990) on the interpretation of cluster structures. A “good” result indicates that the data reflect reasonable or strong evidence that a cluster structure exists, according to the Kaufman and Rousseeuw assessment; a “fair” result indicates that the evidence is weak, and a “poor” result means that, based on that assessment, there is no obvious evidence. The means of silhouette measure, in all the records, (B−A)/max(A,B), where A is the distance of the record to the center of its cluster and B is the distance of the record to the center of the nearest cluster to which it does not belong.
A silhouette coefficient of 1 could imply that all cases are located directly at the centers of their clusters. A value of −1 would mean that all cases are in the cluster centers of another cluster . A value of 0 implies, on average, that the cases are equidistant between the center of their own cluster and the next closest cluster.
In our example, the cluster model summary table indicates that 3 clusters have been formed with the ten input features (categorical and numeric variables) selected, and the cluster quality plot indicates that the result is correct.
Double-clicking on the graph in the previous figure shows an interactive view of the model used in the Model Viewer.
The Cluster Viewer is made up of two panes, the main view on the left and the related or auxiliary view on the right.
Main view . There are two main views:
- Model summary (default).
- conglomerates.
Auxiliary View . There are four related/auxiliary views:
- Importance of the predictor.
- Cluster sizes (default).
- Distribution of boxes.
- Cluster comparison.
By default , Cluster Sizes are displayed using a pie chart containing each cluster. Each sector contains the percentage frequency of each cluster. Hovering with the mouse over the sectors of the diagram shows the number of registers assigned to each cluster.
40.8% (62) of the records were assigned to the first cluster, 25.7% (39) to the second, and 33.6% (51) to the third.
This output also shows a table with the following information about the size of the clusters:
- Smallest cluster size (count and percentage)
- The largest cluster size (count and percentage)
- The ratio of the size of the largest cluster to that of the smallest
In the figure output of the Cluster Viewer Main View , on the toolbar, Clusters is selected and the following output is displayed
A table is displayed containing the following information:
- cluster. Number of clusters created by the algorithm
- label . Labels applied to each cluster (default is blank). By double clicking on the box you can enter the label to describe the content of the cluster
- Description . About the content of the cluster (default is blank). By clicking twice on the box you can enter the description
- Size . Contains the cluster case count, size percentage, and a graph showing the percentage
- Tickets . By default, individual inputs or predictors are listed in order of Global Importance. This global importance of the characteristic is indicated by the shaded color of the background of the box, being darker the more important the characteristic is. Placing the mouse on the boxes shows the name/label of the feature and the importance value of the box. This information depends on the type of feature and the type of view. You can also sort the features by Importance within the cluster , by Name and by Data Order. These ways of classifying features is done using the four Classify Features buttons on the toolbar.
In the main view of the clusters you can select several ways to display the cluster information:
- Transpose clusters and features
- classify features
- Classify clusters
- Select box content.
Transpose clusters and features
By default, clusters appear as columns and features appear as rows. To invert this display, click the Transpose Clusters and Inputs button. This option is useful when there are many clusters and thus reduces the amount of horizontal scrolling required to display the data.
classify features
- Global Importance . Features are ranked in descending order of global importance, and the ranking order is the same across clusters. If there are features that tie in importance values, they are displayed in ascending rank order based on name.
- Importance within the conglomerate. Features are ranked with respect to their importance for each cluster. If there are features that tie in importance values, they are displayed in ascending rank order based on name. If this option is selected, the sort order tends to vary across clusters.
- Name. Features are sorted by name in alphabetical order.
- Data order. Features are ranked by order in the dataset.
Classify clusters
The three Sort Clusters buttons on the toolbar allow you to sort the clusters by descending size (default), by name in alphabetical order, or, if labels have been created, by alphanumeric label order. Features with the same label are sorted by cluster name. If clusters are sorted by label and a cluster’s label is changed, the sort order is automatically updated.
box content
The four Checkbox buttons on the toolbar allow you to change the display of the content of checkboxes and evaluation fields.
- Boxes show cluster centers. By default, the boxes display feature names/labels and the central tendency for each cluster/feature combination. The mean is shown for continuous fields and the mode with the category percentage for categorical fields.
- The boxes show the absolute distributions. Shows feature names/labels and absolute distributions of features for each cluster. For categorical functions, the visualization shows overlay bar charts with the categories arranged in ascending order of data values. For continuous features, the visualization shows a smooth density plot that uses the same endpoints and intervals for each cluster. The dark red display shows the distribution of clusters, while the lighter one represents the overall data.
- The boxes show the relative distributions. Shows feature names/labels and relative distributions in cells. In general, the displays are similar to those shown for absolute distributions, only relative distributions are shown instead. The dark red display shows the distribution of clusters, while the lighter one represents the overall data.
- The boxes show the basic information. If there are many clusters, it can be difficult to see all the details without scrolling. To reduce the amount of scrolling, select this view to change the display to a more compact version of the table.
The output of the Clusters table shows, by default, the clusters ordered from left to right by size, with the classification being 1, 3, 2
The cluster means suggest that the clusters are well separated.
- In cluster 1, 98.4% of the vehicles are automobiles and are characterized by being cheap, small and consuming little fuel.
- In cluster 2, 100% of the vehicles are trucks (column 3) and are characterized by being moderately priced, heavy and have a large fuel tank.
- In cluster 3, 100% of the vehicles are cars and are characterized by being expensive, large and moderately efficient in fuel consumption.
Placing the mouse in the boxes shows information about that characteristic
Cluster means (for continuous variables) and modes (for categorical variables) are useful, but they only give information about the centers of the clusters. To get a visualization of the distribution of values for each cluster field, click the toolbar of the Rank Clusters output and choose Cells show absolute distributions and the following output is displayed
The graph shows a certain overlap between clusters 1 and 3 (columns 1 and 2) in the characteristics of Net weight, Engine size and Fuel capacity . Regarding clusters 3 and 2 (columns 2 and 3) we observe that the vehicles with the largest engine size are in cluster 3 while the vehicles with the most fuel capacity belong to cluster 2.
The evaluation fields information is displayed by clicking the Representation(D) button on the toolbar of the Classify Clusters output and selecting in the resulting output Evaluation Fields
OK is pressed and the evaluation fields are shown below the cluster table
The distribution of sales is similar across clusters except that clusters 1 and 2 (columns 1 and 3) have longer queues than cluster 3 (column 2).
The distribution of the 4-year resale value is very similar in the three clusters, however clusters 2 and 3 (columns 2 and 3) focus on a higher value than cluster 1 and regarding asymmetry cluster 3 has a longer tail than either of the other two clusters.
The output of the Representation window is used to control the display of the clusters:
- Tickets. It is selected by default. To hide all input features, the check box is deselected.
- Evaluation fields. Select the evaluation fields (fields that are not used to build the cluster model, but are sent to the model viewer to evaluate the clusters) that you want to display, since none are displayed by default. Note : This check box is not available if there is no evaluation field available.
- Cluster descriptions. It is selected by default. To hide all cluster description boxes, deselect the check box.
- Cluster Sizes Is selected by default. To hide all cluster size check boxes, deselect the check box.
- Maximum number of categories. Specify the maximum number of categories to display in categorical feature charts. The predetermined value is 20.
Another way to compare the clusters is through the graph that is obtained by selecting the three columns of the clusters using Control+Click in the toolbar of the Auxiliary View and selecting Cluster Comparison in the View drop-down menu of the toolbar of the toolbar. Results Viewer and the following output is displayed
This chart shows the features in the rows and clusters in the columns. This visualization helps to better understand the factors of which the clusters are composed, and allows to see the differences between the clusters not only with respect to the general data, but also with respect to each other.
Pressing the Ctrl+Clik keys in the previous figure selects the clusters to be displayed, at the top of the cluster column (in the main Clusters panel ) .
Note : Up to five clusters can be selected to be displayed. Clusters are displayed in the order they are selected, while the order of the fields is determined by the Sort features by option . If I mportance within cluster is selected under Classify characteristic , the fields are always sorted by overall importance.
This output also shows some graphs of the general distributions of each characteristic:
- Categorical features appear as dot plots, where the size of the dot indicates the most frequent category (mode) for each cluster (per feature).
- Continuous features are displayed as box plots, showing overall medians and interquartile ranges.
The output of the figure above shows box plots for the selected clusters:
- In the continuous characteristics there are square dot markers and horizontal lines that indicate the median and interquartile range of each cluster.
- Each cluster is represented by a different color, which is displayed at the top of the view.
These graphs confirm, in general, what we have seen in the previous ones. This graph can be especially useful when there are many clusters and you want to compare only some of them.
It is interesting to study the importance of the predictor of conglomerates, to do this select Importance of the predictor in the Auxiliary view toolbar and the following graph is obtained
This plot shows the relative importance of each feature in estimating the model.
Última actualización del artículo: March 26, 2023