Cluster analysis partitions marks in the view into clusters, where the marks within each cluster are more similar to one another than they are to marks in other clusters. Show
Watch a Video: To see related concepts demonstrated in Tableau, watch Clustering(Link opens in a new window), a 2-minute free training video. Use your tableau.com(Link opens in a new window) account to sign in. For an example that demonstrates the process of creating clusters with sample data, see Example: Create clusters using World Economic Indicators data. Create clustersTo find clusters in a view in Tableau, follow these steps.
Note: You can move the cluster field from Color to another shelf in the view. However, you cannot move the cluster field from the Filters shelf to the Data pane. To rename the resulting clusters, you must first save the cluster as a group. For details, see Create a group from cluster results and Edit clusters. Clustering constraintsClustering is available in Tableau Desktop, but is not available for authoring on the web (Tableau Server, Tableau Online). Clustering is also not available when any of the following conditions apply:
When any of those conditions apply, you will not be able to drag Clusters from the Analytics pane to the view. In addition, the following field types cannot be used as variables (inputs) for clustering:
Edit clustersTo edit an existing cluster, right-click (Control-click on a Mac) a Clusters field on Color and select Edit clusters.
To change the names used for each cluster, you will first need to drag the Clusters field to the Data pane and save it as a group. For details, see Create a group from cluster results. Right-click the cluster group and select Edit Group to make changes to each cluster.
Select a cluster group in the list of Groups and click Rename to change the name.
Create a group from cluster resultsIf you drag a cluster to the Data pane, it becomes a group dimension in which the individual members (Cluster 1, Cluster 2, etc.) contain the marks that the cluster algorithm has determined are more similar to each other than they are to other marks. After you drag a cluster group to the Data pane, you can use it in other worksheets. Drag Clusters from the Marks card to the Data pane to create a Tableau group:
After you create a group from clusters, the group and the original clusters are separate and distinct. Editing the clusters does not affect the group, and editing the group does not affect the cluster results. The group has the same characteristics as any other Tableau group. It is part of the data source. Unlike the original clusters, you can use the group in other worksheets in the workbook. So if you rename the saved cluster group, that renaming is not applied to the original clustering in the view. See Correct Data Errors or Combine Dimension Members by Grouping Your Data. Constraints on saving clusters as groupsYou will not be able to save Clusters to the Data pane under any of the following circumstances:
Refit saved clustersWhen you save a Clusters field as a group, it is saved with its analytic model. You can use your cluster groups in other worksheets and workbooks, however, they don't automatically refresh. In this example, a saved cluster group and its analytic model has been applied to a different worksheet. As a result, some of the marks are not included in the clustering yet (indicated by gray marks).
If the underlying data changes, you can use the Refit option to refresh and recompute the data for a saved clusters group. To refit a saved cluster
How clustering worksCluster analysis partitions the marks in the view into clusters, where the marks within each cluster are more similar to one another than they are to marks in other clusters. Tableau distinguishes clusters using color. Note: For additional insight into how clustering works in Tableau, see the blog post Understanding Clustering in Tableau 10. The clustering algorithmTableau uses the k-means algorithm for clustering. For a given number of clusters k, the algorithm partitions the data into k clusters. Each cluster has a center (centroid) that is the mean value of all the points in that cluster. K-means locates centers through an iterative procedure that minimizes distances between individual points in a cluster and the cluster center. In Tableau, you can specify a desired number of clusters, or have Tableau test different values of k and suggest an optimal number of clusters (see Criteria used to determine the optimal number of clusters). K-means requires an initial specification of cluster centers. Starting with one cluster, the method chooses a variable whose mean is used as a threshold for splitting the data in two. The centroids of these two parts are then used to initialize k-means to optimize the membership of the two clusters. Next, one of the two clusters is chosen for splitting and a variable within that cluster is chosen whose mean is used as a threshold for splitting that cluster in two. K-means is then used to partition the data into three clusters, initialized with the centroids of the two parts of the split cluster and the centroid of the remaining cluster. This process is repeated until a set number of clusters is reached. Tableau uses Lloyd’s algorithm with squared Euclidean distances to compute the k-means clustering for each k. Combined with the splitting procedure to determine the initial centers for each k > 1, the resulting clustering is deterministic, with the result dependent only on the number of clusters. The algorithm starts by picking initial cluster centers:
It then partitions the marks by assigning each to its nearest center:
Then it refines the results by computing new centers for each partition by averaging all the points assigned to the same cluster:
It then reviews the assignment of marks to clusters and reassigns any marks that are now closer to a different center than before. The clusters are redefined and marks are reassigned iteratively until no more changes are occurring. Criteria used to determine the optimal number of clustersTableau uses the Calinski-Harabasz criterion to assess cluster quality. The Calinski-Harabasz criterion is defined as
where SSB is the overall between-cluster variance, SSW the overall within-cluster variance, k the number of clusters, and N the number of observations. The greater the value of this ratio, the more cohesive the clusters (low within-cluster variance) and the more distinct/separate the individual clusters (high between-cluster variance). Since the Calinski-Harabasz index is not defined for k=1, it cannot be used to detect one-cluster cases. If a user does not specify the number of clusters, Tableau picks the number of clusters corresponding to the first local maximum of the Calinski-Harabasz index. By default, k-means will be run for up to 25 clusters if the first local maximum of the index is not reached for a smaller value of k. You can set a maximum value of 50 clusters. Note: If a categorical variable (that is, a dimension) has more than 25 unique values, then Tableau will disregard that variable when computing clusters. What values get assigned to the "Not Clustered" category?When there are null values for a measure, Tableau assigns values for rows with null to a Not Clustered category. Categorical variables (that is, dimensions) that return * for ATTR (meaning that all values are not identical) are also not clustered. ScalingTableau scales values automatically so that columns having a larger range of magnitudes don’t dominate the results. For example, an analyst could be using inflation and GDP as input variables for clustering, but because GDP values are in trillions of dollars, this could cause the inflation values to be almost completely disregarded in the computation. Tableau uses a scaling method called min-max normalization, in which the values of each variable is mapped to a value between 0 and 1 by subtracting its minimum and dividing by its range. The Describe Clusters dialog box provides information about the models that Tableau computed for clustering. You can use these statistics to assess the quality of the clustering. When the view includes clustering, you can open the Describe Clusters dialog box by right-clicking Clusters on the Marks card (Control-clicking on a Mac) and choosing Describe Clusters. The information in the Describe Clusters dialog box is read-only, though you can click Copy to Clipboard and then paste the screen contents into a writeable document. The Summary tab identifies the inputs that were used to generate the clusters and provides some statistics that characterize the clusters. Inputs for ClusteringVariables Identifies the fields Tableau uses to compute clusters. These are the fields listed in the Variables box in the Clusters dialog box. Level of Detail Identifies the fields that are contributing to the view’s level of detail—that is, the fields that determine the level of aggregation. For details, see How dimensions affect the level of detail in the view. Scaling Identifies the scaling method used for pre-processing. Normalized is currently the only scaling method Tableau uses. The formula for this method, also known as min-max normalization, is (x – min(x))/(max(x) - min(x)). Summary DiagnosticsNumber of Clusters The number of individual clusters in the clustering. Number of Points The number of marks in the view. Between-group sum of squares A metric quantifying the separation between clusters as a sum of squared distances between each cluster’s center (average value), weighted by the number of data points assigned to the cluster, and the center of the data set. The larger the value, the better the separation between clusters. Within-group sum of squares A metric quantifying the cohesion of clusters as a sum of squared distances between the center of each cluster and the individual marks in the cluster. The smaller the value, the more cohesive the clusters. Total sum of squares Totals the between-group sum of squares and the within-group sum of squares. The ratio (between-group sum of squares)/(total sum of squares) gives the proportion of variance explained by the model. Values are between 0 and 1; larger values typically indicate a better model. However, you can increase this ratio just by increasing the number of clusters, so it could be misleading if you compare a five-cluster model with a three-cluster model using just this value. Cluster StatisticsFor each cluster in the clustering, the following information is provided. # Items The number of marks within the cluster. Centers The average value within each cluster (shown for numeric items). Most Common The most common value within each cluster (only shown for categorical items). Describe Clusters – Models TabAnalysis of variance (ANOVA) is a collection of statistical models and associated procedures useful for analyzing variation within and between observations that have been partitioned into groups or clusters. In this case, analysis of variance is computed per variable, and the resulting analysis of variance table can be used to determine which variables are most effective for distinguishing clusters. Relevant analysis of variance statistics for clustering include: F-statisticThe F-statistic for one-way, or single-factor, ANOVA is the fraction of variance explained by a variable. It is the ratio of the between-group variance to the total variance. The larger the F-statistic, the better the corresponding variable is distinguishing between clusters. p-valueThe p-value is the probability that the F-distribution of all possible values of the F-statistic takes on a value greater than the actual F-statistic for a variable. If the p-value falls below a specified significance level, then the null hypothesis (that the individual elements of the variable are random samples from a single population) can be rejected. The degrees of freedom for this F- distribution are (k - 1, N - k), where k is the number of clusters and N is the number of items (rows) clustered. The lower the p-value, the more the expected values of the elements of the corresponding variable differ among clusters. Model Sum of Squares and Degrees of FreedomThe Model Sum of Squares is the ratio of the between-group sum of squares to the model degrees of freedom. The between group sum of squares is a measure of the variation between cluster means. If the cluster means are close to each other (and therefore close to the overall mean), this value will be small. The model has k-1 degrees of freedom, where k is the number of clusters. Error Sum of Squares and Degrees of FreedomThe Error Sum of Squares is the ratio of within-group sum of squares to the error degrees of freedom. The within-group sum-of-squares measures the variation between observations within each cluster. The error has N-k degrees of freedom, where N is the total number of observations (rows) clustered and k is the number of clusters. The Error Sum of Squares can be thought of as the overall Mean Square Error, assuming that each cluster center represents the "truth" for each cluster. Example: Create clusters using World Economic Indicators dataThe Tableau clustering feature partitions marks in the view into clusters, where the marks within each cluster are more similar to one another than they are to marks in other clusters. This example shows how a researcher might use clustering to find an optimal set of marks (in this case, countries/regions) in a data source. The objectiveAs life expectancy increases around the world, and as older people remain more active, senior tourism can be a lucrative market for companies that know how to find and appeal to potential customers. The World Indicators sample data set that comes with Tableau contains the kind of data that might help companies identify the countries or regions where there are enough of the right kind of customers. Finding the right countries/regionsHere is an example of how Tableau clustering could help such a company identify the countries/regions where a senior tourism business could succeed. Imagine you are the analyst. Here is how you might proceed.
|