Ponente
Descripción
Keywords : Clustering; Sparsity; Group Sparsity; Mixed Data.
When databases include many observations described by a reasonable number of variables, the K-means framework provides useful methods for analyzing, summarizing and representing the data. The task consists in extracting a reduced number of "representatives" from the whole database, and/or what is equivalent to partitioning the set of observations into clusters.
For given representatives, the clusters are defined by the nearest neighbor method (to each representative we associate the cluster which is the subset of the database that includes all the observations closer to this representative than to all the others). And on the other hand, if the clusters have been constructed first, the representatives are simply their centers of gravity.
The goal is to obtain clusters as homogenous as possible (with a small intra-cluster dispersion) and well separated from each other (with a large inter-cluster dispersion). These two requirements are identical since for a fixed database, the sum of the intra and inter dispersion is constant, equal to the total dispersion. The algorithms are unsupervised since the clusters are not known, it is only the number K of clusters that must be fixed in advance.
However, when the observations are described by a large number of variables (including the case where the number of variables exceeds the number of observations), it is likely that there are redundant or noise variables, which do not participate in the definition of clusters.
In this case, reducing the number of variables by selecting those that contribute most to clustering will improve the interpretability and performance of the algorithm. This can be done by using the so-called "sparse" clustering methods.
The best-known sparse method is that of Witten & Tibishirani (2010) (Sparse K-means) which consists of a modification of the K-means algorithm, which imposes a weighting of the variables and a penalty term such that maximizing the weighted and penalized inter-cluster dispersion leads to setting to 0 the weights of a certain number of variables, which are thus "removed" from the analysis.
The extension proposed by Chavent, Lacaille, Mourer & Olteanu (2020) consists in extending the WT algorithm to the case where the variables are themselves structured a priori into groups of variables. The authors define a method called Group Sparse K-means, which allows the deletion of complete groups of variables, and which is extended to the case where one wishes to delete not only entire groups of variables, but also variables within groups (Sparse Group Sparse K-means).
The algorithms mentioned so far are only suitable for numerical data. To deal with mixed data, i.e. when numerical and categorical variables are present, the authors propose a pre-processing of the data, which then allows the use of clustering algorithms on numerical variables.
The Group Sparse K-means method is implemented in the R package vimpclust, and we will present an application on real data, as well as a visual selection method of the sparsity parameter.