""" ==================================================================== K-means clustering and vector quantization (:mod:`scipy.cluster.vq`) ==================================================================== Provides routines for k-means clustering, generating code books from k-means models, and quantizing vectors by comparing them with centroids in a code book. .. autosummary:: :toctree: generated/ whiten -- Normalize a group of observations so each feature has unit variance vq -- Calculate code book membership of a set of observation vectors kmeans -- Performs k-means on a set of observation vectors forming k clusters kmeans2 -- A different implementation of k-means with more methods -- for initializing centroids Background information ====================== The k-means algorithm takes as input the number of clusters to generate, k, and a set of observation vectors to cluster. It returns a set of centroids, one for each of the k clusters. An observation vector is classified with the cluster number or centroid index of the centroid closest to it. A vector v belongs to cluster i if it is closer to centroid i than any other centroids. If v belongs to i, we say centroid i is the dominating centroid of v. The k-means algorithm tries to minimize distortion, which is defined as the sum of the squared distances between each observation vector and its dominating centroid. Each step of the k-means algorithm refines the choices of centroids to reduce distortion. The change in distortion is used as a stopping criterion: when the change is lower than a threshold, the k-means algorithm is not making sufficient progress and terminates. One can also define a maximum number of iterations. Since vector quantization is a natural application for k-means, information theory terminology is often used. The centroid index or cluster index is also referred to as a "code" and the table mapping codes to centroids and vice versa is often referred as a "code book". The result of k-means, a set of centroids, can be used to quantize vectors. Quantization aims to find an encoding of vectors that reduces the expected distortion. All routines expect obs to be a M by N array where the rows are the observation vectors. The codebook is a k by N array where the i'th row is the centroid of code word i. The observation vectors and centroids have the same feature dimension. As an example, suppose we wish to compress a 24-bit color image (each pixel is represented by one byte for red, one for blue, and one for green) before sending it over the web. By using a smaller 8-bit encoding, we can reduce the amount of data by two thirds. Ideally, the colors for each of the 256 possible 8-bit encoding values should be chosen to minimize distortion of the color. Running k-means with k=256 generates a code book of 256 codes, which fills up all possible 8-bit sequences. Instead of sending a 3-byte value for each pixel, the 8-bit centroid index (or code word) of the dominating centroid is transmitted. The code book is also sent over the wire so each 8-bit code can be translated back to a 24-bit pixel value representation. If the image of interest was of an ocean, we would expect many 24-bit blues to be represented by 8-bit codes. If it was an image of a human face, more flesh tone colors would be represented in the code book. """ from __future__ import division, print_function, absolute_import import warnings import numpy as np from collections import deque from scipy._lib._util import _asarray_validated from scipy._lib.six import xrange from scipy.spatial.distance import cdist from . import _vq __docformat__ = 'restructuredtext' __all__ = ['whiten', 'vq', 'kmeans', 'kmeans2'] class ClusterError(Exception): pass def whiten(obs, check_finite=True): """ Normalize a group of observations on a per feature basis. Before running k-means, it is beneficial to rescale each feature dimension of the observation set with whitening. Each feature is divided by its standard deviation across all observations to give it unit variance. Parameters ---------- obs : ndarray Each row of the array is an observation. The columns are the features seen during each observation. >>> # f0 f1 f2 >>> obs = [[ 1., 1., 1.], #o0 ... [ 2., 2., 2.], #o1 ... [ 3., 3., 3.], #o2 ... [ 4., 4., 4.]] #o3 check_finite : bool, optional Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs. Default: True Returns ------- result : ndarray Contains the values in `obs` scaled by the standard deviation of each column. Examples -------- >>> from scipy.cluster.vq import whiten >>> features = np.array([[1.9, 2.3, 1.7], ... [1.5, 2.5, 2.2], ... [0.8, 0.6, 1.7,]]) >>> whiten(features) array([[ 4.17944278, 2.69811351, 7.21248917], [ 3.29956009, 2.93273208, 9.33380951], [ 1.75976538, 0.7038557 , 7.21248917]]) """ obs = _asarray_validated(obs, check_finite=check_finite) std_dev = obs.std(axis=0) zero_std_mask = std_dev == 0 if zero_std_mask.any(): std_dev[zero_std_mask] = 1.0 warnings.warn("Some columns have standard deviation zero. " "The values of these columns will not change.", RuntimeWarning) return obs / std_dev def vq(obs, code_book, check_finite=True): """ Assign codes from a code book to observations. Assigns a code from a code book to each observation. Each observation vector in the 'M' by 'N' `obs` array is compared with the centroids in the code book and assigned the code of the closest centroid. The features in `obs` should have unit variance, which can be achieved by passing them through the whiten function. The code book can be created with the k-means algorithm or a different encoding algorithm. Parameters ---------- obs : ndarray Each row of the 'M' x 'N' array is an observation. The columns are the "features" seen during each observation. The features must be whitened first using the whiten function or something equivalent. code_book : ndarray The code book is usually generated using the k-means algorithm. Each row of the array holds a different code, and the columns are the features of the code. >>> # f0 f1 f2 f3 >>> code_book = [ ... [ 1., 2., 3., 4.], #c0 ... [ 1., 2., 3., 4.], #c1 ... [ 1., 2., 3., 4.]] #c2 check_finite : bool, optional Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs. Default: True Returns ------- code : ndarray A length M array holding the code book index for each observation. dist : ndarray The distortion (distance) between the observation and its nearest code. Examples -------- >>> from numpy import array >>> from scipy.cluster.vq import vq >>> code_book = array([[1.,1.,1.], ... [2.,2.,2.]]) >>> features = array([[ 1.9,2.3,1.7], ... [ 1.5,2.5,2.2], ... [ 0.8,0.6,1.7]]) >>> vq(features,code_book) (array([1, 1, 0],'i'), array([ 0.43588989, 0.73484692, 0.83066239])) """ obs = _asarray_validated(obs, check_finite=check_finite) code_book = _asarray_validated(code_book, check_finite=check_finite) ct = np.common_type(obs, code_book) c_obs = obs.astype(ct, copy=False) c_code_book = code_book.astype(ct, copy=False) if np.issubdtype(ct, np.float64) or np.issubdtype(ct, np.float32): return _vq.vq(c_obs, c_code_book) return py_vq(obs, code_book, check_finite=False) def py_vq(obs, code_book, check_finite=True): """ Python version of vq algorithm. The algorithm computes the euclidian distance between each observation and every frame in the code_book. Parameters ---------- obs : ndarray Expects a rank 2 array. Each row is one observation. code_book : ndarray Code book to use. Same format than obs. Should have same number of features (eg columns) than obs. check_finite : bool, optional Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs. Default: True Returns ------- code : ndarray code[i] gives the label of the ith obversation, that its code is code_book[code[i]]. mind_dist : ndarray min_dist[i] gives the distance between the ith observation and its corresponding code. Notes ----- This function is slower than the C version but works for all input types. If the inputs have the wrong types for the C versions of the function, this one is called as a last resort. It is about 20 times slower than the C version. """ obs = _asarray_validated(obs, check_finite=check_finite) code_book = _asarray_validated(code_book, check_finite=check_finite) if obs.ndim != code_book.ndim: raise ValueError("Observation and code_book should have the same rank") if obs.ndim == 1: obs = obs[:, np.newaxis] code_book = code_book[:, np.newaxis] dist = cdist(obs, code_book) code = dist.argmin(axis=1) min_dist = dist[np.arange(len(code)), code] return code, min_dist # py_vq2 was equivalent to py_vq py_vq2 = np.deprecate(py_vq, old_name='py_vq2', new_name='py_vq') def _kmeans(obs, guess, thresh=1e-5): """ "raw" version of k-means. Returns ------- code_book the lowest distortion codebook found. avg_dist the average distance a observation is from a code in the book. Lower means the code_book matches the data better. See Also -------- kmeans : wrapper around k-means Examples -------- Note: not whitened in this example. >>> from numpy import array >>> from scipy.cluster.vq import _kmeans >>> features = array([[ 1.9,2.3], ... [ 1.5,2.5], ... [ 0.8,0.6], ... [ 0.4,1.8], ... [ 1.0,1.0]]) >>> book = array((features[0],features[2])) >>> _kmeans(features,book) (array([[ 1.7 , 2.4 ], [ 0.73333333, 1.13333333]]), 0.40563916697728591) """ code_book = np.asarray(guess) diff = np.inf prev_avg_dists = deque([diff], maxlen=2) while diff > thresh: # compute membership and distances between obs and code_book obs_code, distort = vq(obs, code_book, check_finite=False) prev_avg_dists.append(distort.mean(axis=-1)) # recalc code_book as centroids of associated obs code_book, has_members = _vq.update_cluster_means(obs, obs_code, code_book.shape[0]) code_book = code_book[has_members] diff = prev_avg_dists[0] - prev_avg_dists[1] return code_book, prev_avg_dists[1] def kmeans(obs, k_or_guess, iter=20, thresh=1e-5, check_finite=True): """ Performs k-means on a set of observation vectors forming k clusters. The k-means algorithm adjusts the centroids until sufficient progress cannot be made, i.e. the change in distortion since the last iteration is less than some threshold. This yields a code book mapping centroids to codes and vice versa. Distortion is defined as the sum of the squared differences between the observations and the corresponding centroid. Parameters ---------- obs : ndarray Each row of the M by N array is an observation vector. The columns are the features seen during each observation. The features must be whitened first with the `whiten` function. k_or_guess : int or ndarray The number of centroids to generate. A code is assigned to each centroid, which is also the row index of the centroid in the code_book matrix generated. The initial k centroids are chosen by randomly selecting observations from the observation matrix. Alternatively, passing a k by N array specifies the initial k centroids. iter : int, optional The number of times to run k-means, returning the codebook with the lowest distortion. This argument is ignored if initial centroids are specified with an array for the ``k_or_guess`` parameter. This parameter does not represent the number of iterations of the k-means algorithm. thresh : float, optional Terminates the k-means algorithm if the change in distortion since the last k-means iteration is less than or equal to thresh. check_finite : bool, optional Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs. Default: True Returns ------- codebook : ndarray A k by N array of k centroids. The i'th centroid codebook[i] is represented with the code i. The centroids and codes generated represent the lowest distortion seen, not necessarily the globally minimal distortion. distortion : float The distortion between the observations passed and the centroids generated. See Also -------- kmeans2 : a different implementation of k-means clustering with more methods for generating initial centroids but without using a distortion change threshold as a stopping criterion. whiten : must be called prior to passing an observation matrix to kmeans. Examples -------- >>> from numpy import array >>> from scipy.cluster.vq import vq, kmeans, whiten >>> import matplotlib.pyplot as plt >>> features = array([[ 1.9,2.3], ... [ 1.5,2.5], ... [ 0.8,0.6], ... [ 0.4,1.8], ... [ 0.1,0.1], ... [ 0.2,1.8], ... [ 2.0,0.5], ... [ 0.3,1.5], ... [ 1.0,1.0]]) >>> whitened = whiten(features) >>> book = np.array((whitened[0],whitened[2])) >>> kmeans(whitened,book) (array([[ 2.3110306 , 2.86287398], # random [ 0.93218041, 1.24398691]]), 0.85684700941625547) >>> from numpy import random >>> random.seed((1000,2000)) >>> codes = 3 >>> kmeans(whitened,codes) (array([[ 2.3110306 , 2.86287398], # random [ 1.32544402, 0.65607529], [ 0.40782893, 2.02786907]]), 0.5196582527686241) >>> # Create 50 datapoints in two clusters a and b >>> pts = 50 >>> a = np.random.multivariate_normal([0, 0], [[4, 1], [1, 4]], size=pts) >>> b = np.random.multivariate_normal([30, 10], ... [[10, 2], [2, 1]], ... size=pts) >>> features = np.concatenate((a, b)) >>> # Whiten data >>> whitened = whiten(features) >>> # Find 2 clusters in the data >>> codebook, distortion = kmeans(whitened, 2) >>> # Plot whitened data and cluster centers in red >>> plt.scatter(whitened[:, 0], whitened[:, 1]) >>> plt.scatter(codebook[:, 0], codebook[:, 1], c='r') >>> plt.show() """ obs = _asarray_validated(obs, check_finite=check_finite) if iter < 1: raise ValueError("iter must be at least 1, got %s" % iter) # Determine whether a count (scalar) or an initial guess (array) was passed. if not np.isscalar(k_or_guess): guess = _asarray_validated(k_or_guess, check_finite=check_finite) if guess.size < 1: raise ValueError("Asked for 0 clusters. Initial book was %s" % guess) return _kmeans(obs, guess, thresh=thresh) # k_or_guess is a scalar, now verify that it's an integer k = int(k_or_guess) if k != k_or_guess: raise ValueError("If k_or_guess is a scalar, it must be an integer.") if k < 1: raise ValueError("Asked for %d clusters." % k) # initialize best distance value to a large value best_dist = np.inf for i in xrange(iter): # the initial code book is randomly selected from observations guess = _kpoints(obs, k) book, dist = _kmeans(obs, guess, thresh=thresh) if dist < best_dist: best_book = book best_dist = dist return best_book, best_dist def _kpoints(data, k): """Pick k points at random in data (one row = one observation). Parameters ---------- data : ndarray Expect a rank 1 or 2 array. Rank 1 are assumed to describe one dimensional data, rank 2 multidimensional data, in which case one row is one observation. k : int Number of samples to generate. """ idx = np.random.choice(data.shape[0], size=k, replace=False) return data[idx] def _krandinit(data, k): """Returns k samples of a random variable which parameters depend on data. More precisely, it returns k observations sampled from a Gaussian random variable which mean and covariances are the one estimated from data. Parameters ---------- data : ndarray Expect a rank 1 or 2 array. Rank 1 are assumed to describe one dimensional data, rank 2 multidimensional data, in which case one row is one observation. k : int Number of samples to generate. """ mu = data.mean(axis=0) if data.ndim == 1: cov = np.cov(data) x = np.random.randn(k) x *= np.sqrt(cov) elif data.shape[1] > data.shape[0]: # initialize when the covariance matrix is rank deficient _, s, vh = np.linalg.svd(data - mu, full_matrices=False) x = np.random.randn(k, s.size) sVh = s[:, None] * vh / np.sqrt(data.shape[0] - 1) x = x.dot(sVh) else: cov = np.atleast_2d(np.cov(data, rowvar=False)) # k rows, d cols (one row = one obs) # Generate k sample of a random variable ~ Gaussian(mu, cov) x = np.random.randn(k, mu.size) x = x.dot(np.linalg.cholesky(cov).T) x += mu return x _valid_init_meth = {'random': _krandinit, 'points': _kpoints} def _missing_warn(): """Print a warning when called.""" warnings.warn("One of the clusters is empty. " "Re-run kmeans with a different initialization.") def _missing_raise(): """raise a ClusterError when called.""" raise ClusterError("One of the clusters is empty. " "Re-run kmeans with a different initialization.") _valid_miss_meth = {'warn': _missing_warn, 'raise': _missing_raise} def kmeans2(data, k, iter=10, thresh=1e-5, minit='random', missing='warn', check_finite=True): """ Classify a set of observations into k clusters using the k-means algorithm. The algorithm attempts to minimize the Euclidian distance between observations and centroids. Several initialization methods are included. Parameters ---------- data : ndarray A 'M' by 'N' array of 'M' observations in 'N' dimensions or a length 'M' array of 'M' one-dimensional observations. k : int or ndarray The number of clusters to form as well as the number of centroids to generate. If `minit` initialization string is 'matrix', or if a ndarray is given instead, it is interpreted as initial cluster to use instead. iter : int, optional Number of iterations of the k-means algorithm to run. Note that this differs in meaning from the iters parameter to the kmeans function. thresh : float, optional (not used yet) minit : str, optional Method for initialization. Available methods are 'random', 'points', and 'matrix': 'random': generate k centroids from a Gaussian with mean and variance estimated from the data. 'points': choose k observations (rows) at random from data for the initial centroids. 'matrix': interpret the k parameter as a k by M (or length k array for one-dimensional data) array of initial centroids. missing : str, optional Method to deal with empty clusters. Available methods are 'warn' and 'raise': 'warn': give a warning and continue. 'raise': raise an ClusterError and terminate the algorithm. check_finite : bool, optional Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs. Default: True Returns ------- centroid : ndarray A 'k' by 'N' array of centroids found at the last iteration of k-means. label : ndarray label[i] is the code or index of the centroid the i'th observation is closest to. """ if int(iter) < 1: raise ValueError("Invalid iter (%s), " "must be a positive integer." % iter) try: miss_meth = _valid_miss_meth[missing] except KeyError: raise ValueError("Unknown missing method %r" % (missing,)) data = _asarray_validated(data, check_finite=check_finite) if data.ndim == 1: d = 1 elif data.ndim == 2: d = data.shape[1] else: raise ValueError("Input of rank > 2 is not supported.") if data.size < 1: raise ValueError("Empty input is not supported.") # If k is not a single value it should be compatible with data's shape if minit == 'matrix' or not np.isscalar(k): code_book = np.array(k, copy=True) if data.ndim != code_book.ndim: raise ValueError("k array doesn't match data rank") nc = len(code_book) if data.ndim > 1 and code_book.shape[1] != d: raise ValueError("k array doesn't match data dimension") else: nc = int(k) if nc < 1: raise ValueError("Cannot ask kmeans2 for %d clusters" " (k was %s)" % (nc, k)) elif nc != k: warnings.warn("k was not an integer, was converted.") try: init_meth = _valid_init_meth[minit] except KeyError: raise ValueError("Unknown init method %r" % (minit,)) else: code_book = init_meth(data, k) for i in xrange(iter): # Compute the nearest neighbor for each obs using the current code book label = vq(data, code_book)[0] # Update the code book by computing centroids new_code_book, has_members = _vq.update_cluster_means(data, label, nc) if not has_members.all(): miss_meth() # Set the empty clusters to their previous positions new_code_book[~has_members] = code_book[~has_members] code_book = new_code_book return code_book, label