A survey on algorithms for k-means problem and its variants

Expand
  • 1. College of Applied Sciences, Beijing University of Technology, Beijing 100124, China; 2. School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China

Received date: 2017-04-05

  Online published: 2017-06-15

Abstract

The k-means is one of the most classical problems in both computer science and combinatorial optimization. The k-means clustering is popular as one of the most significant and easiest methods of cluster analysis in data mining. The k-means problem can be described as: Given a set of observations with n elements, where each element is a d-dimensional real vector. The objective is to partition the n observations into k sets, so as to minimize within-cluster sum of squares, where we call the center of the cluster the mean of the observations in it. The k-means is NP-hard in theory however, there are efficient heuristic algorithms employed widely in market segmentation, machine vision, geostatistics, astronomy, agriculture and so on. With more and more complicated the problem we meet in practical, and larger and larger the data grows, further research is needed. This article aims at listing the classical algorithms for solving k-means as well as the variety of the variants and generalization of it, and summarize some problems in k-means which have not been solved for readers.

Cite this article

XU Dachuan, XU Yicheng, ZHANG Dongmei . A survey on algorithms for k-means problem and its variants[J]. Operations Research Transactions, 2017 , 21(2) : 101 -109 . DOI: 10.15960/j.cnki.issn.1007-6093.2017.02.011

References

[1] Steinhaus H. Sur la division des corp materiels en parties [J]. Bulletin L'Acad\acute{\rm e}mie Polonaise des Science, 1956, IV: 801-804.
[2] Lloyd, S. Least squares quantization in PCM [J]. IEEE Transactions on Information Theory, 1982, 28: 129-137.
[3] Ball G, Hall D. ISODATA, a novel method of data anlysis and pattern classification [R]. Technical report NTIS AD 699616, Stanford: Stanford Research Institute, 1965.
[4] MacQueen J. Some methods for classification and analysis of multivariate observations [C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, 1(14): 281-297.
[5] Mahajan M, Nimbhorkar P, Varadarajan K. The planar k-means problem is NP-hard [J]. Theoretical Computer Science, 2009, 442(8): 13-21.
[6] Aloise D, Deshpande A, Hansen P, et al. NP-hardness of Euclidean sum-of-squares clustering [J]. Machine Learning, 2009, 75: 245-249.
[7] Awasthi P, Charikar M, Krishnaswamy R, et al. The hardness of approximation of Euclidean k-means [C]//Proceedings of the 31st International Symposium on Computational Geometry, 2015, 754-767.
[8] Lee E, Schmidt M, Wright J. Improved and simplified inapproximability for k-means [J]. Information Processing Letters, 2016, 120: 40-43.
[9] MacKay D J C. Information Theory, Inference and Learning Algorithms [M]. Cambridge: Cambridge University Press, 2003, 284-292.
[10] Arthur D, Vassilvitskii S. K-means++: the advantages of careful seeding [C]//Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, 1027-1035.
[11] Ostrovsky R, Rabani Y,  Schulman L J,  et al. The effectiveness of Lloyd-type methods for the k-means problem [J]. Journal of the ACM, 2013, 59: 1-22.
[12] Arthur D, Manthey B, R\ddot{\rm o}glin H. Smoothed analysis of the k-means method [J]. Journal of the ACM, 2011, 58(5): 1-31.
[13] Kanungo T, Mount D M, Netanyahu N S, et al. A local search approximation algorithm for k-means clustering [C]//Proceedings of the 18th Annual Symposium on Computational Geometry, 2002, 10-18.
[14] Aggarwal A, Deshpande A, Kannan R. Adaptive sampling for k-means clustering [C]// Proceedings of the 12th International Workshop, APPROX, and 13th International Workshop, RANDOM, 2009, 15-28.
[15] Makarychev K, Makarychev Y, Sviridenko M, Ward  J. A bi-criteria approximation algorithm for k-means [C]//Proceedings of APPROX/RONDOM'16, 2016, Article No. 14, 1-20.
[16] Hsu D, Telgarsky M. Greedy bi-criteria approximations for k-medians and k-means [J]. arXiv: 1607.06203, 2016.
[17] Bandyapadhyay S,  Varadarajan K. On variants of k-means clustering [C]//Proceedings of the 32nd International Symposium on Computational Geometry, 2016, Article No. 14, 1-15.
[18] Inaba M, Katoh N, Imai H. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering [C]//Proceedings of the 10th Annual Symposium on Computational Geometry, 1994, 332-339.
[19] Matou\check{\rm s}ek J. On approximate geometric k-clustering [J]. Discrete and Computational Geometry, 2000, 24: 61-84.
[20] Friggstad Z, Rezapour M, Salavatipour M R. Local search yields a PTAS for k-means in doubling metrics [C]//Proceedings of the 57th Annual Symposium on Foundations of Computer Science, 2016, 365-374.
[21] Cohen-Addad V, Klein P N, Mathieu C. Local search yields approximation schemes for k-means and k-median in Euclidean and minor-free metrics [C]//Proceedings of the 57th Annual Symposium on Foundations of Computer Science, 2016, 353-364.
[22] Bradley P S, Mangasarian O L,  Street W N. Clustering via concave minimization [M]//Advances in Neural Information Processing Systems, Cambridge, Massachusetts: MIT Press, 1997, 368-374.
[23] Jain A K, Dubes R C. Algorithms for Clustering Data [M]. London: Prentice-Hall, 1988.
[24] Kaufman L, Rousseeuw P. Clustering by Means of Medoids [M]. Amsterdam: North-Holland, 1987.
[25] Arya V, Garg N, Khandekar R, et al. Local search heuristic for k-median and facility location problems [C]//Proceedings of the 33rd Annual ACM Symposium on Theory of Computing, 2001, 21-29.
[26] Li S, Svensson O. Approximating k-median via pseudo-approximation [J]. SIAM Journal on Computing, 2016, 45: 530-547.
[27] Byrka J,  Pensyl T,  Rybicki B,  et al. An improved approximation for k-median, and positive correlation in budgeted optimization [C]// Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms, 2015, 737-756.
[28] Byrka J,  Pensyl T,  Rybicki B,  et al. An improved approximation for k-median, and positive correlation in budgeted optimization [J]. arXiv: 1406.2951v4, 2016.
[29] Huang Z. Extensions to the k-means algorithm for clustering large data sets with categorical values [J]. Data Mining and Knowledge Discovery, 1998, 2: 283-304.
[30] He Z, Deng S, Xu X. Approximation algorithms for k-modes clustering [C]//Proceedings of the International Conference on Intelligent Computing, 2006, 296-302.
[31] Aggarwal C C, Reddy C K. Data Clustering: Algorithms and Applications [M]. London: Chapman and Hall/CRC, 2013.
[32] Pelleg D, Moore A W. X-means: Extending k-means with efficient estimation of the number of clusters [C]//Proceedings of the International Conference on Machine Learning, 2000, 727-734.
[33] Hamerly G, Elkan C. Learning the k in k-means [C]//Proceedings of the Annual Conference on Neural Information Processing Systems, 2003, 281-288.
[34] Feng Y, Hamerly G, Elkan C. PG-means: learning the number of clusters in data [C]//Proceedings of the Annual Conference on Neural Information Processing Systems, 2006, 393-400.
[35] Ishioka T. An expansion of X-means for automatically determining the optimal number of clusters [C]//Proceedings of International Conference on Computational Intelligence, 2005, 91-95.
[36] Tucker C S, Kim H M, Barker D E, et al. A relief attribute weighting and X-means clustering methodology for top-down product family optimization [J]. Engineering Optimization, 2010, 42: 593-616.
[37] Dunn J C. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters [J]. Journal of Cybernet,   1973, 3: 32-57.
[38] Celeux G, Govaert G. A classification EM algorithm for clustering and two stochastic versions [J]. Computational Statistics & Data Analysis, 1992, 14: 315-332.
[39] Ana L N F, Jain A K. Robust data clustering [C]//Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003, 2: 128-133.
[40] Honda K, Notsu A, Ichihashi H. Fuzzy PCA-Guided robust k-means clustering [J]. IEEE Transactions on Fuzzy Systems, 2010, 18: 67-79.
[41] Zhang D, Hao C, Wu C, et al. A local search approximation algorithm for the k-means problem with penalties [R]. Beijing:  Beijing University of Technology, 2017.
[42] Gupta S, Kumar R, Lu K, et al. Local search methods for k-means with outliers [C]//Proceedings of the VLDB Endowment, 2017, 10(7): 757-768.
[43] Zhang D, Xu D, Zhang P, et al. A local search approximation algorithm for a sum of squares k-facility location problem [R]. Beijing: Beijing University of Technology, 2017.
[44] Zhang D, Xu D, Zhang P, et al. A local search approximation algorithm for the sum of squares facility location problem [R]. Beijing: Beijing University of Technology, 2017.
[45] Bradley P S, Bennett K P, Demiriz A. Constrained k-means clustering [J]. Microsoft Research, 2000, 59(1): 1-34.
[46] Luo M, Ma Y F, Zhang H J. A spatial constrained k-means approach to image segmentation [C]//Proceedings of the Joint Conference of the 4th International Conference on IEEE, 2003, 738-742.
[47] Ng M K. A note on constrained k-means algorithms [J]. Pattern Recognition, 2000, 33: 515-519.
[48] Wagstaff K, Cardie C, Rogers S, et al. Constrained k-means clustering with background knowledge [C]//Proceedings of the International Conference on Machine Learning, 2001, 577-584.
[49] Banerjee A, Merugu S, Dhillon I S, et al. Clustering with Bregman divergences [J]. Journal of Machine Learning Research, 2005, 6: 1705-1749.
[50] Kim K, Ahn H. A recommender system using GA k-means clustering in an online shopping market [J]. Expert Systems with Applications, 2008, 34: 1200-1209.
[51] King A. Online k-means clustering of nonstationary data [J]. Prediction Project Report, 2012, 1-9.
[52] Liberty E, Sriharsha R, Sviridenko M. An algorithm for online k-means clustering [C]// Proceedings of the 18th Workshop on Algorithm Engineering and Experiment Society for Industrial and Applied Mathematics, 2016, 81-89.
[53] Buchta C, Kober M, Feinerer I, et al. Spherical k-means clustering [J]. Journal of Statistical Software, 2012, 50: 1-22.
[54] Wild S. Seeding Non-Negative Matrix Factorizations with the Spherical k-Means Clustering [M].  Colorado: University of Colorado, 2003.
[55] Zhong S. Efficient online spherical k-means clustering [C]//Proceedings of the International Joint Conference on IEEE, 2005, 3180-3185.
[56] Hong Y, Kwong S. Learning assignment order of instances for the constrained k-means clustering algorithm [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, 39: 568-574.
[57] Ding C, Zhou D, He X, et al. R 1-PCA: rotational invariant L_1-norm principal component analysis for robust subspace factorization [C]// Proceedings of the 23rd International Conference on Machine Learning, 2006, 281-288.
[58] Wu X, Zhu X, Wu G Q, et al. Data mining with big data [J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26: 97-107.
[59] Cai X, Nie F, Huang H. Multi-view k-means clustering on big data [C]//Proceedings of the International Joint Conference on Artificial Intelligence, 2013, 2598-2604.
[60] Cui X, Zhu P, Yang X, et al. Optimized big data k-means clustering using MapReduce [J]. The Journal of Supercomputing, 2014, 70: 1249-1259.
[61] Ding H, Liu Y, Huang L, et al. K-Means clustering with distributed dimensions [C]//Proceedings of the 33rd International Conference on Machine Learning, 2016, 1339-1348.
[62] Shim K. MapReduce algorithms for big data analysis [C]//Proceedings of the  Very Large Database Endowment Endowment, 2012, 2016-2017.
[63] Holmes A. Hadoop in Practice [M]. New York: Manning Publications Co., 2012.
[64] Slavakis K, Giannakis G B, Mateos G. Modeling and optimization for big data analytics: (statistical) learning tools for our era of data deluge [J]. IEEE Signal Processing Magazine, 2014, 31: 18-31.
[65] Chen C L P, Zhang C Y. Data-intensive applications, challenges, techniques and technologies: a survey on big data [J]. Information Sciences, 2014, 275: 314-347.
[66] Fan W, Bifet A. Mining big data: current status, and forecast to the future [J]. ACM SIGKDD Explorations Newsletter, 2013, 14: 1-5.
[67] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm [J]. Journal of the Royal Statistical Society, Series B (Methodological), 1977, 39: 1-38.
Outlines

/