k-平均问题及其变形的算法综述

doi:10.15960/j.cnki.issn.1007-6093.2017.02.011

摘要/Abstract

摘要：

k-平均问题是计算机科学和组合优化领域的经典问题之一. k-平均聚类作为最受重视而且最简单易懂的一种聚类分析方法流行于数据挖掘领域. k-平均问题可描述为: 给定n个元素的观测集, 其中每个观测点都是d维实向量, 目标是把这n个观测点划分到k(\le n)个集合中, 使得所有集合中的点到对应的聚类中心的距离的平方和最小, 其中一个集合的聚类中心指的是该集合中所有观测点的均值. k-平均问题在理论上是NP-难的, 但有高效的启发式算法, 广泛应用在市场划分、机器视觉、地质统计学、天文学和农业等实际背景中. 随着实际问题中遇到的k-平均问题更加复杂, 数据量更加庞大, 还需学者进行更深一步的研究. 罗列出k-平均问题及其诸多变形及推广问题的经典算法, 并总结$k$-平均中尚待研究的若干问题.

关键词: 聚类问题, k-平均, NP-难

Abstract:

The k-means is one of the most classical problems in both computer science and combinatorial optimization. The k-means clustering is popular as one of the most significant and easiest methods of cluster analysis in data mining. The k-means problem can be described as: Given a set of observations with n elements, where each element is a d-dimensional real vector. The objective is to partition the n observations into k sets, so as to minimize within-cluster sum of squares, where we call the center of the cluster the mean of the observations in it. The k-means is NP-hard in theory however, there are efficient heuristic algorithms employed widely in market segmentation, machine vision, geostatistics, astronomy, agriculture and so on. With more and more complicated the problem we meet in practical, and larger and larger the data grows, further research is needed. This article aims at listing the classical algorithms for solving k-means as well as the variety of the variants and generalization of it, and summarize some problems in k-means which have not been solved for readers.

Key words: clustering problems, k-means, NP-hard

徐大川, 许宜诚, 张冬梅. k-平均问题及其变形的算法综述[J]. 运筹学学报, doi: 10.15960/j.cnki.issn.1007-6093.2017.02.011.

XU Dachuan, XU Yicheng, ZHANG Dongmei. A survey on algorithms for k-means problem and its variants[J]. Operations Research Transactions, doi: 10.15960/j.cnki.issn.1007-6093.2017.02.011.

参考文献

[1] Steinhaus H. Sur la division des corp materiels en parties [J]. Bulletin L'Acad\acute{\rm e}mie Polonaise des Science, 1956, IV: 801-804.

[2] Lloyd, S. Least squares quantization in PCM [J]. IEEE Transactions on Information Theory, 1982, 28: 129-137.

[3] Ball G, Hall D. ISODATA, a novel method of data anlysis and pattern classification [R]. Technical report NTIS AD 699616, Stanford: Stanford Research Institute, 1965.

[4] MacQueen J. Some methods for classification and analysis of multivariate observations [C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, 1(14): 281-297.

[5] Mahajan M, Nimbhorkar P, Varadarajan K. The planar k-means problem is NP-hard [J]. Theoretical Computer Science, 2009, 442(8): 13-21.

[6] Aloise D, Deshpande A, Hansen P, et al. NP-hardness of Euclidean sum-of-squares clustering [J]. Machine Learning, 2009, 75: 245-249.

[7] Awasthi P, Charikar M, Krishnaswamy R, et al. The hardness of approximation of Euclidean k-means [C]//Proceedings of the 31st International Symposium on Computational Geometry, 2015, 754-767.

[8] Lee E, Schmidt M, Wright J. Improved and simplified inapproximability for k-means [J]. Information Processing Letters, 2016, 120: 40-43.

[9] MacKay D J C. Information Theory, Inference and Learning Algorithms [M]. Cambridge: Cambridge University Press, 2003, 284-292.

[10] Arthur D, Vassilvitskii S. K-means++: the advantages of careful seeding [C]//Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, 1027-1035.

[11] Ostrovsky R, Rabani Y, Schulman L J, et al. The effectiveness of Lloyd-type methods for the k-means problem [J]. Journal of the ACM, 2013, 59: 1-22.

[12] Arthur D, Manthey B, R\ddot{\rm o}glin H. Smoothed analysis of the k-means method [J]. Journal of the ACM, 2011, 58(5): 1-31.

[13] Kanungo T, Mount D M, Netanyahu N S, et al. A local search approximation algorithm for k-means clustering [C]//Proceedings of the 18th Annual Symposium on Computational Geometry, 2002, 10-18.

[14] Aggarwal A, Deshpande A, Kannan R. Adaptive sampling for k-means clustering [C]// Proceedings of the 12th International Workshop, APPROX, and 13th International Workshop, RANDOM, 2009, 15-28.

[15] Makarychev K, Makarychev Y, Sviridenko M, Ward J. A bi-criteria approximation algorithm for k-means [C]//Proceedings of APPROX/RONDOM'16, 2016, Article No. 14, 1-20.

[16] Hsu D, Telgarsky M. Greedy bi-criteria approximations for k-medians and k-means [J]. arXiv: 1607.06203, 2016.

[17] Bandyapadhyay S, Varadarajan K. On variants of k-means clustering [C]//Proceedings of the 32nd International Symposium on Computational Geometry, 2016, Article No. 14, 1-15.

[18] Inaba M, Katoh N, Imai H. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering [C]//Proceedings of the 10th Annual Symposium on Computational Geometry, 1994, 332-339.

[19] Matou\check{\rm s}ek J. On approximate geometric k-clustering [J]. Discrete and Computational Geometry, 2000, 24: 61-84.

[20] Friggstad Z, Rezapour M, Salavatipour M R. Local search yields a PTAS for k-means in doubling metrics [C]//Proceedings of the 57th Annual Symposium on Foundations of Computer Science, 2016, 365-374.

[21] Cohen-Addad V, Klein P N, Mathieu C. Local search yields approximation schemes for k-means and k-median in Euclidean and minor-free metrics [C]//Proceedings of the 57th Annual Symposium on Foundations of Computer Science, 2016, 353-364.

[22] Bradley P S, Mangasarian O L, Street W N. Clustering via concave minimization [M]//Advances in Neural Information Processing Systems, Cambridge, Massachusetts: MIT Press, 1997, 368-374.

[23] Jain A K, Dubes R C. Algorithms for Clustering Data [M]. London: Prentice-Hall, 1988.

[24] Kaufman L, Rousseeuw P. Clustering by Means of Medoids [M]. Amsterdam: North-Holland, 1987.

[25] Arya V, Garg N, Khandekar R, et al. Local search heuristic for k-median and facility location problems [C]//Proceedings of the 33rd Annual ACM Symposium on Theory of Computing, 2001, 21-29.

[26] Li S, Svensson O. Approximating k-median via pseudo-approximation [J]. SIAM Journal on Computing, 2016, 45: 530-547.

[27] Byrka J, Pensyl T, Rybicki B, et al. An improved approximation for k-median, and positive correlation in budgeted optimization [C]// Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms, 2015, 737-756.

[28] Byrka J, Pensyl T, Rybicki B, et al. An improved approximation for k-median, and positive correlation in budgeted optimization [J]. arXiv: 1406.2951v4, 2016.

[29] Huang Z. Extensions to the k-means algorithm for clustering large data sets with categorical values [J]. Data Mining and Knowledge Discovery, 1998, 2: 283-304.

[30] He Z, Deng S, Xu X. Approximation algorithms for k-modes clustering [C]//Proceedings of the International Conference on Intelligent Computing, 2006, 296-302.

[31] Aggarwal C C, Reddy C K. Data Clustering: Algorithms and Applications [M]. London: Chapman and Hall/CRC, 2013.

[32] Pelleg D, Moore A W. X-means: Extending k-means with efficient estimation of the number of clusters [C]//Proceedings of the International Conference on Machine Learning, 2000, 727-734.

[33] Hamerly G, Elkan C. Learning the k in k-means [C]//Proceedings of the Annual Conference on Neural Information Processing Systems, 2003, 281-288.

[34] Feng Y, Hamerly G, Elkan C. PG-means: learning the number of clusters in data [C]//Proceedings of the Annual Conference on Neural Information Processing Systems, 2006, 393-400.

[35] Ishioka T. An expansion of X-means for automatically determining the optimal number of clusters [C]//Proceedings of International Conference on Computational Intelligence, 2005, 91-95.

[36] Tucker C S, Kim H M, Barker D E, et al. A relief attribute weighting and X-means clustering methodology for top-down product family optimization [J]. Engineering Optimization, 2010, 42: 593-616.

[37] Dunn J C. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters [J]. Journal of Cybernet, 1973, 3: 32-57.

[38] Celeux G, Govaert G. A classification EM algorithm for clustering and two stochastic versions [J]. Computational Statistics & Data Analysis, 1992, 14: 315-332.

[39] Ana L N F, Jain A K. Robust data clustering [C]//Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003, 2: 128-133.

[40] Honda K, Notsu A, Ichihashi H. Fuzzy PCA-Guided robust k-means clustering [J]. IEEE Transactions on Fuzzy Systems, 2010, 18: 67-79.

[41] Zhang D, Hao C, Wu C, et al. A local search approximation algorithm for the k-means problem with penalties [R]. Beijing: Beijing University of Technology, 2017.

[42] Gupta S, Kumar R, Lu K, et al. Local search methods for k-means with outliers [C]//Proceedings of the VLDB Endowment, 2017, 10(7): 757-768.

[43] Zhang D, Xu D, Zhang P, et al. A local search approximation algorithm for a sum of squares k-facility location problem [R]. Beijing: Beijing University of Technology, 2017.

[44] Zhang D, Xu D, Zhang P, et al. A local search approximation algorithm for the sum of squares facility location problem [R]. Beijing: Beijing University of Technology, 2017.

[45] Bradley P S, Bennett K P, Demiriz A. Constrained k-means clustering [J]. Microsoft Research, 2000, 59(1): 1-34.

[46] Luo M, Ma Y F, Zhang H J. A spatial constrained k-means approach to image segmentation [C]//Proceedings of the Joint Conference of the 4th International Conference on IEEE, 2003, 738-742.

[47] Ng M K. A note on constrained k-means algorithms [J]. Pattern Recognition, 2000, 33: 515-519.

[48] Wagstaff K, Cardie C, Rogers S, et al. Constrained k-means clustering with background knowledge [C]//Proceedings of the International Conference on Machine Learning, 2001, 577-584.

[49] Banerjee A, Merugu S, Dhillon I S, et al. Clustering with Bregman divergences [J]. Journal of Machine Learning Research, 2005, 6: 1705-1749.

[50] Kim K, Ahn H. A recommender system using GA k-means clustering in an online shopping market [J]. Expert Systems with Applications, 2008, 34: 1200-1209.

[51] King A. Online k-means clustering of nonstationary data [J]. Prediction Project Report, 2012, 1-9.

[52] Liberty E, Sriharsha R, Sviridenko M. An algorithm for online k-means clustering [C]// Proceedings of the 18th Workshop on Algorithm Engineering and Experiment Society for Industrial and Applied Mathematics, 2016, 81-89.

[53] Buchta C, Kober M, Feinerer I, et al. Spherical k-means clustering [J]. Journal of Statistical Software, 2012, 50: 1-22.

[54] Wild S. Seeding Non-Negative Matrix Factorizations with the Spherical k-Means Clustering [M]. Colorado: University of Colorado, 2003.

[55] Zhong S. Efficient online spherical k-means clustering [C]//Proceedings of the International Joint Conference on IEEE, 2005, 3180-3185.

[56] Hong Y, Kwong S. Learning assignment order of instances for the constrained k-means clustering algorithm [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, 39: 568-574.

[57] Ding C, Zhou D, He X, et al. R 1-PCA: rotational invariant L_1-norm principal component analysis for robust subspace factorization [C]// Proceedings of the 23rd International Conference on Machine Learning, 2006, 281-288.

[58] Wu X, Zhu X, Wu G Q, et al. Data mining with big data [J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26: 97-107.

[59] Cai X, Nie F, Huang H. Multi-view k-means clustering on big data [C]//Proceedings of the International Joint Conference on Artificial Intelligence, 2013, 2598-2604.

[60] Cui X, Zhu P, Yang X, et al. Optimized big data k-means clustering using MapReduce [J]. The Journal of Supercomputing, 2014, 70: 1249-1259.

[61] Ding H, Liu Y, Huang L, et al. K-Means clustering with distributed dimensions [C]//Proceedings of the 33rd International Conference on Machine Learning, 2016, 1339-1348.

[62] Shim K. MapReduce algorithms for big data analysis [C]//Proceedings of the Very Large Database Endowment Endowment, 2012, 2016-2017.

[63] Holmes A. Hadoop in Practice [M]. New York: Manning Publications Co., 2012.

[64] Slavakis K, Giannakis G B, Mateos G. Modeling and optimization for big data analytics: (statistical) learning tools for our era of data deluge [J]. IEEE Signal Processing Magazine, 2014, 31: 18-31.

[65] Chen C L P, Zhang C Y. Data-intensive applications, challenges, techniques and technologies: a survey on big data [J]. Information Sciences, 2014, 275: 314-347.

[66] Fan W, Bifet A. Mining big data: current status, and forecast to the future [J]. ACM SIGKDD Explorations Newsletter, 2013, 14: 1-5.

[67] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm [J]. Journal of the Royal Statistical Society, Series B (Methodological), 1977, 39: 1-38.