运筹学学报

• 运筹学 • 上一篇    下一篇

k-均值算法的初始化方法综述

徐大川1  许宜诚1,2  张冬梅3,*   

  1. 1. 北京工业大学应用数理学院, 北京 100124; 2. 中国科学院深圳先进技术研究院, 广东深圳 518055; 3. 山东建筑大学计算机科学与技术学院, 济南 250101
  • 收稿日期:2017-10-16 出版日期:2018-06-15 发布日期:2018-06-15
  • 通讯作者: 张冬梅 zhangdongmei@sdjzu.edu.cn
  • 基金资助:

    国家自然科学基金(No. 11531014), 山东省济南市科技发展计划项目(No. 201401211)

A survey on the initialization methods for the k-means algorithm

XU Dachuan1  XU Yicheng1,2  ZHANG Dongmei3,*   

  1. 1. College of Applied Sciences, Beijing University of Technology, Beijing 100124, China;  2.  Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, Guangdong, China; 3. School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China
  • Received:2017-10-16 Online:2018-06-15 Published:2018-06-15

摘要:

k-均值问题自提出以来一直吸引组合优化和计算机科学领域的广泛关注, 是经典的NP-难问题之一. 给定N个d维实向量构成的观测集, 目标是把这N个观测点划分到k(\leq N)个集合中, 使得所有集合中的点到对应的聚类中心距离的平方和最小, 一个集合的聚类中心指的是该集合 中所有观测点的均值. k-均值算法作为解决k-均值问题的启发式算法,在实际应用中因其出色的收敛速度而倍受欢迎. k-均值算法可描述为: 给定问题的初始化分组, 交替进行指派(将观测点分配到离其最近的均值点)和更新(计算新的聚类的均值点)直到收敛到某一解. 该算法通常被认为几乎是线性收敛的. 但缺点也很明显, 无法保证得到的是全局最优解, 并且算法结果好坏过于依赖初始解的选取. 于是学者们纷纷提出不同的初始化方法来提高k-均值算法的质量. 现筛选和罗列了关于选取初始解的k-均值算法的初始化方法供读者参考.

关键词: k-均值算法, 初始化方法

Abstract:

The k-means problem which is one of the classical NP-hard problems keeps attracting wide attention in combinatorial optimization and computer science area since been proposed. Given a set of observations consisting of N d-dimensional real vectors, the objective is to partition the N observations into k(\leqslant N) sets, so as to minimize the within-cluster sum of squares, where the center of a cluster is defined as the mean of all the observations in it. As one of the heuristic algorithms for solving k-means problem, k-means algorithm is quite popular in practice because of its excellent perform of convergence. The k-means algorithm can be described as: given initial solution for k-means problem, repeat assignment (assign observations to its nearest center) and update (compute the new centers in the new clusters) step until convergence to a solution. Although the algorithm is usually considered as linearly convergency, its shortcomings are obvious. The k-means algorithm is unable to ensure the global optimality and the output is rely badly on the choice of initial solutions. Therefore, variants of initialization methods are proposed to improve the performance of the k-means algorithm. In this paper we mainly filter and list some of them for readers.

Key words: k-means algorithm, initialization method