Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

Institute Homepage

Institute Homepage DE Sign In

Back

Empirical Inference Article 2009

Empirical Inference

Sebastien Bubeck

Statistical Learning Theory

Ulrike von Luxburg

Professor, University of Tübingen
Max Planck Fellow

Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal, and instead can lead to inconsistency. We construct examples which provably have this behavior. As in the case of supervised learning, the cure is to restrict the size of the function classes under consideration. For appropriate small function classes we can prove very general consistency theorems for clustering optimization schemes. As one particular algorithm for clustering with a restricted function space we introduce nearest neighbor clustering. Similar to the k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is statistically consistent for all commonly used clustering objective functions.

Author(s):	Bubeck, S. and von Luxburg, U.
Journal:	Journal of Machine Learning Research
Volume:	10
Pages:	657-698
Year:	2009
Month:	March
Day:	0

Bibtex Type:	Article (article)

Digital:	0
Electronic Archiving:	grant_archive
Language:	en
Organization:	Max-Planck-Gesellschaft
School:	Biologische Kybernetik

Links:	PDF Web

BibTex

@article{5687,
  title = {Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions},
  journal = {Journal of Machine Learning Research},
  abstract = {Clustering is often formulated as a discrete optimization problem. The objective is to
  find, among all partitions of the data set, the best one according to some quality measure.
  However, in the statistical setting where we assume that the finite data set has been sampled
  from some underlying space, the goal is not to find the best partition of the given
  sample, but to approximate the true partition of the underlying space. We argue that the
  discrete optimization approach usually does not achieve this goal, and instead can lead to
  inconsistency. We construct examples which provably have this behavior. As in the case
  of supervised learning, the cure is to restrict the size of the function classes under consideration.
  For appropriate small function classes we can prove very general consistency
  theorems for clustering optimization schemes. As one particular algorithm for clustering
  with a restricted function space we introduce nearest neighbor clustering. Similar to the
  k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general
  baseline algorithm to minimize arbitrary clustering objective functions. We prove that it
  is statistically consistent for all commonly used clustering objective functions.},
  volume = {10},
  pages = {657-698},
  organization = {Max-Planck-Gesellschaft},
  school = {Biologische Kybernetik},
  month = mar,
  year = {2009},
  slug = {5687},
  author = {Bubeck, S. and von Luxburg, U.},
  month_numeric = {3}
}

Research

Departments

Research Groups

People

Contact

Our Institute

Our History

Career

Doctoral Programs

Training

Service Units

Central Scientific Facilities

Workshops

Campus Services

Impact

Cooperation

Partners and Initiatives