The Complexity of the Simultaneous Cluster Problem
Zhentao Li, Manikandan Narayanan, and Adrian Vetta
Vol. 18, no. 1, pp. 1-34, 2014. Regular paper.
Abstract We study clustering over multiple graphs - each encoding a distinct set of similarity relationships (edges) over the same set of objects (nodes) - where the aim is to identify clusters that are supported across the collection of graphs. This problem of simultaneous clustering is readily motivated by the recent deluge of datasets in several domains (including the biological sciences, social sciences, and marketing), where the same objects are repeatedly measured in different conditions, populations or time points. Whilst there has been a vast amount of heuristic work on practical simultaneous clustering problems, little is known on the theoretical side - we present theoretical results that help explain why such heuristics typically come without quantitative guarantees. We give algorithmic and complexity results for simultaneous clustering using two standard measures on clustering quality: density and connectivity. Specifically, we focus on the basic problem of finding a single cluster (rather than an entire clustering) that is simultaneously of high quality in every graph. When the quality of a cluster is its minimum density over all graphs, we show the problem is not approximable within a factor of 2log1−ϵn, unless NPDTIME(npolylog n). Furthermore, this problem appears very difficult even when there are just two graphs; the resulting problem is approximately as hard as the problem of finding a dense subgraph on at most k vertices. When cluster quality is a fixed connectivity requirement between terminals within the cluster, there are two natural optimization problems: a maximization version (find a good quality cluster with as many terminals as possible) and a minimization version (find a good quality cluster that is as small as possible). We show that the maximization problem is tractable in polynomial time for any fixed connectivity requirement k. On the other hand the minimization problem is hard to approximate within a factor of 2log1−ϵn, unless NPDTIME(npolylog n). The number of graphs in our reduction depends on n. If instead the number of graphs is fixed, we show there is an ϵ > 0 for which the minimization problem is not approximable within g1/2−ϵ for any fixed number g of graphs unless NP = ZPP. These hardness results for the minimization problem hold even in the simple cases where the connectivity requirement is one and there are either just two terminal nodes or every node is a terminal node. We remark that our results extend to case where more robust variants of the quality measure are used.
Submitted: January 2013.
Reviewed: December 2013.
Accepted: December 2013.
Final: January 2014.
Published: January 2014.
Communicated by Ioannis G. Tollis
article (PDF)