Clustering Boolean tensors

Clustering Boolean Tensors

Much of the computational complexity of Boolean CP and Tucker3 tensor decompositions comes from the fact that the rank-1 components can overlap. Yet, in many real-world applications, at least one mode can be considered non-overlapping (for example, in subject–relation–object data the relations behave differently to subjects and objects). It is therefore natural to ask what can be gained (and what will be lost) if we restrict the Boolean CP decomposition by requiring one factor matrix to be a cluster assignment matrix, that is, a binary matrix where each row has exactly one non-zero.

We [1] phrase the problem as a maximum-similarity problem (see Figure 1): Given a binary \(n\)-by-\(m\)-by-\(l\) tensor \(\tens{X}\) and an integer \(k\), our goal is to find matrices \(A\in\{0,1\}^{n\times k}\), \(B\in\{0,1\}^{m\times k}\), and \(C\in\{0,1\}^{l\times k}\), where \(C\) is a cluster assignment matrix, such that \(\tilde{\tens{X}} = A\otimes B\otimes C\) (\(\otimes\) denotes the outer product) is as similar to \(\tens{X}\) as possible.

Figure 1. Illustration of a tensor factorization: For an exact factorization the outer product of the factor matrices \(A\), \(B\), \(C\) yields the tensor \(\tens{X}\). The matrix \(C\) in Boolean tensor clustering is restricted to be a cluster assignment matrix.

Given a tensor \(\tens{X}\), for the optimal solution of the problem, we need binary matrices \(A\), \(B\), and \(C\) that maximize the similarity between \(\tens{X}\) and \(A\otimes B\otimes C\), which can be expressed as \(C(B\odot A)^T\) (\(\odot\) denotes the Khatri-Rao product) when we compare to the matricization of \(\tens{X}\) along the third mode, \(X_{(3)}\): \(\text{sim}(X_{(3)}-C(B\odot A)^T)\). If we replace \(B\odot A\) with an arbitrary binary matrix, this would be equal to the Hypercube segmentation problem defined by Kleinberg et al. [2004]: Given a set \(S\) of \(l\) vertices of the \(d\)-dimensional cube \(\{0,1\}^d\), find \(k\) vertices \(P_1,\ldots,P_k\in \{0,1\}^d\) and a partition of \(S\) into \(k\) segments to maximize \(\sum_{i=1}^k\sum_{c\in S} \text{sim}(P_i, c)\). Therefore we employ an algorithm that resembles those for hypercube segmentation, with the added restrictions to the centroids.

Our algorithm, SaBoTeur (Sampling for Boolean Tensor clustering), acts on the unfolded tensor \(X_{(3)}\). That is the tensor \(\tens{X}\) turned into a matrix by arranging its tubes (fibers of the third mode) as columns of a matrix. In each iteration, SaBoTeur samples \(k\) rows of \(X_{(3)}\) as the initial, unrestricted centroids. The next step is to turn these centroids into the restricted type. This means, the maximum-similarity binary rank-1 decomposition of each centroid has to be obtained. Then, SaBoTeur assigns each row of \(X_{(3)}\), i.e. each slice of \(\tens{X}\) to its closest restricted centroid. The sampling is repeated multiple times, and in the end the factors that gave highest similarity are returned.

We show that, like the hypercube segmentation problem, the maximum similarity binary rank-1 decomposition also admits a PTAS. For hypercube segmentation, Alon and Sudakov [1999] present a randomized algorithm that on expectation attains a similarity within \((1 - \epsilon)\) of the optimum and can be derandomized. To show the approximabilty of the maximum similarity binary rank-1 decomposition, we proof that a variation of Alon's and Sudakov's algorithm solves the decomposition while the approximation bounds are maintained.

Source code

btc.zip

References

Saskia Metzler Pauli Miettinen Clustering Boolean Tensors. Data Mining and Knowledge Discovery 29(5), 2015, 1343–1373.
10.1007/s10618-015-0420-3
[tech. rep. | manuscript | pdf (Springer) | source code]
Saskia Metzler Pauli Miettinen Clustering Boolean Tensors. arXiv:1501.00696 [cs.NA] 2015.
[pdf (arXiv)]