Boolean Tensor Factorization

Finding Facts from Boolean Tensors

The Resource Description Framework (RDF) represents information in the form of subject–predicate–object triples. These triples are interpreted as a directed labeled graph, but alternatively, the data can be interpreted as a 3-way Boolean tensor. This allows us to express the SPARQL queries – the standard queries for RDF – as elementary operations in Boolean algebra, giving us a complete reinterpretation of RDF and SPARQL [2].

Given an RDF graph \(G\), let \(S(G)\), \(P(G)\), and \(O(G)\) denote the sets of distinct subjects, predicates and objects respectively. The number of distinct subjects is denoted by \(\abs{S(G)}\), or shorthand \(\abs{S}\). Correspondingly \(\abs{P(G)} = \abs{P}\) and \(\abs{O(G)} = \abs{O}\) denote the number of predicates and objects.

To represent this RDF data by a binary tensor, we enumerate all subjects, predicates, and objects to obtain mappings from the items in \(S(G)\), \(P(G)\), and \(O(G)\) to indices. Let \(s_i\) denote the \(i\)-th subject with the corresponding index \(i=1,\ldots,|S|\) and respectively \(p_j\) the \(j\)-th predicate with \(j=1,\ldots,|P|\) and \(o_k\) the \(k\)-th object with \(k=1,\ldots,|O|\).

With this mapping we can represent any RDF graph \(G\) as a 3-way binary \(|S|\)-by-\(|P|\)-by-\(|O|\) tensor \(\tens{T}\). An element \((i,j,k)\) of \(\tens{T}\) is \(1\) if and only if the respective subject-predicate-object triple \((s_i,p_j,o_k)\) is present in the RDF graph \(G\).

We can also decompose RDF data sets, similarly to any other tensor. This could be used to find regularities in the data, potentially helping with the query processing, or it can be used to find interesting patterns for data analysis. In the latter application, we considered [1] data consisting of raw subject–predicate–object triples (surface terms). Our aim was to find sets of subject, object, and relation surface terms ("synonyms"), and "true" triples ("facts") over them.

Figure 1. From surface triples to facts: schematic view of Boolean Tucker3 decomposition applied to surface triple data. np stands for noun phrase and vp for verbal phrase.

The Boolean Tucker3 decomposition suits perfectly to this task, the factor matrices giving the synonyms and the core tensor giving the facts (see Figure 1). The Boolean algebra also takes into account the polysemy in the terms, and the generates naturally sparse core tensors – an important characteristic, as the size for the core tensor for this application will make storing a dense core tensor infeasible.

Data

We created a collection of semi-synthetic data, called YPSS data, based on the YAGO and PATTY data sets.

The YPSS data set is generated from a combination of data obtained from the YAGO and PATTY ontologies. YPSS contains entity-relation-entity surface term triples that faithfully represent statistical properties of real life text documents as captured by PATTY and YAGO. PATTY contains a set of facts, that is clean entity-relation-entity term triples, along with frequencies that the facts appear in real life textual data. For each such fact we generate as many surface triples as is the frequency of that fact. To generate the surface forms we replace the entities and the relation with surface forms selected at random with respect to probability \(\Pr(s\mid c)\), where \(s\) is the surface form and \(c\) is the clean entity or relation in the fact. The surface forms and their probabilities are selected and computed based on information from YAGO. Due to the random generation process, surface form triples may appear with multiplicity. For a detailed description of the data generation process please read [1].

Two datasets are provided. The one in file YPSS.zip contains the sampled version of the data set that is used in the experiments in [1]. The one in file YPSS_large.zip contains the entire data set that resulted from the generation process. Uncompressed version of the full dataset is 3GB! The data sets contain both the generated surface term triples as well as ground truth data.

Please read the accompanied README file for further information on the contents of these packages and on the exact citation requirements.

References

  1. Dóra Erdős Pauli Miettinen Discovering Facts with Boolean Tensor Tucker Decomposition. Proc. 2013 ACM International Conference on Infortmation and Knowledge Management (CIKM '13), , 15961572.
    10.1145/2505515.2507846
    [pdf (ACM) | manuscript | data]
  2. Saskia Metzler Pauli Miettinen On Defining SPARQL with Boolean Tensor Algebra. arXiv:1503.00301 [cs.DB]
    [pdf (arXiv)]