This folder contains the semi-synthetic data set YPSS as described and used in the paper titled "Discovering facts with Boolean tensor Tucker decomposition" that appeared in CIKM'13. Along with the data set we also provide ground truth files corresponding to the Boolean tensor Tucker decomposition described in [1]. When using this dataset, please cite: "The YPSS data set [1], based on YAGO [2] and PATTY [3] data sets". [1] D. Erdos, P. Miettinen: "Discovering facts with Boolean tensor Tucker decomposition". In CIKM'13, 2013. [2] F. M. Suchanek, G. Kasneci, G. Weikum: "Yago -- A core of semantic knowledge". In WWW '07, 2007. [3] N. Nakashole, G. Weikum, F. M. Suchanek: "PATTY: A taxonomy of relational patterns with semantic types". In EMNLP '12, 2011. The YPSS data set is generated from a combination of data obtained from the YAGO [2] and PATTY [3] ontologies. YPSS contains entity-relation-entity surface term triples that faithfully represent statistical properties of real life text documents as captured by PATTY and YAGO. PATTY contains a set of facts, that is clean entity-relation-entity term triples, along with frequencies that the facts appear in real life textual data. For each such fact we generate as many surface triples as is the frequency of that fact. To generate the surface forms we replace the entities and the relation with surface forms selected at random with respect to probability Pr(s|c), where s is the surface form and c is the clean entity or relation in the fact. The surface forms and their probabilities are selected and computed based on information from YAGO. Due to the random generation process, surface form triples may appear with multiplicity. For a detailed description of the data generation process please read [1]. Two datasets are provided. The one in file YPSS.zip contains the sampled version of the data set that is used in the experiments in [1]. The one in file YPSS_large.zip contains the entire data set that resulted from the generation process. Uncompressed version of the full dataset is 3GB! The data sets contain both the generated surface term triples as well as ground truth data. Both folders contain four data files: YPSS_entity1.utf8.txt --------------------- Tab separated file. The first column contains the clean subject entity as it appears in the fact. Subsequent columns contain all surface term subject entities generated from this entity. This file corresponds to the first factor matrix in the Boolean tensor Tucker decomposition. YPSS_relations.utf8.txt ----------------------- Tab separated file. The first column contains the clean relation as it appears in the fact. Subsequent columns contain all surface term relations generated from this relation. This file corresponds to the second factor matrix in the Boolean tensor Tucker decomposition. YPSS_entity2.utf8.txt --------------------- Tab separated file. The first column contains the clean object entity as it appears in the fact. Subsequent columns contain all surface term object entities generated from this entity. This file corresponds to the third factor matrix in the Boolean tensor Tucker decomposition. YPSS_truth.utf8.txt ------------------- Tab separated file containing 7 columns. Columns 1-3 contain the surface form, columns 4-6 the fact and column 7 the frequency of this surface form - fact combination. This file can be used to generate the core tensor in the Boolean tensor Tucker decomposition, using columns 4-6.