Index of /~pmiettin/btf/YPSS_data

Icon  Name                    Last modified      Size  Description
[DIR] Parent Directory - [   ] YPSS.zip 04-Nov-2013 18:45 11M Sample data used in the publication [   ] YPSS_large.zip 04-Nov-2013 18:45 1.4G Full data
This folder contains the semi-synthetic data set YPSS as described and
used in the paper titled "Discovering facts with Boolean tensor Tucker
decomposition" that appeared in CIKM'13.  

Along with the data set we also provide ground truth files
corresponding to the Boolean tensor Tucker decomposition described in
[1].

When using this dataset, please cite: "The YPSS data set [1], based on
YAGO [2] and PATTY [3] data sets".

[1] D. Erdos, P. Miettinen: "Discovering facts with Boolean tensor
Tucker decomposition". In CIKM'13, 2013.

[2] F. M. Suchanek, G. Kasneci, G. Weikum: "Yago -- A core of semantic
knowledge". In WWW '07, 2007.

[3] N. Nakashole, G. Weikum, F. M. Suchanek: "PATTY: A taxonomy of
relational patterns with semantic types". In EMNLP '12, 2011.


The YPSS data set is generated from a combination of data obtained
from the YAGO [2] and PATTY [3] ontologies. YPSS contains
entity-relation-entity surface term triples that faithfully represent
statistical properties of real life text documents as captured by
PATTY and YAGO. PATTY contains a set of facts, that is clean
entity-relation-entity term triples, along with frequencies that the
facts appear in real life textual data. For each such fact we generate
as many surface triples as is the frequency of that fact. To generate
the surface forms we replace the entities and the relation with
surface forms selected at random with respect to probability Pr(s|c),
where s is the surface form and c is the clean entity or relation in
the fact. The surface forms and their probabilities are selected and
computed based on information from YAGO. Due to the random generation
process, surface form triples may appear with multiplicity.  For a
detailed description of the data generation process please read [1]. 

Two datasets are provided. The one in file YPSS.zip contains the
sampled version of the data set that is used in the experiments in
[1]. The one in file YPSS_large.zip contains the entire data set that
resulted from the generation process. Uncompressed version of the full
dataset is 3GB! The data sets contain both the generated surface term
triples as well as ground truth data.


Both folders contain four data files:

YPSS_entity1.utf8.txt
---------------------

Tab separated file. The first column contains the clean subject entity
as it appears in the fact. Subsequent columns contain all surface term
subject entities generated from this entity. 

This file corresponds to the first factor matrix in the Boolean tensor
Tucker decomposition. 


YPSS_relations.utf8.txt
-----------------------

Tab separated file. The first column contains the clean relation as it
appears in the fact. Subsequent columns contain all surface term
relations generated from this relation. 

This file corresponds to the second factor matrix in the Boolean
tensor Tucker decomposition. 


YPSS_entity2.utf8.txt
---------------------

Tab separated file. The first column contains the clean object entity
as it appears in the fact. Subsequent columns contain all surface term
object entities generated from this entity. 

This file corresponds to the third factor matrix in the Boolean tensor
Tucker decomposition. 


YPSS_truth.utf8.txt
-------------------

Tab separated file containing 7 columns.  Columns 1-3 contain the
surface form, columns 4-6 the fact and column 7 the frequency of this
surface form - fact combination. 

This file can be used to generate the core tensor in the Boolean
tensor Tucker decomposition, using columns 4-6.