Our goal is twofold: 1) we want to mine the only statistically valid 2-itemsets out of a boolean datatable, 2) on this basis, we want to build the only higher-order non-redundant itemsets compared to their sub-itemsets. For the first task we have designed a randomization test (Tournebool) respectful of the structure of the data variables and independant from the specific distributions of the data. In our test set (193 texts and 888 terms), this leads to a reduction from 400,000 2-itemsets to 4000 significant ones, at the 95% confidence interval. For the second task, we have devised a hierarchical stepwise procedure (MIDOVA) for evaluating the residual amount of variation devoted to higher-order itemsets, yielding new possible positive or negative high-order relations. On our example, this leads to 2300 3-itemsets, 41 4-itemsets, and no higher-order ones, in a computationally efficient way.
Publié le : 2007-01-23
Classification:
Artificial Intelligence,
Discrete Mathematics,
Learning,
Modeling and Simulation,
Document and Text Processing,
Mathematics,
Statistics,
Theory,
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI],
[INFO.INFO-DM]Computer Science [cs]/Discrete Mathematics [cs.DM],
[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG],
[INFO.INFO-MO]Computer Science [cs]/Modeling and Simulation,
[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing,
[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST],
[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH]
@article{inria-00186096,
author = {Cadot, Martine and Lelu, Alain},
title = {Simuler et \'epurer pour extraire les motifs s\^urs et non redondants},
journal = {HAL},
volume = {2007},
number = {0},
year = {2007},
language = {fr},
url = {http://dml.mathdoc.fr/item/inria-00186096}
}
Cadot, Martine; Lelu, Alain. Simuler et épurer pour extraire les motifs sûrs et non redondants. HAL, Tome 2007 (2007) no. 0, . http://gdmltest.u-ga.fr/item/inria-00186096/