id: 05796694 dt: a an: 05796694 au: Lelu, Alain; Cadot, Martine ti: Statistically valid links and anti-links between words and between documents: applying Tournebool randomization test to a Reuters collection. so: Guillet, F. (ed.) et al., Advances in knowledge discovery and management. Selected papers based on the presentations at the “Extraction et gestion des connaissances" conference 2009 (EGC), Strasbourg, France, January 2009. Berlin: Springer (ISBN 978-3-642-00579-4/hbk). Studies in Computational Intelligence 292, 307-324 (2010). py: 2010 pu: Berlin: Springer la: EN cc: ut: statistical graph extraction; randomization test; robust data mining; unsupervised learning; text mining ci: li: doi:10.1007/978-3-642-00580-0_18 ab: Summary: Neighborhood is a central concept in data mining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an objects $\times$ attributes binary table in order to establish which inter-attribute relations are fortuitous, and which ones are meaningful, without requiring any pre-defined statistical model, while taking into account the empirical distributions. It ensues a robust and statistically validated graph. We present a full-scale experiment on one of the public access Reuters test corpus. We characterize the resulting word graph by a series of indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative “counter-relations” between words, i.e. words which “steer clear” one from another. We characterize in the same way the counter-relation graph. At last we generate the couple of valid document graphs (i.e. links and anti-links) and evaluate them by taking into account the Reuters document categories. rv: