×

A new method for identifying outlying subsets of data. (English) Zbl 1175.62065

Summary: In various branches of science, e.g., medicine, economics, sociology, it is necessary to identify or detect outlying subsets of data. Suppose that the set of data is partitioned into many relatively small subsets and we have some reason to suspect that one or several of these subsets may be atypical or aberrant. We propose applying a new measure of separability, based on ideas borrowed from discriminant analysis. We define two versions of this measure, both using a jacknife, leave-one-out, estimator of classification error. If a suspected subset is significantly well separated from the main bulk of the data, then we regard it as outlying.
The usefulness of our algorithm is illustrated on a set of medical data collected in a large survey “Epidemiology of Allergic Diseases in Poland” (ECAP). We also tested our method on artificial data sets and on the classical IRIS data set. For a comparison, we report the results of a homogeneity test of R. Bartoszynski, D. K. Pearl and J. Lawrence [J. Am. Stat. Assoc. 92, No. 438, 577–586 (1997; Zbl 0887.62046)] applied to the same data sets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P10 Applications of statistics to biology and medical sciences; meta analysis
65C60 Computational problems in statistics (MSC2010)

Citations:

Zbl 0887.62046
PDFBibTeX XMLCite
Full Text: EuDML