News thread extraction based on topical N-gram model with a background distribution. (English)
Lu, Bao-Liang (ed.) et al., Neural information processing. 18th international conference, ICONIP 2011, Shanghai, China, November 13‒17, 2011. Proceedings, Part II. Berlin: Springer (ISBN 978-3-642-24957-0/pbk). Lecture Notes in Computer Science 7063, 416-424 (2011).
Summary: Automatic thread extraction for news events can help people know different aspects of a news event. In this paper, we present a method of extraction using a topical N-gram model with a background distribution (TNB). Unlike most topic models, such as Latent Dirichlet Allocation (LDA), which relies on the bag-of-words assumption, our model treats words in their textual order. Each news report is represented as a combination of a background distribution over the corpus and a mixture distribution over hidden news threads. Thus our model can model “presidential election” of different years as a background phrase and “Obama wins” as a thread for event “2008 USA presidential election”. We apply our method on two different corpora. Evaluation based on human judgment shows that the model can generate meaningful and interpretable threads from a news corpus.