Identifying novel information using Latent Semantic Analysis in the WiQA task at CLEF 2006

Richard F.E. Sutcliffe, Josef Steinberger, Udo Kruschwitz, Mijail Alexandrov-Kabadjov, Massimo Poesio

Research output: Contribution to journalConference articlepeer-review

Abstract

From the perspective of WiQA, the Wikipedia can be considered as a set of articles each having a unique title. In the WiQA corpus articles are divided into sentences (snippets) each with its own identifier. Given a title, the task is to find snippets which are Important and Novel relative to the article. We indexed the corpus by sentence using Terrier. In our two-stage system, snippets were first retrieved if they contained an exact match with the title. Candidates were then passed to the Latent Semantic Analysis component which judged them Novel if they did not match the text of the article. The test data was varied-some articles were long, some short and indeed some were empty! We prepared a training collection of twenty topics and used this for tuning the system. During evaluation on 65 topics divided into categories Person, Location, Organization and None we submitted two runs. In the first, the ten best snippets were returned and in the second the twenty best. Run 1 was best with Average Yield per Topic 2.46 and Precision 0.37. We also studied performance on six different topic types: Person, Location, Organization and None (all specified in the corpus), Empty (no text) and Long (a lot of text). Precision results in Run 1 for Person and Organization were good (0.46 and 0.44) and were worst for Long (0.24). Compared to other groups, our performance was in the middle of the range except for Precision where our system was equal to the best. We attribute this to our use of exact title matches in the IR stage. We found that judging snippets Novel when preparing training data was fairly easy but that Important was subjective. In future work we will vary the approach used depending on the topic type, exploit co-references in conjunction with exact matches and make use of the elaborate hyperlink structure which is a unique and most interesting aspect of Wikipedia.

Original languageEnglish
JournalCEUR Workshop Proceedings
Volume1172
Publication statusPublished - 2006
Event2006 Cross Language Evaluation Forum Workshop, CLEF 2006, co-located with the 10th European Conference on Digital Libraries, ECDL 2006 - Alicante, Spain
Duration: 20 Sep 200622 Sep 2006

Keywords

  • Information filtering
  • Latent semantic analysis
  • Question answering

Fingerprint

Dive into the research topics of 'Identifying novel information using Latent Semantic Analysis in the WiQA task at CLEF 2006'. Together they form a unique fingerprint.

Cite this