Abstract
From the perspective of WiQA, the Wikipedia can be considered as a set of articles each having a unique title. In the WiQA corpus articles are divided into sentences (snippets) each with its own identifier. Given a title, the task is to find snippets which are Important and Novel relative to the article. We indexed the corpus by sentence using Terrier. In our two-stage system, snippets were first retrieved if they contained an exact match with the title. Candidates were then passed to the Latent Semantic Analysis component which judged them Novel if they did not match the text of the article. The test data was varied-some articles were long, some short and indeed some were empty! We prepared a training collection of twenty topics and used this for tuning the system. During evaluation on 65 topics divided into categories Person, Location, Organization and None we submitted two runs. In the first, the ten best snippets were returned and in the second the twenty best. Run 1 was best with Average Yield per Topic 2.46 and Precision 0.37. We also studied performance on six different topic types: Person, Location, Organization and None (all specified in the corpus), Empty (no text) and Long (a lot of text). Precision results in Run 1 for Person and Organization were good (0.46 and 0.44) and were worst for Long (0.24). Compared to other groups, our performance was in the middle of the range except for Precision where our system was equal to the best. We attribute this to our use of exact title matches in the IR stage. We found that judging snippets Novel when preparing training data was fairly easy but that Important was subjective. In future work we will vary the approach used depending on the topic type, exploit co-references in conjunction with exact matches and make use of the elaborate hyperlink structure which is a unique and most interesting aspect of Wikipedia.
Original language | English |
---|---|
Journal | CEUR Workshop Proceedings |
Volume | 1172 |
Publication status | Published - 2006 |
Event | 2006 Cross Language Evaluation Forum Workshop, CLEF 2006, co-located with the 10th European Conference on Digital Libraries, ECDL 2006 - Alicante, Spain Duration: 20 Sep 2006 → 22 Sep 2006 |
Keywords
- Information filtering
- Latent semantic analysis
- Question answering