Constructing language models from online forms to aid better document representation for more effective clustering

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Clustering is the practice of finding tacit patterns in datasets by grouping the corpus by similarity. When clustering documents this is achieved by converting the corpus into a numeric format and applying clustering techniques to this new format. Values are assigned to terms based on their frequency within a particular document, against their general occurrence in the corpus. One obstacle in achieving this aim is as a result of the polysemic nature of terms. That is words having multiple meanings; each intended meaning only being discernible when examining the context in which they are used. Thus, disambiguating the intended meaning of a term can greatly improve the efficacy of a clustering algorithm. One approach to achieve this end has been done through the creation of an ontology - Wordnet, which can act as a look-up as to the intended meaning of a term. Wordnet however, is a static source and does not keep pace with the changing nature of language. The aim of this paper is to show that while Wordnet can be affective, however it is static in nature and thus does not capture some contemporary usage of terms. Particularly when the dataset is taken from online conversation forums, who would not be structured in a standard document format. Our proposed solution involves using Reddit as a contemporary source which moves with new trends in word usage. To better illustrate this point we cluster comments found in online threads such as Reddit and compare the efficacy of different representations of these document sets.

Original languageEnglish
Title of host publicationKnowledge Discovery, Knowledge Engineering and Knowledge Management - 9th International Joint Conference, IC3K 2017, Revised Selected Papers
EditorsDavid Aveiro, Jorge Bernardino, Jan L.G. Dietz, Kecheng Liu, David Aveiro, Ana Salgado, Joaquim Filipe, Ana Fred
PublisherSpringer Verlag
Pages67-81
Number of pages15
ISBN (Print)9783030156398
DOIs
Publication statusPublished - 2019
Event9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2017 - Funchal, Madeira, Portugal
Duration: 1 Nov 20173 Nov 2017

Publication series

NameCommunications in Computer and Information Science
Volume976
ISSN (Print)1865-0929

Conference

Conference9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2017
Country/TerritoryPortugal
CityFunchal, Madeira
Period1/11/173/11/17

Keywords

  • Classification
  • Data mining
  • Document clustering
  • Graph theory
  • Word sense disambiguation
  • WordNet

Fingerprint

Dive into the research topics of 'Constructing language models from online forms to aid better document representation for more effective clustering'. Together they form a unique fingerprint.

Cite this