Abstract
Text classification is one of the most important tasks to extract information from the Internet and identifying the best text representation settings. With the increase of data volume on the world wide web, the significance of text classification increases. This situation requires huge human efforts to understand and classify the digital data available on the Internet. Text classification is classifying the number of text files into different classes. The data or text available on the Internet is in an unstructured form which increases the difficulty to understand and classify it for useful purposes. This paper proposes a context-aware text classification system to improve text quality. We use a content-aware recommendation system to extract the data from well-known news databases. Text preprocessing techniques like tokenization, stemming, and stop words removal are studied in detail. Furthermore, unigram, bigram, and trigram attributes are also being tested. Attribute selection methods are also examined and their impact on the text classification results. To carry out a detailed investigation, 11 versions are created of each dataset to save the time in experimentation process and applied the different preprocessing techniques to understand the impact of each technique on classification results. The proposed system is compared with the existing approach to check the accuracy where the proposed system achieved better performance.
Original language | English |
---|---|
Article number | e6489 |
Journal | Concurrency and Computation: Practice and Experience |
Volume | 35 |
Issue number | 15 |
DOIs | |
Publication status | Published - 10 Jul 2023 |
Externally published | Yes |
Keywords
- accuracy
- algorithm
- classification
- context-aware
- data mining
- dataset
- methods | computer