Automated anomaly detection for categorical data by repurposing a form filling recommender system

Hichem Belgacem, Xiaochen Li, Domenico Bianculli, Lionel Briand

Research output: Contribution to journalArticlepeer-review

Abstract

Data quality is crucial in modern software systems, like data-driven decision support systems. However, data quality is affected by data anomalies, which represent instances that deviate from most of the data. These anomalies affect the reliability and trustworthiness of software systems, and may propagate and cause more issues. Although many anomaly detection approaches have been proposed, they mainly focus on numerical data. Moreover, the few approaches targeting anomaly detection for categorical data do not yield consistent results across datasets.In this article, we propose a novel anomaly detection approach for categorical data named LAFF-AD (LAFF-based Anomaly Detection), which takes advantage of the learning ability of a state-of-the-art form filling tool (LAFF) to perform value inference on suspicious data. LAFF-AD runs a variant of LAFF that predicts the possible values of a suspicious categorical field in the suspicious instance. LAFF-AD then compares the output of LAFF to the recorded values in the suspicious instance, and uses a heuristic-based strategy to detect categorical data anomalies.We evaluated LAFF-AD by assessing its effectiveness and efficiency on six datasets. Our experimental results show that LAFF-AD can accurately determine a high range of data anomalies, with recall values between 0.6 and 1 and a precision value of at least 0.808. Furthermore, LAFF-AD is efficient, taking at most 7000s and 735ms to perform training and prediction, respectively.

Original languageEnglish
Article number16
JournalJournal of Data and Information Quality
Volume16
Issue number3
DOIs
Publication statusPublished - 4 Oct 2024

Keywords

  • Data quality
  • categorical data
  • data anomaly detection
  • machine learning

Fingerprint

Dive into the research topics of 'Automated anomaly detection for categorical data by repurposing a form filling recommender system'. Together they form a unique fingerprint.

Cite this