Towards Automatic Data Cleansing and Classification of Valid Historical Data An Incremental Approach Based on MDD

Enda O'Shea, Rafflesia Khan, Ciara Breathnach, Tiziana Margaria

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The project Death and Burial Data: Ireland 1864-1922 (DBDIrl) examines the relationship between historical death registration data and burial data to explore the history of power in Ireland from 1864 to 1922. Its core Big Data arises from historical records from a variety of heterogeneous sources, some aspects are pre-digitized and machine readable. A huge data set (over 4 million records in each source) and its slow manual enrichment (ca 7,000 records processed so far) pose issues of quality, scalability, and creates the need for a quality assurance technology that is accessible to non-programmers. An important goal for the researcher community is to produce a reusable, high-level quality assurance tool for the ingested data that is domain specific (historic data), highly portable across data sources, thus independent of storage technology.This paper outlines the step-wise design of the finer granular digital format, aimed for storage and digital archiving, and the design and test of two generations of the techniques, used in the first two data ingestion and cleaning phases.The first small scale phase was exploratory, based on metadata enrichment transcription to Excel, and conducted in parallel with the design of the final digital format and the discovery of all the domain-specific rules and constraints for the syntax and semantic validity of individual entries. Excel embedded quality checks or database-specific techniques are not adequate due to the technology independence requirement. This first phase produced a Java parser with an embedded data cleaning and evaluation classifier, continuously improved and refined as insights grew. The next, larger scale phase uses a bespoke Historian Web Application that embeds the Java validator from the parser, as well as a new Boolean classifier for valid and complete data assurance built using a Model-Driven Development technique that we also describe. This solution enforces property constraints directly at data capture time, removing the need for additional parsing and cleaning stages. The new classifier is built in an easy to use graphical technology, and the ADD-Lib tool it uses is a modern low-code development environment that auto-generates code in a large number of programming languages. It thus meets the technology independence requirement and historians are now able to produce new classifiers themselves without being able to program.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE International Conference on Big Data, Big Data 2020
EditorsXintao Wu, Chris Jermaine, Li Xiong, Xiaohua Tony Hu, Olivera Kotevska, Siyuan Lu, Weijia Xu, Srinivas Aluru, Chengxiang Zhai, Eyhab Al-Masri, Zhiyuan Chen, Jeff Saltz
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1914-1923
Number of pages10
ISBN (Electronic)9781728162515
DOIs
Publication statusPublished - 10 Dec 2020
Event8th IEEE International Conference on Big Data, Big Data 2020 - Virtual, Atlanta, United States
Duration: 10 Dec 202013 Dec 2020

Publication series

NameProceedings - 2020 IEEE International Conference on Big Data, Big Data 2020

Conference

Conference8th IEEE International Conference on Big Data, Big Data 2020
Country/TerritoryUnited States
CityVirtual, Atlanta
Period10/12/2013/12/20

Keywords

  • Data Analytics
  • Data Assurance
  • Data Cleaning
  • Data Collection
  • Data Parsing
  • Data Quality
  • Ethics
  • Historical Data
  • Model-Driven Development

Fingerprint

Dive into the research topics of 'Towards Automatic Data Cleansing and Classification of Valid Historical Data An Incremental Approach Based on MDD'. Together they form a unique fingerprint.

Cite this