TY - GEN
T1 - Towards Automatic Data Cleansing and Classification of Valid Historical Data An Incremental Approach Based on MDD
AU - O'Shea, Enda
AU - Khan, Rafflesia
AU - Breathnach, Ciara
AU - Margaria, Tiziana
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/12/10
Y1 - 2020/12/10
N2 - The project Death and Burial Data: Ireland 1864-1922 (DBDIrl) examines the relationship between historical death registration data and burial data to explore the history of power in Ireland from 1864 to 1922. Its core Big Data arises from historical records from a variety of heterogeneous sources, some aspects are pre-digitized and machine readable. A huge data set (over 4 million records in each source) and its slow manual enrichment (ca 7,000 records processed so far) pose issues of quality, scalability, and creates the need for a quality assurance technology that is accessible to non-programmers. An important goal for the researcher community is to produce a reusable, high-level quality assurance tool for the ingested data that is domain specific (historic data), highly portable across data sources, thus independent of storage technology.This paper outlines the step-wise design of the finer granular digital format, aimed for storage and digital archiving, and the design and test of two generations of the techniques, used in the first two data ingestion and cleaning phases.The first small scale phase was exploratory, based on metadata enrichment transcription to Excel, and conducted in parallel with the design of the final digital format and the discovery of all the domain-specific rules and constraints for the syntax and semantic validity of individual entries. Excel embedded quality checks or database-specific techniques are not adequate due to the technology independence requirement. This first phase produced a Java parser with an embedded data cleaning and evaluation classifier, continuously improved and refined as insights grew. The next, larger scale phase uses a bespoke Historian Web Application that embeds the Java validator from the parser, as well as a new Boolean classifier for valid and complete data assurance built using a Model-Driven Development technique that we also describe. This solution enforces property constraints directly at data capture time, removing the need for additional parsing and cleaning stages. The new classifier is built in an easy to use graphical technology, and the ADD-Lib tool it uses is a modern low-code development environment that auto-generates code in a large number of programming languages. It thus meets the technology independence requirement and historians are now able to produce new classifiers themselves without being able to program.
AB - The project Death and Burial Data: Ireland 1864-1922 (DBDIrl) examines the relationship between historical death registration data and burial data to explore the history of power in Ireland from 1864 to 1922. Its core Big Data arises from historical records from a variety of heterogeneous sources, some aspects are pre-digitized and machine readable. A huge data set (over 4 million records in each source) and its slow manual enrichment (ca 7,000 records processed so far) pose issues of quality, scalability, and creates the need for a quality assurance technology that is accessible to non-programmers. An important goal for the researcher community is to produce a reusable, high-level quality assurance tool for the ingested data that is domain specific (historic data), highly portable across data sources, thus independent of storage technology.This paper outlines the step-wise design of the finer granular digital format, aimed for storage and digital archiving, and the design and test of two generations of the techniques, used in the first two data ingestion and cleaning phases.The first small scale phase was exploratory, based on metadata enrichment transcription to Excel, and conducted in parallel with the design of the final digital format and the discovery of all the domain-specific rules and constraints for the syntax and semantic validity of individual entries. Excel embedded quality checks or database-specific techniques are not adequate due to the technology independence requirement. This first phase produced a Java parser with an embedded data cleaning and evaluation classifier, continuously improved and refined as insights grew. The next, larger scale phase uses a bespoke Historian Web Application that embeds the Java validator from the parser, as well as a new Boolean classifier for valid and complete data assurance built using a Model-Driven Development technique that we also describe. This solution enforces property constraints directly at data capture time, removing the need for additional parsing and cleaning stages. The new classifier is built in an easy to use graphical technology, and the ADD-Lib tool it uses is a modern low-code development environment that auto-generates code in a large number of programming languages. It thus meets the technology independence requirement and historians are now able to produce new classifiers themselves without being able to program.
KW - Data Analytics
KW - Data Assurance
KW - Data Cleaning
KW - Data Collection
KW - Data Parsing
KW - Data Quality
KW - Ethics
KW - Historical Data
KW - Model-Driven Development
UR - http://www.scopus.com/inward/record.url?scp=85103861717&partnerID=8YFLogxK
U2 - 10.1109/BigData50022.2020.9378148
DO - 10.1109/BigData50022.2020.9378148
M3 - Conference contribution
AN - SCOPUS:85103861717
T3 - Proceedings - 2020 IEEE International Conference on Big Data, Big Data 2020
SP - 1914
EP - 1923
BT - Proceedings - 2020 IEEE International Conference on Big Data, Big Data 2020
A2 - Wu, Xintao
A2 - Jermaine, Chris
A2 - Xiong, Li
A2 - Hu, Xiaohua Tony
A2 - Kotevska, Olivera
A2 - Lu, Siyuan
A2 - Xu, Weijia
A2 - Aluru, Srinivas
A2 - Zhai, Chengxiang
A2 - Al-Masri, Eyhab
A2 - Chen, Zhiyuan
A2 - Saltz, Jeff
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 8th IEEE International Conference on Big Data, Big Data 2020
Y2 - 10 December 2020 through 13 December 2020
ER -