TY - GEN
T1 - Synthetic data generation for statistical testing
AU - Soltana, Ghanem
AU - Sabetzadeh, Mehrdad
AU - Briand, Lionel C.
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/11/20
Y1 - 2017/11/20
N2 - Usage-based statistical testing employs knowledge about the actual or anticipated usage profile of the system under test for estimating system reliability. For many systems, usage-based statistical testing involves generating synthetic test data. Such data must possess the same statistical characteristics as the actual data that the system will process during operation. Synthetic test data must further satisfy any logical validity constraints that the actual data is subject to. Targeting data-intensive systems, we propose an approach for generating synthetic test data that is both statistically representative and logically valid. The approach works by first generating a data sample that meets the desired statistical characteristics, without taking into account the logical constraints. Subsequently, the approach tweaks the generated sample to fix any logical constraint violations. The tweaking process is iterative and continuously guided toward achieving the desired statistical characteristics. We report on a realistic evaluation of the approach, where we generate a synthetic population of citizens' records for testing a public administration IT system. Results suggest that our approach is scalable and capable of simultaneously fulfilling the statistical representativeness and logical validity requirements.
AB - Usage-based statistical testing employs knowledge about the actual or anticipated usage profile of the system under test for estimating system reliability. For many systems, usage-based statistical testing involves generating synthetic test data. Such data must possess the same statistical characteristics as the actual data that the system will process during operation. Synthetic test data must further satisfy any logical validity constraints that the actual data is subject to. Targeting data-intensive systems, we propose an approach for generating synthetic test data that is both statistically representative and logically valid. The approach works by first generating a data sample that meets the desired statistical characteristics, without taking into account the logical constraints. Subsequently, the approach tweaks the generated sample to fix any logical constraint violations. The tweaking process is iterative and continuously guided toward achieving the desired statistical characteristics. We report on a realistic evaluation of the approach, where we generate a synthetic population of citizens' records for testing a public administration IT system. Results suggest that our approach is scalable and capable of simultaneously fulfilling the statistical representativeness and logical validity requirements.
KW - Model-Driven Engineering
KW - OCL
KW - Test Data Generation
KW - UML
KW - Usage-based Statistical Testing
UR - http://www.scopus.com/inward/record.url?scp=85041439255&partnerID=8YFLogxK
U2 - 10.1109/ASE.2017.8115698
DO - 10.1109/ASE.2017.8115698
M3 - Conference contribution
AN - SCOPUS:85041439255
T3 - ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering
SP - 872
EP - 882
BT - ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering
A2 - Nguyen, Tien N.
A2 - Rosu, Grigore
A2 - Di Penta, Massimiliano
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017
Y2 - 30 October 2017 through 3 November 2017
ER -