TY - GEN
T1 - Variance based selection to improve test set performance in Genetic Programming
AU - Azad, R. Muhammad Atif
AU - Ryan, Conor
PY - 2011
Y1 - 2011
N2 - This paper proposes to improve the performance of Genetic Programming (GP) over unseen data by minimizing the variance of the output values of evolving models along-with reducing error on the training data. Variance is a well understood, simple and inexpensive statistical measure; it is easy to integrate into a GP implementation and can be computed over arbitrary input values even when the target output is not known. Moreover, we propose a simple variance based selection scheme to decide between two models (individuals). The scheme is simple because, although it uses bi-objective criteria to differentiate between two competing models, it does not rely on a multi-objective optimisation algorithm. In fact, standard multi-objective algorithms can also employ this scheme to identify good trade-offs such as those located around the knee of the Pareto Front. The results indicate that, despite some limitations, these proposals significantly improve the performance of GP over a selection of high dimensional (multi-variate) problems from the domain of symbolic regression. This improvement is manifested by superior results over test sets in three out of four problems, and by the fact that performance over the test sets does not degrade as often witnessed with standard GP; neither is this performance ever inferior to that on the training set. As with some earlier studies, these results do not find a link between expressions of small sizes and their ability to generalise to unseen data.
AB - This paper proposes to improve the performance of Genetic Programming (GP) over unseen data by minimizing the variance of the output values of evolving models along-with reducing error on the training data. Variance is a well understood, simple and inexpensive statistical measure; it is easy to integrate into a GP implementation and can be computed over arbitrary input values even when the target output is not known. Moreover, we propose a simple variance based selection scheme to decide between two models (individuals). The scheme is simple because, although it uses bi-objective criteria to differentiate between two competing models, it does not rely on a multi-objective optimisation algorithm. In fact, standard multi-objective algorithms can also employ this scheme to identify good trade-offs such as those located around the knee of the Pareto Front. The results indicate that, despite some limitations, these proposals significantly improve the performance of GP over a selection of high dimensional (multi-variate) problems from the domain of symbolic regression. This improvement is manifested by superior results over test sets in three out of four problems, and by the fact that performance over the test sets does not degrade as often witnessed with standard GP; neither is this performance ever inferior to that on the training set. As with some earlier studies, these results do not find a link between expressions of small sizes and their ability to generalise to unseen data.
KW - Genetic Programming
KW - Over-fitting
KW - Regularization
KW - Symbolic regression
KW - Variance
UR - http://www.scopus.com/inward/record.url?scp=84860410775&partnerID=8YFLogxK
U2 - 10.1145/2001576.2001754
DO - 10.1145/2001576.2001754
M3 - Conference contribution
AN - SCOPUS:84860410775
SN - 9781450305570
T3 - Genetic and Evolutionary Computation Conference, GECCO'11
SP - 1315
EP - 1322
BT - Genetic and Evolutionary Computation Conference, GECCO'11
T2 - 13th Annual Genetic and Evolutionary Computation Conference, GECCO'11
Y2 - 12 July 2011 through 16 July 2011
ER -