Skip to main navigation Skip to search Skip to main content

Metrics-First, Language-Aware Clone Type Recognition: Auditable Signals Across C, C#, Java, and Python

  • University of Limerick
  • South East Technological University

Research output: Contribution to journalArticlepeer-review

Abstract

Modern clone research has largely converged on powerful end-to-end detectors, yet practitioners still lack auditable guidance on which concrete code characteristics best distinguish clone types and whether those signals transfer across languages. Rather than building another detector, we focus on metrics-level explainability for clone-type recognition, where we examine which code-level similarity metrics—spanning lexical, structural, semantic, and behavioral families—most strongly separate Syntactic (T1/2), Type 3, and Type 4 clones, and how stable and robust those metrics are across languages. We apply the GPTCloneBench within-language pairs for C, C#, Java, and Python to compute a 20-metric, format-invariant representation per pair. Structural views are derived from Tree-sitter ASTs/graphs; semantic views from CodeBERT/CodeT5 and CodeBLEU; a lightweight behavioral proxy captures decision-path overlap. We adopt a nested design: RFECV selects features within training folds, and four tabular learners (Random Forest, XGBoost, AutoGluon–Tabular, TabPFN) are evaluated per language using macro-F1 and accuracy measures, with non-parametric tests (Friedman (Formula presented.) Wilcoxon+Holm), effect sizes, and feature-stability analyses (selection frequency and Top- (Formula presented.) Jaccard). A compact, mixed-family core—TF–IDF, (Formula presented.) -gram Dice, AST shape similarity, and CodeBERT embedding—consistently accounts for the majority of discriminative signal across languages (Top-5 mutual-information [MI] share 0.58–0.65; Top-5 Jaccard stability 0.63–0.72; low redundancy). Across learners, TabPFN yields the highest mean macro-F1 overall (72.44%), is statistically superior in C and C#, and leads in Java; AutoGluon is best in Python (69.10%). Type 3 remains the persistent bottleneck, while Type 4 often exceeds T3 when semantic features dominate. Cross-language spreads are bounded ((Formula presented.) 10.67 pp), with C# showing the highest ceilings and Python the lowest. A small, auditable metric portfolio coupled with modern tabular learners provides interpretable, language-aware clone-type recognition. The results (i) identify stable, cross-language metric subsets, (ii) quantify where semantics trump structure (T4 (Formula presented.) T3), and (iii) offer practical model guidance (TabPFN for statically typed languages; heterogeneous ensembling for dynamic ones). The accompanying artifacts enable exact regeneration and support future, metrics-first refinement of T3-sensitive signals.

Original languageEnglish
Article numbere70124
JournalJournal of Software: Evolution and Process
Volume38
Issue number5
DOIs
Publication statusPublished - May 2026

Keywords

  • clone type classification
  • code clones
  • machine learning
  • multi-language analysis
  • similarity metrics
  • software engineering

Fingerprint

Dive into the research topics of 'Metrics-First, Language-Aware Clone Type Recognition: Auditable Signals Across C, C#, Java, and Python'. Together they form a unique fingerprint.

Cite this