Dataset Similarity to Assess Semisupervised Learning Under Distribution Mismatch Between the Labeled and Unlabeled Datasets

Saul Calderon-Ramirez, Luis Oala, Jordina Torrents-Barrena, Shengxiang Yang, David Elizondo, Armaghan Moemeni, Simon Colreavy-Donnelly, Wojciech Samek, Miguel A. Molina-Cabello, Ezequiel Lopez-Rubio

Research output: Contribution to journalArticlepeer-review

Abstract

Semisupervised deep learning (SSDL) is a popular strategy to leverage unlabeled data for machine learning when labeled data is not readily available. In real-world scenarios, different unlabeled data sources are usually available, with varying degrees of distribution mismatch regarding the labeled datasets. It begs the question, which unlabeled dataset to choose for good SSDL outcomes. Oftentimes, semantic heuristics are used to match unlabeled data with labeled data. However, a quantitative and systematic approach to this selection problem would be preferable. In this work, we first test the SSDL MixMatch algorithm under various distribution mismatch configurations to study the impact on SSDL accuracy. Then, we propose a quantitative unlabeled dataset selection heuristic based on dataset dissimilarity measures. These are designed to systematically assess how distribution mismatch between the labeled and unlabeled datasets affects MixMatch performance. We refer to our proposed method as deep dataset dissimilarity measures (DeDiMs), designed to compare labeled and unlabeled datasets. They use the feature space of a generic Wide-ResNet, which can be applied prior to learning, are quick to evaluate, and model agnostic. The strong correlation in our tests between MixMatch accuracy and the proposed DeDiMs suggests that this approach can be a good fit for quantitatively ranking different unlabeled datasets prior to SSDL training.

Original languageEnglish
Pages (from-to)282-291
Number of pages10
JournalIEEE Transactions on Artificial Intelligence
Volume4
Issue number2
DOIs
Publication statusPublished - 1 Apr 2023
Externally publishedYes

Keywords

  • Dataset similarity
  • MixMatch
  • deep learning
  • distribution mismatch
  • out of distribution data
  • semisupervised deep learning

Fingerprint

Dive into the research topics of 'Dataset Similarity to Assess Semisupervised Learning Under Distribution Mismatch Between the Labeled and Unlabeled Datasets'. Together they form a unique fingerprint.

Cite this