ICML 2026 Submission

Remove the Ambiguity: Few-shot Multimodal Anomaly Detection Using Crossmodal Feature Replacer

CFR resolves one-to-many RGB-3D feature ambiguity by combining cyclic crossmodal reconstruction with selective inference-time feature replacement.

Abstract

Reconstruction-based multimodal anomaly detection suffers from one-to-many crossmodal mapping: a single 3D feature can correspond to multiple plausible RGB appearances. Deterministic regression collapses these valid targets into over-smoothed reconstructions, weakening anomaly discrimination. CFR learns bidirectional cyclic mappings for coarse crossmodal reconstruction, identifies unreliable reconstructed features, and selectively replaces them with high-confidence normal features at inference time.

92.3 AUPRO@30% FPR on MVTec 3D-AD, 1-shot
82.7 AUPRO@30% FPR on Eyecandies, 1-shot
74.0 Image-level AUROC on MVTec 3D-AD, 1-shot
75.9 Image-level AUROC on Eyecandies, 1-shot

Method

CFR treats reconstruction and retrieval as complementary tools. Reconstruction provides an initial RGB-3D prediction, while retrieval is invoked only for ambiguous regions whose reconstructed features are unreliable.

Overview diagram of Crossmodal Feature Replacer with reconstruction, retrieval, and feature replacement stages.
Overview of Crossmodal Feature Replacer. CFR reconstructs crossmodal feature maps, retrieves confident normal candidates, and replaces unreliable features before anomaly scoring.
1

Crossmodal feature learning

Frozen RGB and 3D feature extractors feed lightweight cyclic mappings that align appearance and geometry while preserving bidirectional consistency.

2

Attention-aided retrieval

Key-value crossmodal memory banks store paired normal features. A compact attention module retrieves candidates that are more reliable than collapsed reconstructions.

3

Feature replacement

At inference time, CFR filters low-confidence reconstructed features and selectively replaces them with high-confidence normal features for sharper anomaly maps.

Results

Under 1, 2, and 4-shot normal-only training, CFR improves both image-level detection and pixel-level localization on MVTec 3D-AD and Eyecandies.

Dataset Setting I-AUROC AUPRO@30% FPR Observation
MVTec 3D-AD 1-shot 74.0 92.3 Strong localization with only one normal sample per class.
MVTec 3D-AD 4-shot 80.5 94.2 More normal samples improve both image and pixel metrics.
Eyecandies 1-shot 75.9 82.7 Large gains on appearance-diverse objects with RGB-3D ambiguity.
Eyecandies 4-shot 77.9 84.7 Maintains high localization quality in richer few-shot settings.
Qualitative anomaly localization comparison on Licorice Sandwich samples.
Qualitative examples on Licorice Sandwich. CFR reduces false positives caused by color-geometry ambiguity.
AUPRO FPR curves for MVTec 3D-AD and Eyecandies.
AUPRO-FPR curves show robust localization across false-positive budgets.

Why It Works

One-to-many crossmodal correspondence makes deterministic RGB-3D regression brittle. CFR avoids forcing a single reconstruction to explain every valid appearance. Instead, it detects where reconstruction is uncertain and corrects only those feature locations.

In the ablation study, attention-aided retrieval improves the mean 1-shot I-AUROC from 75.1 to 81.3 and AUPRO@30% FPR from 82.6 to 85.4 on ambiguity-heavy categories.

Qualitative examples on Eyecandies showing RGB, ground truth, depth, and CFR anomaly heatmaps.
Eyecandies qualitative examples. Despite similar 3D geometry and diverse RGB appearances, CFR localizes anomalies while suppressing ambiguity-induced responses.

Citation

The submission TeX currently uses anonymous placeholder authors, so the citation block keeps the author field anonymous until the public author list is ready.

@inproceedings{cfr2026,
  title = {Remove the Ambiguity: Few-shot Multimodal Anomaly Detection Using Crossmodal Feature Replacer},
  author = {Anonymous},
  booktitle = {International Conference on Machine Learning},
  year = {2026}
}