Background picture

MIAM

Modality Imbalance-Aware Masking for Multimodal Ecological Applications


ICLR 2026

TL;DR

We introduce MIAM (Modality Imbalance-Aware Masking), a dynamic, score-driven masking strategy that adapts during training to counter modality imbalance using modality-specific learning dynamics. Models trained with MIAM can be evaluated under any combination of input tokens, enabling fine-grained analysis of modality and token importance while improving robustness to missing data in ecological applications.

MIAM main figure

Overview of MIAM. (Left) Each modality token is masked with probability pm, sampled from a mixture of product Beta distributions. (Right) The distribution is modulated by ρsm and ρdm, derived from per-modality performance sm and its absolute derivative dm. Modalities that perform strongly and learn stably are masked more often to mitigate modality imbalance.

Introduction

Ecological modeling underpins conservation, climate adaptation, and environmental management. Modern datasets are inherently multimodal, combining heterogeneous signals such as environmental tabular variables πŸ“Š, climate time series πŸ“ˆ, audio πŸ”Š, natural images πŸ“·, and satellite imagery πŸ›°οΈ.

Learning and inferring from such data is challenging because:

  • inputs are often incomplete, with entire modalities or partial observations missing;
  • modalities contribute unequally, leading to modality imbalance during optimization;
  • interpretability is crucial, both across modalities and within each modality.

Masking strategies provide a principled way to improve robustness to missing data while enabling structured analysis of modality contributions.

Why MIAM?

Most multimodal masking strategies are fixed and underexplore the space of possible input subsets. As a result, they improve robustness only partially and do not explicitly address modality imbalance, where dominant modalities hinder the learning of complementary ones.

We formalize masking strategies as probability distributions over the unit hypercube, where each dimension corresponds to a modality. Existing approaches can be interpreted within this unified framework:

MIAM πŸ˜‹

MIAM provides a principled alternative built on three properties:

  • Full support: We sample masking probabilities over the entire hypercube, allowing any combination of masked and unmasked modalities to occur during training.
  • Corner prioritization: Instead of sampling uniformly, MIAM uses a corner-anchored mixture of product Beta distributions. Each mixture component concentrates probability mass near one of the 2M hypercube corners, encouraging training on informative subsets (e.g., single-modality or near-complete inputs).
  • Imbalance awareness: MIAM dynamically adjusts the sharpness of these Beta distributions based on modality-specific learning dynamics. Modalities that achieve high unimodal performance and exhibit stable learning are masked more frequently, encouraging the model to focus on under-optimized or slower-learning modalities.

Main contributions ✨

  • Dynamic, imbalance-aware masking (MIAM) that reduces modality imbalance and promotes complementary multimodal learning βš–οΈ
  • Consistent improvements on ecological multimodal benchmarks, especially for under-optimized modalities πŸ“ˆ
  • Fine-grained interpretability across and within modalities (variables, time segments, image regions) πŸ”Ž

For full methodological details and experimental results, see the paper πŸ“„.

Citation

@inproceedings{
    zbinden2026miam,
    title={MIAM: Modality Imbalance-Aware Masking for Multimodal Ecological Applications},
    author={Robin Zbinden and Wesley Monteith-Finas and Gencer Sumbul and Nina van Tiel and Chiara Vanalli and Devis Tuia},
    booktitle={International Conference on Learning Representations (ICLR)},
    year={2026},
    url={https://openreview.net/forum?id=oljjAkgZN4}
}