Abstract:In long-range reconnaissance missions, multimodal image fusion aims to generate fused images that preserve salient features from each modality while maintaining photorealism. To address the challenges of ambiguous feature representation and structural-textural distortion in distant targets, this paper proposes BalanceFusion, a novel fusion architecture integrating diffusion-based reconstruction with wavelet-domain frequency-aware perception, specifically designed to enhance detail representation and fidelity in infrared and visible image fusion under long-range conditions. The proposed method first employs an efficient image restoration network to strengthen target feature representation. Subsequently, a wavelet-conditioned fusion module is introduced to incorporate frequency-domain awareness, emphasizing thermal radiation contours of targets while mitigating the adverse effects of super-resolution methods that overly prioritize foreground targets at the expense of background structural integrity. Finally, modality-specific high-frequency features exhibiting significant differences are hierarchically aggregated across multiple scales to ensure the fused image retains both sufficient discriminative details and authentic shared structures. Experimental results on real-flight datasets demonstrate that the proposed algorithm outperforms state-of-the-art methods, achieving a 33% improvement in the structure-preserving perceptual quality metric Qabf and 21% gain in spatial frequency (SF).