In order to improve the performance upper bound and generalization ability of current speech enhancement methods based on masking and spectrum mapping, a collaborative monaural speech enhancement method based on the learning framework of combined complex spectrum and masking is proposed. An interactive cooperative learning unit (ICU) is designed in the codec part to monitor the interactive speech information flow and provide an effective potential feature space. In the middle layer, a multi-scale fusion Transformer is designed to extract multi-scale details in the spatial-channel dimension with a small number of parameters for fusion output, at the meanwhile, modeling the voice sub-band and full band information. Experiments on large and small data sets and 115 noise environments show that the proposed method only uses 0. 57 M parameters to obtain better subjective and objective indicators than most advanced and related methods, which has good robustness and effectiveness.