中国·37000威尼斯(品牌公司)·Official website

current location： Home > News > Content

New Progress Made by Prof. Luping Xiang’s Research Group in Multimodal Perception and Semantic Communication

2026-01-05 browse：

To address the limitations of traditional unimodal perception in complex environments, as well as the high latency and communication overhead caused by the decoupled sensing–communication architecture, the research group led by Prof. Luping Xiang from our school has proposed a Semantic-driven Integrated Multimodal Sensing and Communication (SIMAC) framework. By leveraging deep multimodal semantic fusion and a large language model–based channel-adaptive encoding mechanism, the proposed framework significantly improves perception accuracy while reducing communication overhead. The related research paper has been accepted for publication in the top-tier international journal in the communications field, IEEE Journal on Selected Areas in Communications (CAS Zone 1, CCF-A journal, Impact Factor: 17.2).

Research Background and Challenges

Traditional unimodal sensing systems (e.g., relying solely on radar or vision) often suffer from limited accuracy and insufficient perception capability in complex environments. Moreover, the conventional decoupled sensing–communication architecture not only increases processing latency but also incurs substantial communication overhead, especially under bandwidth-constrained scenarios. In addition, existing single-task–oriented systems struggle to meet the increasingly diverse and personalized perception requirements of users.

Figure 1. Illustration of the system model.

Core Innovations

To address the aforementioned challenges, the research team proposes a Semantic-driven Integrated Multimodal Sensing and Communication (SIMAC) framework, which introduces three key technical innovations:

1. Multimodal Semantic Fusion (MSF)

A multimodal semantic fusion network is designed to deeply integrate radar signals and visual images. By employing a bidirectional cross-attention mechanism, the framework effectively combines the physical spatial information from radar signals with the rich semantic features from visual images, thereby overcoming the limitations of unimodal perception.

2. Large Language Model-based Semantic Encoding (LSE)

A novel LLM-based channel-adaptive encoding mechanism is introduced by employing GPT-2 as a semantic encoder. The encoder maps multimodal semantic information together with real-time channel parameters (e.g., signal-to-noise ratio and modulation schemes) into a unified latent space. This design enables the encoder to adaptively adjust to channel conditions, significantly improving transmission robustness.

3. Smart Sensing Decoder for Multi-task Perception (SSD)

A task-oriented sensing semantic decoder is proposed to support multiple perception services simultaneously. Through a multi-task learning strategy, a single transmission can accomplish several perception tasks, including image reconstruction, distance prediction, angle estimation, and velocity detection, thereby greatly improving service efficiency.

Figure 2. Architecture of the SIMAC framework.

Experimental Results

Experiments conducted on the VIRAT video dataset demonstrate that the SIMAC framework outperforms existing unimodal and traditional RNN-based baseline methods across multiple metrics:

1.Significantly improved perception accuracy:

The root mean square error (RMSE) of distance prediction is reduced by approximately 83.3% compared with vision-only approaches and 55.0% compared with radar-only approaches.

2.Advantages of LLM-based models:

Compared with LSTM and GRU, the GPT-2–based semantic encoder reduces the prediction errors of angle, distance, and velocity by up to 40% and 90%, respectively.

3.Low latency and high reconstruction quality:

The overall framework achieves an inference latency of only 1.5 ms, while the Peak Signal-to-Noise Ratio (PSNR) of reconstructed images improves by 1.2 dB.

Figure 3. Selected experimental results.

Summary

By integrating radar and vision multimodal information and leveraging large language models for channel-adaptive semantic encoding, this work successfully constructs a low-overhead, high-precision, and multi-task–capable integrated sensing and communication system. The proposed framework provides a novel solution for intelligent perception and communication in complex environments.

Researchers and students interested in this work are welcome to contact us for further discussion: luping.xiang@nju.edu.cn

Next：INFOCOM’26 Achievement by Prof. Zhibin Wang’s Research Group: The Chameleon System Enables Fault-Tolerant Large Model Training