中国·37000威尼斯(品牌公司)·Official website

中文
current location: Home > News > Content

INFOCOM’26 Achievement by Prof. Zhibin Wang’s Research Group: The Chameleon System Enables Fault-Tolerant Large Model Training

2025-12-24 browse:

To address the frequent failures that occur during large-scale model training, the research group led by Prof. Zhibin Wang from our school, in collaboration with Huawei Technologies Co., Ltd., has proposed a fault-tolerant training system named Chameleon. This work was led by the research group, with participation from undergraduate students Peng Jiang and Qianyu Jiang from our school. The related paper has been accepted by the top-tier international conference INFOCOM 2026.

Challenges in Large Model Training

With the rapid advancement of large-scale foundation models, model training has become increasingly complex and computationally intensive. The training process is often interrupted by unexpected events such as hardware failures, network disruptions, and other system-level faults. Considering that large-scale training may last for months and involve thousands of accelerators, efficiently and seamlessly recovering from failures has become a critical challenge for engineers.

Traditional fault-tolerance approaches, such as redundant computation, dynamic parallelism, and data rerouting, can help resume training but often lead to significant performance degradation. Therefore, a key challenge lies in minimizing the loss of training efficiency when failures occur and restoring the training process to its pre-failure performance as quickly as possible.

An Adaptive Fault-Tolerance Solution

To tackle this challenge, Prof. Zhibin Wang’s research group, together with Huawei Technologies Co., Ltd., proposes the Chameleon system.

Designed for environments where failures frequently occur during large model training, Chameleon dynamically selects between dynamic parallelism and data rerouting strategies in real time. Through unified performance modeling, execution plan search, and communication optimization, the system enables efficient adaptive fault tolerance during large-scale training.

Performance Modeling

Chameleon first constructs a comprehensive performance model that covers all training stages and strategies. This model can evaluate execution time and memory consumption under various failure scenarios.

Based on this performance model, Chameleon can predict the behavior of different recovery strategies in advance, preventing issues such as memory overflow and ensuring stable training during fault recovery.

Execution Plan Search

When a failure occurs, Chameleon rapidly searches for and selects the optimal recovery strategy based on the performance model and system resource constraints.

During this process, the system considers multiple factors, including parallelism configuration, node distribution, and the allocation of data and model layers. This ensures that after recovery, the training process can quickly return to an optimal performance state.

Communication Optimization

In dynamic parallel training strategies, communication overhead is often a key factor affecting training performance. Chameleon introduces communication optimization techniques to reduce the overhead of weight transmission and data synchronization during the recovery process.

For example, Chameleon models weight transfer as a bipartite graph matching problem, while asynchronous communication is optimized via a graph coloring formulation. These techniques maximize communication parallelism and effectively reduce communication latency.

Experimental Evaluation

Experiments conducted on a 32-card Ascend 910B AI accelerator cluster demonstrate that the training performance after recovery with Chameleon is within 11% of normal training performance, achieving efficient and stable fault-tolerant training.

Compared with existing methods such as Oobleck and Recycle, Chameleon improves training throughput by 1.229× and 1.355×, respectively, demonstrating superior performance.

Researchers and students interested in performance analysis and optimization of large model training and inference systems are welcome to contact us for further discussion: wzbwangzhibin@gmail.com.

Suzhou Campus

1520 TAIHU ROAD

SUZHOU, JIANGSU PROVINCE, 215163

版权所有:37000威尼斯Copyright © All Rights Reserverd

XML 地图