Problems you may encounter during distributed training

Large Language Models (LLMs) have show great promise in various artificial intelligence applications. It is becoming a trend to train a Large Language Model. Nevertheless even for many senior AI engineers, training these complex models remain a significant challenge. Here lists a series of issues you may encounter in the future. torch.distributed.barrier() stuck during training with multi gpus At first, you should try to set the environment variable ‘NCCL_P2P_DISABLE=1’. If it works out, the solution is probably to disable ACS of Pcie in BIOS....

August 11, 2023 · 2 min · author: Loong