Problems you may encounter during distributed training

Large Language Models (LLMs) have show great promise in various artificial intelligence applications. It is becoming a trend to train a Large Language Model. Nevertheless even for many senior AI engineers, training these complex models remain a significant challenge. Here lists a series of issues you may encounter in the future.

torch.distributed.barrier() stuck during training with multi gpus

At first, you should try to set the environment variable ‘NCCL_P2P_DISABLE=1’. If it works out, the solution is probably to disable ACS of Pcie in BIOS. You may need to reference to the link.

raise RuntimeError(“Ninja is required to load C++ extensions”)

You need to make sure Ninja compile system has been installed.

sudo apt install ninja-build

subprocess.Called.ProcessError: Command ‘[‘which’, ‘c++’]’ required no-zero exit status 1.

This error means each node has to install c++ compiler. So just check if it has been installed on every machine.

Connection reset by peer in function _create_c10d_store in file rendezvous.py.

The default TCP server is set to run on the process denoted as rank0. However, since many processes start at different time, the process on rank0 can’t ensure that the server has already started before all of other client processes start. To address this issue, we need to ensure that the server (rank0) starts first.

AttributeError: ‘LlamaAttention’ object has no attribute ‘rope_theta’.

To solve this problem, update transformers to 4.34.0 or above (Link).

How to install and update cuda?

Reference to this.

improve your efficiency of llm

Resize your vocabulary equivalent to multiple of 8f

What should we do if the version of cuda does not match with my torch?

The best solution is that install different version of cuda first, and switch on your demand. How to do it? Just run this command and select your desired item.

update-alternatives --config cuda

what’s the meaning of these variables

LOCAL_RANK: the ids of workers within a node. WORLD_SIZE: the number of total workers. RANK: in fact, it means WORLD_RANK, and defines the ids of all the workers in the wolrd (all nodes combined). If the WORLD_SIZE is four, the RANK can be 0,1,2 or 3.

torch.distributed.barrier() stuck during training with multi gpus#

raise RuntimeError(“Ninja is required to load C++ extensions”)#

subprocess.Called.ProcessError: Command ‘[‘which’, ‘c++’]’ required no-zero exit status 1.#

Connection reset by peer in function _create_c10d_store in file rendezvous.py.#

AttributeError: ‘LlamaAttention’ object has no attribute ‘rope_theta’.#

How to install and update cuda?#

improve your efficiency of llm#

What should we do if the version of cuda does not match with my torch?#

what’s the meaning of these variables#