Large language models (LLMs) have already demonstrated significant achievements, many startups make a plan to train their own LLMs. However, training a LLM from scratch remains a big challenge, both in terms of machine costs and the difficulty of data collection. Under this background, continuous pretraining based on some open source LLMs is a considerable alternative.

Determine your purpose of your continuous pretraining LLM.

In common, standard LLMs may not excel in specific domains like financial, law, or trade. And in these areas, the demands for LLMs are stringent. Given this, consistently training our own LLM is an advantageous decision. The three followings are what we’ve got to explore : 1) Is domain-adaptive continual pretraining helpful? 2) How can we adopt data selection strategy? 3) Whether the original capabilities are retained?

Data selection

Data is the most essential component during continuous pretraining.

Figure 1. All the metrics keep in line with data. The more the better. (Image source: 1st Multilingual Model Workshop - Continued Pre-training of LLMs​)

Figure 1. All the metrics keep in line with data. The more the better. (Image source: 1st Multilingual Model Workshop - Continued Pre-training of LLMs​)

How do we select and combine various datasets? In important resampling (Xie et al1), the researchers introduce an approach known as Data Selection with Important Resampling (DSIR). The method utilizes raw and target data in an n-gram feature space to estimate important weights.

Figure 1. For a raw dataset like The Pile, using an estimator to obtain importance weights, and then select data via the importance weights. (Image source: Data Selection for Language Models via Importance Resampling)

Figure 1. For a raw dataset like The Pile, using an estimator to obtain importance weights, and then select data via the importance weights. (Image source: Data Selection for Language Models via Importance Resampling)

Training strategy

Learning rate setting

In general, we tend to use warming up strategy to fine-tune downstream models. But some researchers (Gupta et al.2) have drawn a series of interesting conclusions on warming up.

  1. Don’t use a maximum learning rate initially to avoid an initial large spike in the loss which leads to no consequence later.
  2. A smaller learning rate may preserve more performance on the upstream dataset.
  3. Continual pretraining with the latest pretrained checkpoint improves performance.
  4. Rewarming is not a good option when the downstream dataset is similar to the upstream dataset.
  5. For the same dataset, a constant learning rate achieves the best performance.
  6. Although a constant learning rate can give you a good initialization, rewarming will get better while training long enough.

Catastrophic Forgetting

What we commonly know about continual pretraining is catastrophic forgetting. Some standard solutions involve little more than mixing common data or retaining some gradient information, like EWA.
(Li et al.3) finds continual pretraining may cause repetition issues.

Figure 2. After continual training with the traditional Chinese dataset, the model starts to repeat a sentence. (Image source: Examining Forgetting in Continual Pre-training of Aligned Large Language Models)

Figure 2. After continual training with the traditional Chinese dataset, the model starts to repeat a sentence. (Image source: Examining Forgetting in Continual Pre-training of Aligned Large Language Models)

It seems that catastrophic forgetting is an inevitable side effect when conducting continual pretraining. (Siriwardhana et al.)4 merge the trained mode with the original model using the TIES method, mitigating catastrophic forgetting.

annealing training

a

Evaluation

tools

lm-eval

The lm-eval python package released by EleutherAI which aims to offer an open source framework in LLM evaluation for AI researchers. To install it, simply run a command:

pip install lm-eval

Here is an interesting blog providing a tutorial for beginners.

opencompass

opencompass is a one-stop platform for large language model (LLM) evaluation.