After pretraining on vast datasets and supervised fine-tuning with diverse instruction sets, Large Language Models (LLMs) have achieved remarkable capabilities in text generation. However, LLMs can generate seemingly reasonable sequences—-free from grammatical errors and redundant words—-they may still generate content that lacks truthfulness or accuracy. Are there any methods to mitigate these shortcomings? Researchers at OpenAI have framed these issues as the challenge of LLM alignment. Currently, one of the most prominent approaches to address these challenges is Reinforcement Learning from Human Feedback (RLHF). To implement RLHF, OpenAI has adopted the Proximal Policy Optimization (PPO) algorithm.

multi-turn instruction tuning

Most of instruction-following studies and benchmarks overlook the multi-turn instruction following capability of LLMs, which is actually a more common demand in real-world scenarios. So it would therefore never to be too much of an exggeration to say that multi-turn conversation ability is the most significant part of LLMs.

Parrot: enhancing multi-turn instruction following for LLMs

Multi-turn example, contextual information is need to be ultilized by LLMs. (Image source: Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models)

Multi-turn example, contextual information is need to be ultilized by LLMs. (Image source: Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models)

The most common interaction way between human and LLMs is multi-turn conversation. Parrot1 presents a solution aiming to enhancing multi-turn instruction following for LLMs.

  1. Dataset Collection: the authors proposes training a specialized Parrot-Ask to generate queries using the available real user-ChatGPT logs based on LLaMA, then employ Parrot-Ask to interact with an assistant LLM and thus collect 40K multi-turn instruction tuning data.
  2. Training Parrot-Ask Model: Training the mode is the inverse of standard instruction tuning. Compare to common supervised finetuning methods, the Parrot-Ask model is trained to predict query tokens instead of assistant output tokens. Concretely, the authors use LLaMA-13B-Chat and 90K ShareGPT data to train this model.
  3. CaPO dataset Collection: The authors sample 10K dataset which involve contextual information and adapt three strageties to generate negative responses, thus collect 10K Context-Aware Preference Optimazation (CaPO) dataset.
The process of Parrot.(a) First, train the Parrot-Ask model on real user-ChatGPT logs to learn how real users pose queries, and utilize it to iteratively interact with ChatGPT to collect multi-turn instructionresponse pairs. (b) Then construct negative responses for queries that rely heavily on context for answering with three strategies to simulate three types of error cases. Finally, use the collected data to train the Parrot-Chat model by (c) instruction tuning and (d) context-aware preference optimization.(Image source: Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models)

The process of Parrot.(a) First, train the Parrot-Ask model on real user-ChatGPT logs to learn how real users pose queries, and utilize it to iteratively interact with ChatGPT to collect multi-turn instructionresponse pairs. (b) Then construct negative responses for queries that rely heavily on context for answering with three strategies to simulate three types of error cases. Finally, use the collected data to train the Parrot-Chat model by (c) instruction tuning and (d) context-aware preference optimization.(Image source: Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models)

<<<<<<< HEAD

human preference optimization

=======

Human Preference Optimization

>>>>>>> 76e84382a4d14584045e35f8c9cfb352bb8cb52f

Make LLMs refuse to answer unknown questions

R-Tuning, introduced in (Zhang et al., 20232), aims to equip Large Language Models (LLMs) with the ability to decline answering unknown questions. It leverages the instruction tuning approach, following a two-step process:

  1. Uncertainty Identification: The model is first evaluated on the training data. By inferring the model on the training data once and comparing the prediction and label, the instruction tuning data is split into uncertain data and certain data.
  2. Refusal-Aware Data Construction: Uncertainty expressions are appended to the labels of the certain data points. This newly constructed “refusal-aware data” is then used to fine-tune the LLM, enabling it to recognize and decline unknown questions.
The workflow of constructing refusal-aware data. (Image source: R-Tuning: Teaching Large Language Models to Refuse Unknown Questions)

The workflow of constructing refusal-aware data. (Image source: R-Tuning: Teaching Large Language Models to Refuse Unknown Questions)

The purpose of R-Tuning is to alleviate hallucination of LLMs when facing unknown questions. However, it doesn’t take human preference responses into consideration.

direct preference optimization

DPO (Direct Preference Optimization)3, which evolved from the pair-wise formulation of the reward model introduced in InstructGPT4, simplifies the RLHF (Reinforcement Learning from Human Feedback) process into a one-step optimization. The loss function has been reformulated as follows: $$\mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref})=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}[\log\sigma(\beta\log\frac {\pi_{\theta}(y_w|x)} {\pi_{ref}(y_w|x)} - \beta\log\frac {\pi_{\theta}(y_l|x)} {\pi_{ref}(y_l|x)} )]$$ where $y_w$ represents the accepted or preferred data, while $y_l$ represents the rejected or less preferred data. This formulation clearly demonstrates that DPO optimizes the margin between desirable and undesirable changes, effectively enhancing the model’s ability to generate preferred outputs.

<<<<<<< HEAD

kto

Sometimes, preference dataset with a one-pair format is hard to obtain. In such cases, we can use a set of preference data where each sample only has a label of ‘1’ for accept or a label of ‘-1’ for rejectable? KTO[^4] is proposed for this scenario.

=======

KTO

Sometimes, preference dataset with a one-pair format is hard to obtain. In such cases, we can use a set of preference data where each sample only has a label of ‘1’ for acceptance or a label of ‘-1’ for rejection. KTO[^4] is proposed for this scenario. Its mathematical formation is here: $$\mathcal{L}{KTO}(\pi\theta,\pi_{ref}) = \mathbb{E}_{x,y\sim{D}}[\lambda_y-v(x,y)]$$ where $$r_\theta(x, y) = \log\frac{\pi_{\theta}(y|x)} {\pi_{ref}(y|x)}$$ $$z_0 = \text{KL}(\pi_{\theta}(y^{\prime}|x) || \pi_{ref}(y^{\prime}|x))$$ $$v(x, y) = \begin{cases} \lambda_D\sigma(\beta(r_{\theta}(x,y) - z_0))\ \text{if}\ y \sim y_{desirable}|x;\ \lambda_U\sigma(\beta(z_0-r_{\theta}(x, y)))\ \text{if}\ y\sim\ y_{undesirable}|x\end{cases}$$

>>>>>>> 76e84382a4d14584045e35f8c9cfb352bb8cb52f