pretraining

Align LLMs

After pretraining on vast datasets and supervised fine-tuning with diverse instruction sets, Large Language Models (LLMs) have achieved remarkable capabilities in text generation. However, LLMs can generate seemingly reasonable sequences—-free from grammatical errors and redundant words—-they may still generate content that lacks truthfulness or accuracy. Are there any methods to mitigate these shortcomings? Researchers at OpenAI have framed these issues as the challenge of LLM alignment. Currently, one of the most prominent approaches to address these challenges is Reinforcement Learning from Human Feedback (RLHF)....

Data for LLMs

Training an LLM needs a large amount of high qualitity data. Even though many giant teches open up their high performance LLMs (e.g., LLaMA, Mistral), high qualitity data still remain private. Chinese Dataset English Dataset RefinedWeb: 600 B toknes Dolma: open-sourced by allenai, contains 3T tokens and a toolkit with some key features: high performance, portability, built-in tagger, fast decuplication, extensibility and cloud support. fineweb: 15 trillion tokens of high quality web data....

Continual Pretraining

Large language models (LLMs) have already demonstrated significant achievements, many startups make a plan to train their own LLMs. However, training a LLM from scratch remains a big challenge, both in terms of machine costs and the difficulty of data collection. Under this background, continuous pretraining based on some open source LLMs is a considerable alternative. Determine your purpose of your continuous pretraining LLM. In common, standard LLMs may not excel in specific domains like financial, law, or trade....