Data for LLMs
Training an LLM needs a large amount of high qualitity data. Even though many giant teches open up their high performance LLMs (e.g., LLaMA, Mistral), high qualitity data still remain private. Chinese Dataset English Dataset RefinedWeb: 600 B toknes Dolma: open-sourced by allenai, contains 3T tokens and a toolkit with some key features: high performance, portability, built-in tagger, fast decuplication, extensibility and cloud support. fineweb: 15 trillion tokens of high quality web data....