This paper introduces a method to create a better large-scale dataset for training large language models (LLMs) by combining techniques like classifier ensembling, synthetic data rephrasing, and reduced reliance on traditional filters. The approach results in a 6.3T token dataset with more unique, high-quality tokens compared to previous datasets like DCLM. This improves model performance… Continue reading Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset