jongwon-jay-lee / c4-dataset-script Goto Github PK
View Code? Open in Web Editor NEWThis project forked from shjwudp/c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
License: MIT License