Massively Crawling and extracting bitext from Web
- Given CommonCrawl archive ID, get WET file list.
- Download WET file, unzip it, extract metadta, and dump languge info for each page in WET files.
- Calculate the lengths of text of different languages for each domain.
- Get multilingual domains for given multiple languages.