data collected from : http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html
-
Top 30 categories in TDT2 We provide here a subset of the original TDT2 corpus. The TDT2 corpus ( Nist Topic Detection and Tracking corpus ) consists of data collected during the first half of 1998 and taken from 6 sources, including 2 newswires (APW, NYT), 2 radio programs (VOA, PRI) and 2 television programs (CNN, ABC). It consists of 11201 on-topic documents which are classified into 96 semantic categories. In this subset, those documents appearing in two or more categories were removed, and only the largest 30 categories were kept, thus leaving us with 9,394 documents in total.
-
All categories in Reuters21578 Reuters-21578 corpus contains 21578 documents in 135 categories. We provide here the ModApte version. Those documents with multiple category labels are discarded. It left us with 8293 documents in 65 categories. For ModeApte split, there are 5946 training documents and 2347 testing documents. After preprocessing, this corpus contains 18933 distinct terms.
-
20 Newsgroups (version 2) Please find the homepage of 20 Newsgroups data set at here. We use the 20 Newsgroups sorted by date version (20news-bydate.tar.gz). The original website reports that there are 18941 documents which is not correct. There are only 18846 documents, with 11314 (60%) training and 7532 (40%) testing.
-
Selected 4 categories in RCV1 In this subset, there are 9,625 documents with 29,992 distinct words, including categories "C15", "ECAT", "GCAT", and "MCAT", each with 2,022, 2,064, 2,901, and 2,638 documents respectively.