Giter VIP home page Giter VIP logo

atr_docs_data's Introduction

data collected from : http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html

  • Top 30 categories in TDT2 We provide here a subset of the original TDT2 corpus. The TDT2 corpus ( Nist Topic Detection and Tracking corpus ) consists of data collected during the first half of 1998 and taken from 6 sources, including 2 newswires (APW, NYT), 2 radio programs (VOA, PRI) and 2 television programs (CNN, ABC). It consists of 11201 on-topic documents which are classified into 96 semantic categories. In this subset, those documents appearing in two or more categories were removed, and only the largest 30 categories were kept, thus leaving us with 9,394 documents in total.

  • All categories in Reuters21578 Reuters-21578 corpus contains 21578 documents in 135 categories. We provide here the ModApte version. Those documents with multiple category labels are discarded. It left us with 8293 documents in 65 categories. For ModeApte split, there are 5946 training documents and 2347 testing documents. After preprocessing, this corpus contains 18933 distinct terms.

  • 20 Newsgroups (version 2) Please find the homepage of 20 Newsgroups data set at here. We use the 20 Newsgroups sorted by date version (20news-bydate.tar.gz). The original website reports that there are 18941 documents which is not correct. There are only 18846 documents, with 11314 (60%) training and 7532 (40%) testing.

  • Selected 4 categories in RCV1 In this subset, there are 9,625 documents with 29,992 distinct words, including categories "C15", "ECAT", "GCAT", and "MCAT", each with 2,022, 2,064, 2,901, and 2,638 documents respectively.

atr_docs_data's People

Watchers

James Cloos avatar Hui Guan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.