Giter VIP home page Giter VIP logo

cps842-a1's Introduction

This assignment was done in a group of 2.

Inverted Index and Query

This is a python program that builds an inverted index from a collection of Wikipedia articles and performs single term queries. The program can handle different options for stemming and stop word removal.

Dataset

The dataset consists of more than 200,000 English Wikipedia pages submitted for TREC fair ranking. Every page includes an ID, the article's title, the page's URL, and the HTML for its content, all in a JSON file compressed with gzip. A subset of 50,000 documents are used for this assignment's dataset due to RAM restrictions.

Dependencies

  • Python 3.8 or higher
  • BeautifulSoup from bs4
  • gzip
  • regex

Usage

Indexing

To create the index, run the indexConstruction.py file with the following arguments:

  • file_path: the path to the gzip dataset file
  • stopwords_path: the path to the cacm_stopwords text file
  • stem_on: a boolean value indicating whether to apply stemming or not
  • stopwords_on: a boolean value indicating whether to remove stop words or not

For example:

python indexConstruction.py "path/to/zipped/data" "cacm_stopwords.txt" True True

This will create the index with stemming and stop word removal enabled, and save it as JSON files in the same directory.

Querying

To query the index, run the query.py file with no arguments. This will prompt the user to choose whether to query with stemming or stop word removal enabled. Type y for yes and n for no. Once selected, the corresponding index files will be loaded and the user will be asked to enter a word to be queried. The program will display the document frequency and the top 20 documents that contain the word, along with the term frequency and the positions. The program will also show the time it took to process the query. To exit the program, type ZZEND.

For example:

python query.py
Do you want to query with stemming? (y/n) y
Do you want to query with stopwords? (y/n) n
Enter a word to be queried: algorithm
Document Frequency:  104
Doc ID:  5a0c1c9e55b14b36ce7b2c6f
Title:  Algorithm
Term Frequency:  38
Positions:  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
...
Query time:  0.001 seconds
Enter a word to be queried: ZZEND

cps842-a1's People

Watchers

Deandra Spike-Madden avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.