Giter VIP home page Giter VIP logo

ir-termproject's Introduction

IR-TermProject

Group Number: 22 Group Members:

  1. Arnab Kumar Mallick - 18CH10011
  2. Arghyadeep Bandyopadhyay - 18EE10012
  3. Subham Karmakar - 18EE10067
  4. Sankalp Srivastava - 18EE10069

Data: https://drive.google.com/drive/folders/1ELskVvmr4snskq8vUfPh0gtlDBBgPhQ9?usp=sharing

Task 1A: Generates pickle file and indexed documents file. The pickle file stores doc-ids instead of actual doc-path. The doc-ids are the indices of that particular document in the file - indexed_docs.txt. The pickle file contains a dictinary with terms as keys and a list of doc-id and term-frequency as its values, i.e. [doc-id, term frequency].

Task 1B: Generates queries file.

Task 1C: Logic and Algorithm:

  1. Considered the tokens from the queries as they are joined by 'AND' logic.
  2. Used the trivial merge algorithm(for 'AND' logic) for the boolean retrieval dicussed in the class, for merging the postings lists for the specific tokens obtained from the queries above.

Assumption: indexed_docs.txt file must be present in the root directory of the project.

Task 2A (TF-IDF Vectorization)

Performed by Arghyadeep Bandyopadhyay (Roll No: 18EE10012)

Steps used:

  1. The paths and file names of all the documents are extracted fromt the ‘en_BDNews24’ folder
  2. The inverted index file is read and the document frequencies (df) along with the vocabulary are stored
  3. For each document, the document text is stored as a list of terms using the inverted index.
  4. The text file containing the queries is read and each query is then stored as a list of the terms contained in that query
  5. The |V|-dimensional TF-IDF vectors are obtained for each query with a given weighting scheme
  6. For each document, at first the |V|-dimensional TF-IDF vector is obtained with a given weighting scheme. Then, for each query vector, the value of the cosine similarity metric with normalization between the query vector and the current document vector is computed. The process is repeated for all the documents and the values of the cosine similarity metric for each query-document pair is stored.
  7. For each query, the cosine similarity scores are sorted in descending order. The top 50 documents are then stored in a 2-column csv file in the format : .
  8. Steps 5 to 7 are performed for three ddd.qqq schemes, namely scheme ‘A’ (lnc.ltc), scheme ‘B’ (Lnc.Lpc) and scheme ‘C’ (anc.apc)

Assumptions / Changes:

The term frequencies are not computed separately and are assumed to be stored in the inverted index itself.

Extra input / parameters:

The path to the “queries_<GROUP_NO>.txt is to be given along with the path to the inverted index file and the “en_BDNews24” folder. Here, GROUP_NO is 22.

To run Task 2A:

$>>python3 PAT2_22_ranker.py <path_to_model_queries_22.pth>

Python version used: 3.6.8

Library Requirements:

  1. os
  2. sys
  3. pickle
  4. numpy
  5. math
  6. csv
  7. collections

Task 2B: Logic:

  1. Parsed the data of generated Ranked list A, B, C; Gold standard ranked lists and Queries in an array
  2. Maintained a document that maps the query id with the relevant documents and relevance score
  3. We then compared which documents in our ranked list is present in the Gold Standard list and calculated the Precision@K and Average Precision.
  4. The relevance of the documents is then stored in an array and compared to the sorted array of the relevance scores. This way, we calculate the NDCG.
  5. We then average over all the queries to find out the average parameters.

Assumptions:

  1. The Data folder contains rankedRelevantDocList.csv
  2. queries_22.txt must be present in the root directory of the project.
  3. PAT2_22_ranked_list.csv must be present in the root directory of the project.

To run Task 2B:
$>> python PAT2_22_evaluator.py ./Data/rankedRelevantDocList.csv PAT2_22_ranked_list_A.csv
$>> python PAT2_22_evaluator.py ./Data/rankedRelevantDocList.csv PAT2_22_ranked_list_B.csv
$>> python PAT2_22_evaluator.py ./Data/rankedRelevantDocList.csv PAT2_22_ranked_list_C.csv

The output will be generated in the root directory
PAT2_22_metrics_A.csv
PAT2_22_metrics_B.csv
PAT2_22_metrics_C.csv

Python version used: 3.9.7

Library Requirements:

  1. sys
  2. csv
  3. numpy
  4. tabulate
  5. pickle
  6. pickle
  7. bs4
  8. re
  9. os
  10. pickle

ir-termproject's People

Contributors

apprenticearnab avatar arghya320 avatar sankalp0104 avatar subhamkarmakar24 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.