IR-TermProject

Group Number: 22 Group Members:

Arnab Kumar Mallick - 18CH10011
Arghyadeep Bandyopadhyay - 18EE10012
Subham Karmakar - 18EE10067
Sankalp Srivastava - 18EE10069

Data: https://drive.google.com/drive/folders/1ELskVvmr4snskq8vUfPh0gtlDBBgPhQ9?usp=sharing

Task 1A: Generates pickle file and indexed documents file. The pickle file stores doc-ids instead of actual doc-path. The doc-ids are the indices of that particular document in the file - indexed_docs.txt. The pickle file contains a dictinary with terms as keys and a list of doc-id and term-frequency as its values, i.e. [doc-id, term frequency].

Task 1B: Generates queries file.

Task 1C: Logic and Algorithm:

Considered the tokens from the queries as they are joined by 'AND' logic.
Used the trivial merge algorithm(for 'AND' logic) for the boolean retrieval dicussed in the class, for merging the postings lists for the specific tokens obtained from the queries above.

Assumption: indexed_docs.txt file must be present in the root directory of the project.

Task 2A (TF-IDF Vectorization)

Performed by Arghyadeep Bandyopadhyay (Roll No: 18EE10012)

Steps used:

The paths and file names of all the documents are extracted fromt the ‘en_BDNews24’ folder
The inverted index file is read and the document frequencies (df) along with the vocabulary are stored
For each document, the document text is stored as a list of terms using the inverted index.
The text file containing the queries is read and each query is then stored as a list of the terms contained in that query
The |V|-dimensional TF-IDF vectors are obtained for each query with a given weighting scheme
For each document, at first the |V|-dimensional TF-IDF vector is obtained with a given weighting scheme. Then, for each query vector, the value of the cosine similarity metric with normalization between the query vector and the current document vector is computed. The process is repeated for all the documents and the values of the cosine similarity metric for each query-document pair is stored.
For each query, the cosine similarity scores are sorted in descending order. The top 50 documents are then stored in a 2-column csv file in the format : .
Steps 5 to 7 are performed for three ddd.qqq schemes, namely scheme ‘A’ (lnc.ltc), scheme ‘B’ (Lnc.Lpc) and scheme ‘C’ (anc.apc)

Assumptions / Changes:

The term frequencies are not computed separately and are assumed to be stored in the inverted index itself.

Extra input / parameters:

The path to the “queries_<GROUP_NO>.txt is to be given along with the path to the inverted index file and the “en_BDNews24” folder. Here, GROUP_NO is 22.

To run Task 2A:

$>>python3 PAT2_22_ranker.py <path_to_model_queries_22.pth>

Python version used: 3.6.8

Library Requirements:

os
sys
pickle
numpy
math
csv
collections

Task 2B: Logic:

Parsed the data of generated Ranked list A, B, C; Gold standard ranked lists and Queries in an array
Maintained a document that maps the query id with the relevant documents and relevance score
We then compared which documents in our ranked list is present in the Gold Standard list and calculated the Precision@K and Average Precision.
The relevance of the documents is then stored in an array and compared to the sorted array of the relevance scores. This way, we calculate the NDCG.
We then average over all the queries to find out the average parameters.

Assumptions:

The Data folder contains rankedRelevantDocList.csv
queries_22.txt must be present in the root directory of the project.
PAT2_22_ranked_list.csv must be present in the root directory of the project.

To run Task 2B:
$>> python PAT2_22_evaluator.py ./Data/rankedRelevantDocList.csv PAT2_22_ranked_list_A.csv
$>> python PAT2_22_evaluator.py ./Data/rankedRelevantDocList.csv PAT2_22_ranked_list_B.csv
$>> python PAT2_22_evaluator.py ./Data/rankedRelevantDocList.csv PAT2_22_ranked_list_C.csv

The output will be generated in the root directory
PAT2_22_metrics_A.csv
PAT2_22_metrics_B.csv
PAT2_22_metrics_C.csv

Python version used: 3.9.7

Library Requirements:

sys
csv
numpy
tabulate
pickle
pickle
bs4
re
os
pickle

subhamkarmakar24 / ir-termproject Goto Github PK

ir-termproject's Introduction

IR-TermProject

ir-termproject's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent