Giter VIP home page Giter VIP logo

google-2.0's Introduction

google-2.0

Implementation of a search engine from scratch

This project is developed by 2 students from CentraleSupélec as part of the "Fondements en Recherche d'Information" course:

We are working on two given collections:

  • CACM collection
  • CS276 collection

Installation

When installing, create a file config.py in the main directory and fill with global paths to collections and path where you want the index to be stored:

CACM_path = '/path/to/CACM/'
CS276_path = '/path/to/pa1-data/'
index_path = '/path/to/index/'

Easy testing

Go to RunMe.ipynb for a notebook with main results and explanations.

Download the index

If you don't want to spend too much time generating the index, you can download it from there : https://drive.google.com/drive/folders/17glYdz6KY_PJsnANKrYi4xooNkDQ0ua1?usp=sharing. Be sure to replace the index/ folder with the unzipped folder.

Task 1: inverted index

Linguistic processing

Entry point: CACMIndex.py and CS276Index.py. Each will calculate token size and number of vocabulary of the collection, and also draw the corresponding frequency graphs.

Helper functions:

  • textProcessing.py processes text with language processing tools like tokenize, lemmatize, removing stop words etc.
  • indexBuilder.py to help build each index.
  • CACMParser.py to parse CACM document and get title, summary and key words.

Heap Law: heapRegression.py. Run to calculate Heap Law parameters of each collection. You will need to uncomment to change collection.

Frequency graphs: frequencyRankGraph.py - helper class to draw frequency graphs.

Indexation

Entry point : BSBI.py.

Running this file will generate the different dictionaries (documents, terms, index) in the index/ folder given in config.py.

Boolean search

Entry point : boolean/booleanEvaluation.py.

Run tests on boolean/test.py

Vectorial search

Entry point : vectorial/vectorialEvaluation.py.

Run tests on vectorial/test.py

Both search models that we implemented inherit from evaluation.py.

Evaluation

Evaluate our CACM search models by running functions in CACMEvaluation.py.

google-2.0's People

Contributors

dlphn avatar

Stargazers

Matthias Carré avatar Sébastien Gahat avatar

Forkers

cecileserene

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.