Giter VIP home page Giter VIP logo

sparrowlite's Introduction

Sparrow Lite -- a micro search engine

This project is established for UCI CS121 final assignment

Project Structure

src
├── app.py
├── build.py
├── config.ini
├── controller.py
├── model.py
├── packages
│   └── requirements.txt
├── static
│   ├── index.css
│   ├── logo.png
│   └── search.css
├── templates
│   ├── index.html
│   └── search.html
└── util
    ├── indexBuilder.py
    ├── simhashIndex.py
    └── textProcessor.py

Goals and project progress

Stage 1

  • Implement a inverted index
  • posting include tf-idf score

Stage 2

  • multi query

Stage 3

  • token stemming
  • weight of words in title,<strong>, and heading
  • less than 300ms response
  • a complete search engine

Extra credit

  • duplicate and near duplicate pages
  • page rank
  • 2-gram/3-gram indexing
  • posting with word position
  • anchor words
  • Web or GUI interface

Build and run

Download the processed data files from here, unzip to the same directory where you place this project. The file structure should be look like this:

sparrowlite
├── src
├── DATA <--drop it here
└── README.MD

You need to specify where the source data directory and where to put processed data in src/config.ini

[DATABASE]
WEBSITES_DIR = ../DEV/ # dir where your raw source files locate, if you use built data file I provided, no need to change this one
DATABASE_DIR = ../DATA/    # output dir of your data   
MERGE_CHUNK  = YES         # do not change this line

Install dependencies:

Python 2

pip install -r src/package/requirements.txt

Python 3

pip3 install -r src/package/requirements.txt

To run the application you are required to set your FLASK_APP environment variable.

For Windows:

cd src
set FLASK_APP=main.py

For macos or Unix-like:

cd src
export FLASK_APP=main.py

Start flask application

flask run --host=127.0.0.1 --port=3000

The program will listen and serve the application at localhost on port 3000. Visit http://localhost:3000 in your browser to see the web page.

Known Issues

  • need to rewrite the function to merge partial data files using heapq
  • ugly retrieval result styling
  • stemming
  • use config.ini to set crawled website directory and output directory
  • implement forword idx of inverted idx for retrieval docuements
  • return just url for stage 2, not path
  • flx disk seek difference on windows and mac/unix-like
  • implement complex tuple intersection algorithm
  • slow or false implementation of cosine similarity
  • use Simhash to remove near duplicates
  • fix "," ValueError when parsing docid csv
  • implement inward and outward link graph

sparrowlite's People

Contributors

linjiangzhu avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.