Giter VIP home page Giter VIP logo

yippee's Introduction

Yippee!

A fully distributed search engine

To say the world wide web is enormous would be an understatement:) In the last month (March 2012), Google's index size was estimated to be between forty-five and fifty-five billion pages, only a part of the existing web universe.

The Yippee Search Engine is a distributed application comprised of 4 main components: a web crawler augmenting [the Mercator pattern] (http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf) the web, an indexer to index crawled web pages, our implementation of [Google's PageRank] (http://infolab.stanford.edu/~backrub/google.html) algorithm , and a search module to retrieve documents based on user-provided keywords. Crawler, indexer and search components are distributed via the FreePastry peer-to-peer substrate, while the PageRank components is centralized. All parts of the project are deployed to Amazon Elastic Cloud Computing (EC2) instances.

Contributors:

yippee's People

Contributors

cimbriano avatar marg624 avatar nvasilakis avatar tjdu avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

yippee's Issues

URL Feed

We need to create a URL feeder for a newly started crawler.

It either reads from file or loads the status of the crawler from the database.

SimpleQueue

Messages (and urls) do not implement comparable. As a result, currently we cannot use priority queue as a blocking queue implementation.

Backup scripts

Create backup scripts for the database on EC2; since even home is not persistent, other scripts must be also included (that generate keys and push to each devs account).

Implement FreePastry [Indexer]

Contemplate which parts of the indexer need DHT storage. How to implement this? How to share nodes/message nicely with Crawler?

HTTP Module

Merge HttpResponse to HttpModule and implement the Http client

Ant script that runs all unit tests

Create ant script that runs all Unit tests and writes output (apart from System.out, it could be some form machine-readable form of xml or csv, whatever is easier to interpret)

Per package logging

We need to add granular per-package logging; plus a global logger that gets a copy of everything (so we can validate relative logical time-stamps).

Duplicate URL Eliminator

We need a DUE right before Frontier. We need to finalize duplicate elimination strategy (is it going to keep a different version? Update url status as "more important"? Discard? Maybe no duplicate elimination?)

Remove 'Stop Words'

Here's a list of words our search engine should generally ignore:

Stop Words

The exception here is when words are in quotes, in which case we take the query in its entirety.

Create a global entry point

We need to create a global entry point for the backend, in order to initialize everything correctly. This can't be part of existing modules (such as the crawler or the indexer); it must be either a new package or in util.

Note that this is closely linked with #2

Set(s) of seed URLs

  • In a file?
  • How many URLs to start? - does this depend on the number of nodes? threads? neither?
  • Top Alexa Sites could be used as a source of ideas.
  • CNN
  • tumblr
  • yahoo
  • wordpress
  • wikipedia?
  • reddit (lots of links to various parts of the interne, but also a lot of imgur links)

What is the RobotsModule instance field host?

And why does it have type URL?

I though the RobotsModule was to be a facility for processes to determine if a URL was allowed to be crawled. I didn't think it would have state, besides maybe some LRU caching of recently accessed RobotsTxt objects.

Speed-Up Indexer

It seems that adding new Hits to the Indexer is a bit slow at the moment. Is there a good way to still maintain synchronicity and batch new word lexicon additions? Maybe this is where the DumpLexicon comes in?

Configuration Singleton

Create a Configuration Singleton class so we can

  • share project configuration properties
  • make use of test and production configurations

log4j question

Hey @nvasilakis, this is probably simple, but how do we get rid of these warnings? This one came up when I ran RobotsManagerTest.java

log4j:WARN No appenders could be found for logger (com.yippee.db.crawler.RobotsManager).
log4j:WARN Please initialize the log4j system properly.

Add Lemmatizer

Use Stanford NLP Lemmatizer to assist in interpreting queries.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.