nvasilakis / yippee Goto Github PK

A fully distributed search engine

Java 99.11% Shell 0.89%

yippee's Introduction

Yippee!

A fully distributed search engine

To say the world wide web is enormous would be an understatement:) In the last month (March 2012), Google's index size was estimated to be between forty-five and fifty-five billion pages, only a part of the existing web universe.

The Yippee Search Engine is a distributed application comprised of 4 main components: a web crawler augmenting [the Mercator pattern] (http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf) the web, an indexer to index crawled web pages, our implementation of [Google's PageRank] (http://infolab.stanford.edu/~backrub/google.html) algorithm , and a search module to retrieve documents based on user-provided keywords. Crawler, indexer and search components are distributed via the FreePastry peer-to-peer substrate, while the PageRank components is centralized. All parts of the project are deployed to Amazon Elastic Cloud Computing (EC2) instances.

Contributors:

Chris Imbriano - [email protected]
Margarita Miranda - [email protected]
Nikos Vasilakis - [email protected]
TJ Du - [email protected]

yippee's People

Contributors

Stargazers

Watchers

Forkers

cimbriano triplekill praveenmunagapati freshy969

yippee's Issues

Numerical words return "null" wordIDs

@marg624, I'm having some trouble storing numerical words in the lexicon for some reason. Do you think this might be due to the SHA encoding?

Mechanism to take down/restart node based on PingPongPlus report

Determine if and which node needs to be taken down and brought up again with a fresh NodeID

Convert DBManager to Blocking Queue Design

Possibly using Java BlockingQueue interface.

URL Feed

We need to create a URL feeder for a newly started crawler.

It either reads from file or loads the status of the crawler from the database.

Implement FreePastry [Crawler]

Will need to setup ring and handle crawler DHT caching.

SimpleQueue

Messages (and urls) do not implement comparable. As a result, currently we cannot use priority queue as a blocking queue implementation.

Incremental back-off for indexer

Add class annotations to each package's TestSuite

Can this be done via script to ensure the TestSuites are always running all the JUnit TestCases?

Backup scripts

Create backup scripts for the database on EC2; since even home is not persistent, other scripts must be also included (that generate keys and push to each devs account).

Robots handling

Create a class for robots.txt handling

Ant script for production environment

Create ant scripts for production environment. These should probably be in a separate build.xml

Add interface methods to Lexicon (or remove from Indexer)

The project currently won't build without at least method stubs in the Lexicon

Some examples of missing methods from Indexer.java

lexicon.getLexicon();
lexicon.addListToLexicon(stemlist);
Lexicon lexicon = new Lexicon();

See how logger is initilized for running JUnit tests

Javadoc Comment in DAL

212fe86

This doesn't return the next doc. It returns a cursor.

Implement FreePastry [Indexer]

Contemplate which parts of the indexer need DHT storage. How to implement this? How to share nodes/message nicely with Crawler?

Set up EC2 nodes for testing

Once FreePastry is up and running, start testing crawler and indexer there!

Develop a more sophisticated stemmer/lemmatizer

Options include:

Porter Stemmer Algorithm - http://tartarus.org/~martin/PorterStemmer/
Stanford NLP Tools (Lemmatizer) - http://nlp.stanford.edu/software/corenlp.shtml

Should we use the lemmatizer library if Andreas allows it? Presumably it will greatly improve query results.

HTTP Module

Merge HttpResponse to HttpModule and implement the Http client

Ant script that runs all unit tests

Create ant script that runs all Unit tests and writes output (apart from System.out, it could be some form machine-readable form of xml or csv, whatever is easier to interpret)

Create a util package

Add LinkExtracror and Configuration classes

Implement optional quote query parsing

Consider design implications of queries using quotation marks.

Implement Barrels [Inverted Index]

Connect Indexer to Parser/Stemmer

Call the appropriate processing methods (i.e. parser, extractor, stemmer, etc.) from the indexer and test output.

Per package logging

We need to add granular per-package logging; plus a global logger that gets a copy of everything (so we can validate relative logical time-stamps).

Improve FancyExtractor Unit Test

Currently relying heavily on System.out.println to check for appropriate output. Should convert this to proper JUnit tests.

Update prints to System.out to use logger instead

Inside catch - use logger.warn("Exception", exceptionObject)

Write Project Report

Finalize Project Map Draft

Duplicate URL Eliminator

We need a DUE right before Frontier. We need to finalize duplicate elimination strategy (is it going to keep a different version? Update url status as "more important"? Discard? Maybe no duplicate elimination?)

Implement Doc Index

Remove 'Stop Words'

Here's a list of words our search engine should generally ignore:

Stop Words

The exception here is when words are in quotes, in which case we take the query in its entirety.

Connect Indexer to Parser/Stemmer

Call the appropriate processing methods (i.e. parser, extractor, stemmer, etc.) from the indexer and test output.

Create a global entry point

We need to create a global entry point for the backend, in order to initialize everything correctly. This can't be part of existing modules (such as the crawler or the indexer); it must be either a new package or in util.

Note that this is closely linked with #2

Set(s) of seed URLs

In a file?
How many URLs to start? - does this depend on the number of nodes? threads? neither?
Top Alexa Sites could be used as a source of ideas.
CNN
tumblr
yahoo
wordpress
wikipedia?
reddit (lots of links to various parts of the interne, but also a lot of imgur links)

What is the RobotsModule instance field host?

And why does it have type URL?

I though the RobotsModule was to be a facility for processes to determine if a URL was allowed to be crawled. I didn't think it would have state, besides maybe some LRU caching of recently accessed RobotsTxt objects.

Write LinkExtractor tests for common interface from Crawler perspective

Give Milestones meaningful names.

Speed-Up Indexer

It seems that adding new Hits to the Indexer is a bit slow at the moment. Is there a good way to still maintain synchronicity and batch new word lexicon additions? Maybe this is where the DumpLexicon comes in?

Implement PingPongPlus Service

sends some large number of messages and reports on how many each node received.

Configuration Singleton

Create a Configuration Singleton class so we can

share project configuration properties
make use of test and production configurations

Should deploy a few nodes

Managers should not have to call setup and close.

Basic Lexicon starting from static wordlist

http://wordlist.sourceforge.net/ - wordlist andreas gave
have local log of new words seen, so can add then to lexicon
http://piazza.com/class#spring2012/cis455555/617 - piazza post where he explains it

Refactor handling relative URLs in LinkTextExtractor into method or class

Its a bit more complicated than appending the relative URL to the URL of the page

nvasilakis / yippee Goto Github PK

yippee's Introduction

Yippee!

A fully distributed search engine

yippee's People

Contributors

Stargazers

Watchers

Forkers

yippee's Issues

Recommend Projects

Recommend Topics

Recommend Org