crawly.js is the crawler for fruits, and crawly2 is the crawler for personal (wiki) pages. You can run them seperately, and they will create two different collections in the same database.
Nirmith D'Almeida 101160124 [email protected] Johnathan Scaife 101145480 [email protected] Ali Hassan Sharif 101142782 [email protected]
https://www.youtube.com/watch?v=Kt82suIoy0E
-
Web Crawler a) Fruit example site b) Wikipedia Stored in database with page data and PageRank calculations
-
RESTful Web Server a) / - Home page displays entire collection b) /search - Page to specify search parameters c) /fruits - Search results from fruits collection (supports JSON via Postman) d) /fruits/:id - Data on individual fruit page e) /personal - Search results from personal (wikipedia) collection (Supports JSON via Postman) f) /personal/:id - Data on individual personal (wikipedia) page
Both /fruits and /personal has the following query parameters: a) q - String representing the search query b) boost - true or false, indicating boost from pagerank c) limit - the number of search results you want returned 0<limit<51
- OpenStack Deployment a) Server deployed to OpenStack b) PUT request to distributed search engine using AXIOS
- How does your crawler work? What information does it extract from the page? How does it store the data? Is there any intermediary processing you perform to facilitate the later steps of the assignment?
- Discuss the RESTful design of your server. How has your implementation incorporated the various REST principles?
- Explain how the content score for the search is generated.
- Discuss the PageRank calculation and how you have implemented it.
- How have you defined your page selection policy for your crawler for your personal site?
- Why did you select the personal site you chose? Did you run into any problems when working with this site? How did you address these problems?
- Critique your search engine. How well does it work? How well will it scale? How do you think it could be improved?