Giter VIP home page Giter VIP logo

web_crawler_example's Introduction

Web Crawler and SiteMap Creator - README

This is application is a rails based web Crawler and Sitemap creator. It will take an input to see a live version of the application please go to label.

The web crawler algorithm is recursive.

Installation

# $ git clone [email protected]:gs2589/web_crawler_example.git && cd web_crawler_example
# $ bundle install
# $ rails s

then navigate to localhost:3000

Notes:

  • You need Ruby and Rails to be installed

  • See Gemfile for dependencies

  • Database is not necessary

Notes about this implementation

  • Notable Gems used:

    • This application uses the Nokigiri, URI and Open-Uri gems for web navigation, html parsing and uri manipulation

  • Code Design: the following features were implemented to comply with the single responsibility design principle:

    • Adapter pattern An adapter pattern is used for the web crawler. An adapter handles the web crawling and map creating. The controller receives user requests, instatiantes an adapter an responds to user with adapter response.

    • Helper Methods Helper methods were created to handel: formatting of outputted map, determination of whether a link is internal or external, determination of what the host domain is, determination of whether a link is an image or another page.

  • RESTful conventions are not generally followed

  • Scope

    • This application crawls all links that contain the <a> or <img> html selectors.

    • scroll-links such as domain_name.com/#home are ignored

    • links that appear in a parent page are not listed in child pages

    • links only appear once unless they are aliased

    • the application follows links regardless of whether they are written in relative or absolute notation

    • links that return errors are ignored

  • Sleep Times

    • The application uses a .5 second sleep time between requests to be respectful of crawled sites and prevent overloading them.

web_crawler_example's People

Contributors

gs2589 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.