Giter VIP home page Giter VIP logo

webcrawler-from-scratch's Introduction

Web crawler from scratch in Go

Ever wondered how google.com works? What's under the hood that enables any user to insert a string and obtain related results from the web, given it's inherent complexity and vastity? How does the search engine indexes all those websites and relate their contents with the input string?

We'll try to answer some of these questions by building a simplified version of the main component that power every search engine at his simplest: a web crawler. We won't cover all sofistications and ranking algorithms at the core of the google engine, they're the result of years of research and improvements and it would require a book on its own just to scratch the surface on those topics.

This will be a tutorial on how to build something akin to a raw search engine starting from its inner-most component and extending it by adding features chapter by chapter.
The repository containing the code is https://github.com/codepr/webcrawler. During the journey we'll touch many system design concepts:

  • Microservices
  • Middlewares
  • Network unreliability
  • Concurrency
  • Scalability
    • Consistency patterns
    • Availability patterns

And more in depth on the topic:

  • Web crawler main characteristics
    • Politeness
    • Crawling rules
  • Reverse indexing services
  • Content signatures

webcrawler-from-scratch's People

Contributors

codepr avatar

Stargazers

Zhao Xiaohong avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.