Giter VIP home page Giter VIP logo

la-toile-francaise's Introduction

Let's Build a Search Engine for the French Web

Wouldn't it be cool to build our very own search engine?

I propose the challenge of building a Web Search Engine à la Google, Yahoo, or DuckDuckGo. We can learn a lot about data processing, ways to optimize the code, and trade-offs to make. I define a bunch of rules in advance to ensure the project keeps being small and accessible.

This challenge is open to anyone interested. Rules can change in the future. It is non-mandatory, and people must only work on it in their free time. It is a collaborative project; therefore, a different version of a specific part of the system can be plugged together with different programming languages with different algorithms.

I initially thought about this project when reading AWS Abusing Search Engine Gets Abused, in which Boyter talks about a search engine only indexing Australian websites. However, I would prefer that we use french services and increase France's Network sovereignty.

Rules and Constraints

Contributors can modify these rules in the future, but most recent contributors must agree to accept the changes. Developers must strictly follow these rules for a contribution to be accepted.

User Experience and Interface

There must be a web page with a text input called a "search bar" for the user query results. The matching result is a list of links of the best matching pages where the visible text is the webpage title. The system only shows the first forty (40) best results on one page.

The system must return this list of results in less than a second. We must do our best to answer the best results we can compute for any given user query in this amount of time. Sometimes it will be hard, and some queries will see altered results. We can inform the user about that.

The project must maintain a "one nine" (90%) availability to keep a good enough quality of service without having to wake up at night at any time.

I define best-matching results as web pages relevant in time: non-outdated results, and appropriate according to the query. However, many results can degrade the search experience, like phishing and cheating websites, and we must hide them as much as we can.

Cost of the Project

The project must be relatively inexpensive. We must spend less than 200€/month. We must find the most efficient way to compute, store, and retrieve the data we gather. One or multiple machines, a storage service, or anything that can reduce costs is allowed. However, no abuse of free trial offers is accepted. This project must be viable and sustainable over time without hurting other companies.

The web is a tough place to live. DDoS attacks are current, and our project can suffer from that. We must find a cost-efficient way to keep the service available to genuine users without increasing maintenance costs too much. We do not accept donations for the project as of now.

Search Set

This project aims to expose a good web search engine that returns the best results for french websites. I define a french website as any webpage that is mainly composed of french text, but as of now, it could be easier to restrict on the french TLD (.fr).

Technology-Wise

As this project must reduce the cost, we can start by relying on other technologies like Common-crawl to gather the websites. However, this project also aims to deliver the best results; therefore, we will surely need to design our own tools in the future.

This project is more an educational project than a money-wise sustainable company, which means that we are here to learn and design our system. Contributors should create the essential bricks from the ground up. We can use other technologies, but we must architecture those bricks ourselves.

We must keep a simple design and simple tools to achieve our goals. Avoid complex instruments and hard-to-maintain and understand configurations of them. Keep the codebase modules and behaviors well documented for external and internal contributors to dive into it quickly.

Gathering Statistics

As the project must be sustainable, gathering statistics is crucial and must be done from the start. Understanding the way our users search and find the results interesting is essential. It would be great to have a bunch of graphs showing the indexing time, search time, web-crawling status, errors, and our system logs.

License

The project must use an open-source license, e.g., MIT, APACHE-2.0. However, a lot of the software and hardware are not open source. We allow running our project on non-open source services. However, suppose we have the choice between different equivalent technologies. In that case, a contributor must choose the open-source one rather than the other if a developer can easily contribute the possibly missing features to the open-source project.

la-toile-francaise's People

Contributors

kerollmops avatar

Stargazers

jc avatar Sergey Melderis avatar

Watchers

Sergey Melderis avatar  avatar Eloha avatar  avatar jc avatar

la-toile-francaise's Issues

Identifying the Documents

When searching, we must always identify the documents with a short and straightforward identifier. This identifier will be used in unions, intersections, and storage operations.

I first thought about using an incremental u64 to start, having only a single identifier program. This program would take an URL as a parameter and return a new or known id depending on whether the system already knows this URL or not. Technology like Snowflakes could make sense, but synchronization must be done beforehand to ensure we know this URL already.

The issue I can see is about the storage of the URLs. We must find an efficient way to store them without wasting too much space. Using the Reverse domain name notation can be an excellent way to encode the domain name and continue to store the URL path as is. Most of the time, key-value stores can skip common prefixes between keys. Unfortunately, I simplified grenad to remove this feature, but we can always bring back this feature if necessary.

I also think about Bloom Filters and Xor Filters a lot.

One thing to keep in mind: Design the fastest system to make sure we know this URL and generate a fresh identifier if not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.