Giter VIP home page Giter VIP logo

dixit's Introduction

What is Dixit❔

Disclaimer: I will quote historical people, so I give credits. "Render unto Caesar" as they say.

Dixit is an online app that boost your confidence: when you have a smart idea, it lets you see which historical people came upon this same smart idea. "Great minds think alike", as they say 🧐.

But the truth is, these particular great minds probably formulated the idea way better than you could. So why not quote them directly, and "stand on the shoulders of [these] giants"?

Dixit lets you type in your rough-edged idea, and through a "tremendous" "AI powered semantic search", it retrieves the quotes from its database of historical quotations that best rephrase this idea ✨.

How it works

The website is coded in Django, to make use of Python's powerful ML libraries. "A good sketch is worth a long speech", so here's a graph of the system design.

graph TD
    subgraph GITHUB REPO
        searchbar_page(<font color=black>Index / Search page<br>/User sentence/)
        author_page(<font color=black>Specific Author page)
        quotes_table{<font color=black>fa:fa-database Quotes<br>table<br>4 mo}
        faiss_index{<font color=black>fa:fa-search FAISS Index<br>36 mo}
        search_request[Embedded sentence]
        searchbar_page-.LLM embedding-.->search_request
        quotes_table-.Returns closest quotes-.->searchbar_page
        search_request-.Similarity search-.->faiss_index
        faiss_index -.Query quotes at selected indexes-.-> quotes_table
        author_page --Gets author quotes--> quotes_table
    end

    subgraph POSTGRESQL DATABASE
        authors_table{<font color=black>fa:fa-database Authors table}
    end

    %% Notice that no text in shape are added here instead that is appended further down
    author_page--Get author data-->authors_table

    classDef green fill:#9f6,stroke:#333,stroke-width:3px;
    classDef orange fill:#f96,stroke:#333,stroke-width:3px;
    class author_page,searchbar_page green
    class quotes_table,faiss_index,authors_table orange

The continuous lines are actions performed by the Specific author page , while the dotted lines are actions performed by the Index / Search page. In particular, searching for a setnence triggers the following process:

  1. Query sentence is embedded by a Large Language Model from the Huggingface library into the vector space
  2. A similarity search based on Meta's library FAISS returns the indexes of the closest quotes from database (which means the databse of quotes has been previously embedded into vector space).
  3. The database of quotes is queried at these indexes, and the results are displayed to the user.

As you see above, the server is a clone of a Github repo that both hosts the FAISS index and the Quotes database: while this is certainly not the most storage-light option (and my git push commands are painful), I have good reasons to do this.

My main constraints are: fast loading of the pages (my Railway hosting is a shared node), not eating up too much RAM (I don't want to pay), and being able to update the quotation database whenever I have new quotes.

  • First, why separate Quotes database and FAISS index instead of using a single Huggingface dataset with an associated index? Well, this dataset class seems to have an implementation flaw preventing one to properly save a dataset with its index: I can either store it without index, then have to recompute the index on boot, which makes the page too slow, or pickle it but it's too heavy an object for my RAM. So I chose to keep a separate FAISS index which can be loaded up very quickly.
  • But I want to keep them closeby, since whenever I add/remove a line in the quote database, I have to change the associated index with it.
  • Why host the FAISS Index on Github repo: you can't perform FAISS search in a POSTGRESQL table, and remote hosting (like Pinecone) would add too much complexity to the system. Here, the index is light, so I kept things simple.
  • Why host the Quotes database on Github repo: this allows quick versioning in case I want to add new quotations to the database + it's very light compared to the index (4mo vs 40mo), so it's not a huge burden in memory.

On the other hand, I hosted the authors table in the POSTGRESQL database proposed for free by Railway (❤), since this table won't change a lot, and this format makes it easily hostable / queriable.

Services

  • Website is hosted on the flawless Railway. This provider gives immaculate service so far 🌟 + it's very inexpensive.
  • Analytics: Goatcounter. I highly recommend it: it's GDPR compliant, uses no cookies, and is easy to setup as a single script in your html header.
  • Domain name registration: Google Domains. Obtaining a domain name works like a breeze with a Google account.
  • DNS provider: Cloudflare. But then to work with Railway, I had to follow this tutorial with Cloudflare as a DNS provider. One little tweak to note: I had to change the SSL/TLS settings to "Full" encryption in Cloudflare for it to properly work.

Paths for improvement

  • Database of quotes has 24,000 quotes. This is a lot if you consider the fact that they're mostly intelligent sentences and not randomly picked, but its's not yet enough to capture all ideas. I'll add many more in the months to come.
  • Due to high noise in our databases for recent quotes, I only kept the quotes from older authors, since I needed an automatable criterion and I think time helps a lot to separate the wheat from the chaff. But still, you're very welcome to also propose good recent quotes! (On a purely subjective basis)
  • The LLM model used to embed user sentences into the search space is inherently limited in its representation: for instance, sometimes it can misinterpret the semantics of a query. But I am confident that as LLMs improve on semantic search, the results will become even more accurate.

Try it out

The site is available at dixit.app. Try it out, and please leave me some feedback! 😃

dixit's People

Contributors

aymeric-roucher avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.