Giter VIP home page Giter VIP logo

tangojuice's Issues

Denial of service with regex time bomb

Running the app with the text

Das Schloss Neuschwanstein steht oberhalb von Hohenschwangau bei Füssen im südöstlichen bayerischen Allgäu. Der Bau wurde ab 1869 für den bayerischen König Ludwig II. als idealisierte Vorstellung einer Ritterburg aus der Zeit des Mittelalters errichtet. Die Entwürfe stammen von Christian Jank, die Ausführung übernahmen Eduard Riedel und Georg von Dollmann. Der König lebte nur wenige Monate im Schloss, er starb noch vor der Fertigstellung der Anlage.

\s|(([ac]+c?)*)?ca+b|acbcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\s

source lang german and “Remove proper nouns” unchecked will hang the app for a very long time (and make it unresponsive to SIGINT). Longer times can be achieved with more as.

Explanation:

In vocab.py, at line 179, you run a regex replacement with a regex that is constructed from partially unsanitized user input (a word form extracted from the input text).

replacement = r"<b>" + form + r"</b>"
regex = r"\b" + form + r"\b"
html_example = re.sub(regex, replacement, word.html_example)

This means that a malicious user can inject active characters in the regex and with a correctly crafted input, make the regex engine run in catastrophic backtracking. Being unresponsive to SIGINT is I suspect caused by the backtracking happening in C code that doesn't check for signals (which makes regex faster but in that case plays against you).

Now it isn't easy to run into this because

  • You use mrab regex, which has many optimization tricks that make this harder. It is far easier to make regex bombs with the standard library re.
    • I got this one to work by starting from mrabarnett/mrab-regex#424 and trimming it down to survive through preprocessing (see next points).
  • The input actually goes through some processing
    • Spacy tokenization means that the malicious regex must not be recognized as several words (which is why it can't start with a ().
    • It has to survive DeepL translation (which is unlikely for a malicious regex) or skip this phase (which is why we have to uncheck proper nouns removal.

Nevertheless, this is a security vulnerability. My suggestions are

  • So not ever trust user input : always run it in some sort of escaping/sanitizing such as re.escape
    • is_not_alpha could have given you some protection, but it is true as soon as there is at least one alphanumeric character in its input. Tightening that condition could have been a protection, although not an ideal one.
    • Using a non-backtracking regex engine would also mask the issue, but again, not ideal.
  • Add a timeout mechanism on the requests, which would also give some protection for other forms of DoS attacks.
  • Add a security policy to your repository, so future vulnerability reports can confidential until you deploy a fix.

Heroku crashes when extracting Anki flashcards

2022-01-01T09:57:45.126577+00:00 heroku[web.1]: Process running mem=617M(120.7%)
2022-01-01T09:57:45.136238+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)

To reproduce error: do several Anki extractions in a row... (it doesn't always happen!)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.