Giter VIP home page Giter VIP logo

Comments (5)

nil0x42 avatar nil0x42 commented on June 14, 2024

Hello !
First and foremore, duplicut is not meant to be faster than sort -u (but i'm happy to see it is in some cases).

What sort -u does is is sorting the file alphabetically, then iterating through lines to see if line == line+1, and delete if yes.
What duplicut does is actually a lot more complicated, as it is able to remove duplicates without sorting. So if you don't mind keeping the original order, it might be better to use sort, or other tools for duplicate removal.

But anyway, i suppose you used sort -u just to compare outputs and check if duplicut actually works.
So i invite you to run duplicut --help, and take a look at the options.

For example, there is --line-max-size, which defaults to 14, meaning that lines greater than 14 chars are removed, event if unique.

Also, empty lines are automatically removed by duplicut.

These aditional behaviors exist because duplicut is mean to aggregate password wordlists, without losing the order, and without having duplicates. And in a passwords wordlist context, i rarely want to keep lines longer than 14 chars, as they might be a garbage line, a too long password to deserve to be guessed, of a parsing error from the tool that generated this line. Empty line are also deleted for being obviously useless in a wordlist of passwords.


Anyway, if you want to test duplicut, i recommend you to check at these files from my test suite:
https://github.com/nil0x42/duplicut/blob/master/test/scripts/remove-duplicates.py
https://github.com/nil0x42/duplicut/blob/master/test/tests/nonreg.sh

remove-duplicates.py is a small python script meant to behave like duplicut (it's just million times slower :)) so you can read it to see what's different from sort, and you can compare it's output with duplicut's.

from duplicut.

nil0x42 avatar nil0x42 commented on June 14, 2024

Anyway, if i answered your doubts, and if the issue is resolved, feel free to close it. Othersiwe, i'll be happy to debug with you !

from duplicut.

nil0x42 avatar nil0x42 commented on June 14, 2024

Adding needs documentation label, because this issue has probably been caused by --line-max-size option being unclearly documented
Possible fixes:
Add a phrase when duplicut phrase saying exactly how the wordlist is going to be filtered, something like:

removing lines larger than `N` chars, containing non-printable chars, or duplicated.

from duplicut.

nil0x42 avatar nil0x42 commented on June 14, 2024

Another interesting interesting 'user warning' would be to inform user if no \n has been found in file's first 4096 bytes (because file might be an old-style macOS \r newline separated wordlist)

from duplicut.

nil0x42 avatar nil0x42 commented on June 14, 2024

@freeroute , can you please confirm me if the problem was due to --line-max-size option ?

from duplicut.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.