Comments (5)
Hello !
First and foremore, duplicut is not meant to be faster than sort -u
(but i'm happy to see it is in some cases).
What sort -u
does is is sorting the file alphabetically, then iterating through lines to see if line == line+1, and delete if yes.
What duplicut does is actually a lot more complicated, as it is able to remove duplicates without sorting. So if you don't mind keeping the original order, it might be better to use sort, or other tools for duplicate removal.
But anyway, i suppose you used sort -u
just to compare outputs and check if duplicut actually works.
So i invite you to run duplicut --help
, and take a look at the options.
For example, there is --line-max-size
, which defaults to 14, meaning that lines greater than 14 chars are removed, event if unique.
Also, empty lines are automatically removed by duplicut.
These aditional behaviors exist because duplicut is mean to aggregate password wordlists, without losing the order, and without having duplicates. And in a passwords wordlist context, i rarely want to keep lines longer than 14 chars, as they might be a garbage line, a too long password to deserve to be guessed, of a parsing error from the tool that generated this line. Empty line are also deleted for being obviously useless in a wordlist of passwords.
Anyway, if you want to test duplicut, i recommend you to check at these files from my test suite:
https://github.com/nil0x42/duplicut/blob/master/test/scripts/remove-duplicates.py
https://github.com/nil0x42/duplicut/blob/master/test/tests/nonreg.sh
remove-duplicates.py
is a small python script meant to behave like duplicut (it's just million times slower :)) so you can read it to see what's different from sort
, and you can compare it's output with duplicut's.
from duplicut.
Anyway, if i answered your doubts, and if the issue is resolved, feel free to close it. Othersiwe, i'll be happy to debug with you !
from duplicut.
Adding needs documentation
label, because this issue has probably been caused by --line-max-size option being unclearly documented
Possible fixes:
Add a phrase when duplicut phrase saying exactly how the wordlist is going to be filtered, something like:
removing lines larger than `N` chars, containing non-printable chars, or duplicated.
from duplicut.
Another interesting interesting 'user warning' would be to inform user if no \n
has been found in file's first 4096 bytes (because file might be an old-style macOS \r
newline separated wordlist)
from duplicut.
@freeroute , can you please confirm me if the problem was due to --line-max-size
option ?
from duplicut.
Related Issues (20)
- Use a more performant hash function HOT 1
- Otimize duplicut for SSDs HOT 1
- Transform lines to lowercase HOT 3
- Improve `MEDIUM_LINE_BYTES` guessing with heuristic HOT 1
- core/status: status display sometimes fails to show coherent output
- Verbose output HOT 2
- Output to stdout by default HOT 1
- [Chore] Typo HOT 1
- Ideas for enhancement HOT 4
- Purge both duplicates HOT 1
- [enhancement] Sort options HOT 1
- Run Duplicut on Windows? HOT 1
- Duplicut not cutting all duplicates HOT 7
- No output produced (0 byte) for 9.2 Gb tab separated text file HOT 6
- how to use in kali linux? HOT 2
- Can this program sort the password dictionary? HOT 1
- Why any line longer than 255 chars is ignored? HOT 3
- Add support for removing duplicates from other file HOT 2
- Inconsistency sometimes occurs across multiple runs on the same file HOT 2
- how does it work on windows? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from duplicut.