Giter VIP home page Giter VIP logo

urldedupe's Introduction

urldedupe

urldedupe is a tool to quickly pass in a list of URLs, and get back a list of deduplicated (unique) URL and query string combination. This is useful to ensure you don't have a URL list will hundreds of duplicated parameters with differing qs values. For an example run, take the following URL list passed in:

https://google.com
https://google.com/home?qs=value
https://google.com/home?qs=secondValue
https://google.com/home?qs=newValue&secondQs=anotherValue
https://google.com/home?qs=asd&secondQs=das

Passing through urldedupe will only maintain the non-duplicate URL & query string (ignoring values) combinations:

$ cat urls.txt | urldedupe
https://google.com
https://google.com/home?qs=value
https://google.com/home?qs=newValue&secondQs=anotherValue

It's also possible to deduplicate similar URLs. This is done with -s|--similar flag, to deduplicate endpoints such as API endpoints with different IDs, or assets:

$ cat urls.txt
https://site.com/api/users/123
https://site.com/api/users/222
https://site.com/api/users/412/profile
https://site.com/users/photos/photo.jpg
https://site.com/users/photos/myPhoto.jpg
https://site.com/users/photos/photo.png

Becomes:

$ cat urls.txt | urldedupe -s
https://site.com/api/users/123
https://site.com/api/users/412/profile
https://site.com/users/photos/photo.jpg

Why C++? Because it's super fast?!?! No not really, I'm working on my C++ skills and mostly just wanted to create a real-world C++ project as opposed to educational related work.

Installation

Use the binary already compiled within the repository...Or better yet to not run a random binary from myself who can be very shady, compile from source:

You'll need cmake installed and C++ 17 or higher.

Clone the repository & navigate to it:

git clone https://github.com/ameenmaali/urldedupe.git
cd urldedupe

In the urldedupe directory

cmake CMakeLists.txt

If you don't have cmake installed, do that. On Mac OS X it is:

brew install cmake

Run make:

make

The urldedupe binary should now be created in the same directory. For easy use, you can move it to your bin directory.

Usage

urldedupe takes URLs from stdin, or a file with the -u flag, of which you will most likely want in a file such as:

$ cat urls.txt
https://google.com/home/?q=2&d=asd
https://my.site/profile?param1=1&param2=2
https://my.site/profile?param3=3

Help

$ ./urldedupe -h
(-h|--help) - Usage/help info for urldedupe
(-u|--urls) - Filename containing urls (use this if you don't pipe urls via stdin)
(-V|--version) - Get current version for urldedupe
(-r|--regex-parse) - This is significantly slower than normal parsing, but may be more thorough or accurate
(-s|--similar) - Remove similar URLs (based on integers and image/font files) - i.e. /api/user/1 & /api/user/2 deduplicated
(-qs|--query-strings-only) - Only include URLs if they have query strings
(-ne|--no-extensions) - Do not include URLs if they have an extension (i.e. .png, .jpg, .woff, .js, .html)
(-m|--mode) - The mode/filters to be enabled (can be 1 or more, comma separated). Default is none, available options are the other flags (--mode "r,s,qs,ne")

Examples

Very simple, simply pass URLs from stdin or with the -u flag:

./urldedupe -u urls.txt

After moving the urldedupe binary to your bin dir..Pass in list from stdin and save to a file:

cat urls.txt | urldedupe > deduped_urls.txt

Deduplicate similar URLs with -s|--similar flag, such as API endpoints with different IDs, or assets:

cat urls.txt | urldedupe -s

https://site.com/api/users/123
https://site.com/api/users/222
https://site.com/api/users/412/profile
https://site.com/users/photos/photo.jpg
https://site.com/users/photos/myPhoto.jpg
https://site.com/users/photos/photo.png

Becomes:

https://site.com/api/users/123
https://site.com/api/users/412/profile
https://site.com/users/photos/photo.jpg

For all the bug bounty hunters, I recommend chaining with tools such as waybackurls or gau to get back only unique URLs as those sources are prone to have many similar/duplicated URLs:

cat waybackurls | urldedupe > deduped_urls.txt

For max thoroughness (usually not necessary), you can use an RFC complaint regex for URL parsing, but it is significantly slower for large data sets:

cat urls.txt | urldedupe -r > deduped_urls_regex.txt

Alternatively, use -m|--mode with the flag values you'd like to run with. For example, if you want to get URLs deduped based on similarity, include only URLs that have query strings, and do not have extensions...

Instead of:

urldedupe -u urls.txt -s -qs -ne

You can also do:

urldedupe -u urls.txt -m "s,qs,ne"

urldedupe's People

Contributors

ameenmaali avatar larskraemer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

urldedupe's Issues

make are not working

hacker@localhost:~/tools/urldedupe$ make
Scanning dependencies of target urldedupe
[ 20%] Building CXX object CMakeFiles/urldedupe.dir/Url.cpp.o
/home/hacker/tools/urldedupe/Url.cpp:4:10: fatal error: filesystem: No such file or directory
#include
^~~~~~~~~~~~
compilation terminated.
CMakeFiles/urldedupe.dir/build.make:120: recipe for target 'CMakeFiles/urldedupe.dir/Url.cpp.o' failed
make[2]: *** [CMakeFiles/urldedupe.dir/Url.cpp.o] Error 1
CMakeFiles/Makefile2:94: recipe for target 'CMakeFiles/urldedupe.dir/all' failed
make[1]: *** [CMakeFiles/urldedupe.dir/all] Error 2
Makefile:102: recipe for target 'all' failed
make: *** [all] Error 2

Error when buliding

I don't know anything about c++, but when I run make I am geting this error....

[ 20%] Linking CXX executable urldedupe
/usr/bin/ld: CMakeFiles/urldedupe.dir/Url.cpp.o: in function `std::filesystem::__cxx11::path::has_extension() const':
Url.cpp:(.text._ZNKSt10filesystem7__cxx114path13has_extensionEv[_ZNKSt10filesystem7__cxx114path13has_extensionEv]+0x14): undefined reference to `std::filesystem::__cxx11::path::_M_find_extension() const'
/usr/bin/ld: CMakeFiles/urldedupe.dir/Url.cpp.o: in function `std::filesystem::__cxx11::path::path<std::basic_string_view<char, std::char_traits<char> >, std::filesystem::__cxx11::path>(std::basic_string_view<char, std::char_traits<char> > const&, std::filesystem::__cxx11::path::format)':
Url.cpp:(.text._ZNSt10filesystem7__cxx114pathC2ISt17basic_string_viewIcSt11char_traitsIcEES1_EERKT_NS1_6formatE[_ZNSt10filesystem7__cxx114pathC5ISt17basic_string_viewIcSt11char_traitsIcEES1_EERKT_NS1_6formatE]+0x73): undefined reference to `std::filesystem::__cxx11::path::_M_split_cmpts()'
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/urldedupe.dir/build.make:149: urldedupe] Error 1
make[1]: *** [CMakeFiles/Makefile2:96: CMakeFiles/urldedupe.dir/all] Error 2
make: *** [Makefile:104: all] Error 2

Account for port numbers in URLs

Probably makes sense to discard ports when assessing for duplication, but account for something like:

https://site.com:443/home
https://site.com/home

Why are we decoding the URLs before parsing

Before doing any work in the parser, we are decoding the URL, i.e. replacing "%ab" with '\xab'.
I think it would be better to this after parsing the URL, since the following URL, for example, would produce incorrect results:

https://example.com/test%3Ftest (Note: 0x3F is ASCII for '?')

If this URL is decoded first, then parsed, it will be parsed as "https://example.com/test" with a query string of "test". This is not the behavior any browser is going to give you, and I believe it should not be the behavior of this program.
Instead, I believe we should decode the parts of the URL separately after parsing, probably even while assembling the URL key.

Eliminate duplicates that are not in query strings

Hi,
thanks a lot for this tool, it is very useful!

I was wondering if it would be possible to implement also a dedupe functionality for this kind of URL:

  • /product/1/buy/1
  • /product/1/buy/2
  • /product/1/
  • /product/2/

This should results just in:

  • /product/1/buy/1
  • /product/1/

It seems to me that at this time this is not taken in consideration.

I would really like to contribute on this by myself but my C++ knowledge are really rusty :)

Thanks again!

Chunk file reading for large files

There are some use cases that have been brought up for deduping large files (> 10gb). This will result in a crash if the system does not have enough RAM to deal with it, as the file is loaded into memory at this point. We will need to chunk the file into smaller buffers when loading in order to prevent this. It may also make sense to parallelize this process with the URL deduplication process as large files will take longer than necessary due to waiting for the entire file to be loaded. @larskraemer, any thoughts on approach for solving this issue?

Can't remove duplicate files

jsfiles.txt

https://www.test.com/js/0-0c6e5e47ca6a3f3f7243.js
https://www.test.com/js/0-249c4f63764b90e95f29.js
https://www.test.com/js/0-356c7b1d95f2143f6cd2.js
https://www.test.com/js/0-5adfe0ed1f01b27b5f5f.js
https://www.test.com/js/0-6553d716c12f03bb710d.js
cat jsfiles.txt | urldedupe -s

expected output should be: https://www.test.com/js/0-0c6e5e47ca6a3f3f7243.js

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.