Giter VIP home page Giter VIP logo

deduplicator's Introduction

Deduplicator

Find, Sort, Filter & Delete duplicate files

Usage

Usage: deduplicator [OPTIONS] [scan_dir_path]

Arguments:
  [scan_dir_path]  Run Deduplicator on dir different from pwd (e.g., ~/Pictures )

Options:
  -t, --types <TYPES>          Filetypes to deduplicate [default = all]
  -i, --interactive            Delete files interactively
  -s, --min-size <MIN_SIZE>    Minimum filesize of duplicates to scan (e.g., 100B/1K/2M/3G/4T) [default: 1b]
  -d, --max-depth <MAX_DEPTH>  Max Depth to scan while looking for duplicates
      --min-depth <MIN_DEPTH>  Min Depth to scan while looking for duplicates
  -f, --follow-links           Follow links while scanning directories
  -h, --help                   Print help information
  -V, --version                Print version information
      --json                    

Examples

# Scan for duplicates recursively from the current dir, only look for png, jpg & pdf file types & interactively delete files
deduplicator -t pdf,jpg,png -i

# Scan for duplicates recursively from the ~/Pictures dir, only look for png, jpeg, jpg & pdf file types & interactively delete files
deduplicator ~/Pictures/ -t png,jpeg,jpg,pdf -i

# Scan for duplicates in the ~/Pictures without recursing into subdirectories
deduplicator ~/Pictures --max-depth 0

# look for duplicates in the ~/.config directory while also recursing into symbolic link paths
deduplicator ~/.config --follow-links

# scan for duplicates that are greater than 100mb in the ~/Media directory
deduplicator ~/Media --min-size 100mb

Installation

Cargo Install

Stable

$ cargo install deduplicator

Nightly

if you'd like to install with nightly features, you can use

$ cargo install --git https://github.com/sreedevk/deduplicator

Please note that if you use a version manager to install rust (like asdf), you need to reshim (asdf reshim rust).

Linux (Pre-built Binary)

you can download the pre-built binary from the Releases page. download the deduplicator-x86_64-unknown-linux-gnu.tar.gz for linux. Once you have the tarball file with the executable, you can follow these steps to install:

$ tar -zxvf deduplicator-x86_64-unknown-linux-gnu.tar.gz
$ sudo mv deduplicator /usr/bin/

Mac OS (Pre-built Binary)

you can download the pre-build binary from the Releases page. download the deduplicator-x86_64-apple-darwin.tar.gz tarball for mac os. Once you have the tarball file with the executable, you can follow these steps to install:

$ tar -zxvf deduplicator-x86_64-unknown-linux-gnu.tar.gz
$ sudo mv deduplicator /usr/bin/

Windows (Pre-built Binary)

you can download the pre-build binary from the Releases page. download the deduplicator-x86_64-pc-windows-msvc.zip zip file for windows. unzip the zip file & move the deduplicator.exe to a location in the PATH system environment variable.

Note: If you Run into an msvc error, please install MSCV from here

Performance

Deduplicator uses size comparison and fxhash (a non non-cryptographic hashing algo) to quickly scan through large number of files to find duplicates. its also highly parallel (uses rayon and dashmap). I was able to scan through 120GB of files (Videos, PDFs, Images) in ~300ms. checkout the benchmarks

benchmarks

Command Dirsize Filecount Mean [ms] Min [ms] Max [ms] Relative
deduplicator ~/Data/tmp (~120G) 721 files 33.5 ± 28.6 25.3 151.5 1.87 ± 1.60
deduplicator ~/Data/books (~8.6G) 1419 files 24.5 ± 1.0 22.9 28.1 1.37 ± 0.08
deduplicator ~/Data/books --min-size 10M (~8.6G) 1419 files 17.9 ± 0.7 16.8 20.0 1.00
deduplicator ~/Data/ --types pdf,jpg,png,jpeg (~290G) 104222 files 1207.2 ± 37.0 1172.2 1287.7 67.27 ± 3.33
  • The last entry is lower because of the number of files deduplicator had to go through (~660895 Files). The average size of the files rarely affect the performance of deduplicator.

These benchmarks were run using hyperfine. Here are the specs of the machine used to benchmark deduplicator:

OS: Arch Linux x86_64 
Host: Precision 5540
Kernel: 5.15.89-1-lts 
Uptime: 4 hours, 44 mins 
Shell: zsh 5.9                        
Terminal: kitty 
CPU: Intel i9-9880H (16) @ 4.800GHz 
GPU: NVIDIA Quadro T2000 Mobile / Max-Q 
GPU: Intel CoffeeLake-H GT2 [UHD Graphics 630] 
Memory: 31731MiB (~32GiB)

Screenshots

Roadmap

- Tree format output for duplicate file listing
- GUI
- Packages for different operating system repositories (currently only installable via cargo) 

deduplicator's People

Contributors

beeb avatar dependabot[bot] avatar dhruvasagar avatar ghfghfg23 avatar sreedevk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

deduplicator's Issues

[Feature] Adding new flag to show full file path

** Is your request for an opportunity linked to a problem? Please describe it.
I have a clear and deep folder structure and some files have long names, it is also important for me to see where the duplicates are.

image

Describe the solution you would like.
So I would suggest adding a new flag that shows the full path to the files.

Moreover, I use a laptop with FullHD and some links are too long and can't be shown properly. So I think it would be a good idea to add the ability to move the line to the next line.

Error: unable to open database file (code 14)

OS: Windows 11 Enterprise
OS version: 22000.1335

When I try to run app, I get an error "Error: unable to open database file (code 14)"

Example:

cargo run --release -- --dir=test_data
warning: unused imports: `Frame`, `Rect`
 --> src\app\ui.rs:5:45
  |
5 |     layout::{Constraint, Direction, Layout, Rect},
  |                                             ^^^^
...
9 |     Frame, Terminal,
  |     ^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

warning: unused import: `std::thread`
  --> src\app\mod.rs:14:5
   |
14 | use std::thread;
   |     ^^^^^^^^^^^

warning: unused import: `std::time::Duration`
  --> src\app\mod.rs:15:5
   |
15 | use std::time::Duration;
   |     ^^^^^^^^^^^^^^^^^^^

warning: unused imports: `Block`, `Borders`, `Widget`
  --> src\app\mod.rs:18:15
   |
18 |     widgets::{Block, Borders, Widget},
   |               ^^^^^  ^^^^^^^  ^^^^^^

warning: unused import: `Backend`
 --> src\app\ui.rs:4:15
  |
4 |     backend::{Backend, CrosstermBackend},
  |               ^^^^^^^

warning: associated function `cleanup` is never used
  --> src\app\mod.rs:39:8
   |
39 |     fn cleanup(term: &mut Terminal<CrosstermBackend<io::Stdout>>) -> Result<()> {
   |        ^^^^^^^
   |
   = note: `#[warn(dead_code)]` on by default

warning: associated function `render_cycle` is never used
  --> src\app\mod.rs:51:8
   |
51 |     fn render_cycle(term: &mut Terminal<CrosstermBackend<io::Stdout>>) -> Result<()> {
   |        ^^^^^^^^^^^^

warning: associated function `init_render_loop` is never used
  --> src\app\mod.rs:58:8
   |
58 |     fn init_render_loop(term: &mut Terminal<CrosstermBackend<io::Stdout>>) -> Result<()> {
   |        ^^^^^^^^^^^^^^^^

warning: associated function `init_terminal` is never used
  --> src\app\mod.rs:69:8
   |
69 |     fn init_terminal() -> Result<Terminal<CrosstermBackend<io::Stdout>>> {
   |        ^^^^^^^^^^^^^

warning: struct `EventHandler` is never constructed
 --> src\app\event_handler.rs:7:12
  |
7 | pub struct EventHandler;
  |            ^^^^^^^^^^^^

warning: associated function `init` is never used
  --> src\app\event_handler.rs:10:12
   |
10 |     pub fn init() -> Result<events::Event> {
   |            ^^^^

warning: associated function `handle_keypress` is never used
  --> src\app\event_handler.rs:21:8
   |
21 |     fn handle_keypress(keyevent: KeyEvent) -> Result<events::Event> {
   |        ^^^^^^^^^^^^^^^

warning: enum `Event` is never used
 --> src\app\events.rs:1:10
  |
1 | pub enum Event {
  |          ^^^^^

warning: struct `Ui` is never constructed
  --> src\app\ui.rs:12:12
   |
12 | pub struct Ui;
   |            ^^

warning: associated function `generate_file_list` is never used
  --> src\app\ui.rs:15:8
   |
15 |     fn generate_file_list() -> impl Widget {
   |        ^^^^^^^^^^^^^^^^^^

warning: associated function `generate_info_bar` is never used
  --> src\app\ui.rs:27:8
   |
27 |     fn generate_info_bar() -> impl Widget {
   |        ^^^^^^^^^^^^^^^^^

warning: associated function `generate_file_desc` is never used
  --> src\app\ui.rs:31:8
   |
31 |     fn generate_file_desc() -> impl Widget {
   |        ^^^^^^^^^^^^^^^^^^

warning: associated function `render_frame` is never used
  --> src\app\ui.rs:35:12
   |
35 |     pub fn render_frame(term: &mut Terminal<CrosstermBackend<io::Stdout>>) -> Result<()> {
   |            ^^^^^^^^^^^^

warning: `deduplicator` (bin "deduplicator") generated 18 warnings
    Finished release [optimized] target(s) in 0.30s
     Running `target\release\deduplicator.exe --dir=test_data`
Error: unable to open database file (code 14)
error: process didn't exit successfully: `target\release\deduplicator.exe --dir=test_data` (exit code: 1)

I tried to install app via cargo, also I tried to build app from source code, but I got the same error.

How can I fix it?

[Feature] Add Unit Tests

Is your feature request related to a problem? Please describe.
In order to avoid fixed issues from resurfacing, unit tests need to be added for each mod.

Describe the solution you'd like
Add Tests for:

  • hashing
  • file scanning
  • finding duplicates by size
  • finding duplicates by hashes
  • printing output (integration tests)
  • interactive mode (integration tests)

Describe alternatives you've considered
N/A

Additional context
As the application is going through major changes, it's essential to make sure that it doesn't break.

[Feature] Add Flag to Exclude Filetypes

Is your feature request related to a problem? Please describe.
If I don't want to exclude duplicates of a single file type, It's very difficult to do it.

Describe the solution you'd like
Add a --exclude-types / -x to exclude certain file types from being scanned by deduplicator

Crash while searching

Probably issue in cyrillic characters in path

thread 'main' panicked at 'byte index 11 is not a char boundary; it is inside 'с' (bytes 10..12) of `/фото с iPhone/IMG_6740 1.JPG`', src/output.rs:12:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Panic when scanning

The fact that there are a lot of unwrap's in the scanner code makes it so that the app panics when I run it against my home directory:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', [...]/deduplicator-0.0.3/src/scanner.rs:52:65
stack backtrace:
   0: rust_begin_unwind
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
   2: core::result::unwrap_failed
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/result.rs:1791:5
   3: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
   4: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
   5: rayon::iter::plumbing::Folder::consume_iter
   6: rayon::iter::plumbing::bridge_producer_consumer::helper
   7: <rayon::vec::IntoIter<T> as rayon::iter::IndexedParallelIterator>::with_producer
   8: rayon::iter::extend::<impl rayon::iter::ParallelExtend<T> for alloc::vec::Vec<T>>::par_extend
   9: deduplicator::scanner::duplicates
  10: deduplicator::app::App::init
  11: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  12: tokio::runtime::park::CachedParkThread::block_on
  13: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
  14: tokio::runtime::runtime::Runtime::block_on
  15: deduplicator::main

Faillible actions (especially against the filesystem) should be properly handled (most likely skipping the affected files during the scan).

Test Deduplicator on Windows OS

I don't own or have access to a windows machine. If anybody does have access to a windows machine, please help out by testing deduplicator on windows.

Required Information:

  1. Benchmarks - Speed & Memory Efficiency
  2. Bug Reports

[Bug] Progress Bar for Globwalker

Describe the bug
While scanning deep directory trees, there's a small delay before the scanning files progress bar kicks in. For this period, deduplicator does not display any output. The goal is to add a ProgressBar::new_spinner() to make sure that the users can see that deduplicator is trying to traverse the directories to find files to process.

[Bug] Output Printing Slow

Describe the bug
After the scanning is complete, the app hangs for a second before printing the output. This is clearer with large directories.

** Runtime Info **
Install Type: [e.g. cargo install]
App Version: [e.g. v0.1.1]

Expected behavior
Printing should be fast

Platform Details (please complete the following information):

  • OS: Arch Linux
  • Terminal Emulator: Kitty
  • Shell: Bash

[Bug] "Error: path contains invalid UTF-8 characters"

Version: deduplicator 0.2.1 compiled from git master branch.
System: Windows 10
Cmdline: deduplicator.exe --follow-links --json "c:/" > "c:\temp\deduplicator-report_C.txt"

Scannned an entire C drive but after running 2h30m gives an utf8 error. Error does not give detailed information about the filename and folder.

[00:21:59] 3232274 paths mapped   
[00:00:08] ###### 2692395/2692395 indexed files sizes  
[02:30:50] ###### 2613147/2613147 indexed files hashes   
Error: path contains invalid UTF-8 characters

[feature] minimum file size

Since this application might be useful to regain disk space, I think it could be interesting to have an option to ignore files smaller than a user-defined threshold.

[Bug] Excessive Memory Consumption

Describe the bug
The memory usage while scanning a 127 GB directory of PDFs, Images & Videos shot up to 26 GiB from 4.8 GiB (Initial), causing the desktop manager (lightdm) to crash & restart.

Runtime Info
App Arguments: none
Install Type: cargo install
App Version: 0.0.8

Expected behavior
Reduced Memory Consumption.

Platform Details (please complete the following information):

  • OS: Arch Linux (Kernel: 5.15.86-1-lts)
  • Terminal Emulator: Kitty
  • Shell: Zshell

[Feature] Mass Processing Options --keep-latest --keep-oldest

Is your feature request related to a problem? Please describe.
The Interactive mode allows the deletion of files one duplicate group at a time. When working with millions of files, this can be tedious.

Describe the solution you'd like
In order to automate this process by using deduplicator in scripts, adding options like --keep-latest --keep-oldest can help.

Describe alternatives you've considered
adding custom config files that can parse a DSL to decide which files to keep [idea for the future]

[Bug] compilation error

Describe the bug
When using the stable version of rust, cargo cannot build the application.

** Runtime Info **
Install Type: Install with cargo
App Version: [e.g. v1.0.6]

Expected behavior
Installing the application without error))

Screenshots
image

Platform Details (please complete the following information):

  • OS: RED OS
  • Terminal Emulator: mate-terminal
  • Shell ZSH

Additional context
Erroneous code example:

#[repr(u128)] // error: use of unstable library feature 'repr128'
enum Foo {
    Bar(u64),
}

If you're using a stable or a beta version of rustc, you won't be able to use
any unstable features. In order to do so, please switch to a nightly version of
rustc (by using rustup).

If you're using a nightly version of rustc, just add the corresponding feature
to be able to use it:

#![feature(repr128)]

#[repr(u128)] // ok!
enum Foo {
    Bar(u64),
}

[feature] to remove duplicates

I think its generally expected feature of this tool, it can be interactively like "select file to delete: 1, 2, 3" with some --interactive option or just auto delete first match (keep only last one) with --remove.

Idea for better performance

Just had an idea of how to maybe improve performance. Instead of hashing all files, how about first differentiating them by using the file size (which is unlikely to be identical for two large files that are different) and only relying on the hash when they have the same size?

[Feature] Add Pre-Built Binary Download

Is your feature request related to a problem? Please describe.
Currently, deduplicator is only installable via cargo (rust's build tool). Need to make pre-built binary download options available to make deduplicator easily accessible to more people.

Describe the solution you'd like
Create workflows to cross compile binaries for the following platforms

  • x86_64
  • AArch64

Describe alternatives you've considered

  • Distributing through linux package repositories (plans for the future)

Additional context
N/A

[Bug] --dir autocomplete not working on zsh

Describe the bug
#29 added path autocomplete for --dir option. The autocomplete works for bash but not zsh.

** Runtime Info **
App Arguments: --dir
Install Type: cargo install && cargo run
App Version: 0.1.1

Expected behavior
Autocomplete should work for --dir on zsh & other shells

Platform Details (please complete the following information):

  • OS: Arch Linux
  • Terminal Emulator: Kitty
  • Shell: Zsh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.