Giter VIP home page Giter VIP logo

bridge's Introduction

About me

Role: Systems Administrator

Experience
  • support: troubleshooting, training, documentation
  • proxies & web servers: Squid, Apache, Nginx, HAProxy, IIS
  • mail servers: Postfix, Dovecot, Roundcube, DKIM, Postgrey
  • config/change management: Subversion, Git, Ansible
  • containers: Docker, LXD
  • virtualization: VMware, Hyper-V, VirtualBox
  • databases: MySQL/MariaDB, PostgreSQL, Microsoft SQL Server
  • monitoring: Nagios, custom tooling, Microsoft Teams, fail2ban
  • logging: rsyslog (local, central receivers), Graylog
  • ticketing: Redmine, GitHub, GitLab, Service Now

Role: Intermediate developer

Experience
  • current:
    • Go, Python, PowerShell, shell scripting
    • MySQL/MariaDB, SQLite
    • Docker, LXD
    • Markdown, Textile, MediaWiki, reStructuredText, HTML, CSS
    • Redmine, GitHub (including GitHub Actions), Gitea, GitLab
  • past: batch files (don't laugh, it gets the job done), Perl
  • academic: C, C++

bridge's People

Contributors

atc0005 avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

bridge's Issues

Allow specifying duplicate file count threshold via flag

See also #24.

As with the file size threshold, the duplicate file count threshold (DuplicatesThreshold) is a constant defining how many 1:1 files are required in order to be considered duplicates. As of this writing this number is hard-coded to 2, but there may be use cases where the number would be specified higher to filter out a small number of matches.

Note: This may be better handled by adding a new flag that acts as a filter, leaving the existing DuplicatesThreshold value as-is.

Research third-party leveled-logging packages

Nearly done with the prototype per my notes in #1. The more I work with this codebase the more I consider using logrus for it's leveled logging support.

Perhaps it is worth trying one of the other popular third-party logging libraries?

Research options for handling duplicate files

From #1:

A later revision could recursively perform this task and move duplicates into a subfolder alongside the image? What if the images (or files) are in entirely different folders? Perhaps create a log instead? Or, an option to choose which of those steps are performed?

The initial v0.1 release will focus on creating a CSV file for manual review, but it would be useful to perform some sort of cleanup option either automatically or based on a column entry for the CSV file.

For example, the generated CSV file could create a column with prefilled "keep" action entries that the user could replace with delete. This could involve adding new flags:

  • simulate
  • prune

Optional creation of Excel file isn't properly handled

As of b6fb543 not specifying the desired Excel file results in a barely deciperable error message:

$ ./bridge.exe -recurse -path . -csvfile "pics.csv"
2020/02/08 18:55:54 Configuration: {Paths:[.] RecursiveSearch:true ConsoleReport:false IgnoreErrors:false FileSizeThreshold:1 FileDuplicatesThreshold:2 CSVFile:pics.csv ExcelFile:}
2020/02/08 18:55:54 Path exists: .
2020/02/08 18:56:00 105 evaluated files in specified paths
2020/02/08 18:56:00 14 potential duplicate file sets found using file size
2020/02/08 18:56:00 14 confirmed duplicate file sets found using file hash
2020/02/08 18:56:00 28 files with identical file size
2020/02/08 18:56:00 28 files with identical file hash
2020/02/08 18:56:00 14 duplicate files
2020/02/08 18:56:00 188.1 MiB wasted space for duplicate file sets
2020/02/08 18:56:00 Successfully created CSV file: "pics.csv"
2020/02/08 18:56:00 open : The system cannot find the file specified.

Seeing the empty ExcelFile config value provided a clue, but glancing at the code this is what I see:

	// Generate Excel workbook for review
	// TODO: Implement better error handling
	if err := fileChecksumIndex.WriteFileMatchesWorkbook(config.ExcelFile, duplicateFiles); err != nil {
		log.Fatal(err)
	}
	log.Printf("Successfully created workbook file: %q", config.ExcelFile)

Evidently I didn't finish adding guards around the Excel file generation step to skip creation if the user didn't specify a value.

Add support for blocking file removal operations if ALL files from a set would be removed

This application groups detected duplicate files into "sets". Each set contains all files that are duplicates of each other, including what could be considered the "original" file that is duplicated by others in the set.

The current implementation for file removal support (see #4) will honor a user-supplied "flag" in in the input CSV file to indicate that the file should be removed, even if the result is removing ALL files in a set. This issue is to extend the file removal support to accept a CLI-flag option to help prevent removal of all files in a set. Instead, when the application can determine that all files are marked for removal that set would be skipped (if config.IgnoreErrors is set perhaps?) or cause the application to immediately fail.

refs #4

Set flag.Usage to custom Help header func

While reading over Fun with Flags I learned that the stdlib flag package supports displaying custom help text as a header or lead-in to the auto-generated list of options available to the user. This can be used to display branding, version details and a link back to the main project repo.

Add support for automatically selecting or marking remove_file column cells based on provided pattern

For example, here is a snippet from a v0.4.2 generated CSV file:

directory,file,size,size_in_bytes,checksum,remove_file
C:\Users\adam\Desktop\upload_me\Camera,0825161607a.jpg,7.8 MiB,8199899,1da9b7e355d64f57acf5854e746fff01bb5ab0291bcbad857e63607fcf8d03c7,
C:\Users\adam\Desktop\upload_me\Camera,IMG_20160825_160705412.jpg,7.8 MiB,8199899,1da9b7e355d64f57acf5854e746fff01bb5ab0291bcbad857e63607fcf8d03c7,

If for example we wanted to automatically remove any duplicate that contains a IMG_ prefix, we could use a flag like --mark-dupe-prefix (not a fan of the name) like so:

--mark-dupe-prefix 'IMG_'

This would mark the remove_file column for any duplicate files that have that prefix.

Research replacing Makefile with a well-maintained alternative

Option to ignore errors not fully implemented

While testing on a Windows system from a non-privileged account, the application was unable to descend recursively (as requested) into $RECYCLE.BIN and System Volume Information and immediately aborted.

After digging further, I found at least three places in the code where I was failing to implement support for ignoring the error so that the application could continue as requested:

  • bridge/paths.go

    Lines 55 to 61 in 611391d

    // If an error is received, return it. If we return a non-nil error, this
    // will stop the filepath.Walk() function from continuing to walk the
    // path, and your main function will immediately move to the next line.
    if err != nil {
    return err
    }

  • bridge/matches.go

    Lines 167 to 170 in 611391d

    fm[index].Checksum, err = GenerateCheckSum(file.FullPath)
    if err != nil {
    return err
    }

  • bridge/main.go

    Lines 43 to 46 in 611391d

    fileSizeIndex, err := ProcessPath(config.RecursiveSearch, path)
    if err != nil {
    log.Fatal(err)
    }

There may be additional locations that I've missed.

Allow specifying file size threshold via flag

Currently the threshold is set via a constant in order to weed out 0 byte files, but allowing the value to be specified by command-line would allow the user to focus just on files large enough for them to care about.

Perhaps later (if desired) support could be added to specify both lower and upper threshold values.

README | Update badges to link to the applicable workflow

While useful, the current status badges link out to the SVG used to provide the current status, not the workflow results themselves.

Update all badges to use a syntax similar to 'GoDoc' or 'Latest Release' so that the badge displays, but also functions as a "GoTo" button as well.

Replace external shell script calls with internal Makefile commands

See also atc0005/elbow#234.


Verbatim from atc0005/send2teams#41:

As I've begun doing with my other projects, I'd like to replace content like this:

linting:
	@echo "Calling wrapper script: $(LINTINGCMD)"
	@$(LINTINGCMD)
	@echo "Finished running linting checks"

with this:

.PHONY: linting
## linting: runs common linting checks
# https://stackoverflow.com/a/42510278/903870
linting:
	@echo "Running linting tools ..."

	@echo "Running gofmt ..."

	@test -z $(shell gofmt -l -e .) || (echo "WARNING: gofmt linting errors found" \
		&& gofmt -l -e -d . \
		&& exit 1 )

	@echo "Running go vet ..."
	@go vet ./...

	@echo "Running golint ..."
	@golint -set_exit_status ./...

	@echo "Running golangci-lint ..."
	@golangci-lint run \
		-E goimports \
		-E gosec \
		-E stylecheck \
		-E goconst \
		-E depguard \
		-E prealloc \
		-E misspell \
		-E maligned \
		-E dupl \
		-E unconvert \
		-E golint \
		-E gocritic

	@echo "Running staticcheck ..."
	@staticcheck ./...

	@echo "Finished running linting checks"

It's more verbose (though the golantci-lint options will be moved to an external config file before long), but hopefully more consolidated and clearer what is going on.

Create small app to find duplicate files

Prototype: md5sum * | sort -k1

This sorts by the md5sum and places the duplicates next to each other. In a real setup, the app can move the duplicates to another folder or delete them outright.

Use karrick/godirwalk package?

From https://github.com/karrick/godirwalk:

godirwalk is a library for traversing a directory tree on a file system.

In short, why do I use this library?

  • It's faster than filepath.Walk.
  • It's more correct on Windows than filepath.Walk.
  • It's more easy to use than filepath.Walk.
  • It's more flexible than filepath.Walk.

Feature: Save processing state for later evaluation

The initial concept is to save the crawl status for later file hash comparison in order to confirm duplicates. Depending on how this is implemented, this could also provide a useful "audit" mechanism for discovery purposes.

Create GitHub Actions Workflows

Same here as with our other Go-based projects:

  • Docs linting
  • Code linting, building and testing

Although we don't have any tests just yet we can go ahead and setup the framework for that work.

Test creating Excel (or equivalent) spreadsheet of duplicate files

Found this package (haven't tested yet): https://github.com/360EntSecGroup-Skylar/excelize

Idea:

  1. create one workbook
  2. each set of duplicate files is added to a new sheet in the workbook

Not sure how easy this would be to work in alongside the idea on #4 to support file removal based on a keep or delete column value in a flat CSV structure. I expect that (at least initially) the two output files would be separate and a modified copy of the CSV file would be read back in for further action.

Test/Upgrade v2.x of 360EntSecGroup-Skylar/excelize package

$ go list -m -versions github.com/360EntSecGroup-Skylar/excelize
github.com/360EntSecGroup-Skylar/excelize v1.1.0 v1.2.0 v1.3.0 v1.4.0 v1.4.1
$ go list -m -versions github.com/360EntSecGroup-Skylar/excelize/v2
go: finding github.com/360EntSecGroup-Skylar/excelize/v2 v2.1.0
github.com/360EntSecGroup-Skylar/excelize/v2 v2.0.0 v2.0.1 v2.0.2 v2.1.0

The import port path in the go.mod file currently (and unintentionally) locks our use to the v1.4.1 release of the 360EntSecGroup-Skylar/excelize package. We'll need to explicitly request the v2 series in order to pull in those updates.

matches/matches.go:XXX:YY: Error return value of `f.SetCellValue` is not checked

From a CI run for #74:

##[error]matches/matches.go:595:16: Error return value of `f.SetCellValue` is not checked (errcheck)
	f.SetCellValue(summarySheet, "A1", "Evaluated Files")
	              ^
##[error]matches/matches.go:596:16: Error return value of `f.SetCellValue` is not checked (errcheck)
	f.SetCellValue(summarySheet, "A2", "Sets of files with identical size")
	              ^
##[error]matches/matches.go:597:16: Error return value of `f.SetCellValue` is not checked (errcheck)
	f.SetCellValue(summarySheet, "A3", "Sets of files with identical fingerprint")

Existing problem not related to #74, exposed (evidently) due to the work on #70 or #69.

Various linting errors from initial prototype

Thankfully all are minor; some are from me being new to Go and some are from quickly hacking this together without following common design patterns.

$ make linting
Calling wrapper script: bash testing/run_linting_checks.sh
matches.go:105:6: `InList` is unused (deadcode)
func InList(needle string, haystack []string) bool {
     ^
units.go:13:6: `ByteCountSI` is unused (deadcode)
func ByteCountSI(b int64) string {
     ^
matches.go:275:6: sloppyLen: len(fileMatches) <= 0 can be len(fileMatches) == 0 (gocritic)
                if len(fileMatches) <= 0 {
                   ^
matches.go:314:9: Error return value of `w.Write` is not checked (errcheck)
        w.Write(csvHeader)
               ^
checksums.go:28:9: S1025: the argument is already a string, there's no need to use fmt.Sprintf (gosimple)
        return fmt.Sprintf("%s", string(cs))
               ^
matches.go:150:4: S1011: should replace loop with `mergedFileSizeIndex[fileSize] = append(mergedFileSizeIndex[fileSize], fileMatches...)` (gosimple)
                        for _, fileMatch := range fileMatches {
                        ^
Non-zero exit code from golangci-lint: 1
checksums.go:28:9: the argument is already a string, there's no need to use fmt.Sprintf (S1025)
matches.go:105:6: func InList is unused (U1000)
matches.go:150:4: should replace loop with mergedFileSizeIndex[fileSize] = append(mergedFileSizeIndex[fileSize], fileMatches...) (S1011)
units.go:13:6: func ByteCountSI is unused (U1000)
Non-zero exit code from staticcheck: 1
Linting failed, most recent failure: staticcheck
Makefile:61: recipe for target 'linting' failed
make: *** [linting] Error 1

The unused linting failures are because I'm treating those functions as part of a library/package without actually having them in one. The quick fix here is to pull them out, but a better approach (at least for now) would be to create a sub-package for them. I'll likely go that direction.

Invalid duplicate checksums reported

What I got (tossing unrelated rows):

directory,file,size,size_in_bytes,checksum,remove_file
,,,,,
.,prune-input-testing.csv,25.2 KiB,25832,1d0c2cbe8fc192c57774fb39b258860bc66427ba2914ceaf46493112c35165ae,
.,report.csv,25.2 KiB,25832,1d0c2cbe8fc192c57774fb39b258860bc66427ba2914ceaf46493112c35165ae,

Actual:

$ sha256sum *.csv
1d0c2cbe8fc192c57774fb39b258860bc66427ba2914ceaf46493112c35165ae *prune-input-testing.csv
d942dce529d17457f5f42b386256da6c3d6b43118ec9219c6f0a5c2270a6c740 *report.csv

Prepare v0.4.0 release

  • Review README
  • Review GoDoc coverage
  • Update changelog
  • Create tag
  • Create new release

Both the main README and GoDoc coverage may need to be updated to reflect recent changes to the Help output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.