Giter VIP home page Giter VIP logo

arkfind's Introduction

arkfind

A utility to recursively search for files by name in a filesystem, also looking inside archives to an arbitary depth.

Supported archives are: TAR, TAR.GZ, TAR.BZ2 and ZIP. It should also work on ZIP-like archives such as JAR files. There are options for case sensitivity, glob-style pattern matching and JSON output. Non-ASCII character sets should be supported, although there may be some issues with command line encodings if they aren't UTF-8 (and there's also the fact that ZIP files don't have a universally accepted encoding).

The script uses the "magic" library to determine file types, so it doesn't rely on file extensions to identify archives.

Dependencies and installation

Download the "arkfind" script. Make it executable to run it like ./arkfind or run it with python ./arkfind.

Dependencies:

  • Python (>= 2.7)
  • python-magic (>= 0.4.3) (installable via pip)

Examples

Say that over years of different backup regimes and ad-hoc archiving you've ended up with TAR files inside ZIP files, and there are many such ZIP files all inside one big TAR.BZ2 file. You want to find a file called "magic.txt", so you use the script like so:

$ arkfind AllBackups.tar.bz2 "magic.txt"
AllBackups.tar.bz2
  > backups_2006/lab_pc/MISC.zip
      > misc_lab_stuff/magic.txt

Maybe you have a directory full of such archives, and you can't remember the whole name of the file. You can do this:

$ arkfind -g backups/ "magic*.*"
backups/backup_2007.zip
  > 06/magic_june_2007.rtf
backups/backup_2008.zip
  > 01/magic_jan_2008.html

arkfind's People

Contributors

detly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

arkfind's Issues

CRC errors in tar files cause a crash

Traceback:

  File "./arkfind", line 176, in contents
    yield TarWalker.from_path(fpath, archive_mode)
  File "./arkfind", line 229, in from_path
    return cls(archive, (path,))
  File "./arkfind", line 252, in __init__
    self._populate()
  File "./arkfind", line 294, in _populate
    archive_mode
  File "./arkfind", line 240, in from_buffer
    return cls(archive, paths)
  File "./arkfind", line 252, in __init__
    self._populate()
  File "./arkfind", line 264, in _populate
    for member in self.get_members():
  File "./arkfind", line 336, in get_members
    return self.archive.getmembers()
  File "/usr/lib/python2.7/tarfile.py", line 1805, in getmembers
    self._load()        # all members, we first have to
  File "/usr/lib/python2.7/tarfile.py", line 2380, in _load
    tarinfo = self.next()
  File "/usr/lib/python2.7/tarfile.py", line 2315, in next
    self.fileobj.seek(self.offset)
  File "/usr/lib/python2.7/gzip.py", line 429, in seek
    self.read(1024)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 303, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x5af89238 != 0xdd49a1e8L

UnicodeDecodeError: 'utf8' codec can't decode byte

Traceback (most recent call last):
  File "./arkfind", line 623, in <module>
    json=args.json
  File "./arkfind", line 577, in main
    archive_search(Path.os(base_path), search_func, reporter)
  File "./arkfind", line 558, in archive_search
    to_process.extendleft(node.contents)
  File "./arkfind", line 175, in contents
    yield TarWalker.from_path(fpath, archive_mode)
  File "./arkfind", line 230, in from_path
    return cls(archive, (path,))
  File "./arkfind", line 254, in __init__
    self._populate()
  File "./arkfind", line 273, in _populate
    nested_paths = self.paths + (self.member_path(member),)
  File "./arkfind", line 347, in member_path
    return Path.posix(member.name.decode(self.ENCODING))
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 40: invalid continuation byte

Arkfind slow process time

Hello,

Actually it works perfectly well, however it's quite slow to process the list.

It seems there is a whole lot of room to improve, since it seems not to take advantage of python multi-processing, which means my computer isn't even at 100% in any field (Disk, CPU, Memory usage). Is it normal ?

By the way, is there any way to extract files that matches the listing expression ?
Since my files are inside .tar.gz and even .zip files combined.

Need a "stop at first find" mode

Currently arkfind will walk through all directories and archives, constructing the entire tree in memory, and then it will search through that list for whatever the user specified.

It would be nicer if arkfind printed the results as it went (although it would then be mixed up with the warnings, but that's nothing a stderr redirect wouldn't fix).

The real utility of this would be a "lazy" mode: sometimes you're only looking for one file, and you know there aren't duplicates. Then arkfind could quit after it found the first match.

Initially I had tried to implement arkfind using generators (with the printer and the searcher being coroutines). However, the recursive nature of the code made this quite difficult. See this Stack Overflow question: How can I nest an arbitrary number of Python file context managers?.

What might work, however, is passing the searcher/formatter/printer to the recursing code, so that instead of populating a list in memory, the results are printed as they are found.

wont work with directories

Hi!

The script wont work with directories under centOS, python 2.7.5.

with file as parameter:
[root@localhost war]# python ~/arkfind -g ../war/theme.war "*.ftl"
../war/theme.war

templates/header.ftl

with the same directory as parameter:
[root@localhost war]# python ~/arkfind -g ../war/ "*.ftl"
Traceback (most recent call last):
File "/root/arkfind", line 618, in
json=args.json
File "/root/arkfind", line 572, in main
archive_search(Path.os(base_path), search_func, reporter)
File "/root/arkfind", line 529, in archive_search
base_path_kind = magic.from_file(base_path.joined.encode(sys.getfilesystemencoding()), mime=True)
File "/usr/lib/python2.7/site-packages/magic.py", line 131, in from_file
return m.from_file(filename)
File "/usr/lib/python2.7/site-packages/magic.py", line 81, in from_file
with open(filename):
IOError: [Errno 21] Is a directory: '../war'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.