diskoverdata / diskover-community Goto Github PK

Diskover Community Edition - Open source file indexer, file search engine and data management and analytics powered by Elasticsearch

Home Page: https://diskoverdata.com

License: Apache License 2.0

CSS 16.67% PHP 55.17% JavaScript 15.45% Python 12.72%

python elasticsearch crawler filesystem-visualization filesystem-analysis filesystem-indexer disk-space disk-usage storage-analytics storage

diskover-community's People

Contributors

Stargazers

Watchers

Forkers

hanbinggary lamour1314 georgegu2100 dtlindsey halonent xon91 rivardolivier aboussetta ro9ueadmin gavz draco2003 theralfbrown mlynchcogent hktalent johnjohnsp1 wainermora nanogithub p0prxx adolfoeliazat akashd7 italberto 18nvaz carlchan insanemal vanthaiunghoa cybersme farmers-tan leomatias rabitw mandymandw dev-alex-alex2006hw helge000 chauamtran lludlow swipswaps sajid2045 chaomas farolding appdesign1987 jonbackhaus m4rm0k foxsportsaustralia tolmanam acerge yangzhou123 amirunpri2018 gtb-togerther seanbales 0buffer mathse throwaway2698 pombredanne 4n6strider prodject frenchie71 herstell rajesnal mbeacom hhy5277 fydexx ramonfigueiredo jonekee fsa-streamotion sbrichardson nicksan2c itxaka darkfunct hsiaofongw azarshab-saeed tkuennen pls256 b1sounours gibbousfrsh ilpvfx cloner3000 severinstampler theassyrian scrlx lioneleoy seabreg jetstreamin ii0 ameg-yag tools-env baykelper sumonst21 rohitk31 sudhu26 mikekmiller xemoe 6180 5l1v3r1 suika bigmoonz invati-cloud deepestblue2008 ccf19881030 aptalca varontron cmclarty

diskover-community's Issues

why Files Indexed is zero

2017-06-23 00:49:20,495 [INFO][diskover] Connecting to Elasticsearch
2017-06-23 00:49:20,499 [INFO][diskover] Checking for ES index: diskover-20170623
2017-06-23 00:49:20,502 [WARNING][diskover] ES index exists, deleting
2017-06-23 00:49:20,558 [INFO][diskover] Creating ES index
Crawling: [100%] |########################################| 229/229
2017-06-23 00:49:21,125 [INFO][diskover] Finished crawling
2017-06-23 00:49:21,126 [INFO][diskover] Directories Crawled: 229
2017-06-23 00:49:21,126 [INFO][diskover] Files Indexed: 0
2017-06-23 00:49:21,126 [INFO][diskover] Elapsed time: 0.635111093521
tonyzhang@Ubuntu16:~/data/web/esphp$

Diskover v1.5.0-rc25 failed to create index

Hi, I followed the installation instructions and ran the below command which returned an eeror, logs attached.
Your help will be much appreciated, thanks.

python /path/to/diskover.py -d /rootpath/you/want/to/crawl -i diskover-indexname -a -O

2018-12-19 20:39:35,017 [WARNING][diskover] Not running as root, permissions might block crawling some files
2018-12-19 20:39:35,017 [INFO][diskover] Checking es index: diskover-index
2018-12-19 20:39:35,019 [INFO][diskover] Creating es index
2018-12-19 20:39:35,059 [WARNING][elasticsearch] PUT http://localhost:9200/diskover-index [status:400 request:0.039s]
Traceback (most recent call last):
  File "diskover.py", line 2100, in <module>
    pre_crawl_tasks()
  File "diskover.py", line 1822, in pre_crawl_tasks
    index_create(cliargs['index'])
  File "diskover.py", line 698, in index_create
    es.indices.create(index=indexname, body=mappings)
  File "/home/salussage/diskover/local/lib/python2.7/site-packages/elasticsearch5/client/utils.py", line 73, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/home/salussage/diskover/local/lib/python2.7/site-packages/elasticsearch5/client/indices.py", line 107, in create
    params=params, body=body)
  File "/home/salussage/diskover/local/lib/python2.7/site-packages/elasticsearch5/transport.py", line 312, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/home/salussage/diskover/local/lib/python2.7/site-packages/elasticsearch5/connection/http_urllib3.py", line 129, in perform_request
    self._raise_error(response.status, raw_data)
  File "/home/salussage/diskover/local/lib/python2.7/site-packages/elasticsearch5/connection/base.py", line 125, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch5.exceptions.RequestError

For any discussions/questions please ask on Google Group

https://groups.google.com/forum/?hl=en#!forum/diskover

How do you actually use the `--finddupes` command? (and lots of miscellaneous issues)

herp@mainnas:/media/Storage/Scripts/diskover$ python3 diskover.py -i diskover-extra --finddupes


     _/_/_/    _/            _/
    _/    _/        _/_/_/  _/  _/      _/_/    _/      _/    _/_/    _/  _/_/
   _/    _/  _/  _/_/      _/_/      _/    _/  _/      _/  _/_/_/_/  _/_/
  _/    _/  _/      _/_/  _/  _/    _/    _/    _/  _/    _/        _/
 _/_/_/    _/  _/_/_/    _/    _/    _/_/        _/       _/_/_/  _/
                              v1.5.0-rc29
                              https://shirosaidev.github.io/diskover
                              "I didn't even know that was there."
                              Support diskover on Patreon or PayPal :)


2019-02-03 22:33:16,736 [INFO][diskover] Using config file: /media/Storage/Scripts/diskover/diskover.cfg
2019-02-03 22:33:16,738 [INFO][diskover] Waiting for diskover worker bots to start...
2019-02-03 22:33:18,740 [INFO][diskover] Waiting for diskover worker bots to start...
2019-02-03 22:33:20,743 [INFO][diskover] Waiting for diskover worker bots to start...
<repeats forever>

However, I have 8 worker bots running (as started by diskover-bot-launcher.sh) on the same host.

Poking about with the bot stuff:

./diskover-bot-launcher.sh -l 3 exits. Where do the logs go? Who knows? They don't seem to go into /var/log or the directory that contains the diskover scripts. I assume that higher log-levels need to be run without detatching from the CLI.

Looking in the source, apparently it's supposed to output where the logs are getting written to, but that doesn't work:

herp@mainnas:/media/Storage/Scripts/diskover$ ./diskover-bot-launcher.sh -l 3


  ________  .__        __
  \______ \ |__| _____|  | _________  __ ___________
   |    |  \|  |/  ___/  |/ /  _ \  \/ // __ \_  __ \ /)___(\
   |    `   \  |\___ \|    <  <_> )   /\  ___/|  | \/ (='.'=)
  /_______  /__/____  >__|_ \____/ \_/  \___  >__|   ("\)_("\)
          \/        \/     \/               \/
                Worker Bot Launcher v1.6
                https://github.com/shirosaidev/diskover
                "Crawling all your stuff, core melting time"


Starting 8 worker bots in background...
mainnas.31572 (pid 31572) (botnum 1)
mainnas.31574 (pid 31574) (botnum 2)
mainnas.31576 (pid 31576) (botnum 3)
mainnas.31578 (pid 31578) (botnum 4)
mainnas.31580 (pid 31580) (botnum 5)
mainnas.31582 (pid 31582) (botnum 6)
mainnas.31584 (pid 31584) (botnum 7)
mainnas.31586 (pid 31586) (botnum 8)
DONE!
All worker bots have started
Worker pids have been stored in /tmp/diskover_bot_pids, use -k flag to shutdown workers or -r to restart
Exiting, sayonara!

Digging more, apparently you have to specify the log directory by editing the script, and if you don't, you can turn on logging, but it doesn't do anything. This is certainly not the sort of thing you'd normally expect, and allowing the -l flag to be specified without a non-empty BOTLOG should result in an error.

More realistically, you should either use all CLI args, or all file-edit args. Having a mix is extremely confusing, and the lack of parameter validation doesn't help..

More stuff:

It's way, way too easy to accidentally delete an index. Right now, if you simply re-run a scan, it'll drop the previous index. That'd be fine, if it didn't take about 3+ days to index some of my volumes. Silently and automatically deleting something that takes 3 days to generate is really, really REALLY FUCKING ANNOYING (particularly since the high IO load it entails renders the system near-unusable for the duration). At minimum, you should have a "Do you want to drop the existing index [yN]" prompt before dropping an existing scan result.

Additionally, how do you resume a scan? Can you resume a scan? There is --reindexrecurs, but I'm not sure if that's what you want to use?

The GIANT MOTD on every launch is also annoying, particularly when you're scrolling back and forth to look at the help output.

Dynamic fields won't be included in the index mapping

Originally posted by @tolmanam in https://github.com/_render_node/MDIzOlB1bGxSZXF1ZXN0UmV2aWV3VGhyZWFkMTM2ODIyMzU2OnYy/pull_request_review_threads/discussion

Directory with a trailing space kills crawl

Since our storage is accessed from Mac we are seeing a lot of entries (files and directories) with a trailing space in the name (because artists make a new folder and paste a copied string which contains a trailing space).
This is normally no issue for any tool (makes it tricky in the terminal) but kills the crawl:

[2017-04-27 13:05:13] [status] Finding directories to crawl
Crawling: [66%] |██████████████████████████--------------| 176/268Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/ronald/diskover/diskover.py", line 251, in processDirectoryWorker
    filelist = crawlFiles(path, DATEEPOCH, DAYS, MINSIZE, EXCLUDED_FILES, VERBOSE)
  File "/home/ronald/diskover/diskover.py", line 158, in crawlFiles
    for name in os.listdir(path):
OSError: [Errno 2] No such file or directory: './.cleanup/2016_F4_JUNE_WORK STYLES'

This folder is actually .cleanup/2016_F4_JUNE_WORK STYLES /

I removed name = name.strip() from line 159 in diskover.py but still get an exception.
I'm pretty sure it's the find -type d command that doesn't properly show the trailing space.
I would be curious to see how find deals with unicode characters and the rest of the os library.

leftover DEBUG in diskover.py:365

Tolerate not being able to connect to server?

Hi,

I'm interested in being able to deploy the diskover client (i.e. the worker and the crawlers) separately from the diskover-web stack (redis, ES, and the PHP app), but as far as I can tell, the worker bots currently have no way of tolerating the server going away by e.g. queueing up POSTs and trying again on a timer or when the network environment changes. Is that true? Would there be plans to change this?

Thanks!

Index names case not handled

Specifying an index with upper case letters is not supported in elasticsearch and raises an unhandled exception. There should be a simple regex check for the index name, IMHO we should just lower() it:

# ./diskover.py -i diskover-TEST -a 
2018-08-15 15:04:09,817 [INFO][diskover] Checking es index: diskover-TEST
2018-08-15 15:04:09,825 [INFO][diskover] Creating es index
2018-08-15 15:04:09,829 [WARNING][elasticsearch] PUT http://localhost:9200/diskover-TEST [status:400 request:0.003s]
Traceback (most recent call last):
  File "./diskover.py", line 1847, in <module>
    index_create(cliargs['index'])
  File "./diskover.py", line 667, in index_create
    es.indices.create(index=indexname, body=mappings)
  File "/usr/lib/python2.7/site-packages/elasticsearch5/client/utils.py", line 73, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/lib/python2.7/site-packages/elasticsearch5/client/indices.py", line 107, in create
    params=params, body=body)
  File "/usr/lib/python2.7/site-packages/elasticsearch5/transport.py", line 312, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/elasticsearch5/connection/http_urllib3.py", line 129, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/lib/python2.7/site-packages/elasticsearch5/connection/base.py", line 125, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch5.exceptions.RequestError: TransportError(400, u'invalid_index_name_exception', u'Invalid index name [diskover-TEST], must be lowercase')

why are my bots stopping / or maybe i am doing it wrong

my basic scan setup looks like the following

name=$(generate a name)
./diskover-bot-launcher.sh -w 128
./diskover.py -i $name -d /mnt/point $OPTIONS
./diskover.py -i $name -D

while scanning I am looking at the redis instance

[root@sl-it-p-crwlr1 2019-01]# rqinfo --url redis://localhost:6379
diskover_crawl |█████████████████ 174
1 queues, 174 jobs total
0 workers, 1 queues
Updated: 2019-01-06 11:26:37.474399

I am experiancing that diskover is running out of workers :-/
all the other 128 workers did not log anything to the log files
at the same time diskover is still able to work jobs from the queue ... without any workers?

Problem to start diskover "no module named pwd"

hi, i've got a problem to start your script. I followed the instructions and installed the requirements before. Then i cd into the folder i want to index and execute the python script. I used a win 10 system!

Error in line 44 of diskover.py:
import pwd
ModuleNotFoundError: No module named 'pwd'

i can't install pwd over pip...not existing.
thanks

Index without dash causes exception

Running diskover with -i <indexname> where the index name does not have a dash (-) causes an exception:

user$ diskover.py -i diskover.mnt.20170505 -d /mnt
[2017-05-05 07:41:52] [status] Connecting to Elasticsearch
<snipped output>
[2017-05-05 08:06:53] [status] Finished crawling
Traceback (most recent call last):
  File "diskover.py", line 517, in <module>
    main()
  File "diskover.py", line 495, in main
    INDEXSUFFIX = INDEXNAME.split('diskover-')[1].strip()
IndexError: list index out of range

Maybe put a check at the top of the code to see if the index name contains (starts with) diskover- and fail with a verbose message, rather than running through all the code and filesystem and fail at the end?

0 Files Indexed and Nothing Showing in Kibana

OS: Ubuntu 16.04.
Latest Elasticsearch, Kibana, Python, etc.
Searching an RO NFS Mount.

Here is my output:

python /diskover/diskover.py
[2017-04-26 13:37:27] *** Checking for ES index
[2017-04-26 13:37:27] *** ES index exists, deleting
[2017-04-26 13:37:27] *** Creating ES index
[2017-04-26 13:37:27] *** Starting crawl
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/diskover/diskover.py", line 169, in processDirectoryWorker
    filelist = crawlFiles(path, DATEEPOCH, DAYS, MINSIZE, EXCLUDED_FILES, VERBOSE)
  File "/diskover/diskover.py", line 122, in crawlFiles
    owner = null
NameError: global name 'null' is not defined

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/diskover/diskover.py", line 169, in processDirectoryWorker
    filelist = crawlFiles(path, DATEEPOCH, DAYS, MINSIZE, EXCLUDED_FILES, VERBOSE)
  File "/diskover/diskover.py", line 122, in crawlFiles
    owner = null
NameError: global name 'null' is not defined

[2017-04-26 13:38:08] *** Main thread waiting
[2017-04-26 13:38:08] *** Done
[2017-04-26 13:38:08] *** Directories Crawled: 4168
[2017-04-26 13:38:08] *** Files Indexed: 0
[2017-04-26 13:38:08] *** Elapsed time: 41.3652939796

v1.0.13 not working (SSL issue?).

I have been busy and not keeping up with DiskOver, I went from version 1.0.5 to 1.0.13, but it no longer runs.
First I ran into AWS issues - had to ensure aws=False in the cfg file. but then:

elasticsearch.exceptions.ImproperlyConfigured: Please install requests to use RequestsHttpConnection.

So I did pip install requests and then it went past that, but I ended up with:


    _/_/_/    _/            _/
   _/    _/        _/_/_/  _/  _/      _/_/    _/      _/    _/_/    _/  _/_/
  _/    _/  _/  _/_/      _/_/      _/    _/  _/      _/  _/_/_/_/  _/_/
 _/    _/  _/      _/_/  _/  _/    _/    _/    _/  _/    _/        _/
_/_/_/    _/  _/_/_/    _/    _/    _/_/        _/ v1.0.13  _/_/_/  _/
                              https://github.com/shirosaidev/diskover

2017-05-29 13:35:56,326 [INFO][diskover] Connecting to Elasticsearch
2017-05-29 13:35:56,364 [WARNING][elasticsearch] HEAD https://localhost:9200/ [status:N/A request:0.038s]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_requests.py", line 75, in perform_request
    timeout=timeout or self.timeout)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 623, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
SSLError: [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:590)
2017-05-29 13:35:56,374 [WARNING][elasticsearch] HEAD https://localhost:9200/ [status:N/A request:0.009s]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_requests.py", line 75, in perform_request
    timeout=timeout or self.timeout)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 623, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
SSLError: [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:590)
2017-05-29 13:35:56,389 [WARNING][elasticsearch] HEAD https://localhost:9200/ [status:N/A request:0.011s]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_requests.py", line 75, in perform_request
    timeout=timeout or self.timeout)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 623, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
SSLError: [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:590)
2017-05-29 13:35:56,396 [WARNING][elasticsearch] HEAD https://localhost:9200/ [status:N/A request:0.007s]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_requests.py", line 75, in perform_request
    timeout=timeout or self.timeout)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 623, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
SSLError: [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:590)
2017-05-29 13:35:56,396 [ERROR][diskover] Unable to connect to Elasticsearch

Please keep the documentation up-to-date; the pip requests install was not met by pip install requirements.
This latest issue seems to be SSL related? ElasticSearch is running.

diskover.py error when running indexing commands

Traceback (most recent call last):
File "diskover.py", line 1464, in
config = load_config()
File "diskover.py", line 175, in load_config
t = config.get('autotag', 'files')
File "/usr/lib/python2.7/ConfigParser.py", line 607, in get
raise NoSectionError(section)
ConfigParser.NoSectionError: No section: 'autotag'

Redis version not being detected correctly

So everything has been running just fine in my unRAID env with Redis, ElasticSearch and Diskover. However I had a flurry of updates last week, I know Redis was updated, not sure about Diskover. However now Diskover doesn't detect the right version of Redis. At least that's what it looks like to me:

pkg_resources.ContextualVersionConflict: (redis 2.10.6 (/usr/lib/python3.6/site-packages), Requirement.parse('redis>=3.0.0'), {'rq'})

Full Log:
https://pastebin.com/KVUU5tjW

However my REDIS is version 4.0.9:
https://imgur.com/eNHE5vL

How do you run a complete system under docker?

Basically:

https://hub.docker.com/r/linuxserver/diskover/
There's a docker file in there, but it has invalid sections.
https://github.com/shirosaidev/diskover-web/blob/master/docker-compose.yml
This starts up fine, but as far as I can tell doesn't have the scanner stuff?

I managed to get most of https://hub.docker.com/r/linuxserver/diskover/ working by modifying the diskover volumes:

    volumes:
      - :/config
      - :/data

    volumes:
      - ${HOME}/docker/config:/config
      - ${HOME}/docker/data:/data

But there's still issues on startup:

elasticsearch    | [2019-01-27T08:27:09,727][INFO ][o.e.n.Node               ] [] initializing ...
elasticsearch    | [2019-01-27T08:27:09,747][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]
elasticsearch    | org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: Failed to create node environment
elasticsearch    |      at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:136) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:123) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:70) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:134) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    | Caused by: java.lang.IllegalStateException: Failed to create node environment
elasticsearch    |      at org.elasticsearch.node.Node.<init>(Node.java:268) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.node.Node.<init>(Node.java:245) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:233) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:233) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:342) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:132) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      ... 6 more
elasticsearch    | Caused by: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes
elasticsearch    |      at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84) ~[?:?]
elasticsearch    |      at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
elasticsearch    |      at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
elasticsearch    |      at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) ~[?:?]
elasticsearch    |      at java.nio.file.Files.createDirectory(Files.java:674) ~[?:1.8.0_161]
elasticsearch    |      at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) ~[?:1.8.0_161]
elasticsearch    |      at java.nio.file.Files.createDirectories(Files.java:767) ~[?:1.8.0_161]
elasticsearch    |      at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:221) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.node.Node.<init>(Node.java:265) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.node.Node.<init>(Node.java:245) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:233) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:233) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:342) ~[elasticsearch-5.6.9.jar:5.6.9]
elasticsearch    |      at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:132) ~[elasticsearch-5.6.9.jar:5.6.9]

It gets to some state, and I can access bits of the system:

http://<server-ip>:9181/ seems to show workers, but I have no idea how to send them commands
http://<server-ip>/ returns HTTP 500
http://<server-ip>:9999/ also returns a HTTP 500, and an error shows up in the console:

diskover         | Exception in thread Thread-1:
diskover         | Traceback (most recent call last):
diskover         |   File "/app/diskover/diskover_socket_server.py", line 73, in socket_thread_handler
diskover         |     command_dict = json.loads(data)
diskover         |   File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
diskover         |     return _default_decoder.decode(s)
diskover         |   File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
diskover         |     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
diskover         |   File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
diskover         |     raise JSONDecodeError("Expecting value", s, err.value) from None
diskover         | json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
diskover         |
diskover         | During handling of the above exception, another exception occurred:
diskover         |
diskover         | Traceback (most recent call last):
diskover         |   File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
diskover         |     self.run()
diskover         |   File "/usr/lib/python3.6/threading.py", line 864, in run
diskover         |     self._target(*self._args, **self._kwargs)
diskover         |   File "/app/diskover/diskover_socket_server.py", line 86, in socket_thread_handler
diskover         |     message = b'{"msg": "error", "error": ' + e + b'}\n'
diskover         | TypeError: can't concat JSONDecodeError to bytes
diskover         |
diskover         | 2019-01-27 08:35:39,403 [INFO][diskover] Got a connection from ('10.1.1.4', 39748)
diskover         | 2019-01-27 08:35:39,404 [INFO][diskover] Waiting for connection, listening on 0.0.0.0 port 9999 TCP (ctrl-c to shutdown)
diskover         | 2019-01-27 08:35:39,404 [INFO][diskover] [thread-1]: Got command from ('10.1.1.4', 39748)
diskover         | 2019-01-27 08:35:39,405 [ERROR][diskover] [thread-1]: Invalid JSON from ('10.1.1.4', 39748): (Expecting value: line 1 column 1 (char 0))

So... basically, for someone who hasn't done anything with elasticsearch, docker, etc, how do I make this do anything?

redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.

Hello,

When i run the scan on some larger filesystem. the script was terminating with below below error. Can you please check it once. Let me know if i need to tune something from redis side to accept larger datasets.

Environment :
python 3.6
Redhat Ent Linux 7.2
redis 4.0.9
elasticsearch 5.6.9

Error:
/crawler_pybin/python/bin/python /crawler_pybin/diskover/diskover.py -d /proj/diablo -i diskover-xhd-proj-diablo -a

/\ \ /\ \ /\ \ /_\ /\ \ /_\ /\ \ /\
/::\ \ :\ \ /::\ \ /:/ / /::\ \ /:/ / /::\ \ /::\
/:/:_\ //::_\ /::_\ /::-"_\ /:/:_\ |::L/_\ /:::_\ /:::_
:/:/ / ::/// ::// ;:;-",-" :/:/ / |::::/ / ::/ / ;:::/ /
::/ / :_\ ::/ / |:| | ::/ / L;;// :/ / |://
// // // || // // |__|
v1.5.0-rc1
https://shirosaidev.github.io/diskover
Bringing light to the darkness.
Support diskover on Patreon or PayPal :)

2018-04-25 08:05:49,472 [INFO][diskover] Checking es index: diskover-xhd-proj-diablo
2018-04-25 08:05:49,475 [INFO][diskover] Creating es index
2018-04-25 08:05:50,105 [INFO][diskover] Adding disk space info to es index
2018-04-25 08:05:50,152 [INFO][diskover] Found 60 diskover RQ worker bots
2018-04-25 08:05:50,152 [INFO][diskover] Enqueueing crawl to diskover worker bots for /proj/diablo...
2018-04-25 08:05:50,152 [INFO][diskover] Sending adaptive batches to worker bots
Crawling: [===================================================================================================================================================================================] 100% (Elapsed Time: 6:37:53, Time: 6:37:53)
2018-04-25 14:43:44,019 [INFO][diskover] Finished crawling!
2018-04-25 14:43:44,020 [INFO][diskover] Waiting for diskover bots to be done with any crawl jobs...
2018-04-25 14:43:45,102 [INFO][diskover] Getting diskover bots to calculate directory sizes...
2018-04-25 14:43:45,102 [INFO][diskover] Searching for all directory docs in diskover-xhd-proj-diablo
2018-04-25 14:50:26,777 [INFO][diskover] Found 10011451 directory docs
Traceback (most recent call last):
File "/crawler_pybin/python/lib/python3.6/site-packages/redis/connection.py", line 590, in send_packed_command
self._sock.sendall(item)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/crawler_pybin/python/lib/python3.6/site-packages/redis/client.py", line 2879, in execute
return execute(conn, stack, raise_on_error)
File "/crawler_pybin/python/lib/python3.6/site-packages/redis/client.py", line 2749, in _execute_transaction
connection.send_packed_command(all_cmds)
File "/crawler_pybin/python/lib/python3.6/site-packages/redis/connection.py", line 603, in send_packed_command
(errno, errmsg))
redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/crawler_pybin/python/lib/python3.6/site-packages/redis/connection.py", line 590, in send_packed_command
self._sock.sendall(item)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/crawler_pybin/diskover/diskover.py", line 1408, in
calc_dir_sizes(addstats=True)
File "/crawler_pybin/diskover/diskover.py", line 1105, in calc_dir_sizes
args=(dirbatch, cliargs,))
File "/crawler_pybin/python/lib/python3.6/site-packages/rq/queue.py", line 300, in enqueue
job_id=job_id, at_front=at_front, meta=meta)
File "/crawler_pybin/python/lib/python3.6/site-packages/rq/queue.py", line 252, in enqueue_call
job = self.enqueue_job(job, at_front=at_front)
File "/crawler_pybin/python/lib/python3.6/site-packages/rq/queue.py", line 325, in enqueue_job
pipe.execute()
File "/crawler_pybin/python/lib/python3.6/site-packages/redis/client.py", line 2894, in execute
return execute(conn, stack, raise_on_error)
File "/crawler_pybin/python/lib/python3.6/site-packages/redis/client.py", line 2749, in _execute_transaction
connection.send_packed_command(all_cmds)
File "/crawler_pybin/python/lib/python3.6/site-packages/redis/connection.py", line 603, in send_packed_command
(errno, errmsg))
redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.

Requirement Analysis.

Hi Chris,

I have initiated scan on multiple filesystems, which is around 500 to 800TB and in future we are planning to use this tool to scan around 7 to 8PB every 3 months. In this case, can you suggest me the environment requirement. I mean to say, what was the hardware requirement for 8PB scan. Current environment details are mentioned below.

Redis Server : Red Hat Enterprise Linux 7.2 (with 128GB RAM, 12 Core * 2 CPUs)
Elasticsearch : Ubuntu 16.04 ( 3 Node Cluster with 32GB RAM, 8 Core CPU and 1TB Storage for elasticsearch data and log on each node)
BOT Count : 600

Thanks & Regards,
RaviTeja

UTF-8 decode issue when crawling

I installed diskover to hopefully get some insight into a large-ish dataset on an old NetApp filer.

A short way into the crawl diskover stops with a traceback about failing to decode UTF-8:

2018-01-19 10:46:11,547 [INFO][diskover] Creating ES index
2018-01-19 10:46:11,796 [INFO][diskover] Adding disk space info to ES index
2018-01-19 10:46:11,804 [INFO][diskover] Starting crawl using 8 threads
Crawling: 100%|████████████████████| 5630/5643 [0h:00m:00s, 36.7 dir/s]Traceback (most recent call last):
  File "diskover.py", line 2699, in <module>
    worker_setup_crawl(path)
  File "diskover.py", line 1147, in worker_setup_crawl
    start_crawl(path, crawlbot)
  File "diskover.py", line 1083, in start_crawl
    for root, dirs, files in walk(path):
  File "/home/AD/johnb/diskover-venv/lib/python2.7/site-packages/scandir.py", line 654, in _walk
    for entry in walk(new_path, topdown, onerror, followlinks):
  File "/home/AD/johnb/diskover-venv/lib/python2.7/site-packages/scandir.py", line 654, in _walk
    for entry in walk(new_path, topdown, onerror, followlinks):
  File "/home/AD/johnb/diskover-venv/lib/python2.7/site-packages/scandir.py", line 654, in _walk
    for entry in walk(new_path, topdown, onerror, followlinks):
  File "/home/AD/johnb/diskover-venv/lib/python2.7/site-packages/scandir.py", line 654, in _walk
    for entry in walk(new_path, topdown, onerror, followlinks):
  File "/home/AD/johnb/diskover-venv/lib/python2.7/site-packages/scandir.py", line 654, in _walk
    for entry in walk(new_path, topdown, onerror, followlinks):
  File "/home/AD/johnb/diskover-venv/lib/python2.7/site-packages/scandir.py", line 654, in _walk
    for entry in walk(new_path, topdown, onerror, followlinks):
  File "/home/AD/johnb/diskover-venv/lib/python2.7/site-packages/scandir.py", line 603, in _walk
    entry = next(scandir_it)
  File "/home/AD/johnb/diskover-venv/lib64/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8a in position 69: invalid start byte

I'm guessing there are some filenames with some Latin-1 encoding, or maybe just errant bytes input by a user accidentally.

When running crawlbot, getting NameError: name 'cliargs' is not defined

/ Crawling (Queue: 0) Elapsed Time: 0:00:00                                                                                          2019-06-19 16:21:16,670 [INFO][diskover] Starting crawl using 16 treewalk threads (maxdepth 1)

Exception in thread Thread-12:
Traceback (most recent call last):
  File "/opt/diskover/diskover.py", line 1524, in scandirwalk_worker
    if cliargs['debug'] or cliargs['verbose']:
NameError: name 'cliargs' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/diskover/diskover.py", line 1549, in scandirwalk_worker
    logger.error("[thread-%s] Exception caused by: %s" % (threadn, e))
NameError: name 'logger' is not defined

I have logger and cliargs installed on the system, not sure why of the error. This is with v1.5.0-rc30

OS/IO Exception caused by: [Errno 2] No such file or directory ... caused by whitespace at the end

Folders with whitespaces at the end throughs "OS/IO Exception caused by: [Errno 2] No such file or directory" even though the path is fully readable

diskover shows the path as:
u'/mnt/mountpoint/Endoderm /Some_content' (whitespace at the end of 'Endoderm')

bash:
ls -l /mnt/mountpoint/Endoderm\ /
drwxr-xr-x. 2 root root 0 Aug 23 22:30 Some_content

reuse dupe hashes from other indices

if i scan a folder multiple times (e.g. once a week) most of the dupes will still have the same hash. So maybe its worth taking those hashes from former indices (index2) and just calculate the new dupes. This will also improve the overall crawl time (crawl + finddupes)

Change Base URL

How can I change the href base url? I have it setup behind a reverse proxy (e.g. www.domain.tld/diskover) but it fails because the urls are all absolute to the url.

autotag for dirs only tags the first definition

i created the following autotag definition

dirs = [{"name":["*tmp*","*TMP*","*temp*","*TEMP*","*Temp*","*cache*","*CACHE*","*Cache*","*to*delete*"],"name_exclude":["*emperatur*","*emplate*"],"path":[],"path_exclude":[],"mtime":90,"atime":0,"ctime":90,"tag":"delete","tag_custom":"autotag"},{"name":["*FORMER*LAB*MEMBERS*"],"name_exclude":[],"path":[],"path_exclude":[],"mtime":0,"atime":0,"ctime":0,"tag":"archive","tag_custom":"autotag"}]

but the autotagger only tags the first definition

Improve performance when using plugins and solve small bug

For every file that is analysed using get_file_meta, all the plugins are being reloaded. We should load the plugins only once.

Missing diskover_redis_worker.py

The GIT repository seems to be missing the diskover_redis_worker.py module.

$ python diskover.py -d /home/proadmin
Traceback (most recent call last):
  File "diskover.py", line 24, in <module>
    import diskover_worker_bot
  File "/home/echristensen/diskover/diskover/diskover_worker_bot.py", line 14, in <module>
    import diskover
  File "/home/echristensen/diskover/diskover/diskover.py", line 28, in <module>
    import diskover_dupes
  File "/home/echristensen/diskover/diskover/diskover_dupes.py", line 15, in <module>
    import diskover_redis_worker
ImportError: No module named diskover_redis_worker

Content index

Any plans to add a plug in system or similar to allow to hook up readers to allow content indexing?

Threads die if they attempt to crawl a deleted directory

In crawlFiles I think there needs to be some error handling in cases where a directory or file has been deleted between when it was added to the crawl list and when it is recursed into or stat'd. If this happens for a directory, you get a traceback like this:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 764, in run
    self.__target(*self.__args, **self.__kwargs)
  File "diskover.py", line 327, in processDirectoryWorker
    filelist = crawlFiles(path, DATEEPOCH, DAYSOLD, MINSIZE, EXCLUDED_FILES, VERBOSE)
  File "diskover.py", line 212, in crawlFiles
    for name in os.listdir(path):
OSError: [Errno 2] No such file or directory: '/no/longer/here'

And that thread is dead for the remainder of the scan. Since a directory or file can get yanked out from underneath a diskover thread at any time, is it possible to wrap the relevant bits of crawlFiles with a try/except blocks and have it gracefully move on in the event of an OSError? Or would respawning the thread be better/easier?

Inodes too big for Elastic

It looks like Diskover uses data type "Float" for inodes but inodes in Isilon can be much larger than the Float data type will consider as significant digits. It looks like a "Float" in Elastic is an IEEE 754 number which appears to have a maximum value of 3X10^38 but only 22 digits are explicitly stored making any inode larger than 4194304 lose precision and because Isilon can support Inodes in the hundreds of billions the inodes stored end up being incorrect if they are larger than 2^22.

We found this because when we try to search for inodes (using diskover-web or Kibana), we get results for files that do not actually match the inode we searched for because it is only searching for the first 7 significant digits.

The data type for Inode should be set to "double" or "long" in order to fix this issue. Without this change, any inode larger than 4194304 could be stored wrong in Elastic and will never be accurately searchable.

Auto-tag behaviour

Been trying the autotag feature, but getting poor results.

Config looks like this:

files =
  [ {"name": [], "name_exclude": [],
     "ext": ["tmp*", "TMP*", "temp*", "TEMP*", "cache*", "CACHE*"],
     "path": [],
     "path_exclude": [], "mtime": 90, "atime": 0, "ctime": 90,
     "tag": "delete", "tag_custom": "autotag"},

  { "name":[], "name_exclude": [],
    "ext": ["zip", "ZIP", "dmg", "DMG", "iso", "ISO", "gz", "bz2"],
    "path": [], "path_exclude": [],
    "mtime": 0, "atime":0, "ctime":0,
    "tag": "archive", "tag_custom": "archive" },

  { "name":[], "name_exclude": [],
    "ext": ["mov","MOV","mp4","MP4","m4v","M4V","avi","AVI"],
    "path": [], "path_exclude": [],
    "mtime": 0, "atime":0, "ctime":0,
    "tag": "keep", "tag_custom": "video" },

  { "name":["DSCF*","dscf*","IMG_*","img_*"], "name_exclude": [],
    "ext": ["cr2","CR2","RAF","raf"],
    "path": [], "path_exclude": [],
    "mtime": 0, "atime":0, "ctime":0,
    "tag": "keep", "tag_custom": "rawimage" }
  ]

However, the only thing that gets tagged correctly is DSCF*.RAF
None of the other files specified in the search seem to be working, all are still untagged.

Am I doing something wrong?

Permission Denied error handling

diskover needs a time out or error checking when it is unable access subdirectories. As it stands now, if a scan is run on a directory diskover has no access to, it seems to run indefinitely. A user specified time out would enable prevent diskover from 'hanging' or stalling due to permissions issues

failed to index because elastiscsearch's long is to small

hi,

I have a Netapp serving data via cifs - some of my files have inodes like this
18091310060841039723
and the upper limit of 'long' is
9223372036854775807

so I would recommend to change this to something which fit better

here is my traceback

Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/rq/worker.py", line 793, in perform_job
rv = job.perform()
File "/usr/lib/python3.6/site-packages/rq/job.py", line 599, in perform
self._result = self._execute()
File "/usr/lib/python3.6/site-packages/rq/job.py", line 605, in _execute
return self.func(*self.args, **self.kwargs)
File "/app/diskover/diskover_bot_module.py", line 926, in scrape_tree_meta
es_bulk_add(worker, tree_dirs, tree_files, cliargs, totalcrawltime)
File "/app/diskover/diskover_bot_module.py", line 756, in es_bulk_add
index_bulk_add(es, docs, config, cliargs)
File "/app/diskover/diskover.py", line 674, in index_bulk_add
request_timeout=config['es_timeout'])
File "/usr/lib/python3.6/site-packages/elasticsearch5/helpers/init.py", line 257, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "/usr/lib/python3.6/site-packages/elasticsearch5/helpers/init.py", line 192, in streaming_bulk
raise_on_error, **kwargs)
File "/usr/lib/python3.6/site-packages/elasticsearch5/helpers/init.py", line 137, in _process_bulk_chunk
raise BulkIndexError('%i document(s) failed to index.' % len(errors), errors)
elasticsearch5.helpers.BulkIndexError: ('219 document(s) failed to index.', [{'index': {'_index': 'diskover-2018-11-09', '_type': 'directory', '_id': 'AWb5FhNAnxb8tCnaFPa3', 'status': 400, 'error': {'type': 'mapper_parsing_exception', 'reason': 'failed to parse [inode]', 'caused_by': {'type': 'json_parse_exception', 'reason': 'Numeric value (18091310060841039723) out of range of long (-9223372036854775808 - 9223372036854775807)\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@4586eede; line: 1, column: 330]'}}, ...

Followed guide but Cannot launch python diskover_worker_bot.py

administrator@search-engine:~/diskover$ python diskover_worker_bot.py

 ___  _ ____ _  _ ____ _  _ ____ ____     ;
 |__> | ==== |-:_ [__]  \/  |=== |--<    ["]
 ____ ____ ____ _  _ _    ___  ____ ___ /[_]\
 |___ |--< |--| |/\| |___ |==] [__]  |   ] [ v1.5.0-rc10

 Redis RQ worker bot for diskover crawler
 Crawling all your stuff.

Traceback (most recent call last):
File "diskover_worker_bot.py", line 1198, in
w.work()
File "/home/administrator/.local/lib/python2.7/site-packages/rq/worker.py", line 466, in work
self.register_birth()
File "/home/administrator/.local/lib/python2.7/site-packages/rq/worker.py", line 273, in register_birth
if self.connection.exists(self.key) and
File "/home/administrator/.local/lib/python2.7/site-packages/redis/client.py", line 951, in exists
return self.execute_command('EXISTS', name)
File "/home/administrator/.local/lib/python2.7/site-packages/redis/client.py", line 673, in execute_command
connection.send_command(*args)
File "/home/administrator/.local/lib/python2.7/site-packages/redis/connection.py", line 610, in send_command
self.send_packed_command(self.pack_command(*args))
File "/home/administrator/.local/lib/python2.7/site-packages/redis/connection.py", line 585, in send_packed_command
self.connect()
File "/home/administrator/.local/lib/python2.7/site-packages/redis/connection.py", line 489, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 127.0.0.1:6379. Connection refused.

"Could not locate that visualization" Dupes

When opening the Diskover - Dupes dashboard it shows:

Could not locate that visualization (id: 687e37c0-3062-11e7-b2eb-1307fac824dc)

Could not locate that search (id: 1c800320-305e-11e7-b2eb-1307fac824dc)

Could not locate that visualization (id: 3ae1b180-3061-11e7-b2eb-1307fac824dc)

Terminal output looks good:

[2017-05-08 19:42:43] [status] Connecting to Elasticsearch
[2017-05-08 19:42:43] [info] Checking for ES index: diskover_dupes-2017.04.22
[2017-05-08 19:42:43] [warning] ES index exists, deleting
[2017-05-08 19:42:43] [info] Creating ES index
[2017-05-08 19:42:44] [info] Found 30 duplicates
[2017-05-08 19:42:44] [status] Finished creating duplicate files index

I deleted the old saved objects and uploaded the latest export.json to no avail. Am I missing something?

Thanks!

Inaccurate results and missing files using workers

Hi,

I used diskover previously when it was only a single process crawling, I updated to master (038a1cd) and I now find results to be extremely inaccurate, many files are missing. My home folder is reported to be 75GB when it is 3.5TB large, a folder inside that is reported to weight 230GB and containing 2600 files when it is actually 2TB large containing 70k files.
In the web UI, folders that miss files are listed containing 1 item, and when clicked diskover reports no items found (query becomes path_parent:\/...\/path).
Is this a known issue with the current crawling bot scheme?

Command line used: python diskover.py -d /home/ -i diskover-12042018 -a
Run log:

2018-04-12 13:34:14,943 [INFO][diskover] Checking es index: diskover-12042018
2018-04-12 13:34:14,946 [WARNING][diskover] es index exists, deleting
2018-04-12 13:34:15,583 [INFO][diskover] Creating es index
2018-04-12 13:34:16,449 [INFO][diskover] Adding disk space info to es index
2018-04-12 13:34:16,619 [INFO][diskover] Found 8 diskover RQ worker bots
2018-04-12 13:34:16,619 [INFO][diskover] Enqueueing crawl to diskover worker bots for /home...
2018-04-12 13:34:16,619 [INFO][diskover] Sending adaptive batches to worker bots
Crawling: [=====] 100% (Elapsed Time: 1:33:27, Time:  1:33:27)
2018-04-12 15:07:43,832 [INFO][diskover] Finished crawling!
2018-04-12 15:07:43,832 [INFO][diskover] Waiting for diskover bots to be done with any crawl jobs...
2018-04-12 15:11:48,650 [INFO][diskover] Getting diskover bots to calculate directory sizes...
2018-04-12 15:11:48,650 [INFO][diskover] Searching for all directory docs in diskover-12042018
2018-04-12 15:13:19,920 [INFO][diskover] Found 707085 directory docs
2018-04-12 15:13:30,117 [INFO][diskover] Directories have all been enqueued, calculating in background
2018-04-12 15:13:30,367 [INFO][diskover] Dispatcher is DONE! Sayonara!

Thank you!

Discrepancies between file system and diskover-web reporting

I am currently evaluating diskover and there are some differences between what diskover-web is reporting and what is on the file system.

The file system that the crawlers are indexing is 485TB; however, the index information on the dashboard-web shows Total 800TB, Used 484.65TB, Free 315.35TB, and Available 315.35 TB. It also shows in the Jumbotron that 'You could save 1.39 PB of disk space if you delete...'.
I ran a report and evaluated the results from a single subdirectory. The results shows duplicate listing of files where the directory only contains a single copy.
du is reporting the directory is 154M while diskover-web is reporting 78.01M.

I am sure you will need more information to better understand what is going on, so I hope this helps start the dialogue. Please let me know what additional information you need, or person provide some inside into why we are seeing results that to us, seem incorrect.

Thanks.

possible to ignore hardlinks

is it possible to ignore the hardlinks?

I have a massive file share with 70 million files in it.
find /path -type f
takes around 12 hours to finish

diskover will simply stall after roughly one hour with aprox 50-60 million files
at this point it already found 141,748 hardlinks
the scan simply halts at "/ Crawling (Queue: 0, 1450.5 dirs/sec)" and sits there for days (i let it run the whole weekend)

i did a test with simply
diskover.py -i diskover-index-d4 -d /path/ -M 4
and it stalls with
/ Crawling (Queue: 0, 1450.5 dirs/sec) Elapsed Time: 0:01:04 (4 hardlinks so far)

I tried another share on the same fileserver (both connected via nfs) without a problem, but this share does not have any hardlinks
all other scans i did before never had any hardlinks

io exception warning should be logged to a log file

if diskover runs as a background task it would be good if it logs warnings and errors to a selectable log file

diskover_dupes seams to have memory problems

I tested rc29 but it looks like it has memory problems

specs: 72 cores, 92gb ram, 8gb for elasticsearch
as soon as diskover is searching for dupes (4000 in my test) it soonish prints out "cannot allocate memory"

rc28 works fine though

Question about diskover and treewalk client functionality.

I wanted to ask you some
questions tho about it's functionality which I can't seem to find any information on.

Firstly, my current setup is that I have some linux file servers and OS X servers.
They host different data, and are not connected in any way into a SAN or something like that.
They share the same DMZ and share out different volumes and data.

I have configured a server that hosts elasticsearch, redis and also diskover.
That same server host also the diskover-web frontent.

Now, as I'm really interested in the diskover-treewalk-client.

In my scenario above would I need to create different indexes for every server that's going
to use the treewalk-client? Is it a requirement for the diskover server to also have the same mount points and data paths as treewalk-clients have?
regarding -r and -R option.

For an example, server #1 has data files on /mnt/iscsivol01.
But the diskover server does not have this mountpoint and or data.

diskover-treewalk-client.py -p diskover_server -t metaspider -r /mnt/iscsivol01 -R ?

So my scenario is that diskover server does not know and or see any of those mountpoints for all these servers.
I would much like to host my diskover server in the cloud and use the treewalk-client as much as possible.

And final question, is it normal for the diskover.py with the "-L" argument to exit after
crawling for a treewalk-client? So in theory I have to manually execute "-L" diskover.py every-time I want my treewalk-client to process data to the diskover server ?

(diskover server)
Waiting for connection, listening on xxx.xxx.xxx.xxx port 9998 TCP (ctrl-c to shutdown)

(treewalk client)
python3 diskover-treewalk-client.py -p diskover_server -t metaspider -r /mnt/iscsivol01 -R /data

(diskover server after successful connection and process)
2019-04-03 13:06:52,666 [INFO][diskover] Finished calculating 6436 directory sizes in 0d:0h:00m:41s
2019-04-03 13:06:52,671 [INFO][diskover] Setting ES index settings back to defaults
2019-04-03 13:06:52,691 [INFO][diskover] Force merging ES index...
2019-04-03 13:06:52,922 [INFO][diskover] All DONE! Sayonara!
exit

Thank you kindly.
Best regards,
Svavar O - Reykjavik - Iceland

Docker Support?

I saw that diskover-web recently merged in docker support =):

https://github.com/shirosaidev/diskover-web/pull/8/files

That is awesome!

Does it make sense to also have a Docker instance for Diskover itself?

(Use-case - I want to run this on FreeNAS, within a Docker instance)

UI feedback for crawl overruns

I'm indexing a volume with 47,900 directories and 267,067 files.
The output is like this:

[2017-05-01 08:14:20] [status] Finding directories to crawl
Crawling: [100%] |████████████████████████████████████████| 4270/4270
Crawling: [100%] |████████████████████████████████████████| 7674/7674
Crawling: [100%] |████████████████████████████████████████| 7767/7767
Crawling: [100%] |████████████████████████████████████████| 9119/9119
Crawling: [100%] |████████████████████████████████████████| 9211/9211
Crawling: [100%] |████████████████████████████████████████| 9476/9476
Crawling: [100%] |████████████████████████████████████████| 10038/10038
Crawling: [100%] |████████████████████████████████████████| 10239/10239
Crawling: [100%] |████████████████████████████████████████| 10392/10392
Crawling: [100%] |████████████████████████████████████████| 10668/10668
Crawling: [100%] |████████████████████████████████████████| 10761/10761
Crawling: [100%] |████████████████████████████████████████| 10761/10761
Crawling: [100%] |████████████████████████████████████████| 10925/10925
Crawling: [100%] |████████████████████████████████████████| 10925/10925
Crawling: [100%] |████████████████████████████████████████| 11008/11008
Crawling: [100%] |████████████████████████████████████████| 11064/11064
Crawling: [100%] |████████████████████████████████████████| 11130/11130
Crawling: [100%] |████████████████████████████████████████| 11213/11213
Crawling: [100%] |████████████████████████████████████████| 11289/11289
Crawling: [100%] |████████████████████████████████████████| 11433/11433
Crawling: [100%] |████████████████████████████████████████| 11513/11513
Crawling: [100%] |████████████████████████████████████████| 16255/16255
Crawling: [100%] |████████████████████████████████████████| 17602/17602
Crawling: [100%] |████████████████████████████████████████| 18254/18254
Crawling: [100%] |████████████████████████████████████████| 18254/18254
Crawling: [100%] |████████████████████████████████████████| 18544/18544
Crawling: [100%] |████████████████████████████████████████| 18544/18544
Crawling: [100%] |████████████████████████████████████████| 18679/18679
Crawling: [100%] |████████████████████████████████████████| 19131/19131
Crawling: [100%] |████████████████████████████████████████| 19131/19131
Crawling: [100%] |████████████████████████████████████████| 19282/19282
Crawling: [100%] |████████████████████████████████████████| 19282/19282
Crawling: [100%] |████████████████████████████████████████| 19628/19628
Crawling: [100%] |████████████████████████████████████████| 19628/19628
Crawling: [100%] |████████████████████████████████████████| 19926/19926
Crawling: [100%] |████████████████████████████████████████| 21296/21296
Crawling: [100%] |████████████████████████████████████████| 21296/21296
Crawling: [100%] |████████████████████████████████████████| 21623/21623
Crawling: [100%] |████████████████████████████████████████| 21623/21623
Crawling: [100%] |████████████████████████████████████████| 22093/22093
Crawling: [100%] |████████████████████████████████████████| 22196/22196
Crawling: [100%] |████████████████████████████████████████| 22196/22196
Crawling: [100%] |████████████████████████████████████████| 22480/22480
Crawling: [100%] |████████████████████████████████████████| 22649/22649
Crawling: [100%] |████████████████████████████████████████| 22649/22649
Crawling: [100%] |████████████████████████████████████████| 22819/22819
Crawling: [100%] |████████████████████████████████████████| 22819/22819
Crawling: [100%] |████████████████████████████████████████| 23464/23464
Crawling: [100%] |████████████████████████████████████████| 23591/23591
Crawling: [100%] |████████████████████████████████████████| 23716/23716
Crawling: [100%] |████████████████████████████████████████| 23841/23841
Crawling: [100%] |████████████████████████████████████████| 23965/23965
Crawling: [100%] |████████████████████████████████████████| 23965/23965
Crawling: [100%] |████████████████████████████████████████| 24245/24245
Crawling: [100%] |████████████████████████████████████████| 24245/24245
Crawling: [100%] |████████████████████████████████████████| 24513/24513
Crawling: [100%] |████████████████████████████████████████| 25235/25235
Crawling: [100%] |████████████████████████████████████████| 25366/25366
Crawling: [100%] |████████████████████████████████████████| 25488/25488
Crawling: [100%] |████████████████████████████████████████| 25612/25612
Crawling: [100%] |████████████████████████████████████████| 25815/25815
Crawling: [100%] |████████████████████████████████████████| 25888/25888
Crawling: [100%] |████████████████████████████████████████| 25888/25888
Crawling: [100%] |████████████████████████████████████████| 25971/25971
Crawling: [100%] |████████████████████████████████████████| 25971/25971
Crawling: [100%] |████████████████████████████████████████| 26105/26105
Crawling: [100%] |████████████████████████████████████████| 26178/26178
Crawling: [100%] |████████████████████████████████████████| 26607/26607
Crawling: [100%] |████████████████████████████████████████| 26690/26690
Crawling: [100%] |████████████████████████████████████████| 26869/26869
Crawling: [100%] |████████████████████████████████████████| 26955/26955
Crawling: [100%] |████████████████████████████████████████| 26955/26955
Crawling: [100%] |████████████████████████████████████████| 27043/27043
Crawling: [100%] |████████████████████████████████████████| 27043/27043
Crawling: [100%] |████████████████████████████████████████| 27268/27268
Crawling: [100%] |████████████████████████████████████████| 27391/27391
Crawling: [100%] |████████████████████████████████████████| 27458/27458
Crawling: [100%] |████████████████████████████████████████| 27458/27458
Crawling: [100%] |████████████████████████████████████████| 27541/27541
Crawling: [100%] |████████████████████████████████████████| 27698/27698
Crawling: [100%] |████████████████████████████████████████| 27904/27904
Crawling: [100%] |████████████████████████████████████████| 27904/27904
Crawling: [100%] |████████████████████████████████████████| 28012/28012
Crawling: [100%] |████████████████████████████████████████| 28781/28781
Crawling: [100%] |████████████████████████████████████████| 28902/28902
Crawling: [100%] |████████████████████████████████████████| 28902/28902
Crawling: [100%] |████████████████████████████████████████| 29472/29472
Crawling: [100%] |████████████████████████████████████████| 29472/29472
Crawling: [100%] |████████████████████████████████████████| 29575/29575
Crawling: [100%] |████████████████████████████████████████| 29575/29575
Crawling: [100%] |████████████████████████████████████████| 30642/30642
Crawling: [100%] |████████████████████████████████████████| 30919/30919
Crawling: [100%] |████████████████████████████████████████| 31302/31302
Crawling: [100%] |████████████████████████████████████████| 31396/31396
Crawling: [100%] |████████████████████████████████████████| 32447/32447
Crawling: [100%] |████████████████████████████████████████| 32714/32714
Crawling: [100%] |████████████████████████████████████████| 33745/33745
Crawling: [100%] |████████████████████████████████████████| 34355/34355
Crawling: [100%] |████████████████████████████████████████| 34355/34355
Crawling: [100%] |████████████████████████████████████████| 34647/34647
Crawling: [100%] |████████████████████████████████████████| 34647/34647
Crawling: [100%] |████████████████████████████████████████| 35524/35524
Crawling: [100%] |████████████████████████████████████████| 35524/35524
Crawling: [100%] |████████████████████████████████████████| 35824/35824
Crawling: [100%] |████████████████████████████████████████| 36792/36792
Crawling: [100%] |████████████████████████████████████████| 36929/36929
Crawling: [100%] |████████████████████████████████████████| 37980/37980
Crawling: [100%] |████████████████████████████████████████| 38666/38666
Crawling: [100%] |████████████████████████████████████████| 38666/38666
Crawling: [100%] |████████████████████████████████████████| 38814/38814
Crawling: [100%] |████████████████████████████████████████| 39529/39529
Crawling: [100%] |████████████████████████████████████████| 39674/39674
Crawling: [100%] |████████████████████████████████████████| 40188/40188
Crawling: [100%] |████████████████████████████████████████| 40315/40315
Crawling: [100%] |████████████████████████████████████████| 41654/41654
Crawling: [100%] |████████████████████████████████████████| 41854/41854
Crawling: [100%] |████████████████████████████████████████| 42437/42437
Crawling: [100%] |████████████████████████████████████████| 44910/44910
Crawling: [100%] |████████████████████████████████████████| 44910/44910
Crawling: [100%] |████████████████████████████████████████| 45104/45104
Crawling: [100%] |████████████████████████████████████████| 45219/45219
Crawling: [100%] |████████████████████████████████████████| 45426/45426
Crawling: [100%] |████████████████████████████████████████| 45539/45539
Crawling: [100%] |████████████████████████████████████████| 45643/45643
Crawling: [100%] |████████████████████████████████████████| 45643/45643
Crawling: [100%] |████████████████████████████████████████| 45753/45753
Crawling: [100%] |████████████████████████████████████████| 45850/45850
Crawling: [100%] |████████████████████████████████████████| 45850/45850
Crawling: [100%] |████████████████████████████████████████| 46298/46298
Crawling: [100%] |████████████████████████████████████████| 46298/46298
Crawling: [100%] |████████████████████████████████████████| 46421/46421
Crawling: [100%] |████████████████████████████████████████| 46657/46657
Crawling: [100%] |████████████████████████████████████████| 46657/46657
Crawling: [100%] |████████████████████████████████████████| 46850/46850
Crawling: [100%] |████████████████████████████████████████| 46954/46954
Crawling: [100%] |████████████████████████████████████████| 47045/47045
Crawling: [100%] |████████████████████████████████████████| 47136/47136
Crawling: [100%] |████████████████████████████████████████| 47227/47227
Crawling: [100%] |████████████████████████████████████████| 47323/47323
Crawling: [100%] |████████████████████████████████████████| 47419/47419
Crawling: [100%] |████████████████████████████████████████| 47522/47522
Crawling: [100%] |████████████████████████████████████████| 47625/47625
Crawling: [100%] |████████████████████████████████████████| 47736/47736
Crawling: [100%] |████████████████████████████████████████| 47736/47736
Crawling: [100%] |████████████████████████████████████████| 47899/47900
[2017-05-01 08:26:19] [info] Directories Crawled: 47900

Somewhere the crawl percentage wraps.

Performance concern

Hi Chris,

First of all, I think this is a great project. I like the way your visualized the data.

About 15 years ago I wrote my own "du" type program (in C) which was also multithreaded and custom for our studio. However, even multi-threaded with a large enough filesystem (~100TB with millions files in thousands of directories) the code would take more than 24 hours to run.
In production, it's not acceptable to have data/analytics that is old. You cannot make proper decisions if your disk usage data not current.

Have you tested your code against a large filesystem? It really doesn't matter what the capacity is, it's the number of files (and number of directories) that will determine how long your code takes to run (and a little bit the speed of the vendor's inode lookups). Even on a 10GB filesystem you can create (1KB sized) files in subdirectories and let your code walk it. Please post some benchmarks as those are critical in production environments.

Luckily the vendors are catching on (Qumulo) and the requirement for filesystem crawls should soon be a thing of the past. :)

Error when using -B (--crawlbot) option

When using the diskover.py with -B option the following error message is returned -

Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/root/diskover/diskover_crawlbot.py", line 76, in bot_thread
diskover.crawl_tree(path, cliargs, logger, reindex_dict)
File "/root/diskover/diskover.py", line 1328, in crawl_tree
wait_for_worker_bots()
File "/root/diskover/diskover.py", line 1461, in wait_for_worker_bots
logger.info('Found %s diskover RQ worker bots', len(workers))
NameError: global name 'logger' is not defined

BotError: Moving job to u'failed' queue

I have problems scanning a big nfs share - some bots giving me those errors

22:30:25 TransportError: TransportError(429, u'es_rejected_execution_exception', u'rejected execution of org.elasticsearch.transport.TransportService$7@5392962b on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@431b1929[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 446339]]')
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/rq/worker.py", line 793, in perform_job
rv = job.perform()
File "/usr/lib/python2.7/site-packages/rq/job.py", line 599, in perform
self._result = self._execute()
File "/usr/lib/python2.7/site-packages/rq/job.py", line 605, in _execute
return self.func(*self.args, **self.kwargs)
File "/opt/mdc/bin/diskover/diskover_bot_module.py", line 1003, in scrape_tree_meta
es_bulk_add(worker, tree_dirs, tree_files, cliargs, totalcrawltime)
File "/opt/mdc/bin/diskover/diskover_bot_module.py", line 865, in es_bulk_add
es.index(index=cliargs['index'], doc_type='worker', body=data)
File "/usr/lib/python2.7/site-packages/elasticsearch5/client/utils.py", line 73, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/lib/python2.7/site-packages/elasticsearch5/client/init.py", line 300, in index
_make_path(index, doc_type, id), params=params, body=body)
File "/usr/lib/python2.7/site-packages/elasticsearch5/transport.py", line 312, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/lib/python2.7/site-packages/elasticsearch5/connection/http_urllib3.py", line 129, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/lib/python2.7/site-packages/elasticsearch5/connection/base.py", line 125, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
TransportError: TransportError(429, u'es_rejected_execution_exception', u'rejected execution of org.elasticsearch.transport.TransportService$7@5392962b on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@431b1929[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 446339]]')
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/rq/worker.py", line 793, in perform_job
rv = job.perform()
File "/usr/lib/python2.7/site-packages/rq/job.py", line 599, in perform
self._result = self._execute()
File "/usr/lib/python2.7/site-packages/rq/job.py", line 605, in _execute
return self.func(*self.args, **self.kwargs)
File "/opt/mdc/bin/diskover/diskover_bot_module.py", line 1003, in scrape_tree_meta
es_bulk_add(worker, tree_dirs, tree_files, cliargs, totalcrawltime)
File "/opt/mdc/bin/diskover/diskover_bot_module.py", line 865, in es_bulk_add
es.index(index=cliargs['index'], doc_type='worker', body=data)
File "/usr/lib/python2.7/site-packages/elasticsearch5/client/utils.py", line 73, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/lib/python2.7/site-packages/elasticsearch5/client/init.py", line 300, in index
_make_path(index, doc_type, id), params=params, body=body)
File "/usr/lib/python2.7/site-packages/elasticsearch5/transport.py", line 312, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/lib/python2.7/site-packages/elasticsearch5/connection/http_urllib3.py", line 129, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/lib/python2.7/site-packages/elasticsearch5/connection/base.py", line 125, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
TransportError: TransportError(429, u'es_rejected_execution_exception', u'rejected execution of org.elasticsearch.transport.TransportService$7@5392962b on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@431b1929[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 446339]]')
22:30:25 Moving job to u'failed' queue

[2017-05-09 13:03:50] [info] Found 30 duplicates --Always Reports 30--

I have run Diskover and then run it with the --dupesindex on four of my NFS mounts and all four report back "Found 30 Duplicates". I think it's very unlikely that they all have exactly 30 duplicates so I'm thinking there might be a bug that is limiting it to 30.

Below are the results of indexing and dupes searching two different NFS mounts.

[2017-05-09 11:53:54] [status] Connecting to Elasticsearch
[2017-05-09 11:53:54] [info] Checking for ES index: diskover-2017.04.22
[2017-05-09 11:53:54] [warning] ES index exists, deleting
[2017-05-09 11:53:55] [info] Creating ES index
Crawling: [100%] |████████████████████████████████████████| 279718/279718
[2017-05-09 12:47:18] [status] Finished crawling
[2017-05-09 12:47:18] [info] Directories Crawled: 279718
[2017-05-09 12:47:18] [info] Files Indexed: 475561
[2017-05-09 12:47:18] [info] Elapsed time: 3204.14460206
python /diskover/diskover/diskover.py --dupesindex
[2017-05-09 13:03:49] [status] Connecting to Elasticsearch
[2017-05-09 13:03:49] [info] Checking for ES index: diskover_dupes-2017.04.22
[2017-05-09 13:03:49] [warning] ES index exists, deleting
[2017-05-09 13:03:49] [info] Creating ES index
[2017-05-09 13:03:50] [info] Found 30 duplicates
[2017-05-09 13:03:50] [status] Finished creating duplicate files index

[2017-05-08 20:08:21] [status] Connecting to Elasticsearch
[2017-05-08 20:08:21] [info] Checking for ES index: diskover-2017.04.22
[2017-05-08 20:08:21] [warning] ES index exists, deleting
[2017-05-08 20:08:21] [info] Creating ES index
Crawling: [100%] |████████████████████████████████████████| 70988/70988
[2017-05-08 20:50:11] [status] Finished crawling
[2017-05-08 20:50:11] [info] Directories Crawled: 70988
[2017-05-08 20:50:11] [info] Files Indexed: 3165862
[2017-05-08 20:50:11] [info] Elapsed time: 2510.63846493
python /diskover/diskover/diskover.py --dupesindex
[2017-05-09 09:46:18] [status] Connecting to Elasticsearch
[2017-05-09 09:46:19] [info] Checking for ES index: diskover_dupes-2017.04.22
[2017-05-09 09:46:19] [warning] ES index exists, deleting
[2017-05-09 09:46:19] [info] Creating ES index
[2017-05-09 09:46:19] [info] Found 30 duplicates
[2017-05-09 09:46:19] [status] Finished creating duplicate files index

Directories containing a large number of files never complete

Directories that contain a large number of files, in our case it was nearly 200,000 files, would never complete. After debugging I found that the scandirwalk_worker function has a minor bug. It is possible for the q_paths queue to be empty and the q_paths_results queue to also be empty while one or more scandirwalk_worker threads are still working because the directories take so long to get through. The scandirwalk function will erroneously exit if both queues are empty and mistakenly end the program before it is finished. There needs to be some mechanism to track jobs that are in-progress, being processed by the scandirwalk_worker function. I will submit a merge request with a patch that fixed this for me.

ASCII / unicode exception

I got this error on a run

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/ronald/diskover/diskover.py", line 260, in processDirectoryWorker
    filelist = crawlFiles(path, DATEEPOCH, DAYS, MINSIZE, EXCLUDED_FILES, VERBOSE, DEBUG)
  File "/home/ronald/diskover/diskover.py", line 225, in crawlFiles
    filemeta = '{"filename": "%s", "extension": "%s", "path_full": "%s", "path_parent": "%s", "filesize": %s, "owner": "%s", "group": "%s", "last_modified": %s, "last_access": %s, "last_change": %s, "hardlinks": %s, "inode": %s, "indexing_date": %s}' % (name.decode('utf-8'), extension, filename_fullpath.decode('utf-8'), abspath.decode('utf-8'), size, owner, group, mtime, atime, ctime, hardlinks, inode, indextime)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)

I tweaked the code to add a print in a try/except case, and I see that these are the files causing the failures:

# ls -alb 02514* 00884*
-rw-rw-rw- 1 4294967294 root 26785 Oct  9  1991 02514\ St.\ Patrick’s\ Day
-rw-rw-rw- 1 4294967294 root 7043 Mar 13  1989 00884\ 5.25\ Disk

I have no idea which character is in the second filename, but I think it's just a quote in the first one.

Files in the filesystem are archives from old Mac (HFS) backups.
Back in the day the Mac operators would name their files anything they wanted... :)

The directions for this are awful

Tons of missing dependancies that aren't even in the document. Specific versions that aren't listed.

I'd ask why you don't just make an all-in-one container but it seems like you'd rather sell it.

diskoverdata / diskover-community Goto Github PK

diskover-community's People

Contributors

Stargazers

Watchers

Forkers

diskover-community's Issues

Recommend Projects

Recommend Topics

Recommend Org