Giter VIP home page Giter VIP logo

provenance-analysis's People

Contributors

samduy avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

guoyu07

provenance-analysis's Issues

Improve search on Github to not miss anything

Maybe combine file-search with repo-name-search?
Because in some cases, file-search doesn't give the correct result but repo-name-search (by package-name) could give the exact result.

e.g. in case of CMSMap:

  • File search:
filename:multipartpost.py path:thirdparty/multipart
...

doesn't give the result we wanted.

(Also, because of the file paths even are chosen as the longest one in the directory but it's still very common as many repos are using the same files).

https://github.com/Dionach/CMSmap

Detect sub-packages inside a package

Package A includes some other packages B and C from other developers as its sub-directories.

/path/to/packageA/files
/path/to/packageA/packageB
/path/to/packageA/packageB/files
/path/to/packageA/packageC
/path/to/packageA/packageC/files
...

E.g.

veil-avasion
gems

Currently, only package A has been detected and consider the whole path/to/packageA is a package.
It's correct. But, it is better if package B and package C have also been detected and checked for their latest versions.

Reason: There is a possibility that the developer of Package A is slow in update his package when B or C is updated, so that there is a chance for hacker to attack package A when security bugs of B or C have been published

Better extraction of program directory name

Currently, there's no effective way to auto-identify which level of sub-directory of scanning DIR is the package folder.
Currently we assume that the directory for each package is 2 level below DIR.
For example, when we scan directory '/', the directories packages we extracted can be:

/usr/share/program1
/usr/share/program2

But, it will mis-recognized other sub-directories of other packages as packages also, such as:

/opt/program3/sub-dir1
/opt/program3/sub-dir2

The following should be the proper extraction:

/usr/share/program1
/usr/share/program2
/opt/program3

Make a consolidated result

Expected output result:

Manual

Path Name Source Updated Active Local version Latest version
/path/to/package1 Package1 Github Y Y 1.0.2 1.0.2
/path/to/package2 Package2 Pip N Y 0.0.2 1.0.5

APT

Path Name Updated Local version Latest version
/path/to/package3 Package3 Y 1.0.2 1.0.2
/path/to/package4 Package4 N 0.0.2 1.0.5

PIP

Path Name Updated Local version Latest version
/path/to/package5 Package5 Y 1.0.2 1.0.2
/path/to/package6 Package6 N 0.0.2 1.0.5

Resume from the last point it was interrupted

Some steps take so long time to finish, and sometime it is interrupted in the middle.
When it starts again, it should be able to resume from the last point it was forced to quit.

[interesting.list] Filtering does not work well if files use softlinks

In some machines, soft links have been used, so there are two paths point to the same file.
e.g.
In the list of files installed by PIP (pip_sorted.list)

/usr/lib/python2.7/dist-packages/olefile/olefile.py
/usr/lib/python2.7/dist-packages/olefile/__init__.pyc
/usr/lib/python2.7/dist-packages/olefile/olefile.pyc

In the all_files.list:

/lib/live/mount/persistence/mmcblk0p3/rw/usr/lib/python2.7/dist-packages/olefile/__init__.py
/lib/live/mount/persistence/mmcblk0p3/rw/usr/lib/python2.7/dist-packages/olefile/olefile.py
/lib/live/mount/persistence/mmcblk0p3/rw/usr/lib/python2.7/dist-packages/olefile/olefile.pyc

So that, they can not eliminate each other (which they should).

[Algorithm-A] Cannot detect the repository if the directory name is not identical with repo name

[Version 0.5]
Current algorithm-A is that: it matches the content (search for 3 files) and the name of local directory with the online GitHub repository. It returns the result if they are both identical (files exist and repo name matches).

However, there are many cases, such as:
Local directory:

.../dnsruby-09c3890ccfae

is different from the online repo name: dnsruby.

The algorithm should be improved so that, it can also detect the corresponding GitHub repo for the above case:

https://github.com/alexdalitz/dnsruby

Better information extraction of a local package directory

When?

$ make programs_info.dat

Current issue:

  • Only one file that is presented for the directory has been checked.
    (That file was chosen by the criteria: it has longest path).

What can be improved:

  • Some of the extracted information of the directory should be based on the information of vast majority of files and sub-folders inside it.
  • But, some of the other information should be based on one particular file.
  • One idea (may need to be proved) is: the Creation date of the directory (can be the Installation date of the package) (there's no way to extract it directly from the Linux system. it hasn't support yet) may be the same as Modify date of the vast majority of files and sub-folder inside it.

Integrate APT, PIP results

Each of those tools can output their own result. It's better to integrate them in our final result also.

  1. APT
$ apt list --installed

List up only outdated packages:

$ apt list --installed | sed -nr 's_(.*)/(.*) (.*) (.*) (.*)upgradable to: (.*)]_\1,\3,\4,\6_p'

Implemented in: apt_check.sh

  1. PIP
$ pip list -o

[Algorithm-D]: Improve accuracy of package detection

Current algorithm for package directory detection is not really good. It misses the packages that installed in the system:

  • In them same day.
  • In the same (parent) directory.

Because, current algorithm is based on the Modification date only.

e.g.

/path/to/directory-A/package-B
/path/to/directory-A/package-C

If both package-B and package-C were installed in the same day, it will mis-recognizes directory-A as a package (which is not actually) instead of B or C.

github_latest.py: UnicodeEncodeError: 'ascii' codec can't encode characters

Traceback (most recent call last):
  File "./github_latest.py", line 103, in <module>
    print result
UnicodeEncodeError: 'ascii' codec can't encode characters in position 61-62: ordinal not in range(128)

When?

$ make internet_info.dat

First analysis:

  • It happened when the return results have some Chinese characters.

Identify sufficiently unique-looking paths to search for

Given a set of files (e.g., an archive or a directory) find the ones with sufficiently unique-looking paths and search them on GitHub.

Current approach is: choosing 3 longest paths (in the package directory) seems not to be really efficient since it chooses some very common names.

Mismatch between pip package name and directory name

Some PIP packages does not have the directory with the same name, caused some errors when searching for its installed files.

find: ‘/usr/lib/python2.7/dist-packages/backports-abc*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/backports.shutil-get-terminal-size*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/backports.ssl-match-hostname*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/CouchDB-1.0-py2.7.egg/CouchDB*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/file-magic*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/fuse-python*’: No such file or directory
find: ‘2.1,/GeoIP*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/guess-language-spirit*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/ipcalc-1.1.3-py2.7.egg/ipcalc*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/ipython-genutils*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/msgpack-python*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/ndg-httpsclient*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/NoSQLMap-0.5-py2.7.egg/NoSQLMap*’: Not a directory
find: ‘/usr/local/lib/python2.7/dist-packages/oauthlib-1.1.2-py2.7.egg/oauthlib*’: Not a directory
find: ‘/usr/local/lib/python2.7/dist-packages/pbkdf2-1.3-py2.7.egg/pbkdf2*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/prompt-toolkit*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/pyasn1-modules*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/pymongo-2.7.2-py2.7-linux-x86_64.egg/pymongo*’: Not a directory
find: ‘/usr/lib/python2.7/dist-packages/pysnmp-apps*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/pysnmp-mibs*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-apt*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-dateutil*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-debian*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-debianbts*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/python-ntlm*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/requests_oauthlib-0.6.2-py2.7.egg/requests-oauthlib*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/requests-toolbelt*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/service-identity*’: No such file or directory
find: ‘/usr/local/lib/python2.7/dist-packages/tweepy-3.6.0-py2.7.egg/tweepy*’: Not a directory
find: ‘/usr/local/lib/python2.7/dist-packages/WordHound-0.1-py2.7.egg/WordHound*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/wxPython-common*’: No such file or directory
find: ‘/usr/lib/python2.7/dist-packages/yara-python*’: No such file or directory

Full log in here

[report.xml] Visualization: add graph to report

It would be better if we show the graph (like pie charts, for instance) to give the user a visual about current status of machine. (e.g. how many percent of programs are outdated/updated).

[Test-bug] Error when committed_date is empty

[BUG: when running on Eurecom machine]

In some cases, when the returned committed date is empty, it caused an error:

Traceback (most recent call last):
  File "./report.py", line 64, in <module>
    latest_datetime = datetime.strptime(item['committed_date'], DATETIME_FORMAT_IN)
KeyError: 'committed_date'

[report.xml] Add summary info

Add summary information in the result file.
Such as: how many packages are updated, how many are still having active development, how many outdated, and so on.

[Report] Export report to HTML or XML for better view

Current type of report (.txt) has some limitation, such as: only display minimum information, do not support interaction with user,...
It is better to export result data to XML (or JSON) to be able to store more information. And if necessary, user can view on Web browser (as HTML, maybe need to transform with CSS or XSL...) and click for detail information of each item.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.