Giter VIP home page Giter VIP logo

deltacode's Introduction

Deltacode

DeltaCode is a simple command line utility that leverages the power of scancode-toolkit to determine file-level differences between two codebases.

During a typical software release cycle, development teams and software compliance experts want insight into how a codebase has changed during each release iteration. Specifically, these users need a utility that can point out places in a codebase where material license and other provenance changes have occurred. This is where DeltaCode comes in.

DeltaCode provides an accurate means of comparing two ScanCode result files, and returning any possible changes that have occurred between the two given scanned codebases. DeltaCode currently has the ability to detect file size and license changes, as well as means to detect when files have been moved to new locations.

We are continuously working on new features, such as detecting copyright changes and detecting package version changes.

Build and tests status

We run tests on each commit on multiple CIs to ensure a good platform compatibility with multiple versions of Windows, Linux and macOS.

Azure RTD Build
Azure tests status (Linux, macOS, Windows) Documentation Status

Documentation

The DeltaCode documentation is hosted at deltacode.readthedocs.io.

Installation

Before installing DeltaCode make sure that you have installed the prerequisites properly. This means installing Python 3.8 for x86/64 architectures. We support Python 3.8, 3.9 and 3.10.

See prerequisites for detailed information on the support platforms and Python versions.

There are a few common ways to install DeltaCode.

Quick Start

Run this command to display the command help:

deltacode --help

Run a sample delta:

deltacode -n samples/samples.json -o samples/samples.json

Run a simple delta saved to the output.json file:

deltacode -n samples/samples.json -o samples/samples.json -j output.json

Then open output.json to view the delta results.

To get DeltaCode results for your codebase, install scancode-toolkit and generate a scan for each of the codebases you wish to 'Delta'

Support

If you have a problem, a suggestion or found a bug, please enter a ticket at: https://github.com/nexB/deltacode/issues

For discussions and chats, we have:

  • an official Gitter channel for web-based chats. Gitter is also accessible via an IRC bridge. There are other AboutCode project-specific channels available there too.
  • an official #aboutcode IRC channel on liberachat (server web.libera.chat). This channel receives build and commit notifications and can be noisy. You can use your favorite IRC client or use the web chat.

Source code

License

  • Apache-2.0 with an acknowledgement required to accompany the delta output.

See the NOTICE file and the .ABOUT files that document the origin and license of the third-party code used in DeltaCode for more details.

deltacode's People

Contributors

agustinhenze avatar arijitde92 avatar arnav-mandal1234 avatar ayansinhamahapatra avatar chinyeungli avatar hritik14 avatar johnmhoran avatar jonoyang avatar keshav-space avatar mjherzog avatar pombredanne avatar pratikrocks avatar purna135 avatar steven-esser avatar swastkk avatar tg1999 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deltacode's Issues

empty path strings in outputs appear after some alignments

When running

deltacode -n tests/data/deltacode/ecos-failed-counts-assertion-new.json -o tests/data/deltacode/ecos-failed-counts-assertion-old.json -c ~/test.csv

Many modified or license change Deltas show an empty path value in our output. I believe this happens as a result of align_scan(), but I could be wrong. I will investigate further and post more details.

increment counter in deteremine_delta()

Instead of decrementing a counter in determine_delta(), we should increment it and check against files_count and the end, instead of 0

We should also add some error message when the assertion fails.

Failing case if extra directory is added

I did some simple tests and here is my finding:

I use "balloontip-1.1.1.jar" as a sample file.

  1. Created 2 directories d1/ and d2/ and put the test file in it and then compare these 2 directories. The output is unchanged which is correct.

  2. Same setup as (1) but create a new subdirectory named test/ under d1/ and put balloontip-1.1.1.jar in it.
    Both the
    d1/balloontip-1.1.1.jar
    d1/test/balloontip-1.1.1.jar
    are returned as added.

and the d2/balloontip-1.1.1.jar is returned as removed

which is not correct as the d1/balloontip-1.1.1.jar and d2/balloontip-1.1.1.jar should return unchanged while the d1/test/balloontip-1.1.1.jar is consider as added.

  1. Same setup as (1) but create a new root/ directory and put the d1/ in it and run the deltacode from root/ to d2/. The output is unchanged which is correct.

AssertionError when running DeltaCode on eCos scans

I ran ScanCode with the the following options (-clipeu) on version 2.0 of eCos and the latest HEAD of the eCos CVS repo. After, I ran DeltaCode on the report files and I got the following issue:

$ deltacode -n ecos-head.json -o ~/Desktop/ecos-2.0-linux.json -c delta.csv
Traceback (most recent call last):
  File "/home/jono/nexb/tools/develop/deltacode/bin/deltacode", line 11, in <module>
    load_entry_point('deltacode', 'console_scripts', 'deltacode')()
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jono/nexb/tools/develop/deltacode/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/cli.py", line 80, in cli
    delta = DeltaCode(new, old)
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/__init__.py", line 23, in __init__
    self.deltas = self.determine_delta()
  File "/home/jono/nexb/tools/develop/deltacode/src/deltacode/__init__.py", line 103, in determine_delta
    assert len(deltas) == ((self.new.files_count - new_nonfiles) + (self.old.files_count - old_nonfiles) - modified - unchanged)
AssertionError

Attached are the input files I used to get this error:
ecos-scans.zip

use license scan information to further determine modified.

@mjherzog brought this up in a recent call.

For the 'modified' set of files we should look at the license scan information to further distinguish the type of modification.

For example, two files can effectively be marked unmodified if the license scan information (i.e. license key and/or expression) is the same for both File objects. This type of check is related to #3 as well.

Add filetype info (or similar) to the output.

Along with type and path, we should add filetype (or something similar) to ease filtering of results. Ofter we do not care about config files or Makefiles when analyzing a codebase.

Migrate to using json-to-csv script in lieu of separate formatting option

Instead of maintaining two different 'branches' of output, we should default to json only.

We will still want a way to view the results in csv form, but this can be moved to a separate script that only takes a deltacode json output and converts it to csv.

UX should not change: -c output should still output a csv file. It is in the internals that we would just make a call to our json2csv script instead of write_csv

Identify dupes

We have means of (crudely) identifying moved files. However, this only looks at the cases where there is a single file per sha1 value.

We need a way of handling files there appears multiple times on either/both sides of scans.

This will need to be broken into smaller tickets

Add in additional file-level info for json results

  1. Update License object by removing unnecessary fields.
    a) update License.to_dict() as well
  2. Update License object and License.to_dict() tests
  3. Update Delta object and Delta.to_dict() tests
  4. Update remaining json-based tests (if needed)

Simplify deltacode output

For the csv output, it would be a better presentation if we simply included a single path value, instead of empty or repeated paths that are redundant.

Adjust License Diff to account for License Category

Auditors care more about Copyleft and Proprietary licenses showing up in a codebase. We need to adjust our scoring so that more emphasis is give to:

  • 'no license' -> 'copyleft limited or higher'
  • 'permissive' -> 'copyleft limited or higher'
  • 'anything -> 'proprietary/commercial'

There are probably other combinations here as well.

Add end-to-end tests

We need end-to-end tests for both our outputs in various scenarios. This probably can be addressed along with #37, as the end-to-end could be incorporated via mock cli calls.

This ticket stems out of a few minor issues I've run into after we do refactoring. Having end-to-end tests in place makes sure that as we refactor and add features, cli workflows are not broken by our changes.

Flatten DeltaCode.deltas field

Once we have the basics of scoring (#52), we can move on to flattening the deltas field of DeltaCode. This means moving on a dictionary of lists to just a single list of Delta objects.

Prior to output or perhaps prior to assignment, we will want to sort this list by Delta.score to preserve order and/or cull entries we do not need (if a user specifies --all for instance)

Determine Score implementation

related to: #3

For now, should should simply add small values to our score at the different times. Later on we can think about subtracting for lack of info (license or otherwise) or Permissive licenses etc

Check and determine ScanCode options present in a ScanCode data file

In order to compare codebases accurately, Deltacode needs scancode data files that have the full file information available, at the very least.

We also need to know what additional scancode options each Deltacode input file has in order to figure out what other scan data is present and therefore what type of license, copyright or other data we can compare.

implement release process

We copy the release.sh from scan code-toolkit

  • remove references to scancode
  • replace with deltacode

Add deltas_count field

We should have a deltas_count field in JSON output, similar to scancode-toolkit

May need some discussion as to how to handle the -a option

Add factors to Delta Object

add a 'factors' in lieu of 'category'.

The goal here is to append various 'factors' as we run our codebase thru the various DeltaCode steps (determine_delta moved etc), primarily so we do not have to adjust category often.

if there are relevant factors that go into a particular File pair in a Delta object (a license addition or change, for instance), then than information will simply be appending the this 'factors' list.

When we go to output, all the Deltas will be sorted in descending order by score and we can simply dump the contents of a Delta's 'factors' field into a cell (in the csv case).

Collect/calculate additional statistics

From @MaJuRG We can get basic counts currently in deltacode; we need to expand that to additional calculations like % added, removed, etc, % changed, perhaps some sort of codebase 'drift' calculation that incorporates a number of different stats.

Only pass score during Delta creation.

change to Delta(score, new_file, old_file) in all the places that it was created.

This involves removing 'category' from the Delta object

Our initial scores should be simple for now: an Added should +100 to a score (which is 0 by default). A Removed will simply not add any value to a score. Both cases will also have the factors: 'added' or 'removed' appending to the Delta.factors

During license_diff and other DeltaCode steps, we will simply add values to the score as things are found out about the Delta.

Handle 'moved' files

This will need some thinking, but we will want some way to tell if a file has been 'moved' between the new and old scans of some codebase.

This means the sha1 should be matching, but the path would not be. There are also cases where the same file could be present in multiple locations.

We may need to index the files by sha1, similar to what we did in determine_delta

Match scoring

When we have a matching set of files in deltacode, we need to have some way of scoring or weighing a match.

This will also come into play more when we start to incorporate license and copyright changes.

This score will ultimately take the place of modified string in our match object.

Add deltacode and deltacode entrypoint script

Similar to scancode, we need to add a deltacode and ./deltacode.bat top level script.

This entrypoint script allows a user to simply run ./deltacode after cloning or downloading our deltacode repo. Its main responsibility is handle the initial configuration automatically.

This makes it easier from and end-user point of view. We will also want to include this in our release script.

Consolidate scoring in 'Delta' method

This issue continues the work started with issue #52, in which a hard-coded score is assigned to the new Delta.score attribute when the various Delta categories are created, i.e., in:

  • DeltaCode.determine_delta(),
  • DeltaCode.update_deltas() and
  • Delta._license_diff().

Add windows builds

We currently only have linux builds running via Travis CI. We should have all platforms.

Collect errors

[@MaJuRG comment] stop printing error messages to the console, and log the errors instead in the output.

Add cli tests

Use the scancode cli.py tests as a model. We mainly want to verify correct cli output in certain scenarios.

Make `_license_diff` its own deltacode function.

Similar to determine moved, we move this to DeltaCode.

The primary purpose of this function is simply to modify the Delta score field, depending on the license information between two Files.

We can keep it simple for now, and use a similar algorithm to that currently in _license_diff()

Refactor 'generate_csv()'

I think we might be able to reduce the 38 or so lines used in generate_csv() to construct the .csv tuple down to around 13 lines by using a ternary/conditional expression, e.g.,

new = '' if delta['category'] == 'removed' else delta['new']['path']

Initial testing suggests this works as expected -- all 45 tests pass. The refactoring would be applied here:

for delta in deltas:
category = delta['category']
if delta['category'] == 'added':
new = delta['new']['path']
new_filename = delta['new']['name']
new_sha1 = delta['new']['sha1']
new_size = delta['new']['size']
new_type = delta['new']['type']
new_orig = delta['new']['original_path']
old = ''
old_filename = ''
old_sha1 = ''
old_size = ''
old_type = ''
old_orig = ''
elif delta['category'] == 'removed':
new = ''
new_filename = ''
new_sha1 = ''
new_size = ''
new_type = ''
new_orig = ''
old = delta['old']['path']
old_filename = delta['old']['name']
old_sha1 = delta['old']['sha1']
old_size = delta['old']['size']
old_type = delta['old']['type']
old_orig = delta['old']['original_path']
else:
new = delta['new']['path']
new_filename = delta['new']['name']
new_sha1 = delta['new']['sha1']
new_size = delta['new']['size']
new_type = delta['new']['type']
new_orig = delta['new']['original_path']
old = delta['old']['path']
old_filename = delta['old']['name']
old_sha1 = delta['old']['sha1']
old_size = delta['old']['size']
old_type = delta['old']['type']
old_orig = delta['old']['original_path']

Cloning repo on Windows 10 creates 'tcl' directory

When the deltacode repo is cloned on Windows 10, followed by running ./configure.bat --clean and then ./configure.bat, a tcl directory appears after the the last step has finished. My local version of the former spats-deltacode repo does not contain a tcl directory, and I didn't encounter that directory during my previous work in spats-deltacode. After checking out a new branch and launching Visual Studio Code, VSCode indicates that there are 976 "pending changes" -- all of them evidently in this tcl directory.

My current resolution: I've deleted the tcl directory from inside my local branch.

Add date and platform to JSON output

[@johnmhoran comment] In terms of record-keeping, a user might find it helpful if the JSON output includes the date/time at which the JSON was generated and the platform on which DeltaCode was run.

Identify version changes

[@mjherzog comment] A significant subset of the Added/Removed files in a DeltaCode comparison are likely due to a version change for the same component. This will be complex to solve because the version number may be embedded in the path for a source code directory and/or in the filename for a Development or Deployment component. But this will also be very valuable because upgrading component versions between product releases is extremely common.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.