nexb / scancode-toolkit Goto Github PK

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!

Home Page: https://github.com/nexB/scancode-toolkit/releases/

Batchfile 0.07% Python 36.46% Shell 15.09% HTML 19.42% C 16.58% C++ 6.43% Java 2.49% C# 0.21% Assembly 0.94% JavaScript 0.31% Perl 1.81% TeX 0.02% Makefile 0.10% Awk 0.01% Objective-C 0.01% Objective-C++ 0.01% CSS 0.03% AppleScript 0.01% CMake 0.01% XSLT 0.01%

license copyright packages dependencies spdx provenance license-scan copyright-scan licensing spdx-licenses

scancode-toolkit's Introduction

ScanCode toolkit

A typical software project often reuses hundreds of third-party packages. License and packages, dependencies and origin information is not always easy to find and not normalized: ScanCode discovers and normalizes this data for you.

Read more about ScanCode here: https://scancode-toolkit.readthedocs.io/.

Check out the code at https://github.com/nexB/scancode-toolkit

Discover also:

The ScanCode.io server project here: https://scancodeio.readthedocs.io
The ScanCode Workbench project for visualization of scancode results data: https://github.com/nexB/scancode-workbench
Other companion SCA projects for code origin, license and security analysis here: https://aboutcode.org

Build and tests status

We run 30,000+ tests on each commit on multiple CIs to ensure a good platform compabitility with multiple versions of Windows, Linux and macOS.

Azure	RTD Build	GitHub actions Docs	GitHub actions Release

Why use ScanCode?

As a standalone command-line tool, ScanCode is easy to install, run, and embed in your CI/CD processing pipeline. It runs on Windows, macOS, and Linux.
ScanCode is used by several projects and organizations such as the Eclipse Foundation, OpenEmbedded.org, the FSFE, the FSF, OSS Review Toolkit, ClearlyDefined.io, RedHat Fabric8 analytics, and many more.
ScanCode detects licenses, copyrights, package manifests, direct dependencies, and more both in source code and binary files and is considered as the best-in-class and reference tool in this domain, re-used as the core tools for software composition data collection by several open source tools.
ScanCode provides the most accurate license detection engine and does a full comparison (also known as diff or red line comparison) between a database of license texts and your code instead of relying only on approximate regex patterns or probabilistic search, edit distance or machine learning.
Written in Python, ScanCode is easy to extend with plugins to contribute new and improved scanners, data summarization, package manifest parsers, and new outputs.
You can save your scan results as JSON, YAML, HTML, CycloneDX or SPDX or even create your own format with Jinja templates.
You can also organize and run ScanCode server-side with the companion ScanCode.io web app to organize and store multiple scan projects including scripted scanning pipelines.
ScanCode output data can be easily visualized and analysed using the ScanCode Workbench desktop app.
ScanCode is actively maintained, has a growing users and contributors community.
ScanCode is heavily tested with an automated test suite of over 20,000 tests.
ScanCode has an extensive and growing documentation.
ScanCode can process packages, build manifest and lockfile formats to collect Package URLs and extract metadata: Alpine packages, BUCK files, ABOUT files, Android apps, Autotools, Bazel, JavaScript Bower, Java Axis, MS Cab, Rust Cargo, Cocoapods, Chef Chrome apps, PHP Composer and composer.lock, Conda, CPAN, Debian, Apple dmg, Java EAR, WAR, JAR, FreeBSD packages, Rubygems gemspec, Gemfile and Gemfile.lock, Go modules, Haxe packages, InstallShield installers, iOS apps, ISO images, Apache IVY, JBoss Sar, R CRAN, Apache Maven, Meteor, Mozilla extensions, MSI installers, JavaScript npm packages, package-lock.json, yarn.lock, NSIS Installers, NugGet, OPam, Cocoapods, Python PyPI setup.py, setup.cfg, and several related lockfile formats, semi structured README files such as README.android, README.chromium, README.facebook, README.google, README.thirdparty, RPMs, Shell Archives, Squashfs images, Java WAR, Windows executables and the Windows registry and a few more. See all available package parsers for the exhaustive list.

See our roadmap for upcoming features.

Documentation

The ScanCode documentation is hosted at scancode-toolkit.readthedocs.io.

If you are new to visualization of scancode results data, start with our newcomer page.

If you want to compare output changes between different versions of ScanCode, or want to look at scans generated by ScanCode, review our reference scans.

Installation

Before installing ScanCode make sure that you have installed the prerequisites properly. This means installing Python 3.8 for x86/64 architectures. We support Python 3.8, 3.9, 3.10, 3.11 and 3.12.

See prerequisites for detailed information on the support platforms and Python versions.

There are a few common ways to install ScanCode.

**Installation as an application: Install Python 3.8, download a release archive, extract and run**. This is the recommended installation method.
Development installation from source code using a git clone
Development installation as a library with "pip install scancode-toolkit" [Note that this is not supported on arm64 machines]
Run in a Docker container with a git clone and "docker run"
In Fedora 40+ you can dnf install scancode-toolkit

Quick Start

After ScanCode is installed successfully you can run an example scan printed on screen as JSON:

scancode -clip --json-pp - samples

Follow the How to Run a Scan tutorial to perform a basic scan on the samples directory distributed by default with ScanCode.

See more command examples:

scancode --examples

See How to select what will be detected in a scan and How to specify the output format for more information.

You can also refer to the command line options synopsis and an exhaustive list of all available command line options.

Archive extraction

By default ScanCode does not extract files from tarballs, zip files, and other archives as part of the scan. The archives that exist in a codebase must be extracted before running a scan: extractcode is a bundled utility behaving as a mostly-universal archive extractor. For example, this command will recursively extract the mytar.tar.bz2 tarball in the mytar.tar.bz2-extract directory:

./extractcode mytar.tar.bz2

See all extractcode options and how to extract archives for details.

Support

If you have a problem, a suggestion or found a bug, please enter a ticket at: https://github.com/nexB/scancode-toolkit/issues

For discussions and chats, we have:

an official Gitter channel for web-based chats. Gitter is now accessible through Element or an IRC bridge. There are other AboutCode project-specific channels available there too.
The discussion channel for scancode specifically aimed at users and developers using scancode-toolkit.

Source code and downloads

License

Apache-2.0 as the overall license
CC-BY-4.0 for reference datasets (initially was in the Public Domain).
Multiple other secondary permissive or copyleft licenses (LGPL, MIT, BSD, GPL 2/3, etc.) for third-party components and test suite code and data.

See the NOTICE file and the .ABOUT files that document the origin and license of the third-party code used in ScanCode for more details.

scancode-toolkit's People

Contributors

Stargazers

Watchers

Forkers

pombredanne k-rex retrography jdaguil pierrelapointe jhbsz praveen-pk vinodpanicker neusoft-psd ened lach76 austinc88 triggers licodeli balusarakesh radsz karanmg 10imaging mk1023 savinos yahalom5776 amua khtran1994 agneet42 singh1114 armudgal timcrider ash-anand yasharmaster nishant23 vikrant97 darkknightawakens sudeepb02 michaelrup shubham3211 samsruti forrestchang chaminw rajukoushik dejunliu aviaryan jdbean mabreyes armijnhemel kartiksibal rogermoka tedteah jpopelka krintoxi cryptobuks yashladha jarnugirdhar pgier tardummy01 chubbymaggie yash-nisar yashdsaraf rohit-paspule krysnuvadga roscopecoltran snow-summer jonoyang jimjag skillnter chetanya-shrimali neelanshsahai dbuentello pidelport opensource-hisense vinayvishal starlord1311 bhavishyagopesh haikoschol susg haksungjang maxin3d ajeans harrypotter0 jimbo108 saravananoffl zamasharik 01100100 nishakm aswanipranjal yudhik11 sparic techytushar dasanjan1296 fossas avirlrma jardous chaitya62 vivonk inishchith jamesward dstw lechasseur waseem18 thorstenharter aleachjr

scancode-toolkit's Issues

Improved logging and error/exception processing and reporting

When running the scancode command, we should:

always capture expections and display formatted error messages
optionally have a way to log more details, such ass which file if being processed and for what

Fetch/retrieve new and improved licenses from external sources

SPDX, DejaCode and other would be a good start.
The goal would be to have a single purpose script to fetch, sync and update the ScanCode data

scan code only detects public domain notice of file with also gpl notice

This is the file http://review.coreboot.org/gitweb?p=coreboot.git;a=blob;f=payloads/bayou/lzma.c;h=a7a8717c6ac6eaa992d4e1ee42fe181ec8b1ebf0;hb=HEAD
The text is

Coreboot interface to memory-saving variant of LZMA decoder

(C)opyright 2006 Carl-Daniel Hailfinger
Released under the GNU GPL v2 or later

Parts of this file are based on C/7zip/Compress/LZMA_C/LzmaTest.c from the LZMA
SDK 4.42, which is written and distributed to public domain by Igor Pavlov.

New DejaCode Licenses 2015-07-28

Three (3) new DejaCode licenses.
[
{
"category": "Copyleft",
"spdx_full_name": "",
"name": "Ghostscript General Public License 1988",
"short_name": "Ghostscript General Public License 1988",
"text_urls": "",
"spdx_license_key": "",
"homepage_url": "",
"spdx_url": "",
"spdx_notes": "",
"key": "ghostscript-1988",
"owner": "Richard Stallman",
"faq_url": "",
"osi_url": ""
},
{
"category": "Attribution Restricted",
"spdx_full_name": "",
"name": "Facebook Software License",
"short_name": "Facebook Software License",
"text_urls": "",
"spdx_license_key": "",
"homepage_url": "",
"spdx_url": "",
"spdx_notes": "",
"key": "facebook-software-license",
"owner": "Facebook",
"faq_url": "http://developers.facebook.com/policy/",
"osi_url": ""
},
{
"category": "Copyleft Limited",
"spdx_full_name": "",
"name": "CognitiveWeb Open Source License 1.1",
"short_name": "CognitiveWeb Open Source License 1.1",
"text_urls": "http://www.cognitiveweb.org/legal/license/CognitiveWebOpenSourceLicense-1.1.txt",
"spdx_license_key": "",
"homepage_url": "http://www.cognitiveweb.org/legal/license/",
"spdx_url": "",
"spdx_notes": "",
"key": "cognitive-web-osl-1.1",
"owner": "CognitiveWeb Project",
"faq_url": "",
"osi_url": ""
}
]

Check the tree view for long path (deep path) and long file/directory name on Windows

reported by @chinyeungli based on bugs reported in AboutCode nexB/aboutcode-toolkit#143

Windows needs special care to handle deep paths. We should test and verify that ScanCode support such deep path

IndexError: list index out of range

OS: Windows 8.1 64-bit
Python version 2.7 (for windows 64 bit)
Scan-code release version: 1.3.1
I am trying to find the license name for the following text:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the Thai Open Source Software Center Ltd nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

I have extracted the scan-code and opened a terminal in the Scan-code directory.
Copied the above text into a file "test-license.txt"
Ran the command scancode --license C:\Users\***\Documents\test-license.txt

The following error message was displayed:

[------------------------------------]  1
Traceback (most recent call last):
 File "C:\Users\rakesh\Documents\scancode-toolkit-1.3.1\bin\scancode-script.py"
, line 9, in <module>
   load_entry_point('scancode-toolkit==1.3.1', 'console_scripts', 'scancode')()

 File "C:\Users\rakesh\Documents\scancode-toolkit-1.3.1\lib\site-packages\click
\core.py", line 664, in __call__
   return self.main(*args, **kwargs)
 File "c:\users\rakesh\documents\scancode-toolkit-1.3.1\src\scancode\cli.py", l
ine 230, in main
   standalone_mode=standalone_mode, **extra)
 File "C:\Users\rakesh\Documents\scancode-toolkit-1.3.1\lib\site-packages\click
\core.py", line 644, in main
   rv = self.invoke(ctx)
 File "C:\Users\rakesh\Documents\scancode-toolkit-1.3.1\lib\site-packages\click
\core.py", line 837, in invoke
   return ctx.invoke(self.callback, **ctx.params)
 File "C:\Users\rakesh\Documents\scancode-toolkit-1.3.1\lib\site-packages\click
\core.py", line 464, in invoke
   return callback(*args, **kwargs)
 File "c:\users\rakesh\documents\scancode-toolkit-1.3.1\src\scancode\cli.py", l
ine 290, in scancode
   results.append(scan_one(input_file, copyright, license, verbose))
 File "c:\users\rakesh\documents\scancode-toolkit-1.3.1\src\scancode\cli.py", l
ine 336, in scan_one
   data['licenses'] = list(get_licenses(input_file))
 File "c:\users\rakesh\documents\scancode-toolkit-1.3.1\src\scancode\api.py", l
ine 80, in get_licenses
   from licensedcode.models import get_license
 File "c:\users\rakesh\documents\scancode-toolkit-1.3.1\src\licensedcode\models
.py", line 41, in <module>
   from licensedcode import index
 File "c:\users\rakesh\documents\scancode-toolkit-1.3.1\src\licensedcode\index.
py", line 31, in <module>
   from textcode import analysis
 File "c:\users\rakesh\documents\scancode-toolkit-1.3.1\src\textcode\analysis.p
y", line 32, in <module>
   import typecode.contenttype
 File "c:\users\rakesh\documents\scancode-toolkit-1.3.1\src\typecode\contenttyp
e.py", line 48, in <module>
   from typecode import magic2
 File "c:\users\rakesh\documents\scancode-toolkit-1.3.1\src\typecode\magic2.py"
, line 209, in <module>
   libmagic = load_lib()
 File "c:\users\rakesh\documents\scancode-toolkit-1.3.1\src\typecode\magic2.py"
, line 193, in load_lib
   root_dir = command.get_base_dirs(typecode.bin_dir)[0]
IndexError: list index out of range

Calling commands should not change the current directory

Currently the directory is changed to the scancode installation dir, leading to surprising behaviors

Leading and training colon should not be included in scanned copyrights

For instance scanning this:

    :copyright: (c) 2013 by Armin Ronacher.
    :license: BSD, see LICENSE for more details.

yields:
:copyright: (c) 2013 by Armin Ronacher.
instead of:
copyright (c) 2013 by Armin Ronacher.

Separate JSON data into a separate file from html app

The html app should have the JSON data in a separate file rather than inside the file

Failure to detect GPL 2 license

With the latest, scancode failed to detect GPL license with this string in ascii file:
"/*

Copyright 2008, Network Appliance Inc.
Author: Jason McMullan <mcmullan netapp.com>
Licensed under the GPL-2 or later.
*/"

Failure to detect GPL 3.0 license

Using the latest, scancode failed to detect GPL 3.0 license with this string:
"License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law."

in a binary file. There are many such files in /usr/bin/ with that string such as 'basename', 'cat', and 'chmod'.

older license is detected

For the text below the footnotes (6,7 and 8) explain that the license is changed from GPL to OFL, but scancode detects the license GPL too.

download url: http://mirror.centos.org/centos/7.1.1503/os/x86_64/Packages/abattis-cantarell-fonts-0.0.12-3.el7.noarch.rpm

license-file path: abattis-cantarell-fonts-0.0.12-3.el7.noarch.rpm\abattis-cantarell-fonts-0.0.12-3.el7.noarch.cpio\.\usr\share\doc\abattis-cantarell-fonts-0.0.12\README

README text :

FONTLOG for Cantarell GNOME 0.0.5
=================================

This file provides detailed information on the Cantarell font
software. This information should be distributed along with the
Cantarell fonts and any derivative works.

Font Information
-------------------------

The Cantarell typeface family is a contemporary Humanist 
sans serif, and is used by the GNOME project for its user
interface and the Fedora project. 

Cantarell was originally designed by Dave Crossland as part 
of his coursework for the MA Typeface Design programme at 
the Department of Typography in the University of Reading, 
England. [1] 

Dave was motivated to undertake a study of typeface design because
he believes it is essential that when we use digital tools, our
freedom to use, understand, modify and share these tools is 
respected. Otherwise, when the tool does not work in the way 
that we need, we will be unable to fix it.

These fonts are developed using only such "libre" software, 
mainly FontForge [2]. 

Cantarell was originally aimed at on-screen reading in a specific 
use-case and environment: reading web pages on an HTC Dream 
mobile phone [3].

That device was the first to ship with Google Android [4], and 
came installed with a web browser that supported the exciting web 
fonts feature known as @font-face [5]. As Dave's very first typeface 
design, the typeface has many faults, yet he asserts it achieves 
his goal of improving readability on this device.

The regular member of the family has had recieved the most focus, and a bold
family has been developed quickly to provide better somewhat better results
that an operating system's automatic bolding. In the case of oblique, we
decided to rely on the system generated variant for now. An actual italics
variant is planned.

The Regular font fully supports the following writing systems: 
Basic Latin, Western European, Catalan, Baltic, Turkish, Central 
European, Dutch and Afrikaans. To date, Pan African Latin has 
only 33% glyph coverage.

Since the design is aimed at display on-screen at small sizes, the
printed output (especially of the bold and oblique) may not work
well. Fonts tuned to the needs of printing will be developed in 
the future.

The fonts were initially published on the 6th of July 2009 on
Dave Crossland's foundry website [6] under the terms of the GNU
General Public License version 3. [7] In May 2010 the fonts were 
republished through Google Web Fonts [8] under the terms of the
SIL Open Font License version 1.1. [9] In November 2010 the
project became part of the GNOME project and is now under active
development by the GNOME design community. [10]

Dave Crossland, 21st March 2011

[1]: http://www.typedesign.reading.ac.uk
[2]: http://fontforge.sf.net
[3]: http://en.wikipedia.org/wiki/HTC_Dream
[4]: http://en.wikipedia.org/wiki/Android_%28operating_system%29
[5]: http://openfontlibrary.org/wiki/Web_font_linking_with_%40font-face
[6]: http://abattis.org/cantarell
[7]: http://www.gnu.org/licenses/gpl.html
[8]: http://www.google.com/webfonts
[9]: http://scripts.sil.org/OFL
[10]: http://live.gnome.org/CantarellFonts

                                  * * * 

Developer information
---------------------

The original src/Cantarell-Regular.sfd file has the master sources 
as Cubic (PostScript) Bezier splines. There are temporary layers 
and a 'Spiro' layer in this file, containing forms used to create 
the master Cubic Bezier glyphs; the Spiro layer contains forms in 
Spiro splines, and much of the original typeface design by Dave
Crossland was done by drawing in Spiro splines. However today the
master drawing spline format is Cubic Bezier, and Spiro splines
are used to inform their creation. 

The Cantarell-Regular.sfd file is the _master_ source, and was 
used to generate the Cantarell-Bold.sfd which is now a hard fork. 

All development occurs by making changes to these drawing files. 
When OTF or TTF binaries are compiled, they are copied to the 
Cantarell-*-OTF.sfd and Cantarell-*-TTF.sfd files and then a 
build process applied. 

This means that there should be a 1:1 match between these files, 
the OTF and TTF files in the otf/ and ttf/ directories, and the
output of generating new OTF and TTF files from FontForge. 

The build process is simple; the Spiro and temp layers are removed, 
in the case of TTF files all layers are converted to Quadratic from
Cubic, and then all glyphs have the Simplify, Add Extrema, Round 
to Int, and Correct Direction operations applied. 

In the future a build script will be developed to do this in an
automated way, which will be important for adding OpenType 
Layout features through a feature.fea file. 

ChangeLog
-------------------------

Please refer to the GNOME Git repository changelog at this URL:

http://git.gnome.org/browse/cantarell-fonts/log/

Acknowledgements
-------------------------

Here is a list of major contributors; all contributors are listed
in the GNOME Git repository changelogs.

If you make major modifications be sure to add your name (N), email (E),
web-address (W) and description (D). This list is sorted by last name
in alphabetical order.

N: Jakub Steiner
E: [email protected]
W: http://jimmac.musichall.cz
D: Designer - many improvements and GNOME standards engineering

N: Dave Crossland
E: [email protected]
W: http://abattis.org/cantarell/
D: Designer - original Latin glyphs

N: Erik Hartenian
E: [email protected]
W: http://infinality.net
D: Connoisseur of fine font renderding

multiple detections for a single license in a single file and a few lines are ignored

For the below scan license OFL is detected twice and the lines 4 and 5 are ignored. Please check for the scancode html output in the bottom.

download url: http://mirror.centos.org/centos/7.1.1503/os/x86_64/Packages/abattis-cantarell-fonts-0.0.12-3.el7.noarch.rpm

license-file path: abattis-cantarell-fonts-0.0.12-3.el7.noarch.rpm\abattis-cantarell-fonts-0.0.12-3.el7.noarch.cpio\.\usr\share\doc\abattis-cantarell-fonts-0.0.12\COPYING
COPYING text:

Copyright (c) 2009-2011, Understanding Limited ([email protected]),
Copyright (c) 2010-2011, Jakub Steiner ([email protected]).

This Font Software is licensed under the SIL Open Font License, Version 1.1.
This license is copied below, and is also available with a FAQ at:
http://scripts.sil.org/OFL


-----------------------------------------------------------
SIL OPEN FONT LICENSE Version 1.1 - 26 February 2007
-----------------------------------------------------------

PREAMBLE
The goals of the Open Font License (OFL) are to stimulate worldwide
development of collaborative font projects, to support the font creation
efforts of academic and linguistic communities, and to provide a free and
open framework in which fonts may be shared and improved in partnership
with others.

The OFL allows the licensed fonts to be used, studied, modified and
redistributed freely as long as they are not sold by themselves. The
fonts, including any derivative works, can be bundled, embedded, 
redistributed and/or sold with any software provided that any reserved
names are not used by derivative works. The fonts and derivatives,
however, cannot be released under any other type of license. The
requirement for fonts to remain under this license does not apply
to any document created using the fonts or their derivatives.

DEFINITIONS
"Font Software" refers to the set of files released by the Copyright
Holder(s) under this license and clearly marked as such. This may
include source files, build scripts and documentation.

"Reserved Font Name" refers to any names specified as such after the
copyright statement(s).

"Original Version" refers to the collection of Font Software components as
distributed by the Copyright Holder(s).

"Modified Version" refers to any derivative made by adding to, deleting,
or substituting -- in part or in whole -- any of the components of the
Original Version, by changing formats or by porting the Font Software to a
new environment.

"Author" refers to any designer, engineer, programmer, technical
writer or other person who contributed to the Font Software.

PERMISSION & CONDITIONS
Permission is hereby granted, free of charge, to any person obtaining
a copy of the Font Software, to use, study, copy, merge, embed, modify,
redistribute, and sell modified and unmodified copies of the Font
Software, subject to the following conditions:

1) Neither the Font Software nor any of its individual components,
in Original or Modified Versions, may be sold by itself.

2) Original or Modified Versions of the Font Software may be bundled,
redistributed and/or sold with any software, provided that each copy
contains the above copyright notice and this license. These can be
included either as stand-alone text files, human-readable headers or
in the appropriate machine-readable metadata fields within text or
binary files as long as those fields can be easily viewed by the user.

3) No Modified Version of the Font Software may use the Reserved Font
Name(s) unless explicit written permission is granted by the corresponding
Copyright Holder. This restriction only applies to the primary font name as
presented to the users.

4) The name(s) of the Copyright Holder(s) or the Author(s) of the Font
Software shall not be used to promote, endorse or advertise any
Modified Version, except to acknowledge the contribution(s) of the
Copyright Holder(s) and the Author(s) or with their explicit written
permission.

5) The Font Software, modified or unmodified, in part or in whole,
must be distributed entirely under this license, and must not be
distributed under any other license. The requirement for fonts to
remain under this license does not apply to any document created
using the Font Software.

TERMINATION
This license becomes null and void if any of the above conditions are
not met.

DISCLAIMER
THE FONT SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT
OF COPYRIGHT, PATENT, TRADEMARK, OR OTHER RIGHT. IN NO EVENT SHALL THE
COPYRIGHT HOLDER BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
INCLUDING ANY GENERAL, SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL
DAMAGES, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF THE USE OR INABILITY TO USE THE FONT SOFTWARE OR FROM
OTHER DEALINGS IN THE FONT SOFTWARE.

scancode html output :


Path    What    Start Line  End Line        Info
C:/doc/linux_distro_update/abattis-cantarell-fonts-0.0.12-3.el7.noarch.rpm/abattis-cantarell-fonts-0.0.12-3.el7.noarch.rpm-extract/usr/share/doc/abattis-cantarell-fonts-0.0.12/COPYING License 6   6       OFL 1.1
C:/doc/linux_distro_update/abattis-cantarell-fonts-0.0.12-3.el7.noarch.rpm/abattis-cantarell-fonts-0.0.12-3.el7.noarch.rpm-extract/usr/share/doc/abattis-cantarell-fonts-0.0.12/COPYING License 10  94      OFL 1.1

Scan fails on PDF file

The file at https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf fails to be scanned.
This is a bug in pdfminer. See euske/pdfminer#118

wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "[...]/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 575, in __init__
    self._initialize_password(password)
  File "[...]/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 598, in _initialize_password
    raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={u'EncryptMetadata': False, u'CF': {u'StdCF': {u'Length': 16, u'CFM': /V2, u'AuthEvent': /DocOpen}}, u'O': '\xc6\xa4\xb4%\xed\xda\xe8\x7f&\xd2\x97\x840y\xc7\xbe!N\xdb\xfbw\x0f\x04\xb3iZTn\n\xc3\x93c', u'Filter': /Standard, u'P': -1324, u'Length': 128, u'R': 4, u'U': '\xf3\xa1\xeb\xa5\x19\x8a\x15%\x001\x13CenHO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', u'V': 4, u'StmF': /StdCF, u'StrF': /StdCF}

Note that on Linux using:

 wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
 pdfseparate -f 1 -l 1 5756M-PG101-R.pdf  5756M-PG101-R-p1.pdf

creates a single page small PDF doc that has the same issue as the full doc

configurable ignores with a `--ignore` option

This is a follow up of #35 from @K-Rex
We should add an --ignore command line option to point to an ignore file

Add Summary Feature

In many cases the html file has hundreds and sometimes thousands of line items. It will be convenient to have the tool provide a summary list of licenses in a particular package. The user can then examined the detailed html file only if needed.

scan won't run if extract option is set

OS: Windows 7 - 64bits
Python 2.7
scancode version: 1.0.0

I've an sample archive that I want to extract and run scan on it. I have the --format html-app set follow by the -e option which is the extract.
However, the tool only did the extraction without running the scan.

Following are the command that I used:

(scancode-toolkit-1.0.0) C:\Users\Downloads\scancode-toolkit-1.0.0>scancode --format html-app -e C:\Users\Downloads\test\ C:\Users\Downloads\test.html
Extracting archives...
Extracting done.
(scancode-toolkit-1.0.0) C:\Users\Downloads\scancode-toolkit-1.0.0>

progress should be based on 100% scale

reported by @chinyeungli

Scanning files...
  [------------------------------------]  1

I thought the above is the progress bar based on percentage. But then I realize it's number of file counts.

I think 100% scale make more scenes as not many users care about how many files have been processed, but how much left to be completed.

The Files row highlighted when moving up and down with the arrows is faint in the browser

The row highlighted, which is seven rows below the dark highlighted row, is very faint. I tried this out in Chrome, Firefox, and Internet Explorer and got the same results. Scancode is an impressive tool otherwise.

Basic packaged code support

The basic should be:

have a common model for component data
basic support for common packages formats
a scancode --package option to scan for packages and return first:
- a package type (RPM, Gem, npm, etc)
- is possible its id and version,
- how it is packaged (in an archive or a directory)

Control what is extracted with the extractcode command

extractcode supports selecting what is extracted but this is not exposed as a command line option.
There should be a way to control what is extracted possibly with an expanded option --extract=<kind> or but making extraction a separate of a sub command

Separate scanner from api + other toolkit-related modules

I think it would be valuable to have the scanner itself as a completely separate module in a different repo, or every "strategy" as separate repos. The rule aggregation is a different job than the toolkit of creating the HTML templates and other data.

Also maybe consider building the scanning module in a language like go, which can be compiled into a binary.

Add option to merge overlapping scans

For instance, a copyright, url or email inside a detected license could be omitted.

adding new license

Jun 3, 8:07 AM
Hi,
today @pierrelapointe showed me a quick preview of ScanCode. It looks really promising.
I have one question what he suggested the write here.

Can you add/define a new license in ScanCode?
Can you add the regexp for the this new license?
Can you filter for this new license on the summary page?
Can it be synchronized with DejaCode license library?

For referrence I include here the way how Fossology works. It's really a pain.
http://www.fossology.org/projects/fossology/wiki/Nomos#How-to-Add-a-New-License-Signature

This can be really a good USP.
Kind regards,
Béla

html-app output does not work on IE 8 (which is the default on a vanilla Windows 7)

It represents either ~14% or ~3% of the browsers out there depending on the sources. This is also the last version that was updated on Windows XP. We do not support nor test on XP.

See :

My take is to collect feedback for now and wait and see.

How to use --extract on Linux-32 ? What extract archive mean?

I want to use -c -l to scan, but it scan all the files like .c .h and all, I don't want to scan .svn/ directory. How can I do?

I use : ./scancode -e ../gpl/
I get this error:

Extracting archives...
[------------------------------------] 0
Traceback (most recent call last):
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/bin/scancode", line 9, in
load_entry_point('scancode-toolkit==1.2.4', 'console_scripts', 'scancode')()
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/local/lib/python2.7/site-packages/click/core.py", line 664, in call
return self.main(_args, *_kwargs)
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/local/lib/python2.7/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/local/lib/python2.7/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, *_ctx.params)
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/local/lib/python2.7/site-packages/click/core.py", line 464, in invoke
return callback(_args, **kwargs)
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/src/scancode/cli.py", line 258, in scancode
extract_with_progress(abs_input, verbose)
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/src/scancode/cli.py", line 345, in extract_with_progress
for xevent in extractions:
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/local/lib/python2.7/site-packages/click/_termui_impl.py", line 240, in next
rv = next(self.iter)
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/src/scancode/api.py", line 44, in extract_archives
from extractcode.extract import extract
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/src/extractcode/extract.py", line 37, in
from extractcode import archive
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/src/extractcode/archive.py", line 47, in
from extractcode import libarchive2
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/src/extractcode/libarchive2.py", line 91, in
libarchive = load_lib()
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.2.4/src/extractcode/libarchive2.py", line 87, in load_lib
raise ImportError('Failed to load libarchive: %(libarchive)r' % locals())
ImportError: Failed to load libarchive: '/home/rex/qdu_claro/yuki/scancode-toolkit-1.2.4/src/extractcode/bin/linux-32/bin/libarchive.so'

There is no libarchive.so in that directory!!!
Can anyone help me? Thx...

provide basic file information in results (size, type, etc.)

Using a --info option for now should be a good start

Display progress when extracting

There should be a progress bar and an optional verbose output to display progress on archive extraction, the same way scanning progress is displayed.

Support for common ignore (svn, git, etc)

Broken down from @K-Rex #33

I want to use -c -l to scan, but it scan all the files like .c .h and all, I don't want to scan .svn/ directory. How can I do?

Timeout when processing a file

To avoid command that would hang forever we should have a timeout when processing a single file.

Add new license for nuclide

See https://github.com/facebook/nuclide/blob/master/LICENSE

I can only see the resources displayed as a table instead of a tree.

I ran: scancode --format html-app samples samples.html on my windows machine.

And the result on my machine only shows table on left resource panel.

minimal CSS style for the html output

The html format is bare html which is fine but a tad crude when opened in a browser.

Having a minimal CSS style in the template would be nice.

Release v1.3.0 can't use --verbose with other option!

1.$./scancode --verbose -c ../gpl/ aaa

Scanning files...
Traceback (most recent call last):
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.3.0/bin/scancode", line 9, in
load_entry_point('scancode-toolkit==1.3.0', 'console_scripts', 'scancode')()
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.3.0/local/lib/python2.7/site-packages/click/core.py", line 664, in call
return self.main(_args, *_kwargs)
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.3.0/local/lib/python2.7/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.3.0/local/lib/python2.7/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, *_ctx.params)
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.3.0/local/lib/python2.7/site-packages/click/core.py", line 464, in invoke
return callback(_args, **kwargs)
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.3.0/src/scancode/cli.py", line 282, in scancode
for input_file in file_iter(files):
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.3.0/src/commoncode/fileutils.py", line 299, in file_iter
for top, _dirs, files in walk(location, ignored):
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.3.0/src/commoncode/fileutils.py", line 265, in walk
if filetype.is_file(location) :
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.3.0/src/commoncode/filetype.py", line 46, in is_file
return (location and os.path.isfile(location)
File "/home/rex/qdu_claro/yuki/tools/scancode-toolkit-1.3.0/lib/python2.7/genericpath.py", line 29, in isfile
st = os.stat(path)
TypeError: coercing to Unicode: need string or buffer, generator found

2.$./scancode --verbose --format html-app ../gpl/ gpl.html

TypeError: coercing to Unicode: need string or buffer, generator found

Fail to detect license

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the Thai Open Source Software Center Ltd nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Scancode 1.3.1 fails to detect the above license.

(scancode-toolkit-1.3.1) C:\Users\CYL\Downloads\scancode-toolkit-1.3.1>scancode --license C:\Users\CYL\Downloads\test.txt
Scanning files...
  [####################################]  1
{
  "count": 1,
  "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode
 should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://
github.com/nexB/scancode-toolkit/ for support and download.",
  "results": [
    {
      "licenses": [],
      "location": "C:/Users/CYL/Downloads/test.txt"
    }
  ],
  "version": "1.3.1"
}Scanning done.

Add new tests (and possibly licenses and rules) from https://github.com/retrography/OS-Licenses/

Some licenses in https://github.com/retrography/OS-Licenses/ by @retrography may not be detected as full exact licenses. They should

Incorrect license detection

scan code LZMA-SDK-original license which is probably a newer version for the following license notice found in file...

  LZMA Decoder interface

  LZMA SDK 4.40 Copyright (c) 1999-2006 Igor Pavlov (2006-05-01)
  http://www.7-zip.org/

  LZMA SDK is licensed under two licenses:
  1) GNU Lesser General Public License (GNU LGPL)
  2) Common Public License (CPL)
  It means that you can select one of these two licenses and
  follow rules of that license.

  SPECIAL EXCEPTION:
  Igor Pavlov, as the author of this code, expressly permits you to
  statically or dynamically link your code (or bind by name) to the
  interfaces of this file without subjecting your linked code to the
  terms of the CPL or GNU LGPL. Any modifications or additions
  to this file, however, are subject to the LGPL or CPL terms.```

--extract should not be an option but a separate command

Extraction feels like a wart in the scancode command. It does not share any of the semantics of the command and should be best as a separate command such as extractcode that would only extract things.

Speed up license detection

The license index is recreated from scratch each time the scancode command is called.

The overhead can be anywhere between 10s and a few minutes depending on the machine.

A solution is to cache the index and only re-index when licenses or rules have changed

ScanCode crashes with PDFEncryptionError

Ubuntu 12.04
x86_64
Python 2.7.3
ScanCode Version 1.3.1

I attempted to create both html_app and html output: ./scanCode -f html tivo.html

Workspace being scanned has 198785 files. ScanCode directory and workspace resident on local machine.

On both scan attempts the tool crashed when apparently trying to scan a PDF file. There's no information about which file caused the problem so I can't independently check it's validity.

No output file was created.

Traceback (most recent call last):
File "/export/dqj347/scancode-toolkit-1.3.1/bin/scancode", line 9, in
load_entry_point('scancode-toolkit==1.3.1', 'console_scripts', 'scancode')()
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 664, in call
return self.main(_args, *_kwargs)
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 230, in main
standalone_mode=standalone_mode, *_extra)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, *_ctx.params)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 464, in invoke
return callback(_args, *_kwargs)
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 290, in scancode
results.append(scan_one(input_file, copyright, license, verbose))
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 332, in scan_one
data['copyrights'] = list(get_copyrights(input_file))
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/api.py", line 62, in get_copyrights
for copyrights, _, _, _, start_line, end_line in detect_copyrights(location):
File "/export/dqj347/scancode-toolkit-1.3.1/src/cluecode/copyrights.py", line 70, in detect_copyrights
for numbered_lines in candidate_lines(analysis.text_lines(location)):
File "/export/dqj347/scancode-toolkit-1.3.1/src/cluecode/copyrights.py", line 797, in candidate_lines
for line_number, line in enumerate(lines):
File "/export/dqj347/scancode-toolkit-1.3.1/src/textcode/analysis.py", line 552, in unicode_text_lines_from_pdf
for line in pdf.get_text_lines(location):
File "/export/dqj347/scancode-toolkit-1.3.1/src/textcode/pdf.py", line 46, in get_text_lines
document = PDFDocument(parser)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 326, in init
self._initialize_password(password)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 348, in _initialize_password
raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={'CF': {'StdCF': {'Length': 16, 'CFM': /AESV2, 'AuthEvent': /DocOpen}}, 'O': '\xf1T({\xf5#N\xc0\xfewr\xcf6\xd2\x92\x89\x1b\xbe\x11\x8c\xd0\xec\x88\x1d\x1a\x9c}\xf5\xb7J\xb5\x87', 'Filter': /Standard, 'P': -1036, 'Length': 128, 'R': 4, 'U': '\x14\x8bR\xb6x\x97t\xc1\xcf\xeaO{\x1a]\xfc\xfd\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'V': 4, 'StmF': /StdCF, 'StrF': /StdCF}

Do not report the full absolute paths

This commit 3e3d642 enforced conversion to absolute paths.

This is all good, but the all outputs should have a path rooted in whatever the use entered, not always an absolute path.

Test failure on Windows in developSee appveyor:

See appveyor: https://ci.appveyor.com/project/nexB/scancode-toolkit/build/3

================================== FAILURES =================================== 
__ TestPatchInfo.test_patch_info_patch_patches_windows_plugin_explorer_patch __ 
[gw0] win32 -- Python 2.7.9 C:\projects\scancode-toolkit\Scripts\python.exe 
self = <test_patch.TestPatchInfo testMethod=test_patch_info_patch_patches_windows_plugin_explorer_patch>

    def test_patch_info_patch_patches_windows_plugin_explorer_patch(self):
        test_file = self.get_test_loc(u'patch/patches/windows/plugin explorer.patch')
        expected_file = self.get_test_loc('patch/patches/windows/plugin explorer.patch.expected')
        with open(expected_file, 'rb') as expect:
            expected = expect.read()
            result = pprint.pformat(list(patch.patch_info(test_file)))
>           assert expected == result
E           AssertionError: assert "[('.classpat...supertype>')]" == "[('.classpath...supertype>')]"
E             Detailed information truncated, use "-vv" to show

tests\extractcode\test_patch.py:1578: AssertionError 
 TestPatchInfo.test_patch_info_patch_patches_misc_linux_st710x_patches_i2c_nostop_for_bitbanging_patch  
[gw1] win32 -- Python 2.7.9 C:\projects\scancode-toolkit\Scripts\python.exe 
self = <test_patch.TestPatchInfo testMethod=test_patch_info_patch_patches_misc_linux_st710x_patches_i2c_nostop_for_bitbanging_patch>

    def test_patch_info_patch_patches_misc_linux_st710x_patches_i2c_nostop_for_bitbanging_patch(self):
        test_file = self.get_test_loc(u'patch/patches/misc/linux-st710x/patches/i2c_nostop_for_bitbanging.patch')
        expected_file = self.get_test_loc('patch/patches/misc/linux-st710x/patches/i2c_nostop_for_bitbanging.patch.expected')
        with open(expected_file, 'rb') as expect:
            expected = expect.read()
            result = pprint.pformat(list(patch.patch_info(test_file)))
>           assert expected == result
E           AssertionError: assert "[('linux-2.6...et;\\n}\\n')]" == "[('linux-2.6....et;\\n}\\n')]"
E             Detailed information truncated, use "-vv" to show

tests\extractcode\test_patch.py:826: AssertionError 
 TestPatchInfo.test_patch_info_patch_patches_misc_linux_st710x_patches_motorola_rootdisk_c_patch  
[gw1] win32 -- Python 2.7.9 C:\projects\scancode-toolkit\Scripts\python.exe 
self = <test_patch.TestPatchInfo testMethod=test_patch_info_patch_patches_misc_linux_st710x_patches_motorola_rootdisk_c_patch>

    def test_patch_info_patch_patches_misc_linux_st710x_patches_motorola_rootdisk_c_patch(self):
        test_file = self.get_test_loc(u'patch/patches/misc/linux-st710x/patches/motorola_rootdisk.c.patch')
        expected_file = self.get_test_loc('patch/patches/misc/linux-st710x/patches/motorola_rootdisk.c.patch.expected')
        with open(expected_file, 'rb') as expect:
            expected = expect.read()
            result = pprint.pformat(list(patch.patch_info(test_file)))
>           assert expected == result
E           AssertionError: assert "[('linux-2.6...PAGE_SIZE)')]" == "[('linux-2.6....PAGE_SIZE)')]"
E             Detailed information truncated, use "-vv" to show

tests\extractcode\test_patch.py:986: AssertionError

RPM with an XZ-compressed cpio payload is not extracted correctly

This : http://mirror.centos.org/centos/6/os/x86_64/Packages/abrt-2.0.8-26.el6.centos.x86_64.rpm
results in:

$ ./scancode --extract abrt-2.0.8-26.el6.centos.x86_64.rpm 
Extracting archives...
  [####################################]  1
Extraction errors or warnings for: abrt-2.0.8-26.el6.centos.x86_64.rpm
  ERROR: No error returned
Extracting done.

Usage documentation is not correct on Windows

OS: Windows 7 - 64bits
Python 2.7
scancode version: 1.0.0

(scancode-toolkit-1.0.0) C:\Users\Downloads\scancode-toolkit-1.0.0>scancode --help
Usage: scancode-script.py [OPTIONS] <input> <output_file>

There is no scancode-script.py in the extracted directory.
The usage doc is incorrect. It should be

Usage: scancode [OPTIONS] <input> <output_file>

Add "How to cite" to readme

from #54 (comment)
by @retrography:

We can't include that kind of disclaimer in an academic paper. An academic citation follows a very specific format. I give you an example from the statnet package that I use regularly for my analysis:

Handcock M, Hunter D, Butts C, Goodreau S, Krivitsky P, Bender-deMoll S and Morris M (2015). statnet: Software Tools for the Statistical Analysis of Network Data. The Statnet Project.

You can also provide the bibliographic entry, so that the users can format the citation according to the outlet they publish in:

@Misc{,
  author = {Mark S. Handcock and David R. Hunter and Carter T. Butts and Steven M. Goodreau and Pavel N. Krivitsky and Skye Bender-deMoll and Martina Morris},
  title = {statnet: Software Tools for the Statistical Analysis of Network Data},
  organization = {The Statnet Project (\url{http://www.statnet.org})},
  year = {2015},
  note = {R package version 2015.6.2},
  url = {CRAN.R-project.org/package=statnet},
}

Have a look at here: https://en.wikipedia.org/wiki/BibTeX

Add windows-based build with appveyor

... to ensure that windows is tested

Unable to finish extraction

operating system: windows 7, 64-bit
i downloaded the RPM ant-1.9.2-9.el7.noarch.rpm from the url http://mirror.centos.org/centos/7.1.1503/os/x86_64/Packages/ant-1.9.2-9.el7.noarch.rpm and saved it in the directory called distro.
when I gave the command scancode --extract C:\doc\linux_distro\data\distro, scancode was not able to finish the extraction and I waited for more than half hour.
Interestingly I was able to see the folder ant-1.9.2-9.el7.noarch.rpm-extract but I did not see any message like extraction finished in the terminal.

Transparent extraction of archives

As noted in #3, we do not extract and scan at the same time.

A better way would be to handle internally an archive as if it were a special type of directory (both contain files after all), and when a single archive scan is requested (or when archives are found in a larger scan) we could extract these temporarily to a temp directory, scan the extract and return the results. This would require a bit more thinking to get it right.
At a high level a tree with archives would be considered the same as a tree with directories. Archives would become just a special type directory-like containers for more files.

We could expose an os.walk-like function that would transparently extract archives to a temp directory and yield a real path and the temp location of a given file

Weird copyright detected in binary data stream