src-d / gemini Goto Github PK

View Code? Open in Web Editor NEW

54.0 10.0 16.0 7.17 MB

Advanced similarity and duplicate source code at scale.

License: GNU General Public License v3.0

Scala 80.56% Shell 8.13% Go 0.65% Makefile 0.16% Python 8.51% Java 1.51% Dockerfile 0.47%

duplicate-detection spark duplicates hash source-code-analysis

gemini's Introduction

Gemini

Find similar code in Git repositories

Gemini is a tool for searching for similar 'items' in source code repositories. The supported granularity levels for items are:

repositories (TBD)
files
functions

Gemini is based on its sister research project codenamed Apollo.

Run

./hash   <path-to-repos-or-siva-files>
./query  <path-to-file>
./report

You would need to prefix commands with docker-compose exec gemini if you run it in docker. Read below how to start gemini in docker or standalone mode.

Hash

To pre-process number of repositories for a quick finding of the duplicates run

./hash ./src/test/resources/siva

Input format of the repositories is the same as in src-d/Engine.

To pre-process repositories for search of similar functions run:

./hash -m func ./src/test/resources/siva

Besides local file system gemini support different distributed storages.

Query

To find all duplicate of the single file run

./query <path-to-single-file>

To find all similar function defined in a file run:

./query -m func <path-to-single-file>

If you are interested in similarities of only 1 function defined in the file you can run:

./query -m func <path-to-single-file>:<function name>:<line number where the function is defined>

Report

To find all duplicate files and similar functions in all repositories run

./report

All repositories must be hashed before and a community detection library installed.

Requirements

Docker

Start containers:

docker-compose up -d

Local directories repositories and query are available as /repositories and /query inside the container.

Examples:

docker-compose exec gemini ./hash /repositories
docker-compose exec gemini ./query /query/consumer.go
docker-compose exec gemini ./report

Standalone

You would need:

JVM 1.8
Apache Cassandra or ScyllaDB
Apache Spark 2.2.x
Python 3
Bblfshd v2.5.0+

By default, all commands are going to use

Apache Cassandra or ScyllaDB instance available at localhost:9042
Apache Spark, available though $SPARK_HOME

# save some repos in .siva files using Borges
echo -e "https://github.com/src-d/borges.git\nhttps://github.com/erizocosmico/borges.git" > repo-list.txt

# get Borges from https://github.com/src-d/borges/releases
borges pack --loglevel=debug --workers=2 --to=./repos -f repo-list.txt

# start Apache Cassandra
docker run -p 9042:9042 \
  --name cassandra -d rinscy/cassandra:3.11

# or ScyllaDB \w workaround https://github.com/gocql/gocql/issues/987
docker run -p 9042:9042 --volume $(pwd)/scylla:/var/lib/scylla \
  --name some-scylla -d scylladb/scylla:2.0.0 \
  --broadcast-address 127.0.0.1 --listen-address 0.0.0.0 --broadcast-rpc-address 127.0.0.1 \
  --memory 2G --smp 1

# to get access to DB for development
docker exec -it some-scylla cqlsh

Configuration for Apache Spark

Use env variables to set memory for hash job:

export DRIVER_MEMORY=30g
export EXECUTOR_MEMORY=60g

To use a external cluster just set the URL to the Spark Master though an env var:

MASTER="spark://<spark-master-url>" ./hash <path>

CLI arguments

All three commands accept parameters for database connection and logging:

-h/--host - cassandra/scylla db hostname, default 127.0.0.1
-p/--port - cassandra/scylla db port, default 9042
-k/--keyspace - cassandra/scylla db keyspace, default hashes
-v/--verbose - producing more verbose output, default false

For query and hash commands parameters for bblfsh/features extractor configuration are available:

-m/--mode - similarity modes: file or function, default file
--bblfsh-host - babelfish server host, default 127.0.0.1
--bblfsh-port - babelfish server port, default 9432
--features-extractor-host - features-extractor host, default 127.0.0.1
--features-extractor-port - features-extractor port, default 9001

Hash command specific arguments:

-l/--limit - limit the number of repositories to be processed. All repositories will be processed by default
-f/--format - format of the stored repositories. Supported input data formats that repositories could be stored in are siva, bare or standard, default siva
--gcs-keyfile - path to JSON keyfile for authentication in Google Cloud Storage

Report specific arguments:

--output-format - output format: text or json
--cassandra - Enable advanced cql queries for Apache Cassandra database

Limitations

Currently gemini targets medium size repositories and datasets.

We set resonable defaults and pre-filtering rules to provide the best results for this case. List of rules:

Exclude binary files
Exclude empty files from full duplication results
Exclude files less than 500B from file-similarity results
Similarity deduplication works only for languages supported by babelfish and syntactically correct files

Performance tips

We recommend to run Spark with 10GB+ memory for each executer and for the driver. Gemini wouldn't benifit from more than 1 CPU per task.

Horizontal scaling doesn't work well for the first stage of the pipeline and depends on size of the biggest repositories in a dataset but the rest of pipeline scales well.

Distributed storages

Gemini supports different distributed storages in local and cluster mode. It already includes all necessary jars as a part of fat jar.

HDFS

Path format to git repositories: hdfs://hdfs-namenode/path

To configure HDFS in local or cluster mode please consult Hadoop documentation.

Google Cloud Storage

Path format to git repositories: gs://bucket/path

To connect to GCS locally use --gcs-keyfile flag with path to JSON keyfile.

To use GCS in cluster mode please consult Google Cloud Storage Connector documentation.

Amazon Web Services S3

Path format to git repositories: s3a://bucket/path

To connect to S3 locally use following flags:

--aws-key - AWS access keys
--aws-secret - AWS access secret
--aws-s3-endpoint - region endpoint of your S3 bucket

Due to some limitations passing key&secret as part of URI is not supported.

To use AWS S3 in cluster mode please consult hadoop-aws documentation

Known bugs

Search for similarities in C# code isn't supported right now (patch with workaround)
Timeout for UAST extraction is relatevely low on real dataset according to our experience and it isn't configurable (patch1 and path2 with workaround)
For standard & bare format gemini prints wrong repositories listing (issue)

Development

Compile & Run

If env var DEV is set, ./sbt is used to compile and run all non-Spark commands: ./hash and ./report. This is a convenient for local development, as not requiring a separate "compile" step allows for a dev workflow that is similar to experience with interpreted languages.

Build

To build final .jars for all commands

./sbt assemblyPackageDependency
./sbt assembly

Instead of 1 fatJar we bulid 2, separating all the dependencies from actual application code to allow for lower build times in case of simple changes.

Test

To run tests, that rely

./sbt test

Re-generate code

Latest generated code for gRPC is already checked in under src/main/scala/tech/sourced/featurext. In case you update any of the src/main/proto/*.proto, you would need to generate gRPC code for Feature Extractors:

./src/main/resources/generate_from_proto.sh

To generate new protobuf messages fixtures for tests, you may use bblfsh-sdk-tools:

bblfsh-sdk-tools fixtures -p .proto -l <LANG> <path-to-source-code-file>

License

gemini's People

Contributors

Stargazers

Watchers

Forkers

bzz dpordomingo carlosms sureshannapureddy delia0204 srikantavenger se7entyse7en gryn010 afcarl jeroenherczeg gagliardetto avineshwar neomatrix369 isabella232 doytsujin zeeneddie

gemini's Issues

Report: community detection library

Part of the #54: create an internal library for community detection.

Small internal library in Python (a function or class) for community detection using IGraph, based on the algorithm from graph.py#detect_communities().

Should include a very basic tests (but not really testing IGraph implementation itself).

Release v0.0.1

Setup release automation if needed and produce a first release.

tag
release notes
GH release
CI automation on tag
push image to DockerHub (handled separately under #33 )

Add URLs to the output of report and query commands

Show links to the files when it's found duplicates.

Add reference hash to DB.
For Github https://github.com/<repo>/blob/<ref_hash>/<file_path>
For Gitlab - same URL schema is the same as github
For BitBucket, according to docs  https://bitbucket.org/<repo>/src/<ref_hash>/<file_path>

Improve listRepositories function

It should be able to work with all "formats". Siva (regular and buckets), bare, regular.
It also should provide different output depends on the number of repositories.

Parametarise DB connection URL

At least

URL to Cassandra/ScyllaDB
limit on the number of processed repositories (only for ./hash)

should be exposed though configuration.

CLI https://github.com/src-d/gemini#cli-arguments should be used to set Apache Spark configuration properties, as some commands do not use Spark (./hash or ./report)

Report: community detection cli app

Part of #54: small app to read connected components, do community detection and pretty-print the results.

A CLI app in Python, that reads the connected components from Parquet created in #58, does community detection using the library in #59 and prints the output to STDOUT, in a format consistent \w current duplicate output format.

A call to this app should also be included in ./report, right after a duplicate detection, so for a end user whole process will look like a single application.

Not mandatory, but might be a good idea to use some kind of simple templating library for formatting the output, so templates can be shared between JVM and Python.

Flaky tests: Dataframe containing duplicated files is not properly saved

Problem

tests sometimes fail with the following output.

How to reproduce

Run ./sbt test many times, and it'll fail sometimes. It is difficult to reproduce in the CI (because it takes more time), but I could see there that error a couple of times.

Using the tip of dpordomingo/gemini::reproduce-issue-22, I created a gist with the 3 scenarios I found:

Test succeeds

No data is stored in the test_hashes_duplicates

report1-succeeding.txt -> output of the passing tests
report1-succeeding-db.txt -> content of the testing keyspaces. Both keyspaces are full of data

Test fails

No data at all is stored in the test_hashes_duplicates

report2-failing.txt -> output of the failing test
report2-failing-db.txt -> content of the testing keyspaces. There is no data in test_hashes_duplicates

Partial data is stored in the test_hashes_duplicates

report3-failing.txt -> output of the failing test
report3-failing-db.txt -> content of the testing keyspaces. Only the half of the data retrieved by engine were saved in test_hashes_duplicates (the srcd/borges repo is hashed, but erizocosmico/borges repo is lost.

More info

It requires more investigation, but it could be related when how data is stored during the tests before the test cases start running.

Currently, in the beforeAll section, it is created two keyspaces (test_hashes_duplicates for tests needing duplicates, test_hashes_uniques for tests needing no duplicates)
This is done in two stages:

engine returns a DataFrame containing all the files in the given siva files,
the DataFrame is stored in Cassandra <-- (I think something fails at this point)

After the beforeAll the test cases run, and (sometimes) the ones asserting that there are duplicate files in test_hashes_duplicates fails.

It can be seen in the logs, that in all situations the DataFrame is populated with the right contents (1,2,3).
But then:

when no data is stored at all, at the save stage, there is always outputed Wrote 0 rows to test_hashes_duplicates.blob_hash_files
when half of the data is stored, at the save stage, there is one row that says: Wrote 47 rows to test_hashes_duplicates.blob_hash_files 47 rows
when everything go well, at the save stage, there are some rows that say Wrote n rows to test_hashes_duplicates.blob_hash_files, and the sum of all rows sum 80 33 rows + 47 rows

Hints

Why when duplicate data is partially saved into test_hashes_duplicates, the missing files are from github.com/erizocosmico/borges.git? (here)
Why DataFrame is always populated with right data, but sometimes its data is not properly saved?

Report: connected components

Part of the #54: update Scala implementation of the report to write a Connected Components of similar-files graph.

TODO:

query a DB for candidates graph.py#find_connected_components()
detect connected components
define Parquet schema and write the results on disk

Report: add similar files

Umbrella issue for updating current implementation to use "similarity" from Apollo:

Query DB, detect connected components, write Parquet in Scala #58
Internal library in python (function/class) for community detection \w IGraph #85
App in Python, reading Parquet, doing community detection using the library above and printing the output #60

Making gemini easier to run

After seeing some users face issues running gemini for the first time, I created a new Dockerfile image

See the code here: https://gist.github.com/campoy/d48de3f6aef356e641df36a1fea6a778

Once we have this, running gemini is as easy as:

docker run -v $(pwd)/siva:/repositories campoy/gemini:quickstart

I wonder if this should be part of this repo and where it should be added.

Processing of siva files bigger than 2Gb

There is a limit for a job in Spark. It's 2GB. We need to investigate how to change it if possible and how it will affect spark. (the limit was introduced for some reason)

If somebody else will look at it, here is a tip. It looks like the limit is not from Spark actually, but JVM. I can be wrong. JFYI

Exception:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 415 in stage 1.0 failed 4 times, most recent failure: Lost task 415.3 in stage 1.0 (TID 1072, 10.2.15.79, executor 8): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
	at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
	at tech.sourced.siva.SivaReader.getEntry(SivaReader.java:42)
	at tech.sourced.engine.provider.RepositoryObjectFactory$$anonfun$genSivaRepository$1.apply(RepositoryProvider.scala:209)

Bump Engine version to 0.5.1

Engine 0.5.1 was released, including bugfixes needed for #42

It also may include adapting to minor API changes.

Run Gemini file-level duplicate detection on PGA

Document in README the resources, needed to successfully process 1k, 2k, 10k, 100k and whole PGA of the .siva files.

So good start would be

document the known configuration of the cluster we use internally
running Gemini hash, documenting how long does it take to finish,
what CPU-load, Mem, IO-thoughtput workload it creates on that cluster (i.e from Sysdig Dashbord, to get access file an issue)
see which resource is a bottleneck
try to optimize, in order to utilize that resource better (i.e in case of throughput - have more executor JVMs running on the same machine)
see if we are hit and can help with some Engine issues

Add Scala linter to CI

TODO:

add/document sbt task with linter and code style
integrate linter in CI

As well noted in #48 (review)

I would be nice if all of us use the same configuration for code style.

Query: implement actual query logic

Part of the #53

Add query.py#query() logic to Gemini Scala API:

no need to handle args.id case
need to agree with ML team on a way to read WMH params like htnum and band_size that must be the same, as used for hashing
use a DB from a Apollo hash run
use vocabulary in .asdf (OrderedDocumentFrequencies), built by Apollo hash \w --docfreq

Update query integration test

Part of the #53

We have one that does hash and then looks for duplicates. We need to update it to test similarity too.

I suggest such a plan:

Amend current test dataset to include similar file(s)
Populate DB table hashtables with fixture created using apollo with that dataset
Use docFreq&params from apollo also
Run bblfsh server & fe server
Check query output and make sure both duplicates and similar files appear in the output

Improve docker image management in CI

We have scripts that launch docker images, used by Travis CI. To skip them we look for STYLE_CHECK env var, but that is not the only case we should skip anymore.

if [[ -z "${STYLE_CHECK}" ]]; then

From this thread #83 (comment)

@carlosms

Maybe we should find a better way to skip docker images? For instance we will have at some point lint for python, scala, and other possible new tests that need a different env.
Would it work if in travis.yml we moved the before_script: inside each matrix entry? It would be more verbose, but easier to spot where we are launching docker.
I'm not sure about the best option, but it's something worth looking into.

@smacker

current check is copy-past for a script we already use. But I agree is need something better. Maybe also use docker-compose as proposed by smola.

And from this other thread #83 (comment)

@smola

I understand this is not strictly related to this PR, since you were already doing this previously with scripts/start_docker_db.sh, but you might consider using docker-compose for this in the future.

It can be easily pre-installed in Travis (see here) and it's a single file with more concise syntax to define Docker dependencies. You can choose to start all of them with up or specific up bblfsh.

@smacker

yes. We are going to have 3 docker images (db, bblfsh, feature extraction) in dependencies for gemini. It makes perfect sense to use docker-compose in the feature. (actually, I already do locally)

Add license

Add LICENSE file + to readme. See: src-d/guide licensing policy.

Refactoring: add proper CLI

One of the Scala libraries for CLI parsing can be used, i.e https://github.com/scopt/scopt

FE: feature extractors

This is an umbrella issue for Feature Extractors (FE) implementation:

Define gRPC Service, messages in src/main/proto/*.proto #52
Generation scripts: server in python src/main/python/generated.py, #57
Python CLI app src/main/python/main.py configures the port
Generation scripts: client in scala src/main/scala #63
Docker file \w feature extractor (used on CI) #73
Implement FE service: call sourced.ml.extractros #73
Add missed weight parameter to extractors #79
Integration Tests: add CI profile, that for a given UAST checks the response is not failing #81

Perf: measure performance of file-based similarity

Umbrella issue for adding perf measurements to every command for file-based similarity:

Add CLI flag for ./query ./report to print time for each stage
Instrument FE, exposing one endpoint \w json
Instrument Apache Spark hashing job using org.apache.spark.groupon.metrics.UserMetricsSystem to expose to Spark JSON endpoint

Query: add similar files

Umbrella issue to update current implementation to use a :similarity" notion from Apollo:

UAST extraction #68
talk to FE gRPC, to extract features #69
Find/write CPU WMH implementation in Scala (tests!) #78
Write query.py#query() logic on Scala side #89
Update ./query CLI output to include similar files #92
Integration test with all components (except hash, use cql fixture for it)

Make all commands apply schema (if not exist)

Right now, only tests do not rely on having existing schema in DB.

Each command should be changed (hash, query, report) to do the same and create schema, instead of failing if it does not exist.

Implement report command

This command will output the duplicate files among all the hashed files:

<file_path> | <repo> | <sha1>
show the <sha1> only when command is ran with --verbose mode
add a lik to the file in GH (based in HEAD) -> https://github.com/<repo>/blob/HEAD/<file_path>

Add Docker compose for dependencies

To streamline first user experience:

add simple Docker compose script for dependencies: DB and FE - it will enable part of #84
make CI use it (so only appropriate containers are started - as suggested in #91)

Update Design Document

Initial work duplicate code detection was done under this DesignDoc.

Next steps should be solicited in a new improvement proposal DesignDoc.

Simplify arguments for report command

Currently, we have 2 flags: group-by and condensed. Both of them require Cassandra and can't work with ScyllaDB. We think condensed flag is quite confusing. So we propose to keep only group-by flag.

Add python formatter support

Same as with Scala in #50 - as soon as we have src/main/python - it would be nice to document and add https://github.com/google/yapf support.

Integration Tests on CI in Spark Standalone cluster mode

Right now we have a profile on CI that does Integration Tests with Apache Spark in local mode.

In order to be able to catch more trickier issues i.e \w runtime classpath collision, we need test it also in Apache Spark cluster mode Standalone configuration.

TODOs:

add new profile on CI \w INTEGRATION_TESTS=true
same test scenario, as in local mode, except for
add starting Apache Spark Master and Workers manually
run Gemini using, using the above with MASTER="spark://127.0.0.1:7077" ./hash ...

Add scalafmt support

After #49, it would be nice to have auto-formatter

TODO

add/configure scalafmt support
document IDE configuration http://scalameta.org/scalafmt/#IntelliJ as default InteliJ one can produce a bit different output

As well noted in #48 (review)

I would be nice if all of us use the same configuration for code style.

Release: update release artifacts

Right now, Gemini has 3 use cases:

local, for developer on pre-configured environment, using shell scripts
local, for a first-time user, #84 (comment)
k8s \w Apache Spark cluster, Dockerfiles: Gemini \w Spark, Feature Extractors

Current release artifacts are only .tar file for 1 and a Docker file.

This issue is about changing the release process to accommodate recent changes on file-level similarity so:

add .sh for starting a feature extractor process on local machine
in ./report, check if Python is available
make sure Dockerfile for FE re-uses the shell scripts (as much as possible)
add publishing Docker container for FE to the release process
(this will enable testing Gemini on 3rd use-case)

this way, a new release should accommodate all 1 and 3 case from above.

Release: include LICENSE file in the release artifacts

As discussed in #63 (comment) , this project is governed by GPL3 but include few (well identified, .proto) files under BSD-like licenses, thus we need to make sure we keep copyright notice/disclamer and redistribute it as a part of the final released artefacts.

To do so, we just need to:

make sure LICENSE file does not miss any details of all third-party licenses from external resources like src/main/proto, etc
update the release automation https://github.com/src-d/gemini/blob/master/scripts/release.sh to include LICESE file to the distributed archive.

Reconcile metadata DB schema: use the same between Apollo/Gemini

To reconcile metadata schema and use the same one between Apollo/Gemini we need to:

Apollo: bucket meta by path src-d/apollo#21
Apollo: unify formatting \w Gemini URLFormatter, bucket by ref_hash/commit src-d/apollo#22
Gemini: refactoring, for consistent naming in code i.e ref_hash->commit #48
Gemini: change default keyspace/column names in schema to match Apollo ones
Gemini: add Go query to the CI

Query: UAST extraction

Part of the #53

Add UAST extraction using https://github.com/bblfsh/client-scala to the query API

FE: add graphlet extractor

Right now, a number of feature extractors is already supported and used by default in Gemini.

This task is about:

adding one more extractor from sourced-ml - GraphletBagExtractorand
change Gemini defaults to use 3 extractors: Graphlet, Identifier and Literals

Add CI hook to push image to docker hub

Let's decide about tags/branch on which it should happen later.

The current proposal is to build images from 2 branches:

master
experimental

Query: CPU WMH implementation in Scala

Part of #53 and #55

For both, query and hashing similar files we need to have a CPU-based WeightedMinHash implementation in JVM, to avoid depending on MinHashCUDA lib and GPU.

Here are few reference implementations of this algorithm:

It might make sense to research if such library already exists in JVM ecosystem, and if not - it might make sense to implement one in Java as part of Gemini codebase. This way it can be later used from Java, Scala, Clojure, etc eventually, as a standalone library.

Correctness verification is of paramount importance, so we would need to have some tests, and may be a reference data \w hashes of some sets, produced by one of the implementations from above.

Hash: add similar files

Umbrella issue for adding hashing similar files using Apache Spark:

UAST, feature extraction (same args as in query)
Tokenize, Vectorize: wBOW, tf/idf - defining a pipeline
- generate docFreq.json and params.json
- vectorize: file->set
Use CPU-based WMH implementation in Scala from #67 to hash every set
Take results of hash pipeline, create hash tables, write to ScyllaDB \w schema as in Apollo #111
Test correctness: Apollo and Gemini produces same results (use same seed if needed)

Document easy way and hard way to run Gemini

After #95, update the README structure, add

"Easy Way" a quickstart for Gemini \w containers
"Hard Way" as discussed in #84, a separate file under /docs/, with description on how to run it on Apache Spark cluster \w Scylla/Cassandra with all the details
Apply suggestions from https://github.com/src-d/guide/blob/master/engineering/documentation.md

This change can bee small - it does not need to be "complete" in documentation coverage or full conformance \w doc guide from above, but it has to setup the right "structure" of our user-facing documentation, that will be improved later.

Fix WARN on ./report

Right now, after #90 (comment) ./report command (at least on macOS) produces

WARN 11:57:50 org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner (FileSystem.java:2995) - exception in the cleaner thread but it will continue to run
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
	at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
	at java.lang.Thread.run(Thread.java:748)

It is may or may not be related to HADOOP-12829 - the goal is to find out and fix it.

Add python linter to CI

As we have src/main/python, it would be nice to have something like a https://www.pylint.org/ integrated in CI.

Update output of scala applications

For query and report tools:

add -v flag, producing more verbose debug output
replace println() with proper slf4j logging (same as Apache Spark uses) on different levels INFO/DEBUG

The Spark Application logging will be handled in different issue.

Investigate why listRepositories doesn't work with HDFS

Most probably something is wrong with documentation.

CI: fix by switching to Docker instead of EmbeddedCassandra

Right now test use EmbeddedCassandra to speed up CI \wo Docker.

But proper INTEGRATION_TESTS=true profile would not be possible \wo Docker in before_script anyway, so it would make sense to just rely on it in both Dev and CI flows (as we do in other projects)

Query: include similar files in CLI output

Part of the #53

Update ./query CLI output to include not only duplicates, but also a similar files.

Same as in report #60, it may (or may not yet) make sense to start using some kind of templating.

Flaky tests: org.eclipse.jgit.errors.MissingObjectException

On local ./sbt run as well as on CI, sometimes test become flaky with org.eclipse.jgit.errors.MissingObjectException.

Example:

ERROR 15:13:13,185 org.apache.spark.internal.Logging$class (Logging.scala:91) - Exception in task 0.0 in stage 1.0 (TID 2)
org.eclipse.jgit.errors.MissingObjectException: Missing commit 4aa29ac236c55ebbfbef149fef7054d25832717f
	at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:164)
	at org.eclipse.jgit.revwalk.RevWalk.getCachedBytes(RevWalk.java:903)

https://api.travis-ci.org/v3/job/319232107/log.txt from this run, but same commit passes here
TODO add more

This might, or might not have to do something either with Engine behaviour, or with test fixtures that we have.

This issue is about investigating the reason and if needed, filing appropriate issues elsewhere to fix it.

Update output of Spark application

For ./hash
migrate to logger, provided by Apache Spark
output: mute meaningless Spark logs (though log4j.config)

This would most probably mean to refactor Gemini class to have logging dependency injected (thorough constructor or .set...()), as it will be instantiated by the client (Spark/non-spark ones) in a different ways.

duplicates in repository

When we have identical files in repository, only one is written in DB and duplicates won't appear in report because the primary key is the same for them.

Steps to reproduce:

smacker at Maxims-MacBook-Air in ~/tmp/testrepo on master*
$ ls
CONTRIBUTING.md file.py         file_2.py

file_2.py is copy of file.py.

Run engine to collect files:

+--------------------+--------------------+---------------+--------------------+----+
|         commit_hash|           file_hash|           path|       repository_id|name|
+--------------------+--------------------+---------------+--------------------+----+
|06e561f1a7d6db4f3...|c4e5bcc8001f80acc...|      file_2.py|file:///Users/sma...|HEAD|
|06e561f1a7d6db4f3...|eaf26a547aa54cde7...|CONTRIBUTING.md|file:///Users/sma...|HEAD|
|06e561f1a7d6db4f3...|c4e5bcc8001f80acc...|        file.py|file:///Users/sma...|HEAD|
+--------------------+--------------------+---------------+--------------------+----+

Check what we have in DB after hash:

cqlsh:hashes> select * from blob_hash_files;

 blob_hash                                | repo                               | file_path
------------------------------------------+------------------------------------+-----------------
 eaf26a547aa54cde7079567d832ac05880eb6bd2 | file:///Users/smacker/tmp/testrepo | CONTRIBUTING.md
 c4e5bcc8001f80acc238877174130845c5c39aa3 | file:///Users/smacker/tmp/testrepo |       file_2.py

(2 rows)

Speedup CI

The longest task is integration. Most of the time takes:

install python. We can improve it a bit, installing some deps from apt instead of building
docker for FE. We can remove it now when we have python in CI
Make build. We can try to use travis cache to improve it

also check that we don't run services when they aren't necessary

Query: extract Features

Part of the #53

Call Feature Extractors using gRPC client, to get feature from the UASTs.

src-d / gemini Goto Github PK

gemini's Introduction

Gemini

Run

Hash

Query

Report

Requirements

Docker

Standalone

Configuration for Apache Spark

CLI arguments

Limitations

Performance tips

Distributed storages

HDFS

Google Cloud Storage

Amazon Web Services S3

Known bugs

Development

Compile & Run

Build

Test

Re-generate code

License

gemini's People

Contributors

Stargazers

Watchers

Forkers

gemini's Issues

Problem

How to reproduce

Test succeeds

Test fails

More info

Hints

Recommend Projects

Recommend Topics

Recommend Org