guacsec / guac Goto Github PK

GUAC aggregates software security metadata into a high fidelity graph database.

License: Apache License 2.0

Go 99.57% Nix 0.01% Makefile 0.28% Shell 0.13% Starlark 0.01%

security software-supply-chain software-supply-chain-security supply-chain supply-chain-security supply-chain-visibility supply-chain-analytics

guac's Introduction

GUAC: Graph for Understanding Artifact Composition

Note: GUAC is under active development - if you are interested in contributing, please look at contributor guide.

Graph for Understanding Artifact Composition (GUAC) aggregates software security metadata into a high fidelity graph database—normalizing entity identities and mapping standard relationships between them. Querying this graph can drive higher-level organizational outcomes such as audit, policy, risk management, and even developer assistance.

Conceptually, GUAC occupies the “aggregation and synthesis” layer of the software supply chain transparency logical model:

A few examples of questions answered by GUAC include:

Quickstart

Our documentation is a good place to get started.

We have various demos use cases that you can take a look.

Starting the GUAC services with our docker compose quickstart.

Docs

All documentation for GUAC lives on docs.guac.sh, backed by the following docs github repository.

Architecture

Here is an overview of the architecture of GUAC:

For an in-depth view and explanation of components of the GUAC Beta, please refer to how GUAC works.

Supported input documents

Note that GUAC uses software identifiers standards to help link metadata together. However, these identifiers are not always available and heuristics need to be used to link them. Therefore, there may be unhandled edge cases and errors occurring when ingesting data. We appreciate it if you could create a data quality issue if you encounter any errors or bugs with ingestion.

GraphQL backends

GUAC supports multiple backends behind a software abstraction layer. The GraphQL API is always the same and clients should be unaffected by which backend is in use. The backends are categorized into:

Supported/Unsupported: Supported backends are those which the GUAC project is committed to actively maintain. Unsupported backends are not actively maintained but will accept community contributions.
Complete/Incomplete: Complete backends support all mandatory GraphQL APIs. Incomplete backends support a subset of those APIs and may not be feature complete.
Optimized: The backend has gone through a level of optimization to help improve performance.

The two backend that are Supported, Complete, and Optimized are:

keyvalue (supported, complete, optimized): a non-persistent in-memory backend that doesn't require any additional infrastructure. Also acts as a conformance backend for API implementations. We recommend starting with this if you're just starting with GUAC!
ent (supported, complete optimized) with PostgreSQL: a persistent backend based on Entity Framework for Go that can run on various SQL backends. GUAC only supports ent with PostgreSQL. Other ent backends such as MySQL and SQLite are unsupported.

The other backends are:

arangoDB (unsupported, incomplete, optimized): a persistent backend based on ArangoDB
neo4j/openCypher (unsupported, incomplete): a persistent backend based on neo4j and openCypher. This backend should work with any database that supported openCypher queries.
keyvalue: Redis (experimental, complete): The default keyvalue backend, but using Redis as storage.
keyvalue: TiKV (experimental, complete): The default keyvalue backend, but using TiKV as storage.

Additional References

Communication

For more information on how to get involved in the community, mailing lists and meetings, please refer to our community page

For security issues or code of conduct concerns, an e-mail should be sent to [email protected].

Governance

Information about governance can be found here.

guac's People

Contributors

Stargazers

Watchers

Forkers

pxp928 mlieberman85 lumjjb mihaimaruseac rgreinho nadgowdas cpendery trmiller lukehinds desenna nishakm fbiville judesafo markkupekkarinen nguyennp doytsujin trantdai b-xiang dongbinghua zhangli344236745 tonghuaroot krishnaindani feixiaoan impldream rvema guidiego storv aplater davidalphafox ralphliang samhays kartikeyap saifsabir97 shafeeshafee anoop2811 dattgoswami hustliyilin fishseabowl anthonyharrison anggadaz ryancraig yunhua-deng qiwen-zhou vasu018 luis-sousa-pinto clyly appdirectory ravisujlana fridex figoowen2003 robh-snyk pombredanne rossmcewan cmaclaughlin codificat nathannaveen naveensrinivasan rigzba21 pramos rewanthtammana rajkrishnamurthy ducthinh993 knrc y2023y trustification kanchan-dhamane jeffmendoza maorkuriel verolop qpc-github sunnyyip lavelee954 pdxjohnny redlinejoes olivekl ppattanayak fastbyt3 ctron crossedsecurity lmnewton zlehmann iq-scm rmetzman cyberhiten kurt-r2c genos1998 tuananh neilnaveen krishnaduttpanchagnula ivanvanderbyl opsmx weiherng2000 colyoonamaz j-white mrizzi jonzeolla migmartri bobmcwhirter stevemenezes m-brophy

guac's Issues

task: [processor] create ITE6 DocumentTypeGuesser

As part of #26, we want to be able to create guessers to identify what format type a document is (using the guesser interface defined in https://github.com/guacsec/guac/tree/main/pkg/handler/processor/guesser). Based on the foundations laid in #27 .

This issue is to create ITE6 document guesser.

task: [ingestor] create Key interface for ingestor

Create an Key interface that will be used to implement various key providers

IdentityFor edge should be generic

The identity for edge should apply to almost any type of document/node, and thus should be able to be defined on any GuacNode. This should be done as well as any other clean up required around identity for graphBuilder

task: [dochandler] create collector interface

Create collector interface to be able to emit lists of documents to be processed by the docprocessor. (#16)

Write document guesser test to make sure that other guessers are not accidentally misguessing another document type

Certain document type importers may not have sufficient heuristics to determine if a document is indeed the type guessed. For example, if the fields in the JSON are optional for that document type then it may mistake any JSON document as its document type. (This happens in certain cases in SPDX thus the requirement to check for existence of field).

We should write a unit test to make sure that no other document guesser misguesses a document.

task: [ingestor] create SLSA document ingestor

Create SLSA document ingestor off @mihaimaruseac 's script to take in SLSA documents as a list of Documents and outputs GuacNodes and GuacEdges, depends on #11

task: [collector/ingestor/assembler] import scorecard information

Scorecards information is useful to help reason about source repositories, it would be great to integrate into GUAC data flow.

https://github.com/ossf/scorecard/#public-data

task: [processor] simplify processor interface

Simplify processor interface to remove trust info and validation and output a document tree instead.

SLSA parser crashes on multiple subjects OR multiple hashes

SLSA parser crashes with multiple subjects or multiple hashes

Multiple digests error:
SLSA multiple digests example:

  "subject": [
        {
      "name": "gs://kubernetes-release/release/v1.25.2/bin/linux/arm64/kube-apiserver",
      "digest": {
        "sha256": "5522c9bcd76863fa24a658d9faeb6fa2ca999d022806e301e922efca747043f6",
        "sha512": "aa989e60525ac208bc1a7469b486eecb02bf4e7ceb3530c97bae5e0cbc8d4361ce040a8899fa7d9eb56f573fdfc605325e4fcaf956f5efa930cf1a52cb5ebb10"
      }
    }
      ],

Error:

panic: runtime error: index out of range [342] with length 342

goroutine 1 [running]:
github.com/guacsec/guac/pkg/assembler.StoreGraph({{0xc00017c800, 0x156, 0x180}, {0xc000372000, 0x3f9, 0x400}}, {0x1d75098?, 0xc0000c6f20?})
	/Users/lumb/go/src/github.com/guacsec/guac/pkg/assembler/graphdb.go:62 +0xa1d
github.com/guacsec/guac/cmd/guacone/cmd.getAssembler.func1({0xc0005a7a70?, 0x1?, 0xc00019e540?})
	/Users/lumb/go/src/github.com/guacsec/guac/cmd/guacone/cmd/files.go:178 +0xc5
github.com/guacsec/guac/cmd/guacone/cmd.glob..func1.1(0xc00019f020)
	/Users/lumb/go/src/github.com/guacsec/guac/cmd/guacone/cmd/files.go:109 +0x13b
github.com/guacsec/guac/pkg/handler/collector.Collect({0x1d73be8?, 0xc0008001e0}, 0xc00035dcc8, 0xc00035dc58)
	/Users/lumb/go/src/github.com/guacsec/guac/pkg/handler/collector/collector.go:84 +0x2f0
github.com/guacsec/guac/cmd/guacone/cmd.glob..func1(0x2488a80?, {0xc000800180, 0x1, 0x3})
	/Users/lumb/go/src/github.com/guacsec/guac/cmd/guacone/cmd/files.go:125 +0x56a
github.com/spf13/cobra.(*Command).execute(0x2488a80, {0xc000800120, 0x3, 0x3})
	/Users/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:876 +0x67b
github.com/spf13/cobra.(*Command).ExecuteC(0x2488d00)
	/Users/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:990 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
	/Users/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:918
github.com/guacsec/guac/cmd/guacone/cmd.Execute()
	/Users/lumb/go/src/github.com/guacsec/guac/cmd/guacone/cmd/root.go:35 +0x25
main.main()
	/Users/lumb/go/src/github.com/guacsec/guac/cmd/guacone/main.go:23 +0x17

Multiple Subjects error:
SLSA subject section example:

  "subject": [
    {
      "name": "gs://kubernetes-release/release/v1.25.2/bin/windows/amd64/kubectl-convert.exe",
      "digest": {
        "sha512": "aa989e60525ac208bc1a7469b486eecb02bf4e7ceb3530c97bae5e0cbc8d4361ce040a8899fa7d9eb56f573fdfc605325e4fcaf956f5efa930cf1a52cb5ebb10"
      }
    },
        {
      "name": "gs://kubernetes-release/release/v1.25.2/bin/linux/arm64/kube-apiserver",
      "digest": {
        "sha256": "5522c9bcd76863fa24a658d9faeb6fa2ca999d022806e301e922efca747043f6"
      }
    }
      ],

panic: runtime error: index out of range [5] with length 5

goroutine 1 [running]:
github.com/guacsec/guac/pkg/assembler.StoreGraph({{0xc00003c0f0, 0x5, 0x5}, {0xc0001049c0, 0x6, 0x6}}, {0x1d75098?, 0xc000486f20?})
	/Users/lumb/go/src/github.com/guacsec/guac/pkg/assembler/graphdb.go:62 +0xa1d
github.com/guacsec/guac/cmd/guacone/cmd.getAssembler.func1({0xc00030ce70?, 0x1?, 0xc0001046c0?})
	/Users/lumb/go/src/github.com/guacsec/guac/cmd/guacone/cmd/files.go:178 +0xc5
github.com/guacsec/guac/cmd/guacone/cmd.glob..func1.1(0xc000104780)
	/Users/lumb/go/src/github.com/guacsec/guac/cmd/guacone/cmd/files.go:109 +0x13b
github.com/guacsec/guac/pkg/handler/collector.Collect({0x1d73be8?, 0xc00010fb30}, 0xc00063fcc8, 0xc00063fc58)
	/Users/lumb/go/src/github.com/guacsec/guac/pkg/handler/collector/collector.go:84 +0x2f0
github.com/guacsec/guac/cmd/guacone/cmd.glob..func1(0x2488a80?, {0xc00010fad0, 0x1, 0x3})
	/Users/lumb/go/src/github.com/guacsec/guac/cmd/guacone/cmd/files.go:125 +0x56a
github.com/spf13/cobra.(*Command).execute(0x2488a80, {0xc00010fa70, 0x3, 0x3})
	/Users/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:876 +0x67b
github.com/spf13/cobra.(*Command).ExecuteC(0x2488d00)
	/Users/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:990 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
	/Users/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:918
github.com/guacsec/guac/cmd/guacone/cmd.Execute()
	/Users/lumb/go/src/github.com/guacsec/guac/cmd/guacone/cmd/root.go:35 +0x25
main.main()
	/Users/lumb/go/src/github.com/guacsec/guac/cmd/guacone/main.go:23 +0x17

Design and implement full integration with pub/sub for GUAC flow for collector/processor/assembler

Collectors that obtain documents need somewhere to emit them to. The processor, which is the next part of the pipeline needs to gather the documents and process them..

There are a couple options naturally:

Processor runs as a gRPC server
Processor obtains documents from a Pub/Sub queue (e.g. kafka, nats.io, etc.)
Processor ingests from STDIN or file
Processor and Collector are part of the same process.

This boils down to we collectors and processors want to be run in the architecture. The ingestor will most likely be tied to the assembler.

Deliberation:

Will all the collectors be run in a single executable? I.e. the processor will cache duplicate documents so it is beneficial to have an n:m relationship (where n>m) between collectors and executables. If the answer is no, this excludes option 3 and 4.
- I think it is likely that this answer is no, given the access of collectors to need credentials and not a single account/team would have all credentials
Options 1 and 2 are similar, with a trade-off between simplicity and scale.

task: [processor] create SLSA DocumentTypeGuesser

This issue is to create SLSA document guesser.

task: [ingestor]Include the github and commit id information from SLSA attestations

SLSA level 3 attestations should contain source information, this information can be included within the graph which will help link to other data sources (e.g. scorecards)

task: [assembler] create graphDB package for neo4j

Create a graph DB package that will be to create an instance of the neo4j/cypher driver to talk to the graph. No need for plug-ability of graph DB for now, since we currently do not foresee supporting additional graph DBs in the near future.

task: [ingestor] create inmemory key provider for ingestor

inmemory key provider is an implementation of the key interface. This will store keys in-memeory

task: [ingestor] update verifier interface with Key wrapper

Update the Verifier interface that adds in the Key Wrapper and register new providers.

task: [assembler] define minimal set of GuacNodes and GuacEdges

Define an initial set of GuacNodes and GuacEdges to satisfy the basic example of SLSA attestations based on @mihaimaruseac 's script and @mlieberman85 's data.

Deps.dev bigquery dataset collector

Write a collector to ingest deps.dev bigquery data https://deps.dev/data

SPDX Heuristic for Syft SPDX SBOMs

Right now, syft isnt putting the top level package as SPDX objects

I think for now we can add a PURL OCI reference type by heuristics based on the name in the document. But ill open an issue in Syft to include this as well (anchore/syft#1241).

The checksum is not currently stored, but would be good to also include "name" as the package ref

{
 "SPDXID": "SPDXRef-DOCUMENT",
 "name": "gcr.io/google-containers/kube-addon-manager-v8.9",
 "spdxVersion": "SPDX-2.2",
 "creationInfo": {
  "created": "2022-10-03T14:41:17.720701835Z",
  "creators": [
   "Organization: Anchore, Inc",
   "Tool: syft-0.58.0"
  ],

SLSA parser digest format contains stray quotes

Current SLSA parser digest string has additional quotes

{
  "identity": 568482,
  "labels": [
    "Artifact"
  ],
  "properties": {
"name": "git+https://github.com/kubernetes/kubernetes",
"digest": "sha1:'3c7da84d8fc03c30d3409e9c846ae4bc2de0b4d5'"
  }
}

chore: reconcile CI with local makefile

Right now, there is a bit of difference between the CI and makefile

some tests require neo4j to run (perhaps tag them to not run locally)
Linting rules in makefile and CI are different

task: [processor] create DocumentUnknown pre-processor

Create a DocumentUnknown pre-processor that takes in a document blob and guess the format and document type between each iteration of the processor.

i.e. given a Document with a blob, tell me what the type and the format is

#27 added the initial foundations
TODO

add call from processor

task: [all] do end to end file collector to graph population

Create an end to end command line tool to take in a folder of documents and populate a graph for debugging and to show end to end flow.

Keep consistency in certain identifying properties in GuacNode/Edges

In certain cases, there may be slight variation in identifiers, for example, digest being "sha256:abc..." vs "SHA256:abc...", this should be handled so that common identified nodes are not duplicated.

this can probably be done easily with the Properties function (https://github.com/guacsec/guac/blob/main/pkg/assembler/nodes.go#L56)

Create Contributing.md and make sure comms channels are usable.

As we get more and more contributors coming in, we want to make sure that there are some contribution guides to help, and make sure that the modes of communication are available (i.e. ensure mailing list works)

Logging library not being developed anymore

For logging, the logrus library is being used. However is it not actively developed anymore:

Logrus is in maintenance-mode. We will not be introducing new features. It's simply too hard to do in a way that won't break many people's projects, which is the last thing you want from your Logging library (again...).

They recommend using other libraries:

Check out, for example, Zerolog, Zap, and Apex.

task: [collector] change collector.Collect interface to take in emit function

Move the channel logic into Collect and hide all this channel stuff? (reference discussion: https://github.com/guacsec/guac/pull/23/files#r953929317)

Make it Collect(ctx context.Context, emitter processor.Emitter, handleErr collector.ErrHandler)

type ErrHandler func(error) bool

type Emitter func(*processor.Document) error

task: [ingestor] break parser into plugin model

Create parser interface and make it a plugin model similar to the collector and processor.

Add performance warning in README

Note performance warning in README that the current proof of concept does not include optimizations to neo4j and may see some degradation of performance. Create a separate PERFORMANCE.md file to provide some ideas to increase performance in the time being.

task: [processor] create DSSE DocumentTypeGuesser

This issue is to create DSSE document guesser.

task: [processor] JSON lines processor

Create a new processor to unpack JSON lines, including creating a new format called FormatJSONLines and document type DocumentJSONLines

JSON lines: https://jsonlines.org/

task: [processor/ingestor] CycloneDX support

Add support to ingest CycloneDX documents

task: create make check and make binaries

Create a makefile with make check to check golang compilation, testing and linting

Discuss how to handle identifiers and duplicate scenarios

Some entities may have multiple identifiers. Let's figure out what's the best way to handle them, especially for merging nodes and insertion of new edges/relating new information. Another tricky question also revolves around empty identifier fields and possible lists of identifiers vs having multiple nodes.

#107 (comment)

FYI: @pxp928 @mlieberman85 @mihaimaruseac

task: [collector] create a Rekor collector

Implement a Rekor collector.

Pointers:

Using or getting inspiration from the Rekor client from the CLI https://github.com/sigstore/rekor/blob/main/pkg/client/rekor_client.go

task: [ingestor] Create sigstore verifier

Added a sigstore verifier to validate the signatures based on the public keys

Add guacone command to add indexes

Provide a guacone subcommand (or include as part of the files) code to create the indices to help performance.

Set up bots to benefit from more automation

Bots can help with various tasks, and it would be useful to setup a few for this project/organization:

Automerge: Either using the GitHub built-in feature of Kodiak
Pull Request size
Stale Bot

I think we should consider to add at least these 3 bots.

Interested in Dev/Contributing to GUAC?

Welcome! This thread is on expressing interest in contributing to GUAC! We are glad to welcome our fellow open source contributors! As the project is starting up, we will be creating issues that folks can pick up and work on. In the meantime, as the code base is forming up, we'd like to engage directly with our contributors!

BTW we now have a slack channel: https://openssf.slack.com/archives/C03U677QD46

If you are interested in contributing, it would be very helpful to provide the following details (copy and paste into your comment):

1. I am interested in contributing to:
- [ ] Development
- [ ] Documentation
- [ ] Issue triage and community
- [ ] Technical advisory (review [governance document](https://github.com/artifact-ff/artifact-ff/blob/main/GOVERNANCE.md#technical-advisory-members))

2. I am here because:
- [ ] Personal interest
- [ ] My company/orgs i work with are interested in this

3. What is your associated company/org if you're contributing in their capacity? _________

4. Depending on how things go, I may be interested in becoming a maintainer of the project
- [ ] Yes

5. (optional) I have expertise in:
- [ ] Neo4j
- [ ] Cypher
- [ ] GraphQL
- [ ] Intoto
- [ ] SPDX
- [ ] CycloneDX
- [ ] Others (fill in):

Parser tests should check for calls to GetIdentities too

Currently parser tests do not test for GetIdentities (in pkg/ingestor/parser)

task: [processor] create ITE6 Processor

After DSSE processor has been completed the ITE6 processor will run next and determine if the predicate type is SLSA and unpack.

fyi: ingestor tree parsing

We agreed that in the long term, the ingestor would need to have a way to communicate information up/down the tree in order to make edges and annotations between the elements of each node in the document tree.

However, to get started with an e2e poc, we decided to defer the implementation of the recursive processing model.

Relevant Conversation:
#39 (comment)
#39 (comment)
#39 (comment)

task: [processor/ingestor] SPDX support

Support the ingestion of SPDX documents.

Implement:

SPDXProcessor
SPDXIngestor

JSON lines format

Refactor for SLSA parser tests

EDIT: upon reading the tests more, it seems like it just moves a lot of the test case checks outside the test definition

The SLSA parser tests should specify expected edges and nodes within the test itself rather than having it just be purely part of the body (explicitly being linked to a test case)

task: [querier] design and write some queries for the neo4j graph

We need to come up with a design on how queries will happen and examples of the interfaces that would be required for it. Example queries are that as in the usecase section of the GUAC design doc.

Add end-to-end distributed tracing monitor to GUAC

Adding tracing monitor (such as jaeger) to allow for us to collect metrics for troubleshooting and tracking the time taken by each action.

task: [CI] Migrate CI from demo script to actual CI

Our current CI runs a demo script to test integration with neo4j but as we now have tests in the codebase and will soon get nodes and edges to push to neo4j (#10, #11) we should migrate it to be a real CI script.

task: [collector] create file/folder collector

Create a collector that will read from a file path to import a bunch of documents.

https://github.com/guacsec/guac/blob/main/pkg/handler/collector/collector.go

For a simple test reference; see https://github.com/guacsec/guac/blob/main/cmd/collector/cmd/mockcollector/mock_collector.go

task: map keys to identities and trust

Identities should be considered separate from any given key material, as its potentially a many to many situation. One identity might have multiple keys and one key might be potentially associated with multiple identities.

Note: Identity in this context is still abstract. It should not be tied back to a specific person especially anonymous/pseudonymous folks. The primary goal is to associate keys identities associated with a project, most likely organizations or known maintainers.

Docs: