Giter VIP home page Giter VIP logo

Comments (5)

nscuro avatar nscuro commented on August 25, 2024

Valid request. And it will be even more valid once we support additional metadata such as occurrences.

The identity based de-duplication has always been there, but I think with the recent refactoring of BOM processing, as well as introduction of component property support, it's now more obvious.

De-duplication is a major concern for users who merge multiple BOMs prior to upload - most merge tools don't pay attention to duplicates, so it's up to DT to resolve that. There are also BOM generators out there that will produce duplicate component records for monorepos, or multi-module projects.

That being said, even in those cases, I'd expect properties outside of the core identity to match as well. So I'm inclined to say we should be able to just switch to full equality and be done with it.

If we need to maintain multiple ways, we could just make it a flag in the BOM upload request, defaulting to identity-based de-duplication.


What could be problematic are BOM generators that yield non-reproducible outputs. For example if they put timestamps or otherwise dynamic data in properties. In that case you'll get lots of churn whenever you re-upload BOMs to existing projects.

from dependency-track.

mykter avatar mykter commented on August 25, 2024

most merge tools don't pay attention to duplicates, so it's up to DT to resolve that. There are also BOM generators out there that will produce duplicate component records for monorepos, or multi-module projects.

I think there's a reasonable argument that it's up to the BOM producers to resolve that, not DT. Being able to say the BOM is the source of truth is a powerful simplifier, both for users and developers.

So I'm inclined to say we should be able to just switch to full equality and be done with it.

Sounds good! Would you be open to PRs to implement this? Would we need it behind an experimental flag?

from dependency-track.

nscuro avatar nscuro commented on August 25, 2024

Would you be open to PRs to implement this?

Most certainly.

Would we need it behind an experimental flag?

I think that would be good.

We can still decide to remove the flag later if we deem it unnecessary, but initially we should assume that there will be noticeable differences that users will need to "opt in" to.

from dependency-track.

mykter avatar mykter commented on August 25, 2024

I've been thinking about this some more and came up with a potential problem. Let's say you upgrade your BOM generator, and it adds a new metadata field as a property. I don't think anyone would expect this to cause a problem, but if we were using strict component equality then every vulnerability and policy violation would disappear and be recreated afresh the first time this new BOM was uploaded, with no triage status or notes etc.

So on either extreme we have:

  1. Use strict equality (as we discussed above): consumers need to deal with vulnerabilities and policy violations that get recreated on any change to the generated BOM
  2. Use the existing identity based equality: consumers have to deal with not being able to represent multiple different instances of the same dependency in a BOM

Middle grounds I can think of:

  1. Choose the behaviour on upload. In theory this allows the best of both worlds, as you could use strict equality most of the time, and identity based when your BOM changes in some way. In practice I don't think this is realistic - you can't be fiddling with automated BOM uploads for one-off activities, and you'd probably only notice you needed this behaviour when it was too late and all your vulnerabilities had been recreated.
  2. Make the equality check configurable to some degree: start with the existing fields as a base (or perhaps a bigger default set?), select other fields that you want to include, and if a component doesn't match on all these fields then it's treated as distinct.
    • It could be configured in the BOM upload request, at the project level, or globally. These options get progressively simpler (good!) and less flexible (bad). Arguably the per-request option can be made to behave like the per-project one by the client: use a consistent equality definition whenever you're uploading to a project.

In addition to purl/cpe/swid/name/version, I can see deviations in fields like these warranting separate components:

  • pedigree
  • call stack
  • named properties
  • occurrences

Option 4 feels safer to me, whilst still meeting the need to be able to represent multiple instances of the same component. It is more complex and subtle though.

Are there other options I'm not thinking of?

from dependency-track.

nscuro avatar nscuro commented on August 25, 2024

I think option 4 is going in the right direction - We need to find a minimal subset of component properties that can reliably uniquely identify a component.

I'm not sure if giving too much choice to clients is a good idea though. Ideally we would identify one "approved" way of doing things and run with it. The more opportunities for variation we offer, the farther away people's experiences will drift apart. It will be challenging to support users if the de-duplication is too customizable, if that makes sense.

In addition to purl/cpe/swid/name/version, I can see deviations in fields like these warranting separate components:

  • pedigree
  • call stack
  • named properties
  • occurrences

We definitely need to consider hashes as well. Probably also licenses.

RE occurrences: Consider that across project versions, the same component can appear in different places. Or additional occurrences can get added from one project version to the next. We wouldn't want the component to be recreated, just because it is imported from more locations. Call stack may have similar semantics.

from dependency-track.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.