Howdy folks, been looking over this specification and its pretty complete, but I have some concerns about the per-type specific component value transformations.
Specifically the various bits that are per-type that may need for canonical form to be case-sensitive or case-insensitive, or do various translation of chars (like "_" to "-") for example.
It seems like in terms of a generic spec and impls to be able to generically parse and form a package-url, that with such edge-cases that any impl would be eventually invalid since it could not possibly encode the details of presently unspecified package types, or whatever new package systems are created in the future.
The docs for the pypi type state that pypi treats "-" and "" the same, but requires that "" be translated into "-". This seems like over complication if the underlying system would treat them the same?
The docs for the npm type state that the value must be lower-cased. And while I understand the underlying npm system may require that, having to encode this detail into the package-url specification seems like it may lead to sustainability issues in the future. While an impl could encode this, when some new format comes along say some fictitious "upper" type for some fictitious package system where everything is always UPPER-CASE (and anything other than UPPER-CASE is not valid). Its not likely that existing package-url impls would know about that type and end up making invalid canonical string representations.
It seems almost like if you were to consider the URL specification, that the spec would treat path/query/fragment details different depending on the host:port part of the identifiers. Or similarly for URI spec that the scheme would indicate how you would transform the rest of the components. This would make for hugely complex implementations (which would probably be eventually if not already wrong). I feel like the package-url specification is already like that with these type specific transformation wrinkles.
I believe it would be simpler and more normal, to ignore case (except perhaps for type itself) and ignore content transformation (except for percent encoding). This would imply you could end up with:
... which may not be proper with respect to the package expectations that name is lower-cased. But that seems like its an input problem and not really something that a generic specification to identify and generalize package identifiers should be concerned with.
... would be more correct in terms of how the NPM community has decided to normalize their identifiers, but in terms of package-url specification, it seems like it really should not care. Since its not reasonable (or even possible presently with various formats needing lower and some needing mixed case), it seems like the specification should to be more general and support future formats not require any such transformations.