hupo-psi / mzqc Goto Github PK
View Code? Open in Web Editor NEWReporting and exchange format for mass spectrometry quality control data
Home Page: https://hupo-psi.github.io/mzQC/
License: Creative Commons Attribution 4.0 International
Reporting and exchange format for mass spectrometry quality control data
Home Page: https://hupo-psi.github.io/mzQC/
License: Creative Commons Attribution 4.0 International
We should state an e- mail address or any other way how anybody could request a new CV parameter.
The term QC:4000068 "Ambient humidity" is currently an n-tuple.
Is there any reason for this? If so, the description (and name) should reflect this.
Add a namespace to the XML schema.
Original issue reported on code.google.com by [email protected]
on 6 Sep 2013 at 12:04
Dear all,
Julian told me that I can request new CV entries here. Ideally we would like to have all the metrics that we are taking into account in the QCloud (see Table 1 of our paper http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0189209). By order of priority, we would like to have the Peak Area, Mass accuracy, Retention Time, Median Injection Time for MS2, Chromatographic Resolution, Peak Capacity and Total Ion Current.
Thank you!
Roger
The qcML file should specify the entire bioinformatics processing cycle up to generating the QC metrics.
As remarked previously, this can be done through the general cvParam
elements, but it might be useful to make this a bit more explicit and enforceable. At the very least we should carefully document this.
Additionally, information related to a contact person should be included. This has already been specified in the MIAPE list, so we should ensure that all MIAPE information can be found in the qcML file as well.
Run/Set Elements don't have a xsd:ID attribute
Original issue reported on code.google.com by [email protected]
on 6 May 2013 at 12:43
id-free:
id:
total number of PSM
total number of identified peptides
total number of identified protein
total number of proteins uniquely identified (unique peptides)
precursor / MC / delta ppm / TD / charge table (table of precursor m/z and delta ppm and missed cleavages and target/decoy distinction and charge)
final ID/MS2 ratio
quant:
new relation:
Because the ID for parameters is stored in QualityParameter and
AttachmentParameter, and not in AbstractParameter or CvParameter, threshold has
no ID. Threshold extends CvParameter, but doesn't define it's own ID.
Is this deliberate, i.e. a threshold doesn't need an ID because it is defined
by its parent qualityParameter? Or should threshold still have an ID, possibly
for easy conversion to a database?
Original issue reported on code.google.com by [email protected]
on 14 Oct 2013 at 8:28
We need to keep element names document unique to be able to successfully reference items within a json document because we got rid of the 'id attribute'
Provide a reference dataset(s) to show QCML in action.
Original issue reported on code.google.com by [email protected]
on 5 Feb 2013 at 2:49
This issue is to collect discussions and points of improvements for the current (0.0.8 to now) schema changes, some of which have already been identified in the monthly telcos by @bittremieux and @mwalzer.
Name:
"Number of chromatograms"
Definition:
"Number of chromatograms"
Comment: A lower number of chromatograms acquired during one sample run compared to similar runs can indicate mismatched instrument settings or issues with the instrumentation.
Proposed value type:
Name:
"Number of MS1 spectra"
Definition:
"Number of MS1 spectra"
Comment: A lower number of MS1 spectra acquired during one sample run compared to similar runs can indicate mismatched instrument settings or issues with the instrumentation.
Proposed value type:
Name:
"Number of MS2 spectra"
Definition:
"Number of MS2 spectra"
Comment: A lower number of MS2 spectra acquired during one sample run compared to similar runs can indicate mismatched instrument settings or issues with the instrumentation or unusual low levels of ions collectable for MS/MS.
Proposed value type:
Name:
"MZ acquisition range"
Definition:
"Upper and lower limit of m/z values at which spectra are recorded."
Comment: Acquisition levels can be used as a criterion to assess the comparability of instrument settings between runs.
Proposed value type:
Name:
"RT acquisition range"
Definition:
"Upper and lower limit of time at which spectra are recorded."
Comment: Acquisition levels can be used as a criterion to assess the comparability of instrument settings between runs.
Proposed value type:
Name:
"Fastest frequency for MS level 1 collection"
Definition:
"Fastest frequency for MS level 1 collection"
Comment: Spectrum acquisition frequency can be used to gauge the suitability of used instrument settings for the sample content used.
Proposed value type:
Name:
"Fastest frequency for MS level 2 collection"
Definition:
"Fastest frequency for MS level 2 collection"
Comment: Spectrum acquisition frequency can be used to gauge the suitability of used instrument settings for the sample content used.
Proposed value type:
Name:
"Slowest frequency for MS level 1 collection"
Definition:
"Slowest frequency for MS level 1 collection"
Comment: Spectrum acquisition frequency can be used to gauge the suitability of used instrument settings for the sample content used.
Proposed value type:
Name:
"Slowest frequency for MS level 2 collection"
Definition:
"Slowest frequency for MS level 2 collection"
Comment: Spectrum acquisition frequency can be used to gauge the suitability of used instrument settings for the sample content used.
Proposed value type:
Name:
"Precursor intensity range"
Definition:
"Minimum and maximum precursor intensity recorded."
Comment: The intensity range of the precursors informs about the dynamic range of the acquisition.
Proposed value type:
Name:
Explained precursor intensity first quarter
Definition:
"Fraction of identified MS2 in the first quarter of precursor intensity."
Comment: Higher fractions of identified MS2 spectra indicate the efficiency of detection and sampling
Proposed value type:
Synonym: MS2-4A
Name:
Explained precursor intensity second quarter
Definition:
"Fraction of identified MS2 in the second quarter of precursor intensity."
Comment: Higher fractions of identified MS2 spectra indicate the efficiency of detection and sampling
Proposed value type:
Synonym: MS2-4B
Name:
Explained precursor intensity third quarter
Definition:
"Fraction of identified MS2 in the third quarter of precursor intensity."
Comment: Higher fractions of identified MS2 spectra indicate the efficiency of detection and sampling
Proposed value type:
Synonym: MS2-4C
Name:
Explained precursor intensity fourth quarter
Definition:
"Fraction of identified MS2 in the fourth quarter of precursor intensity."
Comment: Higher fractions of identified MS2 spectra indicate the efficiency of detection and sampling
Proposed value type:
Synonym: MS2-4D
Name:
Extent of identified precursor intensity
Definition:
"Ratio of 95th over 5th percentile of precursor intensity for identified peptides",
Comment: Can be used to approximate the dynamic range of signal
Proposed value type:
Synonym: MS1-3A
Name:
"Precursor intensity range of identified MS2"
Definition:
"Minimum and maximum precursor intensity recorded and identified."
Comment: The intensity range of the identified precursors informs about the dynamic range of the acquisition.
Proposed value type:
Name:
Precursor intensity of identified MS2 Q1, Q2, Q3
Definition:
"From the distribution of precursor intensity of identified MS2, the quartiles Q1, Q2, Q3",
Comment: The (un)identified precursor intensity distribution can aid the interpretation of overall identification success.
Proposed value type:
Name:
Precursor intensity of unidentified MS2 Q1, Q2, Q3
Definition:
"From the distribution of precursor intensity of unidentified MS2, the quartiles Q1, Q2, Q3",
Comment: The (un)identified precursor intensity distribution can aid the interpretation of overall identification success.
Proposed value type:
Name:
Median S/N for MS1 spectra in RT range in which half of the peptides are identified
Definition:
"Median S/N for MS1 spectra in RT range in which half of the peptides are identified",
Comment: Higher MS1 S/N may correlate with higher signal discrimination
Proposed value type:
Synonym: MS1-2A
Name:
Median of TIC values in the RT range in which the first half of peptides are identified
Definition:
"Median of TIC values in the RT range in which the first half of peptides are identified"
Comment: Estimates the total absolute signal for peptides (may vary significantly between instruments)
Proposed value type:
Synonym: MS1-2B
Name:
Median S/N for MS1 spectra in the shortest RT range in which half of the peptides are identified
Definition:
"Median S/N for MS1 spectra in the shortest RT range in which half of the peptides are identified",
Proposed value type:
Name:
Median of TIC values in the shortest RT range in which half of the peptides are identified
Definition:
"Median of TIC values in the shortest RT range in which half of the peptides are identified"
Proposed value type:
Name:
Explained base peak intensity median
Definition:
"Median of the ratio of 'max survey scan intensity' over 'sampled precursor intensity' for all peptides identified"
Comment: Gives insight into the amount of overall explained signal and whether the amount of signal could be increased by a better sampling strategy.
Proposed value type:
Name:
Explained base peak intensity median from least intense 50% base peaks
Definition:
"Ratios of 'max survey scan intensity' over sampled precursor intensity for the bottom half (by MS1 max) of MS2",
Comment: Gives insight into the amount of explained signal and whether the sampling strategy is interfering with the sampling of low abundant peaks.
Proposed value type:
In the terminology world, unique elements in controlled vocabularies are generally referred to as terms. The word "accession" is used in the bio and library community, generally for libraries and collections that change as the world being studied changes (new books are published, genes are identified, proteins are identified).
At present, mzQC uses "accession" in many places where, as a terminologist, I'd expect to see "term". I'm raising this issue to flag that we may want a discussion around which term (sorry!) to adopt.
Currently the RawFile
element supports the following attributes:
id
and name
attribute to identify the file. Should name
be the original filename? Should it include the file extension or not? We should probably specify this in the documentation.location
attribute as a URI.methodFile_ref
reference to a methods file. The example given is a TraML file used for SRM analysis. When would this reference need to be used instead of only having an entry for the file in question as another external file?And the following child elements:
ExternalFormatDocumentation
with a URI pointing to additional documentation. As remarked previously it was unclear what this should contain and we decided to remove it (db7c7de).FileFormat
as a cvParam
. This imposes barely any restrictions in the XML schema, so probably external validation should happen to verify that a specific CV item denoting a file format is used here?userParam
and cvParam
elements that can contain any other information.What is the basis to put information as an attribute or an element? For example, both location
and ExternalFormatDocumentation
are simple URIs. There is of course a difference in importance between these 2 URIs, partially indicated by the mandatory/optional status.
RawFile
can contain information about any type of input file. A flag should be included to specify what kind of file this is. (579a614)RawFile
elements is unbounded.after going through some metrics in the current OBO, some descriptions are vague (at best) and/or ambiguous.
E.g. https://github.com/HUPO-PSI/mzQC/blob/master/cv/v0_1_0/qc-cv.obo#L194 says Log ratio of ...
. Log to that base? There is no universal agreement IMHO. Writing LogN
or Log2
would be desirable and make QC values more comparable.
Also, most metrics could use a unit (e.g. for RT etc...). Is there a way to restrict (or even fix) the unit to be used?
Also, can we put restrictions on the values, e.g. fractions must be between 0-1. Reporting multiple fractions must sum to 1 etc...
The main idea/benefit of providing a tighter description is a) better interpretability of the data b) better comparability across tools which compute these metrics.
The term "CV" is already well-known in the bioinformatics community as Coefficient of Variation, which is also a term associated with quality control. I would strongly recommend not overloading the abbreviation, and I would strongly recommend not using abbreviations in a standard exchange format (we studiously avoided abbreviations when standardising OWL, for example).
May I request renaming the "cv" element in the JSON to "controloledVocabulary" to avoid this collision and consequent cognitive dissonance?
We need a changelog for the schema changes from 0.0.8
Several ways have been suggested to encode a QC metric containing multiple values (i.e. quartiles, deciles, ...).
<qualityParameter ID="METRIC003" cvRef="PSI-QC-CV" accession="C-8C" name="area under total ion current chromatogram quartiles Intensity Q1-Q3">
<content cvRef="PSI-QC-CV" accession="list" value='3'>{'UO:0000189':[11235567,49344566,98047696]}</content>
</qualityParameter>
<qualityParameter ID="METRIC002_1" cvRef="PSI-QC-CV" accession="C-8D" name="TIC quartile in relation to RT duration RT-TIC Q1-Q3">
<content cvRef="PSI-QC-CV" accession="C-8D-json" value='4'>{Q1:0.27,Q2:0.4,Q3:0.74}</content>
</qualityParameter>
The previous consensus was the JSON lists are preferred.
As JSON is very flexible we should carefully specify what is and isn't allowed.
For example: how will we specify the number of elements in a list? Will the CV terms explicitly state "quartile" to indicate that 4 values are required or not? Previously we mentioned that the CV terms won't explicitly contain tags such as "Q1", "Q2", ... for quartiles etc., but that the number of quantiles will be derived from the number of values the metric takes. (0b134a6) In this case it will be crucial that this is stored consistently so that a correct interpretation is possible.
This is also important for file validation, which will need to be done externally.
Having a reference for what variable/name/element is stylised in which context will help consistency and understanding context while getting familiar with the format, implementing a reader/writer, adding new metrics and spotting/talking about issues.
FWHM or what?
Original issue reported on code.google.com by [email protected]
on 3 Sep 2013 at 3:20
Add new metrics. I.e. define according CV terms.
Check back http://code.google.com/p/qcml/wiki/PotentialCVterms for missing
parameters from NIST-CPTAC/QuaMeter, SYMPATICQO, ...
Original issue reported on code.google.com by [email protected]
on 5 Feb 2013 at 2:52
To better understand the interpretation of the CV entries, or rather the values in the mzQC file itself, I would like to introduce comments to the obo files.
This will be in the next pull request (containing the entries from issue #52 ) and needs to be done for all metrics. Help is very welcome for this.
The id prefix for the first three terms in the CV are MS:
rather than CV:
. Is this intentional @julianu? I guess there's no rule against that, but to be consistent it seems logical to prefix all ids with QC:
instead.
@dtabb73 added PSI-MS relations to the ID free metrics, e.g. relationship: has_relation MS:1000041 ! charge state
.
attached file from #53
As discussed in our recent teleconference (f52d365) we should compile a list of minimum information that is necessary for QC analyses, even though that information might already be stored in secondary files.
By duplicating that information into the qcML files directly this will facilitate (re)running QC analyses and simplify pipelines by not having to keep track of (all) prior files.
Current information that would be useful as identified during our teleconference is:
Please discuss and specify further useful information.
We still need to add many CV terms, keep our examples in-line, and have a way to spot offending entries in mzQC documents with a semantic validator. (Also, elaborate on what is 'offending'.)
e.g. in corresponding lists, defined: "lists with n elements each, n specified by the CV param value"
If cv value is n, then where do the lists go? This contradicts, how we use e.g. id: QC:4000050
name: XIC-WideFrac
Add new metric: retention time deviation for a specific feature (sequence)
Original issue reported on code.google.com by [email protected]
on 3 Sep 2013 at 2:34
Attachment is derived from CVParamType,
but I'm not sure this is really needed.
Maybe for values that are not that important and have no cv.
This would be described in a sepcification doc.
Original issue reported on code.google.com by [email protected]
on 6 May 2013 at 12:43
In the current CV term collection (the separator-delimited file now suffixed '.md') it may also be interesting to report, for what MS experiment type (e.g. DDA-high-throughput MS/MS or SRM) a quality metric can only be useful (as Wout pointed out in the QC announcement manuscript, the total number of acquired MS/MS spectra is not so helpful for SRM). By adding such a column / attribute, it is not decided, whether this will be also included in a final CV collection / ontology or whether it is stated via text in a QC format specification document or whether it is modled by a rule in a semantic validator.
In #44 I created a pull request for the version 0.1.0 for our CV.
Besides the starting from scratch of all terms, the CV imports the 0.0.8-legacy for backward compatibility. We should highlight somewhere (at least on the 1.0.0 release), that all terms below QC:4...... are deprecated. and should no longer be used. This could also easily be reflected in the legacy-cv later on.
As we also should have a stable link-position for the current versions of the CV, the file "qcML-development/cv/qc-cv.obo" was created, which should in the future always represent the newest active version. Besides it also resides the legacy version.
I think this also reflects your ideas @mwalzer, right?
Please comment or merge the PR.
Oops sorry, I closed the previous discussion by mistake! I was asking if I could create a branch to add the new QCCV entries we are adding to the new QCloud website.
Thanks!
Things that have to be validated after the syntactic validation outside of the JSON Schema:
qualityParameter
s correspond to the information in the CV.cvRef
s link to valid cv
s in the file.qualityParameter
s are unique within a run
/setQuality
.run
/setQuality
. #50Warnings:
Please add any other checks that I might have missed at the moment.
A extended sample file(s) comprising all features of qcML
is needed for implementation aid and testing.
Original issue reported on code.google.com by [email protected]
on 6 Sep 2013 at 8:08
Decide on the way to represent metrics spanning multiple files
related to #35
I fear the parent term (and CV term category) QC Metric type might be confusing. All subterms describe value types of metrics, not metric types.
Until now, in the CV we have the xref for each term, like "xref: value-type:xsd:int "The allowed value-type for this CV term."
These value types are still XML specific.
Do we need to specify them in json, and if, what values types do we have there?
I was looking for the CV codes for my IDFree MS1-Count and MS2-Count metrics, but the CV seems to lack these. Please help! I am only a PI and cannot fix this myself.
Adjust Attachment so a table is encapsulated by a table-element. F.e.:
<Attachment id=...>
<table>
<TableColumnTypes>table row values</TableColumnTypes>
<TableRowValues>1 2 3</TableRowValues>
...
<TableRowValues>n-2 n-1 n</TableRowValues>
</table>
</Attachment>
Original issue reported on code.google.com by [email protected]
on 6 Sep 2013 at 11:54
we need a logo! (for the files and the website)
Investigate ability of PRIDE to handle QCML as part
of ProteomeXchange submissions.
Which would be feasible integrations?
Original issue reported on code.google.com by [email protected]
on 5 Feb 2013 at 2:48
make reference database implementation available.
Potentially in /trunk/[reference DB implementation name]/ ?
Original issue reported on code.google.com by [email protected]
on 5 Feb 2013 at 2:40
certain Proteins?
Original issue reported on code.google.com by [email protected]
on 3 Sep 2013 at 3:17
Provide documentation to QCML availability in various software packages.
(Bioconductor proteomics mass spec package?)
Original issue reported on code.google.com by [email protected]
on 5 Feb 2013 at 2:43
We can determine how well dynamic exclusion settings are set for an instrument if we can measure where an MS2 spectrum is acquired relative to the apex of the feature it belongs to. This can be measured as a percent of max intensity or distance to apex.
Poor performance can be denoted as greatest distance from apex and could indicate a change of dynamic exclusion parameters.
The description for the term QC:4000062 "MS2 density per quantile" is broken.
Could you please provide the correct description @dtabb73 ?
Should we really do this? Maybe a cvTerm accession would do.
"cvParameter": {
"description": "Base element containing a reference to a controlled vocabulary.",
"type": "object",
"properties": {
"cvRef": {
"description": "Reference to the CV that contains the parameter definition.",
"type": "string",
"pattern": "^[A-Z]+$"
},
"accession": {
"description": "Accession number identifying the parameter within its CV.",
"type": "string",
"pattern": "^[A-Z]+:[0-9]{7}$"
},
"name": {
"description": "Name of the CV element describing the parameter.",
"type": "string"
},
"description": {
"description": "Description of the parameter.",
"type": "string"
},
"value": {
"description": "Value of the parameter."
},
"unit": {
"description": "A CV element describing the unit of the parameter value.",
"anyOf": [
{
"$ref": "#/definitions/cvParameter"
},
{
"type": "array",
"minItems": 1,
"items": {
"$ref": "#/definitions/cvParameter"
}
}
]
}
},
"required": ["cvRef", "accession", "name"]
},
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.