pachterlab / seqspec Goto Github PK

View Code? Open in Web Editor NEW

108.0 7.0 17.0 204.76 MB

machine-readable file format for genomic library sequence and structure

License: MIT License

Python 99.70% Makefile 0.30%

seqspec's Issues

add "requests" package as dependency?

Super minor installation issue with requests package dependency not being bundled by default in some environments?

Tested on various MacOS and Linux anaconda3/miniconda distributions (conda > 4.10.3, python 3.11.x, seqspec 967cf97).

Example:

conda create --name seqspec
conda activate seqspec

conda install pip
pip install git+https://github.com/IGVF/seqspec.git

conda list
# packages in environment at /Users/choo/opt/anaconda3/envs/seqspec:
#
# Name                    Version                   Build  Channel
attrs                     23.1.0                   pypi_0    pypi
bzip2                     1.0.8                h1de35cc_0  
ca-certificates           2023.08.22           hecd8cb5_0  
jsonschema                4.19.0                   pypi_0    pypi
jsonschema-specifications 2023.7.1                 pypi_0    pypi
libffi                    3.4.4                hecd8cb5_0  
ncurses                   6.4                  hcec6c5f_0  
newick                    1.9.0                    pypi_0    pypi
openssl                   3.0.10               hca72f7f_2  
pip                       23.2.1                   pypi_0    pypi
python                    3.11.5               hf27a42d_0  
pyyaml                    6.0.1                    pypi_0    pypi
readline                  8.2                  hca72f7f_0  
referencing               0.30.2                   pypi_0    pypi
rpds-py                   0.10.3                   pypi_0    pypi
seqspec                   0.0.0                    pypi_0    pypi
setuptools                68.0.0                   pypi_0    pypi
sqlite                    3.41.2               h6c40b1e_0  
tk                        8.6.12               h5d9f67b_0  
tzdata                    2023c                h04d1e81_0  
wheel                     0.38.4                   pypi_0    pypi
xz                        5.4.2                h6c40b1e_0  
zlib                      1.2.13               h4dc903c_0  



seqspec --help
Traceback (most recent call last):
  File "/Users/choo/opt/anaconda3/envs/seqspec/bin/seqspec", line 5, in <module>
    from seqspec.main import main
  File "/Users/choo/opt/anaconda3/envs/seqspec/lib/python3.11/site-packages/seqspec/main.py", line 4, in <module>
    from .seqspec_format import setup_format_args, validate_format_args
  File "/Users/choo/opt/anaconda3/envs/seqspec/lib/python3.11/site-packages/seqspec/seqspec_format.py", line 1, in <module>
    from seqspec.utils import load_spec
  File "/Users/choo/opt/anaconda3/envs/seqspec/lib/python3.11/site-packages/seqspec/utils.py", line 5, in <module>
    import requests
ModuleNotFoundError: No module named 'requests'

Fix:

pip install requests # requests 2.31.0 (pypi_0) installed
seqspec --help # launches as expected

Cheers!

speck check is allowing different region_type

This might be due to the validator still being in a beta version but at the moment region_type doesn't have any constraints.

package uses, but does list jsonschema in requirements.

The tox run tests fail without this fix, and honestly the pyyaml restriction is more restrictive than needed. seqspec check also works with 5.3.1

--- a/requirements.txt
+++ b/requirements.txt
@@ -1 +1,2 @@
pyyaml==6.0
+jsonschema

Linker is not a region_type

In the example specs there is a region_type called linker1
https://github.com/IGVF/seqspec/blob/0d408a38cec4e632f85a20bd95c26c56ad1ac1dc/specs/SHARE-seq/spec.yaml#LL457C1-L457C1

But it doesn't seem to be an allowed type https://github.com/IGVF/seqspec/blob/main/docs/SPECIFICATION.md

What is the correct type to use.

Moreover, region_type has a number to it but I assume that that spec it is outdated and it should be region_type: linker and region_id: linker-1

using region.max_len for format and check is a problem for nanopore reads

It's unclear what the max length for a nanopore seqspec read should be.

I tried using 2 million as that's what the reported, however this causes problems for seqspec check, format, and reasonably sized yaml files.

Currently the checks require the sequence lengths match the max length as a fixed string, needless to say with a 2 million basepair max_len this leads to 4 megabyte yaml file.

Some options might be to use the min length, for sequence lengths, or to implement some kind of run length encoding for the sequence strings.... instead of a massive list of Xs. Perhaps the sequence string could do something like: X{2000000} instead.

seqspec split error

seqspec split -o split spec.yaml

Traceback (most recent call last):
File "/software/miniconda3/bin/seqspec", line 8, in
sys.exit(main())
File "/software/miniconda3/lib/python3.9/site-packages/seqspec/main.py", line 82, in main
COMMAND_TO_FUNCTION[sys.argv[1]](parser, args)
File "/software/miniconda3/lib/python3.9/site-packages/seqspec/seqspec_split.py", line 35, in validate_split_args
spec.sequencer,
AttributeError: 'Assay' object has no attribute 'sequencer'

I commented out the spec.assay in seqspec_split.py and now I am getting

Traceback (most recent call last):
File "/software/miniconda3/bin/seqspec", line 8, in
sys.exit(main())
File "/software/miniconda3/lib/python3.9/site-packages/seqspec/main.py", line 82, in main
COMMAND_TO_FUNCTION[sys.argv[1]](parser, args)
File "/software/miniconda3/lib/python3.9/site-packages/seqspec/seqspec_split.py", line 33, in validate_split_args
spec_m = Assay(
TypeError: init() missing 5 required positional arguments: 'sequence_kit', 'library_protocol', 'library_kit', 'sequence_spec', and 'library_spec'

Feature request: grouping of sequencing runs

sequence_spec can only be single read files. But maybe a hierarchy might be good when resequencing data and want to add to the actual seqspec file. Right now, the only possibility I see is concatenating fastq files which is a waste of disk pace and an additional processing step that can be avoided.

format chromap potentially is not deterministic as python set() does not have a guaranteed order,

I had written a notebook to generate a bunch of seqspecs and I had decided to call run_index() to generate the index arguments to make it easier to review that the seqspec was generating the expected index values.

However, each time I ran the notebook, some of the chromap filename arguments moved around some

The diffs showing the inconsistent filename ordering is here:

https://github.com/detrout/y2ave_seqspecs/commits/main/all_arguments.tsv

I believe what's happening is that the order of the sets at https://github.com/pachterlab/seqspec/blob/main/seqspec/seqspec_index.py#L309 isn't guaranteed to have a fixed order,

    read1_fq = list(set(gdna_fqs))[0]
    read2_fq = list(set(gdna_fqs))[1]

however I think it's disorder comes from memory allocation so may depends some on how much management is going on, so may not show up without having called the function in a loop after doing a bunch of parsing. (Worse there's a slim chance read1_fq and read2_fq could end up equal if the two different calls to set() generated different orders.

I wrote a function that is guaranteed to maintain the order of the fastqs and remove duplicates

modified   seqspec/seqspec_index.py
@@ -285,6 +285,17 @@ def format_zumis(indices, subregion_type=None):
     return "\n".join(xl)[:-1]
 
 
+def stable_deduplicate_fqs(fqs):
+    # stably deduplicate gdna_fqs
+    seen_fqs = set()
+    deduplicated_fqs = []
+    for r in fqs:
+        if r not in seen_fqs:
+            deduplicated_fqs.append(r)
+            seen_fqs.add(r)
+    return deduplicated_fqs
+
+
 def format_chromap(indices, subregion_type=None):
     bc_fqs = []
     bc_str = []
@@ -306,8 +317,9 @@ def format_chromap(indices, subregion_type=None):
         raise Exception("chromap only supports genomic dna from two fastqs")
 
     barcode_fq = bc_fqs[0]
-    read1_fq = list(set(gdna_fqs))[0]
-    read2_fq = list(set(gdna_fqs))[1]
+    deduplicated_gdna_fqs = stable_deduplicate_fqs(gdna_fqs)
+    read1_fq = deduplicated_gdna_fqs[0]
+    read2_fq = deduplicated_gdna_fqs[1]
     read_str = ",".join([f"r{idx}:{ele}" for idx, ele in enumerate(gdna_str, 1)])
     bc_str = ",".join(bc_str)

getting an unexpected keyword argument error in seqspec init

I am trying to follow the tutorial here (https://github.com/pachterlab/seqspec/blob/main/docs/TUTORIAL.md) and am running into an issue. I have installed seqspec via pip (seqspec 0.2.0) and and running the command exactly from the tutorial seqspec init -n SPLiTSeq -m 1 -o spec.yaml "((P5:29,Spacer:8,Read_1_primer:33,cDNA:1098,RT_primer:15,Round_1_BC:8,linker_1:30,Round_2_BC:8,Linker_2:30,Round_3_BC:8,UMI:10,Read_2_primer:22,Round_4_BC:6,P7:24)rna)". I am getting the error:
Traceback (most recent call last): File "/oak/stanford/groups/wjg/bgrd/bin/miniconda3/envs/IGVF_utils/bin/seqspec", line 8, in <module> sys.exit(main()) File "/oak/stanford/groups/wjg/bgrd/bin/miniconda3/envs/IGVF_utils/lib/python3.8/site-packages/seqspec/main.py", line 82, in main COMMAND_TO_FUNCTION[sys.argv[1]](parser, args) File "/oak/stanford/groups/wjg/bgrd/bin/miniconda3/envs/IGVF_utils/lib/python3.8/site-packages/seqspec/seqspec_init.py", line 53, in validate_init_args spec = run_init(name, tree[0].descendants) File "/oak/stanford/groups/wjg/bgrd/bin/miniconda3/envs/IGVF_utils/lib/python3.8/site-packages/seqspec/seqspec_init.py", line 70, in run_init assay = Assay( TypeError: __init__() got an unexpected keyword argument 'assay'

to_dict() should return the order parameter.

While I was writing about how the jsonschema validator works better with to_dict(), I remembered to check that order is actually being set in the schema files.

I think actually the reason for the order validation errors is .to_dict() probably also needs to return the order property.

@@ -101,6 +99,7 @@ class Region(yaml.YAMLObject):
             "region_type": self.region_type,
             "name": self.name,
             "sequence_type": self.sequence_type,
+            "order": self.order,
             "onlist": self.onlist.to_dict() if self.onlist else None,
             "sequence": self.sequence,
             "min_len": self.min_len,

PIP-seq

Would be great to add a PIP-seq spec: https://www.nature.com/articles/s41587-023-01685-z

Please make an initial release

To ease a wide adoption of this approach, we need a release as soon as possible. This way, it can for example be incorporated in Bioconda and Snakemake wrappers.

source file for pypi

Hi,

I want to put seqspec into bioconda. The easiest way is to go through pypi and it wil automatically update conda if there is a new pypi version.

But therefore you have also to upload the source files to pypi. So after building your distribution you also need to build the source distribution. This can be done with python setup.py sdist and afterwards uploading to pypi.

thanks!

Improvements

seqspec format verify md5sum of onlist
add container_type to assay, options are well, cell, shell
seqspec check should validate that the region_type/sequence type pairs make sense (not all are allowed), same with sequence
remove parent_id
add seqspec_path as a hidden attribute during the load_spec function call

DNA for modalities

I try to create a seqspec file for MPRAs. for teh aqssignment sequencing we are sequencing genomic/synthetic regions which we designed and the BC which is associated with. So I would say this is sequencing of a DNA modality. But seqspec allows only this:

'DNA' is not one of ['rna', 'tag', 'protein', 'atac', 'crispr'] in spec['modalities'][0]

None of them fits to the modality in our case

read seqspec from URI

It would be cool if the lib could read seqspec from a URI instead of just a local file path

10x-RNA-v3 barcode file

Hi,

I noticed that the 10x chromium v3 whitelist file included (737-august-2016) is the same as the v2 chemistry. Shouldn't it be the 3M-february-2018?

missing support for custom read primer definition

There is currently no support for custom read primers. Specifically, at least two assays that I am aware of (BioRad SureCell 3' WTA and ATACseq) use a custom read1 primer rather than the standard Illumina TruSeq or Nextera primers. The only currently supported region types that appear to indicate primers are: truseq_read1/truseq_read2 and nextera_read1/nextera_read2. In order to incorporate seqspec into an automated pipeline which include adapter trimming (for example), it would need to support designation of custom primer types as well (such as the more generic "read1_primer" and "read2_primer" designation currently used in the SureCell seqspec).

Draft4Validator.iter_errors is expecting a dictionary

When running seqspec check

The result is:

python3 -m seqspec.main check ./assays/BD-Rhapsody-EB/spec.yaml                        
[error 1] {'name': 'BD-Rhapsody-EB', 'doi': 'https://scomix.bd.com/hc/en-us/articles/6990647359501-Rhapsody-WTA-De
mo-Datasets-with-Enhanced-Cell-Capture-Beads', 'publication_date': '31 August 2022', 'description': 'BD Rhapsody W
TA is a nanowell-based commercial system that uses a split-pool (Enahnced Beads-v2) approach to generate oligos on
 magnetic beads.', 'modalities': ['RNA'], 'lib_struct': 'https://teichlab.github.io/scg_lib_structs/methods_html/B
D_Rhapsody.html', 'assay_spec': [{'region_id': 'RNA', 'region_type': 'RNA', 'name': 'RNA', 'sequence_type': 'joine
d', 'onlist': None, 'sequence': 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTXNNNNNNNNNGTGANNNNNNNNN
GACANNNNNNNNNNNNNNNNNXXAGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG', 'min_len': 169, 'max_l
en': 366, 'regions': [{'region_id': 'illumina_p7', 'region_type': 'illumina_p7', 'name': 'illumina_p7', 'sequence_
type': 'fixed', 'onlist': None, 'sequence': 'AATGATACGGCGACCACCGAGATCTACAC', 'min_len': 29, 'max_len': 29, 'region
s': None}, {'region_id': 'truseq_r1', 'region_type': 'truseq_r1', 'name': 'truseq_r1', 'sequence_type': 'fixed', '
onlist': None, 'sequence': 'TCTTTCCCTACACGACGCTCTTCCGATCT', 'min_len': 29, 'max_len': 29, 'regions': None}, {'regi
on_id': 'vb', 'region_type': 'vb', 'name': 'vb', 'sequence_type': 'onlist', 'onlist': {'filename': 'vb_onlist.txt'
, 'md5': None}, 'sequence': 'X', 'min_len': 0, 'max_len': 3, 'regions': None}, {'region_id': 'cls1', 'region_type'
: 'cls1', 'name': 'cls1', 'sequence_type': 'onlist', 'onlist': {'filename': 'cls1_onlist.txt', 'md5': None}, 'sequ
ence': 'NNNNNNNNN', 'min_len': 9, 'max_len': 9, 'regions': None}, {'region_id': 'linker1', 'region_type': 'linker1
', 'name': 'linker1', 'sequence_type': 'fixed', 'onlist': None, 'sequence': 'GTGA', 'min_len': 4, 'max_len': 4, 'r
egions': None}, {'region_id': 'cls2', 'region_type': 'cls2', 'name': 'cls2', 'sequence_type': 'onlist', 'onlist': 
{'filename': 'cls2_onlist.txt', 'md5': None}, 'sequence': 'NNNNNNNNN', 'min_len': 9, 'max_len': 9, 'regions': None
}, {'region_id': 'linker2', 'region_type': 'linker2', 'name': 'linker2', 'sequence_type': 'fixed', 'onlist': None,
 'sequence': 'GACA', 'min_len': 4, 'max_len': 4, 'regions': None}, {'region_id': 'cls3', 'region_type': 'cls3', 'n
ame': 'cls3', 'sequence_type': 'onlist', 'onlist': {'filename': 'cls3_onlist.txt', 'md5': None}, 'sequence': 'NNNN
NNNNN', 'min_len': 9, 'max_len': 9, 'regions': None}, {'region_id': 'umi', 'region_type': 'umi', 'name': 'umi', 's
equence_type': 'random', 'onlist': None, 'sequence': 'NNNNNNNN', 'min_len': 8, 'max_len': 8, 'regions': None}, {'r
egion_id': 'polyT', 'region_type': 'polyT', 'name': 'polyT', 'sequence_type': 'random', 'onlist': None, 'sequence'
: 'X', 'min_len': 1, 'max_len': 98, 'regions': None}, {'region_id': 'cdna', 'region_type': 'cdna', 'name': 'cdna',
 'sequence_type': 'random', 'onlist': None, 'sequence': 'X', 'min_len': 1, 'max_len': 98, 'regions': None}, {'regi
on_id': 'truseq_r2', 'region_type': 'truseq_r2', 'name': 'truseq_r2', 'sequence_type': 'fixed', 'onlist': None, 's
equence': 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC', 'min_len': 34, 'max_len': 34, 'regions': None}, {'region_id': 'sam
ple_index', 'region_type': 'sample_index', 'name': 'sample_index', 'sequence_type': 'onlist', 'onlist': {'filename
': 'sample_index_onlist.txt', 'md5': None}, 'sequence': 'NNNNNNNN', 'min_len': 8, 'max_len': 8, 'regions': None}, 
{'region_id': 'illumina_p7', 'region_type': 'illumina_p7', 'name': 'illumina_p7', 'sequence_type': 'fixed', 'onlis
t': None, 'sequence': 'ATCTCGTATGCCGTCTTCTGCTTG', 'min_len': 24, 'max_len': 24, 'regions': None}]}]} is not of typ
e 'object' in spec[]

after applying this patch the error messages look quite a bit more plausible.

--- a/seqspec/seqspec_check.py
+++ b/seqspec/seqspec_check.py
@@ -39,9 +39,8 @@ def validate_check_args(parser, args):
 
 
 def run_check(schema, spec):
-
     v = Draft4Validator(schema)
-    for idx, error in enumerate(v.iter_errors(spec), 1):
+    for idx, error in enumerate(v.iter_errors(spec.to_dict()), 1):
         print(
             f"[error {idx}] {error.message} in spec[{']['.join(repr(index) for index in error.path)}]"
         )

Now lists many more errors.

Though also maybe some of the attributes could be optional?

As a guess order might be a good candidate for either being optional, having validation code added, or having the order of elements in the list shuffled to match the order. (I bet the Stanford DACC might be able to help with the jsonschema)

[error 1] 'order' is a required property in spec['assay_spec'][0]['regions'][0]
[error 2] 'order' is a required property in spec['assay_spec'][0]['regions'][1]
[error 3] None is not of type 'string' in spec['assay_spec'][0]['regions'][2]['onlist']['md5']
[error 4] 'order' is a required property in spec['assay_spec'][0]['regions'][2]
[error 5] None is not of type 'string' in spec['assay_spec'][0]['regions'][3]['onlist']['md5']
[error 6] 'order' is a required property in spec['assay_spec'][0]['regions'][3]
[error 7] 'order' is a required property in spec['assay_spec'][0]['regions'][4]
[error 8] None is not of type 'string' in spec['assay_spec'][0]['regions'][5]['onlist']['md5']
[error 9] 'order' is a required property in spec['assay_spec'][0]['regions'][5]
[error 10] 'order' is a required property in spec['assay_spec'][0]['regions'][6]
[error 11] None is not of type 'string' in spec['assay_spec'][0]['regions'][7]['onlist']['md5']
[error 12] 'order' is a required property in spec['assay_spec'][0]['regions'][7]
[error 13] 'order' is a required property in spec['assay_spec'][0]['regions'][8]
[error 14] 'order' is a required property in spec['assay_spec'][0]['regions'][9]
[error 15] 'order' is a required property in spec['assay_spec'][0]['regions'][10]
[error 16] 'order' is a required property in spec['assay_spec'][0]['regions'][11]
[error 17] None is not of type 'string' in spec['assay_spec'][0]['regions'][12]['onlist']['md5']
[error 18] 'order' is a required property in spec['assay_spec'][0]['regions'][12]
[error 19] 'order' is a required property in spec['assay_spec'][0]['regions'][13]
[error 20] 'order' is a required property in spec['assay_spec'][0]

With a seqspec that has only one modality it seems like the library png plot is too short.

I tried generating some example library structure plots and for the case of a seqspec with only one modality the bar height was flattened. There was also an error message:

~/proj/seqspec/seqspec/seqspec_print.py:94: UserWarning: constrained_layout not applied because axes sizes collapsed to zero.  Try making figure larger or axes decorations smaller.
  s.savefig(o, dpi=300, bbox_inches="tight")

I was using this seqspec https://github.com/detrout/y2ave_seqspecs/blob/main/Team_7_igvf_b01_LeftCortex_13A_seqspec.yaml
with the command:

python3 -m seqspec.main print -o parse-bridge-library.png -f png Team_7_igvf_b01_LeftCortex_13A_seqspec.yaml

and got this plot.

With other libraries that had multiple modalities, the bar height was larger than the region length text.

index string should take into account strand information

This affects chromap's read_format, but could affect other formats for other tools as well. The chromap index string needs to take into account strand information, and add a "-" at the end of the string when the strand is neg.

For example, for this seqspec, the correct index string to align the fastqs using chromap is bc:8:23:-,r1:0:49,r2:0:49. Current implementation will return bc:8:23,r1:0:49,r2:0:49 (missing the extra ":-" at end of bc string).

getting not unique across all regions error but region_id is unique

I getting the following errors on my file:

[error 1] IGVF_neuro_S1_R2_001.fastq.gz does not exist
[error 2] IGVF_neuro_S1_R1_001.fastq.gz does not exist
[error 3] IGVF_neuro_S1_R3_001.fastq.gz does not exist
[error 4] IGVF_neuro_S1_R2_001.fastq.gz does not exist
[error 5] IGVF_neuro_S1_R1_001.fastq.gz does not exist
[error 6] IGVF_neuro_S1_R3_001.fastq.gz does not exist
[error 7] IGVF_neuro_S1_R2_001.fastq.gz does not exist
[error 8] IGVF_neuro_S1_R1_001.fastq.gz does not exist
[error 9] IGVF_neuro_S1_R3_001.fastq.gz does not exist
[error 10] region_id 'IGVF_neuro_S1_R2_001.fastq.gz' is not unique across all regions
[error 11] region_id 'adapter_fwd' is not unique across all regions
[error 12] region_id 'IGVF_neuro_S1_R1_001.fastq.gz' is not unique across all regions
[error 13] region_id 'IGVF_neuro_S1_R3_001.fastq.gz' is not unique across all regions
[error 14] region_id 'adapter_rev' is not unique across all regions
[error 15] region_id 'IGVF_neuro_S1_R2_001.fastq.gz' is not unique across all regions
[error 16] region_id 'adapter_fwd' is not unique across all regions
[error 17] region_id 'IGVF_neuro_S1_R1_001.fastq.gz' is not unique across all regions
[error 18] region_id 'IGVF_neuro_S1_R3_001.fastq.gz' is not unique across all regions
[error 19] region_id 'adapter_rev' is not unique across all regions

I cannot explain error 10 to 19 because region_ids are unique.

Further error 1 to 9 complains about a missing file. But then it should also mentioned Ngn2-RNA-1_S4_R1_001.fastq.gz, Ngn2-RNA-1_S4_R2_001.fastq.gz, Ngn2-RNA-1_S4_R3_001.fastq.gz, Ngn2-DNA-1_S1_R1_001.fastq.gz, Ngn2-DNA-1_S1_R2_001.fastq.gz and Ngn2-DNA-1_S1_R3_001.fastq.gz because they are also not present.

My file:

!Assay
seqspec_version: 0.0.0
assay: "MPRA"
sequencer: "TODO"
name: mpra_shendure_assignment_80K
doi: ""
publication_date: ""
description: "Assignment library of the MPRA 80K design (caridac, neuro and random CREs)"
modalities:
  - rna # FIXME to DNA
  - rna # FIXME to DNA
  - rna
lib_struct: ""
assay_spec:
  - !Region
    parent_id: null
    region_id: assignment
    region_type: gdna # FIXME to DNA
    name: Assignment
    sequence_type: random
    sequence: X
    min_len: 0
    max_len: 1024
    onlist: null
    regions:
      - !Region
        parent_id: assignment
        region_id: barcode
        region_type: barcode # or tag?
        name: Barcode
        sequence_type: random # can in theory be onlist, but this will be a long list with all possible combinations
        sequence: XXXXXXXXXXXXXXX
        min_len: 15
        max_len: 15
        onlist: null # or filename of all possible combinations
        regions:
          - !Region
            parent_id: barcode
            region_id: IGVF_neuro_S1_R2_001.fastq.gz
            region_type: fastq # or tag?
            name: IGVF_neuro_S1_R2_001.fastq.gz
            sequence_type: random # can in theory be onlist, but this will be a long list with all possible combinations
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null # or filename of all possible combinations
            regions: null
      - !Region
        parent_id: assignment
        region_id: oligo
        region_type: gdna # FIXME to dna
        name: Oligo sequence
        sequence_type: onlist
        sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
        min_len: 300
        max_len: 300
        onlist: !Onlist
          filename: /fast/groups/ag_kircher/work/MPRA/IGVF_Y1_design/final_design/results/final_design/design.fa.gz
          location: local
          md5: 5a34f80819cc26f33f641c9aad70be09
        regions:
          - !Region
            parent_id: oligo
            region_id: adapter_fwd
            region_type: linker # FIXME to adapter
            name: Forward adapter
            sequence_type: fixed
            sequence: AGGACCGGATCAACT
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
          - !Region
            parent_id: oligo
            region_id: designed_sequence
            region_type: gdna # FIXME to dna
            name: Designed oligo sequence for testing
            sequence_type: onlist # or onlist because we knwo the design
            sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
            min_len: 270
            max_len: 270
            onlist: !Onlist
              filename: /fast/groups/ag_kircher/work/MPRA/IGVF_Y1_design/final_design/results/final_design/design.fa.gz
              location: local
              md5: 5a34f80819cc26f33f641c9aad70be09
            regions:
              - !Region
                parent_id: designed_sequence
                region_id: IGVF_neuro_S1_R1_001.fastq.gz
                region_type: fastq
                name: IGVF_neuro_S1_R1_001.fastq.gz
                sequence_type: random
                sequence: X
                min_len: 1
                max_len: 146
                onlist: null
                regions: null
              - !Region
                parent_id: designed_sequence
                region_id: IGVF_neuro_S1_R3_001.fastq.gz
                region_type: fastq
                name: IGVF_neuro_S1_R3_001.fastq.gz
                sequence_type: random
                sequence: X
                min_len: 1
                max_len: 146
                onlist: null
                regions: null
          - !Region
            parent_id: assignment
            region_id: adapter_rev
            region_type: linker # FIXME to adapter
            name: Reverse adapter
            sequence_type: fixed
            sequence: CATTGCGTGAACCGA
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
  - !Region
    parent_id: null
    region_id: dna_count_library
    region_type: cdna # or tag?
    name: DNA counts library
    sequence_type: random
    sequence: X
    min_len: 1
    max_len: 31
    onlist: null
    regions:
      - !Region
        parent_id: dna_count_library
        region_id: dna_counts
        region_type: barcode # or tag?
        name: DNA counts
        sequence_type: random
        sequence: XXXXXXXXXXXXXXX
        min_len: 15
        max_len: 15
        onlist: null
        regions:
          - !Region
            parent_id: dna_counts
            region_id: Ngn2-DNA-1_S1_R1_001.fastq.gz
            region_type: fastq
            name: Ngn2-DNA-1_S1_R1_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
          - !Region
            parent_id: dna_counts
            region_id: Ngn2-DNA-1_S1_R3_001.fastq.gz
            region_type: fastq # or tag or bc
            name: Ngn2-DNA-1_S1_R3_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
      - !Region
        parent_id: dna_count_library
        region_id: dna_umis
        region_type: umi
        name: DNA UMIs
        sequence_type: random
        sequence: XXXXXXXXXXXXXXXX
        min_len: 16
        max_len: 16
        onlist: null
        regions:
          - !Region
            parent_id: dna_counts
            region_id: Ngn2-DNA-1_S1_R2_001.fastq.gz
            region_type: fastq # or tag or bc
            name: Ngn2-DNA-1_S1_R2_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
  - !Region
    parent_id: null
    region_id: rna_count_library
    region_type: cdna # or tag?
    name: RNA counts library
    sequence_type: random
    sequence: X
    min_len: 1
    max_len: 31
    onlist: null
    regions:
      - !Region
        parent_id: rna_count_library
        region_id: rna_counts
        region_type: barcode # or tag?
        name: DNA counts
        sequence_type: random
        sequence: XXXXXXXXXXXXXXX
        min_len: 15
        max_len: 15
        onlist: null
        regions:
          - !Region
            parent_id: rna_counts
            region_id: Ngn2-RNA-1_S4_R1_001.fastq.gz
            region_type: fastq
            name: Ngn2-RNA-1_S4_R1_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
          - !Region
            parent_id: rna_counts
            region_id: Ngn2-RNA-1_S4_R3_001.fastq.gz
            region_type: fastq
            name: Ngn2-RNA-1_S4_R3_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
      - !Region
        parent_id: rna_count_library
        region_id: rna_umis
        region_type: umi
        name: DNA UMIs
        sequence_type: random
        sequence: XXXXXXXXXXXXXXXX
        min_len: 16
        max_len: 16
        onlist: null
        regions:
          - !Region
            parent_id: dna_counts
            region_id: Ngn2-RNA-1_S4_R2_001.fastq.gz
            region_type: fastq
            name: Ngn2-RNA-1_S4_R2_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXXX
            min_len: 16
            max_len: 16
            onlist: null
            regions: null

Print not working on Windows

!seqspec print broad_human_jamboree_test_spec.yaml
Returns

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python311\Scripts\seqspec.exe\__main__.py", line 7, in <module>
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python311\Lib\site-packages\seqspec\main.py", line 68, in main
    COMMAND_TO_FUNCTION[sys.argv[1]](parser, args)
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python311\Lib\site-packages\seqspec\seqspec_print.py", line 45, in validate_print_args
    print(s)
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-10: character maps to <undefined>

Apparently this bug is due to this:
https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters

Works on Mac and Linux but not on jupyter notebook run under Windows.

Tested using the spec.yaml found in the GitHub under the examples.

predefined `sequence_protocol`

A list of sequence_protocols can be found here:

https://www.ebi.ac.uk/ols4/ontologies/efo/classes/http%253A%252F%252Fwww.ebi.ac.uk%252Fefo%252FEFO_0002699?lang=en

add them as part of the spec to check against in the sequence_protocol

Sequencing data repositories which accept seqspec?

Hi there,

Is there any list of sequencing data repositories which accept seqspec? (realizing that seqspec is very new, and direct support is unlikely at this stage, but maybe there are some best practices even at this point?)

For example, does NCBI SRA accept seqspec?

pachterlab / seqspec Goto Github PK

seqspec's Issues

Recommend Projects

Recommend Topics

Recommend Org