yanailab / cel-seq-pipeline Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 12.0 65 KB

License: GNU General Public License v3.0

Python 100.00%

cel-seq-pipeline's People

Contributors

Stargazers

Watchers

Forkers

rgranit quangu0925 eco32i tobsecret ubec puriney sidoruka dragonmasterx87 rbenel davidhbrann hfyuanuq

cel-seq-pipeline's Issues

Optimize bc_demultiplex

Embarrassingly, bc_demultiplex can be the most time-consuming step in a pipeline.
We should profile it to check what are the causes for this, and optimize. Ideally, we should benchmark this on real-life data.

After profiling, possible solutions could be:

Simple optimizations of inner loops in Python. Sometimes even assigning local variables can make a difference.
Stronger optimizations of inner loops - maybe use a C-based package, or Cython
Parallelize.

@flo-compbio , @gufranca

install of CEL-Seq-pipeline

Hello,

Thank you so much for developing so nice software!

I downloaded CEL-Seq-pipeline-stable.zip from https://github.com/yanailab/CEL-Seq-pipeline
unzipped.

li@lx:~/CEL-Seq-pipeline-stable$ python pijpleiding.py --help
Traceback (most recent call last):
  File "pijpleiding.py", line 31, in <module>
    import bowtie_wrapper, bc_demultiplex, htseq_wrapper, clean_up
  File "/home/li/CEL-Seq-pipeline-stable/bc_demultiplex.py", line 22, in <module>
    from HTSeq import FastqReader, SequenceWithQualities
ImportError: No module named HTSeq
avilion-Desktop-59

My python version is:

Python 2.7.15+

I worked in Ubuntu system.

Thank you in advance for your great help and I really appreciated！

Best,

Yue

We should add the GPL license of HTSeq-count to the repo

install of celseq2

Hello,

Thank you for developing so nice software!

I tried to install celseq2 in my Ubuntu system.

~/celseq2-master$ pip install ./
Defaulting to user installation because normal site-packages is not writeable
Processing /home/li/celseq2-master

But it comes out:

ERROR: xlmhg 2.5.4 has requirement plotly>=3, but you'll have plotly 2.7.0 which is incompatible.

Thanks in advance for any great help!

Best,

Yue

missing config file

When the config file you refer to doesn't exist you get a generic error message of missing section and no clear indication of the actual problem

Unable to redirect htseq_count_umified.py to a file

I want to use htseq_count_umified.py as a stand alone script to generate count files for individual samples. I redirected the output as follows, as I usually would for a standard htseq-count script, but nothing was written to the output file. Progress report of the script showed that both the GTF and SAM files were processed completely with no additional errors.

python htseq_count_umified.py \
-a 30 \
-u \
sample1.sam \
$gtf \
> sample1.count

I tried omitting the redirection as well, but there was still no output file generated. Is there any reason why this is happening? And what is the correct way to write the output files for this script?

This was done on python 2.7.12, biopython 1.68, htseq 0.6.1p1

Thanks,
Mei San

CEL-seq matrix to seurat

I used the CEL-seq pipeline to create an expression matrix. It’s coming from a 384well SORT-seq (CEL-seq2) experiment. Now I would like to feed it to seurat for downstream analysis.
Do you have any experience with this; what do I need to change in the format of the matrix, what other files I need to feed into seurat?
If you have any other suggestions for the downstream analysis, please let me know. Any guidance or suggestion is appreciated.

Thanks a lot!

Sample sheet?

I have a question about how to properly fill out the sample sheet to run CEL-Seq. The repository has an example sample sheet which I am using as a template. It has the fields ID, flocell, series, lane, il_barcode, cel_barcode, and project. I'm assuming il_barcode is the illumina barcode and cel_barcode is the umi barcode, but the example sample sheet has integer values for those fields and I was wondering what those mean. When the example has "4" for il_barcode, what does that 4 represent? And what do the 1-18 values on cel_barcode represent? And what does series mean? Also, can ID and project be anything as long as they are different?

Thanks,
David

Fix `bc_demultiplex`'s wrong assumptions and filename dependencies

bc_demultiplex was written based on wrong assumptions and should be fixed.
Basically, the reads should be identified by FASTQ records, and not by the filename. The flow cell id and other fields give the same information as the filename but in a more robust and expected way - making assumptions on filenames is non standard and confusing.

This will require changing the sample sheets.

@flo-compbio

Htseq-count on empty SAM files

When one of SAM files is empty (has only header) we get an "Iteration error" and no output.

Ambiguous error message when SAM file contains no UMI in htseq-count-umified

If one runs htseq-count-umified with umi=true, but without UMIs in the SAM file, the error message is quite unclear:

Error occured when processing SAM input (line 97460 of file SAMFILE):
Traceback (most recent call last):
  File "/illumina1/YanaiLab/new_fs/tools/bin/pijpleiding_stable", line 129, in <module>
    main(args.config_file)
  File "/illumina1/YanaiLab/new_fs/tools/bin/pijpleiding_stable", line 93, in main
    segment(**parameters)
  File "/illumina1/YanaiLab/new_fs/tools/pipeline_stable/htseq_wrapper.py", line 82, in main
    results = pool.map(run_cmd, cmds)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 250, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 554, in get
    raise self._value
AttributeError: 'NoneType' object has no attribute 'group'

That is because the relevant code block ( https://github.com/yanailab/CEL-Seq-pipeline/blob/stable/htseq_count_umified.py#L50 ) assumes a umi is always matched.

A better error message should apear.

all the test_out file are 0 bytes

Hello,

I tried to use be_demultiplex from: https://github.com/yanailab/CEL-Seq-pipeline.

All the output files are 0 bytes.

Thanks in advance for any great help!

Best,

Yue

li@Desktop-590-p0xxx:~/CEL-Seq-pipeline-stable$ bc_demultiplex --out-dir test_out --bc-index tmp01 --not-gzip R1.fq R2.fq
[ Sat Mar  7 00:42:58 2020 ] Demultiplexing starts R1.fq--R2.fq ...

[ Sat Mar  7 00:42:58 2020 ] Demultiplexing ends R1.fq--R2.fq.

R1.zip
R2.zip
tmp01.zip

Pipe run

I'm trying to understand your pipeline. Is there a place with more information than the README and the papers? For instance, what is the "pipe_run" parameter?

How does this work?

Hi guys,

unfortunately I really have some problems understanding how the pipeline can be adapted to our dataset. For some initial tests I wanted to use bc_demultiplex.py, but I don't understand how the sample sheet should look and how the files need to be named. So to start from the beginning.
We have 5 libraries (of ~20 tissue slices each, but I guess this is irrelevant), which were pooled and sequenced all together on a HiSeq. We used barcodes:

and also UMIs of course.
What we got is 5 Illumina datasets, Lib_1 ... Lib-5, and each has two readsets, e.g.
Lib-2_S2_L002_R2_001.fastq.gz Lib-2_S2_L001_R1_001.fastq.gz Lib-2_S2_L001_R2_001.fastq.gz Lib-2_S2_L002_R1_001.fastq.gz.
I cleaned these with trimmomatic (I think this is necessary in this case, since the sequencing quality is not really super. My reads are now named, e.g.
Lib-4_S4_L002_reverse_paired.fq.gz Lib-4_S4_L002_forward_paired.fq.gz Lib-4_S4_L001_reverse_paired.fq.gz Lib-4_S4_L001_forward_paired.fq.gz.
But this can of course be changed.

Now my simple questions are:

how do I need to layout the bc_index file? (I suspect like the barcode_umis.tab file, right?)
how do I need to layout the sample sheet?
would it make sense to just combine all reads into one giant fastq file?

Thanks for your help!

Cheers

Philipp

bc_demultiplex.py fails to assign reads if sample ID contains "_"

If fastq files are input into bc_demultiplex.py that contain an underscore in the sample id, all reads will be assigned to undetermined, since the "lane" field (as set on line 49) will not match the lane field in the sample sheet. I believe the cause of this issue is line 47. While the workaround (remove underscores) is fairly simple once you know the issue, it's not immediately obvious that this is the cause.

Steps to Reproduce

Run Illumina's bcl2fastq (v2.17) on Illumina NextSeq data, with a SampleSheet.csv where the first column (Sample_ID) has a value containing an underscore (i.e. "Group_1a"). This will produce output files in the format "Group_1a_S1_L001_R1_001.fastq.gz".
Run bc_demultiplex.py (bc_demultiplex.py --min-bc-quality 0 --out-dir test_out index_file.txt Group_1a_sample_sheet.txt test_raw/Group_1a_S1_L00*R1*), where the first sample line of the sample_sheet is like 1 A1 [flowcell] L001 Group_1a 1 Group_1a [project].

Expected Result
Reads are assigned to sample fastq files.

Actual Result
All reads are assigned to undetermined.

All this is on REHL 6.8, with Python 2.7.6

yanailab / cel-seq-pipeline Goto Github PK

cel-seq-pipeline's People

Contributors

Stargazers

Watchers

Forkers

cel-seq-pipeline's Issues

Recommend Projects

Recommend Topics

Recommend Org