yanailab / cel-seq-pipeline Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU General Public License v3.0
License: GNU General Public License v3.0
Embarrassingly, bc_demultiplex
can be the most time-consuming step in a pipeline.
We should profile it to check what are the causes for this, and optimize. Ideally, we should benchmark this on real-life data.
After profiling, possible solutions could be:
Hello,
Thank you so much for developing so nice software!
I downloaded CEL-Seq-pipeline-stable.zip
from https://github.com/yanailab/CEL-Seq-pipeline
unzipped.
li@lx:~/CEL-Seq-pipeline-stable$ python pijpleiding.py --help
Traceback (most recent call last):
File "pijpleiding.py", line 31, in <module>
import bowtie_wrapper, bc_demultiplex, htseq_wrapper, clean_up
File "/home/li/CEL-Seq-pipeline-stable/bc_demultiplex.py", line 22, in <module>
from HTSeq import FastqReader, SequenceWithQualities
ImportError: No module named HTSeq
avilion-Desktop-59
My python version is:
Python 2.7.15+
I worked in Ubuntu
system.
Thank you in advance for your great help and I really appreciated!
Best,
Yue
Hello,
Thank you for developing so nice software!
I tried to install celseq2
in my Ubuntu system
.
~/celseq2-master$ pip install ./
Defaulting to user installation because normal site-packages is not writeable
Processing /home/li/celseq2-master
But it comes out:
ERROR: xlmhg 2.5.4 has requirement plotly>=3, but you'll have plotly 2.7.0 which is incompatible.
Thanks in advance for any great help!
Best,
Yue
When the config file you refer to doesn't exist you get a generic error message of missing section and no clear indication of the actual problem
I want to use htseq_count_umified.py as a stand alone script to generate count files for individual samples. I redirected the output as follows, as I usually would for a standard htseq-count script, but nothing was written to the output file. Progress report of the script showed that both the GTF and SAM files were processed completely with no additional errors.
python htseq_count_umified.py \
-a 30 \
-u \
sample1.sam \
$gtf \
> sample1.count
I tried omitting the redirection as well, but there was still no output file generated. Is there any reason why this is happening? And what is the correct way to write the output files for this script?
This was done on python 2.7.12, biopython 1.68, htseq 0.6.1p1
Thanks,
Mei San
I used the CEL-seq pipeline to create an expression matrix. It’s coming from a 384well SORT-seq (CEL-seq2) experiment. Now I would like to feed it to seurat for downstream analysis.
Do you have any experience with this; what do I need to change in the format of the matrix, what other files I need to feed into seurat?
If you have any other suggestions for the downstream analysis, please let me know. Any guidance or suggestion is appreciated.
Thanks a lot!
I have a question about how to properly fill out the sample sheet to run CEL-Seq. The repository has an example sample sheet which I am using as a template. It has the fields ID, flocell, series, lane, il_barcode, cel_barcode, and project. I'm assuming il_barcode is the illumina barcode and cel_barcode is the umi barcode, but the example sample sheet has integer values for those fields and I was wondering what those mean. When the example has "4" for il_barcode, what does that 4 represent? And what do the 1-18 values on cel_barcode represent? And what does series mean? Also, can ID and project be anything as long as they are different?
Thanks,
David
bc_demultiplex
was written based on wrong assumptions and should be fixed.
Basically, the reads should be identified by FASTQ records, and not by the filename. The flow cell id and other fields give the same information as the filename but in a more robust and expected way - making assumptions on filenames is non standard and confusing.
This will require changing the sample sheets.
When one of SAM files is empty (has only header) we get an "Iteration error" and no output.
If one runs htseq-count-umified
with umi=true
, but without UMIs in the SAM file, the error message is quite unclear:
Error occured when processing SAM input (line 97460 of file SAMFILE):
Traceback (most recent call last):
File "/illumina1/YanaiLab/new_fs/tools/bin/pijpleiding_stable", line 129, in <module>
main(args.config_file)
File "/illumina1/YanaiLab/new_fs/tools/bin/pijpleiding_stable", line 93, in main
segment(**parameters)
File "/illumina1/YanaiLab/new_fs/tools/pipeline_stable/htseq_wrapper.py", line 82, in main
results = pool.map(run_cmd, cmds)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
AttributeError: 'NoneType' object has no attribute 'group'
That is because the relevant code block ( https://github.com/yanailab/CEL-Seq-pipeline/blob/stable/htseq_count_umified.py#L50 ) assumes a umi is always matched.
A better error message should apear.
Hello,
I tried to use be_demultiplex from: https://github.com/yanailab/CEL-Seq-pipeline.
All the output files are 0 bytes.
Thanks in advance for any great help!
Best,
Yue
li@Desktop-590-p0xxx:~/CEL-Seq-pipeline-stable$ bc_demultiplex --out-dir test_out --bc-index tmp01 --not-gzip R1.fq R2.fq
[ Sat Mar 7 00:42:58 2020 ] Demultiplexing starts R1.fq--R2.fq ...
[ Sat Mar 7 00:42:58 2020 ] Demultiplexing ends R1.fq--R2.fq.
I'm trying to understand your pipeline. Is there a place with more information than the README and the papers? For instance, what is the "pipe_run" parameter?
Hi guys,
unfortunately I really have some problems understanding how the pipeline can be adapted to our dataset. For some initial tests I wanted to use bc_demultiplex.py
, but I don't understand how the sample sheet should look and how the files need to be named. So to start from the beginning.
We have 5 libraries (of ~20 tissue slices each, but I guess this is irrelevant), which were pooled and sequenced all together on a HiSeq. We used barcodes:
Primer_ID | Unique Barcode
1 | AGACTC
2 | AGCTAG
4 | AGCTTC
5 | CATGAG
9 | CAGATC
10 | TCACAG
11 | AGGATC
14 | TCCTAG
17 | TCGAAG
20 | GTACAG
23 | GTCTAG
25 | GTTGCA
26 | GTGACA
28 | ACAGTG
29 | ACCATG
31 | ACTCGA
32 | ACGTAC
35 | CTAGAC
40 | CTTCGA
46 | TGCAGA
and also UMIs of course.
What we got is 5 Illumina datasets, Lib_1 ... Lib-5, and each has two readsets, e.g.
Lib-2_S2_L002_R2_001.fastq.gz Lib-2_S2_L001_R1_001.fastq.gz Lib-2_S2_L001_R2_001.fastq.gz Lib-2_S2_L002_R1_001.fastq.gz.
I cleaned these with trimmomatic (I think this is necessary in this case, since the sequencing quality is not really super. My reads are now named, e.g.
Lib-4_S4_L002_reverse_paired.fq.gz Lib-4_S4_L002_forward_paired.fq.gz Lib-4_S4_L001_reverse_paired.fq.gz Lib-4_S4_L001_forward_paired.fq.gz.
But this can of course be changed.
Now my simple questions are:
bc_index
file? (I suspect like the barcode_umis.tab
file, right?)Thanks for your help!
Cheers
Philipp
If fastq files are input into bc_demultiplex.py that contain an underscore in the sample id, all reads will be assigned to undetermined, since the "lane" field (as set on line 49) will not match the lane field in the sample sheet. I believe the cause of this issue is line 47. While the workaround (remove underscores) is fairly simple once you know the issue, it's not immediately obvious that this is the cause.
Steps to Reproduce
bc_demultiplex.py --min-bc-quality 0 --out-dir test_out index_file.txt Group_1a_sample_sheet.txt test_raw/Group_1a_S1_L00*R1*
), where the first sample line of the sample_sheet is like 1 A1 [flowcell] L001 Group_1a 1 Group_1a [project]
.Expected Result
Reads are assigned to sample fastq files.
Actual Result
All reads are assigned to undetermined.
All this is on REHL 6.8, with Python 2.7.6
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.