Giter VIP home page Giter VIP logo

cel-seq-pipeline's People

Contributors

avitalgal avatar jarondl avatar leon-anavy avatar martinfed avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cel-seq-pipeline's Issues

Optimize bc_demultiplex

Embarrassingly, bc_demultiplex can be the most time-consuming step in a pipeline.
We should profile it to check what are the causes for this, and optimize. Ideally, we should benchmark this on real-life data.

After profiling, possible solutions could be:

  • Simple optimizations of inner loops in Python. Sometimes even assigning local variables can make a difference.
  • Stronger optimizations of inner loops - maybe use a C-based package, or Cython
  • Parallelize.

@flo-compbio , @gufranca

install of CEL-Seq-pipeline

Hello,

Thank you so much for developing so nice software!

I downloaded CEL-Seq-pipeline-stable.zip from https://github.com/yanailab/CEL-Seq-pipeline
unzipped.

li@lx:~/CEL-Seq-pipeline-stable$ python pijpleiding.py --help
Traceback (most recent call last):
  File "pijpleiding.py", line 31, in <module>
    import bowtie_wrapper, bc_demultiplex, htseq_wrapper, clean_up
  File "/home/li/CEL-Seq-pipeline-stable/bc_demultiplex.py", line 22, in <module>
    from HTSeq import FastqReader, SequenceWithQualities
ImportError: No module named HTSeq
avilion-Desktop-59

My python version is:

Python 2.7.15+

I worked in Ubuntu system.

Thank you in advance for your great help and I really appreciated!

Best,

Yue

install of celseq2

Hello,

Thank you for developing so nice software!

I tried to install celseq2 in my Ubuntu system.

~/celseq2-master$ pip install ./
Defaulting to user installation because normal site-packages is not writeable
Processing /home/li/celseq2-master

But it comes out:

ERROR: xlmhg 2.5.4 has requirement plotly>=3, but you'll have plotly 2.7.0 which is incompatible.

Thanks in advance for any great help!

Best,

Yue

missing config file

When the config file you refer to doesn't exist you get a generic error message of missing section and no clear indication of the actual problem

Unable to redirect htseq_count_umified.py to a file

I want to use htseq_count_umified.py as a stand alone script to generate count files for individual samples. I redirected the output as follows, as I usually would for a standard htseq-count script, but nothing was written to the output file. Progress report of the script showed that both the GTF and SAM files were processed completely with no additional errors.

python htseq_count_umified.py \
-a 30 \
-u \
sample1.sam \
$gtf \
> sample1.count

I tried omitting the redirection as well, but there was still no output file generated. Is there any reason why this is happening? And what is the correct way to write the output files for this script?

This was done on python 2.7.12, biopython 1.68, htseq 0.6.1p1

Thanks,
Mei San

CEL-seq matrix to seurat

I used the CEL-seq pipeline to create an expression matrix. It’s coming from a 384well SORT-seq (CEL-seq2) experiment. Now I would like to feed it to seurat for downstream analysis.
Do you have any experience with this; what do I need to change in the format of the matrix, what other files I need to feed into seurat?
If you have any other suggestions for the downstream analysis, please let me know. Any guidance or suggestion is appreciated.

Thanks a lot!

Sample sheet?

I have a question about how to properly fill out the sample sheet to run CEL-Seq. The repository has an example sample sheet which I am using as a template. It has the fields ID, flocell, series, lane, il_barcode, cel_barcode, and project. I'm assuming il_barcode is the illumina barcode and cel_barcode is the umi barcode, but the example sample sheet has integer values for those fields and I was wondering what those mean. When the example has "4" for il_barcode, what does that 4 represent? And what do the 1-18 values on cel_barcode represent? And what does series mean? Also, can ID and project be anything as long as they are different?

Thanks,
David

Fix `bc_demultiplex`'s wrong assumptions and filename dependencies

bc_demultiplex was written based on wrong assumptions and should be fixed.
Basically, the reads should be identified by FASTQ records, and not by the filename. The flow cell id and other fields give the same information as the filename but in a more robust and expected way - making assumptions on filenames is non standard and confusing.

This will require changing the sample sheets.

@flo-compbio

Ambiguous error message when SAM file contains no UMI in htseq-count-umified

If one runs htseq-count-umified with umi=true, but without UMIs in the SAM file, the error message is quite unclear:

Error occured when processing SAM input (line 97460 of file SAMFILE):
Traceback (most recent call last):
  File "/illumina1/YanaiLab/new_fs/tools/bin/pijpleiding_stable", line 129, in <module>
    main(args.config_file)
  File "/illumina1/YanaiLab/new_fs/tools/bin/pijpleiding_stable", line 93, in main
    segment(**parameters)
  File "/illumina1/YanaiLab/new_fs/tools/pipeline_stable/htseq_wrapper.py", line 82, in main
    results = pool.map(run_cmd, cmds)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 250, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 554, in get
    raise self._value
AttributeError: 'NoneType' object has no attribute 'group'

That is because the relevant code block ( https://github.com/yanailab/CEL-Seq-pipeline/blob/stable/htseq_count_umified.py#L50 ) assumes a umi is always matched.

A better error message should apear.

Pipe run

I'm trying to understand your pipeline. Is there a place with more information than the README and the papers? For instance, what is the "pipe_run" parameter?

How does this work?

Hi guys,

unfortunately I really have some problems understanding how the pipeline can be adapted to our dataset. For some initial tests I wanted to use bc_demultiplex.py, but I don't understand how the sample sheet should look and how the files need to be named. So to start from the beginning.
We have 5 libraries (of ~20 tissue slices each, but I guess this is irrelevant), which were pooled and sequenced all together on a HiSeq. We used barcodes:

Primer_ID | Unique Barcode
1 | AGACTC
2 | AGCTAG
4 | AGCTTC
5 | CATGAG
9 | CAGATC
10 | TCACAG
11 | AGGATC
14 | TCCTAG
17 | TCGAAG
20 | GTACAG
23 | GTCTAG
25 | GTTGCA
26 | GTGACA
28 | ACAGTG
29 | ACCATG
31 | ACTCGA
32 | ACGTAC
35 | CTAGAC
40 | CTTCGA
46 | TGCAGA

and also UMIs of course.
What we got is 5 Illumina datasets, Lib_1 ... Lib-5, and each has two readsets, e.g.
Lib-2_S2_L002_R2_001.fastq.gz Lib-2_S2_L001_R1_001.fastq.gz Lib-2_S2_L001_R2_001.fastq.gz Lib-2_S2_L002_R1_001.fastq.gz.
I cleaned these with trimmomatic (I think this is necessary in this case, since the sequencing quality is not really super. My reads are now named, e.g.
Lib-4_S4_L002_reverse_paired.fq.gz Lib-4_S4_L002_forward_paired.fq.gz Lib-4_S4_L001_reverse_paired.fq.gz Lib-4_S4_L001_forward_paired.fq.gz.
But this can of course be changed.

Now my simple questions are:

  • how do I need to layout the bc_index file? (I suspect like the barcode_umis.tab file, right?)
  • how do I need to layout the sample sheet?
  • would it make sense to just combine all reads into one giant fastq file?

Thanks for your help!

Cheers

Philipp

bc_demultiplex.py fails to assign reads if sample ID contains "_"

If fastq files are input into bc_demultiplex.py that contain an underscore in the sample id, all reads will be assigned to undetermined, since the "lane" field (as set on line 49) will not match the lane field in the sample sheet. I believe the cause of this issue is line 47. While the workaround (remove underscores) is fairly simple once you know the issue, it's not immediately obvious that this is the cause.

Steps to Reproduce

  1. Run Illumina's bcl2fastq (v2.17) on Illumina NextSeq data, with a SampleSheet.csv where the first column (Sample_ID) has a value containing an underscore (i.e. "Group_1a"). This will produce output files in the format "Group_1a_S1_L001_R1_001.fastq.gz".
  2. Run bc_demultiplex.py (bc_demultiplex.py --min-bc-quality 0 --out-dir test_out index_file.txt Group_1a_sample_sheet.txt test_raw/Group_1a_S1_L00*R1*), where the first sample line of the sample_sheet is like 1 A1 [flowcell] L001 Group_1a 1 Group_1a [project].

Expected Result
Reads are assigned to sample fastq files.

Actual Result
All reads are assigned to undetermined.

All this is on REHL 6.8, with Python 2.7.6

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.