hoohm / cite-seq-count Goto Github PK

View Code? Open in Web Editor NEW

74.0 74.0 43.0 806 KB

A tool that allows to get UMI counts from a single cell protein assay

Home Page: https://hoohm.github.io/CITE-seq-Count/

License: MIT License

Python 100.00%

10x citeseq dropseq single-cell

cite-seq-count's People

Contributors

Stargazers

Watchers

cite-seq-count's Issues

Add 5prime 3prime protocol choice

Can't run CITEseq count

Hi,
I've installed citeseq count correctly and run the command but I'm getting this error;
I am using the data from the original paper:

CITE-seq-Count -R1 SRR8281061_1.fastq.gz -R2 SRR8281061_2.fastq.gz -t TAG_LIST.csv -cbf 1 -cbl 16 -umif 17 -umil 26 -cells 3000 -wl whitelist.xlsx -o cell_hashing_paper
Traceback (most recent call last):
File "/anaconda3/bin/CITE-seq-Count", line 10, in
sys.exit(main())
File "/anaconda3/lib/python3.6/site-packages/cite_seq_count/main.py", line 240, in main
collapsing_threshold=args.bc_threshold)
File "/anaconda3/lib/python3.6/site-packages/cite_seq_count/preprocessing.py", line 70, in parse_whitelist_csv
whitelist = [row[0].strip(STRIP_CHARS) for row in csv_reader
File "/anaconda3/lib/python3.6/site-packages/cite_seq_count/preprocessing.py", line 70, in
whitelist = [row[0].strip(STRIP_CHARS) for row in csv_reader
File "/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte

thanks :)

whitelist.xlsx

TAG_LIST.xlsx

Memory, missing barcodes, and R2 trimming

Hello!

First off, thanks so much for providing this tool for the single-cell community. I've been using cite-seq count regularly over the last few months and I've found one potential bug and have a couple of features that I'm hoping you could implement.

Bug: Occasionally when I use CSCount to align barcode reads to a 96-index reference using the following example script:

python /bin/CITE-seq-Count
-R1 /alpha/cmcginnis/NUC_BAR/NUC_BAR_S10_L001_R1_001.fastq.gz
-R2 /alpha/cmcginnis/NUC_BAR/NUC_BAR_S10_L001_R2_001.fastq.gz
-t /alpha/cmcginnis/CSCount/LMOlist.csv
-cbf 1 -cbl 16 -umif 17 -umil 26 -tr "[ATCG]{8}[A]{6,}" -hd 1
-wl /alpha/cmcginnis/NUC_L1_WL.txt
-o CSCount_NUC_wWL_111518.tsv

I receive an output where not all 96 barcodes are present (e.g., For the instance posted above, only 23/96 barcodes were present in the final csv). It's worth noting that this usually happens when I'm processing experiments in which I do not utilize all 96 barcodes -- for 96-plex experiments, I normally get 96 barcodes back -- and the barcodes I expect to be present never drop out. I figured that there may be a barcode QC step that removes barcodes with sufficiently few aligned reads, but when I look at my final count matrix, there are plenty of barcodes with exactly 0 UMIs.

Suggested Features: When I try to use CSCount on fairly large barcode FASTQs, I have to be very careful in terms of memory usage. I'm wondering if you could incorporate multi-thread processing into CSCount or, alternatively, load only enough of R2 to enable alignment (e.g., 18 bases instead of all 96). I believe either of these solutions would speed up CSCount tremendously and/or allow multiple instances to be run on a single core.

installation of 1.4

Hi,

I was trying to install CITE-seq-Count v1.4. To perform clean-installation, I uninstalled CITE-seq-Count by using 'pip uninstall' command and reinstalled CITE-seq-Count by using 'pip install CITE-seq-Count'.

After installation, it seems that v1.3.2 was installed instead of v1.4. Do you have any suggestions?

Thank you.
Joon

output of version 1.4

Hi,
I just wanted to point out that (at least for me) the current output is compatible with cellranger3.0 format and can only be read if you install Seurat3 (latest version). Is that correct?
However in Seurat3 the function HTOdemux is gone, so which kind of analysis flow would you recommend to do more or less the same than https://satijalab.org/seurat/hashing_vignette.html?

Thanks
Roberta

cell hashing?

this is a very useful tool to create counts per tag. for CITE-seq that's all that's needed. for cell hashing, one would want to make a determination per cell, using some algorithm to identify doublets and make a sample call per cell. do you have established code to do this? thanks.

Option --plot-prefix not available

Hi, I get this type of error when running CITE-seq-Count :

root:No local minima was accepted. Recommend checking the plot output and counts per local minima (requires --plot-prefix option) and then re-running with manually selected threshold (--set-cell-number option)

However, when I try to add a --plot-prefix to the command line, it says there is no such option (with version 1.4.1). Is this behavior expected ?

citeseq count is returning negative count values

Hello, I'm running into a very odd issue where some of the entries in the hashtag count matrix are negative. Below are some examples of the numbers I see.

I couldn't attach all the output files and the notebook I used to generate them, so I've put them in this google drive folder:

https://drive.google.com/open?id=1guoYnF4-whi_0aUgGQeugQ966bFBoc_J

This aberrant count matrix was generated with the following call:

CITE-seq-Count -R1 /home/munfred/hashing/data/SRR8281307_1.fastq.gz -R2 /home/munfred/hashing/data/SRR8281307_2.fastq.gz \
-t ./HTOs_len13.csv -cbf 1 -cbl 16 -umif 17 -umil 26 \
-o ./citeseqcount1 -cells 500000

SRR8281307 is the RNA HTO dataset from the hashtag paper: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP127179

Additionally, I have one more question: can you confirm that the UMI bug has been fixed? I have done the installation with the command pip install CITE-seq-Count==1.4.1

However, after running CITE-seq-Count on the SRR8281307 data, I get back a matrix with 71,195,205 entries. This is suspiciously close to the 74,219,921 entries in the original fastq file, and suggests to me that the UMIs are not being collapsed. I have looked into both read_count and umi_count folder contents and they are identical.

CITE-seq COUNT cell number

Hi,
I am using CITE-seq COUNT for ADT and HTO analysis and couldn't get the correct number of cells defined. (And it worked before but not anymore...)
Below is the script used:

CITE-seq-Count -R1 ADT_HTO_tag_R1.fastq.gz -R2 ADT_HTO_tag_R2.fastq.gz -t /home/chenx302/Xiao/XC7_hashing/XC7_libraries_tag_CITE.csv -cbf 1 -cbl 16 -umif 17
-umil 28 -cells 11496 -wl /home/chenx302/Xiao/XC7_hashing/XC7_WHITELIST_antibody -o ~/Xiao/XC7_hashing/XC7_hashing/outs/fastq_path/tags_fastqs/CITE_SEQ_04122
019

The WHITELIST contains 11496 rows and I also '-cells 11496'. However the barcodes.tsv in umi_count only contains 11484 rows.
Any idea why?? I got the proper number of rows before with the same files. But now I merged all the R1.fastq from different NextSeq lanes and the same for R2.fastq and it stopped working. Can CITE-seq COUNT handle multiple R1 R2 entries? I did the below script previously but it seems like the output from the last file overwrite the previous.

R1="ADT_tag_S2_L001_R1_001.fastq.gz ADT_tag_S2_L004_R1_001.fastq.gz HTO_tag_S3_L003_R1_001.fastq.gz ADT_tag_S2_L002_R1_001.fastq.gz HTO_tag_S3_L001_R1_001.f
astq.gz HTO_tag_S3_L004_R1_001.fastq.gz ADT_tag_S2_L003_R1_001.fastq.gz HTO_tag_S3_L002_R1_001.fastq.gz"

R2="ADT_tag_S2_L001_R2_001.fastq.gz ADT_tag_S2_L004_R2_001.fastq.gz HTO_tag_S3_L003_R2_001.fastq.gz ADT_tag_S2_L002_R2_001.fastq.gz HTO_tag_S3_L001_R2_001.f
astq.gz HTO_tag_S3_L004_R2_001.fastq.gz ADT_tag_S2_L003_R2_001.fastq.gz HTO_tag_S3_L002_R2_001.fastq.gz"

arr=($R1)
arr2=($R2)

for i in {0..7}
do
       CITE-seq-Count -R1 ${arr[i]} -R2 ${arr2[i]} -t /home/chenx302/Xiao/XC7_hashing/XC7_libraries_tag_CITE.csv -cbf 1 -cbl 16 -umif 17 -umil 28 -cells 114
96 -wl /home/chenx302/Xiao/XC7_hashing/XC7_WHITELIST_antibody -o ~/Xiao/XC7_hashing/XC7_hashing/outs/fastq_path/tags_fastqs/CITE_SEQ
done

Demultiplexing cite-seq sequencing data

Hi,
I'm about to have data from a 10X experiment with both HTO (4 tags) and ADT antibodies (18 markers)
My question is; should I have one file for HTO and ADT or 1 for each type of antibodies.
I find only a description for demultiplexing the hashtags.
Thanks

Option "-u" not working in version 1.3.1

Hi there,

I was trying the new option for printing the unknown tags, but it's giving me the following error:

[...]
Processed 1,000,000 lines in 7.649 secondes. Total lines processed: 27,000,000
Processed 1,000,000 lines in 7.69 secondes. Total lines processed: 28,000,000
Done counting
Traceback (most recent call last):
  File "~/miniconda3/envs/CITEseq-pipeline/bin/CITE-seq-Count", line 11, in <module>
    sys.exit(main())
  File "~/miniconda3/envs/CITEseq-pipeline/lib/python3.6/site-packages/cite_seq_count/__main__.py", line 350, in main
    no_match_matrix = pd.DataFrame(no_match_table.items(), columns=['tag', 'total'])
  File "~/miniconda3/envs/CITEseq-pipeline/lib/python3.6/site-packages/pandas/core/frame.py", line 422, in __init__
    raise ValueError('DataFrame constructor not properly called!')
ValueError: DataFrame constructor not properly called!

I'm not very proficiency at using Pandas, but I managed to make it work by replacing the following line of code for:

keys = list(no_match_table.keys())
vals = list(no_match_table.values())
no_match_matrix = pd.DataFrame({"tag": keys, "total": vals})

I'm using the following versions:

CITE-seq-Count v1.3.1
Pandas v0.23.4
Python v3.6.5

Best regards,
Santiago

Running CITEseqCount on cDNA Fastq files

Hello,

Thank you for developing this tool! I've received some hashed data from a collaborator and I have been trying to use it to no avail - it seems to run properly, but the report yaml shows percent unmapped = 100%.

After doing some reading I realized that my situation may in fact be similar to the problem faced in issue #5 , where I only actually have FASTQ files for the cDNA libraries while a typical hashed dataset actually has its own Adt fastq files. I have only a single pair of files, R1 and R2; I am actually able to find the HTO sequences in R2 using grep (several thousand in R2 compared to a few dozen in R1). In that thread, someone suggested the following strategy:

Running the cDNA fastq files (in this case only R1 and R2) through Cellranger Count.
Running R1 and R2 through Cite-Seq Count, perhaps using as a whitelist the cells identified by cellranger count. The author of that post suggested specifying some number of A's in the regex option, which I could not find any reference to in the documentation.

Is there any advice you can provide for this scenario? Thank you very much in advance!

cell-hashing pipeline questions

Hi,
maybe this is not the right place to ask, but I seem to be unable to get a response on the cite-seq website and I would really appreciate if you could comment on the following.
I'm about to do a cell hashing with some 10x libraries.

Can you please confirm that the sequencing cycles configuration of the Hashing libraries can be the same of a 10x library? From your script it seems so, as it will look for the classic coordinates of the CellBarcode(1-16) UMI-(17-26) and so on. (Of course, if it was dropseq beads i'd have to use a different sequencing configuration)

Also, in the preprint of the hashing, in the methods : "Fastq files from the 10x libraries with four distinct barcodes were pooled together and processed using the standard Drop-seq pipeline". If you have fastq to start with it means you basecalled them with the 10x pipeline?
Can I basecall with the cellranger wrapper or do you suggest any other way? (the dropseq paper suggested a couple of other ways, so I'm really just checking what's the best practice)
and then I have to handle the alignment and following steps exactly as in Dropseq pipeline, right?

Any suggestion would be really welcome! And many thanks for such amazing tool.

Best,
biola

Job never finishes

Hello,

I have been trying to run cite-seq-count on my data but haven't been successful so far.
I tried initially on the cluster but then gave up and tried locally. On the cluster, the job was never finishing so had to abort. In the error file there was this message:

No local minima was accepted. Recommend checking the plot output and counts per local minima (requires `--plot-prefix` option) and then re-running with manually selected threshold (`--set-cell-number` option)

And here is what happens locally on my laptop:

Unable to revert mtime: /Library/Fonts
Unable to revert mtime: /Library/Fonts/Microsoft
Counting number of reads
Started mapping
CITE-seq-Count is running with 4 cores.
Processed 1,000,000 reads in 23.67 seconds. Total reads: 1,000,000 in child 29798
Processed 1,000,000 reads in 39.92 seconds. Total reads: 1,000,000 in child 29799
Mapping done for process 29798. Processed 1,969,487 reads
Processed 1,000,000 reads in 1.0 minute, 2.679 seconds. Total reads: 1,000,000 in child 29801
Mapping done for process 29799. Processed 1,969,487 reads
Mapping done for process 29801. Processed 1,969,487 reads
Processed 1,000,000 reads in 1.0 minute, 30.52 seconds. Total reads: 1,000,000 in child 29800
Mapping done for process 29800. Processed 1,969,487 reads
Mapping done
Merging results
Correcting cell barcodes
ERROR:root:No local minima was accepted. Recommend checking the plot output and counts per local minima (requires `--plot-prefix` option) and then re-running with manually selected threshold (`--set-cell-number` option)
Could not find a good local minima for correction.
No cell barcode correction was done.
Correcting umis

I installed cite-seq-count version 1.4.1, with python 3.7.0.
Any help would be much appreciated.

Thank you

TAG file error

I receive the following error now:

This tag TGATGGCCTATTGG is not only composed of ATGC bases.
Please check your tags file
Exiting the application.

--expected_cells option still required when using -wl

Hello,

Thanks for developping such usefull tool.

I have some analyses I've done with CITE-seq-Count 1.3.4 that I'm currently rerunning with 1.4.1. I was wondering if it would be possible for the programme to require either -cells OR -wl as the white list will anyway contains a limited number of droplet barcodes.

The reason for that is that I'm running CITE-seq-Count as part of a bash script that should be able to process different files with different white-lists and I don't want to have to look at the number of lines in each of these white-list and have to encode that number manually for each in the script. I tried to make the script extract this info from the WhiteList file but it didn't work.

Thank you very much

Best wishes,

Alice

PS: It also took me a bit of time finding which option I needed to use to replace the -hd option that I had been using in my previous script. I think it would be useful to have that info in the documentation. For example, it would be usefull to have something like (previously -hd) written in description of --max-error.

umi correction issues

I'm not sure if this is an issue particular to my system, but I'm getting errors in the correct_cells and correct_umis functions when running the program on some hash data.

For cell barcodes, the program reports: "Could not find a good local minima for correction." When I dig into it a little further, it appears that the correct_cells function uses umi_methods.getCellWhitelist(), which my system doesn't recognize. Instead, umi_tools seems to have a whitelist_methods.getCellWhitelist() function. Is this an incompatibility between umi_tools version (I'm using 1.0.0) and CITE-seq-Count version (I'm using 1.4.1)?

For umis, I get the error: "TypeError: call() takes 3 positional arguments but 4 were given". When I dig into this a little more, it seems that network.UMIClusterer() (from umi_tools again) is the culprit. As far as I can tell, you are feeding it the umis as keys, a dictionary of umis with counts and a threshold. However, the function seems to expect only the dictionary and the threshold on my machine.

If I fix both issues in the source code, the program seems to run fine. I assume that this is a versioning issue on my machine since I don't see anyone else noting it, but thought I'd share anyway.

Variable TAG lengths?

Hi there,

I was wondering, do you actually believe/expect to have projects mixing TAGs of variable length? Do you really think this is a possibility?

I've been checking the lengths within different suppliers, and the lengths are always consistent.

Thanks in advance.

Cheers,
Santiago

Finding a whitelist

Hi everyone,

I'm trying to demultiplex my scRNAseq data with CITE-seq-count (hashtag barcoding)

I tested with 10 000 reads
CITE-seq-Count -R1 HTO-1_R1.fastq.gz -R2 HTO-1_R2.fastq.gz -t hto.csv -cbf 1 -cbl 16 -umif 17 -umil 26 -o /Results -cells 5000 -n 10000
which worked pretty well..

So I discard the -n 10000 in order to used all the reads

CITE-seq-Count -R1 HTO-1_R1.fastq.gz -R2 HTO-1_R2.fastq.gz -t hto.csv -cbf 1 -cbl 16 -umif 17 -umil 26 -o /Results -cells 5000 
Counting number of reads
Started mapping
Processing 41,301,352 reads
CITE-seq-Count is running with 72 cores.
...
Mapping done
Merging results
Correcting cell barcodes
Finding a whitelist
/.local/lib/python3.7/site-packages/umi_tools/whitelist_methods.py:283: RuntimeWarning: invalid value encountered in sqrt
  lineVecNorm = lineVec / np.sqrt(np.sum(lineVec**2))
Traceback (most recent call last):
  File "/.local/bin/CITE-seq-Count", line 10, in <module>
    sys.exit(main())
  File "/.local/lib/python3.7/site-packages/cite_seq_count/__main__.py", line 352, in main
    collapsing_threshold=args.bc_threshold)
  File "/.local/lib/python3.7/site-packages/cite_seq_count/processing.py", line 310, in correct_cells
    plotfile_prefix=False)
  File "/.local/lib/python3.7/site-packages/umi_tools/whitelist_methods.py", line 447, in getCellWhitelist
    cell_barcode_counts, cell_number, plotfile_prefix)
  File "/.local/lib/python3.7/site-packages/umi_tools/whitelist_methods.py", line 322, in getKneeEstimateDistance
    raise ValueError("Something's gone wrong here!!")
ValueError: Something's gone wrong here!!

Is there something that I missed?
Thanks!!
nicolas

Why this warning?

If you have a read length greater than the cell barcode + UMI length, the tool says:

"Read 1 length is 35bp but you are using 26bp for cell and UMI barcodes combined.
This might lead to wrong cell attribution and skewed umi counts"

Is this actually a problem though? Doesnt it just ignore the final 9bp? I could see this warning being useful to make sure the operator is confident with the barcode/UMI lengths chosen; however, there isnt actually a chance it would result in the incorrect cell barcode being chosen, is there?

Read length not consistent

Hey there,

I got the error:
Read2 length is not consistent, please trim all Read2 reads at the same length

So I checked the length of the R2 reads and some were a bit shorter, so I trimmed everything down to 15bp (barcode is in the first 8bp anyway). Now the reads are all the same length

And it's still returning the same error. Any ideas??

Here's the call:
CITE-seq-Count \ -R1 /global/scratch/hpc3837/barcode_emt/fastq/Mix-1-Barcode/Mix-1-Barcode_R1_Merged.fastq.gz \ -R2 /global/scratch/hpc3837/barcode_emt/fastq/Mix-1-Barcode/Mix-1-Barcode_R2_Merged_Trimmed.fastq.gz \ -t /global/scratch/hpc3837/barcode_emt/data/LMOlist.csv \ -cbf 1 -cbl 16 -umif 17 -umil 26 -hd 1 -cells 30000 \ -o CiteSeqCount_Mix1_NoWL.tsv

UMI or cell-barcodes with a hamming distance

Hi,Hoohm

Does CITE-seq-Count collapse UMI or cell-barcodes with a hamming distance of N?

Extracted ADT cell Barcodes do not match those found in the STAMPS

After extracting the top 3000 ADT counts, I was surprised that only about half of these barcodes was present in the cell barcode read counts file output from the dropseq-tools pipeline.

Is this due to the barcode synthesis correction step in the dropseq pipeline?

Do you have plans to integrate this script with that pipeline in order to take advantage of their barcode error detection/correction? It should not be too much of an issue since the pipeline extracts the cell and UMI barcodes from read 1 and adds them to the XC and XM tags, respectively, of the resulting BAM/SAM file.

ValueError: hamming expected two unicodes of the same length

I got this error while running the cite-seq count script.

Traceback (most recent call last): File “/gpfs/commons/groups/satija_lab/shared/bin/CITE-seq-Count/CITE-seq-Count/CITE-seq-count.py”, line 228, in <module> main() File “/gpfs/commons/groups/satija_lab/shared/bin/CITE-seq-Count/CITE-seq-Count/CITE-seq-count.py”, line 200, in main temp_res[value] = Levenshtein.hamming(TAG_seq, key) #Get distance from all barcodes ValueError: hamming expected two unicodes of the same length

Turns out the length of my ADT sequences was longer than the number of base pairs that were sequenced for R2.
Trimming my sequences down to match the number of base pairs from R2 resolved the error.

Process gets killed due to memory issues

Hey,

thanks for providing CITE-seq-Count! I am trying to run the tool on ~100,000,000 paired reads on a linux machine with 64gb of RAM It won't do it, automatically killing the task due to eating up all memory. On my macbook pro with 32gb of RAM I ran the tool succesfully for the same fastq files. Any help?

Best Max

Can't install

when using:
sudo pip install CITE-seq-Count==1.4.2 --no-cache-dir

I get:
Could not find a version that satisfies the requirement CITE-seq-Count==1.4.2 (from versions: 1.2)
No matching distribution found for CITE-seq-Count==1.4.2

Minimum hamming distance of TAGS barcode is less than given threshold

Hi,
Following are my antibody tags. I tried to run CITE-seq-Count on 10X data but got following error. Minimum hamming distance of the tags is 6. Since maximum allowed hammign distance is 3 by default, it is expected to have such error because if there is error in two sequences at most in 3 basepairs this gives rise to tag ambiguity. However, when I provide --hamming-distance 2 I get the same error.

Tags:

CTCATTGTAACTCCT
TGTTCCCGCTCAACT
GCTGCGCTTTCCATT
AACCTAGTAGTTCGG
TTTGTCCTGTACGCC
ACCTTTATGCCACGG
ACTTCCGTCGATCTT
TCAATCCTTCCGCTT
GTCCCTGCAACTTGA
CCAGCTCATTAGAGC
GACCTCATTGTGAAT
AGTTCAGTCAACCGA
ACCTACCTGAGGTTA
TTTACTAAGTCGTTT
CTTCCGATTCATTCA
GGTTGCCAGATGTCA
TGTCTTTCCTGCCAG
CTCCTCTGCAATTAC
CAGTAGTCACGGTCA

Command:

CITE-seq-Count --read1 $R1 --read2 $R2 \
        --tags TAG_LIST.csv \
        --cell_barcode_first_base 1 \
        --cell_barcode_last_base 16\
        --umi_first_base 17 \
        --umi_last_base 26 \
        --output $OUTPUT_DIR/$SAMPLE_R1".csv" \
        --whitelist $10X_WHITELIST_FILE \
       --hamming-distance 2

ERROR:

Minimum hamming distance of TAGS barcode is less than given threshold
Please use a smaller distance; exiting"

Regex inefficiency when using many tags plus other issues

Hi @Hoohm. Glad to see this script is gaining many users! After benchmarking in the spring, we're using CITE-seq + hashtags in production now. With more data, I've run into a few issues.

As background, we're running with 3 hashtags and 16 CITE-seq antibodies. Each tag is 15bp long. Tags are being pooled with our 10X transcriptomes and sequenced together, so our fastqs contain ~400M reads each and contain a mix of tags and transcripts.

1.) When running in the non-legacy mode (no -l option), the regex for the CITE-seq tags becomes essentially ([ACGT]{15}). At this point, the regex isn't really doing much.

2.) I typically will have ~150 "N" characters in my tag/transcripts R2 per every million reads. The regexes don't allow for Ns, so I'm throwing away possible reads due to that.

3.) When I run in legacy mode (-l option enabled), I would expect that my tag counts should be uniformly lower than with legacy mode disabled because legacy mode adds more stringent regex requirements. However, I find that I get some tag counts that go up and others that go down. This only seems to happen when I have many tags being counted simultaneously. Below is an example on 2.5 million reads (combined transcript/tags) with CITE + hash tags.

[14:01:40 | cite_seq_test] python <<eof
import pandas as pd
x = pd.read_csv('counts.csv', index_col=0, header=0).T
print(x.iloc[:,1:-2].sum(axis=0))
eof
citetag_01-TACGCCTATAACTTG         90
citetag_02-GTGTGTTGTCCTATG         26
citetag_03-GCGATGGTAGATTAT        205
citetag_04-AATTCAACCGTCGCC         79
citetag_05-GATCCCTTTGTCACT         38
citetag_06-AGTTCAGTCAACCGA         76
citetag_07-GCACTCCTGCATGTA         76
citetag_08-CGCGCACCCATTAAA        189
citetag_09-ACAGCGCCGTATTTA         80
citetag_10-TGAGAACGACCCTAA        210
citetag_11-TGTTCCCGCTCAACT        136
citetag_12-TCAATCCTTCCGCTT         28
citetag_13-TTCGCCGCATTGAGT         18
citetag_14-GTCTTTGTCAGTGCA         22
citetag_15-AATAGCGAGCAAGTA         61
citetag_16-CAGTCTCCGTAGAGT         82
hashtag_5-AAGTATCGTTTCGCA         915
hashtag_6-GGTTGCCAGATGTCA        1264
hashtag_7-TGTCTTTCCTGCCAG          35
dtype: int64
[14:02:28 | cite_seq_test] python <<eof
import pandas as pd
x = pd.read_csv('counts_legacy.csv', index_col=0, header=0).T
print(x.iloc[:,1:-2].sum(axis=0))
eof
citetag_01-TACGCCTATAACTTG      69
citetag_02-GTGTGTTGTCCTATG      32
citetag_03-GCGATGGTAGATTAT     171
citetag_04-AATTCAACCGTCGCC      72
citetag_05-GATCCCTTTGTCACT      41
citetag_06-AGTTCAGTCAACCGA      64
citetag_07-GCACTCCTGCATGTA      76
citetag_08-CGCGCACCCATTAAA     215
citetag_09-ACAGCGCCGTATTTA      80
citetag_10-TGAGAACGACCCTAA     189
citetag_11-TGTTCCCGCTCAACT     137
citetag_12-TCAATCCTTCCGCTT      27
citetag_13-TTCGCCGCATTGAGT      16
citetag_14-GTCTTTGTCAGTGCA      19
citetag_15-AATAGCGAGCAAGTA      49
citetag_16-CAGTCTCCGTAGAGT      68
hashtag_5-AAGTATCGTTTCGCA      996
hashtag_6-GGTTGCCAGATGTCA     1354
hashtag_7-TGTCTTTCCTGCCAG       46

That pattern changes to what I'd expect if I only look at hashtags (below). However, note that the counts aren't consistent between searching for 19 tags and 3 tags even though it's the same 2.5M reads. Any idea what's going on here? Should I post a sample reads as well?

[13:40:54 | cite_seq_test] python <<eof 
import pandas as pd
x = pd.read_csv('hto_counts.csv', index_col=0, header=0).T
print(x.sum(axis=0))
eof
bad_struct                   960067
hashtag_5-AAGTATCGTTTCGCA      1042
hashtag_6-GGTTGCCAGATGTCA      1350
hashtag_7-TGTCTTTCCTGCCAG        50
no_match                       8999
total_reads                   11441
dtype: int64
[13:40:58 | cite_seq_test] python <<eof
import pandas as pd
x = pd.read_csv('hto_counts_legacy.csv', index_col=0, header=0).T
print(x.sum(axis=0))
eof
bad_struct                   172213
hashtag_5-AAGTATCGTTTCGCA       940
hashtag_6-GGTTGCCAGATGTCA      1276
hashtag_7-TGTCTTTCCTGCCAG        45
no_match                         38
total_reads                    2299
dtype: int64

memory / cpu requirements

Greetings,

I've been attempting to run the CITE-seq-Count script on our server. My job keeps getting killed, I presume because it has outgrown my estimated memory/CPU requirements. Can you provide a rule of thumb about how many cores / how much memory I should request? I have a 10x run with a whitelist of about 16,000 sample barcodes.

Thanks,
Kelly

Cell barcodes in whiteList not present in output

I am trying to use the CITE-seq-Count tool to associate cell hashing barcodes with cell barcodes from a 10x 3' run. I'm using the filtered barcodes.tsv (after removing '-1') as the white list to this tool. The resulting antibody/hashing barcode counts file does not contain information for every cell barcode provided. Seurat's setAssayData() requires the new assay data to have the same list of cell barcodes for both datasets. I can fill in the blanks in R after the fact but it would be nice if the CITE-seq-Count matrix had columns for every cell barcode in the white list. Am I doing something wrong?

Command:
python CITE-seq-count.py -R1 Pre_HTO_S1_L001_R1_001.fastq.gz -R2 Pre_HTO_S1_L001_R2_001.fastq.gz -t barcodes.csv -cbf 1 -cbl 16 -umif 17 -umil 26 -tr "^[ATGC]{15}[TGC][A]{6,}" -hd 1 -o Pre_HTO_antibodyCounts.txt -wl preHTO_whiteList.txt

Read 1: 26bp
Read2: 101bp

Barcodes:
GTCAACTCTTTAGCG,b1
TGATGGCCTATTGGG,b2
TTCCGCCTCTCTTTG,b3
AGTAAGTTCAGCGTA,b4

Head of white list:
TTTGTCACAATCCGAT
TTTGTCACACGGTAGA
TTTGTCACAGCTTAAC
TTTGTCAGTTAAAGTG
TTTGTCATCCGAACGC
TTTGTCATCGATAGAA
TTTGTCATCGCATGAT
TTTGTCATCTACTATC
TTTGTCATCTCTGTCG
TTTGTCATCTTGTATC

Number of cell barcodes given in white list: 8329

Number of cell barcodes present in output counts file: 7844

Thanks in advance,

Jason

job exceeded memory limit

Hello,

Sorry to open several different issue... I try to open 1 issue per problem so that each can be closed when solved, independently of the others...

As explained in issue #36, I'm running CITE-seq-Count 1.4.1 on data previoulsy analysed with 1.3.4.

I've tried twice and in both cases, the programme has been killed by the system because it was using too much memory. I'm running it on a cluster, using the job manager slurm and I have to tell slurm how much memory to allocate to my script. Is there a way to limit the memory used by CITE-seq-Count? And is-there a way to know how much it will need? In this case I allocated 15Gb and it was killed once after 1h30 and the second time after 33 minutes.

Here is the error message:

slurmstepd: Job 3410940 exceeded memory limit (15912452 > 15728640), being killed

You'll find the script and log file attached to the issue #36.

Thank you very much,

Best wishes,

Alice

Works with 1 million reads; hangs on 50 million.

Hi thanks for the tool! I'm trying to use it to demultiplex cells by HTO from a HiSeq lane. With the full lane, it maps reads but then pauses and never reaches the "Correcting cell barcodes" stage. It works with option -n 1000000 (one million reads) but not with 50,000,000 reads. I'm giving it 128Gb of memory and multiple days of runtime on a cluster. Any suggestions or notes on what could be causing this?

Command:
CITE-seq-Count -R1 s10xHTO_S1_L001_R1_001.fastq.gz -R2 s10xHTO_S1_L001_R2_001.fastq.gz -t TAG_LIST.csv -cbf 1 -cbl 18 -umif 19 -umil 28 -cells 11000 --threads 1 -n 50000000 -o fiftythou

Running cite-seq-count

Hi,

I recently got fastq files from a cell hashing experiment and I am now trying to process it. I am new to sc analysis and generally not very experienced with bioinformatic tools upstream of R, apologies if my question sounds trivial.

I have merged the fastqs from different lanes (L001-L004) into a single file for R1 and a single file for R2. I have generated a Barcode_whitelist.txt file (no header or column or row names, just tab delimited) from the Barcode.tsv file from cellranger count, and made a tags.csv file with my HTOs.

I have tried running the following on 5000 reads

cite_seq_count -R1 ***_R1_001.fastq.gz -R2 ***_R2_001.fastq.gz -t ***/tags.csv -cbf 1 -cbl 16 -umif 17 -umil 26 -hd 2 -o Result.tsv -wl Barcode_whitelist.txt -n 5000

When I look at the Result.tsv file I am not sure it looks like what it's supposed to look like (see attached screenshot below). Is it OK and good to go to be run on all reads?
Also, I was told that the ATCACGAT illumina index was used for the HTO library. Is this something I should use and provide in -tr ?

Many thanks for your help!

CITE-seq-Count with v3 vs v2 chemistry

Hello Christoph,

This is a general question/thread regarding the differences in the need to use CITE-seq-Count across different 10X chemistries. I am part of a group which will soon start CITE-seq with both 10X chemistries, and would like feedback about some aspects of the analysis:

Since the release of TotalSeq B which is compatible with Feature Barcoding and v3 chemistry, it seems to me that cellranger count will provide all outputs needed to run the analysis in Seurat. Thus, am I correct in the understanding that users will no longer need to run CITE-seq-count?
If using V2 chemistry from 10X, my understanding is that cellranger will not be able to analyze the antibody reads. In this case, is the following correct? RNA reads can be aligned with cellranger count (version 2 or 3), but CITE-seq (antibody reads), can only be analyzed with CITE-seq-count? And then the outputs of the above can be added to Seurat 3.
On the same note as above, I am a bit confused why the original CITE-seq paper utilized the Drop-seq pipeline to analyze 10X v2 chemistry? Why not cellranger, such as what was done in this preprint, which you are an author (I am assuming v2 chemistry was used here, although it is not mentioned)? Also the original CITE-seq paper doesn't mention the use of CITE-seq-Count, but I am assuming that it was actually utilized in the analysis but not yet released to the public at the time correct?

One last naive question, but for the sake of being clear (I do not have CITE-seq data yet so I wanted to ensure my interpretation of your documentation is correct). You mention that we need a tags.csv file for CITE-seq-Count. I am assuming that these barcodes and the names of the tags/antibodies are provided by the company (e.g.: BioLogend, such as the one shown here for TotalSeq A) or the sequencing core correct? Just wanted to double-check!

Thanks a lot for all the clarification. This information seems to be a bit sparse currently online, so it is a bit difficult to link it all together when we consider multiple versions of CITE-seq/10X chemistries.

Error messages or no output when running CITE-seq-Count 1.4.2 with specific options

I was able to successfully run the latest release of CITE-seq-Count (version 1.4.2) with the required -cells parameter, but without the -wl, --bc_collapsing_dist, --umi_collapsing_dist, or --max-error options. However, when I tried to run it with --umi_collapsing_dist 1 and --max-error 1 options, the following error messages appeared after mapping was done:
Mapping done
Merging results
Correcting cell barcodes
Finding a whitelist
Traceback (most recent call last):
File "/ihome/crc/install/cite-seq-count/python3.7/bin/CITE-seq-Count", line 11, in
sys.exit(main())
File "/ihome/crc/install/cite-seq-count/python3.7/lib/python3.7/site-packages/cite_seq_count/main.py", line 352, in main
collapsing_threshold=args.bc_threshold)
File "/ihome/crc/install/cite-seq-count/python3.7/lib/python3.7/site-packages/cite_seq_count/processing.py", line 310, in correct_cells
plotfile_prefix=False)
File "/ihome/crc/install/cite-seq-count/python3.7/lib/python3.7/site-packages/umi_tools/whitelist_methods.py", line 447, in getCellWhitelist
cell_barcode_counts, cell_number, plotfile_prefix)
File "/ihome/crc/install/cite-seq-count/python3.7/lib/python3.7/site-packages/umi_tools/whitelist_methods.py", line 322, in getKneeEstimateDistance
raise ValueError("Something's gone wrong here!!")
ValueError: Something's gone wrong here!!

When I tried to run CITE-seq-Count version 1.4.2 with the -wl option to provide the 10x Genomics 737K-august-2016.txt cell barcode whitelist, I was concerned that something was wrong with that run too because it was going for 75+ minutes and there was no output in the log file (not even an indication that reads were being counted, mapped and processed), so I canceled that run.

Running without UMI

Hi, thanks for producing the CITE-seq-Count package. Sorry if this is more of a question than an issue. I'm just curious if it's possible to run CITE-seq-Count without UMI. We're interested in seeing how the data would look like with some experiments we're running. The UMI parameter of the call is "required", so I'm wondering if there's a way to get around this. Thanks.

Error running CITE-seq-count for 10X fastq files.

I got the following error when tried running CITE-seq-count. I have no clue what does the error mean. I will appreciate your help.

Command:

## for 10X
CB_FIRST=1
CB_LAST=16
UMI_FIRST=17
UMI_LAST=26
HAMMING_THRESH=1

CITE-seq-Count --read1 $R1 --read2 $R2 \
	--tags TAG_LIST.csv \
	--cell_barcode_first_base $CB_FIRST \
	--cell_barcode_last_base $CB_LAST \
	--umi_first_base $UMI_FIRST \
	--umi_last_base $UMI_LAST \
	--hamming-distance $HAMMING_THRESH \
	--output Result.tsv \
	--whitelist $WHITELIST \
	--first_n 100000

Output:

loading
100,000 reads loaded
100,000 uniques reads loaded
Traceback (most recent call last):
  File "/homes/scauser/Python/CITE-seq-Count-1.3.1/bin/CITE-seq-Count", line 11, in <module>
    load_entry_point('CITE-seq-Count==1.3.1', 'console_scripts', 'CITE-seq-Count')()
  File "build/bdist.linux-x86_64/egg/cite_seq_count/__main__.py", line 284, in main
  File "build/bdist.linux-x86_64/egg/regex.py", line 265, in search
  File "build/bdist.linux-x86_64/egg/regex.py", line 496, in _compile
_regex_core.error: bad fuzzy cost limit at position 91

UMI number over 20k

Hi, thanks for developing this tools. I have some cell-hashing data, and the umi number of some of the cells are over 20,000. After running this, I got an uncorrected_cells folder in which contains cells with ultra-high UMI count. However, when I merge this file with results in umi_count folder, and check the UMI count distribution, I found there is a big gap between this two data set (umi_count: 1,000-13,802, uncorrected: 20,001-46,428), looks there is no any middle value between 13,802-20,001, this is very weird, and looks not right. And I also test other dataset, they all have this problem. Not sure what's happened...

Steps to add to unique_lines and UMI_reduce set objects redundant?

I was recently using CITE-seq-Count for a cell hashing project with ~100 million reads in the HTO FASTQ file.

The script works great when I downsample the reads, but it runs into memory issues when trying to run on the full 100 million reads.

So I decided to look at the source code to see if there was any way to make it work better for larger data sets. In doing so, I noticed that the script uses a Python set object unique_lines to remove duplicate lines (duplicate defined as same exact R1 + R2 sequence). Looking at main.py, lines 235 to 256 or so.

There is also another, very similar step later on. In later steps (lines 279 to 327 or so), it uses another set object UMI_reduce to hold the unique values for the concatenation of cell barcode + UMI + read 2 sequence.

My question is, aren't unique_lines and UMI_reduce redundant here? The "line" is just the concatenation of cell barcode + UMI + tag, plus maybe some other non-useful sequence depending. Couldn't we go straight from reading every second line the FASTQ file to checking against UMI_reduce, and adding to UMI_reduce if it is not already there?

implementation of barcode whitelist

Hi,

I would like to propose implementing the use of a barcode whitelist for Cite-seq-count. As of now the output generated contains TAG counts for all encountered Cell barcodes and not just for those that have been identified as cell associated. The process of TAG counting could potentially be accelerated by only counting TAGs for cell associated barcodes and it would reduce the dimensions and file size of the output table, which again would improve downstream analysis.
For 10X one could just the barcodes.tsv from "/outs/filtered_gene_bc_matrices/" as whitelist.

Best,

Tobias

Best way to count ADT tags ?

Hello,

I recently had some CITE-seq data to analyze and as a beginner, I'm really grateful for your CITE-seq-Count workflow.
It's working well for my HTO labeling because the sequence looks like what we expect in the R2: [HTO Barcode]-[CGT]-poly A
with only a few errors, but my ADT labeling is not in the same shape :
(polynucleotide sequence of 0 to 21 nt)-[ADT Barcode]-[CGT]-poly A

As it's mostly a 21nt sequence before and that there is no Regex option, I can use the -trim=21 option, but I lose a lot of counts.
So then I thought that as the 21nt sequence is always the same, I can use it in the csv listing the tags, like :
[ADT Barcode],CD3
G[ADT Barcode],CD3
GG[ADT Barcode],CD3
ACGG[ADT Barcode],CD3
TACGG[ADT Barcode],CD3
... and so on.
But I don't think it's a sustainable solution...
Do you have any better solution to advise me on this issue?

Stuck at "Testing cell barcode collapsing threshold..."

Hi,

With the recent version (1.4.2), it seems I am unable to get pass the stage of ;

Testing cell barcode collapsing threshold of 1

I used the following commands to initially test the workflow

python3 CITE_seq_Count \ -R1 test/HTO_R1.fastq.gz \ -R2 test/HTO_R2.fastq.gz \ -t test/tags.csv \ -cbf 1 -cbl 16 \ -umif 17 -umil 26 \ -cells 100 \ -wl test/CellBarcode_GEM_01_1Reads.txt \ -o CITE_seq_output_GEM01 \ --no_umi_correctio \ -n 100

I uploaded the example file here in case it would help.

Strangely this only happens with the current version? I ran the previous version okay.

Best
Zaki

ValueError: hamming expected two strings of the same length

Hello again --

I ran into the following error on a recent run:
ValueError: hamming expected two strings of the same length

Here is the command I used:
python /bin/CITE-seq-Count
-R1 /alpha/cmcginnis/BARS_3_S24_L007_R1_001.fastq.gz
-R2 /alpha/cmcginnis/BARS_3_S24_L007_R2_001.fastq.gz
-t /alpha/cmcginnis/CSCount/LMOlist.csv
-cbf 1 -cbl 16 -umif 17 -umil 26 -tr "[ATCG]{6}" -hd 1
-wl /alpha/cmcginnis/HMEC1_WL.txt
-o CSCount_HMEC96orig_allIDs_120118.tsv

Any idea what would cause this error?

UnboundLocalError

Hello,

I am running CITE-seq-count pipeline and it gives me UnboundLocalError. Any suggestions? Here is the code:

python /CITE-seq-Count-master/CITE-seq-count.py -R1 P-xxxx1.fastq.gz -R2 xxxx2.fastq.gz -t TAG_List.csv -cbf 1 -cbl 16 -umif 17 -umil 26 -tr "^[ATGC]{15}[TGC][A]{6,}" -hd 1 -o citeseqcount.tsv -cells 5000

Output:

Traceback (most recent call last):
File "/CITE-seq-Count-master/CITE-seq-count.py", line 232, in
main()
File "/CITE-seq-Count-master/CITE-seq-count.py", line 151, in main
tag_length = check_tags(ab_map, args.hamming_thresh)
File "/CITE-seq-Count-master/CITE-seq-count.py", line 136, in check_tags
return(len(a))
UnboundLocalError: local variable 'a' referenced before assignment

Top unknown barcodes?

Hello,

So far as I can see with CITE-seq-Count, one provides a table of expected barcodes. The final report includes rows for each barcode, bad_struct, and no_match. Is there any way to make it report the frequency of barcodes in the no_match category? For example, when you run bcl2fastq, that report lists top unknown barcodes. This is a nice check in case the barcodes actually used dont line up w/ what's expected.

thanks

https://www.ebi.ac.uk/ena/data/view/PRJNA393315 10x or drop-seq data

dear professor,
Thanks for your great CITE-seq. I am reading your paper of "simultaneous epitope and transcriptome measurement in single cells", I want to know the data you posted in EBI is just 10x or Drop-seq in website https://www.ebi.ac.uk/ena/data/view/PRJNA393315 , because net speed here is very low, thanks a lot

how to cite cite-seq count

Just a lazy question, just wondering how to cite cite-seq count in publication?
Didn't found preferred cited articles, just want to make sure using the right acknowledgement. The orignial cite-seq paper?
Thanks

output names are not compatible with Seurat Read10X

I've got three outputs after running CITE-seq-Count v1.4

barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz

As suggested in the document page, I tried to use READ10X in Seurat (v3) but got an error since the READ10X specifically check the existence of three files

barcodes.tsv
genes.tsv
matrix.mtx

Do you have any plan to make an adjustment to resolve this issue? Thanks.

Best,
Joon

ERROR:root:No local minima was accepted

Hello,
I'm trying to run CITE-seq-Count 1.4.1 on data I already analysed with the 1.3.4 and it doesn't work in my hands. It starts processing reads then after processing 850,716 reads, it throw the following error:

ERROR:root:No local minima was accepted. Recommend checking the plot output and counts per local minima (requires `--plot-prefix` option) and then re-running with manually selected threshold (`--set-cell-number` option)

Despite this error, the programme keep running (without processing any more reads and without generating any output files) until it was killed (after 1h33 or 33min) by the system because it was using too much memory.

I tried twice with the command

CITE-seq-Count -R1 ${R1file} -R2 ${R2file} -t ${TagsFile} -cbf 1 -cbl 16 -umif 17 -umil 26 -cells 15291 -wl ${BClist} --max-error 1 -o ${OutputDir}

I can't find the option --set-cell-number in the documentation, so I don't know what to do.

Could you help me with this?

You'll find the script I used and the log file here attached.
CiteSeqCount.3410940.log
CiteSeqCount_script.txt

Thank you very much,
Best wishes,
Alice

correcting umis

Hello,
Thanks for developing CITE-seq-Count.

I have been trying to run this for some time. However, i always get the error:root: no local minima was accepted...--plot-prefix etc error which I am ignoring since no cell barcode correction is done which is okay. But then i leave the pc for 1 or 2 days while the package is "correcting UMIs" only to find the process killed. I am running this on a linux OS with 64 Gb RAM.

To note, I am running this on my HTO library and using the filtered whitelist provided by cell ranger.
Please advise.
Thank you.

hoohm / cite-seq-count Goto Github PK

cite-seq-count's People

Contributors

Stargazers

Watchers

Forkers

cite-seq-count's Issues

Recommend Projects

Recommend Topics

Recommend Org