yezhengstat / cuttag_tutorial Goto Github PK

View Code? Open in Web Editor NEW

51.0 51.0 17.0 22.99 MB

Tutorial Website

Home Page: https://yezhengstat.github.io/CUTTag_tutorial/

R 79.30% Shell 20.70%

cuttag_tutorial's People

Stargazers

Watchers

Forkers

pythseq nothing88716 mavti jhuanglabtools zhanglu777 hzm220 pipelines-jusue404 dklcdbi sridhar0605 biogeeker kerwin12580 summer-yangqin wxl-508 lsakamoto cynthiamoncadareid fjps

cuttag_tutorial's Issues

CUTTAG duplication

Hi there,

Thank you for the CUT & TAG tutorial which is extremely helpful.
I have a question regarding the duplication removal in the Cuttag samples.
I read that in the tutorial
''
CUT&Tag integrates adapters into DNA in the vicinity of the antibody-tethered pA-Tn5, and the exact sites of integration are affected by the accessibility of surrounding DNA. For this reason fragments that share exact starting and ending positions are expected to be common, and such ‘duplicates’ may not be due to duplication during PCR. In practice, we have found that the apparent duplication rate is low for high quality CUT&Tag datasets, and even the apparent ‘duplicate’ fragments are likely to be true fragments. Thus, we do not recommend removing the duplicates. In experiments with very small amounts of material or where PCR duplication is suspected, duplicates can be removed.
''
I obtained samples with 70-90% duplication, and I am just wondering how to make a decision if those duplication are due to PCR or not.

Many thanks.

Best,
Jingkui

got two peaks in heatmap

Hello, Ye

I was following your tutorial and almost everything works well, but when I try to regenerate the heatmap, I got to peaks in my file, I am kind of doubt my hg38 tsv file, and was wondering whether other people happen to have the same issue with me.
Thank you!

Best,
Yifan

Duplication in E.coli genome

Hello,
Thank you for creating this helpful tutorial. I have a question about the duplication in the E. coli genome. When I aligned to the E. coli spike-in genome, I found almost all mapped reads marked duplication. Should I use the number of fragments from the E. coli after removing duplication reads, or the number of all fragments from the E. coli when I calculate the scale factor to normalize my data?
Thank you, and look forward to your reply.

Cannot Picard SamSort and/or MarkDuplicate?

Hi Ye,

Not sure if this is appropriate to ask, but for some reason, on Step 3.3 Removing duplicates, Picard MarkDuplicate command reports 0 duplicates. I'm using the same dataset as the tutorial and on IgG_rep1_bowtie2.sam. Would you happen to know how to fix this? Thank you

12:54:18.412 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/sethilab/opt/anaconda3/envs/cutruntools2.1/share/picard-2.27.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.dylib
[Wed Dec 07 12:54:18 EST 2022] SortSam --INPUT ./alignment/sam/IgG_rep1_bowtie2.sam --OUTPUT ./alignment/sam/IgG_rep1_bowtie2.sorted.sam --SORT_ORDER coordinate --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed Dec 07 12:54:18 EST 2022] Executing as [email protected] on Mac OS X 13.0 x86_64; OpenJDK 64-Bit Server VM 17.0.3+7-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.4-SNAPSHOT
INFO 2022-12-07 12:54:19 SortSam Seen many non-increasing record positions. Printing Read-names as well.
INFO 2022-12-07 12:54:42 SortSam Finished reading inputs, merging and writing to output now.
[Wed Dec 07 12:54:57 EST 2022] picard.sam.SortSam done. Elapsed time: 0.65 minutes.
Runtime.totalMemory()=536870912
12:54:58.671 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/sethilab/opt/anaconda3/envs/cutruntools2.1/share/picard-2.27.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.dylib
[Wed Dec 07 12:54:58 EST 2022] MarkDuplicates --INPUT ./alignment/sam/IgG_rep1_bowtie2.sorted.sam --OUTPUT ./alignment/removeDuplicate/IgG_rep1_bowtie2.sorted.dupMarked.sam --METRICS_FILE ./alignment/removeDuplicate/picard_summary/IgG_rep1_picard.dupMark.txt --ASSUME_SORT_ORDER coordinate --VERBOSITY WARNING --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed Dec 07 12:54:58 EST 2022] Executing as [email protected] on Mac OS X 13.0 x86_64; OpenJDK 64-Bit Server VM 17.0.3+7-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.4-SNAPSHOT
WARNING 2022-12-07 12:54:59 AbstractOpticalDuplicateFinderCommandLinePrograA field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR11923224.1466798.2. Cause: String 'SRR11923224.1466798.2' did not start with a parsable number.
[Wed Dec 07 12:55:36 EST 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.63 minutes.
Runtime.totalMemory()=536870912
12:55:37.913 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/sethilab/opt/anaconda3/envs/cutruntools2.1/share/picard-2.27.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.dylib
[Wed Dec 07 12:55:38 EST 2022] MarkDuplicates --INPUT ./alignment/sam/IgG_rep1_bowtie2.sorted.sam --OUTPUT ./alignment/removeDuplicate/IgG_rep1_bowtie2.sorted.rmDup.sam --METRICS_FILE ./alignment/removeDuplicate/picard_summary/IgG_rep1_picard.rmDup.txt --REMOVE_DUPLICATES true --ASSUME_SORT_ORDER coordinate --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed Dec 07 12:55:38 EST 2022] Executing as [email protected] on Mac OS X 13.0 x86_64; OpenJDK 64-Bit Server VM 17.0.3+7-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.4-SNAPSHOT
INFO 2022-12-07 12:55:38 MarkDuplicates Start of doWork freeMemory: 529478472; totalMemory: 536870912; maxMemory: 2147483648
INFO 2022-12-07 12:55:38 MarkDuplicates Reading input file and constructing read end information.
INFO 2022-12-07 12:55:38 MarkDuplicates Will retain up to 7780737 data points before spilling to disk.
WARNING 2022-12-07 12:55:38 AbstractOpticalDuplicateFinderCommandLinePrograA field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR11923224.1466798.2. Cause: String 'SRR11923224.1466798.2' did not start with a parsable number.
INFO 2022-12-07 12:55:44 MarkDuplicates Read 1,000,000 records. Elapsed time: 00:00:06s. Time for last 1,000,000: 6s. Last read position: chr5:69,153,488
INFO 2022-12-07 12:55:44 MarkDuplicates Tracking 1000000 as yet unmatched pairs. 67092 records in RAM.
INFO 2022-12-07 12:55:50 MarkDuplicates Read 2,000,000 records. Elapsed time: 00:00:11s. Time for last 1,000,000: 5s. Last read position: chr11:75,183,841
INFO 2022-12-07 12:55:50 MarkDuplicates Tracking 2000000 as yet unmatched pairs. 85106 records in RAM.
INFO 2022-12-07 12:55:55 MarkDuplicates Read 3,000,000 records. Elapsed time: 00:00:17s. Time for last 1,000,000: 5s. Last read position: chrM:855
INFO 2022-12-07 12:55:55 MarkDuplicates Tracking 3000000 as yet unmatched pairs. 24278 records in RAM.
INFO 2022-12-07 12:55:57 MarkDuplicates Read 3386886 records. 3386886 pairs never matched.
INFO 2022-12-07 12:55:57 MarkDuplicates After buildSortedReadEndLists freeMemory: 652251184; totalMemory: 966787072; maxMemory: 2147483648
INFO 2022-12-07 12:55:57 MarkDuplicates Will retain up to 67108864 duplicate indices before spilling to disk.
INFO 2022-12-07 12:55:58 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2022-12-07 12:55:58 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2022-12-07 12:55:58 MarkDuplicates Sorting list of duplicate records.
INFO 2022-12-07 12:55:58 MarkDuplicates After generateDuplicateIndexes freeMemory: 958545280; totalMemory: 1503657984; maxMemory: 2147483648
INFO 2022-12-07 12:55:58 MarkDuplicates Marking 0 records as duplicates.
INFO 2022-12-07 12:55:58 MarkDuplicates Found 0 optical duplicate clusters.
INFO 2022-12-07 12:55:58 MarkDuplicates Reads are assumed to be ordered by: coordinate
INFO 2022-12-07 12:56:18 MarkDuplicates Writing complete. Closing input iterator.
INFO 2022-12-07 12:56:18 MarkDuplicates Duplicate Index cleanup.
INFO 2022-12-07 12:56:18 MarkDuplicates Getting Memory Stats.
INFO 2022-12-07 12:56:18 MarkDuplicates Before output close freeMemory: 529395968; totalMemory: 536870912; maxMemory: 2147483648
INFO 2022-12-07 12:56:18 MarkDuplicates Closed outputs. Getting more Memory Stats.
INFO 2022-12-07 12:56:18 MarkDuplicates After output close freeMemory: 529395968; totalMemory: 536870912; maxMemory: 2147483648
[Wed Dec 07 12:56:18 EST 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.68 minutes.
Runtime.totalMemory()=536870912

A simple question

What does "$2" in section 6.1 mean. Please don't laugh at me, I am a novice.

seacr="/fh/fast/gottardo_r/yezheng_working/Software/SEACR/SEACR_1.3.sh"
histControl=$2
mkdir -p $projPath/peakCalling/SEACR

QC: nucleosomal-length fragments

Dear Ye,

Thank you for this very well written tutorial!

I have a very simple question about the QC step - nucleosome length that you mentioned in section 3.4:

"CUT&Tag reactions targeting a histone modification predominantly results in fragments that are nucleosomal lengths (~180 bp)"

According to wikipedia , nucleosome core particle consists of 146bp DNA. And the linker DNA is about 80bp.

My question thus is, when we check the QC plot, why are we looking for peak at about 180bp? What could be the range of acceptable length?

Thank you very much!

Bests,
Minglu

What does "25x25 PE Illumina sequencing" means and how does the reads length influence CUT&Tag?

Hi,
I found a description in the tutorial:
"Our standard pipeline is to perform single-index 25x25 PE Illumina sequencing on up to 90 pooled samples on a single HiSeq 2500 flowcell, where each sample has a unique PCR primer barcode."
I don't konw what does "25x25 PE Illumina sequencing" means? And how does the reads length influence CUT&Tag?
Thanks!

Puzzling enrichment peak

We found in the experiment that some experimental data generated by the CUT & Tag method showed abnormal enrichment effect. We used a chromosome centromere-specific antibody to mark the centromere position. The existing successful chip-seq and partial cut-tag data showed significant enrichment. But it is puzzling that part of the CUT & Tag data, which should have appeared enrichment peaks in centromeres, shows an opposite enrichment trend, with enrichment peaks turning into enrichment valleys. Do you have the above problems and how to solve them ?
Thank you in advance for your answer！

Mistake report

Hello, Ye Zheng:
Maybe your index.html(https://yezhengstat.github.io/CUTTag_tutorial/index.html) file have some error.
1.In section 3.3, You Wrote 'rmDuplicate' mistakes as 'removeDuplicate'.
2.In section 4.2,

Filter and keep the mapped read pairs

samtools view -bS -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam $projPath/alignment/bam/${histName}_bowtie2.mapped.bam

If there is a '>' symbol was ignored? Like below:

samtools view -bS -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam >$projPath/alignment/bam/${histName}_bowtie2.mapped.bam

Downstream differential analysis

Thank you for creating the helpful tutorial. I have a question regarding the differential analysis section of the tutorial, in which a count matrix is generated. I realized from both my count table that it does not have a corresponding gene name, but instead only numeric order. I suppose that to relate peaks called by SEACR/peak-calling software to genes, I will need to use Rsubread or similar packages; however, from my impression, to use the Rsubread package, I will need .narrowpeak files from MACS2, which is not acquirable from SEACR. Then, do you recommend that I use MACS2 to perform the peak calling, or is there an alternative method to connect peaks to specific genes/TSSs? Thank you!

CUT&RUN too?

Hi,

Thank you for the in-depth tutorial! Just wondering if I can apply these scripts to my CUT&RUN data instead of CUT&TAG?

Thank you!

FRiPs calculation without IgG control

Hi there,
Thank you very much for the amazing tutorial. I have a question regarding FRiPs calculation. In the tutorial you wrote: "We calculate the fraction of reads in peaks (FRiPs) as a measure of signal-to-noise, and contrast it to FRiPs in the IgG control dataset for illustration. Although sequencing depths for CUT&Tag are typically only 1-5 million reads, the low background of the method results in high FRiP scores. "
However, I don't really understand what you mean here by "contrast it to FRiPs in the IgG control" as in the code, you are taking the fragment counts from the bam file of one replicate and "counting" them into the peaks in the control file only. Also, how can I correct this step if I don't have IgG control (that is my case)? I thought about substituting the bed file with my top0.01_stringent.bed of the corresponding replicate, but I'm not sure this is totally correct.
Thank you in advance for your reply.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.