hbctraining / dge_workshop_salmon_online Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://hbctraining.github.io/DGE_workshop_salmon_online/
Home Page: https://hbctraining.github.io/DGE_workshop_salmon_online/
https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/pathway_topology.html
From this page, the link in the below path is not the correct link to the blog post. There is no "www." in the correct link.
"The blog post from Getting Genetics Done provides a step-by-step procedure for using and understanding SPIA."
the schedule page needs to be updated with these
The lesson is a bit text heavy and could use some paring down with the use of bullet points, sub-sectioning etc
@mistrm82 has some ideas that she will implement in a a branched version and do a pull request for review from the team
Change data visualization heatmap image in lesson with a more recent one
look at language in lesson and 5b wald test results
order doesn't matter - but as best practice keep main factor last because of certain defaults
"I am self-learning bioinformatics using the awesome material prepared by your training program.
For the topic:
The formulas for "How is the dispersion value derived?" are not displayed.
Just thought I would let you know."
remove things we don't use anymore ? SPIA
Also add new visualizations (suggestions from @hwick's code)
Figure with X, Y, and Z genes is misleading as Z has the fewest reads but is the longest. As this figure is only designed to visualize differences caused by library size it is better to remove this and only show X and Y. The length issue is discussed in the next figure.
The ggplot2 link on the schedule page is not working.
In the setup lesson (https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/01b_DGE_setup_and_overview.html), provide the project that already contains data and folders, so that there is no need to download data.
Hi,
thanks a lot for the fantastic workshop for DGE analyses, I really enjoy it and learned a lot. :)
I am now trying to run analyses with my own data. For previous SNP analyses and now the Salmon quantification I used the NCBI RefSeq Transcripts FASTA (https://www.ncbi.nlm.nih.gov/genome/guide/human/). Thus, I am trying to build my tx2gene annotation file from the NCBI annotation. Would you have a recommendation, which ah$dataprovider to query? Is there anything else I should adapt/ keep my eyes on, compared to the presented workflow using ensembldb?
Thanks a lot in advance for your help. :)
Best wishes, Ella
updates in cluster profiler use new msigdbr package to query these datasets; however, older method should still work: https://yulab-smu.top/biomedical-knowledge-mining-book/universal-api.html?q=msig#msigdb-analysis
https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/05b_wald_test_results.html
The green and purple gene figure on the right is not fully explained. The dashed vs. solid lines are crucial and not mentioned. This is another figure from the DESEQ2 paper.
My mistake, misread!
Hey all :)
in Clusterprofiler added utility to translate Entrez ID outputs (e.g. from gseKEGG) to gene symbols: https://yulab-smu.top/biomedical-knowledge-mining-book/useful-utilities.html#setReadable
gseaKEGG <- gseKEGG(geneList = foldchanges, # ordered named vector of fold changes (Entrez IDs are the associated names)
organism = "hsa", # supported organisms listed below
nPerm = 1000, # default number permutations
minGSSize = 20, # minimum gene set size (# genes in set) - change to test more sets or recover sets with fewer # genes
pvalueCutoff = 0.05, # padj cutoff value
verbose = FALSE)
no term enriched under specific pvalueCutoff...
Warning messages:
1: In .GSEA(geneList = geneList, exponent = exponent, minGSSize = minGSSize, :
We do not recommend using nPerm parameter incurrent and future releases
2: In fgsea(pathways = geneSets, stats = geneList, nperm = nPerm, minSize = minGSSize, :
You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.
This also complains that there are no terms enriched under my pvalueCutoff, but there are definitely terms with < 0.05
clusterProfiler_4.8.2
https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/01b_DGE_setup_and_overview.html
featurecount link broken
the correlation coloring is the reverse of what the mind (my mind) expects. Should be redone here with the reverse color scheme.
https://github.com/hbctraining/DGE_workshop_salmon_online/blob/master/lessons/03_DGE_QC_analysis.md
This is a draft of how to fix it:
#check available updated database
query(ah,'org.Hs.eg.db.sqlite')
human_orgdb <- query(ah, c("Homo sapiens", "OrgDb"))
test <- human_orgdb[["AH111575"]]
test
When working with my own data set, when running the cnetplot code I get this warning:
Scale for size is already present.
Adding another scale for size, which will replace the existing scale.
I noticed that the legend for size is clearly not pvalue (integer numbers > 1), which is what the instructions say. Similarly, the example plot in the lesson has integers > 1, so they clearly aren't p values either. When looking at the help page for cnet plot, categorySize
isn't a listed argument. If I plot without this argument, the plot looks exactly the same (and oddly, I still get the warning). The "size" appears to represent the number of significant genes in the GO term.
Additionally there is this warning:
Warning message:
In cnetplot.enrichResult(x, ...) :
Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'.
The foldChange parameter will be removed in the next version.
It sounds like aspects of this function have changed since these lessons were written and need updating. Regardless, the text as is is currently inaccurate even with the current plot image:
"Finally, the category netplot shows the relationships between the genes associated with the top five most significant GO terms and the fold changes of the significant genes associated with these terms (color). The size of the GO terms reflects the pvalues of the terms, with the more significant terms being larger. This plot is particularly useful for hypothesis generation in identifying genes that may be important to several of the most affected processes."
This should be changed to reflect that node size actually represents the number of significant genes in the GO terms.
Apparently the link is broken because Horvath left UCLA and is no longer paying for the website
Our best bet is to probably replace with internet archive link to the original, here: https://web.archive.org/web/20230323144343/horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/
More info about the disappearance and some other links here:
https://www.reddit.com/r/bioinformatics/comments/1cr7m9j/what_happened_to_the_wgcna_tutorial/
No error. Just does not produce any graph
> pathview(gene.data = foldchanges,
+ pathway.id = "hsa03008",
+ species = "hsa",
+ limit = list(gene = 2, # value gives the max/min limit for foldchanges
+ cpd = 1))
'select()' returned 1:1 mapping between keys and columns
Info: Working in directory /Users/hew416/Library/CloudStorage/OneDrive-HarvardUniversity/Desktop/DEanalysis
Info: Writing image file hsa03008.pathview.png
pathview_1.40.0
dds_lrt
is originally created in Multiple test corrections with:
dds_lrt <- DESeq(dds, test="LRT", reduced = ~ 1)
Is referred to later in the LRT lesson
rld_mat
is originally created in Visualizations self learning:
rld <- rlog(dds, blind=T)
rld_mat <- assay(rld)
Is referred to later in the in LRT lesson
We should either add a note for how to recreate the objects, or instruct students to save (and load) their environment for each lesson. I think it is easier to just add a note for how to recreate the objects since environments can get confusing and might be conceptually new for students (we don't introduce the concept)
Hey all :)
Maybe you have seen this already, but I just noticed that they have removed the Stat column from the DESeq2 results output after shrinkage using apeglm. The reason for this is well-described by Mike here: https://support.bioconductor.org/p/129277/. Not sure if this is necessary to add to the lesson, but at least just a heads-up.
The code needs either a note to say this is just demo code do not run
OR
See if the code below works with the workshop dataset and update it to this:
DEGreport::degPlot(dds = dds, res = res_tableOE, n = 20, xs = "sampletype", group = "sampletype")
DEGreport::degVolcano(
data.frame(res_tableOE_tb[,c("log2FoldChange","padj")]), # table - 2 columns
plot_text = data.frame(res_tableOE_tb[1:10,c("log2FoldChange","padj", "symbol")]))
# Available in the newer version for R 3.4
Many people had the ` warning when running the dotplot command:
wrong orderBy parameter; set to default orderBy = "x"
Maybe this will help?
dotplot(myResults, showCategory=15, orderBy="GeneRatio")
https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/04b_DGE_DESeq2_analysis.html
The close up of the dispersions figure is taken directly from the DESEQ2 paper but the figure is not really explained and none of the acronyms are listed.
Add apeglm
install from Biocmanager to instructions
Ensembldb has a filter function now that can overwrite dplyr, so be sure to put dplyr:: in front of filter commands like on the "Summarizing results and extracting significant gene lists" page:
Change:
sigOE <- res_tableOE_tb %>%
filter(padj < padj.cutoff)
To:
sigOE <- res_tableOE_tb %>%
dplyr::filter(padj < padj.cutoff)
And on Visualization
Change:
norm_OEsig <- normalized_counts[,c(1:4,7:9)] %>%
filter(gene %in% sigOE$gene)
To:
norm_OEsig <- normalized_counts[,c(1:4,7:9)] %>%
dplyr::filter(gene %in% sigOE$gene)
While there are no exercises, we no longer have a PollEverywhere, so we need to create a Day 3 Google poll for questions to get posted to.
Remove "We have available an example html report for perusal. " Need to create an issue to find the report.
pseudbulk DE lesson has code for a nice facet gene plot of expression differences between groups. Add in here
LRT: description of axes of plots is wrong
summarizing workflow: covaraites
extra pink mean dot in the first figure here: https://github.com/hbctraining/DGE_workshop_salmon_online/blob/master/lessons/04a_design_formulas.md
The pink dot all the way to the right should be removed.
The code needs updates to lfcShrink and also a review of some of the comments in there
newer approaches, methods?
Remove some of the older methods we don't really use?
Do we want to finally move over to this using coef?
maybe we should do clusterProfiler for up and down regulated genes separately? Or at least have a note about it in the lesson
Error about no term enriched
.
Should we look into this?
has z score abundance
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.