dzhang32 / ggtranscript Goto Github PK

View Code? Open in Web Editor NEW

118.0 3.0 8.0 23.05 MB

Visualizing transcript structure and annotation using ggplot2

Home Page: https://dzhang32.github.io/ggtranscript/

License: Other

R 100.00%

gene-annotation visualization transcripts ggplot-extension

ggtranscript's People

Contributors

Stargazers

Watchers

Forkers

gpertea unique379r rhreynolds leireabarrategui lutfia95 genostack morpheus2112 ningshuang-yao sablokgaurav

ggtranscript's Issues

Problems using facet_wrap

Hi,
I like the package a lot but I can't figure out how to use facet_wrap to show transcript expression at different timepoints. I can get one plot to work easily but when I try to wrap them the arrows work correctly only for the first leftmost plot.

prop_plot_introns <- ggtranscript::to_intron(prop_plot, "feature_id")

p <- prop_plot %>%
  ggplot2::ggplot(ggplot2::aes(
    xstart = start,
    xend = end,
    y = feature_id
  )) +
  ggplot2::facet_grid(cols = vars(DAI)) +
  ggtranscript::geom_range(
    ggplot2::aes(fill = Proportion)
  ) +
  ggtranscript::geom_intron(
    data = prop_plot_introns,
    ggplot2::aes(strand = strand)
  ) +
  th +
  ggplot2::scale_fill_gradient2(
    high = "darkorchid",
    low = "white",
    mid = "darkorange",
    midpoint = 0.5
    # midpoint = min(prop_plot$Proportion) + ((max(prop_plot$Proportion) - min(prop_plot$Proportion)) / 2)
  ) +
  ggplot2::guides(fill = ggplot2::guide_legend("Proportion"))
print(p)

where prop_plot:

gene_id	feature_id	chr	start	end	strand	DAI	TPM	Proportion
AT5G20240	AT5G20240.1	Chr5	6828904	6829457	+	0	109.705848163636	0.201793165193519
AT5G20240	AT5G20240.1	Chr5	6828904	6829457	+	4	519.364511065168	0.164112890096364
AT5G20240	AT5G20240.1	Chr5	6828904	6829457	+	8	515.17501570544	0.12987240173574
AT5G20240	AT5G20240.1	Chr5	6830455	6830516	+	0	109.705848163636	0.201793165193519
AT5G20240	AT5G20240.1	Chr5	6830455	6830516	+	4	519.364511065168	0.164112890096364
AT5G20240	AT5G20240.1	Chr5	6830455	6830516	+	8	515.17501570544	0.12987240173574
AT5G20240	AT5G20240.1	Chr5	6830637	6830736	+	0	109.705848163636	0.201793165193519
AT5G20240	AT5G20240.1	Chr5	6830637	6830736	+	4	519.364511065168	0.164112890096364
AT5G20240	AT5G20240.1	Chr5	6830637	6830736	+	8	515.17501570544	0.12987240173574
AT5G20240	AT5G20240.1	Chr5	6830809	6830838	+	0	109.705848163636	0.201793165193519
AT5G20240	AT5G20240.1	Chr5	6830809	6830838	+	4	519.364511065168	0.164112890096364
AT5G20240	AT5G20240.1	Chr5	6830809	6830838	+	8	515.17501570544	0.12987240173574
AT5G20240	AT5G20240.1	Chr5	6830927	6830971	+	0	109.705848163636	0.201793165193519
AT5G20240	AT5G20240.1	Chr5	6830927	6830971	+	4	519.364511065168	0.164112890096364
AT5G20240	AT5G20240.1	Chr5	6830927	6830971	+	8	515.17501570544	0.12987240173574
AT5G20240	AT5G20240.1	Chr5	6831074	6831515	+	0	109.705848163636	0.201793165193519
AT5G20240	AT5G20240.1	Chr5	6831074	6831515	+	4	519.364511065168	0.164112890096364
AT5G20240	AT5G20240.1	Chr5	6831074	6831515	+	8	515.17501570544	0.12987240173574
AT5G20240	AT5G20240.2	Chr5	6828987	6829457	+	0	433.949077207243	0.798206834806481
AT5G20240	AT5G20240.2	Chr5	6828987	6829457	+	4	2645.31384393917	0.835887109903636
AT5G20240	AT5G20240.2	Chr5	6828987	6829457	+	8	3451.60321292623	0.87012759826426
AT5G20240	AT5G20240.2	Chr5	6830455	6830516	+	0	433.949077207243	0.798206834806481
AT5G20240	AT5G20240.2	Chr5	6830455	6830516	+	4	2645.31384393917	0.835887109903636
AT5G20240	AT5G20240.2	Chr5	6830455	6830516	+	8	3451.60321292623	0.87012759826426
AT5G20240	AT5G20240.2	Chr5	6830637	6830838	+	0	433.949077207243	0.798206834806481
AT5G20240	AT5G20240.2	Chr5	6830637	6830838	+	4	2645.31384393917	0.835887109903636
AT5G20240	AT5G20240.2	Chr5	6830637	6830838	+	8	3451.60321292623	0.87012759826426
AT5G20240	AT5G20240.2	Chr5	6830927	6830971	+	0	433.949077207243	0.798206834806481
AT5G20240	AT5G20240.2	Chr5	6830927	6830971	+	4	2645.31384393917	0.835887109903636
AT5G20240	AT5G20240.2	Chr5	6830927	6830971	+	8	3451.60321292623	0.87012759826426
AT5G20240	AT5G20240.2	Chr5	6831074	6831515	+	0	433.949077207243	0.798206834806481
AT5G20240	AT5G20240.2	Chr5	6831074	6831515	+	4	2645.31384393917	0.835887109903636
AT5G20240	AT5G20240.2	Chr5	6831074	6831515	+	8	3451.60321292623	0.87012759826426

Resulting plot:

Do you have any idea what goes wrong here? I suspect facet_wrap doesn't play well with the fact that geom_intron is passed it's own dataset.

Thank you,
Andrea

shortened gaps with exon and CDS

Is there a way to create a plot with shortened gaps that still shows exon vs CDS regions?

lines extend to gene model with geom_junction_label_repel

Hi,
Firstly, thanks for this R package; it's already really useful to me, and it's quite user friendly :)
However, I'm having an issue with the geom_junction_label_repel 'geom' and getting different results from the examples in the README. I was expecting the lines coming from the labels to connect to the junction lines, but instead, they connect to the gene models. I have tried this both with the SOD gene example in the GitHub instructions and with my own data.
I have copied the resulting plot from copying the given code into R below.

sod1_201_exons %>%
  ggplot(aes(
    xstart = start,
    xend = end,
    y = transcript_name
  )) +
  geom_range(
    fill = "white", 
    height = 0.25
  ) +
  geom_range(
    data = sod1_201_cds
  ) + 
  geom_intron(
    data = to_intron(sod1_201_exons, "transcript_name")
  ) + 
  geom_junction(
    data = sod1_junctions,
    junction.y.max = 0.5
  ) +
  geom_junction_label_repel(
    data = sod1_junctions,
    aes(label = round(mean_count, 2)),
    junction.y.max = 0.5
  )

My session info is below:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.5.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.2        stringr_1.4.1        purrr_0.3.5          readr_2.1.3          tidyr_1.2.1          tibble_3.1.8         tidyverse_1.3.2      patchwork_1.1.2.9000
 [9] ggsci_2.9            ggplot2_3.4.0        rtracklayer_1.54.0   GenomicRanges_1.46.1 GenomeInfoDb_1.30.1  IRanges_2.28.0       S4Vectors_0.32.4     BiocGenerics_0.40.0 
[17] dplyr_1.0.10         ggtranscript_0.99.9 

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                matrixStats_0.62.0          fs_1.5.2                    lubridate_1.9.0             bit64_4.0.5                 httr_1.4.4                 
 [7] tools_4.1.2                 backports_1.4.1             utf8_1.2.2                  R6_2.5.1                    DBI_1.1.3                   colorspace_2.0-3           
[13] withr_2.5.0                 tidyselect_1.2.0            bit_4.0.4                   compiler_4.1.2              textshaping_0.3.6           cli_3.4.1                  
[19] rvest_1.0.3                 Biobase_2.54.0              xml2_1.3.3                  DelayedArray_0.20.0         labeling_0.4.2              scales_1.2.1               
[25] systemfonts_1.0.4           Rsamtools_2.10.0            XVector_0.34.0              pkgconfig_2.0.3             MatrixGenerics_1.6.0        dbplyr_2.2.1               
[31] rlang_1.0.6                 readxl_1.4.1                rstudioapi_0.14             BiocIO_1.4.0                generics_0.1.3              farver_2.1.1               
[37] jsonlite_1.8.3              BiocParallel_1.28.3         vroom_1.6.0                 googlesheets4_1.0.1         RCurl_1.98-1.9              magrittr_2.0.3             
[43] GenomeInfoDbData_1.2.7      Matrix_1.5-1                Rcpp_1.0.9                  munsell_0.5.0               fansi_1.0.3                 lifecycle_1.0.3            
[49] stringi_1.7.8               yaml_2.3.6                  SummarizedExperiment_1.24.0 zlibbioc_1.40.0             grid_4.1.2                  parallel_4.1.2             
[55] ggrepel_0.9.2               crayon_1.5.2                lattice_0.20-45             Biostrings_2.62.0           haven_2.5.1                 hms_1.1.2                  
[61] knitr_1.40                  pillar_1.8.1                rjson_0.2.21                reprex_2.0.2                XML_3.99-0.12               glue_1.6.2                 
[67] modelr_0.1.9                vctrs_0.5.0                 tzdb_0.3.0                  cellranger_1.1.0            gtable_0.3.1                assertthat_0.2.1           
[73] xfun_0.34                   broom_1.0.1                 restfulr_0.0.15             ragg_1.2.4                  googledrive_2.0.0           gargle_1.2.1               
[79] GenomicAlignments_1.30.0    timechange_0.1.1            ellipsis_0.3.2

Could you please advise what to do?
Thanks!

packing non-overlapping transcripts on the same y

I realize this involves some complex design/implementation changes and it is a feature more suitable for a full genome browser which is not the goal of this package.

For cases where many smaller non-overlapping transfrags are generated during transcript assembly, it would be very helpful to be able to pack some of those on the same horizontal line/slot, thus reducing the overall height of the plot needed to show a rather large number of transcripts whenever such partial transcript fragments are present.

Of course this involves taking the transcript labels off the y axis and placing them next to each transcript (above or below the left-most exon, or centered?). This is clearly a major change in ggtranscript's current design and I do not expect to be implemented or even paid attention to, at the moment.

Allow `shorten_gaps()` to take into account CDS as well as exons

Currently, if users wanted to plot CDS (differentiating UTRs) they would be unable to use shorten_gaps().

extend `to_diff()` to junctions

The to_diff() function is great for showing differences between exons, but sometimes it is very useful to highlight differences between introns and individual splice sites.

Not sure if it wouldn't make more sense to make it a dedicated to_jdiff() helper function for this , since I can imagine this might be a bit more complex as we could be looking at highlighting per-splice site differences - i.e. highlighting alternate donor vs. alternate acceptor vs. both (wholly novel introns)

Warning about size being replaced by linewidth in ggplot2 3.4.0+

Hi, this is not an issue, just a report about a change in ggplot2 affecting your package.
Best, Nicco

# devtools::install_github("dzhang32/ggtranscript")
library(dplyr)
library(ggplot2)
library(ggtranscript)

Running the first example

# extract exons
sod1_exons <- sod1_annotation |> filter(type == "exon")

sod1_exons |>
  ggplot(aes( xstart = start, xend = end, y = transcript_name ) ) +
    geom_range( aes(fill = transcript_biotype) ) +
    geom_intron( data = to_intron(sod1_exons, "transcript_name"),
                 aes(strand = strand) )

Prints this warning from ggplot2

Warning message:
Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Session info:

 packageVersion('ggplot2')
[1] ‘3.4.1’

packageVersion('ggtranscript')
[1] ‘0.99.9’

R.version.string
[1] "R version 4.1.2 (2021-11-01)"

high-level workflow example

Thank you for this excellent work - I have been looking for something like this for a while as I often needed to explore transcript assemblies within a gene or genomic region while comparing them with reference annotation etc.

I think it would be really useful for the community (and increase the popularity/adoption of the package) to provide a step-by-step workflow example (perhaps adding a few high-level convenience functions) for a use case like this:

loading a user provided GTF/GFF or BED file (e.g. with novel transcripts). As far as I know this can be done starting with something like as.data.frame(rtracklayer::import("user.gtf"))
loading reference annotation transcripts for the specified region of interest (or a gene?), from the available AnnotationDbi (orgdb) and txdb packages out there, or pulling such annotation directly from an online resource (biomaRt); in general
display the above with ggtranscript in a visually distinct manner (the reference transcripts using a different color from the user loaded transcripts, or otherwise suggesting/delimiting the two separate "tracks")

A genomic range would of course be required from the start (could be also used to subset the transcripts from the user provided file), with some common-sense checks (if not already implemented) to limit the genomic region width, the maximum number of transcripts etc.

reverse orientation of transcripts on minus strand

Hi,

Is there an easy way to inverse the orientation for transcripts that are on the minus strand (to also go from left to right)?

Kind regards,
Tabea

Make plots interactive

Given just how much information can be gleaned from these plots it would be incredibly useful if plots could be interactive (to allow for zooming, moving along a transcript structure, etc.). plotly enables some ggplot2 geoms to be made interactive via ggplotly(), however, as ggtranscript introduces new geoms these are not implemented.

labeling of exons (junctions?)

For transcripts with many exons it would useful to have the option to display the exon order numbers inside the exon (or above/below when the exon height is variable or too small?).

Perhaps a dedicated boolean option to just enable/disable the automatic drawing of exon order numbers for each transcript, with another option for its placement?

A more generic solution would be mapping such exon labels to some GTF exon attribute, like cov or exon_number as found in StringTie output -- maybe a label option can be added to geom_range() or its aesthetics. However in many cases the exon_number attribute is missing so a helper function could be added to generate that automatically in that case..

As for labeling junctions, I suppose a labeling option could be added to geom_junction() to enable showing the numeric coverage values (supporting reads) for each junction, above the junction curve for top curves, or below for bottom ones.

adding coverage

Is there a way to add a read coverage track above (or below) the transcripts?
Either from a bigwig or bam file?