grunwaldlab / metacoder Goto Github PK

View Code? Open in Web Editor NEW

134.0 17.0 28.0 110.13 MB

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data

Home Page: http://grunwaldlab.github.io/metacoder_documentation

License: Other

R 97.75% C++ 2.25%

taxonomy metabarcoding trees r community-diversity pcr hierarchical cran

metacoder's People

Contributors

Stargazers

Watchers

metacoder's Issues

Make `plot_taxonomy` more modular

Currently plot_taxonomy is over 500 lines long. This should not be. A couple things that could be done:

Make all text grobs using the same function from a common data frame. Only the x,y,size, color, etc need to be defined in the function. Do the conversion of coordinates in this function as well.
Put all that mess associated with the legend in its own function.
Make some of the longer internally defined functions independent external functions.
- get_sub_graphs
- get_sub_layouts
- get_search_space
- find_overlap
- infer_size_range
- select_labels

replace `rescale_limits` with scales::rescale

I made a function that already exists...

`ncbi_taxon_sample`: make output `classified`

pariwise differential display

Make some sort of graphic for comparing more than two treatments. Perhaps a pairwise set of graphs, each highlighting differences.

Complete tree plotting vignette

extract_taxonomy: filter out hex characters

The unite database has a few hex characters in headers. This breaks extract_taxonomy.

`plot_taxonomy`: new option `no_trunk`

When all the items associated with a taxonomy share several high level ranking taxa....

Make `taxonomic_sample` function

Make taxonomic_sample function that abstracts the functionality used in get_taxon_sample so it can be used to recursivly subsample any set of observations as long as functions can be defined to get subtaxa ids and sample ids gor given taxa.

taxonomic_sample would need the following additional options:

get_subtaxa : The function used to get the subtaxa of a given taxon.
get_observations : The function used to get observations (e.g. sequence indexes) for a given taxon

The return type could either be a vector of observation indexes or concatenated get_observations output.

get_taxon_sample would then be rewritten as a special case of taxonomic_sample

`ncbi_taxon_sample`: handle unranked taxa better

Some taxa do not have a taxonomic level assigned. This makes level-based actions like count filtering not applicable. Currently, high level taxa without defined levels are not subject to count filtering, sometimes making them inflated. I need a way to handle this more intellegently. Perhaps they can adopt the rank of sister taxa if they exist. Or, there could be a limit as to how many consecutive levels of unidentified ranks will be explored before giving up. Setting this level to 0, would provide a way to ignore taxa without assigned levels. Not perfect solutions, but they could help...

plot_taxonomy: make margin take into account labels / standardize stat suffixes

Currently, labels are plotted on a [0,1] coordinate space whereas the other elements are plotted in the space returned by igraph layout functions. Therefore, there is no easy way right now to estimate the size of labels in igraph space, so labels do not effect the plotting window/margins. This means that if the margin is set to 0, labels can be cut off.

To fix this, I need to estimate the size of labels in igraph space. While doing this, I should also standardize the meaning of the statistic suffixes:

_u : what the user supplied
_t : transformed user input
_g: value in igraph space
_p: value in terms of proportion of graph dimension. This has to be used for plotting text.

extract_taxonomy: allow `item_id` to be non-unique

Look into incorperating ecopcr for in silico pcr

http://www.grenoble.prabi.fr/trac/ecoPCR/wiki/EcoDocumentation

Use ggplot2 to plot igraph networks

Make ncbi_taxon_sample accept mutiple inputs

get_taxon_items: make the `target` parameter default to `taxa`

See if any dependencies can be dropped

There are a lot of dependencies that are only use rarely and possibly could be removed.

extract_taxonomy: rewrite for speed and robustness

extract_taxonomy is taking forever to parse the ~20,000 records in the UNITE general FASTA release, even when not querying databases.

Fix plotting anomalies

For some reason, the text in plots is no longer positioned correctly. Also, the default size range of elements is off.

Allow install without emboss

Currently, building the vignettes requires primersearch, from the emboss tool kit to be installed. This should not be.

`plot_taxonomy`: Make text scale with viewport size

Source

This idea and the majority of the descriptive text is from the following blog post:

http://ryouready.wordpress.com/2012/08/01/creating-a-text-grob-that-automatically-adjusts-to-viewport-size/

The method

First, create a new grob class called resizingTextGrob that is supposed to resize automatically

library(grid)
library(scales)

resizingTextGrob <- function(...)
{
  grob(tg=textGrob(...), cl='resizingTextGrob')
}

The drawDetails method is called automatically when drawing a grob using grob.draw.

drawDetails.resizingTextGrob <- function(x, recording=TRUE)
{
  grid.draw(x$tg)
}

The preDrawDetails method is automatically called before any drawing occures

preDrawDetails.resizingTextGrob <- function(x)
{
  h <- convertHeight(unit(1, 'snpc'), 'mm', valueOnly=TRUE)
  fs <- rescale(h, to=c(18, 7), from=c(120, 20))
  pushViewport(viewport(gp = gpar(fontsize = fs)))
}

Clean up after the drawing the created viewport is popped:

postDrawDetails.resizingTextGrob <- function(x)
  popViewport()

Test it out

g <- resizingTextGrob(label='test 1')
grid.draw(g)
grid.text('test 2', y=.4)

Integration with ggplot2

library(ggplot2)
x = data.frame(x = 1:10, y = 1:10)
ggplot(data = x, aes(x = x, y = y)) + geom_point() + annotation_custom(g)

ggplot(data = x, aes(x = x, y = y)) + geom_point() + annotation_custom(g)

Allow for sampling with replacement in `taxonomic_sample`

The option replace could be added to taxonomic_sample to allow for sampling with replacement. This could be useful for bootstrapping applications.

`plot_taxonomy`: make line labels have gap from vertex

New function: extract_taxonomy

A new function to extract the taxonomy information from sequence headers could be very useful. I imagine this would be an upper level function, potentially built upon more specific functions. The basic idea is that the user would supply a vector of sequence headers and identify the locations of bits of information that could be used to derive a taxonomy.

For example, say a particular database had a sequence header of the type:

>name_of_sequence-1234-description
where 1234 was the genbank id. The function would be called something like

x <- c(">name_of_sequence-1234-description", ...)
extract_taxonomy(x, "^>.+-%genid%-.+$")

Where %genid% would be a function specific pattern indicating the identity of relevant information. The function would then extract the genbank id using a modified version of the supplied regex, look up the taxonomy information and parse it into a standardized format.

I imagine the output would consist of two parts: a vector of numerical taxon ids, named by their common names, and something equivalent to an adjacency list or adjacency matrix or taxon lineage list that allows for a given taxon id to defined in the context of the entire taxonomy. The taxon id could be official (e.g. genbank taxon uids) or arbitrary if the sequence header information had a taxonomy lineage:

>name_of_sequence:kingdom-phylum-class-order-family-genus-species:description

`primersearch`: document and standardize output

There is some ambiguity about whether the primer sequences are included in the amplicon and "length" fields.
This should be specified.
Perhaps and option should be added to choose if primer sequences are taken into account.

`plot_taxonomy`: use force-directed graph layouts correctly

Investigate incorperating ecoprimers

Ecoprimers tries to generate metabarcoding primers from sequence databases.
http://www.grenoble.prabi.fr/trac/ecoPrimers/wiki/ecoPrimers

`plot_taxonomy`: add option to specify random seed.

`plot_taxonomy`: add option to remove nodes with only one child.

new function: `clean_taxonomy`

I often need to clean up the names of taxons, such as:

Inocybaceae_sp (not informative for species, should be NULL)
Rhodotorula_cycloclastica (should be "cycloclastica")
Caloplaca_sp_RVM_2012
Russula_cf_brevipes_RK8
Russula_brevipes_var_acrior
Russula_aff_brevipes_r_04085
"Crenarchaeota"

Add depth column to taxon ouput of `extract_taxonomy`

Add a column to the taxon data frame returned by extract_taxonomy that has the number of taxa that each taxon is a subtaxon of.

This could be used as a default value for rank.

`plot_taxonomy`: Add options to define the range of statistics to display

Currently, the range of statistics supplied is used to infer the range dispalyed. This is optimal for many kinds of data, but makes it hard to make multiple graphs with the same color/size to statistic relationship or to set logical ranges, such as a diverging scale with 0 in the middle.

`plot_taxonomy`: standardize handling of text

All text in the plot should be added to the same data.frame and treated the same. This will remove a lot of repetitive code and make it easier to calculate margins correctly.

Add `parent` column to output of `extract_taxonomy`

If a parent column were included in the taxonomy data dataframe, it would be in effect an adjacency list, redering the taxonomy classification key output redundant.

Make roxygen import method consistent

Complete subsample vignette

Change default plot_taxonomy colors

Make the background of the plot transparent and pick a color scale that is color-blind friendly.

`plot_taxonomy`: add option to use default text grobs

The automatically scaling text grobs are ideal for most publication applications, but they are apparently computing intensive (for some reason I have not investigated yet) to implement when there are hundreds or more. It would be good to have an option to use standard text grobs when lots of text will need to be displayed.

Request: add examples in man pages

Documentation in vignettes is a great thing, but having examples in the man pages is extremely helpful especially when the user wants to quickly see how a certain function is used.

Complete in silico PCR vignette

`plot_taxonomy`: add a overall graph title option

Currently, each tree can have titles, but not the overall graph when there are multiple trees.

plot_taxonomy: default selection of layout

If the user does not specify a layout, use mean and standard deviation of rank depth to pick a appropriate layout.

Add SILVA parsing example in vignette

Implement option of `extract_taxonomy` to specify how arbitrary taxon ids are handled

There are a few ways I can see handleing arbitrary ids when mixed with real ids:

allow : Allow arbitrary ids
warn : Allow them, but warn the user
error : Throw an error if they are needed
na : replace them and any information derived from them with NAs

extract_taxonomy: mixed id types in classifications

Add ability to output a mixture of verified unique taxon ids and arbitrary ids when looking up classification names. I imagine this could be useful when some taxon names cannot be found in a database.

assign arbitrary ids to each taxon in a classification
look up taxon ids from names
replace arbitrary ids with returned ids
add a column to the taxa output indicating the type of id: ("verified", "arbitrary", or "unknown")

Look into incorperating primer3 for primer optimization

Look into incorperating SAP

SAP uses a bayesian approch to assign a probability that an unknown sequence belongs to a particular taxon. This could be used to estimate a reference sequence's taxonomic resolution.

http://ib.berkeley.edu/labs/slatkin/munch/StatisticalAssignmentPackage.html

Consider further abstracting `taxonomic_sample` until not specific to taxonomic analysis

taxonomic_sample is currently abstracted to the point where it can be used on any set of items with a indexed heirarchical classification. I could remove all reference to taxonomy and rename it recursive_sample. The only potential problem is the rank concept.

Suggestion: make internal functions truely internal

When I look at the index for metacoder, there are a lot of functions that don't seem to be immediately of use to the user, such as resizingTextGrob(). Additionally, I notice that there are unexported functions documented such as verify_color_range().

These create a lot of clutter in the index for metacoder and might confuse users. I have a couple of suggestions to de-clutter these.

Exported internal functions

Don't export them. You can change/add/remove internal functions to your heart's desire, but an exported function requires a version update and has the potential to break a user's workflow.

Internal documentation

Documenting internal functions is a fantastic idea, but it should not be displayed in the index since that becomes the table of contents for the user manual. Instead, I recommend to add @keywords internal to your roxygen directives for the unexported functions. This will still create the documentation for those who need it, but will hide it from those who don't.

Complete workflow vignette

There should be a workflow vignette that provides a few example workflows.

Something like:

Parse example FASTA file with extract_taxonomy
Plot classifications
Subsample with taxonomic_sample
Plot subsampled classifications
In silico PCR
Plot results of insilico PCR
Barcode gap analysis
Plot results of barcode gap analysis

Complete mitochondrial sequences and COX1 as the barcode would be a clean example.
Maybe we are trying to evalulate a barcode for a group of insects...