grunwaldlab / metacoder Goto Github PK
View Code? Open in Web Editor NEWParsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
Home Page: http://grunwaldlab.github.io/metacoder_documentation
License: Other
Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
Home Page: http://grunwaldlab.github.io/metacoder_documentation
License: Other
Currently plot_taxonomy
is over 500 lines long. This should not be. A couple things that could be done:
get_sub_graphs
get_sub_layouts
get_search_space
find_overlap
infer_size_range
select_labels
I made a function that already exists...
Make some sort of graphic for comparing more than two treatments. Perhaps a pairwise set of graphs, each highlighting differences.
The unite database has a few hex characters in headers. This breaks extract_taxonomy.
Make taxonomic_sample
function that abstracts the functionality used in get_taxon_sample
so it can be used to recursivly subsample any set of observations as long as functions can be defined to get subtaxa ids and sample ids gor given taxa.
taxonomic_sample
would need the following additional options:
get_subtaxa
: The function used to get the subtaxa of a given taxon.get_observations
: The function used to get observations (e.g. sequence indexes) for a given taxonThe return type could either be a vector of observation indexes or concatenated get_observations
output.
get_taxon_sample
would then be rewritten as a special case of taxonomic_sample
Some taxa do not have a taxonomic level assigned. This makes level-based actions like count filtering not applicable. Currently, high level taxa without defined levels are not subject to count filtering, sometimes making them inflated. I need a way to handle this more intellegently. Perhaps they can adopt the rank of sister taxa if they exist. Or, there could be a limit as to how many consecutive levels of unidentified ranks will be explored before giving up. Setting this level to 0, would provide a way to ignore taxa without assigned levels. Not perfect solutions, but they could help...
Currently, labels are plotted on a [0,1] coordinate space whereas the other elements are plotted in the space returned by igraph layout functions. Therefore, there is no easy way right now to estimate the size of labels in igraph space, so labels do not effect the plotting window/margins. This means that if the margin is set to 0, labels can be cut off.
To fix this, I need to estimate the size of labels in igraph space. While doing this, I should also standardize the meaning of the statistic suffixes:
_u
: what the user supplied_t
: transformed user input_g
: value in igraph space_p
: value in terms of proportion of graph dimension. This has to be used for plotting text.There are a lot of dependencies that are only use rarely and possibly could be removed.
extract_taxonomy
is taking forever to parse the ~20,000 records in the UNITE general FASTA release, even when not querying databases.
For some reason, the text in plots is no longer positioned correctly. Also, the default size range of elements is off.
Currently, building the vignettes requires primersearch, from the emboss tool kit to be installed. This should not be.
This idea and the majority of the descriptive text is from the following blog post:
First, create a new grob class called resizingTextGrob that is supposed to resize automatically
library(grid)
library(scales)
resizingTextGrob <- function(...)
{
grob(tg=textGrob(...), cl='resizingTextGrob')
}
The drawDetails method is called automatically when drawing a grob using grob.draw.
drawDetails.resizingTextGrob <- function(x, recording=TRUE)
{
grid.draw(x$tg)
}
The preDrawDetails method is automatically called before any drawing occures
preDrawDetails.resizingTextGrob <- function(x)
{
h <- convertHeight(unit(1, 'snpc'), 'mm', valueOnly=TRUE)
fs <- rescale(h, to=c(18, 7), from=c(120, 20))
pushViewport(viewport(gp = gpar(fontsize = fs)))
}
Clean up after the drawing the created viewport is popped:
postDrawDetails.resizingTextGrob <- function(x)
popViewport()
g <- resizingTextGrob(label='test 1')
grid.draw(g)
grid.text('test 2', y=.4)
library(ggplot2)
x = data.frame(x = 1:10, y = 1:10)
ggplot(data = x, aes(x = x, y = y)) + geom_point() + annotation_custom(g)
ggplot(data = x, aes(x = x, y = y)) + geom_point() + annotation_custom(g)
The option replace
could be added to taxonomic_sample
to allow for sampling with replacement. This could be useful for bootstrapping applications.
A new function to extract the taxonomy information from sequence headers could be very useful. I imagine this would be an upper level function, potentially built upon more specific functions. The basic idea is that the user would supply a vector of sequence headers and identify the locations of bits of information that could be used to derive a taxonomy.
For example, say a particular database had a sequence header of the type:
>name_of_sequence-1234-description
where 1234
was the genbank id. The function would be called something like
x <- c(">name_of_sequence-1234-description", ...)
extract_taxonomy(x, "^>.+-%genid%-.+$")
Where %genid%
would be a function specific pattern indicating the identity of relevant information. The function would then extract the genbank id using a modified version of the supplied regex, look up the taxonomy information and parse it into a standardized format.
I imagine the output would consist of two parts: a vector of numerical taxon ids, named by their common names, and something equivalent to an adjacency list or adjacency matrix or taxon lineage list that allows for a given taxon id to defined in the context of the entire taxonomy. The taxon id could be official (e.g. genbank taxon uids) or arbitrary if the sequence header information had a taxonomy lineage:
>name_of_sequence:kingdom-phylum-class-order-family-genus-species:description
There is some ambiguity about whether the primer sequences are included in the amplicon and "length" fields.
This should be specified.
Perhaps and option should be added to choose if primer sequences are taken into account.
Ecoprimers tries to generate metabarcoding primers from sequence databases.
http://www.grenoble.prabi.fr/trac/ecoPrimers/wiki/ecoPrimers
I often need to clean up the names of taxons, such as:
Inocybaceae_sp (not informative for species, should be NULL)
Rhodotorula_cycloclastica (should be "cycloclastica")
Caloplaca_sp_RVM_2012
Russula_cf_brevipes_RK8
Russula_brevipes_var_acrior
Russula_aff_brevipes_r_04085
"Crenarchaeota"
Add a column to the taxon data frame returned by extract_taxonomy
that has the number of taxa that each taxon is a subtaxon of.
This could be used as a default value for rank.
Currently, the range of statistics supplied is used to infer the range dispalyed. This is optimal for many kinds of data, but makes it hard to make multiple graphs with the same color/size to statistic relationship or to set logical ranges, such as a diverging scale with 0 in the middle.
All text in the plot should be added to the same data.frame and treated the same. This will remove a lot of repetitive code and make it easier to calculate margins correctly.
If a parent column were included in the taxonomy data dataframe, it would be in effect an adjacency list, redering the taxonomy classification key output redundant.
Make the background of the plot transparent and pick a color scale that is color-blind friendly.
The automatically scaling text grobs are ideal for most publication applications, but they are apparently computing intensive (for some reason I have not investigated yet) to implement when there are hundreds or more. It would be good to have an option to use standard text grobs when lots of text will need to be displayed.
Documentation in vignettes is a great thing, but having examples in the man pages is extremely helpful especially when the user wants to quickly see how a certain function is used.
Currently, each tree can have titles, but not the overall graph when there are multiple trees.
If the user does not specify a layout, use mean and standard deviation of rank depth to pick a appropriate layout.
There are a few ways I can see handleing arbitrary ids when mixed with real ids:
allow
: Allow arbitrary idswarn
: Allow them, but warn the usererror
: Throw an error if they are neededna
: replace them and any information derived from them with NAsAdd ability to output a mixture of verified unique taxon ids and arbitrary ids when looking up classification names. I imagine this could be useful when some taxon names cannot be found in a database.
taxa
output indicating the type of id: ("verified", "arbitrary", or "unknown")SAP uses a bayesian approch to assign a probability that an unknown sequence belongs to a particular taxon. This could be used to estimate a reference sequence's taxonomic resolution.
http://ib.berkeley.edu/labs/slatkin/munch/StatisticalAssignmentPackage.html
taxonomic_sample
is currently abstracted to the point where it can be used on any set of items with a indexed heirarchical classification. I could remove all reference to taxonomy and rename it recursive_sample
. The only potential problem is the rank concept.
When I look at the index for metacoder, there are a lot of functions that don't seem to be immediately of use to the user, such as resizingTextGrob()
. Additionally, I notice that there are unexported functions documented such as verify_color_range()
.
These create a lot of clutter in the index for metacoder and might confuse users. I have a couple of suggestions to de-clutter these.
Don't export them. You can change/add/remove internal functions to your heart's desire, but an exported function requires a version update and has the potential to break a user's workflow.
Documenting internal functions is a fantastic idea, but it should not be displayed in the index since that becomes the table of contents for the user manual. Instead, I recommend to add @keywords internal
to your roxygen directives for the unexported functions. This will still create the documentation for those who need it, but will hide it from those who don't.
There should be a workflow vignette that provides a few example workflows.
Something like:
extract_taxonomy
taxonomic_sample
Complete mitochondrial sequences and COX1 as the barcode would be a clean example.
Maybe we are trying to evalulate a barcode for a group of insects...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.