Giter VIP home page Giter VIP logo

hogwash's Introduction

Travis build status

codecov

Summary

The hogwash R package is a phylogenetically-informed, convergence-based method for performing genome-wide association studies in bacteria. In short, the user inputs a phylogenetic tree, a phenotype (either binary or continuous), and a genotype (a binary matrix) and receives an output of the genotypes that are significantly associated with the phenotype after correcting for multiple testing, requiring convergence, and accounting for the clonal structure of the population.

Install the package

install.packages("devtools")
devtools::install_github("katiesaund/hogwash")
library(hogwash)

Getting started

Please check out the wiki or vignette for a brief primer on bacterial GWAS, detailed descriptions of the algorithms, and example data with results.

The paper

To learn about using hogwash on your bacterial data please read our paper "Hogwash: three methods for genome-wide association studies in bacteria". It describes the algorithms and their performance on simulated data. The simulated data used in the paper were generated using the code in this repository.

hogwash's People

Contributors

katiesaund avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

hogwash's Issues

grouping genotypes

Add explicit error message when the 1st column of the grouping genotype key does not contain an entry for each row of the genotype matrix. Right now the error message is vague.

"Error in check_dimensions(lookup, exact_rows = ncol(geno), min_rows = ncol(geno), :
matrix has too few rows
Calls: hogwash ... run_binary_transition -> group_genotypes -> check_dimensions"

Grab bag todo list

  • update absolute paths to relative paths
  • remove generate dummy data script
  • run lintr and covr again
  • fully deprecate annotation
  • add an explicit requirement for at least N tips in the tree?

Decrease example time for hogwash() function

In order to submit to CRAN will need to reduce example time for hogwash() function:

"Each executable function in your R package should come with an example of how to use the function. However, the time it takes to run the function should not exceed 10 seconds!"

Comments from check:
✔ checking examples (22.3s)
Examples with CPU or elapsed time > 5s
user system elapsed
hogwash 20.651 0.352 21.503

fix qpdf error during package check()

Getting this warning during check:
‘qpdf’ is needed for checks on size reduction of PDFs

Internet says to (1) install qpdf executable and (2) make sure to run on r-hub and win-builder.

Account for inputs lacking sufficient data to run ks test

run_ks_test <- function(t_index, non_t_index, phenotype_by_edges){

Deal with cases when there isn't enough data (aren't at least 1 of least t_index and non_t_index). I could remove these from the genotype and report that they're bad & continue without breaking out of the algorithm, but right now it will cause a stop error which won't be good for lots of things. I can't seem to recapitulate the error when there are not enough samples to run the test, but all indices are list of at least one. Hmm.

Fix output object names

  • Change $hit_pvals_reconstruction > $hit_pvals
  • Change $sig_pvals_reconstruction > $sig_pvals
  • Change "object" to name of test
  • rename column in $sig_pvals_reconstruction from fdr_corrected_pvals[fdr_corrected_pvals < fdr]
  • rename object$original_high_confidence_trasition_edges > $high_confidence_trasition_edges
  • add genotype names to list $original_high_confidence_trasition_edges
  • rename object$num_high_confidence_trasition_edges
  • synchronous .rda didn't get saved -- probably doesn't have synchronous in name and so got written over by the convergence test .rda files

remove temp object

There are several data objects loaded into the package, but somehow a tree got saved twice - once as hogwash::tree and a second time as hogwash::temp. Delete temp.

Building package error message

When running devtools::install() or checking the package I get the following error message:

   Warning in utils::tar(filepath, pkgname, compression = "gzip", compression_level = 9L,  :
     storing paths of more than 100 bytes is not portable:
     ‘hogwash/hogwash/dummy_data/discrete_phenotype_nongrouped_genotype/temp_results/phyc_dummy_group_snps_into_genes_summary_and_sig_hit_results.pdf’

These comments don't make sense because those files have been deleted for days, those directories no longer exist.

Need to find source of this error and resolve issue.

Bootstrap support value

IQtree - bootstrap should be 95+
RAxML bootstrap should be 87.5+
PhyC used 70
Add these differences to the documentation to help users.

Add citation

Add CITATION file once methods paper is published.

Speed up ancestral reconstruction

pick_recon_model <- function(mat, tr, disc_cont, num, recon_method){

Ancestral reconstruction is being run three times: once with ER, once with ARD, and then again with which ever model is best. Have the pick model and then also return the results of the best reconstruction instead of doing it for the third time.

standardize output file names

  • Remove '_summary_and_sig_hit_results' from pdf names for continuous test.
  • Add 'grouped' to file output when grouping genotypes

relative paths in test

Currently getting around paths in tests using ../../ which may be too hacky. Return to this and determine optimal relative path choice for tests later.

Error: package or namespace load failed for ‘magick’ in dyn.load(file, DLLpath = DLLpath, ...):

When trying to install the hogwash package we are experiencing difficulites which result in the package failing to install. We are trying to install to our directory /home/billy and are using version 3.5 of R. To go through the install we have done the following steps:
module load gcc (to load gcc version: 6.2.0)
install.packages("devtools")
devtools::install_github("katiesaund/hogwash")

The output that we are getting results in the following:

** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
  converting help for package ‘magick’
    finding HTML links ... done
    analysis                                html  
    animation                            html  
    as_EBImage                         html  
    attributes                             html  
    autoviewer                          html  
    color                                   html  
    composite                          html  
    config                                 html  
    device                                 html  
    edges                                  html  
    editing                               html  
Rd warning: /tmp/RtmpXdBSNM/R.INSTALL84696915cea1/magick/man/editing.Rd:93: file link ‘geom_raster’ in package ‘ggplot2’ does not exist and so has been treated as a topic
    effects                              html  
    fx                                      html  
    geometry                         html  
    image_ggplot                  html  
Rd warning: /tmp/RtmpXdBSNM/R.INSTALL84696915cea1/magick/man/image_ggplot.Rd:14: file link ‘geom_raster’ in package ‘ggplot2’ does not exist and so has been treated as a topic
    magick                             html  
    morphology                     html  
    ocr                                   html  
    options                            html  
    painting                           html  
    reexports                         html  
Rd warning: /tmp/RtmpXdBSNM/R.INSTALL84696915cea1/magick/man/reexports.Rd:14: file link ‘%>%’ in package ‘magrittr’ does not exist and so has been treated as a topic
    segmentation                  html  
    thresholding                    html  
    transform                         html  
    video                                html  
    wizard                              html  
** building package indices
** installing vignettes
** testing if installed package can be loaded
Error: package or namespace load failed for ‘magick’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/billy/R/3.5/magick/libs/magick.so':
  /home/billy/R/3.5/magick/libs/magick.so: undefined symbol: _ZN6Magick5Image5writeEPNS_4BlobERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEm
Error: loading failed
Execution halted
ERROR: loading failed

  • removing ‘/home/billy/R/3.5/magick’
    Error: Failed to install 'hogwash' from GitHub:
      (converted from warning) installation of package ‘magick’ had non-zero exit status

Check() issues

Getting a warning for the data files, even though I do have some documentation:

W checking for missing documentation entries ...
Undocumented code objects:
‘antibiotic_resistance’ ‘growth’ ‘snp_gene_key’ ‘snp_genotype’ ‘tree’
Undocumented data sets:
‘antibiotic_resistance’ ‘growth’ ‘snp_gene_key’ ‘snp_genotype’ ‘tree’
All user-level objects in a package should have documentation entries.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.

Fix documentation so check passes without any warnings.

Check issues

Issues when checking the package:

calculate_genotype_significance: no visible global function definition
     for ‘median’
   check_if_phenotype_normal: no visible global function definition for
     ‘shapiro.test’
   create_test_data: no visible global function definition for ‘rbinom’
   discrete_plot_orig: no visible global function definition for ‘pdf’
   discrete_plot_orig: no visible global function definition for ‘par’
   discrete_plot_orig: no visible global function definition for ‘hist’
   discrete_plot_orig: no visible global function definition for ‘abline’
   discrete_plot_orig: no visible global function definition for ‘dev.off’
   discrete_plot_trans: no visible global function definition for ‘pdf’
   discrete_plot_trans: no visible global function definition for ‘par’
   discrete_plot_trans: no visible global function definition for ‘hist’
   discrete_plot_trans: no visible global function definition for ‘abline’
   discrete_plot_trans: no visible global function definition for
     ‘dev.off’
   get_sig_hits_while_correcting_for_multiple_testing: no visible global
     function definition for ‘p.adjust’
   histogram_abs_high_confidence_delta_pheno_highlight_transition_edges:
     no visible global function definition for ‘par’
   histogram_abs_high_confidence_delta_pheno_highlight_transition_edges:
     no visible global function definition for ‘hist’
   histogram_all_delta_pheno_overlaid_with_high_conf_delta_pheno: no
     visible global function definition for ‘hist’
   histogram_all_delta_pheno_overlaid_with_high_conf_delta_pheno: no
     visible global function definition for ‘rgb’
   histogram_raw_high_confidence_delta_pheno_highlight_transition_edges:
     no visible global function definition for ‘hist’
   make_manhattan_plot: no visible global function definition for ‘abline’
   make_manhattan_plot: no visible global function definition for ‘text’
   pick_recon_model: no visible global function definition for ‘pchisq’
   pick_recon_model: no visible global function definition for ‘AIC’
   plot_continuous_phenotype: no visible global function definition for
     ‘plot’
   plot_significant_hits: no visible global function definition for ‘pdf’
   plot_significant_hits: no visible global function definition for ‘par’
   plot_significant_hits: no visible global function definition for ‘hist’
   plot_significant_hits: no visible global function definition for
     ‘abline’
   plot_significant_hits: no visible global function definition for
     ‘dev.off’
   plot_tree_with_colored_edges: no visible global function definition for
     ‘par’
   plot_tree_with_colored_edges: no visible global function definition for
     ‘plot’
   plot_tree_with_colored_edges: no visible global function definition for
     ‘legend’
   read_in_tsv_matrix: no visible global function definition for
     ‘read.table’
   run_binary_original: no visible global function definition for
     ‘capture.output’
   run_binary_original: no visible global function definition for
     ‘sessionInfo’
   run_binary_transition: no visible global function definition for
     ‘capture.output’
   run_binary_transition: no visible global function definition for
     ‘sessionInfo’
   run_continuous: no visible global function definition for
     ‘capture.output’
   run_continuous: no visible global function definition for ‘sessionInfo’
   run_ks_test: no visible global function definition for ‘ks.test’
   Undefined global functions or variables:
     AIC abline capture.output dev.off hist ks.test legend median p.adjust
     par pchisq pdf plot rbinom read.table rgb sessionInfo shapiro.test
     text
   Consider adding
     importFrom("grDevices", "dev.off", "pdf", "rgb")
     importFrom("graphics", "abline", "hist", "legend", "par", "plot",
                "text")
     importFrom("stats", "AIC", "ks.test", "median", "p.adjust", "pchisq",
                "rbinom", "shapiro.test")
     importFrom("utils", "capture.output", "read.table", "sessionInfo")
   to your NAMESPACE file.

Documentation

  • fully switch to Roxygen2
  • documentation for each function
  • documentation for loaded variables

Cleaning up output plots

Manhattan Plot

  • Add jitter to manhattan plots
  • Add key to manhattan plots (red line == FDR set by user)
  • Change manhattan plot y-axis label to -ln(P-value)
  • Remove x-axis numbering

Summary (First) Heatmap

  • keep only one heatmap (either cluster or don't, but not both)
  • fix heatmap width on continuous plots (currently working on synch and phyc)
  • Need a key for genotype presence/absence on heatmaps
  • Change "pheno_transitions" --> "Phenotype Transition Edges"
  • Change NA phenotype transition edge annotations from -1 to NA (this should help with color issue)
  • locus --> "Locus Significance"
  • "not_sig" --> "Not Significant"
  • "sig" --> "Significant"
  • Make sure Significant and Not Significant always both print in the legend even if there is only 1 in the data
  • PhyC/ Sync: "fdr_corrected_pvals" --> "-ln(FDR corrected P-value)"
  • Continuous: "-ln(p-val)" --> "-ln(FDR corrected P-value)"
  • Remove heatmap that clusters from continuous (this is already absent in sync/phyc)
  • "SNPs in gene" --> "Variants in Group"
  • More descriptive heatmap titles
  • Better genotype legend
  • Include # of grouped loci per gene in continuous & phyc heatmap (already done in synch)
  • Check that Sync reported p-value on heatmap is -ln()

Trees

  • limit number of characters that can print to above genotype transition tree in binary plots for each significant hit
  • Move phenotype tree to its own page

Histograms

  • Change "pval=###" to "-ln(FDR corrected P-value) = ###"
  • Continuous: fix spacing of "# non-trans edges = "
  • Phyc & Synchronous: Add P-value rank.
  • P-values -> round all of them and present in scientific notation
  • Continuous: green delta phenotype plots --> make a color key rather than describe in the title
  • Increase font size on all histogram axis titles
  • Overlap/KS plots: add legend with red/grey -- don't describe in the figure title

Individual Heatmaps for significant loci

  • Remove

Heatmap legends

Make legend names more informative: phenotype only in annotation not body of heatmap, low confidence is either tree, phenotype or geno low confidence.

Create fit_phenotype_model() function

Write an exported function, fit_phenotype_model(phenotype, tree), that will report the phenotype phylogenetic signal and if it fits BM/WN to the user.

Wiki

Update images with finalized plot types

Testing

  • Increase unit test coverage of package
  • convert all unit tests to be using the loaded package data rather than referencing the .tsv.
  • Once I've removed all references to .tsv, delete those from the data folder
  • Can I run the tests without creating outputs? Would that solve some check/building issues or is it not important anymore?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.