nci-cgr / plco-analysis Goto Github PK

Primary workflow for the PLCO "Atlas" project

Makefile 10.06% Shell 77.81% M4 0.60% C++ 4.41% Metal 3.31% R 2.61% Python 1.20%

plco-analysis's Issues

replace git-lfs hosting with something, anything, elsewhere

The resources under annotations/ are large and currently tracked with git-lfs, which is fine for the moment; but I'd really like to just put the files somewhere that could just be downloaded via wget in a pipeline somewhere. Please, Future Person, save me from myself. (I contacted IT and they said they couldn't help me, but would try to find a solution in the future... good luck with that!)

cleanup of incomplete pipelines

I drafted a variety of additional pipelines before we fixed the scope of the project, and those pipelines are in various states of disrepair after not being included in the primary dev process. need to flag those problematic ones and remove them.

error in aggregating results files in globus.Makefile

when there are phenotypes with the same prefix they are grouped together
e.g. j_lung_cancer and j_lung_cancer_current_smokers got merged together in the final file.
This either needs to be fixed in the Makefile or make sure the phenotypes have different prefix, e.g. change j_lung_cancer_current_smokers to j_current_smokers_lung_cancer?

custom max number of principal components estimated/used

I've started hearing rumors that the number of principal components used in association may be changed to some other number greater than the current 10. a nice gift to bequeath to my successor would be a configurable parameter for this somewhere in Makefile.config, and support for that increased ceiling in the model matrix constructor, such that they don't have to deal with an immediate extension buried within makefiles.

Makefile.config extension handling assumes python=3

which is bad, when ldsc/ldscores pipelines assume python2

categorical trait module expansion

existing method is a hack with expected anticonservative bias. bakeoff different scalable regression methods

add case-inclusion and case-exclusion options

similar to control-inclusion and control-exclusion, this is needed if we want to say examine lung cancer among never smokers. Another option is to add cohort inclusion/exclusion but having both case and control inclusion/exclusion seems more general

patch to installable version

the first v1.0.0 release is literally the copy that was used for phenotype tranche 1. that does no one any good, as it has a bunch of baked in nonsense and no conda environment specification. this is obviously an urgently needed fix.

update README with milestones

the milestone tracker on the README is ancient, fix that

modernize/update the checks

so the pipelines (and configuration directory itself) have testing subdirectories that contain simple test scripts for integration with TAP/automake. you can generally run them from top-level by running make config-check (or generally make {rulename}-check. they are not nearly, not even close, as expansive as they should be; furthermore the tests for primary analysis are terribly slow since they are not set up in such a way that allows parallelization.

please, Future Person, do a better job of this than the cobbled-together nonsense that's in there now!

add ldsc workflow that does not depend on the file names/paths in the workflow

this is so that we can run ldsc after extra steps

handing ordinal target variable in the pipeline fails when the sample size of one of the comparisons is below the threshold

when the sample size of one of the comparisons is below the threshold it does not produce the tracking files to inform the next steps, and then when running meta the pipeline breaks

make[1]: *** No rule to make target '.../?.SAIGE.categorical-combined.tsv.success', needed by '/.../?.SAIGE.final-ids.tsv.success'. Stop.

tracking inspection

recently found an entire pipeline that hadn't been updated to the $(call log_handler) or $(call sub_handler) convention. need to go through all pipelines in all directories and ensure that everything that's in use has actually been updated

readthedocs documentation

now that this is publicly available, host rst documentation for readthedocs

clean up Makefile.config

Makefile.config is one of the single oldest pieces of the pipeline. it's being used, no question; but it has an accumulated pile of garbage in it. of the various things to go through and clean up before departure, this is probably the highest priority. notably, it still has enumerated extension definitions in it when those are now being primarily handled in a yaml file along with https://github.com/NCI-CGR/initialize_output_directories, so it's actually quite bad.

project agnostic nomenclature/inputs

this project was designed with the intention that it be used for the PLCO "atlas" project. later developments have indicated that there may be the need for this code to be used for other projects. some of the "PLCO" nomenclature is baked into the pipelines (this was initially targeted for removal during an early milestone for the project but was removed from the project plan after those milestones were scrapped by superiors).

chrX support

Atlas investigators have requested chrX support in the pipeline. This is not too difficult but requires pulling in imputed files generated by someone else. Each downstream tool handles chrX differently, so support needs to be cooked into each individual association pipeline.

fix config check

make check-config is broken due to separate added support for deeper level yaml structures in config/*.yaml. especially if outside people are going to be using this pipeline in the near future, they're going to need to rely heavily on that check code to catch their doubtless many yaml formatting errors.

nci-cgr / plco-analysis Goto Github PK

plco-analysis's Issues

Recommend Projects

Recommend Topics

Recommend Org