adamtaranto / frisk Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 3.0 15.57 MB

Screen genomic scaffolds for regions of unusual k-mer composition.

Home Page: http://adamtaranto.github.io/frisk/

License: GNU General Public License v3.0

Python 100.00%

frisk's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger pombredanne wy2160640

frisk's Issues

Add Jensen-Shannon Distance as an alternative Distance measure.

Add function as alternative to Kullback-leibler Distance.

Add Chrm Painting for HMM states

Add extra chrm painting with 2 HMM states as feature tracks.

IVOM/KLI debugging

I'll attach an example fasta file here. Perhaps do 1<= k <= 4 as that will fit nicely on a screen.

pybedtools is temperamental AF - Preprocess host GFF annotations

pybedtools bugs out when making BED object from gff3 records that have different numbers of columns / metadata fields, is also uppity about line endings.

Find package or write function to read in gff to pandas then --> convert to BED.

Licensing: GPLv3?

We currently use GPLv2, which is an old version of the standard strong copyleft license GPL. We should probably use GPLv3+. I can update it if you want.

pip installed frisk fails to run on mac

Tested pip install on Eli's mac. Not sure if this issue stems from the blank frisk version.

>>> pip install cython numpy scipy
>>> pip install frisk
>>> frisk -h
Traceback (most recent call last):
  File "/usr/local/bin/frisk", line 9, in <module>
    load_entry_point('frisk==0-unknown', 'console_scripts', 'frisk')()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources.py", line 339, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/local/lib/python2.7/site-packages/pkg_resources.py", line 2470, in load_entry_point
    return ep.load()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources.py", line 2184, in load
    ['__name__'])
  File "/usr/local/lib/python2.7/site-packages/frisk/__init__.py", line 19, in <module>
    from hmmlearn import hmm
  File "/usr/local/lib/python2.7/site-packages/hmmlearn/hmm.py", line 15, in <module>
    from sklearn.utils import check_random_state
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/__init__.py", line 16, in <module>
    from .class_weight import compute_class_weight, compute_sample_weight
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/class_weight.py", line 7, in <module>
    from ..utils.fixes import in1d
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/fixes.py", line 316, in <module>
    from ._scipy_sparse_lsqr_backport import lsqr as sparse_lsqr
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/_scipy_sparse_lsqr_backport.py", line 58, in <module>
    from scipy.sparse.linalg.interface import aslinearoperator
  File "/Library/Python/2.7/site-packages/scipy/sparse/linalg/__init__.py", line 109, in <module>
    from .isolve import *
  File "/Library/Python/2.7/site-packages/scipy/sparse/linalg/isolve/__init__.py", line 6, in <module>
    from .iterative import *
  File "/Library/Python/2.7/site-packages/scipy/sparse/linalg/isolve/iterative.py", line 7, in <module>
    from . import _iterative
ImportError: dlopen(/Library/Python/2.7/site-packages/scipy/sparse/linalg/isolve/_iterative.so, 2): Library not loaded: /usr/local/lib/gcc/x86_64-apple-darwin12.5.0/4.9.1/libgfortran.3.dylib
  Referenced from: /Library/Python/2.7/site-packages/scipy/sparse/linalg/isolve/_iterative.so
  Reason: image not found

Version tag unknown

Frisk version is not rendering correctly.

Example:
pip install -upgrade frisk
frisk --version
"frisk --0+unknown"

Same if running local version with python -m frisk

Versioneer fail

Have deleted files relating to Versioneer and commented out relevent lines in frisk.py
Need to set up properly.

Finding optimal K

Adapt class for finding f(K) stat from datasciencelab.

Need to enable input of actual Projection coordinates (X), which are currently randomly generated.

Also, support for projection co-ords with up to three dimensions. Currently 2.

Documentation

Needs documentation. Should include:
What is Frisk?
How to cite
How it works?
Installing dependencies
Examples / Use cases
Interpreting output

PEP440 pypi error

python setup.py sdist upload -r pypi
Upload failed (400): Invalid version, cannot use PEP 440 local versions on PyPI.
error: Upload failed (400): Invalid version, cannot use PEP 440 local versions on PyPI.

Known bug: Matplotlib axes.prop_cycler warning

Will resolve on next release of Seaborn:

mwaskom/seaborn#739

Add Self-genome Masking

Given repeat-masked genome, learn k-mer abundance only from unmasked regions.

For use training 'self' for non-self-rich genomes i.e. High RIP/TE abundance fungal genomes.

Driving Kmer reports

Which kmers drive divergence from reference population in windows/features with extreme KLI scores.

Helper script: Given sequence (string or fasta) and Self-kmer pickle file, report observed kmers ranked by KLI. If given a multi-fasta of anom sequences, return mean and var of KLI for each observed kmer.

Include Class label in GFF output

When --runProjection and --cluster are set, append Cluster allocation to anomaly gff3 metadata.

Debian package

I can try and package frisk for debian (which will eventually end up in Ubuntu) once it is stable. This is to remind me to do so.

Xantho test result

frisk.pdf

 $ time python tst.py Xanthomonas_oryzae_pv_oryzae_pxo99a.GCA_000019585.2.29.dna.genome.fa
calculated genome IVOM
done 100 windows
done 200 windows
done 300 windows
done 400 windows
done 500 windows
done 600 windows
done 700 windows
done 800 windows
done 900 windows
done 1000 windows
done 1100 windows
done 1200 windows
done 1300 windows
done 1400 windows
done 1500 windows
done 1600 windows
done 1700 windows
done 1800 windows
done 1900 windows
done 2000 windows

%TIME: [usr:1021.76 sys:21.43 301% wall:5:46.51 rss:313340]

Pybedtools update fucks float to BED conversion.

def thresholdList(intervalList, threshold, args, threshCol=3, merge=True):
    if args.findSelf:
        tItems = [t for t in intervalList if np.log10(t[3]) <= threshold]
    else:
        tItems = [t for t in intervalList if np.log10(t[3]) >= threshold]
    sItems = sorted(tItems, key=itemgetter(0, 1, 2))
    anomaliesBED  = pybedtools.BedTool(sItems) ##Fails here! cannot deal with float (KLI) at index 3.
    if merge:
        anomalies = anomaliesBED.merge(d=args.mergeDist, c='4,4,4', o='max,min,mean')
    else:
        anomalies = anomaliesBED
    return anomalies

Possible solution: Change lines 1158 and 1161 (window data), to store windowKLI, PI, SI, and CRI and strings. Have thresholdList() convert back to float for thresholding window records.

Although, this will probably still leave line 549 pretty fucked when pybedtools trys to do merge stats on strings:

anomalies = anomaliesBED.merge(d=args.mergeDist, c='4,4,4', o='max,min,mean')

Have raised Pybedtools issue, with any luck they will just fix it: daler/pybedtools#150

Migrate t-SNE to Sci-Kit-Learn Implementation

Migrate t-SNE clustering option to the SKL implementation for consistency with PCA method.

http://alexanderfabisch.github.io/t-sne-in-scikit-learn.html
http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm

k-means as alt clustering method

Elbow plot to estimate optimal k
Use k-means from sk-learn to cluster on PCA

Chromosome painting of Anomalies fails when Chr name is Number

makeChrPainting() or chromosome_collections() function fails to annotate anomaly bands onto scaffolds if scaffolds named only with int i.e. [1,2,3,4,5,6]

Have tried forcing to str() at various points.

Add report - Self + Anom GC content as stacked

Calc GC by default (prop non-NN space in window). Splits into self and Anom sets. Plot as stacked bar chart in Seaborn.

To give a feel for underlying drivers of self-Anom separation.

See: http://randyzwitch.com/creating-stacked-bar-chart-seaborn/