adamtaranto / frisk Goto Github PK
View Code? Open in Web Editor NEWScreen genomic scaffolds for regions of unusual k-mer composition.
Home Page: http://adamtaranto.github.io/frisk/
License: GNU General Public License v3.0
Screen genomic scaffolds for regions of unusual k-mer composition.
Home Page: http://adamtaranto.github.io/frisk/
License: GNU General Public License v3.0
Add function as alternative to Kullback-leibler Distance.
Add extra chrm painting with 2 HMM states as feature tracks.
I'll attach an example fasta file here. Perhaps do 1<= k <= 4
as that will fit nicely on a screen.
K
pybedtools bugs out when making BED object from gff3 records that have different numbers of columns / metadata fields, is also uppity about line endings.
Find package or write function to read in gff to pandas then --> convert to BED.
We currently use GPLv2, which is an old version of the standard strong copyleft license GPL. We should probably use GPLv3+. I can update it if you want.
Tested pip install on Eli's mac. Not sure if this issue stems from the blank frisk version.
>>> pip install cython numpy scipy
>>> pip install frisk
>>> frisk -h
Traceback (most recent call last):
File "/usr/local/bin/frisk", line 9, in <module>
load_entry_point('frisk==0-unknown', 'console_scripts', 'frisk')()
File "/usr/local/lib/python2.7/site-packages/pkg_resources.py", line 339, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/local/lib/python2.7/site-packages/pkg_resources.py", line 2470, in load_entry_point
return ep.load()
File "/usr/local/lib/python2.7/site-packages/pkg_resources.py", line 2184, in load
['__name__'])
File "/usr/local/lib/python2.7/site-packages/frisk/__init__.py", line 19, in <module>
from hmmlearn import hmm
File "/usr/local/lib/python2.7/site-packages/hmmlearn/hmm.py", line 15, in <module>
from sklearn.utils import check_random_state
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/__init__.py", line 16, in <module>
from .class_weight import compute_class_weight, compute_sample_weight
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/class_weight.py", line 7, in <module>
from ..utils.fixes import in1d
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/fixes.py", line 316, in <module>
from ._scipy_sparse_lsqr_backport import lsqr as sparse_lsqr
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/_scipy_sparse_lsqr_backport.py", line 58, in <module>
from scipy.sparse.linalg.interface import aslinearoperator
File "/Library/Python/2.7/site-packages/scipy/sparse/linalg/__init__.py", line 109, in <module>
from .isolve import *
File "/Library/Python/2.7/site-packages/scipy/sparse/linalg/isolve/__init__.py", line 6, in <module>
from .iterative import *
File "/Library/Python/2.7/site-packages/scipy/sparse/linalg/isolve/iterative.py", line 7, in <module>
from . import _iterative
ImportError: dlopen(/Library/Python/2.7/site-packages/scipy/sparse/linalg/isolve/_iterative.so, 2): Library not loaded: /usr/local/lib/gcc/x86_64-apple-darwin12.5.0/4.9.1/libgfortran.3.dylib
Referenced from: /Library/Python/2.7/site-packages/scipy/sparse/linalg/isolve/_iterative.so
Reason: image not found
Frisk version is not rendering correctly.
Example:
pip install -upgrade frisk
frisk --version
"frisk --0+unknown"
Same if running local version with python -m frisk
Have deleted files relating to Versioneer and commented out relevent lines in frisk.py
Need to set up properly.
Adapt class for finding f(K) stat from datasciencelab.
Need to enable input of actual Projection coordinates (X), which are currently randomly generated.
Also, support for projection co-ords with up to three dimensions. Currently 2.
Needs documentation. Should include:
What is Frisk?
How to cite
How it works?
Installing dependencies
Examples / Use cases
Interpreting output
python setup.py sdist upload -r pypi
Upload failed (400): Invalid version, cannot use PEP 440 local versions on PyPI.
error: Upload failed (400): Invalid version, cannot use PEP 440 local versions on PyPI.
Will resolve on next release of Seaborn:
Given repeat-masked genome, learn k-mer abundance only from unmasked regions.
For use training 'self' for non-self-rich genomes i.e. High RIP/TE abundance fungal genomes.
Which kmers drive divergence from reference population in windows/features with extreme KLI scores.
Helper script: Given sequence (string or fasta) and Self-kmer pickle file, report observed kmers ranked by KLI. If given a multi-fasta of anom sequences, return mean and var of KLI for each observed kmer.
When --runProjection and --cluster are set, append Cluster allocation to anomaly gff3 metadata.
I can try and package frisk
for debian (which will eventually end up in Ubuntu) once it is stable. This is to remind me to do so.
$ time python tst.py Xanthomonas_oryzae_pv_oryzae_pxo99a.GCA_000019585.2.29.dna.genome.fa
calculated genome IVOM
done 100 windows
done 200 windows
done 300 windows
done 400 windows
done 500 windows
done 600 windows
done 700 windows
done 800 windows
done 900 windows
done 1000 windows
done 1100 windows
done 1200 windows
done 1300 windows
done 1400 windows
done 1500 windows
done 1600 windows
done 1700 windows
done 1800 windows
done 1900 windows
done 2000 windows
%TIME: [usr:1021.76 sys:21.43 301% wall:5:46.51 rss:313340]
def thresholdList(intervalList, threshold, args, threshCol=3, merge=True):
if args.findSelf:
tItems = [t for t in intervalList if np.log10(t[3]) <= threshold]
else:
tItems = [t for t in intervalList if np.log10(t[3]) >= threshold]
sItems = sorted(tItems, key=itemgetter(0, 1, 2))
anomaliesBED = pybedtools.BedTool(sItems) ##Fails here! cannot deal with float (KLI) at index 3.
if merge:
anomalies = anomaliesBED.merge(d=args.mergeDist, c='4,4,4', o='max,min,mean')
else:
anomalies = anomaliesBED
return anomalies
Possible solution: Change lines 1158 and 1161 (window data), to store windowKLI, PI, SI, and CRI and strings. Have thresholdList() convert back to float for thresholding window records.
Although, this will probably still leave line 549 pretty fucked when pybedtools trys to do merge stats on strings:
anomalies = anomaliesBED.merge(d=args.mergeDist, c='4,4,4', o='max,min,mean')
Have raised Pybedtools issue, with any luck they will just fix it: daler/pybedtools#150
Migrate t-SNE clustering option to the SKL implementation for consistency with PCA method.
http://alexanderfabisch.github.io/t-sne-in-scikit-learn.html
http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm
Elbow plot to estimate optimal k
Use k-means from sk-learn to cluster on PCA
makeChrPainting() or chromosome_collections() function fails to annotate anomaly bands onto scaffolds if scaffolds named only with int i.e. [1,2,3,4,5,6]
Have tried forcing to str() at various points.
Calc GC by default (prop non-NN space in window). Splits into self and Anom sets. Plot as stacked bar chart in Seaborn.
To give a feel for underlying drivers of self-Anom separation.
See: http://randyzwitch.com/creating-stacked-bar-chart-seaborn/
Add arg: List of scaffold names to be included in Chromosome Painting.
Currently, fragmented assemblies wreck the graphic's formatting.
Additional chromosome painting with RIP track as gff3 feature track.
Are we going to bother w/ versioneer? and we need to add a setup.py so it is easily pip install
-able. And tag releases in git (git tag -s 0.1.0
probably now, will do a 0.2.0 once all is fixed. See semver.org).
Currently, anomalous features have their k-mer counts made symmetrical (i.e. GA = TC ) and proportional within orders of k (i.e. A=25/100, T=25/100, G=25/100, C=25/100). This means 50% of columns are redundant data, providing no novel signal.
Need to find a way of scrubbing out redundant keys from symmetrical k-mer dictionaries.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.