Comments (6)
Hi @jmcbroome ,
it looks like your run hasn't even gotten to the stage of comparing anything, I think it is still computing the O/E values. The number of regions you are comparing, and the --background-query
and --limit-background
do no influence this step at all. I think the best you can do here, especially if you want to rerun the command on the same data, is to switch to a data format that stores precomputed O/E values, like juicer of fanc. You can do this e.g. with
fanc from-cooler
The conversion will take a while, but you'll have to do it only once, then loading the data will be a lot faster.
Hope this helps.
Best,
Nick
from chess.
@jmcbroome , do you get an acceptable runtime after the conversion?
from chess.
Sorry for not following up more quickly, I've been trying to make it work. I've been precalculating O/E matrices with the HiCExplorer implementation (hicTransform in obs_exp mode) which only takes a few minutes. However, it appears to be hanging on reading in reference contact data, and has spent 3 days so far reading in a obs-exp transformed chicken HiC matrix as reference. It's only using one thread for this step though I gave it access to 18 cores. This seems to be a serious bottleneck problem. I may try breaking my data into a series of syntenic chromosome pairs and associated syntenic region pairs and starting several parallel CHESS runs so that each individual CHESS run has to read in less total contact data, since this step is not multithreaded in the software.
Current output:
2020-11-20 18:09:48,878 INFO Running '/home/jmcbroome/anaconda3/bin/chess sim --oe-input --background-query --limit-background -p 18 chicken_hicexplorer_obsexp.cool human_chr1_hicexplorer_obsexp.cool chess_test_group.bedpe chess_test_group_v2.txt'
2020-11-20 18:09:49,897 INFO CHESS version: 0.3.5
2020-11-20 18:09:49,897 INFO FAN-C version: 0.9.7
2020-11-20 18:09:49,899 INFO Loading reference contact data
It is now 2020-11-23 and no additional lines have been printed.
from chess.
Yeah, this is indeed a bottleneck. Besides the O/E calculation, CHESS currently also reads everything in the reference and query matrices into memory. That also means a lot of data is read that will never be used, because it lies outside of the region pairs of interest.
For txt file input, this is a difficult problem to solve, as one has to iterate over the entire file anyways to see which contacts are relevant. For FAN-C compatible matrices (i.e. Juicer and FAN-C, while Cooler does not have support for expected values on file), however, we might be able to retrieve regions on the fly. Then the sim
part of the CHESS run would start practically instantaneously. However, concurrency might become an issue here. PyTables, which is HDF5-backed, really does not like concurrent access...
from chess.
Hello again! I did not have time yet to refactor the CHESS code to load submatrices from file on demand - I'm going to need @nickmachnik's help for this, too, particularly for the multiprocessing part.
However, worked on the Cooler compatibility layer in FAN-C yesterday night and found some inefficiencies, which I fixed. On my machine (SSD), I now get a reading speed of 15-20 million pixels per minute (don't ask what it was before...). This should load high-res Cooler matrices much more quickly. The changes are already available via GitHub and PyPi (FAN-C version 0.9.8
). @nickmachnik, maybe you can bump the fanc
dependency in CHESS' setup.py
?
Cheers,
Kai
from chess.
I updated the fanc dependency to 0.9.8, will put this on PyPi with the next patch (0.3.6).
from chess.
Related Issues (20)
- chess --version doesn't work?
- CNV bias in normalization HOT 2
- Conditions for conservation analysis of syntenic blocks HOT 5
- Nan Continued HOT 2
- No valid region pairs found? HOT 1
- Different resolution produce different result HOT 1
- Should the users be concerned about the problem raised in the new Contradictory Results bioRxiv preprint? HOT 2
- conservation analysis when only a few syntenic blocks are available HOT 3
- speed up the chess run HOT 1
- error of the chess extract HOT 3
- issue of normalized/chess extract HOT 1
- error on running chess sim HOT 2
- error when running extract on .hic files HOT 1
- something different from plotting HOT 9
- _pickle.PicklingError HOT 2
- chess extract error: operands could not be broadcast together with shapes HOT 1
- data_range parameter not specified - error HOT 7
- Chess sim output .tsv file explained HOT 1
- Normalization of .hic files HOT 1
- Deprecated parameters in scikit-image & scikit-learn
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ššš
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ā¤ļø Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chess.