I’m encountering significant runtime problems when calculating a similarity score for

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Chess sim command runtime improvement heuristics? about chess HOT 6 OPEN

vaquerizaslab commented on June 9, 2024

Chess sim command runtime improvement heuristics?

from chess.

Comments (6)

nickmachnik commented on June 9, 2024

Hi @jmcbroome ,
it looks like your run hasn't even gotten to the stage of comparing anything, I think it is still computing the O/E values. The number of regions you are comparing, and the --background-query and --limit-background do no influence this step at all. I think the best you can do here, especially if you want to rerun the command on the same data, is to switch to a data format that stores precomputed O/E values, like juicer of fanc. You can do this e.g. with

fanc from-cooler

The conversion will take a while, but you'll have to do it only once, then loading the data will be a lot faster.
Hope this helps.
Best,
Nick

from chess.

nickmachnik commented on June 9, 2024

@jmcbroome , do you get an acceptable runtime after the conversion?

from chess.

jmcbroome commented on June 9, 2024

Sorry for not following up more quickly, I've been trying to make it work. I've been precalculating O/E matrices with the HiCExplorer implementation (hicTransform in obs_exp mode) which only takes a few minutes. However, it appears to be hanging on reading in reference contact data, and has spent 3 days so far reading in a obs-exp transformed chicken HiC matrix as reference. It's only using one thread for this step though I gave it access to 18 cores. This seems to be a serious bottleneck problem. I may try breaking my data into a series of syntenic chromosome pairs and associated syntenic region pairs and starting several parallel CHESS runs so that each individual CHESS run has to read in less total contact data, since this step is not multithreaded in the software.

Current output:
2020-11-20 18:09:48,878 INFO Running '/home/jmcbroome/anaconda3/bin/chess sim --oe-input --background-query --limit-background -p 18 chicken_hicexplorer_obsexp.cool human_chr1_hicexplorer_obsexp.cool chess_test_group.bedpe chess_test_group_v2.txt'
2020-11-20 18:09:49,897 INFO CHESS version: 0.3.5
2020-11-20 18:09:49,897 INFO FAN-C version: 0.9.7
2020-11-20 18:09:49,899 INFO Loading reference contact data

It is now 2020-11-23 and no additional lines have been printed.

from chess.

kaukrise commented on June 9, 2024

Yeah, this is indeed a bottleneck. Besides the O/E calculation, CHESS currently also reads everything in the reference and query matrices into memory. That also means a lot of data is read that will never be used, because it lies outside of the region pairs of interest.

For txt file input, this is a difficult problem to solve, as one has to iterate over the entire file anyways to see which contacts are relevant. For FAN-C compatible matrices (i.e. Juicer and FAN-C, while Cooler does not have support for expected values on file), however, we might be able to retrieve regions on the fly. Then the sim part of the CHESS run would start practically instantaneously. However, concurrency might become an issue here. PyTables, which is HDF5-backed, really does not like concurrent access...

from chess.

kaukrise commented on June 9, 2024

Hello again! I did not have time yet to refactor the CHESS code to load submatrices from file on demand - I'm going to need @nickmachnik's help for this, too, particularly for the multiprocessing part.

However, worked on the Cooler compatibility layer in FAN-C yesterday night and found some inefficiencies, which I fixed. On my machine (SSD), I now get a reading speed of 15-20 million pixels per minute (don't ask what it was before...). This should load high-res Cooler matrices much more quickly. The changes are already available via GitHub and PyPi (FAN-C version 0.9.8). @nickmachnik, maybe you can bump the fanc dependency in CHESS' setup.py?

Cheers,

Kai

from chess.

nickmachnik commented on June 9, 2024

I updated the fanc dependency to 0.9.8, will put this on PyPi with the next patch (0.3.6).

from chess.

Chess sim command runtime improvement heuristics? about chess HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent