Giter VIP home page Giter VIP logo

Comments (6)

nickmachnik avatar nickmachnik commented on June 9, 2024

Hi @jmcbroome ,
it looks like your run hasn't even gotten to the stage of comparing anything, I think it is still computing the O/E values. The number of regions you are comparing, and the --background-query and --limit-background do no influence this step at all. I think the best you can do here, especially if you want to rerun the command on the same data, is to switch to a data format that stores precomputed O/E values, like juicer of fanc. You can do this e.g. with

fanc from-cooler

The conversion will take a while, but you'll have to do it only once, then loading the data will be a lot faster.
Hope this helps.
Best,
Nick

from chess.

nickmachnik avatar nickmachnik commented on June 9, 2024

@jmcbroome , do you get an acceptable runtime after the conversion?

from chess.

jmcbroome avatar jmcbroome commented on June 9, 2024

Sorry for not following up more quickly, I've been trying to make it work. I've been precalculating O/E matrices with the HiCExplorer implementation (hicTransform in obs_exp mode) which only takes a few minutes. However, it appears to be hanging on reading in reference contact data, and has spent 3 days so far reading in a obs-exp transformed chicken HiC matrix as reference. It's only using one thread for this step though I gave it access to 18 cores. This seems to be a serious bottleneck problem. I may try breaking my data into a series of syntenic chromosome pairs and associated syntenic region pairs and starting several parallel CHESS runs so that each individual CHESS run has to read in less total contact data, since this step is not multithreaded in the software.

Current output:
2020-11-20 18:09:48,878 INFO Running '/home/jmcbroome/anaconda3/bin/chess sim --oe-input --background-query --limit-background -p 18 chicken_hicexplorer_obsexp.cool human_chr1_hicexplorer_obsexp.cool chess_test_group.bedpe chess_test_group_v2.txt'
2020-11-20 18:09:49,897 INFO CHESS version: 0.3.5
2020-11-20 18:09:49,897 INFO FAN-C version: 0.9.7
2020-11-20 18:09:49,899 INFO Loading reference contact data

It is now 2020-11-23 and no additional lines have been printed.

from chess.

kaukrise avatar kaukrise commented on June 9, 2024

Yeah, this is indeed a bottleneck. Besides the O/E calculation, CHESS currently also reads everything in the reference and query matrices into memory. That also means a lot of data is read that will never be used, because it lies outside of the region pairs of interest.

For txt file input, this is a difficult problem to solve, as one has to iterate over the entire file anyways to see which contacts are relevant. For FAN-C compatible matrices (i.e. Juicer and FAN-C, while Cooler does not have support for expected values on file), however, we might be able to retrieve regions on the fly. Then the sim part of the CHESS run would start practically instantaneously. However, concurrency might become an issue here. PyTables, which is HDF5-backed, really does not like concurrent access...

from chess.

kaukrise avatar kaukrise commented on June 9, 2024

Hello again! I did not have time yet to refactor the CHESS code to load submatrices from file on demand - I'm going to need @nickmachnik's help for this, too, particularly for the multiprocessing part.

However, worked on the Cooler compatibility layer in FAN-C yesterday night and found some inefficiencies, which I fixed. On my machine (SSD), I now get a reading speed of 15-20 million pixels per minute (don't ask what it was before...). This should load high-res Cooler matrices much more quickly. The changes are already available via GitHub and PyPi (FAN-C version 0.9.8). @nickmachnik, maybe you can bump the fanc dependency in CHESS' setup.py?

Cheers,

Kai

from chess.

nickmachnik avatar nickmachnik commented on June 9, 2024

I updated the fanc dependency to 0.9.8, will put this on PyPi with the next patch (0.3.6).

from chess.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.