jmschrei / tfmodisco-lite Goto Github PK

A lite implementation of tfmodisco, a motif discovery algorithm for genomics experiments.

License: MIT License

Python 100.00%

tfmodisco-lite's Introduction

tfmodisco-lite

Warning tfmodisco-lite v2.0.0 and above may produce slightly different results from the original TF-MoDISCo code as minor bugs are fixed and some speed improvements required swapping sorting algorithms.

TF-MoDISco is a biological motif discovery algorithm that differentiates itself by using attribution scores from a machine learning model, in addition to the sequence itself, to guide motif discovery. Using the attribution scores, as opposed to the signal being predicted by the machine learning model (e.g. ChIP-seq peaks), can be beneficial because the attributions fine-map the specific sequence drivers of biology. Although in many of our examples this model is BPNet and the attributions are from DeepLIFT/DeepSHAP, there is no limit on what attribution algorithm is used, or what model the attributions come from. This means that, for example, one would use attributions from a gapped k-mer SVM just as easily as DeepSHAP on a convolutional neural network that predicts enhancer activity. All that's needed to run TF-MoDISco are sequences and their corresponding per-position attribution scores.

Note This is a rewrite of the original TF-MoDISCo code, which can be found at https://github.com/kundajelab/tfmodisco

Below, we can see just how fast and memory efficient tfmodisco-lite is on a an ATAC-seq experiment when capped at 100k seqlets. The original tfmodisco code (left) takes 5 hours and 35 Gb of RAM. tfmodisco-lite v1.0.0 (middle) takes closer to 4 hours and only requires 13 Gb of RAM. tfmodisco-lite v2.0.0 (right) takes only a little over 1.5 hours and 8 Gb of RAM.

Installation

You can install tfmodisco-lite using pip install modisco-lite

Command Line Tools

You can run tfmodisco-lite using the command line tool modisco which comes with the tfmodisco-lite installation. This tool allows you to run tfmodisco-lite on a set of sequences and corresponding attributions, and then to generate a report (like the one seen above) for the output generated from the first step.

Algorithm Description

At a high level, the procedure works like this: "seqlets," which are short spans of sequence with high absolute attribution score, are extracted from the given examples. These seqlets are then divided into positive and negative seqlets ("metaclusters"). For each metacluster, a coarse-grained similarity is calculated as the cosine similarity between gapped k-mer representations between all pairs of seqlets. This information is used to calculate the top nearest neighbors, for which a fine-grained similarity is calculated as the maximum Jaccard index as two seqlets are aligned with all possible offsets. This sparse smilarity matrix is then density adapted, similarly to t-SNE, and Leiden clustering is used to extract patterns. Finally, some heuristics are used to merge similar patterns and split apart the seqlets comprising dissimilar ones.

The outputs of TF-MoDISco are motifs that summarize repeated patterns with high attribution. These patterns can be visualized using the reports.report_motifs function to generate an HTML file (which can be loaded into a Jupyter notebook) that, after training a BPNet model on a SPI1 data set, looks like the following:

tfmodisco-lite is a lightweight version of the original TF-MoDISco implementation, which takes significantly less memory, (sometimes) runs faster, and is significantly less code, making it easier to understand. This rewrite is an exact reimplementation (except for one minor fix) and so should be able to be used as a drop-in replacement for existing tfmodisco pipelines.

Running tfmodisco-lite

modisco motifs -s ohe.npz -a shap.npz -n 2000 -o modisco_results.h5

This command will run modisco on the one-hot encoded sequences in ohe.npz, use the attributions from shap.npz, use a maximum of 2000 seqlets per metacluster (this is low, but a good starting point for testing the algorithm on your own data), and will output the results to modisco_results.h5. The one-hot encoded sequences and attributions are assumed to be in length-last format, i.e., have the shape (# examples, 4, sequence length). Note that you can also use npy files if you don't want to use compressed data for some reason.

The output saved in modisco_results.h5 will include all of the patterns and has the following struture:

pos_patterns/
    pattern_0/
        sequence: [...]
        contrib_scores: [...]
        hypothetical_contribs: [...]
        seqlets/
            n_seqlets: [...]
            start: [...]
            end: [...]
            example_idx: [...]
            is_revcomp: [...]
            sequence: [...]
            contrib_scores: [...]
            hypothetical_contribs: [...]
        subpattern_0/
            ...
    pattern_1/
        ...
    ...
neg_patterns/
    pattern_0/
        ...
    pattern_1/
        ...
    ...

where [...] denotes that data is stored at that attribute. Importantly, the seqlets are all in the correct orientation. If a seqlet has been flipped to be the reverse complement, the sequence, contribution scores, and coordinates have also been flipped. In cases where there are not enough seqlets to consider a metacluster, that attribute (neg_patterns or pos_patterns) may not appear in the file.

Generating reports

Basic reporting can be executed with the following command:

modisco report -i modisco_results.h5 -o report/ -s report/

You may also generate reports compared to a given database of motifs with the following command:

modisco report -i modisco_results.h5 -o report/ -s report/ -m motifs.txt

This command will take the results from the tfmodisco-lite run, as well as a reference database of motifs to compare the extracted patterns to, and generate a HTML report like the one seen above. Each pattern that is extracted by tfmodisco-lite is compared against the database of motifs using TOMTOM to match them with prior knowledge.

tfmodisco-lite's People

Contributors

Stargazers

Watchers

Forkers

kellycochran yztxwd p-koo xiaotingchen bytewife vierstralab austintwang soumyakundu seqcode adamyhe mkarikom patchbio xjyx castillohair

tfmodisco-lite's Issues

Reset seqlet coordinates with respect to "--window"

The seqlet coordinates (per example) are returned relative to the "trimmed" contributions For example, if the user specificies a window of 100 with the "--window" argument, on a full sequence input width of 100 then the return seqlet coordinates, then the seqlet coordinates should be shifted by 450 bp when saving to the file MODISCO output file.

Alternatively, this could be just documented. I was trying to use the H5 file (after backwards conversion) to call seqlet hits on the examples and this messed me up for a couple of hours.

How are high absolute attribution scores defined?

When extracting seqlets with high absolute contribution scores, are the seqlets considered to be high in comparison to all sequences contribution scores or only within it's own sequence?

Attribution scores vs importance scores

Hi!

I'm trying to run either TF-MoDISco or this newer tfmodisco-lite for interpretation of a CNN model I have trained on enhancer sequences. I've managed to calculate the importance scores and hypothetical importance scores of my one-hot encoded enhancer sequences following the code from this notebook using DeepExplainer.

For tfmodisco-lite I'm slightly confused as to what I'm meant to be inputting for the attributions. Are the attributions the same as the importance scores or some combination of the importance scores and hypothetical importance scores? If not, how do you obtain these attributions?

Does tfmodisco-lite work with sequences that contain N?

Hi,

I've been trying to run tfmosico-lite, but got this error IndexError: cannot do a non-empty take from an empty axes. This error seems to originate from in _laplacian_null (np.percentile(a=pos_values, q=percentiles_to_use)-mu)). Neither of my two input Numpy arrays contains any NaN. The axes seem correct: (batch, 4, sequence length).

I realize that my sequences are not of the same lengths so they have been padded with Ns on the 5' ends. The N nucleotide is encoded as (0,0,0,0). Do you know if this could be the reason that I got this error?

Thank you!

version for RNA sequence

Hi, to add a comment here, it will be super helpful to the community if the team can also release a version for RNA sequences. Thanks!!!

`modisco meme` command does not return trimmed motifs and no negative patterns but only positive ones

This worked for me to get all motifs out:

import click
import h5py
import numpy as np

@click.command()
@click.option(
    "--reportFile",
    "report_file",
    required=True,
    multiple=False,
    type=str,
    default="modisco_resultshypothetical_contribution_scores_mean_diffTeloHEAC_CTRL_vs_6h.npz.h5",
    help="e.g. modico output",
)
@click.option(
    "--outputPWM",
    "out_pwm",
    required=True,
    multiple=False,
    type=str,
    help="output PWM file",
)
@click.option(
    "--outputHCWM",
    "out_hcwm",
    required=True,
    multiple=False,
    type=str,
    help="output PWM file",
)
def cli(report_file, out_pwm, out_hcwm):
    
    # read data base entries
    report=h5py.File(report_file , 'r')


    # trim pos patterns
    trimmed_ppms=[]
    trimmed_hcwms=[]
    for key1 in report.keys(): #pos patterns and neg patterns
        for key2 in report[key1].keys():  
            pwm=report[key1][key2]["sequence"][:]
            cwm=(report[key1][key2]["hypothetical_contribs"][:])

            score = np.sum(np.abs(cwm), axis=1)
            trim_thresh = np.max(score) * 0.2  # Cut off anything less than 30% of max score
            pass_inds = np.where(score >= trim_thresh)[0]
            trimmed = pwm[np.min(pass_inds): np.max(pass_inds) + 1]

            trimmed_ppms.append(trimmed)
    report.close()
    
    #write MEME file with trimmed PWM motifs
    background=[0.25, 0.25, 0.25, 0.25]
    f = open(out_pwm, 'w')
    f.write('MEME version 4\n\n')
    f.write('ALPHABET= ACGT\n\n')
    f.write('strands: + -\n\n')
    f.write('Background letter frequencies (from unknown source):\n')
    f.write('A %.3f C %.3f G %.3f T %.3f\n\n' % tuple(list(background)))

    for i in range (0, len(trimmed_ppms), 1):
        f.write('\nMOTIF '+ str(i) +'\n')
        f.write('letter-probability matrix: alength= 4 w= %d nsites= 1 E= 0e+0\n' % trimmed_ppms[i].shape[0])
        for s in trimmed_ppms[i]:
            f.write('%.5f %.5f %.5f %.5f\n' % tuple(s))
    f.close()


         
    


            
if __name__ == "__main__":
    cli()

problem in tfmodisco input

Hello,
Thank you so much for your amazing work!
I'm not sure what's the difference between profile_socre.h5 and counts_score.h5 input. Can I use any one of them as tfmodisco input, or which one is suggested?
Thanks in advance.

Best wishes.

Motifs missing when input sequence less than 400 bp

If the user doesn't set the -w option, Modisco-lite may produce different results compared to standard Modisco when the input sequence is less than 400 bp. It would be good for the program to automatically account for this scenario, either by raising an error or by using the provided sequence length.

Running tfmodisco-lite for multiple tasks?

Hi @jmschrei

Thank you for your great work! I was wondering whether modisco-lite is able to handle multiple tasks (i.e., multiple instances of attribution scores) for the same sequences. This was still possible with the original modisco package and a key motivation for me to shift to the more efficient implementation. However, after browsing the code, it's not clear to me whether this is possible at all with this rewrite. Apologies in advance if I missed something. If so, could you please provide an example of how modisco-lite can be used with multiple tasks?

Thank you again for making the code slimmer and more efficient!

`MergeMotifsAcrossRuns.ipynb` should account for indexes with respect to `--window`

Extension of #20, MergeMotifsAcrossRuns.ipynb creates a new seqlet set that combines the results from two different modisco runs. However, it does not take into account that tfmodisco-lite trims each input sequence around the center with respect to --window and labels start and end using the trimmed sequences. Thus, the recreated seqlets do not match the original seqlets.

Errors from MEME text parser leads to pandas.errors.EmptyDataError: No columns to parse from file

I am trying to compare JASPER motifs to motifs found by TF-modisco-lite using the following command:

modisco report -i heart_left_ventricle_random_2000_all_modisco_results.h5 -o report_all_modisco/ -s report_all_modisco/ -m JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt

Relevant files:
report_all_modisco.tar.gz
JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt
heart_left_ventricle_random_2000_all_modisco_results.h5.gz

But I ran into the following error. I am attaching the files needed to reproduce this issue.

Errors from MEME text parser:
The PSPM of motif 1 has probabilities which don't sum to 1 on row 1.
The PSPM of motif 1 has probabilities which don't sum to 1 on row 1.
FATAL: Requested motif number 1  was not found in file '/tmp/296678579.1.noble-login.q/tmpuk0jiknv'.

Traceback (most recent call last):
  File "/net/noble/vol1/home/anupamaj/miniconda3/bin/modisco", line 147, in <module>
    modiscolite.report.report_motifs(args.h5py, args.output, img_path_suffix=args.suffix, meme_motif_db=args.meme_db, 
  File "/net/noble/vol1/home/anupamaj/miniconda3/lib/python3.8/site-packages/modiscolite/report.py", line 263, in report_motifs
    tomtom_df = generate_tomtom_dataframe(modisco_h5py, meme_motif_db, 
  File "/net/noble/vol1/home/anupamaj/miniconda3/lib/python3.8/site-packages/modiscolite/report.py", line 138, in generate_tomtom_dataframe
    r = fetch_tomtom_matches(ppm, cwm, motifs_db=meme_motif_db,
  File "/net/noble/vol1/home/anupamaj/miniconda3/lib/python3.8/site-packages/modiscolite/report.py", line 109, in fetch_tomtom_matches
    tomtom_results = pandas.read_csv(tomtom_fname, sep="\t", usecols=(1, 5))
  File "/net/noble/vol1/home/anupamaj/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/net/noble/vol1/home/anupamaj/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/net/noble/vol1/home/anupamaj/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/net/noble/vol1/home/anupamaj/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1679, in _make_engine
    return mapping[engine](f, **self.options)
  File "/net/noble/vol1/home/anupamaj/miniconda3/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 557, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

Add MEME output from modisco motifs

We should add in a subcommand, modisco meme -i <h5 file>, which takes in the h5 and outputs a MEME file with the positive and negative CWMs.

tabs vs spaces

Hi Jacob,

Minor annoyance: I noticed that the codebase is built out of files where you use tabs and not spaces.
Guido said that spaces are the way to go, so probably a lot of people will get indentation errors if they tweak something to the code. I successfully changed all the leading (-i in the command) tabs in the modiscolite folder to 4 spaces with this oneliner: (via)
find . -name '*.py' ! -type d -exec bash -c 'expand -i -t 4 "$0" > /tmp/e && mv /tmp/e "$0"' {} \;

Arial Rounded not available on my system

On line 164 in report.py, you specify that the logo should be made with the Arial Rounded font, which is not available on my system. This results in pages and pages of "findfont: Font family 'Arial Rounded' not found." errors. Removing the argument fixes the issue, and the generated logos look good to me.

Center peak regions at summit in `modisco seqlet-bed`

Chrombpnet defines the peak center as start + summit, not the midpoint between start and end. It would be good to make write_bed_from_h5 consistent with this convention.

https://github.com/jmschrei/tfmodisco-lite/blob/d6716a56c407040e73f48cf0662a432035e4fde2/modiscolite/io.py#L349C1-L350C1

So absolute_peak_center = int(peak_row[1]) + int(peak_row[9]) instead

No motifs found

Hi,

I have tried to run modisco-lite on some of my sequences but unofortunately I do not get any motifs back.
I have tried to generated a one-hot encoded sequences and numpy array from both the sequences and the hyp important score from my gkm-svm run but I got no results.
Attached you'll find both npy files that I used as input.
Also I used the fasta and score generated on the TF-modisco repo to make numpy array for modisco-lite and I also do not get any significant motif back. Am I doing something wrong ?
Many thanks !

Best,

Salim.
Archive.zip

Trouble constructing minimal example where motifs are found

Hey! Thanks so much for all your hard work on this project. I'm having some trouble constructing a minimal example where motifs are found. I'm sure I'm making some kind of silly mistake (and therefore am not actually identifying an 'issue'), but I'm putting this here in case it's helpful to others.

I tried constructing some sequences where 'AAAAA' is an important motif but am getting back an error. I'm using the format (num sequences, 4, length) because that's the format mentioned in the README. I'm on modiscolite version 2.2.0.

import numpy as np
import modiscolite

num_sequences = 1000
length = 50

sequences = np.zeros((num_sequences, 4, length))
# let AAAAA be present in the first 5 chars of all sequences
sequences[:, 0, 0:5] = 1
sequences[:, 1, 5:] = 1

hyp_contrib_scores = np.zeros((num_sequences, 4, length))
# let AAAAA have high contribution scores in all sequences
hyp_contrib_scores[:, 0, 0:5] = 1

pos_sequences, neg_sequences = modiscolite.tfmodisco.TFMoDISco(
    hypothetical_contribs=hyp_contrib_scores,
    one_hot=sequences,
    verbose=True,
)

I get the following error:

Traceback (most recent call last):
  File "/Users/nickackerman/code/TFBind/issue.py", line 16, in <module>
    pos_sequences, neg_sequences = modiscolite.tfmodisco.TFMoDISco(
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/modiscolite/tfmodisco.py", line 281, in TFMoDISco
    seqlet_coords, threshold = extract_seqlets.extract_seqlets(
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/modiscolite/extract_seqlets.py", line 158, in extract_seqlets
    pos_null_values, neg_null_values = _laplacian_null(track=smoothed_tracks, 
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/modiscolite/extract_seqlets.py", line 34, in _laplacian_null
    (np.percentile(a=pos_values, q=percentiles_to_use)-mu))
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/numpy/lib/function_base.py", line 4283, in percentile
    return _quantile_unchecked(
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/numpy/lib/function_base.py", line 4555, in _quantile_unchecked
    return _ureduce(a,
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/numpy/lib/function_base.py", line 3823, in _ureduce
    r = func(a, **kwargs)
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/numpy/lib/function_base.py", line 4722, in _quantile_ureduce_func
    result = _quantile(arr,
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/numpy/lib/function_base.py", line 4831, in _quantile
    slices_having_nans = np.isnan(arr[-1, ...])
IndexError: index -1 is out of bounds for axis 0 with size 0

Can you help me construct a minimal example similar to this where motifs are identified? Thanks!

Could not construct partition: Cannot accept NaN weights.

Hi, using the following command, I get an error:

pos_patterns, neg_patterns = modiscolite.tfmodisco.TFMoDISco(
        hypothetical_contribs=attrs,
        one_hot=inputs,
        max_seqlets_per_metacluster=2000,
        sliding_window_size=20,
        flank_size=5,
        target_seqlet_fdr=0.05,
        n_leiden_runs=2,
    )

The error message is below:

/opt/conda/lib/python3.10/site-packages/modiscolite/affinitymat.py:238: RuntimeWarning: invalid value encountered in true_divide
  (Y_ / np.linalg.norm(Y_)).ravel())
/opt/conda/lib/python3.10/site-packages/modiscolite/affinitymat.py:237: RuntimeWarning: invalid value encountered in true_divide
  scores_ = np.dot((X / np.linalg.norm(X)).ravel(),
---------------------------------------------------------------------------
BaseException                             Traceback (most recent call last)
Cell In[33], line 1
----> 1 pos_patterns, neg_patterns = modiscolite.tfmodisco.TFMoDISco(
      2         hypothetical_contribs=attrs,
      3         one_hot=inputs,
      4         max_seqlets_per_metacluster=2000,
      5         sliding_window_size=20,
      6         flank_size=5,
      7         target_seqlet_fdr=0.05,
      8         n_leiden_runs=2,
      9     )

File /opt/conda/lib/python3.10/site-packages/modiscolite/tfmodisco.py:310, in TFMoDISco(one_hot, hypothetical_contribs, sliding_window_size, flank_size, min_metacluster_size, weak_threshold_for_counting_sign, max_seqlets_per_metacluster, target_seqlet_fdr, min_passing_windows_frac, max_passing_windows_frac, n_leiden_runs, n_leiden_iterations, min_overlap_while_sliding, nearest_neighbors_to_compute, affmat_correlation_threshold, tsne_perplexity, frac_support_to_trim_to, min_num_to_trim_to, trim_to_window_size, initial_flank_to_add, prob_and_pertrack_sim_merge_thresholds, prob_and_pertrack_sim_dealbreaker_thresholds, subcluster_perplexity, merging_max_seqlets_subsample, final_min_cluster_size, min_ic_in_window, min_ic_windowsize, ppm_pseudocount, verbose)
    307 	if verbose:
    308 		print("Using {} positive seqlets".format(len(pos_seqlets)))
--> 310 	pos_patterns = seqlets_to_patterns(seqlets=pos_seqlets,
    311 		track_set=track_set, 
    312 		track_signs=1,
    313 		min_overlap_while_sliding=min_overlap_while_sliding,
    314 		nearest_neighbors_to_compute=nearest_neighbors_to_compute,
    315 		affmat_correlation_threshold=affmat_correlation_threshold,
    316 		tsne_perplexity=tsne_perplexity,
    317 		n_leiden_iterations=n_leiden_iterations,
    318 		n_leiden_runs=n_leiden_runs,
    319 		frac_support_to_trim_to=frac_support_to_trim_to,
    320 		min_num_to_trim_to=min_num_to_trim_to,
    321 		trim_to_window_size=trim_to_window_size,
    322 		initial_flank_to_add=initial_flank_to_add,
    323 		prob_and_pertrack_sim_merge_thresholds=prob_and_pertrack_sim_merge_thresholds,
    324 		prob_and_pertrack_sim_dealbreaker_thresholds=prob_and_pertrack_sim_dealbreaker_thresholds,
    325 		subcluster_perplexity=subcluster_perplexity,
    326 		merging_max_seqlets_subsample=merging_max_seqlets_subsample,
    327 		final_min_cluster_size=final_min_cluster_size,
    328 		min_ic_in_window=min_ic_in_window,
    329 		min_ic_windowsize=min_ic_windowsize,
    330 		ppm_pseudocount=ppm_pseudocount)
    331 else:
    332 	pos_patterns = None

File /opt/conda/lib/python3.10/site-packages/modiscolite/tfmodisco.py:254, in seqlets_to_patterns(***failed resolving arguments***)
    252 #apply subclustering procedure on the final patterns
    253 for patternidx, pattern in enumerate(patterns):
--> 254 	pattern.compute_subpatterns(subcluster_perplexity, 
    255 		n_seeds=n_leiden_runs, n_iterations=n_leiden_iterations)
    257 return patterns

File /opt/conda/lib/python3.10/site-packages/modiscolite/core.py:153, in SeqletSet.compute_subpatterns(self, perplexity, n_seeds, n_iterations)
    150 sp_density_adapted_affmat /= np.sum(sp_density_adapted_affmat.data)
    152 #Do Leiden clustering
--> 153 self.subclusters = cluster.LeidenCluster(sp_density_adapted_affmat,
    154 	n_seeds=n_seeds, n_leiden_iterations=n_iterations) 
    156 #this method assumes all the seqlets have been expanded so they
    157 # all start at 0
    158 subcluster_to_seqletsandalignments = OrderedDict()

File /opt/conda/lib/python3.10/site-packages/modiscolite/cluster.py:22, in LeidenCluster(affinity_mat, n_seeds, n_leiden_iterations)
     19 best_quality = None
     21 for seed in range(1, n_seeds+1):
---> 22     partition = leidenalg.find_partition(
     23         graph=g,
     24         partition_type=leidenalg.ModularityVertexPartition,
     25         weights=affinity_mat.data,
     26         n_iterations=n_leiden_iterations,
     27         initial_membership=None,
     28         seed=seed*100) 
     30     quality = np.array(partition.quality())
     31     membership = np.array(partition.membership)

File /opt/conda/lib/python3.10/site-packages/leidenalg/functions.py:81, in find_partition(graph, partition_type, initial_membership, weights, n_iterations, max_comm_size, seed, **kwargs)
     79 if not weights is None:
     80   kwargs['weights'] = weights
---> 81 partition = partition_type(graph,
     82                            initial_membership=initial_membership,
     83                            **kwargs)
     84 optimiser = Optimiser()
     86 optimiser.max_comm_size = max_comm_size

File /opt/conda/lib/python3.10/site-packages/leidenalg/VertexPartition.py:456, in ModularityVertexPartition.__init__(self, graph, initial_membership, weights)
    452   else:
    453     # Make sure it is a list
    454     weights = list(weights)
--> 456 self._partition = _c_leiden._new_ModularityVertexPartition(pygraph_t,
    457     initial_membership, weights)
    458 self._update_internal_membership()

BaseException: Could not construct partition: Cannot accept NaN weights.

Can you advise on what this means? I'm running modiscolite 2.0.7 in Python 3.10.8. inputs and attrs are both numpy arrays of shape (500, 400, 4) and neither contains NaNs.

Need to raise an error if the attr or ohe files don't end in `.npz` or `.npy`

Prediction from low peak quality (qval)

Greetings,

Thank you so much for your amazing work!

I was wondering if there are any options of applying a filter on the peaks that are being used for predicting the motifs from. Currently, TF-MoDISco is predicting some motifs in peaks that have very high q-value (poor peak quality) and I was wondering if it would be possible to discard those predictions somehow.

Thanks in advance.
Best wishes.

Add negative pattern implementation for `meme` subcommand

Create reports without meme database

If I don't much care to compare my motifs to a database, it would still be nice to be able to generate a report. It would be nice to make the -m argument to report optional.
(It would also be nice if the documentation were very clear about exactly what file is expected for the -m argument, since it's a very particular format.)

Matrix format of pfms in motif pfms directory

Hi!
The format for my pfm files in my motifs/pfm/ directory are of MEME format (example below):

MEME version 4

ALPHABET= ACGT

strands: + -

Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25

MOTIF MA0002.2 RUNX1
letter-probability matrix: alength= 4 w= 11 nsites= 2000 E= 0
 0.143500  0.248000  0.348000  0.260500
 0.117000  0.242500  0.233500  0.407000
 0.061500  0.536000  0.074500  0.328000
 0.028500  0.000000  0.003500  0.968000
 0.000000  0.037500  0.936000  0.026500
 0.043500  0.063500  0.035000  0.858000
 0.000000  0.000000  0.993500  0.006500
 0.008500  0.021000  0.924000  0.046500
 0.005000  0.200000  0.125500  0.669500
 0.065500  0.231500  0.040500  0.662500
 0.250000  0.079000  0.144500  0.526500
URL http://jaspar.genereg.net/matrix/MA0002.2

However, I'm getting errors from line 146 in report.py when it tries to read these files ValueError: could not convert string 'MEME version 4' to float64 at row 0, column 1. For these pfm files do they just need the matrix of numbers alone? I'm using the JASPAR Core (2022) for vertebrates database and the only download format options they have for the pfms are JASPAR, MEME and TRANSFAC. Could you possible provide an example format for the pfm so I know how to format them accordingly?

Cheers!

Doesn't work for the the sequences in variable lengths

Hi!

Thanks for the amazing work!

I am testing the original notebook from https://github.com/kundajelab/tfmodisco/blob/master/examples/simulated_TAL_GATA_deeplearning/TF_MoDISco_TAL_GATA.ipynb, it seems the original GitHub is deprecated.

In the original notebook, you mentioned the TF-modisco pipeline also works for the sequences in different lengths. However, when I was testing the notebook with input data with shape of [100, length, 4], the length ranges from 500-1000 bp, it raised the ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (100,) + inhomogeneous part.
Then if I crop the sequences and contribution scores to the same length, it works again. I am wondering is there any version available for the sequences with different length?
https://github.com/kundajelab/tfmodisco/blob/master/examples/simulated_TAL_GATA_deeplearning/TF_MoDISco_TAL_GATA.ipynb

Thank you again!

Best wishes,
Ruizhi

`example_idx` doesnt trace back to original contribution scores

When using the modisco.h5 to track down original seqlet coordinates, the example_idx parameter is arbitrary (i.e. doesn't have a 1:1 match with the dimensions of the original input .npy in terms of different contribution score windows). This makes it essentially impossible to use only the "modisco.h5" saved metadata to trace back the genomic coordinates of the seqlets. You can reimplement parts of the "extract_seqlets.py" code to find which of your pos/neg_regions made your cut, but would it be possible to make the contribution indices match the modisco.h5 outputs?

Specifically, this function I think is where we lose the information.
https://github.com/jmschrei/tfmodisco-lite/blob/main/modiscolite/extract_seqlets.py#L59

--verbose is still rather taciturn

It'd be nice if --verbose added some more information about progress as the algorithm runs. As it stands, --verbose adds exactly two output lines in the entire program. While the multi-stage nature of the Modisco algorithm means that a progress bar is not very meaningful, I'd like to get updates to know how far the program has gotten, especially when I do something weird and it crashes. Particularly when I'm starting out with a project, I'd like to know if my modisco run is going to take thirty minutes or five hours.

Even if there's not a time estimate, just seeing an occasional

INFO: 2024-01-29 13:38:34 tfmodisco.py:174: Positive seqlets refinement, phase 3/5, round 1/2

in the logs would be nice.

I'm very fond of the logging library in Python, since you just have to configure the logger once and then you don't have to pass a verbose parameter to every function call, but that's definitely personal taste.

Request of the example data

Sorry, I didn't find the toy data you mentioned, could you kindly provide it to help us understand the input?

Generating report: pandas.errors.EmptyDataError: No columns to parse from file

Hi again!

Cheers for the help on the last issue I raised! I managed to successfully run modisco motifs and generate a results.h5 file. Now I am trying to generate a report from this file using the command modisco report -i modisco_results.h5 -o report/ -m motifs.txt -l motifs/pfms/.

However, I am getting a pandas.errors.EmptyDataError: No columns to parse from file error from line 77 in modiscolite/report.py:
tomtom_results = pandas.read_csv(tomtom_fname, sep="\t", usecols=(1, 5))

Does this mean that TomTom did not produce any matches to known motifs as it seems that tomtom_fname might be empty?

I am using the JASPAR CORE (non-redundant) for vertebrates as my motif database and I also wanted to ask whether this is the correct file format for the motifs database input motifs.txt: https://jaspar2022.genereg.net/download/data/2022/CORE/JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt

motifs.html report can be confusing

Hi,

When matching to tomtom, the motifs.html report can be very confusing for a few reasons:

it also reports matches that are not significant, such as q-value =1
when performing the match, it uses the ppm of the tfmodisco pattern to match known motifs in tomtom which also uses ppm, however, in the .html report, tfmodisco motif plot is based on attribution score. q-value are based on the ppm match. So, we can easily see cases that plots are not similar at all but q-values are very significant. plots and q-values are not consistently based on the same matrix.

Add command to extract seqlets info in BED and FASTA format

Add one command, modisco seqlet-bed -i <h5 file>, that returns a BED file with the columns chrom, start, end, name, score, strand (from https://genome.ucsc.edu/FAQ/FAQformat.html#format1) where name is <pattern_name>.<seqlet_id>, score is the sum of the contributions, and strand is the RC.

Add another command, modisco seqlet-fasta -i <h5 file> -f <fasta file> with the repeating format

><chrom>:<start>-<end> <strand> <pattern_name>.<seqlet_id>
FASTA sequence extracted from a fast file

Add Unit tests

Add unit tests to confirm functionality is the same across commits

Add Linting Github Action

Linting should improve maintainability by removing classes of errors down the line, and make contributions more standardized

Access motifs

Hi,
First off thanks so much for the awesome tool.

I was wondering how to I access the motif matrices used for tomtom comparison.

The values from the logos in the report are different from those stores in the pattern's sequence dataset.

thanks for the help,
Sam

Compatibility with bpnet-refactor

Hello,
I have recently run the bpnet-refactor workflow and obtained SHAP attribution scores, which I would now like to load into TF-MoDISco. I'm having some trouble seeing how to do that - the bpnet-refactor documentation says you can do it with tfmodisco-lite but otherwise is a bit spare on the subject.
The final bpnet command was:

$ bpnet-shap \
    --reference-genome $REFERENCE_GENOME \
    --model $MODEL_DIR/model_split000 \
    --bed-file $DATA_DIR/peaks_inliers.bed \
    --chroms chr1 \
    --output-dir $SHAP_DIR \
    --input-seq-len 2114 \
    --control-len 1000 \
    --task-id 0 \
    --input-data $INPUT_DATA \
    --generate-shap-bigWigs \
    --chrom-sizes "$CHROM_SIZES"

The output of that was:

$ ls $SHAP_DIR
config.json       counts_scores.stats.txt  profile_scores.h5
counts_scores.bw  peaks_valid_scores.bed   profile_scores.stats.txt
counts_scores.h5  profile_scores.bw        shap_scores.log

Now, my question is, how is one meant to convert the *.h5 files with the scores into the *.npz files expected by TF-MoDISco? I see from the Jupyter notebook that you use bpnet-lite, which has an 'interpet' function to make the one-hot sequence encoding and attribution scores, but bpnet-refactor seems not to have the same program, or anything named similarly, in its /bin. Is there some way to take what I've already got and render it useable for TF-MoDISco?

Thanks,
Greg

Edited: realized that bpnet-lite is indeed on GitHub, just not in Kundaje lab's repositories.

Add example folder containing notebooks

Add in an example folder containing one example of TF attributions+sequence and one example of ATAC attributions+sequence and corresponding notebooks showing how to run the commands to extract motifs and generate the reports and then show the reports.