Hello, I'm trying to run ./jaccard on a collectio

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Running with a collection of FASTA files about jaccard-ctf HOT 11 CLOSED

brettyout commented on July 27, 2024

Running with a collection of FASTA files

from jaccard-ctf.

Comments (11)

raghavendrak commented on July 27, 2024

hello,
A matrix of mXn is processed to compute the jaccard matrix (m k-mers, n data samples). Data sample files are the input to the jaccard program (Each file containing one column of the input matrix, n).
The FASTA files need to be preprocessed. fasta_reader preprocesses the same in parallel.

mpirun -np 1 ./fasta_reader -n <number of files> -k <maxkmer = 4^k> -perc <only kmers <= percentage * maxkmer are considered> -listfile <list of fasta files in gzip> -infolderPath <path to the fasta files> -outfolderPath <output: path to generate k-mer files> -reverse_complement <should the reverse complement of the k-mer be generated>

mpirun -np 1 ./fasta_reader -n 2 -k 4 -perc 1.0 -listfile input_list.txt -infolderPath test_data/ -outfolderPath test_data_inMformat -reverse_complement 0
(the gzip requirement can be dropped by making necessary changes in fasta_reader.cxx).

The files generated in the folder test_data_inMformat can be used as input to jaccard.

The Jaccard matrix print to the output is disabled (line 444 (per batch), 472). The 0.04 is the time for the computation.

from jaccard-ctf.

brettyout commented on July 27, 2024

Hello again,

I ran the following command on my gzipped test data:

mpirun -np 1 ~/jaccard-ctf/fasta_reader -n 2 -k 4 -perc 1.0 -lfile input_list.txt -infolderPath test_data/ -outfolderPath test_data_inMformat -reverse_complement 0

The file input turns out to be -lfile not -listfile, but the job finished.

The output is not what I expected however:
ls test_data_inMformat/ SRS011.txt

There's only a single output file. Should that be the case if I provided two input files? SRS011061.fa.gz & SRS011086.fa.gz. Does this make sense based on the contents of SRS011.txt?

SRS011.txt

from jaccard-ctf.

raghavendrak commented on July 27, 2024

hello,
Yes, sorry, it is -lfile. Thank you.
Yes, the output does not seem to be what is expected.
There might be two things causing this:

The maximum value of the kmer that will be written to the output file is 4 ^ (k - 1) (not 4 ^ k). One of your files might have all kmer values > 64 (k = 4 was the input).
The file might be getting overwritten. In line 72 of the code we are only considering a substring of the input file name to generate an output file name. You might want to change that.

from jaccard-ctf.

brettyout commented on July 27, 2024

Fixed the number of output files. Looks like line 72 only works with inputs that have the extension .fasta.gz (len = 9 char).

I'm not sure what you mean in your first bullet point. It's only possible to save 64 unique k-mers of 4 even though my samples share 136 unique k-mers? What is k-mer 1 if not the first k-mer fasta_reader encounters in a sample?

I also ran with -perc 2and the output file was the same as SRS011.txt, but counting up to 128 this time. Attached is the list of 136 k-mers found in both my input FASTA's. Might be worth noting that 100% of the k-mers in the two FASTA inputs are shared.
common_kmers.txt

from jaccard-ctf.

raghavendrak commented on July 27, 2024

hello,
The percentage value is in the range (0, 1]. We added the same to help us process only slices of the input (BIGSI ~20TB) i.e., avoid generating k-mers in the file that are greater than a specified threshold for a test run. The perc should be 1.0 for the actual runs. The reported height for our dataset was 78,929,564,286 with 446,507 columns (files). If we want to generate the top 1% of these k-mers, perc=0.01 is specified, and only the set bits upto 789,295,642 are dumped as output. Line 133 is where we check the same. Since we use std::set for storing the k-mers, and then iterate over it to generate the output, the output is sorted.
The reader looks for a fasta header (line 93), and all lines between the two lines starting with > are considered to be of the same sequenece. It also ignores non-ACGT characters. For example, ACNGtaT k=3 has 2 valid k-mers: GTA and TAT. The characters are case insensitive. Line 31 and Line 126 is where the values are assigned to these k-mers. We map convert k-mers to integers as follows: (c_1,...,c_k) -> \sum_i 4^i c_i, where c_i is the code of the i-th character in k-mer: (A: 0, C: 1, G: 2, T: 3). We start with i = 0.
Changing the maximum k-mer value calculation can cover all the k-mers found in your input FASTA files.

from jaccard-ctf.

brettyout commented on July 27, 2024

Ah. Didn't understand how those k-mer files. Thanks for clearing that up. I increased the k-mer length to 21 bp in fasta_reader then ran jaccard with the print command in line 444 uncommented. Here is what I got:

./jaccard -lfile input_kmers.txt -f test_data_inMformat/ -n 2 read k-mers, batchNo: 0 non_zero_rows: 9 time: 0.02 masks created with zero rows removed if compression is enabled, batchNo: 0 time: 0.00 J constructed, batchNo: 0 J.nnz_tot: 5 J.nrow: 32 J write time: 0.12 0 17039427 0 0 0 2048 0 1 0 0 0 0 0 0 0 0 0 0 0 4096 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Batch complete, batchNo: 0 time for jaccard_acc(): 0.03 S matrix computed for the specified input dataset

Could you help me understand this matrix? How would I calculate the jaccard index between my two input files?

from jaccard-ctf.

brettyout commented on July 27, 2024

Here's a screenshot with the format I'm seeing.

from jaccard-ctf.

raghavendrak commented on July 27, 2024

hello, would it be possible to share your input files (the inputs to jaccard (test_data_inMformat))?. Can email me, and we can update a summary here after things get resolved?

from jaccard-ctf.

brettyout commented on July 27, 2024

Sure. I can't seem to find your email anywhere. Mind sharing it with me?

from jaccard-ctf.

raghavendrak commented on July 27, 2024

Sure, it is: raghavendra066 @ gmail

from jaccard-ctf.

raghavendrak commented on July 27, 2024

Copy-pasting the communication here:
line 473: S.print_matrix() is the similarity matrix given by the equation:

It might help to see an example: Two files
input_0.txt
1
2
4
8
16
32
64
input_1.txt
1
2
4
8
16
32
64
./jaccard -lfile input_test.txt -f test_data/ -n 2 -m 1024
Similarity matrix as all 1s

If one kmer from either of the files is removed, the data is no longer similar.

from jaccard-ctf.

Running with a collection of FASTA files about jaccard-ctf HOT 11 CLOSED

Comments (11)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent