Giter VIP home page Giter VIP logo

Comments (11)

raghavendrak avatar raghavendrak commented on July 27, 2024

hello,
A matrix of mXn is processed to compute the jaccard matrix (m k-mers, n data samples). Data sample files are the input to the jaccard program (Each file containing one column of the input matrix, n).
The FASTA files need to be preprocessed. fasta_reader preprocesses the same in parallel.

mpirun -np 1 ./fasta_reader -n <number of files> -k <maxkmer = 4^k> -perc <only kmers <= percentage * maxkmer are considered> -listfile <list of fasta files in gzip> -infolderPath <path to the fasta files> -outfolderPath <output: path to generate k-mer files> -reverse_complement <should the reverse complement of the k-mer be generated>

mpirun -np 1 ./fasta_reader -n 2 -k 4 -perc 1.0 -listfile input_list.txt -infolderPath test_data/ -outfolderPath test_data_inMformat -reverse_complement 0
(the gzip requirement can be dropped by making necessary changes in fasta_reader.cxx).

The files generated in the folder test_data_inMformat can be used as input to jaccard.

The Jaccard matrix print to the output is disabled (line 444 (per batch), 472). The 0.04 is the time for the computation.

from jaccard-ctf.

brettyout avatar brettyout commented on July 27, 2024

Hello again,

I ran the following command on my gzipped test data:

mpirun -np 1 ~/jaccard-ctf/fasta_reader -n 2 -k 4 -perc 1.0 -lfile input_list.txt -infolderPath test_data/ -outfolderPath test_data_inMformat -reverse_complement 0

The file input turns out to be -lfile not -listfile, but the job finished.

The output is not what I expected however:
ls test_data_inMformat/ SRS011.txt

There's only a single output file. Should that be the case if I provided two input files? SRS011061.fa.gz & SRS011086.fa.gz. Does this make sense based on the contents of SRS011.txt?

SRS011.txt

from jaccard-ctf.

raghavendrak avatar raghavendrak commented on July 27, 2024

hello,
Yes, sorry, it is -lfile. Thank you.
Yes, the output does not seem to be what is expected.
There might be two things causing this:

  • The maximum value of the kmer that will be written to the output file is 4 ^ (k - 1) (not 4 ^ k). One of your files might have all kmer values > 64 (k = 4 was the input).
  • The file might be getting overwritten. In line 72 of the code we are only considering a substring of the input file name to generate an output file name. You might want to change that.

from jaccard-ctf.

brettyout avatar brettyout commented on July 27, 2024

Fixed the number of output files. Looks like line 72 only works with inputs that have the extension .fasta.gz (len = 9 char).

I'm not sure what you mean in your first bullet point. It's only possible to save 64 unique k-mers of 4 even though my samples share 136 unique k-mers? What is k-mer 1 if not the first k-mer fasta_reader encounters in a sample?

I also ran with -perc 2and the output file was the same as SRS011.txt, but counting up to 128 this time. Attached is the list of 136 k-mers found in both my input FASTA's. Might be worth noting that 100% of the k-mers in the two FASTA inputs are shared.
common_kmers.txt

from jaccard-ctf.

raghavendrak avatar raghavendrak commented on July 27, 2024

hello,
The percentage value is in the range (0, 1]. We added the same to help us process only slices of the input (BIGSI ~20TB) i.e., avoid generating k-mers in the file that are greater than a specified threshold for a test run. The perc should be 1.0 for the actual runs. The reported height for our dataset was 78,929,564,286 with 446,507 columns (files). If we want to generate the top 1% of these k-mers, perc=0.01 is specified, and only the set bits upto 789,295,642 are dumped as output. Line 133 is where we check the same. Since we use std::set for storing the k-mers, and then iterate over it to generate the output, the output is sorted.
The reader looks for a fasta header (line 93), and all lines between the two lines starting with > are considered to be of the same sequenece. It also ignores non-ACGT characters. For example, ACNGtaT k=3 has 2 valid k-mers: GTA and TAT. The characters are case insensitive. Line 31 and Line 126 is where the values are assigned to these k-mers. We map convert k-mers to integers as follows: (c_1,...,c_k) -> \sum_i 4^i c_i, where c_i is the code of the i-th character in k-mer: (A: 0, C: 1, G: 2, T: 3). We start with i = 0.
Changing the maximum k-mer value calculation can cover all the k-mers found in your input FASTA files.

from jaccard-ctf.

brettyout avatar brettyout commented on July 27, 2024

Ah. Didn't understand how those k-mer files. Thanks for clearing that up. I increased the k-mer length to 21 bp in fasta_reader then ran jaccard with the print command in line 444 uncommented. Here is what I got:

./jaccard -lfile input_kmers.txt -f test_data_inMformat/ -n 2 read k-mers, batchNo: 0 non_zero_rows: 9 time: 0.02 masks created with zero rows removed if compression is enabled, batchNo: 0 time: 0.00 J constructed, batchNo: 0 J.nnz_tot: 5 J.nrow: 32 J write time: 0.12 0 17039427 0 0 0 2048 0 1 0 0 0 0 0 0 0 0 0 0 0 4096 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Batch complete, batchNo: 0 time for jaccard_acc(): 0.03 S matrix computed for the specified input dataset

Could you help me understand this matrix? How would I calculate the jaccard index between my two input files?

from jaccard-ctf.

brettyout avatar brettyout commented on July 27, 2024

Screen Shot 2020-10-29 at 9 52 30 PM

Here's a screenshot with the format I'm seeing.

from jaccard-ctf.

raghavendrak avatar raghavendrak commented on July 27, 2024

hello, would it be possible to share your input files (the inputs to jaccard (test_data_inMformat))?. Can email me, and we can update a summary here after things get resolved?

from jaccard-ctf.

brettyout avatar brettyout commented on July 27, 2024

Sure. I can't seem to find your email anywhere. Mind sharing it with me?

from jaccard-ctf.

raghavendrak avatar raghavendrak commented on July 27, 2024

Sure, it is: raghavendra066 @ gmail

from jaccard-ctf.

raghavendrak avatar raghavendrak commented on July 27, 2024

Copy-pasting the communication here:
line 473: S.print_matrix() is the similarity matrix given by the equation:
func
It might help to see an example: Two files
input_0.txt
1
2
4
8
16
32
64
input_1.txt
1
2
4
8
16
32
64
./jaccard -lfile input_test.txt -f test_data/ -n 2 -m 1024
Similarity matrix as all 1s
1
If one kmer from either of the files is removed, the data is no longer similar.
2

from jaccard-ctf.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.