Comments (11)
hello,
A matrix of mXn
is processed to compute the jaccard matrix (m
k-mers, n
data samples). Data sample files are the input to the jaccard program (Each file containing one column of the input matrix, n
).
The FASTA files need to be preprocessed. fasta_reader
preprocesses the same in parallel.
mpirun -np 1 ./fasta_reader -n <number of files> -k <maxkmer = 4^k> -perc <only kmers <= percentage * maxkmer are considered> -listfile <list of fasta files in gzip> -infolderPath <path to the fasta files> -outfolderPath <output: path to generate k-mer files> -reverse_complement <should the reverse complement of the k-mer be generated>
mpirun -np 1 ./fasta_reader -n 2 -k 4 -perc 1.0 -listfile input_list.txt -infolderPath test_data/ -outfolderPath test_data_inMformat -reverse_complement 0
(the gzip requirement can be dropped by making necessary changes in fasta_reader.cxx).
The files generated in the folder test_data_inMformat can be used as input to jaccard.
The Jaccard matrix print to the output is disabled (line 444 (per batch), 472). The 0.04 is the time for the computation.
from jaccard-ctf.
Hello again,
I ran the following command on my gzipped test data:
mpirun -np 1 ~/jaccard-ctf/fasta_reader -n 2 -k 4 -perc 1.0 -lfile input_list.txt -infolderPath test_data/ -outfolderPath test_data_inMformat -reverse_complement 0
The file input turns out to be -lfile
not -listfile
, but the job finished.
The output is not what I expected however:
ls test_data_inMformat/ SRS011.txt
There's only a single output file. Should that be the case if I provided two input files? SRS011061.fa.gz & SRS011086.fa.gz
. Does this make sense based on the contents of SRS011.txt
?
from jaccard-ctf.
hello,
Yes, sorry, it is -lfile
. Thank you.
Yes, the output does not seem to be what is expected.
There might be two things causing this:
- The maximum value of the kmer that will be written to the output file is
4 ^ (k - 1)
(not4 ^ k
). One of your files might have all kmer values > 64 (k = 4 was the input). - The file might be getting overwritten. In line
72
of the code we are only considering a substring of the input file name to generate an output file name. You might want to change that.
from jaccard-ctf.
Fixed the number of output files. Looks like line 72
only works with inputs that have the extension .fasta.gz
(len = 9 char).
I'm not sure what you mean in your first bullet point. It's only possible to save 64 unique k-mers of 4 even though my samples share 136 unique k-mers? What is k-mer 1 if not the first k-mer fasta_reader
encounters in a sample?
I also ran with -perc 2
and the output file was the same as SRS011.txt
, but counting up to 128 this time. Attached is the list of 136 k-mers found in both my input FASTA's. Might be worth noting that 100% of the k-mers in the two FASTA inputs are shared.
common_kmers.txt
from jaccard-ctf.
hello,
The percentage value is in the range (0, 1]. We added the same to help us process only slices of the input (BIGSI ~20TB) i.e., avoid generating k-mers in the file that are greater than a specified threshold for a test run. The perc should be 1.0 for the actual runs. The reported height for our dataset was 78,929,564,286 with 446,507 columns (files). If we want to generate the top 1% of these k-mers, perc=0.01 is specified, and only the set bits upto 789,295,642 are dumped as output. Line 133 is where we check the same. Since we use std::set
for storing the k-mers, and then iterate over it to generate the output, the output is sorted.
The reader looks for a fasta header (line 93), and all lines between the two lines starting with >
are considered to be of the same sequenece. It also ignores non-ACGT characters. For example, ACNGtaT
k=3
has 2 valid k-mers: GTA
and TAT
. The characters are case insensitive. Line 31 and Line 126 is where the values are assigned to these k-mers. We map convert k-mers to integers as follows: (c_1,...,c_k) -> \sum_i 4^i c_i, where c_i is the code of the i-th character in k-mer: (A: 0, C: 1, G: 2, T: 3). We start with i = 0.
Changing the maximum k-mer value calculation can cover all the k-mers found in your input FASTA files.
from jaccard-ctf.
Ah. Didn't understand how those k-mer files. Thanks for clearing that up. I increased the k-mer length to 21 bp in fasta_reader
then ran jaccard
with the print command in line 444
uncommented. Here is what I got:
./jaccard -lfile input_kmers.txt -f test_data_inMformat/ -n 2 read k-mers, batchNo: 0 non_zero_rows: 9 time: 0.02 masks created with zero rows removed if compression is enabled, batchNo: 0 time: 0.00 J constructed, batchNo: 0 J.nnz_tot: 5 J.nrow: 32 J write time: 0.12 0 17039427 0 0 0 2048 0 1 0 0 0 0 0 0 0 0 0 0 0 4096 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Batch complete, batchNo: 0 time for jaccard_acc(): 0.03 S matrix computed for the specified input dataset
Could you help me understand this matrix? How would I calculate the jaccard index between my two input files?
from jaccard-ctf.
Here's a screenshot with the format I'm seeing.
from jaccard-ctf.
hello, would it be possible to share your input files (the inputs to jaccard
(test_data_inMformat))?. Can email me, and we can update a summary here after things get resolved?
from jaccard-ctf.
Sure. I can't seem to find your email anywhere. Mind sharing it with me?
from jaccard-ctf.
Sure, it is: raghavendra066 @ gmail
from jaccard-ctf.
Copy-pasting the communication here:
line 473: S.print_matrix() is the similarity matrix given by the equation:
It might help to see an example: Two files
input_0.txt
1
2
4
8
16
32
64
input_1.txt
1
2
4
8
16
32
64
./jaccard -lfile input_test.txt -f test_data/ -n 2 -m 1024
Similarity matrix as all 1s
If one kmer from either of the files is removed, the data is no longer similar.
from jaccard-ctf.
Related Issues (3)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jaccard-ctf.