yzhernand / vntrseek Goto Github PK

This repository is now deprecated. Please visit the official repository at https://github.com/Benson-Genomics-Lab/VNTRseek. VNTRSeek is a computational pipeline for the detection of VNTRs. It was developed by Yevgeniy Gelfand et al in Dr. Gary Benson's Laboratory for Biocomputing and Informatics at Boston University.

Home Page: http://orca.bu.edu/vntrseek/

License: GNU General Public License v3.0

CMake 1.06% Perl 45.56% Shell 0.79% C 51.98% C++ 0.57% Makefile 0.04% Batchfile 0.01%

vntrseek's People

Contributors

Stargazers

Watchers

Forkers

rahram2010 cyclinbox

vntrseek's Issues

New

Hello,

I am starting to use VNTRseek for VNTR for Mycobacterias. I have got the fasta file for them obtained from Nanopore.
I am confused about: a temporary directory it can use (tmpdir): Should I create a new folder?
the directory to which it should write files to be viewed by vntrview (html_dir): Should I create a new folder?
and the names of the input reference set profiles and sequences (reference_file and reference_seq): I have got refereces fasta files, should be these ones? what about refrence seq? What does it mean?

Thank you!

CMake Error at src/pcr_dup/CMakeLists.txt:1 (add_executable)

Hi,

I got the following error after cloning the repo and running cmake.

$ git clone https://github.com/yzhernand/VNTRseek.git && cd VNTRseek
$ mkdir build && cd build
$ cmake -DCMAKE_INSTALL_PREFIX=/home/niuyw/software/VNTRseek.1.10.0-rc.3 ..
ZLIB lib: /usr/lib64/libz.so
ZLIB include: /usr/include
-- Checking GCC version...
-- GCC version >= 4.1.2 (4.4.7)
-- Checking GLIBC version...
-- GLIBC version: 2.12
-- Downloading legacy build of TRF...
-- Perl >= 5.8.8 (5.10.1)
-- Checking samtools version...
-- samtools version >= 1.8 (1.9)
-- Checking bedtools version...
-- bedtools version >= 2.15.0 (2.26.0)
-- Your processor is x86_64 and you are running Linux. This means we'll download trf409.legacylinux64
-- Checking for perl module Try::Tiny
-- Checking for perl module Try::Tiny - found at /home/niuyw/perl5/lib/perl5/Try/Tiny.pm
-- Found PerlModules: TRUE  
CMake Error at src/pcr_dup/CMakeLists.txt:1 (add_executable):
  add_executable called with incorrect number of arguments


CMake Error at src/pcr_dup/CMakeLists.txt:2 (target_link_libraries):
  Cannot specify link libraries for target "pcr_dup.exe" which is not built
  by this project.


-- Configuring incomplete, errors occurred!
See also "/home/niuyw/software/VNTRseek/build/CMakeFiles/CMakeOutput.log".
See also "/home/niuyw/software/VNTRseek/build/CMakeFiles/CMakeError.log".

Do you have any ideas about this?

Bests,
Yiwei Niu

default output to be current directory

I suggest the output directory to be by default the ./SQL_DATABASE_NAME

conf file should be in output directory

I think the configuration file should be inside the output directory (instead of the ~home directory) so that when you have a run, another person can see on which conf file these results were obtained.

Accept BAM files as input

Currently only FASTA and FASTQ format sequence files are accepted by VNTRseek. This issue is to request the addition of BAM files as supported input.

Only BAM files with accompanying BAI files will be used as VNTRseek tries to take advantage of multi core systems and needs BAI files to divide the BAM files into ranges of reads to delegate to subprocesses for processing.

clean up

Makes millions of files and never deletes them

reference minisattellite file

Is there a way to access the reference file with all the repeat units and statistics obtained from TRF?

Using index for bam files.

Hi,

I'm trying to use an index with bam files, but when I add my index file in the FASTA directory, e.g.

$ ls
bam_14_sorted_griznog.bam  bam_14_sorted_griznog.bam.bai

However, this is picked up as two input files:

$ vntrseek 0 19 --dbsuffix ${DB_SUFFIX}
Could not read global config


Executing step #0 (creating MySQL database)...
Warning: Failed to create data directory!

Warning: Failed to create html directory!

Warning: Failed to create output directory!
done!


Executing step #1 (searching for tandem repeats in reads, producing profiles and sorting)...
2 supported files (bam format, assuming uncompressed) found in /tmp/7244420/fasta
Will use 4 processes
Processing bam chunk using: samtools view -F 256 -F 2048 -f 64 /tmp/7244420/fasta/bam_14_sorted_griznog.bam chrM:1-16571
Processing bam chunk using: samtools view -F 256 -F 2048 -f 128 /tmp/7244420/fasta/bam_14_sorted_griznog.bam chrM:1-16571
Processing bam chunk using: samtools view -F 256 -F 2048 -f 64 /tmp/7244420/fasta/bam_14_sorted_griznog.bam chr1:1-249250621
Processing bam chunk using: samtools view -F 256 -F 2048 -f 128 /tmp/7244420/fasta/bam_14_sorted_griznog.bam chr1:1-249250621
Running child, current_file = 1 (HASH(0x1e2e940))...
Running child, current_file = 0 (HASH(0x1e2ea18))...
Running child, current_file = 2 (HASH(0x1e2ebc8))...
Running child, current_file = 3 (HASH(0x1e0fc48))...
Processing bam chunk using: samtools view -F 256 -F 2048 -f 64 /tmp/7244420/fasta/bam_14_sorted_griznog.bam chr2:1-243199373
Running child, current_file = 4 (HASH(0x1e2ea90))...
Processing bam chunk using: samtools view -F 256 -F 2048 -f 128 /tmp/7244420/fasta/bam_14_sorted_griznog.bam chr2:1-243199373
Running child, current_file = 5 (HASH(0x1ffd378))...

Is there a way to use the index file without duplicating each step? Our input files are pretty large and using an index really speeds things up.

Kill Received

Hello,

I have compiled this and it is running but I get a "kill" result on Step 1 and the log last lines reads "TRF output lines processed 32481800". The mail at the bu web site is inop.

When I hit CtrC after the "kill" it says executing step #2.. no valid pairs for 0-0 command exited with value 255 at line 782.

Any help appreciated,

Jerry Y

Specifying port for mysql and/or ability to specify a socket.

Hi,

I'm trying to run vntrseek with multiple copies on each node in our cluster and across multiple nodes. To do this we are running one mysqld per vntrseek job, which requires using higher level ports. In most places this works as is by specifying

HOST=127.0.0.1:$PORT

in the config, but it breaks the system() calls to the mysql client in vntrseek.pl. To work around this I used these changes:

--- vntrseek-scg.pl	2019-02-23 10:09:42.652371611 -0800
+++ vntrseek.pl	2019-02-23 12:01:06.717409634 -0800
@@ -667,8 +667,13 @@
 
     write_mysql( $DBNAME, $opts{TMPDIR} );
 
+    my ($db_host, $db_port) = split(/:/, $opts{HOST});
+    if ($db_port ne "") {
+        $db_port = "--port $db_port";
+    }
+
     my $exstring
-        = "mysql -u $opts{LOGIN} --password=$opts{PASS} -h $opts{HOST} < $opts{TMPDIR}/${DBNAME}.sql";
+        = "mysql -u $opts{LOGIN} --password=$opts{PASS} -h ${db_host} ${db_port} < $opts{TMPDIR}/${DBNAME}.sql";
     system($exstring);
 
     $exstring = "rm -f $opts{TMPDIR}/${DBNAME}.sql";
@@ -1099,8 +1104,13 @@
         }
     }
 
+    my ($db_host, $db_port) = split(/:/, $opts{HOST});
+    if ($db_port ne "") {
+        $db_port = "--port $db_port";
+    }
+
     $exstring
-        = "mysql -u $opts{LOGIN} --password=$opts{PASS} -h $opts{HOST} --local-infile=1 $DBNAME < $opts{TMPDIR}/${DBNAME}_2.sql";
+        = "mysql -u $opts{LOGIN} --password=$opts{PASS} -h $db_host $db_port --local-infile=1 $DBNAME < $opts{TMPDIR}/${DBNAME}_2.sql";
     system($exstring);
     if ( $? == -1 ) {
         SetError( $STEP, "command failed: $!", -1 );

Could support for specifying the port be added and, bonus if we could specify a socket and run mysql with skip-network.

griznog

lowercase the db name

fails if db name is different case

DBD::SQLite::db do failed: near "(": syntax error

Hi Yozen,

Our test run with 1.10.0-rc.3 died with this error:

29284921 profiles read, 8437576 profiles marked nonredundant. (time: 8097 seconds)

setting additional statistics...
Creating reference sequence database...
DBD::SQLite::db do failed: near "(": syntax error at /scg/apps/software/vntrseek/1.10.0-rc.3/vntrseek1.10.0-rc.3/lib/vutil.pm line 706.
command exited with value 2 at /scg/apps/software/vntrseek/1.10.0-rc.3/bin/vntrseek line 585.
Done vntrseek

My perl-fu is weak and I do not see an obvious problem in vutil.pm, any idea what we are doing wrong here?

Our vs.conf looks like:

$ cat 7258930.vs.cnf
# Database backend
BACKEND=sqlite

# set this to the number of processors on your system 
# (or less if sharing the system with others or RAM is limited)
# eg, 8
NPROCESSES=128

# minimum required flank on both sides for a read TR to be considered
# eg, 10
MIN_FLANK_REQUIRED=10

# maximum flank length used in flank alignments
# set to big number to use all
# if read flanks are long with a lot of errors, 
# it might be useful to set this to something like 50
# max number of errors per flank is currently set to 8 (can be changed in main script only)
# eg, 1000
MAX_FLANK_CONSIDERED=50

# minimum number of mapped reads which agree on copy number to call an allele
# eg, 2
MIN_SUPPORT_REQUIRED=2

# Whether or not to keep reads detected as PCR duplicates. A nonzero (true) value
# means that detected PCR duplicates will not be removed. Default is 0.
KEEPPCRDUPS=1

# server name, used for html generating links
# eg, orca.bu.edu
SERVER=localhost

# for 454 platform, strip leading 'TCAG' 
# eg, 1 - yes
# eg, 0 - no (use no for all other platforms)
STRIP_454_KEYTAGS=0

# data is paired reads
# eg, 0 = no 
# eg, 1 - yes
IS_PAIRED_READS=1

# Sample ploidy. Default is 2. For haploid, set to 1.
PLOIDY=2

# Rebuild reference database
# eg, 0 = no 
# eg, 1 - yes
REDO_REFDB=0

# input data directory 
# (plain or gzipped fasta/fastq files)
# eg, /input
INPUT_DIR=/tmp/7258930/fasta

# output directory (must be writable and executable!)
# eg, /output
OUTPUT_ROOT=/home/username/output/7258930

# temp (scratch) directory (must be executable!)
# eg, /tmp
TMPDIR=/tmp/7258930

# names for the reference files 

# (leb36 file, sequence plus flank data file, indistinguishable references file) 
# files must be in install directory

# eg, hg19. This is the base name for files describing
# reference TR loci (.db, .seq, .leb36, and .indist)
REFERENCE=/tmp/7258930/reference/t26__

# generate a file of indistinguishable references, 
# necessary only if a file is not already available for the reference set
# eg, 1- generate
# eg, 0 - don't generate
REFERENCE_INDIST_PRODUCE=0