atks / vt Goto Github PK

View Code? Open in Web Editor NEW

189.0 189.0 3.0 30.84 MB

A tool set for short variant discovery in genetic sequence data.

Home Page: http://genome.sph.umich.edu/wiki/vt

License: MIT License

C++ 39.03% C 59.91% Makefile 0.28% Shell 0.70% Batchfile 0.01% Assembly 0.01% Python 0.08%

variant-calling

vt's People

Contributors

Stargazers

Watchers

Forkers

3ng7n33r wan-yifei goslak

vt's Issues

some samples are set to 0/2 by vt decompose -s

here's my original record

1   50891   .   T   C   69.2    PASS    AC=2;AF=0.020;AN=102;DP=178;FS=0.000;GQ_MEAN=4.18;GQ_STDDEV=2.00;InbreedingCoeff=-0.0591;MLEAC=1;MLEAF=9.804e-03;MQ=40.00;MQ0=0;NCC=116;QD=23.07;SOR=2.833;VQSLOD=2.95;culprit=FS;set=Intersection  GT:AD:DP:GQ:PL  0/0:1,0:1:3:0,3,32  0/0:8,0,0:8:0:0,0,221,0,221,221 0/0:21,0,0:21:63:0,63,686,63,686,686    0/0:14,0,0:14:42:0,42,487,42,487,487    0/0:6,0,0:6:0:0,0,135,0,135,135 .   0/0:3,0:3:0:0,0,33  0/0:6,0,0:6:5:0,5,169,5,169,169 0/0:1,0:1:3:0,3,33  0/0:1,0:1:3:0,3,37  0/0:2,0:2:6:0,6,69  0/0:20,0:20:57:0,57,855 0/0:1,0:1:3:0,3,30  0/0:35,0,0:35:57:0,57,977,57,977,977    0/0:11,0,0:11:33:0,33,378,33,378,378    .   .   .   0/0:8,0,0:8:0:0,0,210,0,210,210 0/0:6,0:6:15:0,15,225   0/0:10,0,0:10:0:0,0,132,0,132,132   0/0:1,0:1:3:0,3,28  0/0:12,0,0:12:36:0,36,372,36,372,372    0/0:21,0,0:21:34:0,34,651,34,651,651    0/0:1,0:1:3:0,3,28  0/0:1,0:1:3:0,3,32  0/0:30,0,0:30:62:0,62,929,62,929,929    0/0:24,0,0:24:39:0,39,782,39,782,782    0/0:1,0:1:3:0,3,30  0/0:3,0:3:9:0,9,100 0/1:7,4,0:11:79:79,0,191,100,203,303    .   0/0:11,0,0:11:33:0,33,382,33,382,382    0/1:5,6:11:99:152,0,109 0/0:19,0,0:19:54:0,54,810,54,810,810    0/1:14,4,0:18:69:69,0,389,111,401,512   0/0:16,0,0:16:11:0,11,496,11,496,496    0/0:2,0:2:6:0,6,65  0/0:25,0,0:25:38:0,38,800,38,800,800    0/0:1,0:1:3:0,3,27  0/1:7,7:14:99:116,0,193 0/0:11,0,0:11:33:0,33,390,33,390,390    0/0:14,0:14:39:0,39,585 0/1:3,2,0:5:47:47,0,80,56,86,142    0/0:14,0,0:14:10:0,10,344,10,344,344    .   .   .   .   0/0:1,0:1:3:0,3,27  0/0:1,0:1:3:0,3,33  0/0:4,0:4:0:0,0,17  0/0:9,0,0:9:12:0,12,268,12,268,268  0/0:9,0,0:9:27:0,27,286,27,286,286  0/0:2,0:2:6:0,6,56  0/0:3,0,0:3:9:0,9,72,9,72,72    0/0:1,0,0:1:3:0,3,35,3,35,35    .   0/0:12,0:12:33:0,33,495 .   0/0:17,0:17:48:0,48,720 0/0:1,0:1:3:0,3,29  0/0:10,0:10:30:0,30,356 0/0:15,0,0:15:45:0,45,519,45,519,519    0/0:10,0,0:10:30:0,30,314,30,314,314    0/0:34,0,0:34:64:0,64,1085,64,1085,1085 0/0:2,0:2:6:0,6,73  0/0:7,0:7:18:0,18,270   0/0:27,0,0:27:81:0,81,910,81,910,910    .   .   0/0:10,0:10:27:0,27,405 0/0:19,0:19:51:0,51,765 0/0:10,0,0:10:29:0,29,262,29,262,262    .   0/0:1,0:1:3:0,3,35  0/0:41,0,0:41:96:0,96,1347,96,1347,1347 .   0/1:14,4,0:18:99:123,0,380,165,401,566  0/0:3,0,0:3:0:0,0,33,0,33,33    0/0:14,0:14:39:0,39,585 0/0:15,0,0:15:45:0,45,515,45,515,515    .   .   0/0:25,0,0:25:75:0,75,881,75,881,881    0/0:17,0,0:17:51:0,51,591,51,591,591    0/0:2,0:2:6:0,6,59  0/0:1,0:1:3:0,3,33  0/0:1,0:1:3:0,3,32  0/0:20,0,0:20:60:0,60,703,60,703,703    0/0:1,0:1:3:0,3,28  0/0:1,0:1:3:0,3,33  0/0:2,0:2:6:0,6,61  0/0:1,0:1:3:0,3,31  0/0:28,0,0:28:84:0,84,963,84,963,963    0/0:15,0,0:15:45:0,45,523,45,523,523    0/0:22,0,0:22:66:0,66,768,66,768,768    0/0:31,0,0:31:93:0,93,1057,93,1057,1057 0/0:14,0,0:14:31:0,31,423,31,423,423    0/0:11,0:11:33:0,33,408 0/0:19,0,0:19:0:0,0,481,0,481,481   0/0:4,0,0:4:0:0,0,52,0,52,52    0/0:14,0,0:14:42:0,42,494,42,494,494    0/0:8,0,0:8:24:0,24,271,24,271,271  0/1:6,3,0:9:66:66,0,163,84,172,256  0/0:12,0,0:12:9:0,9,313,9,313,313   0/0:17,0,0:17:15:0,15,551,15,551,551    0/0:1,0:1:3:0,3,34  0/0:33,0,0:33:75:0,75,1103,75,1103,1103 0/0:28,0,0:28:0:0,0,704,0,704,704   0/0:1,0:1:3:0,3,28  0/0:27,0,0:27:43:0,43,846,43,846,846    0/0:1,0:1:3:0,3,30  0/0:6,0,0:6:0:0,0,112,0,112,112 0/0:1,0:1:3:0,3,32  0/1:11,2,0:13:38:38,0,306,71,315,386    0/0:33,0,0:33:33:0,33,968,33,968,968    0/0:3,0:3:6:0,6,90  .   0/0:16,0,0:16:48:0,48,513,48,513,513    0/0:11,0,0:11:30:0,30,450,30,450,450    0/0:2,0:2:6:0,6,65  0/0:9,0,0:9:0:0,0,222,0,222,222 0/0:2,0:2:6:0,6,59  0/0:1,0:1:3:0,3,28  0/0:2,0,0:2:0:0,0,3,0,3,3   0/1:9,2,0:11:29:29,0,254,56,260,316 1/1:0,2,0:2:6:61,6,0,61,6,61    .   .   .   0/0:13,0:13:36:0,36,540 0/0:20,0:20:54:0,54,810 0/0:14,0:14:29:0,29,446 .   .   0/0:1,0:1:3:0,3,25  0/1:12,11,3:26:99:262,0,340,241,247,701 0/0:43,0,0:43:38:0,38,1247,38,1247,1247 0/0:10,0,0:10:30:0,30,363,30,363,363    0/0:26,0,0:26:41:0,41,857,41,857,857    0/0:1,0:1:3:0,3,30  0/0:33,0,0:33:17:0,17,914,17,914,914    0/1:15,3,0:18:39:39,0,421,84,430,514    0/0:25,0,0:25:64:0,64,763,64,763,763    0/0:1,0:1:3:0,3,35  0/2:6,0,2:8:41:41,59,197,0,138,132  0/1:24,5,0:29:67:67,0,675,140,690,829   0/0:16,0,0:16:45:0,45,675,45,675,675    0/0:2,0:2:6:0,6,58  0/0:32,0,0:32:96:0,96,931,96,931,931    0/0:3,0:3:9:0,9,93  0/0:3,0:3:9:0,9,101 0/0:25,0:25:75:0,75,855 0/0:1,0:1:3:0,3,27  0/0:1,0:1:3:0,3,34  0/0:18,0:18:45:0,45,675 0/0:1,0:1:3:0,3,35  1/1:0,2,0:2:6:62,6,0,62,6,62    0/0:1,0:1:3:0,3,30  .   .   .   .   0/0:2,0:2:3:0,3,45  1/1:0,3:3:9:94,9,0  0/0:17,0:17:45:0,45,675 .   0/0:11,0,0:11:0:0,0,295,0,295,295   0/0:9,0,0:9:27:0,27,313,27,313,313  0/1:18,6,0:24:99:113,0,497,168,515,683  0/1:15,8,3:26:99:169,0,435,149,333,659  0/0:9,0,0:9:13:0,13,276,13,276,276  0/0:2,0:2:6:0,6,68  0/0:1,0:1:3:0,3,31  0/0:1,0:1:3:0,3,34  0/0:7,0,0:7:18:0,18,270,18,270,270  0/0:2,0:2:6:0,6,50  0/0:1,0:1:3:0,3,27  0/0:21,0,0:21:63:0,63,677,63,677,677    .   0/1:5,4,0:9:97:97,0,131,112,143,255 0/0:2,0:2:6:0,6,61  0/0:2,0:2:6:0,6,70  0/0:13,0,0:13:39:0,39,449,39,449,449    0/0:15,0,0:15:0:0,0,339,0,339,339   0/2:12,0,4:16:82:82,117,427,0,309,297   .   .   .   0/0:1,0:1:3:0,3,33  0/0:12,0,0:12:36:0,36,413,36,413,413    0/0:13,0,0:13:39:0,39,439,39,439,439    0/0:19,0,0:19:27:0,27,614,27,614,614    1/1:0,3:3:9:112,9,0 0/0:2,0,0:2:6:0,6,66,6,66,66    0/0:6,0,0:6:0:0,0,153,0,153,153 0/0:9,0,0:9:0:0,0,243,0,243,243 .   .   .

and here is it after decompose -s

1   50891   .   T   C   69.2    PASS    AC=2;AF=0.02;AN=102;DP=178;FS=0;GQ_MEAN=4.18;GQ_STDDEV=2;InbreedingCoeff=-0.0591;MLEAC=1;MLEAF=0.009804;MQ=40;MQ0=0;NCC=116;QD=23.07;SOR=2.833;VQSLOD=2.95;culprit=FS;set=Intersection  GT:AD:DP:GQ:PL  0/0:1,0:1:3:0,3,32  0/0:8,0,0:8:0:0,0,221,0,221,221 0/0:21,0,0:21:63:0,63,686,63,686,686    0/0:14,0,0:14:42:0,42,487,42,487,487    0/0:6,0,0:6:0:0,0,135,0,135,135 .:.:.:.:.   0/0:3,0:3:0:0,0,33  0/0:6,0,0:6:5:0,5,169,5,169,169 0/0:1,0:1:3:0,3,33  0/0:1,0:1:3:0,3,37  0/0:2,0:2:6:0,6,69  0/0:20,0:20:57:0,57,855 0/0:1,0:1:3:0,3,30  0/0:35,0,0:35:57:0,57,977,57,977,977    0/0:11,0,0:11:33:0,33,378,33,378,378    .:.:.:.:.   .:.:.:.:.   .:.:.:.:.   0/0:8,0,0:8:0:0,0,210,0,210,210 0/0:6,0:6:15:0,15,225   0/0:10,0,0:10:0:0,0,132,0,132,132   0/0:1,0:1:3:0,3,28  0/0:12,0,0:12:36:0,36,372,36,372,372    0/0:21,0,0:21:34:0,34,651,34,651,651    0/0:1,0:1:3:0,3,28  0/0:1,0:1:3:0,3,32  0/0:30,0,0:30:62:0,62,929,62,929,929    0/0:24,0,0:24:39:0,39,782,39,782,782    0/0:1,0:1:3:0,3,30  0/0:3,0:3:9:0,9,100 0/1:7,4,0:11:79:79,0,191,100,203,303    .:.:.:.:.   0/0:11,0,0:11:33:0,33,382,33,382,382    0/1:5,6:11:99:152,0,109 0/0:19,0,0:19:54:0,54,810,54,810,810    0/1:14,4,0:18:69:69,0,389,111,401,512   0/0:16,0,0:16:11:0,11,496,11,496,496    0/0:2,0:2:6:0,6,65  0/0:25,0,0:25:38:0,38,800,38,800,800    0/0:1,0:1:3:0,3,27  0/1:7,7:14:99:116,0,193 0/0:11,0,0:11:33:0,33,390,33,390,390    0/0:14,0:14:39:0,39,585 0/1:3,2,0:5:47:47,0,80,56,86,142    0/0:14,0,0:14:10:0,10,344,10,344,344    .:.:.:.:.   .:.:.:.:.   .:.:.:.:.   .:.:.:.:.   0/0:1,0:1:3:0,3,27  0/0:1,0:1:3:0,3,33  0/0:4,0:4:0:0,0,17  0/0:9,0,0:9:12:0,12,268,12,268,268  0/0:9,0,0:9:27:0,27,286,27,286,286  0/0:2,0:2:6:0,6,56  0/0:3,0,0:3:9:0,9,72,9,72,72    0/0:1,0,0:1:3:0,3,35,3,35,35    .:.:.:.:.   0/0:12,0:12:33:0,33,495 .:.:.:.:.   0/0:17,0:17:48:0,48,720 0/0:1,0:1:3:0,3,29  0/0:10,0:10:30:0,30,356 0/0:15,0,0:15:45:0,45,519,45,519,519    0/0:10,0,0:10:30:0,30,314,30,314,314    0/0:34,0,0:34:64:0,64,1085,64,1085,1085 0/0:2,0:2:6:0,6,73  0/0:7,0:7:18:0,18,270   0/0:27,0,0:27:81:0,81,910,81,910,910    .:.:.:.:.   .:.:.:.:.   0/0:10,0:10:27:0,27,405 0/0:19,0:19:51:0,51,765 0/0:10,0,0:10:29:0,29,262,29,262,262    .:.:.:.:.   0/0:1,0:1:3:0,3,35  0/0:41,0,0:41:96:0,96,1347,96,1347,1347 .:.:.:.:.   0/1:14,4,0:18:99:123,0,380,165,401,566  0/0:3,0,0:3:0:0,0,33,0,33,33    0/0:14,0:14:39:0,39,585 0/0:15,0,0:15:45:0,45,515,45,515,515    .:.:.:.:.   .:.:.:.:.   0/0:25,0,0:25:75:0,75,881,75,881,881    0/0:17,0,0:17:51:0,51,591,51,591,591    0/0:2,0:2:6:0,6,59  0/0:1,0:1:3:0,3,33  0/0:1,0:1:3:0,3,32  0/0:20,0,0:20:60:0,60,703,60,703,703    0/0:1,0:1:3:0,3,28  0/0:1,0:1:3:0,3,33  0/0:2,0:2:6:0,6,61  0/0:1,0:1:3:0,3,31  0/0:28,0,0:28:84:0,84,963,84,963,963    0/0:15,0,0:15:45:0,45,523,45,523,523    0/0:22,0,0:22:66:0,66,768,66,768,768    0/0:31,0,0:31:93:0,93,1057,93,1057,1057 0/0:14,0,0:14:31:0,31,423,31,423,423    0/0:11,0:11:33:0,33,408 0/0:19,0,0:19:0:0,0,481,0,481,481   0/0:4,0,0:4:0:0,0,52,0,52,52    0/0:14,0,0:14:42:0,42,494,42,494,494    0/0:8,0,0:8:24:0,24,271,24,271,271  0/1:6,3,0:9:66:66,0,163,84,172,256  0/0:12,0,0:12:9:0,9,313,9,313,313   0/0:17,0,0:17:15:0,15,551,15,551,551    0/0:1,0:1:3:0,3,34  0/0:33,0,0:33:75:0,75,1103,75,1103,1103 0/0:28,0,0:28:0:0,0,704,0,704,704   0/0:1,0:1:3:0,3,28  0/0:27,0,0:27:43:0,43,846,43,846,846    0/0:1,0:1:3:0,3,30  0/0:6,0,0:6:0:0,0,112,0,112,112 0/0:1,0:1:3:0,3,32  0/1:11,2,0:13:38:38,0,306,71,315,386    0/0:33,0,0:33:33:0,33,968,33,968,968    0/0:3,0:3:6:0,6,90  .:.:.:.:.   0/0:16,0,0:16:48:0,48,513,48,513,513    0/0:11,0,0:11:30:0,30,450,30,450,450    0/0:2,0:2:6:0,6,65  0/0:9,0,0:9:0:0,0,222,0,222,222 0/0:2,0:2:6:0,6,59  0/0:1,0:1:3:0,3,28  0/0:2,0,0:2:0:0,0,3,0,3,3   0/1:9,2,0:11:29:29,0,254,56,260,316 1/1:0,2,0:2:6:61,6,0,61,6,61    .:.:.:.:.   .:.:.:.:.   .:.:.:.:.   0/0:13,0:13:36:0,36,540 0/0:20,0:20:54:0,54,810 0/0:14,0:14:29:0,29,446 .:.:.:.:.   .:.:.:.:.   0/0:1,0:1:3:0,3,25  0/1:12,11,3:26:99:262,0,340,241,247,701 0/0:43,0,0:43:38:0,38,1247,38,1247,1247 0/0:10,0,0:10:30:0,30,363,30,363,363    0/0:26,0,0:26:41:0,41,857,41,857,857    0/0:1,0:1:3:0,3,30  0/0:33,0,0:33:17:0,17,914,17,914,914    0/1:15,3,0:18:39:39,0,421,84,430,514    0/0:25,0,0:25:64:0,64,763,64,763,763    0/0:1,0:1:3:0,3,35  0/2:6,0,2:8:41:41,59,197,0,138,132  0/1:24,5,0:29:67:67,0,675,140,690,829   0/0:16,0,0:16:45:0,45,675,45,675,675    0/0:2,0:2:6:0,6,58  0/0:32,0,0:32:96:0,96,931,96,931,931    0/0:3,0:3:9:0,9,93  0/0:3,0:3:9:0,9,101 0/0:25,0:25:75:0,75,855 0/0:1,0:1:3:0,3,27  0/0:1,0:1:3:0,3,34  0/0:18,0:18:45:0,45,675 0/0:1,0:1:3:0,3,35  1/1:0,2,0:2:6:62,6,0,62,6,62    0/0:1,0:1:3:0,3,30  .:.:.:.:.   .:.:.:.:.   .:.:.:.:.   .:.:.:.:.   0/0:2,0:2:3:0,3,45  1/1:0,3:3:9:94,9,0  0/0:17,0:17:45:0,45,675 .:.:.:.:.   0/0:11,0,0:11:0:0,0,295,0,295,295   0/0:9,0,0:9:27:0,27,313,27,313,313  0/1:18,6,0:24:99:113,0,497,168,515,683  0/1:15,8,3:26:99:169,0,435,149,333,659  0/0:9,0,0:9:13:0,13,276,13,276,276  0/0:2,0:2:6:0,6,68  0/0:1,0:1:3:0,3,31  0/0:1,0:1:3:0,3,34  0/0:7,0,0:7:18:0,18,270,18,270,270  0/0:2,0:2:6:0,6,50  0/0:1,0:1:3:0,3,27  0/0:21,0,0:21:63:0,63,677,63,677,677    .:.:.:.:.   0/1:5,4,0:9:97:97,0,131,112,143,255 0/0:2,0:2:6:0,6,61  0/0:2,0:2:6:0,6,70  0/0:13,0,0:13:39:0,39,449,39,449,449    0/0:15,0,0:15:0:0,0,339,0,339,339   0/2:12,0,4:16:82:82,117,427,0,309,297   .:.:.:.:.   .:.:.:.:.   .:.:.:.:.   0/0:1,0:1:3:0,3,33  0/0:12,0,0:12:36:0,36,413,36,413,413    0/0:13,0,0:13:39:0,39,439,39,439,439    0/0:19,0,0:19:27:0,27,614,27,614,614    1/1:0,3:3:9:112,9,0 0/0:2,0,0:2:6:0,6,66,6,66,66    0/0:6,0,0:6:0:0,0,153,0,153,153 0/0:9,0,0:9:0:0,0,243,0,243,243 .:.:.:.:.   .:.:.:.:.   .:.:.:.:.

note that some samples have genotype 0/2

vt partition reporting no overlap

Hi,

I'd like to use vt partition to identify the similarities and differences between variants called by different variant callers on the same set of samples.

However I'm having some issues...

using VT to compare platypus and freebayes
vt partition platypus_chr22_normal_filt.vcf.gz freebayes_chr22_normal_filt.vcf.gz
returns this:

partition v0.5
Options: input VCF file a platypus_chr22_normal_filt.vcf.gz
input VCF file b freebayes_chr22_normal_filt.vcf.gz
A: 663 variants
B: 609 variants
ts/tv ins/del
A-B 663 [3.04] [1.20]
A&B 0 [-nan] [-nan]
B-A 609 [3.12] [1.50]
of A 0.0%
of B 0.0%
Time elapsed: 0.02s

i.e. there are zero overlaps between variants in the two files

However, using vcftools I can see there are ~565 variant overlaps between variants in the two files:

vcf-isec -f platypus_chr22_normal_filt.vcf.gz freebayes_chr22_normal_filt.vcf.gz | grep -v ^# | wc -l
565

Any thoughts as to why vt partition would not be working? It appears to be an issue with the Freebayes VCF as when I do a comparison with a VCF generated with VarScan2 I get results as expected from vt parition:

vt partition varscan_chr22_normal.vcf platypus_chr22_normal_filt.vcf.gz

partition v0.5
Options: input VCF file a varscan_chr22_normal.vcf
input VCF file b platypus_chr22_normal_filt.vcf.gz
A: 673 variants
B: 663 variants
ts/tv ins/del
A-B 97 [2.36] [1.00]
A&B 576 [2.93] [1.50]
B-A 87 [3.73] [0.75]
of A 85.6%
of B 86.9%
Time elapsed: 0.02s

Using vt partition + varscan vcf + freebayes vcf returns 0 overlap hence why I think it is an issue with the freebayes VCF.

I have run vt partition comparing the freebayes vcf to itself and that returns as expected results:
vt partition freebayes_chr22_normal_filt.vcf.gz freebayes_chr22_normal_filt.vcf.gz
partition v0.5

Options: input VCF file a freebayes_chr22_normal_filt.vcf.gz
input VCF file b freebayes_chr22_normal_filt.vcf.gz
A: 609 variants
B: 609 variants
ts/tv ins/del
A-B 0 [-nan] [-nan]
A&B 609 [3.12] [1.50]
B-A 0 [-nan] [-nan]
of A 100.0%
of B 100.0%
Time elapsed: 0.03s

Commands for generating the Freebayes VCF were this:

call variants
freebayes --genotype-qualities -f $ref --region chr22 -L $bam_list > $out_dir/test_out_freebayes_chr22.vcf
limit to cds regions & sort
vt view -I test.intervals test_out_freebayes_chr22.vcf.gz -o tmp_freebayes_chr22.vcf
vt sort tmp_freebayes_chr22.vcf -o freebayes_chr22.vcf
normalise
vt normalize freebayes_chr22.vcf -r $ref -o freebayes_chr22_normal.vcf
filter
vt view freebayes_chr22_normal.vcf -f "INFO.DP>=30&&QUAL>=20" -o freebayes_chr22_normal_filt.vcf
bgzip freebayes_chr22_normal_filt.vcf
tabix -p vcf freebayes_chr22_normal_filt.vcf.gz

Any help with this much appreciated

Chris

vt cat results in chrom replacement

Hi,

Thanks for developing vt - looks like it could be really helpful

I've come across an issue when using vt cat to combine VCF files generated on the same set of samples but on different chromosomes (as per the example in the docs: http://genome.sph.umich.edu/wiki/Vt#Concatenate). The chromosome name is being replaced with whatever it is in the first VCF file supplied. I've tried with both specifying the VCFs in full or using an input file list (-L)

An example of what I'm trying to do:

vt cat chr1.vcf chr2.vcf -o cat.vcf

The variants are concatenated in cat.vcf, however all the chromosome names would be converted to chr1

Chris

Reference genome file for vt normalize

I cannot find any documentation on what is expected for the reference fasta file for vt normalize (the -r option).

I tried using the hs37d5.fa file provided with your resource bundle. However, I get the following error:
[variant_manip.cpp:637 right_trim_or_left_extend] failure to extract base from fasta file: chr1:825765

I also tried concatenating the UCSC files for chromosomes 1-22, X, Y, and M. This gave me the same error but at a later point in the file:
[variant_manip.cpp:637 right_trim_or_left_extend] failure to extract base from fasta file: chr19:423687

My command line:
vt normalize -r hs37d5.fa Genotype.vcf.gz -o Genotype.vt.normalized.vcf

I can't share the entire input file because it is sensitive. However, I get the same error with a tiny input file containing the problem record:

fileformat=VCFv4.1

reference=ftp://ftp.completegenomics.com/ReferenceFiles/build37.fa.bz2

contig=<ID=1,length=249250621,assembly=B37,md5=1b22b98cdeb4a9304cb5d48026a85128,species="Homo sapiens">

contig=<ID=19,length=59128983,assembly=B37,md5=1aacd71f30db8e561810913e0b72636d,species="Homo sapiens">

ALT=<ID=CGA_NOCALL,Description="No-called record">

ALT=<ID=INS:ME:ALU,Description="Insertion of ALU element">

ALT=<ID=INS:ME:L1,Description="Insertion of L1 element">

ALT=<ID=INS:ME:SVA,Description="Insertion of SVA element">

ALT=<ID=INS:ME:MER,Description="Insertion of MER element">

ALT=<ID=INS:ME:LTR,Description="Insertion of LTR element">

ALT=<ID=INS:ME:PolyA,Description="Insertion of PolyA element">

ALT=<ID=INS:ME:HERV,Description="Insertion of HERV element">

FILTER=<ID=VQLOW,Description="Call is homozygous and the varScoreVAF is less than 20dB, or the call is not homozygous and the varScoreVAF is less than 40dB">

FILTER=<ID=SQLOW,Description="Somatic variant has somaticScore < -10">

FILTER=<ID=FET30,Description="Fisher somatic score < 30">

FILTER=<ID=AMBIGUOUS,Description="Read evidence does not strongly distinguish multiple non-reference candidate alleles">

FILTER=<ID=URR,Description="Too close to an underrepresented repeat">

FILTER=<ID=MPCBT,Description="Mate pair count below 10">

FILTER=<ID=SHORT,Description="Junction side length below 70">

FILTER=<ID=TSNR,Description="Transition sequence not resolved">

FILTER=<ID=INTERBL,Description="Interchromosomal junction in baseline">

FILTER=<ID=sns75,Description="Sensitivity to known MEI calls in range (.75,.95] i.e. medium FDR">

FILTER=<ID=sns95,Description="Sensitivity to known MEI calls in range (.95,1.00] i.e. high to very high FDR">

INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">

INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele">

INFO=<ID=CGA_XR,Number=A,Type=String,Description="Per-ALT external database reference (dbSNP, COSMIC, etc)">

INFO=<ID=AF,Number=A,Type=String,Description="Allele frequency, or &-separated frequencies for complex variants (in latter, '?' designates unknown parts)">

INFO=<ID=CGA_FI,Number=A,Type=String,Description="Functional impact annotation">

INFO=<ID=CGA_PFAM,Number=.,Type=String,Description="PFAM Domain">

INFO=<ID=CGA_MIRB,Number=.,Type=String,Description="miRBaseId">

INFO=<ID=CGA_RPT,Number=.,Type=String,Description="repeatMasker overlap information">

INFO=<ID=CGA_SDO,Number=1,Type=Integer,Description="Number of distinct segmental duplications that overlap this locus">

INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

INFO=<ID=CGA_WINEND,Number=1,Type=Integer,Description="End of coverage window">

INFO=<ID=CGA_BF,Number=1,Type=Float,Description="Frequency in baseline">

INFO=<ID=CGA_MEDEL,Number=4,Type=String,Description="Consistent with deletion of mobile element; type,chromosome,start,end">

INFO=<ID=MATEID,Number=1,Type=String,Description="ID of mate breakend">

INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">

INFO=<ID=CGA_BNDG,Number=A,Type=String,Description="Transcript name and strand of genes containing breakend">

INFO=<ID=CGA_BNDGO,Number=A,Type=String,Description="Transcript name and strand of genes containing mate breakend">

INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">

INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">

INFO=<ID=MEINFO,Number=4,Type=String,Description="Mobile element info of the form NAME,START,END,POLARITY">

INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">

FORMAT=<ID=SS,Number=1,Type=String,Description="Somatic Status: Germline, Somatic, LOH, or . (Unknown)">

FORMAT=<ID=FT,Number=1,Type=String,Description="Genotype filters">

FORMAT=<ID=CGA_ALTCALLS,Number=2,Type=String,Description="Alternative call sequences and scores">

FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

FORMAT=<ID=EHQ,Number=2,Type=Integer,Description="Haplotype Quality, Equal Allele Fraction Assumption">

FORMAT=<ID=CGA_CEHQ,Number=2,Type=Integer,Description="Calibrated Haplotype Quality, Equal Allele Fraction Assumption">

FORMAT=<ID=GL,Number=.,Type=Integer,Description="Genotype Likelihood">

FORMAT=<ID=CGA_CEGL,Number=.,Type=Integer,Description="Calibrated Genotype Likelihood, Equal Allele Fraction Assumption">

FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total Read Depth">

FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allelic depths (number of reads in each observed allele)">

FORMAT=<ID=CGA_RDP,Number=1,Type=Integer,Description="Number of reads observed supporting the reference allele">

FORMAT=<ID=CGA_GP,Number=1,Type=Float,Description="Depth of coverage for 2k window GC normalized to mean">

FORMAT=<ID=CGA_NP,Number=1,Type=Float,Description="Coverage for 2k window, GC-corrected and normalized relative to copy-number-corrected multi-sample baseline">

FORMAT=<ID=CGA_CL,Number=1,Type=Float,Description="Nondiploid-model called level">

FORMAT=<ID=CGA_LS,Number=1,Type=Integer,Description="Nondiploid-model called level score">

FORMAT=<ID=CGA_CP,Number=1,Type=Integer,Description="Diploid-model called ploidy">

FORMAT=<ID=CGA_PS,Number=1,Type=Integer,Description="Diploid-model called ploidy score">

FORMAT=<ID=CGA_CT,Number=1,Type=String,Description="Diploid-model CNV type">

FORMAT=<ID=CGA_TS,Number=1,Type=Integer,Description="Diploid-model CNV type score">

FORMAT=<ID=CGA_BNDMPC,Number=1,Type=Integer,Description="Mate pair count supporting breakend">

FORMAT=<ID=CGA_BNDPOS,Number=1,Type=Integer,Description="Breakend position">

FORMAT=<ID=CGA_BNDDEF,Number=1,Type=String,Description="Breakend definition">

FORMAT=<ID=CGA_BNDP,Number=1,Type=String,Description="Precision of breakend">

FORMAT=<ID=CGA_IS,Number=1,Type=Float,Description="MEI InsertionScore: confidence in occurrence of an insertion">

FORMAT=<ID=CGA_IDC,Number=1,Type=Float,Description="MEI InsertionDnbCount: count of paired ends supporting insertion">

FORMAT=<ID=CGA_IDCL,Number=1,Type=Float,Description="MEI InsertionLeftDnbCount: count of paired ends supporting insertion on 5' end of insertion point">

FORMAT=<ID=CGA_IDCR,Number=1,Type=Float,Description="MEI InsertionRightDnbCount: count of paired ends supporting insertion on 3' end of insertion point">

FORMAT=<ID=CGA_RDC,Number=1,Type=Integer,Description="MEI ReferenceDnbCount: count of paired ends supporting reference allele">

FORMAT=<ID=CGA_NBET,Number=1,Type=String,Description="MEI NextBestElementType: (sub)type of second-most-likely inserted mobile element">

FORMAT=<ID=CGA_ETS,Number=1,Type=Float,Description="MEI ElementTypeScore: confidence that insertion is of type indicated by CGA_ET/ElementType">

FORMAT=<ID=CGA_KES,Number=1,Type=Float,Description="MEI KnownEventSensitivityForInsertionScore: fraction of known MEI insertion polymorphisms called for this sample with CGA_IS at least as high as for the current call">

chr1 825681 . AGGCGTGAGCCACTGCACCCGGCCTTGACTTC . . nc END=825712 GT .
chr1 825742 . ACAAGGGGGTTCT . . nc END=825754 GT .
chr1 825767 GS000007651-ASM_1860_L C ]1:5726936]C . PASS NS=1;SVTYPE=BND;MATEID=GS000007651-ASM_1860_R;CGA_BF=0.92 GT 1
chr1 825796 . CAGCCTGAGTGACAGAGTGAG . . nc END=825816 GT .
chr1 825826 . C . . nc END=825826 GT .
chr19 423506 . CTCGCTC . . nc END=423512 GT .
chr19 423541 . TCCTGGGGGGTCCTCCCCCCCT . . nc END=423562 GT .
chr19 423689 GS000007651-ASM_2163_R C ]19:423261]C . PASS NS=1;SVTYPE=BND;MATEID=GS000007651-ASM_2163_L;CGA_BF=0.75;CGA_XR=rs72175445;CGA_BNDG=NM_012435|-;CGA_BNDGO=NM_012435|- GT 1
chr19 425290 rs73916977 G A . PASS NS=1;AN=2;AC=1;CGA_XR=dbsnp.130|rs73916977;CGA_FI=25759|NM_012435.2|SHC2|INTRON|UNKNOWN-INC
GT:GQ:HQ:EHQ 1/0:400:400,400:398,398
chr19 426030 . A . . nc END=426030 GT .

Many thanks for any insight you can provide!

Incorrect mitochondrial SV normalization

vt normalize produces an incorrect position when transforming a mitochondrial structural variation line:

C is the reference at position 195. T is the reference at 196.

failure to extract base from fasta file - vt error

[variant_manip.cpp:464 right_trim_or_left_extend] failure to extract base from fasta file: 1:2338229

[variant_manip.cpp:464 right_trim_or_left_extend] failure to extract base from fasta file: 1:2338229

I have the following variant in this window:

1 2338231 c.764_765insA T TT . . OMIM=602859.0005;dbSNP=rs61750435;GENELIST=PEX10;CLINSIG=Pathogenic;TRAIT="Peroxisomebiogenesisdisorder6A";VT=Duplication;TRANSCRIPT=NM_002617.3;ASSEMBLY=GRCh37;METHOD=Eutils

I’m a bit puzzled about the error. Is the fasta sequence not matching the allele listed in the VCF record?

Makefile: `test` should depend on `vt`

Memory leak

I am trying to use 'vt decompose' to decompose variants from a large VCF file (~80G compressed). When I use the '-s' option, the RAM usage grows to very high levels. It appears to roughly track the size of the uncompressed input file read to that point. I don't see the problem running 'vt decompose' without the -s option. I have started trying to track down the source, but haven't found it yet.

Your help would be much appreciated. Thanks.

-Jason

-i option does not have effect

It appears that using the -i option does not have any effect of the size and content of output file using vt normalize compiled from 0d53450:

-i "22:20000000-21000000,22:10000000-11000000" and a run without -i produces exactly the same output

make fails with missing #include pregex.h

hi, on a current ubuntu 14.04 i am getting this error with the latest pull:

g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o annotate_1000g.o -c annotate_1000g.cpp
In file included from annotate_1000g.h:36:0,
from annotate_1000g.cpp:24:
filter.h:30:20: fatal error: pregex.h: No such file or directory
#include "pregex.h"
^
compilation terminated.
make: *** [annotate_1000g.o] Error 1

googling and dpkg don't turn up anything for "pregex.h", is it a typo?

thanks for all your work.

mike d

New release?

Hi Adrian,

Are there any plans to make a new release on Github soon?

vt normalize doesn't handle collisions

vt normalize doesn't cover certain cases where two variants are very close together and the right one wants to left-align past the other one.

Here is an example:

chrM reference 56-60 is ATTTT

Here are two lines from a fictitious VCF file. Together they specify ATTTT-> AGTTTT
chrM 57 . T G
chrM 59 . T TT

A correct normalization would be:
chrM 57 . T G
chrM 58 . T TT

(The ideal normalization would be a single line, but vt normalize is not intended to do anything this sophisticated:
chrM 56 . A AG )

vt incorrectly gives the below, which together specify ATTTT->ATGTTT, different from what's specified by the original
chrM 56 . A AT
chrM 57 . T G

You may find it helpful that this same issue came up with the SMaSH normalizer: https://groups.google.com/forum/#!topic/smash-benchmarking/2Vnn7OR0ug8

They addressed it by writing some collision-handling code: amplab/smash#7

Segmentation fault 11 when running vt subset on mac

I've tried running vt subset on a couple different vcf files, and I keep getting a segmentation fault: 11 error. make test works fine without errors. Any idea what might be happening? Here's the command and error:

PN105860:vt jzook$ ./vt subset -s /Users/jzook/Documents/AJTrio/NCBI_IlluminaHiSeq300X_cortex_09042015/HG002sample.txt -o /Users/jzook/Documents/AJTrio/NCBI_IlluminaHiSeq300X_cortex_09042015/AJtrio_HiSeq300X_cortex_variants_GRCh37_09042015_insgt49vt.vcf -f "DLEN>49" /Users/jzook/Documents/AJTrio/NCBI_IlluminaHiSeq300X_cortex_09042015/AJtrio_HiSeq300X_cortex_variants_GRCh37_09042015.vcf.gz
subset v0.5

Options: input VCF File /Users/jzook/Documents/AJTrio/NCBI_IlluminaHiSeq300X_cortex_09042015/AJtrio_HiSeq300X_cortex_variants_GRCh37_09042015.vcf.gz
[s] sample file list 1 samples
[f] filter DLEN>49

Segmentation fault: 11

Here are my clang and gcc versions:
PN105860:vt jzook$ clang --version
Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin13.4.0
Thread model: posix

PN105860:vt jzook$ gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/usr/include/c++/4.2.1
Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin13.4.0
Thread model: posix

Thanks!
Justin

make error (Mac 10.11.3)

$ make test
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o profile_mendelian.o -c profile_mendelian.cpp
profile_mendelian.cpp:45:5: error: unknown type name 'float_t'; did you mean 'float'?
float_t min_gq;
^~~~~~~
float
profile_mendelian.cpp:463:16: error: use of undeclared identifier 'NAN'
return NAN;
^
profile_mendelian.cpp:518:26: error: use of undeclared identifier 'NAN'
return (het==0 ? NAN : hom/het);
^
profile_mendelian.cpp:573:32: error: use of undeclared identifier 'NAN'
return ((het+hom)==0 ? NAN : het/(hom+het)100);
^
4 errors generated.
make: ** [profile_mendelian.o] Error 1

installation error

Hi,

I am trying to install vt using the commands stated in the wiki at http://genome.sph.umich.edu/wiki/Vt#Installation

However, I am receiving a fatal error..

In file included from annotate_indels.cpp:24:
In file included from ./annotate_indels.h:28:
In file included from ./vntr_annotator.h:28:
In file included from ./candidate_motif_picker.h:31:
In file included from ./motif_tree.h:28:
./motif_map.h:27:10: fatal error: 'cstdint' file not found

include
     ^
1 error generated.
make: *** [annotate_indels.o] Error 1

Would appreciate any guidance.

Regards,
Saumya

Reference mismatch

Hi there,

I was wondering if you'd consider implementing a specific flag for vt normalize. We're getting an error that looks like this with some of our genomes:

Variant is not consistent: chrY:59034049-59034049 - N(REF) vs A(FASTA)

This is because it's in the PAR region and the reference we're feeding it has that masked. We don't really want to use the -n flag because we would like to catch other errors if they happen, but we'd like to ignore errors where the REF is N.

Thanks,
Denise

Don't understand behavior of decomposition

I have to decompose a VCF and used the instructions provided for Gemini described here:
http://gemini.readthedocs.io/en/latest/

zless $VCF \
   | sed 's/ID=AD,Number=./ID=AD,Number=R/' \
   | vt decompose -s - \
   | vt normalize -r $REF - \

However, when I dug back into some multiallelic sites I came across this:

Original VCF -

1 4992134 rs35519208 TTATATA T,TTATATATATATATA

decomposed vcf
1 4992134 rs35519208 TTATATA T
1 4992134 rs35519208 T TTATATATA

REF prefixes differ after vt normalizing two files

Hello,
I have vt normalized two vcf files and when I try to merge them I get REF prefixes differ error in multiple positions. Here are the positions, REF and ALT before and after vt normalization. I couldn't understand how vt normalizing can result in different REF alleles at the same positions. Thank you for your help

file1 before vt normalization:
35301 TTAAAAAAACTTATAAACGTAAA TCAAAAAAAACTTATAAACGTAAG
file1 after normalization:
35301 T TC
file2 after normalization
35301 A AT OLD_VARIANT=gi|602625715|gb|AE004092.2|:35302:T/TT

add .`gitignore` with `*.o` and `vt`

make error (Mac)

On Mac OS X 10.9.5, make generates error:

g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o paste_and_compute_features_sequential.o -c paste_and_compute_features_sequential.cpp
paste_and_compute_features_sequential.cpp:699:9: error: use of undeclared identifier 'isnanf'; did you mean 'isnan'?
if ( isnanf(fic) ) fic = 0;
^~~~~~
isnan
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/cmath:425:1: note: 'isnan' declared here
isnan(_A1 __x) _NOEXCEPT
^
1 error generated.
make: *** [paste_and_compute_features_sequential.o] Error 1

Shouldn't
https://github.com/atks/vt/blob/master/paste_and_compute_features_sequential.cpp#L699
be if ( isnan(fic) ) fic = 0;?

vt normalize v0.5 "Floating point exception (core dumped)" error

I just read your excellent variant normalization article and decided to try processing a few of my VCF files. For a file of INDELs (produced by GATK), I received a "Floating point exception (core dumped)" exception. For other files, it worked fine. I wish I could give you something more concrete to troubleshoot, but that was the extent of the error message, and unfortunately I cannot share the variant data due to confidentiality issues. However, I will see if I can whittle my error-generating file down to check if there is a specific entry causing the exception, and I'll report back if I find anything.

Problem with sorting COSMIC

The following fails reproducably on COSMIC v72 (available after registration from http://cancer.sanger.ac.uk/cosmic) on the current version.

vt cat CosmicNonCodingVariants.normalized.vcf.gz CosmicCodingMuts.normalized.vcf.gz \
        | vt sort /dev/stdin -o out.vcf.gz

build error with gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

My system have..

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
Ubuntu 12.04.5 LTS \n \l

$ git clone https://github.com/atks/vt.git
$ cd vt
$ make
...
make[1]: Leaving directory `/BiO/BioTools/vt/lib/libsvm'
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o align.o -c align.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o allele.o -c allele.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o annotate_1000g.o -c annotate_1000g.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o annotate_dbsnp_rsid.o -c annotate_dbsnp_rsid.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o annotate_indels.o -c annotate_indels.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o annotate_mendelian.o -c annotate_mendelian.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o annotate_regions.o -c annotate_regions.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o annotate_variants.o -c annotate_variants.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o annotate_vntrs.o -c annotate_vntrs.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o augmented_bam_record.o -c augmented_bam_record.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o bam_ordered_reader.o -c bam_ordered_reader.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o bcf_genotyping_buffered_reader.o -c bcf_genotyping_buffered_reader.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o bcf_ordered_reader.o -c bcf_ordered_reader.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o bcf_ordered_writer.o -c bcf_ordered_writer.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o bcf_synced_reader.o -c bcf_synced_reader.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o bed.o -c bed.cpp
g++ -pipe -std=c++0x -O3 -I./lib -I. -I./lib/htslib -I./lib/Rmath -I./lib/pcre2 -D__STDC_LIMIT_MACROS -o candidate_motif_picker.o -c candidate_motif_picker.cpp
candidate_motif_picker.cpp: In member function ‘bool CandidateMotifPicker::get_indel(std::string, std::string, std::string&)’:
candidate_motif_picker.cpp:186:13: error: ‘std::string’ has no member named ‘pop_back’
candidate_motif_picker.cpp:187:13: error: ‘std::string’ has no member named ‘pop_back’
make: *** [candidate_motif_picker.o] Error 1

I checked this thread. (http://stackoverflow.com/questions/20891441/using-string-pop-back-and-string-back)

Could I build vt with my system?

Build error

Building the current master fails with:

g++ -pipe -std=c++0x -O3 -ggdb -I./lib/include/ -I. -I./lib/include/htslib -I./lib/include/Rmath  -D__STDC_LIMIT_MACROS -o vt align.o allele.o annotate_dbsnp_rsid.o annotate_indel.o annotate_regions.o annotate_variants.o bam_ordered_reader.o bcf_ordered_reader.o bcf_ordered_writer.o bcf_synced_reader.o bed.o candidate_motif.o cat.o chmm.o compute_concordance.o compute_features.o config.o consolidate_variants.o construct_probes.o context_filter.o decompose.o decompose_blocksub.o discover.o discover2.o estimate.o estimator.o filter.o gencode.o genome_interval.o genotype.o genotype2.o genotyping_buffer.o hts_utils.o index.o interval_tree.o interval.o lfhmm.o lhmm.o lhmm1.o lhmm_genotyping_record.o log_tool.o merge.o merge_candidate_variants.o motif_suffix_tree.o ordered_bcf_overlap_matcher.o ordered_region_overlap_matcher.o partition.o paste.o pedigree.o peek.o pileup.o profile_afs.o profile_chm1.o profile_chrom.o profile_fic_hwe.o profile_hwe.o profile_indels.o profile_len.o profile_mendelian.o profile_na12878.o profile_snps.o program.o remove_overlap.o rfhmm.o seq.o sort.o str.o subset.o sv_tree.o test.o union_variants.o uniq.o utils.o validate.o variant.o variant_manip.o view.o vntrize.o tbx_ordered_reader.o ahmm.o xcmp.o normalize.o main.o lib/include/htslib/libhts.a lib/include/Rmath/libRmath.a -lz -lpthread
discover2.o: In function `flush':
/data/zappadata_p1/matt/inbox/vt/discover2.cpp:547: undefined reference to `Pileup::diff(unsigned long, unsigned long)'
collect2: error: ld returned 1 exit status
make: *** [vt] Error 1

gcc 4.8.1

vt decompose -s for Type=String

Is this not supported because there could be unrelated commas in the value?
Could decompose support splitting when Type=String?

You can see an example VCF of this here:
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20150305.vcf.gz

where the INFO fields starting with "CLN" have comma separated values with multiple alts.

We have adjusted the header to set Number=A for INFO's we want to split.

Segmentation fault on Decompose

Hi, I am running VT Decompose on a multi-sample vcf file generated by GATK as part of the preprocess workflow to load the file onto GEMINI browser and it terminates with Segmentation Fault. When I reran the command after updating VT it again failed but this time with an output file of bigger size. The command I used is as follows,

vt decompose -s -o output.vcf input.vcf

Kindly clarify.

vt decompose adjust counts for ref, alt, dp in genotypes

GATK uses AD=ref,alt1,alt2...
freebayes uses RO=ref, AO=alt1,alt2

it would be nice if these were pulled when doing the decompose (and then used to recalc DP) so that they could be used downstream to derive allele frequency, etc.

Decomposition of a SNP adjacent to an Indel

@holtgrewe

Should A/GC be decomposed to A/G and A/AC?

@ekg - Erik, what's your take on this case?

Optional log file

Would it be possible to (optionally) output all log information to a defined file? We currently catch everything going to stderr and process to identify potential errors in our pipeline. Current implementation means we always have content in stderr and it is almost never an error. What would be great would be a --log option for all vt tools, so that log information goes to that file, errors go to stderr and the vcf/bcf goes to stdout.

running uniq causes loss of data in INFO field.

I've just take a VCF file and ran normalize on it to get the following results:

1       1886346 rs35830547      G       GGCC    .       PASS    DBSNP=dbSNP_126;EA_AC=1377,3247;AA_AC=1189,1297;TAC=2566,4544;MAF=29.7794,47.8278,36.09;GTS=A1A1,A1R,RR;EA_GTC=306,765,1241;AA_GTC=406,377,460;GTC=712,1142,1701;DP=90;GL=KIAA1751;CP=0;CG=0.6;AA=.;CA=.;EXOME_CHIP=no;GWAS_PUBMED=.;FG=NM_001080484.1:utr-3;HGVS_CDNA_VAR=NM_001080484.1:c.*669_*670insCGG;HGVS_PROTEIN_VAR=.;CDS_SIZES=NM_001080484.1:2289;GS=.;PH=.;EA_AGE=.;AA_AGE=.;GRCh38_POSITION=1:1954908;OLD_VARIANT=1:1886347:G/GCCG
1       1886346 ~rs35830547     G       GGCC    .       PASS    DBSNP=dbSNP_126;EA_AC=1375,3249;AA_AC=1194,1296;TAC=2569,4545;MAF=29.7362,47.9518,36.1119;GTS=A1A1,A1R,RR;EA_GTC=305,765,1242;AA_GTC=408,378,459;GTC=713,1143,1701;DP=90;GL=KIAA1751;CP=0;CG=0.2;AA=.;CA=.;EXOME_CHIP=no;GWAS_PUBMED=.;FG=NM_001080484.1:utr-3;HGVS_CDNA_VAR=NM_001080484.1:c.*668_*669insGCG;HGVS_PROTEIN_VAR=.;CDS_SIZES=NM_001080484.1:2289;GS=.;PH=.;EA_AGE=.;AA_AGE=.;GRCh38_POSITION=1:1954909;OLD_VARIANT=1:1886348:C/CCGC
1       1886346 ~rs35830547     G       GGCC    .       PASS    DBSNP=dbSNP_126;EA_AC=1374,3246;AA_AC=1190,1296;TAC=2564,4542;MAF=29.7403,47.8681,36.0822;GTS=A1A1,A1R,RR;EA_GTC=305,764,1241;AA_GTC=406,378,459;GTC=711,1142,1700;DP=88;GL=KIAA1751;CP=0;CG=-0.3;AA=.;CA=.;EXOME_CHIP=no;GWAS_PUBMED=.;FG=NM_001080484.1:utr-3;HGVS_CDNA_VAR=NM_001080484.1:c.*667_*668insGGC;HGVS_PROTEIN_VAR=.;CDS_SIZES=NM_001080484.1:2289;GS=.;PH=.;EA_AGE=.;AA_AGE=.;GRCh38_POSITION=1:1954910;OLD_VARIANT=1:1886349:C/CGCC

but then when I run uniq, only one of the 'OLD_VARIANT' tags is added

1       1886346 ~rs35830547     G       GGCC    .       PASS    DBSNP=dbSNP_126;EA_AC=1374,3246;AA_AC=1190,1296;TAC=2564,4542;MAF=29.7403,47.8681,36.0822;GTS=A1A1,A1R,RR;EA_GTC=305,764,1241;AA_GTC=406,378,459;GTC=711,1142,1700;DP=88;GL=KIAA1751;CP=0;CG=-0.3;AA=.;CA=.;EXOME_CHIP=no;GWAS_PUBMED=.;FG=NM_001080484.1:utr-3;HGVS_CDNA_VAR=NM_001080484.1:c.*667_*668insGGC;HGVS_PROTEIN_VAR=.;CDS_SIZES=NM_001080484.1:2289;GS=.;PH=.;EA_AGE=.;AA_AGE=.;GRCh38_POSITION=1:1954910;OLD_VARIANT=1:1886349:C/CGCC

It would be good to carrier the INFO field for all of them.

ID field empty

in recent version, when running decompose and normalize, the header gets these lines:

##INFO=<ID=,Number=1,Type=String,Description="Original chr:pos:ref:alt encoding">
##INFO=<ID=OLD_VARIANT,Number=.,Type=String,Description="Original chr:pos:ref:alt encoding">

The former, without an ID, causes problems in downstream parsers.

I'm running as:

vt decompose -s $VCF   | vt normalize -r $REF -

Update htslib to avoid dependency on gcc 4.5 or better

Adrian;
Apologies, one more report this morning. The htslib currently in vt has a dependency on gcc 4.5 or better. This got fixed yesterday in devel so it'll be compatible with older versions:

samtools/bcftools#201 (comment)

Would it be possible to update htslib in vt to pull in this fix? Thanks much.

Test suite?

Hi,
Is there a test suite available for post-installation validation?

thanks!
Deanna

compilation error : isnan was not declared

I get the following error after make commande

 complex_genotyping_record.cpp:318:27: error: ‘isnan’ was not declared in this scope

decompose

Hello,

Question. After a decompose should certain INFO tags be changed to reflect the one line, one ALT output? i.e. Number=A --> Number=1 ? I got an error with a tool expecting it.

##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele">

Reference mismatch, 'c' isnt' a 'C'

Similar to issue #45 we're also getting the following error:

[variant_manip.cpp:96 is_not_ref_consistent] reference bases not consistent: chrX:60000-60000  c(REF) vs C(FASTA)
[normalize.cpp:209 normalize] Normalization not performed due to inconsistent reference sequences. (use -n or -m option to relax this)

and incidentally we are using both the -n and the -m options.

CGI bz2 file throws Not a VCF/BCF file

Hi,

I have a bunch of Complete Genomics VCF files that I want to normalize. They're compressed with bzip2 and they have the following header start:

fileformat=VCFv4.1

fileDate=20140715

center=Complete Genomics

source=CGAPipeline_2.0.2.26;cgatools_1.8.0

source_GENOME_REFERENCE=NCBI build 37

source_GENE_ANNOTATIONS=NCBI build 37.2

source_DBSNP_BUILD=dbSNP build 132

source_COSMIC=COSMIC v48

source_DGV_VERSION=9

source_MIRBASE_VERSION=miRBase version 16

source_PFAM_DATE=April 21, 2011

source_REPMASK_GENERATED_AT=2011-Feb-15 10:08

source_SEGDUP_GENERATED_AT=2010-Dec-01 13:40

phasing=partial

source_MAX_PLOIDY=10

source_NUMBER_LEVELS=GS01489-DNA_A06:5

source_NONDIPLOID_WINDOW_WIDTH=100000

source_MEAN_GC_CORRECTED_CVG=GS01489-DNA_A06:55.52

source_MEI_1000G_ANNOTATIONS=INITIAL-DATA-RELEASE

Right now vt normalize throws:
[bcf_ordered_reader.cpp:49 BCFOrderedReader] Not a VCF/BCF file

vt normalize vcfBeta-ASM.vcf.bz2 -o normVcf.out -r hg19_CGI.fa.bz2

Please update to allow these files to be normalized.

Thanks,
Denise

Multiple operations?

I'd love to do a filter + normalization at the same time. We're filtering on variants and piping this back into vt normalize. Be so much better if we could filter and normalize at the same time in the same process to reduce memory and I/O.

Header parsing bug leads to missing VCF columns in output

A header line containing extra whitespace outside of the "<...>" content can lead to missing VCF columns in the output. Here's an example input VCF containing a header line line with about 60 trailing space characters in the last FILTER header line, right before the first FORMAT line. This will look like a blank line in this window; scroll all the way right to see the newline quote character:

Input VCF file contents:

##fileformat=VCFv4.1
##fileDate=20150126
##reference=hs37d5
##phasing=partial
##FILTER=<ID=INDEL_SPECIFIC_FILTERS,Description="QD < 2.0 || ReadPosRankSum < -20.0 || InbreedingCoeff < -0.8 || FS > 200.0">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=VQSRTrancheSNP99.00to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -6.6778 <= x < -0.6832">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -36469.5723">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -36469.5723 <= x < -6.6778">                 \

##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GATK,Number=1,Type=String,Description="Genotype as called by GATK. Always a diploid call. All other genotype stats based on this genotype.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype after Personalis post-processing to match detected chromosome counts.">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001
1       12065947        PTV001  C       T,A     29      PASS    .       GT:GATK:AD:DP:GQ        0/1:0/1:3,2:5:19
1       109817590       PTV002  G       T       77      PASS    .       GT:GATK:AD:DP:GQ        0/1:0/1:3,2:5:20
1       153791300       PTV003  CTG     C       81      PASS    .       GT:GATK:AD:DP:GQ        0/1:0/1:3,2:5:21
1       156104666       PTV004  TTGAGAGCCGGCTGGCGGAT    TCC     30      PASS    .       GT:GATK:AD:DP:GQ        0/1:0/1:3,2:5:22
1       156108541       PTV005  G       GG      31      PASS    .       GT:GATK:AD:DP:GQ        0/1:0/1:3,2:5:23
1       161279695       PTV006  T       C,A     32      PASS    .       GT:GATK:AD:DP:GQ        0/1:0/1:3,2:5:24
1       169519049       PTV007  T       .       35      PASS    .       GT:GATK:AD:DP:GQ        0/1:0/1:3,2:5:24
1       226125468       PTV097  G       A       99      PASS    .       GT:GATK:AD:DP:GQ        0/1:0/1:3,2:5:109
16      2103394 PTV056  C       T       68      PASS    .       GT:GATK:AD:DP:GQ        0/1:0/1:3,2:5:72
4       31789170        PTV021  G       .       77      PASS    .       GT:GATK:AD:DP:GQ        0/1:0/1:3,2:5:38

Output VCF file data:
Two of the variants in this input get normalized, and vt terminates normally. However, the output VCF is lacking the FORMAT and genotype columns:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       12065947        PTV001  C       T,A     29      PASS    .
1       109817590       PTV002  G       T       77      PASS    .
1       153791300       PTV003  CTG     C       81      PASS    .
1       156104667       PTV004  TGAGAGCCGGCTGGCGGAT     CC      30      PASS    OLD_VARIANT=1:156104666:TTGAGAGCCGGCTGGCGGAT/TCC
1       156108540       PTV005  C       CG      31      PASS    OLD_VARIANT=1:156108541:G/GG
1       161279695       PTV006  T       C,A     32      PASS    .
1       169519049       PTV007  T       .       35      PASS    .
1       226125468       PTV097  G       A       99      PASS    .
16      2103394 PTV056  C       T       68      PASS    .
4       31789170        PTV021  G       .       77      PASS    .

I'm using vt software v0.5 (released/downloaded on 2015-06-24)

Tag a new release?

There has been > 200 commits since, and i'd like to package it for homebrew.

decompose_blocksub should optionally output phased genotypes and PS tags for decomposed MNPs

Breaking up MNPs and other complex variants loses information about which variants occur on the same haplotype. To avoid this, decompose_blocksub could output phased genotypes (i.e. | instead of /) and PS tags in the per-sample data

Incorrect decompose

Example:

##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=11,length=135006516>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S1  S2  S3  S4
11  101 .   GCGT    G,GCGA,GTGA,CCGT    199 PASS    .   GT  0/1 1/2 2/3 2/4

The vt decompose -s | vt normalize output is (irrelevant fields are skipped):

11      101     GCGT    G       0/1     1/.     ./.     ./.
11      101     G       C       0/.     ./.     ./.     ./1
11      102     CGT     TGA     0/.     ./.     ./1     ./.
11      104     T       A       0/.     ./1     1/.     1/.

But I tend to believe the right one should be:

11       101     GCGT    G       0/1      1/.      ./.      ./.
11       101     G       C       0/.      ./0      0/0      0/1
11       102     C       T       0/.      ./0      0/1      0/0
11       104     T       A       0/.      ./1      1/1      1/0

There are two problems with the vt output. Firstly, CGT=>TGA has not been decomposed. Secondly, several . should be replaced with the reference 0. For example, in sample S3, the original genotype is 2/3. Both allele 2 and 3 have the reference base at 11:101. The genotype at 101:G=>C should be 0/0, not ./.. To do the decompose right, vt needs to be aware of the allele sequences. BTW, also note that at 11:104, S3 has a homozygous A/A genotype, although in the original VCF, 2/3 appears to be a heterozygote.

EDIT: err... actually there is no "GCGT" on human chr11. Nonetheless, if we change coordinate to 61842 where there is a GCGT on the reference, the output is the same.

duplicated sample name issue

I'm getting this error when trying to run vt genotype

[parlar@mps 12191-96]$ vt genotype -r /home/bcbio_root/share/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -s 12191-96 -b 12191-96-ready.bam -o 12191-96-ensemble.vcf.gz.norm.decomp.vcf.gz.vt.vcf 12191-96-ensemble.vcf.gz.norm.decomp.vcf
[E::bcf_hdr_add_sample] Duplicated sample name '12191-96'
Aborted (core dumped)

Not setting -s also produces a complaint:

vt genotype -r /home/bcbio_root/share/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -b 12191-96-ready.bam -o 12191-96-ensemble.vcf.gz.norm.decomp.vcf.gz.vt.vcf 12191-96-ensemble.vcf.gz.norm.decomp.vcf

  undefined -- Required argument missing: s

genotype v0.5

description : Genotypes variants for each sample.


usage : vt genotype [options] <in.vcf>

options : -d  debug alignments
          -r  reference FASTA file
          -s  sample ID
          -o  output VCF file
          -b  input BAM file
          -I  file containing list of intervals []
          -i  intervals []
          -?  displays help

What I want to do is to extract allelic frequencies and possibly other params in the bam alignment for pre-specified variants (in a vcf). I'm not even sure that vt will do the job ... ? Any suggestions on how to do this in the best way is also greatly appreciated. Have tried Mutect2, which takes ages, freebayes but which does not call all input variants.

Kind regards,

Pär Larsson

Compilation error

Dear Adrian,

I'm having a compilation error as following, could you share some idea on a fix please?

bed.cpp: In member function ‘std::string BEDRecord::to_string()’:
bed.cpp:63: error: call of overloaded ‘to_string(int32_t&)’ is ambiguous
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int)
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note:                 std::string std::to_string(long long unsigned int)
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note:                 std::string std::to_string(long double)
bed.cpp:63: error: call of overloaded ‘to_string(int32_t&)’ is ambiguous
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int)
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note:                 std::string std::to_string(long long unsigned int)
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note:                 std::string std::to_string(long double)
make: *** [bed.o] Error 1

Subset should modify multiallelic sites

When subsetting a vcf file, it would be useful to trim all alleles from a multiallelic site that are not present in the samples being subsetted on. For example, consider the following entry

1 100 . CTTT CT,C 100 PASS ... 0/2 ...

The genotype for the sample being subsetted on is 0/2, so when subsetting, this record needs to be retained, but there is no need to retain the 'CT' allele. This also requires all INFO fields with an entry for each alternate allele to be trimmed.

It is possible to use 'vt decompose | vt subset' to get rid of the alternate allele that isn't present, but this will modify the values supplied in the genotype fields, so isn't necessarily a desirable solution.

non-termination for specific vcf

with this VCF:
wget https://s3.amazonaws.com/biodata/variants/cosmic-v68-GRCh37.vcf.gz

this command:

   vt decompose  cosmic-v68-GRCh37.vcf.gz   \
          | vt normalize -r /data/human/b37/human_g1k_v37_decoy.fasta - > vv.vcf

puts some data in the output, but then it stalls after 77K lines or so.

Normalization issue for indel

Hi,
I have below indel detected by Freebayes:

chr10 88679046 . GAGCCCTG GTGTAG

After running the normalization using vt, it became:
chr10 88679047 . AGCCCT TGTA

This is incorrect representation based on VCF spec. It should be:

chr10 88679046 . GAGCCCT GTGTA

Please let me know if it is possible to normalize in correct way.

Thanks
Savi

Typo leads to failed compile

paste_genotypes.cpp:729:9: error: use of undeclared identifier 'isnanf'; did you mean 'isnan'?
if ( isnanf(fic) ) fic = 0;
^~~~~~
isnan

Changing to isnan allows successful compilation

v0.557 still reporting version as v0.5

Using this source https://github.com/atks/vt/archive/0.577.tar.gz

% vt view
...
view v0.5

I was expecting to see view v0.557 ?

Decomposing variants in decompose_blocksub

@holtgrewe

CHROM POS ID REF ALT QUAL FILTER INFO

1 159030 . TACCTTTC TGACCTTTT 0.04 . .

This is decomposed by decompose_blocksub with -a option to

CHROM POS ID REF ALT QUAL FILTER INFO

1 159030 . T TG 0.04 . .
1 159037 . C T 0.04 . .

However, if the given normalized version of the variant is

CHROM POS ID REF ALT QUAL FILTER INFO

1 159031 . ACCTTTC GACCTTTT 0.04 . .

It is not possible to know that a padding T is required as the alignment will be:
-ACCTTTC
GACCTTTT

We will need decompose_blocksub to have access to the reference genome to allow for the padding of the variant on the left hand side before performing the Needle-Wunsch alignment or to add the relevant T after decomposition.