fulcrumgenomics / fgbio Goto Github PK

View Code? Open in Web Editor NEW

305.0 21.0 67.0 9.35 MB

Tools for working with genomic and high throughput sequencing data.

Home Page: http://fulcrumgenomics.github.io/fgbio/

License: MIT License

Java 1.18% Scala 97.98% Shell 0.12% R 0.71%

analyzing-genomic-data umi bioinformatics ngs scala fgbio molecular-indexes genomics

fgbio's People

Stargazers

Watchers

fgbio's Issues

HapCutToVcf: `FORMAT/PR` has no header entry

Hi again,

Just noticed a format field has no header entry:

$ java -jar fgbio-0.1.2-SNAPSHOT.jar HapCutToVcf -v ../NA12878.1000G.vcf -i na12878.output_haplotype_file  -o test.vcf.gz
$ bcftools view test.vcf.gz > /dev/null 
[W::vcf_parse_format] FORMAT 'PR' is not defined in the header, assuming Type=String

Improve the parallel collection support

Instead of this:

val parIter = iter.par
parIter.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(threads))
val result = parIter.map(x => doWork(x))

something like:

val result = iter.withPar(threads).map(x => doWork(x))

and something like this:

val threadPool: ForkJoinPool = ...
val result = iter.withPar(threadPool).map(x => doWork(x))

The latter is useful if we have multiple parallel collections.

CallMolecularConsensusReads doesn't appear to support /dev/stdout as it's output file

Not sure why since this isn't usually a problem:

java -Xmx4096m -jar fgbio.jar CallMolecularConsensusReads -i grouped.unsorted.bam -o /dev/stdout -r consensus.rejects.bam -t MI -P false > tmp.bam
Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: /dev/stdout (Permission denied)
    at htsjdk.samtools.util.BinaryCodec.close(BinaryCodec.java:610)
    at htsjdk.samtools.BAMFileWriter.finish(BAMFileWriter.java:143)
    at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:219)
    at com.fulcrumgenomics.umi.CallMolecularConsensusReads.execute(CallMolecularConsensusReads.scala:134)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:70)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:53)
    at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:41)
    at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)
Caused by: java.io.FileNotFoundException: /dev/stdout (Permission denied)
    at java.io.RandomAccessFile.open0(Native Method)
    at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
    at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
    at htsjdk.samtools.util.BlockCompressedInputStream.checkTermination(BlockCompressedInputStream.java:471)
    at htsjdk.samtools.util.BlockCompressedOutputStream.close(BlockCompressedOutputStream.java:286)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
    at htsjdk.samtools.util.BinaryCodec.close(BinaryCodec.java:606)
    ... 7 more

SamRecordSetBuilder returning records with invalid insert size

Tracked it down to this: samtools/htsjdk#692

GroupReadsByUmi should handle no-calls in UMIs better

As it stands the way GroupReadsByUmi checks whether UMIs are equivalent is just by counting the number of bases that are not identical (after upper-casing). This is problematic for a few corner cases:

Comparing NNN to NNN should probably not yield a match!
Comparing ACG to ACN - what should this do?

My first instinct is that we should do two things:

Throw out all reads whose UMIs contain >= some number of Ns (could be 0, could be 1-2)
For the adjacency strategy, not allow nodes with Ns anywhere but at the leaves

Support for setting common/global options

There are a number of options that are common to many, but not all, tools in fgbio that it would be nice to be able to have standard names for. For example things that control some aspects of BAM and VCF writing like whether or not to create indices, what compression level to use, whether or not to enable async-io.

Perhaps we could convert fgbio to use the same sub-command model as dagr uses, and then define options to fgbio separately from options to the tools? Then we could write things like:

java -jar fgbio.jar --async=true DemuxFastqs --input ...

Does CallMolecularConsensusReads preserve alignment information

I was recently experimenting with CallMolecularConsensusReads, and it appears to strip the alignment information out of the resulting bam when I use it. Is this the desired behavior?

Time for a 0.1.2 release?

@tfenne what do you think?

It would be great to have it depend on a stable version of dagr for the release, so perhaps depend on fulcrumgenomics/dagr#201?

CorrectUmis should store the old/uncorrected UMI in a tag once the spec says which one

Requires the spec to be a bit more settled. See:

Positional dependence during GroupReadsByUmi

We've been getting started with an AnnotateBamWithUmis, GroupReadsByUmi and CallMolecularConsensusReads pipeline to consolidate UMI tagged input fastqs and a couple of conceptual questions and feature requests came up and the grouping process:

How are read positions treated for the grouping process? The UMI edit distance and grouping algorithms make sense but we were not sure the conditions under which reads are considered identically mapped for grouping. Does this take into account soft-clipping? Is there currently wiggle room in coordinate matching or are exact mappings of both reads required?
As a feature request, would it be possible to make the coordinate matching flexible to nearby positions, where the definition of nearby is configurable? We have applications using UMIs for tagging transposons where read mappings will be within a region and it would be useful to group all reads within a window with the same UMI.

Thanks much for putting together these tools.

AnnotateBamWithUmis with two UMIs

Hi,
How will the AnnotateBamWithUmis tool work with two UMIs? I have reads with one UMI 5' of insert and one 3' of it.

Sonatype snapshots not being released

Soft Clipped Bases

Hello,

I was wondering if soft clipped bases, in Illumina HiSeq4000 data, are removed before consensus building? We have noticed a lot more longer insertions in our analyses when use all reads for a given UMI but don''t not in consensus build from 3 or more reads of a given UMI.

Thanks,
Sakina

GroupReadsByUmi returns empty bam

Hi folks,

First of all, many thanks for providing this nice package. I just started using fgbio to deal with sequences that use some customized UMIs. I could successfully annotate my bam file with UMI tags. Bam file looks like the following after annotating with AnnotateBamWithUmis:

NS500607:25:HGWNLBGXY:4:11404:10042:8605 99 chr1 836282 60 74M = 836455 247 GTGAATCTAAAACCGTTCTTTAAAAAGAGTTTATTTAGTAGGCTGGGCATGGTGGCTCACACCTGCAATCCCAG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEE X0:i:1 X1:i:0 MD:Z:74 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U RX:Z:AGGGCTTAACGC
NS500607:25:HGWNLBGXY:4:11406:25080:7099 99 chr1 836282 60 74M = 836455 247 GTGAATCTAAAACCGTTCTTTAAAAAGAGTTTATTTAGTAGGCTGGGCATGGTGGCTCACACCTGCAATCCCAG AAAAAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE X0:i:1 X1:i:0 MD:Z:74 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U RX:Z:AGGGCTTAACGC

Please note that mates are also in the bam file, but not shown here.

So the issue is with GroupreadsByUmi: nothing is written in the output bam (empty file). Following is the log output:

I also run the code as:

java -Xmx32g -jar ~/src/fgbio/target/scala-2.11/fgbio-0.1.2-SNAPSHOT.jar GroupReadsByUmi -i input_bam.bam -s edit -e 2 -m 20 -o output_bam.bam --tmp-dir=../tmp

I would be very grateful if you could please help me out to fix this. It is very hard to figure this out as I do not get any error message.

Cheers,
MH

SamRecordSetBuilder should expose setting the random seed (to 42)

Should we make a recipe for fgbio in homebrew?

It probably belongs in homebrew-science. See the contributing guide.

class Fgbio < Formula
  desc "Tools for working with genomic and high throughput sequencing data."
  homepage "https://github.com/fulcrumgenomics/fgbio"
  url "https://github.com/fulcrumgenomics/fgbio/archive/0.1.1.tar.gz"
  sha256 "4b2414ab8af5e8cfa384cde44fc9a282cec83c51cd93efe0113866dc2840e696"

  head do
    url "https://github.com/fulcrumgenomics/fgbio.git"
  end

  bottle :unneeded

  depends_on :java => "1.8+"
  depends_on "sbt" => :build

  def install
    system "sbt", "assembly"
    libexec.install Dir["target/scala-*/fgbio-*.jar"][0]
    bin.write_jar_script Dir[libexec/"fgbio-*.jar"][0], "fgbio"
  end

  test do
    assert_match "USAGE", shell_output("#{bin}/fgbio -h 2>&1", 1)
  end
end

CorrectUmis should output metrics per end metrics for read pairs along with the template metrics

See #199

I find this personally very useful, so I'll assign this to myself.

SimpleCounter and NumericCounter should implement CanBuildFrom and maybe Map

It would be nice to have the counter objects implement the CanBuildFrom interfaces so that you can manipulate their contents and get another counter back without a lot of extra effort.

Similarly it might be nice to implement immutable Map too to provide Map-like accessors?

Should fgbio tools produce any messages as they start and end?

E.g. at startup should they report their environment similar to the way Picard tools do? At the end should they print out the runtime and any other useful stats?

AssessPhasing should report the # of blocks per chromosome

This is useful to ensure that there is at least one block per chromosome.

Proposal for improvements to consensus calling

There are really three parts to this proposal:

Simplify the consensus caller (particularly the usage thereof) by removing parameters that we have found are not useful and are overlapped with other parameters
Add tags to the consensus reads that are generated so that downstream tools can know a lot more about the properties of each consensus read and it's supporting raw reads
Move filtering of consensus reads and bases largely into a separate program that can be run after consensus calling, removing the need to call consensuses multiple times to experiment with different cutoffs

Simplifying the consensus caller

To recap, we have a lot of parameters to consensus calling:

errorRatePreUmi: PhredScore
errorRatePostUmi: PhredScore
minInputBaseQuality: PhredScore
maxRawBaseQuality: PhredScore
rawBaseQualityShift: PhredScore
minConsensusBaseQuality: PhredScore
minReads: Int
minMeanConsensusBaseQuality: PhredScore
requireConsensusForBothPairs: Boolean

I'm finding that in general there are two sets of highly overlapped parameters where I favor one of the set, and would advocate for removal of the others:

errorRatePostUmi, maxRawBaseQuality and rawBaseQualityShift all work to effectively reduce the probability we construct from the raw read base qualities, albeit in slightly different ways. I'm finding that I tend to use errorRatePostUmi exclusively and then set maxRawBaseQuality to something high that will have little to no effect, and rawBaseQualityShift to 0. I propose removing maxRawBaseQuality and rawBaseQualityShift.
minConsensusBaseQuality and minMeanConsensusBaseQuality. While not exactly redundant I find that I set minConsensusBaseQuality high enough (i.e. close enough to errorRatePreUmi that then the minMeanConsensusBaseQuality becomes meaningless. I would propose getting rid of minMeanConsensusBaseQuality, perhaps in favor of introducing a filter on the number or proportion of no-calls generated in a consensus read (more in the filtering section).

Adding Tags to Consensus Reads

It would help significantly to encode information about the consensus reads into tags that are added to the consensus reads in the BAM file for three reasons:

One could generate a single consensus BAM, re-align it, and then filter it different ways to experiment with filtering settings much more easily and quickly
It would allow for the information to be used in filtering variants (e.g. you could imagine a performing rank-sum based tests on things like raw-read-depth or consensus-error-rate between reads supporting ref and alt alleles at a site).
It would aid when manually reviewing false positive or questionable variant calls to have more knowledge about the consensus read.

The real question is what tags we should add. We can break this down into per-read and per-base information. My first pass is that we'd want to encode the following per read:

The number of raw-reads used to generate the consensus
The error rate of the raw-reads vs. the consensus (defined as: how many bases in raw reads disagreed with the final consensus bases, prior to masking any consensus bases for quality)

Then per-base I think we'd want to capture:

The number of bases contributing to the consensus as each base.
Either the number of disagreeing bases from the raw-reads at each base, or the error rate.

This does bring up what the best way to store the per-base tags is. The first could be stored as unsigned 8, 16 or 32 bit int[]s. I'd vote for either 8 or 16 as I'm not really sure there's value in discriminating between a consensus at 256 deep vs. deeper, and definitely not between 65k deep and deeper! The disagreeing bases could be stored in the same way, or we could store the error rates at a 32-bit float[] (seems overkill), or (and I'm not keen on this) do some kind of scaling/logging of the values and store them as chars. Maybe we should test and see if using 16 or 32-bit ints has any overhead in BAM vs. 8-bit when all values are << 256?

Any other suggestions for things we definitely want to store? I can think of some other things, but until I feel a need for them I'm tempted to stop here.

Moving filtering into a separate program

I think this would be really beneficial for the reasons outlined above. Obviously input-read filtering would still have to happen in consensus calling. And I think I would like to keep a min-reads filter there too for efficiency sake (e.g. if you know you're never going to go below min=2 or min=3 you can save a lot of compute and filesize by enforcing that during consensus building). But beyond that I'd suggest moving everything else off to a post-filtering tool that uses the tags described above and features of the consensus read (number of Ns, quality scores, etc.) to do both per-read and per-base filtering.

Ideally such a tool will run on an aligned BAM in coordinate order; we should discuss whether it is acceptable to produce a semi-invalid BAM (e.g. if one read in a mate is tossed, is it necessary to toss the other one too, or fix it's flags?). If the goal is to filter before producing QC metrics and variant calls, I'm not so sure that's important.

segmentation fault error during haplotype calling

I am trying to identify the haplotypes and its alleles and my command line is posted below when am running this am getting following error , help me in fixing this error
calling MAXCUT based haplotype assembly algorithm
./command.sh: line 11: 10635 Segmentation fault

/share/apps/hapcut/extractHAIRS --VCF filtered_variants.vcf --bam aln_sorted.bam --maxIS 600 --ref 2017_V2_ND886_assembly.fa --indels 1 > fragment_matrix_file
/share/apps/hapcut/HAPCUT --fragments fragment_matrix_file --VCF filtered_variants.vcf --output output_haplotype.vcf --maxiter 100 > hapcut.log

Metric does not support optional values

The Metric class does not support optional values via Option[Type]. This would be nice as there are definitely cases where some values are optional, and not necessarily present.

CallMolecularConsensusReads should add a @PG and possibly chain it to the old ones?

Should we leave any existing @pgs in the header? That's going to make chaining much more complicated.

ProgressLogger should have a method to print the last "record" statement

... so that we can call it when we have completed recording to get a final count.

error in HapCutToVcf

Hello! I am getting the following error when trying to run HapCutToVcf. I can share some test data but it is quite large.

$ java -jar /illumina/thirdparty/joconnell/fgbio/target/scala-2.11/fgbio-0.1.2-SNAPSHOT.jar HapCutToVcf -v ../NA12878.1000G.vcf -i na12878.output_haplotype_file  -o test.vcf
[2016/09/19 20:52:24 | FgBioMain | Info] Executing HapCutToVcf from fgbio version 0.1.2-SNAPSHOT as [email protected] on JRE 1.8.0_40-b26
[2016/09/19 20:52:25 | FgBioMain | Info] HapCutToVcf failed. Elapsed time: 0.07 minutes.
Exception in thread "main" java.lang.NumberFormatException: For input string: "1|0"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
    at scala.collection.immutable.StringOps.toInt(StringOps.scala:30)
    at com.fulcrumgenomics.vcf.HapCutCall$$anonfun$9.apply(HapCutToVcf.scala:503)
    at com.fulcrumgenomics.vcf.HapCutCall$$anonfun$9.apply(HapCutToVcf.scala:503)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
    at com.fulcrumgenomics.vcf.HapCutCall.toVariantContext(HapCutToVcf.scala:503)
    at com.fulcrumgenomics.vcf.HapCutAndVcfMergingIterator.next(HapCutToVcf.scala:269)
    at com.fulcrumgenomics.vcf.HapCutAndVcfMergingIterator.next(HapCutToVcf.scala:240)
    at scala.collection.Iterator$class.foreach(Iterator.scala:742)
    at com.fulcrumgenomics.vcf.HapCutAndVcfMergingIterator.foreach(HapCutToVcf.scala:240)
    at com.fulcrumgenomics.vcf.HapCutToVcf.execute(HapCutToVcf.scala:111)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:70)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:53)
    at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:41)
    at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)

just to check that my inputs are sane:

$ head na12878.output_haplotype_file
BLOCK: offset: 921 len: 195 phased: 2 SPAN: 1719 fragments 3
921 0   1   chr1    568214  C   T   1|0:255,0,255:84:26,28,14,16:127    0   0.000000    0.000000
1115    0   1   chr1    569933  G   A   1|0:221,0,255:97:36,36,15,10:127    0   0.000000    0.000000
******** 
BLOCK: offset: 1553 len: 59 phased: 2 SPAN: 2070 fragments 1
1553    0   1   chr1    724869  C   A   1|0:91,0,165:16:11,0,0,5:90 0   0.000000    -0.000087
1611    0   1   chr1    726939  G   C   1|0:255,0,255:52:13,11,7,21:127 0   0.000000    -0.000087
******** 
BLOCK: offset: 1952 len: 2 phased: 2 SPAN: 45 fragments 9
1952    0   1   chr1    749644  T   C   1|0:41,0,78:35:6,14,11,4:42 0   0.000000    0.000000

$ bcftools view -H ../NA12878.1000G.vcf | head
chr1    10539   .   C   A   0   .   DP=61;MQSB=0.805878;MQ0F=0;AC=0;AN=2;DP4=40,19,0,0;MQ=44    GT:PL:DP:DP4:GQ 0|0:0,178,255:59:40,19,0,0:255
chr1    10642   .   G   A   56  .   DP=20;MQSB=0.818182;MQ0F=0.7;AC=0;AN=2;DP4=2,11,0,0;MQ=0    GT:PL:DP:DP4:GQ 0|0:0,39,29:13:2,11,0,0:28
chr1    11008   .   C   G   54  .   DP=21;MQSB=1;MQ0F=1;AC=0;AN=2;DP4=12,9,0,0;MQ=0 GT:PL:DP:DP4:GQ 0|0:0,63,27:21:12,9,0,0:27
chr1    11012   .   C   G   52  .   DP=21;MQSB=1;MQ0F=1;AC=0;AN=2;DP4=12,8,0,0;MQ=0 GT:PL:DP:DP4:GQ 0|0:0,60,25:20:12,8,0,0:25
chr1    13011   .   T   G   0   .   DP=107;SGB=-0.379885;RPB=1;MQB=1;MQSB=0.55873;BQB=1;MQ0F=0.635514;AC=0;AN=2;DP4=36,66,0,1;MQ=8  GT:PL:DP:DP4:GQ 0|0:0,255,196:103:36,66,0,1:255
chr1    13110   .   G   A   999 .   DP=51;MQSB=0.0687032;MQ0F=0.313726;AC=0;AN=2;DP4=7,42,0,0;MQ=15 GT:PL:DP:DP4:GQ 0|0:0,148,179:49:7,42,0,0:255
chr1    13116   .   T   G   97  .   DP=46;VDB=0.0517943;SGB=-0.693132;RPB=0.000248006;MQB=0.00218214;MQSB=0.29013;BQB=0.0112228;MQ0F=0.413043;AC=2;AN=2;DP4=3,7,3,31;MQ=11  GT:PL:DP:DP4:GQ 1|1:124,38,0:44:3,7,3,31:46
chr1    13118   .   A   G   94  .   DP=46;VDB=0.0555373;SGB=-0.693132;RPB=0.0001001;MQB=0.00136901;MQSB=0.197653;BQB=0.155903;MQ0F=0.456522;AC=2;AN=2;DP4=4,7,3,31;MQ=11    GT:PL:DP:DP4:GQ 1|1:121,35,0:45:4,7,3,31:42
chr1    13259   .   G   A   999 .   DP=65;MQSB=0.896401;MQ0F=0.707692;AC=0;AN=2;DP4=26,37,0,0;MQ=5  GT:PL:DP:DP4:GQ 0|0:0,190,135:63:26,37,0,0:255
chr1    13273   .   G   C   999 .   DP=55;MQSB=0.987527;MQ0F=0.709091;AC=0;AN=2;DP4=22,32,0,0;MQ=6  GT:PL:DP:DP4:GQ 0|0:0,163,132:54:22,32,0,0:255

wgsHistogram.R no longer needed

Remove: src/scripts/com/fulcrumgenomics/metrics/wgsHistogram.R

Update the main README.md to have more useful information!

Efficient computation of "or" with an array of log probabilities.

The previous implementation of NumericTypes.or(values: Array[LogProbability]): LogProbability did not use a rolling sum, but now we do: #50

Figure out a way to make this more efficient. Likely the array is only a few elements long (four in a lot of cases). See the use of this method in the ConsensusCaller.

Metric.write and Metric.read don't agree on how to deal with Boolean values

The default mechanism for writing out values in Metric.write causes Boolean values to be emitted as either Y or N. However, the code that reads the values back only appears to recognize true as a value value for Boolean.True. So upon roundtrip all boolean values become false.

java.lang.NullPointerException error - GroupReadsByUmi

Hi folks,

I am getting the following error when I run GroupReadsByUmi:

###############################
[2016/11/21 15:32:07 | GroupReadsByUmi | Info] Sorted 76,000,000 record. Elapsed time: 00:14:55s. Time for last 1,000,000: 11s. Last read position: chr2:198,257,003
[2016/11/21 15:32:19 | GroupReadsByUmi | Info] Sorted 77,000,000 record. Elapsed time: 00:15:07s. Time for last 1,000,000: 12s. Last read position: chr7:148,525,795
[2016/11/21 15:32:24 | GroupReadsByUmi | Info] Assigning reads to UMIs and outputting.
[2016/11/21 15:32:27 | FgBioMain | Info] GroupReadsByUmi failed. Elapsed time: 15.29 minutes.
Exception in thread "main" java.lang.NullPointerException
at com.fulcrumgenomics.umi.GroupReadsByUmi$$anonfun$14.apply(GroupReadsByUmi.scala:477)
at com.fulcrumgenomics.umi.GroupReadsByUmi$$anonfun$14.apply(GroupReadsByUmi.scala:477)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.immutable.List.map(List.scala:285)
at com.fulcrumgenomics.umi.GroupReadsByUmi.assignUmiGroups(GroupReadsByUmi.scala:477)
at com.fulcrumgenomics.umi.GroupReadsByUmi.execute(GroupReadsByUmi.scala:448)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:70)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:53)
at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:41)
at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)
###############################

I could successfully run it on smaller BAM files, e.g., with 17M reads. Could it be because of the large size of my BAM file or some sort of memory issue?

Thanks for your continued support.

Cheers,
MH

GroupReadsByUmi, CallMolecularConsensusReads and CallDuplexConsensusReads should emit @PG tags

It would be really nice if the tools in the consensus chain could all emit @pg tags in the header. Even if they're not stamped on the records, it would be great to capture the command lines used in grouping, calling and filtering.

Details on the likelihood model of CallMolecularConsensusReads

Hi folks,

I have a question about the write-up on the likelihood model of CallMolecularConsensusReads available here: https://github.com/fulcrumgenomics/fgbio/wiki/Calling-Consensus-Reads

Almost all of the parameters used in the model are described in the write-up, but unfortunately I could not find how Err_{post} and Err_{pre} are determined. More precisely, what are the factors affecting these error rates? And how one can set these two parameters given the quality of grouped reads by UMI?

Thank you for your continued support,
MH

In FilterConsensusReads consider end-trimming reads of Ns before applying max-n-fraction test

It would be interesting to try this. The thought being that if quality does tail off at the end of reads we may end up with more errors or masked bases at the end of the read. We could, during filtering, apply some kind of end-trimming to reads after masking and prior to the "too many Ns" test. The trimming could do any of:

Remove all trailing Ns in the read
Remove bases until we've gone X bases without seeing an N
Remove bases while some sliding window of size W has >= X Ns in it

FilterConsensusReads should spot unmapped BAMs and not do it's second sort!

FilterConsensusReads works on unmapped BAMs, but does an unnecessary second sort into coordinate order(!) in order to recalculate tags that don't exist. I think we could check and if we don't see any mapped reads at all, output the file in queryname order without doing the whole second sort.

Add tests to increase coverage of ReadStructure class

Count the # of Ns in the read ignored in ErrorsRateByReadPositions

Could also consider the same for the reference.

no coordinate in output of CallMolecularConsensusReads

Hi folks,

The output of CallMolecularConsensusReads seems incomplete (no coordinate info is included):

C:0 77 * 0 0 * * 0 0 TGGCTGGGCGCGGTGGCTCACGCCTGTAATCCCAGCATATNNGGAGGCTGAGGCGNGNGNNTCNNGNGNNCANG LLLLLMMMMMMMMMMMMMMAMMMMMMMMAMMMMMMM==AM##MM2MMMMMMMMMA#M#M##M;##M#M##MM#M RG:Z:A MI:Z:0
C:0 141 * 0 0 * * 0 0 CCACCTCCCGGGTTCACGCCATTCTCTNGCCTCAGCCACCTGAGTAGCTGGGACTACAGGCNNCNGCCACCNCN LLLLLMMMMMMMMMMMMMMMMMMMMMM#MMMMMMMMMMMMMMMLMMMMMMMMMAMMMMMMM##8#MMMMMM#M# RG:Z:A MI:Z:0
C:1 77 * 0 0 * * 0 0 AGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCATCTCTACTAAAAGTGCAAAAATTAGCCAGGCATG LLLLLMMMMMMMMMMMMMMMMMMMMMMMMMMMMMLMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM RG:Z:A MI:Z:1
C:1 141 * 0 0 * * 0 0 CCCAGGCTGGAGTGCAGTGGTGTGATATTGGCTCACTGCAACCTCCGCCTTCCGGGTTCAAGCGATTCTCCTGC LLLLLMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM RG:Z:A MI:Z:1

Would you please let me know what do you think could be causing this?

Thanks a lot,
MH

Consensus caller should filter consensus calls that have too many Ns

reported by @tfenne: bwa mem runs slowly with reads with even a small # of Ns.

Quality trim source reads before generating consensus reads

In VanillaUmiConsensusCaller we have a method consensusReadLength which calculates the output read length based on the the longest length at which we still have min-reads available.

I think that probably we should quality trim the source reads prior to invoking this method. We could simply trim off bases below min-input-base-quality at the ends of the reads without doing any harm. We could also do a phred/bwa style trim where we trim until we're consistently above min-input-base-quality. The goal would be to ensure that we don't create consensus reads with strings of Ns at the end if we have source reads that trail off into low quality at the end. E.g. if we have 10 reads at 100bp and min=3, but all the reads turn to low quality junk at cycle 80, I believe we'll currently create a 100bp consensus read with 20 Ns at the end.

Add logging with counts and reasons for dropped reads in GroupReadsByUmi

It would be helpful to output, at the end of execution, the counts of reads that were filtered out, along with the reasons, in order to help users understand why reads are being dropped and particularly to diagnose pathalogical situations.

Fix the outpuf

Fix this:
https://github.com/fulcrumgenomics/fgbio/search?utf8=%E2%9C%93&q=outpuf

Assertion error: reads out of order

Hi
I am currently trying to handle sequence data that contains UMIs. I came across fgbio and have attempted to use some of its applications with my data. I have successfully annotated my bams with UMIs and am now looking to collapse the reads into a consensus sequence. I have tried using the 'GroupReadsByUmi' and 'CallMolecularConsensusReads' applications (I understand that the latter is dependent on the former?) but in both cases I encounter errors;
With regards to the former, the program seems to successfully complete the 'filtering and sorting input' step but stops during the 'Assigning reads to UMIs and outputting' step, giving the error;
Reads out of order @ D00408:358:C8KK0ANXX:7:1105:12567:37050
at scala.Predef$.assert(Predef.scala:165)
at com.fulcrumgenomics.umi.GroupReadsByUmi$$anonfun$execute$8.apply(GroupReadsByUmi.scala:447)
at com.fulcrumgenomics.umi.GroupReadsByUmi$$anonfun$execute$8.apply(GroupReadsByUmi.scala:447)
at scala.collection.immutable.List.foreach(List.scala:381)
at com.fulcrumgenomics.umi.GroupReadsByUmi.execute(GroupReadsByUmi.scala:447)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:70)
at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:53)
at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:41)
at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)

When I try running 'CallMolecularConsensusReads' command (ignoring the fact that I haven't grouped reads by UMI) I also get an error;
Alignments added out of order in SAMFileWriterImpl.addAlignment…
…
Sort order is coordinate. Offending records are at [chr1:1257827] and [chr1:1257750]

Do you have any idea what might be causing this issue - and what I can do to get around it. I have tried sorting the UMI-annotated BAM by query name and coordinate (even though in the notes it says this shouldn't matter) and it doesn't make a difference.

I would appreciate any help you could provide. I am relatively new to bioinformatics, so I apologise if the solution is obvious.
Thanks
C

Add tests for ExtractUmisFromBam

likelihood model of CallMolecularConsensusReads

Hi folks,

We have successfully used AnnotateBamWithUmis, GroupReadsByUmi and CallMolecularConsensusReads pipeline and we are very pleased with the fgbio package so far. Hats off to the developers!
I was wondering if you have additional documentation online beyond what is provided by –help command. For example, in order to properly interpret the consensus reads, we are very interested to learn more about the likelihood model used in CallMolecularConsensusReads. Such information would be very helpful to adjust the input parameters such as --error-rate-pre-umi, --error-rate-post-umi, --max-base-quality, and --base-quality-shift to name a few. Another important question is about masking consensus bases. Does the package call ‘N’ solely based on --min-consensus-base-quality?
Thank you very much in advance for your help.

Cheers,
MH

PickIlluminaIndices fails to create indices

java -Xmx48G -jar /path/to/fgbio/target/scala-2.11/fgbio-0.1.2-SNAPSHOT.jar \
    PickIlluminaIndices \
    -n 1000000 \
    -o /path/to/output.txt \
    --vienna-rna-dir /path/to/ViennaRNA-2.2.5/src \
    -l 10 \
    -e 2 \
    --allow-reverses=false \
    --allow-reverse-complements=false \
    --allow-palindromes=false \
    --max-homopolymer=2 \
    --min-gc=0.0 \
    --max-gc=0.7 \
    -t=32 \
    --min-delta-g=-10.0 \
    --adapters=Nil

I tried both master and the additions on #68. In the latter case, no Exception is thrown inside each execute statement in the IndexSet.remove method, but we get:

INFO    2016-07-22 14:09:16 PickIlluminaIndicesCommand  Down to 186400 indices.
INFO    2016-07-22 14:09:16 PickIlluminaIndicesCommand  Down to 186300 indices.
INFO    2016-07-22 14:09:16 PickIlluminaIndicesCommand  Down to 186200 indices.
INFO    2016-07-22 14:09:16 PickIlluminaIndicesCommand  Down to 186100 indices.
INFO    2016-07-22 14:09:17 PickIlluminaIndicesCommand  Down to 186000 indices.
INFO    2016-07-22 14:09:17 PickIlluminaIndicesCommand  Down to 185900 indices.
INFO    2016-07-22 14:09:17 PickIlluminaIndicesCommand  Down to 185800 indices.
[2016/07/22 14:09:17 | FgBioMain | Info] PickIlluminaIndices failed. Elapsed time: 3.20 minutes.
Exception in thread "main" java.lang.IllegalStateException: Unable to remove the index from the ranked indices: {sequence: TACTATTCCA, score: -2143048647, minEditDistance:1}
    at com.fulcrumgenomics.util.PickIlluminaIndicesCommand.execute(PickIlluminaIndicesCommand.java:323)
    at com.fulcrumgenomics.util.PickIlluminaIndices.execute(PickIlluminaIndices.scala:59)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:70)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:53)
    at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:41)
    at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)

This does not happen each time I run the tool, so it is obviously there's a concurrency issue.

Resolve consensus calling naming

The following terms need to be unified in CallMolecularConsensusReads, VanillaUmiConsensusCaller, and VanillaUmiConsensusCallerOptions:
errorRatePreUmi -> errorRatePreLabeling
errorRatePostUmi -> errorRatePostLabeling
maxBaseQuality -> maxRawBaseQuality
baseQualityShift -> rawBaseQualityShift

fulcrumgenomics / fgbio Goto Github PK

fgbio's People

Stargazers

Watchers

Forkers

fgbio's Issues

Simplifying the consensus caller

Adding Tags to Consensus Reads

Moving filtering into a separate program

Recommend Projects

Recommend Topics

Recommend Org