hadoopgenomics / hadoop-bam Goto Github PK
View Code? Open in Web Editor NEWHadoop-BAM is a Java library for the manipulation of files in common bioinformatics formats using the Hadoop MapReduce framework
License: MIT License
Hadoop-BAM is a Java library for the manipulation of files in common bioinformatics formats using the Hadoop MapReduce framework
License: MIT License
If the Hadoop-BAM index is present, then it should be possible efficiency only load a subset of the BAM, say bases 1-2M of chromosome 5.
This will likely be picked up by the GATK team.
I have two questions. First, an easier one I think:
For some reason when using the hadoop-bam.jar that doesn't include dependencies, I get a class not found "sort" when running the command as shown in the documentation. However, if I run
hadoop jar Hadoop-BAM-master/target/hadoop-bam-7.0.1-SNAPSHOT.jar org.seqdoop.hadoop_bam.cli.Frontend -libjars /home/garychen1/hadoop/htsjdk-1.128.jar,/home/garychen1/hadoop/commons-jexl-2.1.1.jar sort -v --format=BAM workdir bamfiles/CG-0013A.1q31.bam
the MapReduce appears to begin correctly.
Alternatively I can run
hadoop jar Hadoop-BAM-master/target/hadoop-bam-7.0.1-SNAPSHOT-jar-with-dependencies.jar -libjars /home/garychen1/hadoop/htsjdk-1.128.jar,/home/garychen1/hadoop/commons-jexl-2.1.1.jar sort -v --format=BAM workdir bamfiles/CG-0013A.1q31.bam
and this also starts the MapReduce job.
My second question is as follows:
WIth either command above I start the map jobs, but part way through the reduce jobs. I get an error message:
[garychen1@biocluster hadoop]$ hadoop jar Hadoop-BAM-master/target/hadoop-bam-7.0.1-SNAPSHOT.jar org.seqdoop.hadoop_bam.cli.Frontend -libjars /home/garychen1/hadoop/htsjdk-1.128.jar,/home/garychen1/hadoop/commons-jexl-2.1.1.jar sort -v --format=BAM workdir bamfiles/CG-0013A.1q31.bam
15/02/13 07:11:33 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
sort :: Sampling...
15/02/13 07:11:33 INFO input.FileInputFormat: Total input paths to process : 1
15/02/13 07:11:42 INFO partition.InputSampler: Using 10000 samples
15/02/13 07:11:42 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
15/02/13 07:11:42 INFO compress.CodecPool: Got brand-new compressor [.deflate]
sort :: Sampling complete in 9.491 s.
15/02/13 07:11:43 INFO client.RMProxy: Connecting to ResourceManager at biocluster.med.usc.edu/68.181.163.131:8032
15/02/13 07:11:43 INFO input.FileInputFormat: Total input paths to process : 1
15/02/13 07:11:43 INFO mapreduce.JobSubmitter: number of splits:3
15/02/13 07:11:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423785861813_0007
15/02/13 07:11:43 INFO impl.YarnClientImpl: Submitted application application_1423785861813_0007
15/02/13 07:11:44 INFO mapreduce.Job: The url to track the job: http://biocluster.med.usc.edu:8088/proxy/application_1423785861813_0007/
sort :: Waiting for job completion...
15/02/13 07:11:44 INFO mapreduce.Job: Running job: job_1423785861813_0007
15/02/13 07:11:49 INFO mapreduce.Job: Job job_1423785861813_0007 running in uber mode : false
15/02/13 07:11:49 INFO mapreduce.Job: map 0% reduce 0%
15/02/13 07:11:59 INFO mapreduce.Job: map 46% reduce 0%
15/02/13 07:12:02 INFO mapreduce.Job: map 67% reduce 0%
15/02/13 07:12:05 INFO mapreduce.Job: map 78% reduce 0%
15/02/13 07:12:08 INFO mapreduce.Job: map 89% reduce 0%
15/02/13 07:12:11 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:12:15 INFO mapreduce.Job: map 100% reduce 22%
15/02/13 07:12:15 INFO mapreduce.Job: Task Id : attempt_1423785861813_0007_r_000000_0, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
15/02/13 07:12:16 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:12:27 INFO mapreduce.Job: map 100% reduce 11%
15/02/13 07:12:36 INFO mapreduce.Job: map 100% reduce 22%
15/02/13 07:12:38 INFO mapreduce.Job: Task Id : attempt_1423785861813_0007_r_000000_1, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
15/02/13 07:12:39 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:12:50 INFO mapreduce.Job: map 100% reduce 22%
15/02/13 07:13:02 INFO mapreduce.Job: Task Id : attempt_1423785861813_0007_r_000000_2, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
15/02/13 07:13:03 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:13:13 INFO mapreduce.Job: map 100% reduce 100%
15/02/13 07:13:14 INFO mapreduce.Job: Job job_1423785861813_0007 failed with state FAILED due to: Task failed task_1423785861813_0007_r_000000
Job failed as tasks failed. failedMaps:0 failedReduces:1
15/02/13 07:13:14 INFO mapreduce.Job: Counters: 37
File System Counters
FILE: Number of bytes read=1072859902
FILE: Number of bytes written=2146049435
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=331721757
HDFS: Number of bytes written=0
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed reduce tasks=4
Launched map tasks=3
Launched reduce tasks=4
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=53765
Total time spent by all reduces in occupied slots (ms)=59825
Total time spent by all map tasks (ms)=53765
Total time spent by all reduce tasks (ms)=59825
Total vcore-seconds taken by all map tasks=53765
Total vcore-seconds taken by all reduce tasks=59825
Total megabyte-seconds taken by all map tasks=55055360
Total megabyte-seconds taken by all reduce tasks=61260800
Map-Reduce Framework
Map input records=6056948
Map output records=6056948
Map output bytes=1054688968
Map output materialized bytes=1072859830
Input split bytes=399
Combine input records=0
Spilled Records=12113896
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=465
CPU time spent (ms)=58340
Physical memory (bytes) snapshot=794177536
Virtual memory (bytes) snapshot=2741903360
Total committed heap usage (bytes)=565968896
File Input Format Counters
Bytes Read=0
sort :: Job failed.
[garychen1@biocluster hadoop]
I dug in the logs
and I got 8 container_* directories under the userlogs/application_* directory. In the last 4 container_* directories, I found a stack trace with the following:
2015-02-13 07:12:14,868 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error r
unning child : java.lang.IncompatibleClassChangeError: Found interface org.apach
e.hadoop.mapreduce.TaskAttemptContext, but class was expected
at org.seqdoop.hadoop_bam.cli.Utils.getMergeableWorkFile(Utils.java:180)
at org.seqdoop.hadoop_bam.cli.CLIMergingAnySAMOutputFormat.getDefaultWorkFile(CLIMergingAnySAMOutputFormat.java:69)
at org.seqdoop.hadoop_bam.cli.CLIMergingAnySAMOutputFormat.getRecordWriter(CLIMergingAnySAMOutputFormat.java:62)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:540)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:614)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Do you have any insights into this? I build hadoop from source as well as all the other dependencies of hadoop-bam.
I am running Hadoop version 2.6.0:
[garychen1@biocluster ~]$ hadoop version
Hadoop 2.6.0
Subversion Unknown -r Unknown
Compiled by garychen1 on 2015-02-12T23:40Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /home/garychen1/hadoop/hadoop-2.6.0-src/hadoop-dist/target/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0.jar
[garychen1@biocluster ~]$
Thanks! Let me know if you need further information.
Gary
Can someone clarify if vcf.gz is supported by VCFInputFormat?
Is it possible to use HadoopBAM to analyse reads? I would like to count the mapped/unmapped reads at each position? I just read it can manipulating BAM/SAM files. I am new to this area and could't find any other help, sorry.
I would like to change the OutputFormat
to normal text instead of a new BAM file. So my InputFormat
would still be BAMInputFormat
, but the OutputFormat
should be instead of KeyIgnoringBAMOutputFormat
a TextOutputFormat
(org.apache.hadoop.mapred.TextOutputFormat
).
How would I do that?
If a BAM has consecutive paired reads then it is sometimes desirable for them to be in the same split - for example Mark Duplicates could take advantage of this to avoid an initial sort of the BAM.
The BAMRecordReader could ensure that paired reads are always in the same split by starting with the first record in the split that is unpaired or the first of a pair. I.e. it would only return the first record if the following expression was true:
!record.getReadPairedFlag() || record.getFirstOfPairFlag()
Similarly it would return an extra record at the end of the split if the last record was paired and was the first of a pair:
record.getReadPairedFlag() && record.getFirstOfPairFlag()
It would be good to do this so we can see where test coverage needs improving. Something like https://coveralls.io would be suitable.
We could follow for example the approach here
SAMOutputPreparer knows how to prepare the beginning of an output stream, but not how to populate or terminate it (which for CRAM requires a special terminating EOF container). We should extend it do the whole roll-up correctly based on the SAMFormat being used.
This would avoid the need to use heuristics to find splits when reading back in.
A couple of ideas from #82:
htsjdk allows for specifying custom deflaters for writing bam files. We (GATK) would like to use the IntelDeflater
from the Intel GKL https://github.com/Intel-HLS/GKL
This can be done by setting BlockCompressedOutputStream.setDefaultDeflaterFactory(new IntelDeflaterFactory());
or by setting the right deflater on the writers
There have been a few issues fixed since the last release, so I'd like to do a new one in the next day or two.
Since HTSJDK now requires Java 8 we need to ensure compatibility with 1.8 as well. I don't think we want to maintain versions to the easiest would be just to upgrade.
Based on the discussion here samtools/htsjdk#180 it is safe to say the issue will cause controversy.
edit: We should probably start off with working on a separate branch and testing that. Also improving the test coverage would not hurt.
The CLI tools like Cat/Sort need to be updated to properly handle CRAM files (once #60) is done.
I am hitting the following error using Hadoop-BAM in Spark. This happens on a hadoop-2.3 cluster when I put the CDH hadoop jars on the classpath:
net.sf.samtools.util.RuntimeIOException: Read error; BinaryCodec in readmode; streamed file (filename not available)
at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:397)
at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:371)
at net.sf.samtools.util.BinaryCodec.readByteBuffer(BinaryCodec.java:481)
at net.sf.samtools.util.BinaryCodec.readInt(BinaryCodec.java:492)
at net.sf.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:178)
at fi.tkk.ics.hadoop.bam.BAMRecordReader.nextKeyValue(BAMRecordReader.java:176)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:105)
Has anyone else hit this?
Add a setQueryIntervals
method to BAMInputFormat
that takes a htsjdk QueryInterval
. A QueryInterval
can have 0/-1 as the end position to mean end of sequence. (See discussion in #82.)
From samtools/htsjdk#567 when it is released.
Currently we are migrating traditional genome analysis to Hadoop and also for learning purpose. We are using bsmap for bisulfide methylation extraction (methratio.py with input = .ba file and reerence = .fa file). I tried to use hadoop bam for this purpose but I couldn't figure what I am doing wrong here:
but when I try to execute:
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -libjars hadoop-bam-7.0.0.jar,htsjdk-1.118.jar -inputformat org.seqdoop.hadoop_bam.BAMInputFormat -file ./methratio.py -file '../fadata/Genome.fa' -mapper methratio.py -input ./wgEncodeSydhRnaSeqK562Ifna6hPolyaAln.bam -output ./outfile
I am getting :
Exception in thread "main" java.lang.NoClassDefFoundError: htsjdk/samtools/seekablestream/SeekableStream
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1986)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1951)
at org.apache.hadoop.streaming.StreamUtil.goodClassOrNull(StreamUtil.java:51)
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:784)
at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:128)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: htsjdk.samtools.seekablestream.SeekableStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 15 more
I would really appreciate the help.
The testNoReadsInFirstSplitBug unit test fails for me on my lesser laptop with OutOfMemoryError
$ mvn test
...
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running org.seqdoop.hadoop_bam.TestBAMInputFormat
Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.126 sec <<< FAILURE! - in org.seqdoop.hadoop_bam.TestBAMInputFormat
testNoReadsInFirstSplitBug(org.seqdoop.hadoop_bam.TestBAMInputFormat) Time elapsed: 1.086 sec <<< ERROR!
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at htsjdk.samtools.SAMFileHeader.addComment(SAMFileHeader.java:292)
at org.seqdoop.hadoop_bam.TestBAMInputFormat.writeBamFileWithLargeHeader(TestBAMInputFormat.java:102)
at org.seqdoop.hadoop_bam.TestBAMInputFormat.testNoReadsInFirstSplitBug(TestBAMInputFormat.java:127)
...
Results :
Tests in error:
TestBAMInputFormat.testNoReadsInFirstSplitBug:127->writeBamFileWithLargeHeader:102 » OutOfMemory
Tests run: 150, Failures: 0, Errors: 1, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
Can .bai indexes be written for shards then merged?
As stated in that PR, this bug is currently affecting some development of GATK 4. It'd be big help if you could do a release in the next few days.
This would allow htsjdk to read sharded BAM files created with MR (or Spark), without having to merge them.
There hasn't been a release of Hadoop 1 since 2013, and all the distributions have been based on Hadoop 2 for a long time now.
This would involve removing references to Hadoop 1 in the documentation, as well as removing the hadoop.org.apache.hadoop.mapreduce.lib
packages (which seem to be unused?). It would be good to mark the Hadoop dependencies as provided
, although this could be done separately.
based on file extension, similar to the way AnySAMInputFormat does for the RecordReader.
SAMTextWriter.setSortOrder
is not called before SAMTextWriter.setHeader
in SAMRecordWriter.init
, despite this comment stating that it must (link is from htsjdk 1.118, but this API and comment still exist in htsjdk 1.138).
I noticed this because all .sam
output files in ADAM have @HD
lines with a SO:unsorted
tag, even if we set a coordinate
sort on the SAMFileHeader
before writing.
Implement the extra validation if this ticket: #32 goes well
I'm saving a BAM file with the 7.0 BAMOutputFormat and am getting the following warning when reading the file with samtools:
samtools view -H t.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
The file seems to be functionally OK; I get both the header and correct reads (as far as I can tell), mostly just wanted to raise awareness of this in case it is easy to fix later.
The CIGAR bug is blocking some validation work of GATK 4. It would be great if you could do a cut in the next few days.
I have noticed that when I try to access bam files in an HDFS using the gatk with an index file in the same directory using the ".bai" extension the gatk fails to find the index. However, if the index file has the extension ".bam.bai" it will correctly identify the index.
Or has anyone done any work on what it would take ?
Any reason to require at least Java 1.7? Some distributions in common use (e.g., CentOS 6.5) still come with Java 1.6.
when reading a vcf file (raw or bgzipped) we need to be able to read intervals of it. Similarly to using a bai for BAM you could use tribble or tabix indexes to read parts of the file. Support for this would be very valuable for users (eg gatk)
It would be much more robust to delegate to htsjdk to verify the entire file header using, i.e. SamStreams.isCRAMFile/isBAMFile.
This is so it can be used directly from GATK (which currently replicates its functionality).
$ mvn org.codehaus.mojo:versions-maven-plugin:2.2:plugin-updates-report
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Hadoop-BAM 7.5.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- versions-maven-plugin:2.2:plugin-updates-report (default-cli) @ hadoop-bam ---
[INFO] artifact org.apache.maven.plugins:maven-gpg-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-enforcer-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-compiler-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-clean-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-deploy-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-install-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-jar-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-resources-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-javadoc-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-shade-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-site-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-surefire-plugin: checking for updates from central
[INFO] artifact org.codehaus.mojo:findbugs-maven-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-antrun-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-release-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-assembly-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-dependency-plugin: checking for updates from central
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.793 s
[INFO] Finished at: 2016-06-13T10:02:23-05:00
[INFO] Final Memory: 21M/181M
[INFO] ------------------------------------------------------------------------
Then open target/site/plugin-updates-report.html in a browser.
Also list configuration properties supported by each format.
There are plenty of deprecation warning when building that we should take care of.
CRAMInputFormat, introduced in #28, can currently only read the reference from a local file. We should extend it to support reading reference files (FASTA) from HDFS using an approach like samtools/htsjdk#308 or similar.
It's unclear of the performance cost of decoding the rest of the (first three?) records when trying to find a start point. This should be benchmarked on a large BAM ~100-300GB against current behavior.
When I run the "mvn test" command, I receive the following errors:
Tests in error:
TestBAMOutputFormat.testBAMOutput:152->doMapReduce:258 » FileAlreadyExists Out...
TestBAMOutputFormat.testBAMRoundTrip:217->doMapReduce:258 » FileAlreadyExists ...
TestBAMOutputFormat.testBAMWithSplittingBai:185 » FileSystem /home/lngo/git/Ha...
TestSAMInputFormat.testMapReduceJob:89 » FileAlreadyExists Output directory fi...
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
Tests run: 175, Failures: 0, Errors: 19, Skipped: 0
I wonder this is an issue with multiple test cases receive the same input/output directory and need to have these output directories reset to comply with HDFS' requirement to have MR output directory to not exist yet.
SAMRecordReader does some tricks so that it can workaround htsjdk's requirement that a SAM file start with a header, which doesn't hold when reading part of a file (see WorkaroundingStream). This could be improved by changing htsjdk so that an external header can be supplied to a SamReader.
SAMRecord.setHeader doesn't validate the sequence dictionary in the header, so its possible to set the header to a value that doesn't contain the sequence referenced by the record. setHeaderStrict forces resolution of the record's reference and mate reference names using the sequence dictionary in the new header at the time the header is set.
I discovered this when testing GATK with sharded cram output. I'll have a fix shortly.
When one currently tries to output VariantContextWritables from mappers that contain a properly decoded VariantContext (e.g., generated by other means) one ends up with an IllegalStateException ""Cannot write fully decoded VariantContext: need lazy genotypes". This is due to the fact that VariantContextCodec relies on Picard's LazyGenotypeContext to keep the un-decoded genotype data until the context is poperly decoded. This should not be required.
This issue was originally reported by @jmthibault79 Thanks!
Currently it supports SAM and BAM files, but with the addition of CRAMInputFormat it should support CRAM files too.
AnySAMInputFormat.createRecordReader
throws IllegalArgumentException
if it can't determine the format of a sam/bam file. If this is caused by a nonexistant file it hides the FileNotFoundException which would be a more useful error message.
This cropped up in a case where a file local to one machine was being specified as an input a spark cluster with multiple nodes. The file existed for the master node, but the workers failed because it was unavailable to them.
It would be helpful if the error message mentioned this possibility.
see broadinstitute/gatk#1417 for the original issue
I have a block gzipped vcf and I want to read it using a call like this:
ctx.newAPIHadoopFile(
vcf, VCFInputFormat.class, LongWritable.class, VariantContextWritable.class,
new Configuration());
VCFFormat.inferFromData
falsely thinks it's a BCF because it's gzipped and starts with a 1f
byte. And so my code breaks. The vcf is attached. Block gzipped VCF should be recognized as VCFs not BCFs.
count_variants.blockgz.gz
Uploaded using ZenHub.io
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.