hadoopgenomics / hadoop-bam Goto Github PK

Hadoop-BAM is a Java library for the manipulation of files in common bioinformatics formats using the Hadoop MapReduce framework

License: MIT License

Java 99.60% Python 0.23% Shell 0.17%

hadoop-bam's People

Stargazers

Watchers

hadoop-bam's Issues

Support reading a subset of a BAM

If the Hadoop-BAM index is present, then it should be possible efficiency only load a subset of the BAM, say bases 1-2M of chromosome 5.

This will likely be picked up by the GATK team.

problem with running the sort command

I have two questions. First, an easier one I think:

For some reason when using the hadoop-bam.jar that doesn't include dependencies, I get a class not found "sort" when running the command as shown in the documentation. However, if I run

hadoop jar Hadoop-BAM-master/target/hadoop-bam-7.0.1-SNAPSHOT.jar org.seqdoop.hadoop_bam.cli.Frontend -libjars /home/garychen1/hadoop/htsjdk-1.128.jar,/home/garychen1/hadoop/commons-jexl-2.1.1.jar sort -v --format=BAM workdir bamfiles/CG-0013A.1q31.bam

the MapReduce appears to begin correctly.

Alternatively I can run

hadoop jar Hadoop-BAM-master/target/hadoop-bam-7.0.1-SNAPSHOT-jar-with-dependencies.jar -libjars /home/garychen1/hadoop/htsjdk-1.128.jar,/home/garychen1/hadoop/commons-jexl-2.1.1.jar sort -v --format=BAM workdir bamfiles/CG-0013A.1q31.bam

and this also starts the MapReduce job.

My second question is as follows:

WIth either command above I start the map jobs, but part way through the reduce jobs. I get an error message:

[garychen1@biocluster hadoop]$ hadoop jar Hadoop-BAM-master/target/hadoop-bam-7.0.1-SNAPSHOT.jar org.seqdoop.hadoop_bam.cli.Frontend -libjars /home/garychen1/hadoop/htsjdk-1.128.jar,/home/garychen1/hadoop/commons-jexl-2.1.1.jar sort -v --format=BAM workdir bamfiles/CG-0013A.1q31.bam
15/02/13 07:11:33 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
sort :: Sampling...
15/02/13 07:11:33 INFO input.FileInputFormat: Total input paths to process : 1
15/02/13 07:11:42 INFO partition.InputSampler: Using 10000 samples
15/02/13 07:11:42 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
15/02/13 07:11:42 INFO compress.CodecPool: Got brand-new compressor [.deflate]
sort :: Sampling complete in 9.491 s.
15/02/13 07:11:43 INFO client.RMProxy: Connecting to ResourceManager at biocluster.med.usc.edu/68.181.163.131:8032
15/02/13 07:11:43 INFO input.FileInputFormat: Total input paths to process : 1
15/02/13 07:11:43 INFO mapreduce.JobSubmitter: number of splits:3
15/02/13 07:11:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423785861813_0007
15/02/13 07:11:43 INFO impl.YarnClientImpl: Submitted application application_1423785861813_0007
15/02/13 07:11:44 INFO mapreduce.Job: The url to track the job: http://biocluster.med.usc.edu:8088/proxy/application_1423785861813_0007/
sort :: Waiting for job completion...
15/02/13 07:11:44 INFO mapreduce.Job: Running job: job_1423785861813_0007
15/02/13 07:11:49 INFO mapreduce.Job: Job job_1423785861813_0007 running in uber mode : false
15/02/13 07:11:49 INFO mapreduce.Job: map 0% reduce 0%
15/02/13 07:11:59 INFO mapreduce.Job: map 46% reduce 0%
15/02/13 07:12:02 INFO mapreduce.Job: map 67% reduce 0%
15/02/13 07:12:05 INFO mapreduce.Job: map 78% reduce 0%
15/02/13 07:12:08 INFO mapreduce.Job: map 89% reduce 0%
15/02/13 07:12:11 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:12:15 INFO mapreduce.Job: map 100% reduce 22%
15/02/13 07:12:15 INFO mapreduce.Job: Task Id : attempt_1423785861813_0007_r_000000_0, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
15/02/13 07:12:16 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:12:27 INFO mapreduce.Job: map 100% reduce 11%
15/02/13 07:12:36 INFO mapreduce.Job: map 100% reduce 22%
15/02/13 07:12:38 INFO mapreduce.Job: Task Id : attempt_1423785861813_0007_r_000000_1, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
15/02/13 07:12:39 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:12:50 INFO mapreduce.Job: map 100% reduce 22%
15/02/13 07:13:02 INFO mapreduce.Job: Task Id : attempt_1423785861813_0007_r_000000_2, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
15/02/13 07:13:03 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:13:13 INFO mapreduce.Job: map 100% reduce 100%
15/02/13 07:13:14 INFO mapreduce.Job: Job job_1423785861813_0007 failed with state FAILED due to: Task failed task_1423785861813_0007_r_000000
Job failed as tasks failed. failedMaps:0 failedReduces:1

15/02/13 07:13:14 INFO mapreduce.Job: Counters: 37
File System Counters
FILE: Number of bytes read=1072859902
FILE: Number of bytes written=2146049435
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=331721757
HDFS: Number of bytes written=0
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed reduce tasks=4
Launched map tasks=3
Launched reduce tasks=4
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=53765
Total time spent by all reduces in occupied slots (ms)=59825
Total time spent by all map tasks (ms)=53765
Total time spent by all reduce tasks (ms)=59825
Total vcore-seconds taken by all map tasks=53765
Total vcore-seconds taken by all reduce tasks=59825
Total megabyte-seconds taken by all map tasks=55055360
Total megabyte-seconds taken by all reduce tasks=61260800
Map-Reduce Framework
Map input records=6056948
Map output records=6056948
Map output bytes=1054688968
Map output materialized bytes=1072859830
Input split bytes=399
Combine input records=0
Spilled Records=12113896
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=465
CPU time spent (ms)=58340
Physical memory (bytes) snapshot=794177536
Virtual memory (bytes) snapshot=2741903360
Total committed heap usage (bytes)=565968896
File Input Format Counters
Bytes Read=0
sort :: Job failed.
[garychen1@biocluster hadoop]

I dug in the logs
and I got 8 container_* directories under the userlogs/application_* directory. In the last 4 container_* directories, I found a stack trace with the following:

2015-02-13 07:12:14,868 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error r
unning child : java.lang.IncompatibleClassChangeError: Found interface org.apach
e.hadoop.mapreduce.TaskAttemptContext, but class was expected
at org.seqdoop.hadoop_bam.cli.Utils.getMergeableWorkFile(Utils.java:180)
at org.seqdoop.hadoop_bam.cli.CLIMergingAnySAMOutputFormat.getDefaultWorkFile(CLIMergingAnySAMOutputFormat.java:69)
at org.seqdoop.hadoop_bam.cli.CLIMergingAnySAMOutputFormat.getRecordWriter(CLIMergingAnySAMOutputFormat.java:62)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:540)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:614)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Do you have any insights into this? I build hadoop from source as well as all the other dependencies of hadoop-bam.

I am running Hadoop version 2.6.0:

[garychen1@biocluster ~]$ hadoop version
Hadoop 2.6.0
Subversion Unknown -r Unknown
Compiled by garychen1 on 2015-02-12T23:40Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /home/garychen1/hadoop/hadoop-2.6.0-src/hadoop-dist/target/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0.jar
[garychen1@biocluster ~]$

Thanks! Let me know if you need further information.

Gary

Is gz compressed VCF file supported?

Can someone clarify if vcf.gz is supported by VCFInputFormat?

Can I output normal text files using HadoopBAM

Is it possible to use HadoopBAM to analyse reads? I would like to count the mapped/unmapped reads at each position? I just read it can manipulating BAM/SAM files. I am new to this area and could't find any other help, sorry.

I would like to change the OutputFormat to normal text instead of a new BAM file. So my InputFormat would still be BAMInputFormat, but the OutputFormat should be instead of KeyIgnoringBAMOutputFormat a TextOutputFormat (org.apache.hadoop.mapred.TextOutputFormat).

How would I do that?

Add an option to ensure paired reads are in the same split

If a BAM has consecutive paired reads then it is sometimes desirable for them to be in the same split - for example Mark Duplicates could take advantage of this to avoid an initial sort of the BAM.

The BAMRecordReader could ensure that paired reads are always in the same split by starting with the first record in the split that is unpaired or the first of a pair. I.e. it would only return the first record if the following expression was true:

!record.getReadPairedFlag() || record.getFirstOfPairFlag()

Similarly it would return an extra record at the end of the split if the last record was paired and was the first of a pair:

record.getReadPairedFlag() && record.getFirstOfPairFlag()

Add code coverage reports

It would be good to do this so we can see where test coverage needs improving. Something like https://coveralls.io would be suitable.

Publish snapshot artifacts via Travis

We could follow for example the approach here

Release Hadoop-BAM 7.5.0

It would be good to do a new release soon so that #70 and #80 can be used by downstream projects.

I'd like to also include #81 and #82 as they are needed by GATK. If we could include the other PRs that would be great too.

SAMOutputPreparer should be extended to write complete streams

SAMOutputPreparer knows how to prepare the beginning of an output stream, but not how to populate or terminate it (which for CRAM requires a special terminating EOF container). We should extend it do the whole roll-up correctly based on the SAMFormat being used.

Requests for inclusion in the next Hadoop-BAM release

This ticket is just to record GATK's requests for the PRs we'd like to get in to the next release of Hadoop-BAM:

#47
#49
#50
#57

The CRAM writing PR (#57) has a dependency on a not-yet-released version of htsjdk (post-2.0.1) which we expect to release soon - when its ready I'll submit a PR.

Generate splitting-bai indexes for BAM when writing from MR

This would avoid the need to use heuristics to find splits when reading back in.

Improve performance for large interval lists

A couple of ideas from #82:

Ignore the intervals for trimming/eliminating splits if there are more than a certain number (1000?), and emit a log message. (The intervals would still be used for filtering of course.)
For each split, discard away any interval that doesn't overlap the split. This would be done by the record reader.

Provide a way to specify a custom Deflater

htsjdk allows for specifying custom deflaters for writing bam files. We (GATK) would like to use the IntelDeflater from the Intel GKL https://github.com/Intel-HLS/GKL

This can be done by setting BlockCompressedOutputStream.setDefaultDeflaterFactory(new IntelDeflaterFactory()); or by setting the right deflater on the writers

Release Hadoop-BAM 7.7.0

There have been a few issues fixed since the last release, so I'd like to do a new one in the next day or two.

move to Java 8

Since HTSJDK now requires Java 8 we need to ensure compatibility with 1.8 as well. I don't think we want to maintain versions to the easiest would be just to upgrade.

Based on the discussion here samtools/htsjdk#180 it is safe to say the issue will cause controversy.

edit: We should probably start off with working on a separate branch and testing that. Also improving the test coverage would not hurt.

CLI tools don't handle CRAM files correctly

The CLI tools like Cat/Sort need to be updated to properly handle CRAM files (once #60) is done.

Read error; BinaryCodec in readmode; streamed file

I am hitting the following error using Hadoop-BAM in Spark. This happens on a hadoop-2.3 cluster when I put the CDH hadoop jars on the classpath:

net.sf.samtools.util.RuntimeIOException: Read error; BinaryCodec in readmode; streamed file (filename not available)
at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:397)
at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:371)
at net.sf.samtools.util.BinaryCodec.readByteBuffer(BinaryCodec.java:481)
at net.sf.samtools.util.BinaryCodec.readInt(BinaryCodec.java:492)
at net.sf.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:178)
at fi.tkk.ics.hadoop.bam.BAMRecordReader.nextKeyValue(BAMRecordReader.java:176)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:105)

Has anyone else hit this?

Add support for query interval semantics

Add a setQueryIntervals method to BAMInputFormat that takes a htsjdk QueryInterval. A QueryInterval can have 0/-1 as the end position to mean end of sequence. (See discussion in #82.)

Use new OverlapsDetector API

From samtools/htsjdk#567 when it is released.

Hadoop BAM : Class not found Error.

Currently we are migrating traditional genome analysis to Hadoop and also for learning purpose. We are using bsmap for bisulfide methylation extraction (methratio.py with input = .ba file and reerence = .fa file). I tried to use hadoop bam for this purpose but I couldn't figure what I am doing wrong here:

I loaded the .bam file to hdfs

but when I try to execute:

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -libjars hadoop-bam-7.0.0.jar,htsjdk-1.118.jar -inputformat org.seqdoop.hadoop_bam.BAMInputFormat -file ./methratio.py -file '../fadata/Genome.fa' -mapper methratio.py -input ./wgEncodeSydhRnaSeqK562Ifna6hPolyaAln.bam -output ./outfile

I am getting :

Exception in thread "main" java.lang.NoClassDefFoundError: htsjdk/samtools/seekablestream/SeekableStream
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1986)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1951)
at org.apache.hadoop.streaming.StreamUtil.goodClassOrNull(StreamUtil.java:51)
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:784)
at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:128)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: htsjdk.samtools.seekablestream.SeekableStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 15 more

I would really appreciate the help.

OutOfMemoryError in testNoReadsInFirstSplitBug unit test

The testNoReadsInFirstSplitBug unit test fails for me on my lesser laptop with OutOfMemoryError

$ mvn test
...
-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.seqdoop.hadoop_bam.TestBAMInputFormat
Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.126 sec <<< FAILURE! - in org.seqdoop.hadoop_bam.TestBAMInputFormat
testNoReadsInFirstSplitBug(org.seqdoop.hadoop_bam.TestBAMInputFormat)  Time elapsed: 1.086 sec  <<< ERROR!
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3664)
    at java.lang.String.<init>(String.java:207)
    at java.lang.StringBuilder.toString(StringBuilder.java:407)
    at htsjdk.samtools.SAMFileHeader.addComment(SAMFileHeader.java:292)
    at org.seqdoop.hadoop_bam.TestBAMInputFormat.writeBamFileWithLargeHeader(TestBAMInputFormat.java:102)
    at org.seqdoop.hadoop_bam.TestBAMInputFormat.testNoReadsInFirstSplitBug(TestBAMInputFormat.java:127)
...
Results :

Tests in error: 
  TestBAMInputFormat.testNoReadsInFirstSplitBug:127->writeBamFileWithLargeHeader:102 » OutOfMemory

Tests run: 150, Failures: 0, Errors: 1, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------

Investigate generating bai indexes for BAM when writing from MR

Can .bai indexes be written for shards then merged?

New release of Hadoop-BAM post PR #25

As stated in that PR, this bug is currently affecting some development of GATK 4. It'd be big help if you could do a release in the next few days.

Add SamReaderFactory for sharded BAM files

This would allow htsjdk to read sharded BAM files created with MR (or Spark), without having to merge them.

Remove support for Hadoop 1

There hasn't been a release of Hadoop 1 since 2013, and all the distributions have been based on Hadoop 2 for a long time now.

This would involve removing references to Hadoop 1 in the documentation, as well as removing the hadoop.org.apache.hadoop.mapreduce.lib packages (which seem to be unused?). It would be good to mark the Hadoop dependencies as provided, although this could be done separately.

AnySAMOutputFormat should delegate to the appropriate RecordWriter

based on file extension, similar to the way AnySAMInputFormat does for the RecordReader.

SAMTextWriter.setSortOrder is not called before SAMTextWriter.setHeader in SAMRecordWriter

SAMTextWriter.setSortOrder is not called before SAMTextWriter.setHeader in SAMRecordWriter.init, despite this comment stating that it must (link is from htsjdk 1.118, but this API and comment still exist in htsjdk 1.138).

I noticed this because all .sam output files in ADAM have @HD lines with a SO:unsorted tag, even if we set a coordinate sort on the SAMFileHeader before writing.

Add extra check to finding split point via guessing (if profiling says it's fine)

Implement the extra validation if this ticket: #32 goes well

BAM output is missing EOF marker

I'm saving a BAM file with the 7.0 BAMOutputFormat and am getting the following warning when reading the file with samtools:

samtools view -H t.bam 
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.

The file seems to be functionally OK; I get both the header and correct reads (as far as I can tell), mostly just wanted to raise awareness of this in case it is easy to fix later.

Another release post Tom's CIGAR fix

The CIGAR bug is blocking some validation work of GATK 4. It would be great if you could do a cut in the next few days.

Not recognizing .bai extensions for index files

I have noticed that when I try to access bam files in an HDFS using the gatk with an index file in the same directory using the ".bai" extension the gatk fails to find the index. However, if the index file has the extension ".bam.bai" it will correctly identify the index.

Any plans for support for CRAM files ?

Or has anyone done any work on what it would take ?

Requires Java 1.7

Any reason to require at least Java 1.7? Some distributions in common use (e.g., CentOS 6.5) still come with Java 1.6.

add support for reading intervals of VCF file

when reading a vcf file (raw or bgzipped) we need to be able to read intervals of it. Similarly to using a bai for BAM you could use tribble or tabix indexes to read parts of the file. Support for this would be very valuable for users (eg gatk)

SAMFormat inferFromData determines SAM format using a single byte of input

It would be much more robust to delegate to htsjdk to verify the entire file header using, i.e. SamStreams.isCRAMFile/isBAMFile.

Make Utils#mergeInto more configurable

This is so it can be used directly from GATK (which currently replicates its functionality).

Maven plugin versions out of date

$ mvn org.codehaus.mojo:versions-maven-plugin:2.2:plugin-updates-report
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building Hadoop-BAM 7.5.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- versions-maven-plugin:2.2:plugin-updates-report (default-cli) @ hadoop-bam ---
[INFO] artifact org.apache.maven.plugins:maven-gpg-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-enforcer-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-compiler-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-clean-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-deploy-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-install-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-jar-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-resources-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-javadoc-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-shade-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-site-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-surefire-plugin: checking for updates from central
[INFO] artifact org.codehaus.mojo:findbugs-maven-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-antrun-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-release-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-assembly-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-dependency-plugin: checking for updates from central
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.793 s
[INFO] Finished at: 2016-06-13T10:02:23-05:00
[INFO] Final Memory: 21M/181M
[INFO] ------------------------------------------------------------------------

Then open target/site/plugin-updates-report.html in a browser.

Add table of input formats vs. file formats to README

Also list configuration properties supported by each format.

Update to new HTSJDK API

There are plenty of deprecation warning when building that we should take care of.

CRAMInputFormat should read reference from HDFS and S3

CRAMInputFormat, introduced in #28, can currently only read the reference from a local file. We should extend it to support reading reference files (FASTA) from HDFS using an approach like samtools/htsjdk#308 or similar.

Release Hadoop-BAM 7.6.0

It would be good to do a new release.

I'd like to include the following open PRs:

#106
#99

Any others?

Profile adding extra checks to "guessing" which record to decode

It's unclear of the performance cost of decoding the rest of the (first three?) records when trying to find a start point. This should be benchmarked on a large BAM ~100-300GB against current behavior.

Hadoop-BAM mvn test

When I run the "mvn test" command, I receive the following errors:

Tests in error:
TestBAMOutputFormat.testBAMOutput:152->doMapReduce:258 » FileAlreadyExists Out...
TestBAMOutputFormat.testBAMRoundTrip:217->doMapReduce:258 » FileAlreadyExists ...
TestBAMOutputFormat.testBAMWithSplittingBai:185 » FileSystem /home/lngo/git/Ha...
TestSAMInputFormat.testMapReduceJob:89 » FileAlreadyExists Output directory fi...
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...

Tests run: 175, Failures: 0, Errors: 19, Skipped: 0

I wonder this is an issue with multiple test cases receive the same input/output directory and need to have these output directories reset to comply with HDFS' requirement to have MR output directory to not exist yet.

Change htsjdk to avoid SAM header workaround in SAMRecordReader

SAMRecordReader does some tricks so that it can workaround htsjdk's requirement that a SAM file start with a header, which doesn't hold when reading part of a file (see WorkaroundingStream). This could be improved by changing htsjdk so that an external header can be supplied to a SamReader.

SAMRecord.setHeader calls should be replaced with setHeaderStrict

SAMRecord.setHeader doesn't validate the sequence dictionary in the header, so its possible to set the header to a value that doesn't contain the sequence referenced by the record. setHeaderStrict forces resolution of the record's reference and mate reference names using the sequence dictionary in the new header at the time the header is set.

CRAM Writing doesn't properly handle the case where headers are written to each part

I discovered this when testing GATK with sharded cram output. I'll have a fix shortly.

Support VariantContextWritable as output value type

When one currently tries to output VariantContextWritables from mappers that contain a properly decoded VariantContext (e.g., generated by other means) one ends up with an IllegalStateException ""Cannot write fully decoded VariantContext: need lazy genotypes". This is due to the fact that VariantContextCodec relies on Picard's LazyGenotypeContext to keep the un-decoded genotype data until the context is poperly decoded. This should not be required.

This issue was originally reported by @jmthibault79 Thanks!

Extend AnySAMInputFormat to support CRAM

Currently it supports SAM and BAM files, but with the addition of CRAMInputFormat it should support CRAM files too.

Improve error message in createRecordReader when file missing

AnySAMInputFormat.createRecordReader throws IllegalArgumentException if it can't determine the format of a sam/bam file. If this is caused by a nonexistant file it hides the FileNotFoundException which would be a more useful error message.

This cropped up in a case where a file local to one machine was being specified as an input a spark cluster with multiple nodes. The file existed for the master node, but the workers failed because it was unavailable to them.

It would be helpful if the error message mentioned this possibility.

see broadinstitute/gatk#1417 for the original issue

block gzipped VCF not recognized as VCF

I have a block gzipped vcf and I want to read it using a call like this:

ctx.newAPIHadoopFile(
                vcf, VCFInputFormat.class, LongWritable.class, VariantContextWritable.class,
                new Configuration());

VCFFormat.inferFromData falsely thinks it's a BCF because it's gzipped and starts with a 1f byte. And so my code breaks. The vcf is attached. Block gzipped VCF should be recognized as VCFs not BCFs.

count_variants.blockgz.gz
Uploaded using ZenHub.io

hadoopgenomics / hadoop-bam Goto Github PK

hadoop-bam's People

Stargazers

Watchers

Forkers

hadoop-bam's Issues

Recommend Projects

Recommend Topics

Recommend Org