Giter VIP home page Giter VIP logo

hadoop-bam's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hadoop-bam's Issues

Support reading a subset of a BAM

If the Hadoop-BAM index is present, then it should be possible efficiency only load a subset of the BAM, say bases 1-2M of chromosome 5.

This will likely be picked up by the GATK team.

problem with running the sort command

I have two questions. First, an easier one I think:

For some reason when using the hadoop-bam.jar that doesn't include dependencies, I get a class not found "sort" when running the command as shown in the documentation. However, if I run

hadoop jar Hadoop-BAM-master/target/hadoop-bam-7.0.1-SNAPSHOT.jar org.seqdoop.hadoop_bam.cli.Frontend -libjars /home/garychen1/hadoop/htsjdk-1.128.jar,/home/garychen1/hadoop/commons-jexl-2.1.1.jar sort -v --format=BAM workdir bamfiles/CG-0013A.1q31.bam

the MapReduce appears to begin correctly.

Alternatively I can run

hadoop jar Hadoop-BAM-master/target/hadoop-bam-7.0.1-SNAPSHOT-jar-with-dependencies.jar -libjars /home/garychen1/hadoop/htsjdk-1.128.jar,/home/garychen1/hadoop/commons-jexl-2.1.1.jar sort -v --format=BAM workdir bamfiles/CG-0013A.1q31.bam

and this also starts the MapReduce job.

My second question is as follows:

WIth either command above I start the map jobs, but part way through the reduce jobs. I get an error message:

[garychen1@biocluster hadoop]$ hadoop jar Hadoop-BAM-master/target/hadoop-bam-7.0.1-SNAPSHOT.jar org.seqdoop.hadoop_bam.cli.Frontend -libjars /home/garychen1/hadoop/htsjdk-1.128.jar,/home/garychen1/hadoop/commons-jexl-2.1.1.jar sort -v --format=BAM workdir bamfiles/CG-0013A.1q31.bam
15/02/13 07:11:33 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
sort :: Sampling...
15/02/13 07:11:33 INFO input.FileInputFormat: Total input paths to process : 1
15/02/13 07:11:42 INFO partition.InputSampler: Using 10000 samples
15/02/13 07:11:42 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
15/02/13 07:11:42 INFO compress.CodecPool: Got brand-new compressor [.deflate]
sort :: Sampling complete in 9.491 s.
15/02/13 07:11:43 INFO client.RMProxy: Connecting to ResourceManager at biocluster.med.usc.edu/68.181.163.131:8032
15/02/13 07:11:43 INFO input.FileInputFormat: Total input paths to process : 1
15/02/13 07:11:43 INFO mapreduce.JobSubmitter: number of splits:3
15/02/13 07:11:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423785861813_0007
15/02/13 07:11:43 INFO impl.YarnClientImpl: Submitted application application_1423785861813_0007
15/02/13 07:11:44 INFO mapreduce.Job: The url to track the job: http://biocluster.med.usc.edu:8088/proxy/application_1423785861813_0007/
sort :: Waiting for job completion...
15/02/13 07:11:44 INFO mapreduce.Job: Running job: job_1423785861813_0007
15/02/13 07:11:49 INFO mapreduce.Job: Job job_1423785861813_0007 running in uber mode : false
15/02/13 07:11:49 INFO mapreduce.Job: map 0% reduce 0%
15/02/13 07:11:59 INFO mapreduce.Job: map 46% reduce 0%
15/02/13 07:12:02 INFO mapreduce.Job: map 67% reduce 0%
15/02/13 07:12:05 INFO mapreduce.Job: map 78% reduce 0%
15/02/13 07:12:08 INFO mapreduce.Job: map 89% reduce 0%
15/02/13 07:12:11 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:12:15 INFO mapreduce.Job: map 100% reduce 22%
15/02/13 07:12:15 INFO mapreduce.Job: Task Id : attempt_1423785861813_0007_r_000000_0, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
15/02/13 07:12:16 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:12:27 INFO mapreduce.Job: map 100% reduce 11%
15/02/13 07:12:36 INFO mapreduce.Job: map 100% reduce 22%
15/02/13 07:12:38 INFO mapreduce.Job: Task Id : attempt_1423785861813_0007_r_000000_1, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
15/02/13 07:12:39 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:12:50 INFO mapreduce.Job: map 100% reduce 22%
15/02/13 07:13:02 INFO mapreduce.Job: Task Id : attempt_1423785861813_0007_r_000000_2, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
15/02/13 07:13:03 INFO mapreduce.Job: map 100% reduce 0%
15/02/13 07:13:13 INFO mapreduce.Job: map 100% reduce 100%
15/02/13 07:13:14 INFO mapreduce.Job: Job job_1423785861813_0007 failed with state FAILED due to: Task failed task_1423785861813_0007_r_000000
Job failed as tasks failed. failedMaps:0 failedReduces:1

15/02/13 07:13:14 INFO mapreduce.Job: Counters: 37
File System Counters
FILE: Number of bytes read=1072859902
FILE: Number of bytes written=2146049435
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=331721757
HDFS: Number of bytes written=0
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed reduce tasks=4
Launched map tasks=3
Launched reduce tasks=4
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=53765
Total time spent by all reduces in occupied slots (ms)=59825
Total time spent by all map tasks (ms)=53765
Total time spent by all reduce tasks (ms)=59825
Total vcore-seconds taken by all map tasks=53765
Total vcore-seconds taken by all reduce tasks=59825
Total megabyte-seconds taken by all map tasks=55055360
Total megabyte-seconds taken by all reduce tasks=61260800
Map-Reduce Framework
Map input records=6056948
Map output records=6056948
Map output bytes=1054688968
Map output materialized bytes=1072859830
Input split bytes=399
Combine input records=0
Spilled Records=12113896
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=465
CPU time spent (ms)=58340
Physical memory (bytes) snapshot=794177536
Virtual memory (bytes) snapshot=2741903360
Total committed heap usage (bytes)=565968896
File Input Format Counters
Bytes Read=0
sort :: Job failed.
[garychen1@biocluster hadoop]

I dug in the logs
and I got 8 container_* directories under the userlogs/application_* directory. In the last 4 container_* directories, I found a stack trace with the following:

2015-02-13 07:12:14,868 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error r
unning child : java.lang.IncompatibleClassChangeError: Found interface org.apach
e.hadoop.mapreduce.TaskAttemptContext, but class was expected
at org.seqdoop.hadoop_bam.cli.Utils.getMergeableWorkFile(Utils.java:180)
at org.seqdoop.hadoop_bam.cli.CLIMergingAnySAMOutputFormat.getDefaultWorkFile(CLIMergingAnySAMOutputFormat.java:69)
at org.seqdoop.hadoop_bam.cli.CLIMergingAnySAMOutputFormat.getRecordWriter(CLIMergingAnySAMOutputFormat.java:62)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:540)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:614)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Do you have any insights into this? I build hadoop from source as well as all the other dependencies of hadoop-bam.

I am running Hadoop version 2.6.0:

[garychen1@biocluster ~]$ hadoop version
Hadoop 2.6.0
Subversion Unknown -r Unknown
Compiled by garychen1 on 2015-02-12T23:40Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /home/garychen1/hadoop/hadoop-2.6.0-src/hadoop-dist/target/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0.jar
[garychen1@biocluster ~]$

Thanks! Let me know if you need further information.

Gary

Can I output normal text files using HadoopBAM

Is it possible to use HadoopBAM to analyse reads? I would like to count the mapped/unmapped reads at each position? I just read it can manipulating BAM/SAM files. I am new to this area and could't find any other help, sorry.

I would like to change the OutputFormat to normal text instead of a new BAM file. So my InputFormat would still be BAMInputFormat, but the OutputFormat should be instead of KeyIgnoringBAMOutputFormat a TextOutputFormat (org.apache.hadoop.mapred.TextOutputFormat).

How would I do that?

Add an option to ensure paired reads are in the same split

If a BAM has consecutive paired reads then it is sometimes desirable for them to be in the same split - for example Mark Duplicates could take advantage of this to avoid an initial sort of the BAM.

The BAMRecordReader could ensure that paired reads are always in the same split by starting with the first record in the split that is unpaired or the first of a pair. I.e. it would only return the first record if the following expression was true:

!record.getReadPairedFlag() || record.getFirstOfPairFlag()

Similarly it would return an extra record at the end of the split if the last record was paired and was the first of a pair:

record.getReadPairedFlag() && record.getFirstOfPairFlag()

Release Hadoop-BAM 7.5.0

It would be good to do a new release soon so that #70 and #80 can be used by downstream projects.

I'd like to also include #81 and #82 as they are needed by GATK. If we could include the other PRs that would be great too.

SAMOutputPreparer should be extended to write complete streams

SAMOutputPreparer knows how to prepare the beginning of an output stream, but not how to populate or terminate it (which for CRAM requires a special terminating EOF container). We should extend it do the whole roll-up correctly based on the SAMFormat being used.

Requests for inclusion in the next Hadoop-BAM release

This ticket is just to record GATK's requests for the PRs we'd like to get in to the next release of Hadoop-BAM:

#47
#49
#50
#57

The CRAM writing PR (#57) has a dependency on a not-yet-released version of htsjdk (post-2.0.1) which we expect to release soon - when its ready I'll submit a PR.

Improve performance for large interval lists

A couple of ideas from #82:

  1. Ignore the intervals for trimming/eliminating splits if there are more than a certain number (1000?), and emit a log message. (The intervals would still be used for filtering of course.)
  2. For each split, discard away any interval that doesn't overlap the split. This would be done by the record reader.

Provide a way to specify a custom Deflater

htsjdk allows for specifying custom deflaters for writing bam files. We (GATK) would like to use the IntelDeflater from the Intel GKL https://github.com/Intel-HLS/GKL

This can be done by setting BlockCompressedOutputStream.setDefaultDeflaterFactory(new IntelDeflaterFactory()); or by setting the right deflater on the writers

Release Hadoop-BAM 7.7.0

There have been a few issues fixed since the last release, so I'd like to do a new one in the next day or two.

move to Java 8

Since HTSJDK now requires Java 8 we need to ensure compatibility with 1.8 as well. I don't think we want to maintain versions to the easiest would be just to upgrade.

Based on the discussion here samtools/htsjdk#180 it is safe to say the issue will cause controversy.

edit: We should probably start off with working on a separate branch and testing that. Also improving the test coverage would not hurt.

Read error; BinaryCodec in readmode; streamed file

I am hitting the following error using Hadoop-BAM in Spark. This happens on a hadoop-2.3 cluster when I put the CDH hadoop jars on the classpath:

net.sf.samtools.util.RuntimeIOException: Read error; BinaryCodec in readmode; streamed file (filename not available)
at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:397)
at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:371)
at net.sf.samtools.util.BinaryCodec.readByteBuffer(BinaryCodec.java:481)
at net.sf.samtools.util.BinaryCodec.readInt(BinaryCodec.java:492)
at net.sf.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:178)
at fi.tkk.ics.hadoop.bam.BAMRecordReader.nextKeyValue(BAMRecordReader.java:176)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:105)

Has anyone else hit this?

Add support for query interval semantics

Add a setQueryIntervals method to BAMInputFormat that takes a htsjdk QueryInterval. A QueryInterval can have 0/-1 as the end position to mean end of sequence. (See discussion in #82.)

Hadoop BAM : Class not found Error.

Currently we are migrating traditional genome analysis to Hadoop and also for learning purpose. We are using bsmap for bisulfide methylation extraction (methratio.py with input = .ba file and reerence = .fa file). I tried to use hadoop bam for this purpose but I couldn't figure what I am doing wrong here:

  1. I loaded the .bam file to hdfs

but when I try to execute:

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -libjars hadoop-bam-7.0.0.jar,htsjdk-1.118.jar -inputformat org.seqdoop.hadoop_bam.BAMInputFormat -file ./methratio.py -file '../fadata/Genome.fa' -mapper methratio.py -input ./wgEncodeSydhRnaSeqK562Ifna6hPolyaAln.bam -output ./outfile

I am getting :

Exception in thread "main" java.lang.NoClassDefFoundError: htsjdk/samtools/seekablestream/SeekableStream
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1986)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1951)
at org.apache.hadoop.streaming.StreamUtil.goodClassOrNull(StreamUtil.java:51)
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:784)
at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:128)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: htsjdk.samtools.seekablestream.SeekableStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 15 more

I would really appreciate the help.

OutOfMemoryError in testNoReadsInFirstSplitBug unit test

The testNoReadsInFirstSplitBug unit test fails for me on my lesser laptop with OutOfMemoryError

$ mvn test
...
-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.seqdoop.hadoop_bam.TestBAMInputFormat
Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.126 sec <<< FAILURE! - in org.seqdoop.hadoop_bam.TestBAMInputFormat
testNoReadsInFirstSplitBug(org.seqdoop.hadoop_bam.TestBAMInputFormat)  Time elapsed: 1.086 sec  <<< ERROR!
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3664)
    at java.lang.String.<init>(String.java:207)
    at java.lang.StringBuilder.toString(StringBuilder.java:407)
    at htsjdk.samtools.SAMFileHeader.addComment(SAMFileHeader.java:292)
    at org.seqdoop.hadoop_bam.TestBAMInputFormat.writeBamFileWithLargeHeader(TestBAMInputFormat.java:102)
    at org.seqdoop.hadoop_bam.TestBAMInputFormat.testNoReadsInFirstSplitBug(TestBAMInputFormat.java:127)
...
Results :

Tests in error: 
  TestBAMInputFormat.testNoReadsInFirstSplitBug:127->writeBamFileWithLargeHeader:102 » OutOfMemory

Tests run: 150, Failures: 0, Errors: 1, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------

New release of Hadoop-BAM post PR #25

As stated in that PR, this bug is currently affecting some development of GATK 4. It'd be big help if you could do a release in the next few days.

Remove support for Hadoop 1

There hasn't been a release of Hadoop 1 since 2013, and all the distributions have been based on Hadoop 2 for a long time now.

This would involve removing references to Hadoop 1 in the documentation, as well as removing the hadoop.org.apache.hadoop.mapreduce.lib packages (which seem to be unused?). It would be good to mark the Hadoop dependencies as provided, although this could be done separately.

BAM output is missing EOF marker

I'm saving a BAM file with the 7.0 BAMOutputFormat and am getting the following warning when reading the file with samtools:

samtools view -H t.bam 
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.

The file seems to be functionally OK; I get both the header and correct reads (as far as I can tell), mostly just wanted to raise awareness of this in case it is easy to fix later.

Not recognizing .bai extensions for index files

I have noticed that when I try to access bam files in an HDFS using the gatk with an index file in the same directory using the ".bai" extension the gatk fails to find the index. However, if the index file has the extension ".bam.bai" it will correctly identify the index.

Requires Java 1.7

Any reason to require at least Java 1.7? Some distributions in common use (e.g., CentOS 6.5) still come with Java 1.6.

add support for reading intervals of VCF file

when reading a vcf file (raw or bgzipped) we need to be able to read intervals of it. Similarly to using a bai for BAM you could use tribble or tabix indexes to read parts of the file. Support for this would be very valuable for users (eg gatk)

Maven plugin versions out of date

$ mvn org.codehaus.mojo:versions-maven-plugin:2.2:plugin-updates-report
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building Hadoop-BAM 7.5.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- versions-maven-plugin:2.2:plugin-updates-report (default-cli) @ hadoop-bam ---
[INFO] artifact org.apache.maven.plugins:maven-gpg-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-enforcer-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-compiler-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-clean-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-deploy-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-install-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-jar-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-resources-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-javadoc-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-shade-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-site-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-surefire-plugin: checking for updates from central
[INFO] artifact org.codehaus.mojo:findbugs-maven-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-antrun-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-release-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-assembly-plugin: checking for updates from central
[INFO] artifact org.apache.maven.plugins:maven-dependency-plugin: checking for updates from central
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.793 s
[INFO] Finished at: 2016-06-13T10:02:23-05:00
[INFO] Final Memory: 21M/181M
[INFO] ------------------------------------------------------------------------

Then open target/site/plugin-updates-report.html in a browser.

Hadoop-BAM mvn test

When I run the "mvn test" command, I receive the following errors:

Tests in error:
TestBAMOutputFormat.testBAMOutput:152->doMapReduce:258 » FileAlreadyExists Out...
TestBAMOutputFormat.testBAMRoundTrip:217->doMapReduce:258 » FileAlreadyExists ...
TestBAMOutputFormat.testBAMWithSplittingBai:185 » FileSystem /home/lngo/git/Ha...
TestSAMInputFormat.testMapReduceJob:89 » FileAlreadyExists Output directory fi...
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTripWithMerge:197->doMapReduce:249 » FileAlreadyExists
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...
TestVCFRoundTrip.testRoundTrip:144->doMapReduce:249 » FileAlreadyExists Output...

Tests run: 175, Failures: 0, Errors: 19, Skipped: 0

I wonder this is an issue with multiple test cases receive the same input/output directory and need to have these output directories reset to comply with HDFS' requirement to have MR output directory to not exist yet.

Change htsjdk to avoid SAM header workaround in SAMRecordReader

SAMRecordReader does some tricks so that it can workaround htsjdk's requirement that a SAM file start with a header, which doesn't hold when reading part of a file (see WorkaroundingStream). This could be improved by changing htsjdk so that an external header can be supplied to a SamReader.

SAMRecord.setHeader calls should be replaced with setHeaderStrict

SAMRecord.setHeader doesn't validate the sequence dictionary in the header, so its possible to set the header to a value that doesn't contain the sequence referenced by the record. setHeaderStrict forces resolution of the record's reference and mate reference names using the sequence dictionary in the new header at the time the header is set.

Support VariantContextWritable as output value type

When one currently tries to output VariantContextWritables from mappers that contain a properly decoded VariantContext (e.g., generated by other means) one ends up with an IllegalStateException ""Cannot write fully decoded VariantContext: need lazy genotypes". This is due to the fact that VariantContextCodec relies on Picard's LazyGenotypeContext to keep the un-decoded genotype data until the context is poperly decoded. This should not be required.

This issue was originally reported by @jmthibault79 Thanks!

Improve error message in createRecordReader when file missing

AnySAMInputFormat.createRecordReader throws IllegalArgumentException if it can't determine the format of a sam/bam file. If this is caused by a nonexistant file it hides the FileNotFoundException which would be a more useful error message.

This cropped up in a case where a file local to one machine was being specified as an input a spark cluster with multiple nodes. The file existed for the master node, but the workers failed because it was unavailable to them.

It would be helpful if the error message mentioned this possibility.

see broadinstitute/gatk#1417 for the original issue

block gzipped VCF not recognized as VCF

I have a block gzipped vcf and I want to read it using a call like this:

ctx.newAPIHadoopFile(
                vcf, VCFInputFormat.class, LongWritable.class, VariantContextWritable.class,
                new Configuration());

VCFFormat.inferFromData falsely thinks it's a BCF because it's gzipped and starts with a 1f byte. And so my code breaks. The vcf is attached. Block gzipped VCF should be recognized as VCFs not BCFs.

count_variants.blockgz.gz
Uploaded using ZenHub.io

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.