yongtang / hadoop-xz Goto Github PK
View Code? Open in Web Editor NEWXZ (LZMA/LZMA2) Codec for Apache Hadoop
License: Apache License 2.0
XZ (LZMA/LZMA2) Codec for Apache Hadoop
License: Apache License 2.0
The provided Spark example from README for reading xz files is returning output where:
Here's the Spark/Scala code :
def readXzfile() {
val conf = new SparkConf(true).setAppName("XzUncompressExample").set("spark.shuffle.manager", "SORT").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.akka.frameSize", "50").set("spark.storage.memoryFraction", "0.8").set("spark.cassandra.output.batch.size.rows", "6000").set("spark.executor.extraJavaOptions", "-XX:MaxJavaStackTraceDepth=-1").set("io.compression.codecs", "io.sensesecure.hadoop.xz.XZCodec");
val sc = new SparkContext(conf)
val hadoopConfiguration = new Configuration()
//val file = sc.textFile(fileName.getFileName)
//val rddOfXz = sc.newAPIHadoopFile("file:///Users/bparman/Perforce/testOldAnalyticsCommons10/gn-perseng/eg-analytics/analytics-commons/src/test/resources/*.xz", classOf[org.apache.hadoop.mapred.TextInputFormat], classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text], conf)
val rddOfXz = sc.newAPIHadoopFile("/user/ubuntu/raw/*.xz", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hadoopConfiguration)
rddOfXz.foreach(println)
println("Total number of lines is "+rddOfXz.count())
rddOfXz.saveAsTextFile ("/user/ubuntu/uncompressed")
}
Here's my build file:
name := "detailed-commons"
organization := "com.mycompany.commons"
version := "1.0.2"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.2.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.2.0" % "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.1.1"
libraryDependencies ++= Seq(
("io.sensesecure" % "hadoop-xz" % "1.4").
exclude("commons-beanutils", "commons-beanutils-core").
exclude("commons-collections", "commons-collections")
)
publishTo := Some(Resolver.file("detailed-commons-assembly-1.0.2.jar", new File( Path.userHome.absolutePath+"/.ivy2/cache" )) )
Hi,
I tried using it with Spark to generate xz compressed file but can not make it work. The following is my code:
object GztoXZ {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: GztoXZ gzfile xzfile")
System.exit(1)
}
val conf = new SparkConf().setAppName("Gzip transform to xz")
conf.set("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec")
val spark = new SparkContext(conf)
println("Start spark task!")
val inPath = args(0)
val outPath = args(1)
val content = spark.textFile(inPath)
content.map(line=>{
line
}).coalesce(1,true).saveAsTextFile(outPath,classOf[io.sensesecure.hadoop.xz.XZCodec])
spark.stop()
}
}
It always generate empty result. Is there anything I missed?
Having trouble getting a file from s3 decompressed on the fly. here is the command we are trying
/usr/bin/hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -libjars /home/hadoop/hadoop-xz-1.0.jar -Dmapred.output.compress=false -Dmapred.compress.map.out=false -Dmapred.input.compression.codec=io.sensesecure.hadoop.xz.XZCodec -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=0 -input s3n://XXX/2013-05-02.txt.xz -output s3n://XXX/20130501_resultt -mapper /bin/cat -reducer /bin/cat
We basically are trying to decompress a file on the fly. this example would just write it back to s3 so far it just writes back the compressed data.
Hello,
I'm trying to read some .xz files in hadoop. I added the code to my project and inserted:
conf.set("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec");
anyway, hadoop does not detect that the file is compressed and it gives the binary info as text.
Am I doing something wrong?
Thanks,
roberto.
A sample CSV file:
1,23.9
2,5.6
compressed using hadoop-xz yields:
$ hexdump -C out/000000_0.xz
00000000 fd 37 7a 58 5a 00 00 04 e6 d6 b4 46 02 00 21 01 |.7zXZ......F..!.|
00000010 16 00 00 00 74 2f e5 a3 01 00 0c 31 2c 32 33 2e |....t/.....1,23.|
00000020 39 0a 32 2c 35 2e 36 0a 00 00 00 00 dc 43 5b 17 |9.2,5.6......C[.|
00000030 b5 3f 3f e0 00 01 25 0d 71 19 c4 b6 1f b6 f3 7d |.??...%.q......}|
00000040 01 00 00 00 00 04 59 5a |......YZ|
00000048
Same file, compressed manually using a single core yields the same result
$ /software/astro/sl6/xz/5.2.2/bin/xz csv/1.csv -c | hexdump -C
00000000 fd 37 7a 58 5a 00 00 04 e6 d6 b4 46 02 00 21 01 |.7zXZ......F..!.|
00000010 16 00 00 00 74 2f e5 a3 01 00 0c 31 2c 32 33 2e |....t/.....1,23.|
00000020 39 0a 32 2c 35 2e 36 0a 00 00 00 00 dc 43 5b 17 |9.2,5.6......C[.|
00000030 b5 3f 3f e0 00 01 25 0d 71 19 c4 b6 1f b6 f3 7d |.??...%.q......}|
00000040 01 00 00 00 00 04 59 5a |......YZ|
00000048
But compressing it using multiple threads yields somewhat different file that haddop-xz is unable to read and fails with a:
java.io.IOException: XZ Stream Footer is corrupt
$ /software/astro/sl6/xz/5.2.2/bin/xz csv/1.csv -T0 -c | hexdump -C
00000000 fd 37 7a 58 5a 00 00 04 e6 d6 b4 46 04 c0 11 0d |.7zXZ......F....|
00000010 21 01 16 00 00 00 00 00 00 00 00 00 88 88 cd 68 |!..............h|
00000020 01 00 0c 31 2c 32 33 2e 39 0a 32 2c 35 2e 36 0a |...1,23.9.2,5.6.|
00000030 00 00 00 00 dc 43 5b 17 b5 3f 3f e0 00 01 2d 0d |.....C[..??...-.|
00000040 79 93 1d 7e 1f b6 f3 7d 01 00 00 00 00 04 59 5a |y..~...}......YZ|
00000050
When I try to proceed data with lzma it reads in two times more data then i'm actually have on the HDFS.
For example, hadoop client (hadoop fs -du) shows some numbers like 100GB.
then i run MR (like select count(1) ) over this data and check MR counters and find "HDFS bytes read" two times more (like 200GB).
In case of gzip and bzip2 codecs hadoop client file size and MR counters are the similar
Configuration conf = new Configuration();
conf.set("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec");
is it correct to do compression in mapreduce ? When i decompress it using java XZInputStream
the decompressed file is always empty.
Thanks
Hi, the following command is throwing an UnsupportedOperationException
from XZCompressor:
import org.apache.hadoop.io.Text
sc.parallelize(Array("1", "2", "3")).map(num => (new Text(num), new Text(num))).saveAsSequenceFile("<path>", Some(classOf[io.sensesecure.hadoop.xz.XZCodec]))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.