yongtang / hadoop-xz Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 8.0 70 KB

XZ (LZMA/LZMA2) Codec for Apache Hadoop

License: Apache License 2.0

Java 100.00%

hadoop-xz's People

Contributors

Stargazers

Watchers

Forkers

pfarner lithiumtech alphaheavy robegs cafxx zanglang

hadoop-xz's Issues

Provided Spark example expected output?

The provided Spark example from README for reading xz files is returning output where:

number of lines don't match
output comes out almost like bytestream:
....
(4377758,��b�H�n��8NĂ��6�z.RS��6�q>��@�⧚2u�oX�+�׃�,�=E�(�X�1͜��v郕��ch�U{0PT�Hz�1`uX荲�͉�2q�N�l{�c6��Z�\�� M��&��]s^��P��$��+u|��=��Xh�<|�*)
(4377930,��KJ�0�Q0d��ִ��RVY(�o��V�<I�8��M�6��cԖ�>,k)
...
Is this expected format? Why are there fewer lines on the output than uncompressed input?

Here's the Spark/Scala code :

def readXzfile() {

val conf = new SparkConf(true).setAppName("XzUncompressExample").set("spark.shuffle.manager", "SORT").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.akka.frameSize", "50").set("spark.storage.memoryFraction", "0.8").set("spark.cassandra.output.batch.size.rows", "6000").set("spark.executor.extraJavaOptions", "-XX:MaxJavaStackTraceDepth=-1").set("io.compression.codecs", "io.sensesecure.hadoop.xz.XZCodec");

val sc = new SparkContext(conf)

val hadoopConfiguration = new Configuration()

//val file = sc.textFile(fileName.getFileName)

//val rddOfXz = sc.newAPIHadoopFile("file:///Users/bparman/Perforce/testOldAnalyticsCommons10/gn-perseng/eg-analytics/analytics-commons/src/test/resources/*.xz", classOf[org.apache.hadoop.mapred.TextInputFormat], classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text], conf)

val rddOfXz = sc.newAPIHadoopFile("/user/ubuntu/raw/*.xz", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hadoopConfiguration)

rddOfXz.foreach(println)

println("Total number of lines is "+rddOfXz.count())

rddOfXz.saveAsTextFile ("/user/ubuntu/uncompressed")

}

Here's my build file:

name := "detailed-commons"

organization := "com.mycompany.commons"

version := "1.0.2"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided"

libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.2.0" % "provided"

libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.2.0" % "provided"

libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.1.1"

libraryDependencies ++= Seq(
("io.sensesecure" % "hadoop-xz" % "1.4").
exclude("commons-beanutils", "commons-beanutils-core").
exclude("commons-collections", "commons-collections")
)

publishTo := Some(Resolver.file("detailed-commons-assembly-1.0.2.jar", new File( Path.userHome.absolutePath+"/.ivy2/cache" )) )

Can not generate xz file while using spark

Hi,
I tried using it with Spark to generate xz compressed file but can not make it work. The following is my code:
object GztoXZ {
def main(args: Array[String]) {

if (args.length < 2) {
  System.err.println("Usage: GztoXZ gzfile xzfile")
  System.exit(1)
}
val conf = new SparkConf().setAppName("Gzip transform to xz")
conf.set("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec")
val spark = new SparkContext(conf)
println("Start spark task!")

val inPath = args(0)
val outPath = args(1)

val content = spark.textFile(inPath)


content.map(line=>{
      line
    }).coalesce(1,true).saveAsTextFile(outPath,classOf[io.sensesecure.hadoop.xz.XZCodec])

spark.stop()

}
}

It always generate empty result. Is there anything I missed?

Mapreduce decompress s3 file

Having trouble getting a file from s3 decompressed on the fly. here is the command we are trying

/usr/bin/hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -libjars /home/hadoop/hadoop-xz-1.0.jar -Dmapred.output.compress=false -Dmapred.compress.map.out=false -Dmapred.input.compression.codec=io.sensesecure.hadoop.xz.XZCodec -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=0 -input s3n://XXX/2013-05-02.txt.xz -output s3n://XXX/20130501_resultt -mapper /bin/cat -reducer /bin/cat

We basically are trying to decompress a file on the fly. this example would just write it back to s3 so far it just writes back the compressed data.

Hadoop does not detect the .xz files

Hello,
I'm trying to read some .xz files in hadoop. I added the code to my project and inserted:
conf.set("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec");
anyway, hadoop does not detect that the file is compressed and it gives the binary info as text.
Am I doing something wrong?
Thanks,
roberto.

Cannot read compressed files using multithreading

A sample CSV file:

1,23.9
2,5.6

compressed using hadoop-xz yields:

$ hexdump -C out/000000_0.xz 
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 02 00 21 01  |.7zXZ......F..!.|
00000010  16 00 00 00 74 2f e5 a3  01 00 0c 31 2c 32 33 2e  |....t/.....1,23.|
00000020  39 0a 32 2c 35 2e 36 0a  00 00 00 00 dc 43 5b 17  |9.2,5.6......C[.|
00000030  b5 3f 3f e0 00 01 25 0d  71 19 c4 b6 1f b6 f3 7d  |.??...%.q......}|
00000040  01 00 00 00 00 04 59 5a                           |......YZ|
00000048

Same file, compressed manually using a single core yields the same result

$ /software/astro/sl6/xz/5.2.2/bin/xz csv/1.csv -c | hexdump -C
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 02 00 21 01  |.7zXZ......F..!.|
00000010  16 00 00 00 74 2f e5 a3  01 00 0c 31 2c 32 33 2e  |....t/.....1,23.|
00000020  39 0a 32 2c 35 2e 36 0a  00 00 00 00 dc 43 5b 17  |9.2,5.6......C[.|
00000030  b5 3f 3f e0 00 01 25 0d  71 19 c4 b6 1f b6 f3 7d  |.??...%.q......}|
00000040  01 00 00 00 00 04 59 5a                           |......YZ|
00000048

But compressing it using multiple threads yields somewhat different file that haddop-xz is unable to read and fails with a:

java.io.IOException: XZ Stream Footer is corrupt

$ /software/astro/sl6/xz/5.2.2/bin/xz csv/1.csv -T0 -c | hexdump -C
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 04 c0 11 0d  |.7zXZ......F....|
00000010  21 01 16 00 00 00 00 00  00 00 00 00 88 88 cd 68  |!..............h|
00000020  01 00 0c 31 2c 32 33 2e  39 0a 32 2c 35 2e 36 0a  |...1,23.9.2,5.6.|
00000030  00 00 00 00 dc 43 5b 17  b5 3f 3f e0 00 01 2d 0d  |.....C[..??...-.|
00000040  79 93 1d 7e 1f b6 f3 7d  01 00 00 00 00 04 59 5a  |y..~...}......YZ|
00000050

Reading in two times more data, then we have on HDFS

When I try to proceed data with lzma it reads in two times more data then i'm actually have on the HDFS.
For example, hadoop client (hadoop fs -du) shows some numbers like 100GB.
then i run MR (like select count(1) ) over this data and check MR counters and find "HDFS bytes read" two times more (like 200GB).
In case of gzip and bzip2 codecs hadoop client file size and MR counters are the similar

Example in mapreduce

Configuration conf = new Configuration();
conf.set("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec");

is it correct to do compression in mapreduce ? When i decompress it using java XZInputStream
the decompressed file is always empty.

Thanks

Can't save as sequence file

Hi, the following command is throwing an UnsupportedOperationException from XZCompressor:

import org.apache.hadoop.io.Text
sc.parallelize(Array("1", "2", "3")).map(num => (new Text(num), new Text(num))).saveAsSequenceFile("<path>", Some(classOf[io.sensesecure.hadoop.xz.XZCodec]))

yongtang / hadoop-xz Goto Github PK

hadoop-xz's People

Contributors

Stargazers

Watchers

Forkers

hadoop-xz's Issues

Provided Spark example expected output?

Can not generate xz file while using spark

Mapreduce decompress s3 file

Hadoop does not detect the .xz files

Cannot read compressed files using multithreading

Reading in two times more data, then we have on HDFS

Example in mapreduce

Can't save as sequence file

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent