Giter VIP home page Giter VIP logo

hadoop-xz's People

Contributors

shaunsenecal avatar yongtang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hadoop-xz's Issues

Provided Spark example expected output?

The provided Spark example from README for reading xz files is returning output where:

  • number of lines don't match
  • output comes out almost like bytestream:
    ....
    (4377758,�����b�H�n��8NĂ��6�z.RS��6�q>����@�⧚2u�oX�+�׃�,�=E�(�X�1͜���v郕����ch�U{0PT�Hz�1`uX荲�͉�2q�N�l{�c6��Z�\�� M��&��]s^���P��$��+u|��=���Xh�<|�*)
    (4377930,��KJ�0�Q0d������ִ��RVY(�o�����V�<I�8��M�6��cԖ�>,k)
    ...
    Is this expected format? Why are there fewer lines on the output than uncompressed input?

Here's the Spark/Scala code :

def readXzfile() {

val conf = new SparkConf(true).setAppName("XzUncompressExample").set("spark.shuffle.manager", "SORT").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.akka.frameSize", "50").set("spark.storage.memoryFraction", "0.8").set("spark.cassandra.output.batch.size.rows", "6000").set("spark.executor.extraJavaOptions", "-XX:MaxJavaStackTraceDepth=-1").set("io.compression.codecs", "io.sensesecure.hadoop.xz.XZCodec");

val sc = new SparkContext(conf)

val hadoopConfiguration = new Configuration()

//val file = sc.textFile(fileName.getFileName)

//val rddOfXz = sc.newAPIHadoopFile("file:///Users/bparman/Perforce/testOldAnalyticsCommons10/gn-perseng/eg-analytics/analytics-commons/src/test/resources/*.xz", classOf[org.apache.hadoop.mapred.TextInputFormat], classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text], conf)

val rddOfXz = sc.newAPIHadoopFile("/user/ubuntu/raw/*.xz", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hadoopConfiguration)

rddOfXz.foreach(println)

println("Total number of lines is "+rddOfXz.count())

rddOfXz.saveAsTextFile ("/user/ubuntu/uncompressed")

}

Here's my build file:

name := "detailed-commons"

organization := "com.mycompany.commons"

version := "1.0.2"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided"

libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.2.0" % "provided"

libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.2.0" % "provided"

libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.1.1"

libraryDependencies ++= Seq(
("io.sensesecure" % "hadoop-xz" % "1.4").
exclude("commons-beanutils", "commons-beanutils-core").
exclude("commons-collections", "commons-collections")
)

publishTo := Some(Resolver.file("detailed-commons-assembly-1.0.2.jar", new File( Path.userHome.absolutePath+"/.ivy2/cache" )) )

Can not generate xz file while using spark

Hi,
I tried using it with Spark to generate xz compressed file but can not make it work. The following is my code:
object GztoXZ {
def main(args: Array[String]) {

if (args.length < 2) {
  System.err.println("Usage: GztoXZ gzfile xzfile")
  System.exit(1)
}
val conf = new SparkConf().setAppName("Gzip transform to xz")
conf.set("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec")
val spark = new SparkContext(conf)
println("Start spark task!")

val inPath = args(0)
val outPath = args(1)

val content = spark.textFile(inPath)


content.map(line=>{
      line
    }).coalesce(1,true).saveAsTextFile(outPath,classOf[io.sensesecure.hadoop.xz.XZCodec])

spark.stop()

}
}

It always generate empty result. Is there anything I missed?

Mapreduce decompress s3 file

Having trouble getting a file from s3 decompressed on the fly. here is the command we are trying

/usr/bin/hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -libjars /home/hadoop/hadoop-xz-1.0.jar -Dmapred.output.compress=false -Dmapred.compress.map.out=false -Dmapred.input.compression.codec=io.sensesecure.hadoop.xz.XZCodec -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=0 -input s3n://XXX/2013-05-02.txt.xz -output s3n://XXX/20130501_resultt -mapper /bin/cat -reducer /bin/cat

We basically are trying to decompress a file on the fly. this example would just write it back to s3 so far it just writes back the compressed data.

Hadoop does not detect the .xz files

Hello,
I'm trying to read some .xz files in hadoop. I added the code to my project and inserted:
conf.set("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec");
anyway, hadoop does not detect that the file is compressed and it gives the binary info as text.
Am I doing something wrong?
Thanks,
roberto.

Cannot read compressed files using multithreading

A sample CSV file:

1,23.9
2,5.6

compressed using hadoop-xz yields:

$ hexdump -C out/000000_0.xz 
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 02 00 21 01  |.7zXZ......F..!.|
00000010  16 00 00 00 74 2f e5 a3  01 00 0c 31 2c 32 33 2e  |....t/.....1,23.|
00000020  39 0a 32 2c 35 2e 36 0a  00 00 00 00 dc 43 5b 17  |9.2,5.6......C[.|
00000030  b5 3f 3f e0 00 01 25 0d  71 19 c4 b6 1f b6 f3 7d  |.??...%.q......}|
00000040  01 00 00 00 00 04 59 5a                           |......YZ|
00000048

Same file, compressed manually using a single core yields the same result

$ /software/astro/sl6/xz/5.2.2/bin/xz csv/1.csv -c | hexdump -C
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 02 00 21 01  |.7zXZ......F..!.|
00000010  16 00 00 00 74 2f e5 a3  01 00 0c 31 2c 32 33 2e  |....t/.....1,23.|
00000020  39 0a 32 2c 35 2e 36 0a  00 00 00 00 dc 43 5b 17  |9.2,5.6......C[.|
00000030  b5 3f 3f e0 00 01 25 0d  71 19 c4 b6 1f b6 f3 7d  |.??...%.q......}|
00000040  01 00 00 00 00 04 59 5a                           |......YZ|
00000048

But compressing it using multiple threads yields somewhat different file that haddop-xz is unable to read and fails with a:

java.io.IOException: XZ Stream Footer is corrupt

$ /software/astro/sl6/xz/5.2.2/bin/xz csv/1.csv -T0 -c | hexdump -C
00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 04 c0 11 0d  |.7zXZ......F....|
00000010  21 01 16 00 00 00 00 00  00 00 00 00 88 88 cd 68  |!..............h|
00000020  01 00 0c 31 2c 32 33 2e  39 0a 32 2c 35 2e 36 0a  |...1,23.9.2,5.6.|
00000030  00 00 00 00 dc 43 5b 17  b5 3f 3f e0 00 01 2d 0d  |.....C[..??...-.|
00000040  79 93 1d 7e 1f b6 f3 7d  01 00 00 00 00 04 59 5a  |y..~...}......YZ|
00000050

Reading in two times more data, then we have on HDFS

When I try to proceed data with lzma it reads in two times more data then i'm actually have on the HDFS.
For example, hadoop client (hadoop fs -du) shows some numbers like 100GB.
then i run MR (like select count(1) ) over this data and check MR counters and find "HDFS bytes read" two times more (like 200GB).
In case of gzip and bzip2 codecs hadoop client file size and MR counters are the similar

Example in mapreduce

Configuration conf = new Configuration();
conf.set("io.compression.codecs","io.sensesecure.hadoop.xz.XZCodec");

is it correct to do compression in mapreduce ? When i decompress it using java XZInputStream
the decompressed file is always empty.

Thanks

Can't save as sequence file

Hi, the following command is throwing an UnsupportedOperationException from XZCompressor:

import org.apache.hadoop.io.Text
sc.parallelize(Array("1", "2", "3")).map(num => (new Text(num), new Text(num))).saveAsSequenceFile("<path>", Some(classOf[io.sensesecure.hadoop.xz.XZCodec]))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.