klbostee / feathers Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 13.0 128 KB

Java classes that can be useful for Dumbo programs that run on Hadoop Streaming.

Java 97.49% Shell 2.51%

feathers's People

Contributors

Stargazers

Watchers

Forkers

andrix brunovianarezende dangra yuanke evilkirin arbenson stevecaldwell77 piccolbo wrightrocket tmacam wowgeeker vijayeluri

feathers's Issues

add RawFileOutputFormat.java to dumbo

RawFileOutputFormat is an output format that allows the output of raw bytes files (with no TypedBytes wrapping). We've been using it and it showed useful for creating tar files, tokyio cabinet stores in dumbo. I couldn't find a way to attach a file to the issue, to I'll just copy and paste the classe here:

package fm.last.feathers.output;

import java.io.ByteArrayInputStream;
import java.io.DataOutputStream;
import java.io.DataInputStream;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream;

import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordWriter;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.util.Progressable;

public class RawFileOutputFormat<K, V> extends FileOutputFormat<K, V> {
protected static class RawFileRecordWriter<K, V>
implements RecordWriter<K, V> {
protected DataOutputStream out;

public RawFileRecordWriter(DataOutputStream out) {
    this.out = out;
}

private byte[] readRawBytes(BytesWritable bv)
throws IOException{
    byte[] typedbytesContent = bv.getBytes();
    ByteArrayInputStream bais = new ByteArrayInputStream(typedbytesContent);
    DataInputStream in = new DataInputStream(bais);
    try {
        in.readUnsignedByte();//read the code and discard it
      } catch (Exception eof) {
        return null;
      }
    int length = in.readInt();
    byte[] bytes = new byte[length];
    in.readFully(bytes);
    return bytes;
}

public synchronized void write(K key, V value)
  throws IOException {
  boolean nullValue = (value == null) || value instanceof NullWritable;
  if(!nullValue) {
      BytesWritable bv = (BytesWritable) value;
      out.write(readRawBytes(bv));
  }
}

public synchronized void close(Reporter reporter) throws IOException {
  out.close();
}

}

@OverRide
public RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress)
throws IOException {
Path file = FileOutputFormat.getTaskOutputPath(job, name);
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new RawFileRecordWriter<K, V>(fileOut);
}
}

Is it possible to use MultipleTextOutputs without the key appearing in the result?

i.e. in such a simple example

@opt("getpath", "yes")
def mapper(key, value):
    path = value[0:5]
    yield (path,None),row

if __name__ == "__main__":
    run(mapper)

dumbo start example.py -input /x -output /y -hadoop /hadoop -libjar /home/zoltanctoth/env-dumbo/feathers/feathers.jar -outputformat text -numReduceTasks 0

Is it possible to have an output file that doesn't have its rows in a <key>\t<value> but simply in <value> format?

I tried it with "" and <backspace> instead of None in yield (path,None),row but none of them seems to work.

build on aws emr cluster fails

Hi
I'm new to Java , Hadoop and AWS. So i'm posting issue after i tried so much of google search
Facing below error while executing build.sh on emr cluster
src/map/Words.java:24: error: package org.apache.hadoop.io does not exist
import org.apache.hadoop.io.;
^
src/map/Words.java:25: error: package org.apache.hadoop.mapred does not exist
import org.apache.hadoop.mapred.;
^
src/map/Words.java:28: error: cannot find symbol
public class Words extends MapReduceBase

Seems to require added the streaming jar to the hadoop classpath

In order to use this jar with hadoop streaming on a standalone installation of Cloudera CDH3 on Ubuntu Linux, I had to do two things:

change the ant file to add:

+++ b/build.xml
@@ -21,6 +21,8 @@

```
   <fileset dir="${hadoop.home}" 
```

            includes="contrib/streaming/hadoop-streaming-*.jar" />

add the selected hadoop streaming jar to the HADOOP_CLASSPATH.

In dumbo/backends/streaming.py, I added:
if addedopts['libjarstreaming'] and addedopts['libjarstreaming'][0] != 'no':
addedopts['libjar'].append(streamingjar)
which seemed to be required to get it to work.

Without this, I always got an error that it couldn't figure out where where org.apache.hadoop.typedbytes.TypedBytesWritable was for the Partition function:

Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/typedbytes/TypedBytesWritable
at fm.last.feathers.partition.Prefix.(Unknown Source)

After that, I was able to do use the partition/Prefix class successfully.

klbostee / feathers Goto Github PK

feathers's People

Contributors

Stargazers

Watchers

Forkers

feathers's Issues

add RawFileOutputFormat.java to dumbo

Is it possible to use MultipleTextOutputs without the key appearing in the result?

build on aws emr cluster fails

Seems to require added the streaming jar to the hadoop classpath

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent