klbostee / feathers Goto Github PK
View Code? Open in Web Editor NEWJava classes that can be useful for Dumbo programs that run on Hadoop Streaming.
Java classes that can be useful for Dumbo programs that run on Hadoop Streaming.
RawFileOutputFormat is an output format that allows the output of raw bytes files (with no TypedBytes wrapping). We've been using it and it showed useful for creating tar files, tokyio cabinet stores in dumbo. I couldn't find a way to attach a file to the issue, to I'll just copy and paste the classe here:
package fm.last.feathers.output;
import java.io.ByteArrayInputStream;
import java.io.DataOutputStream;
import java.io.DataInputStream;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordWriter;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.util.Progressable;
public class RawFileOutputFormat<K, V> extends FileOutputFormat<K, V> {
protected static class RawFileRecordWriter<K, V>
implements RecordWriter<K, V> {
protected DataOutputStream out;
public RawFileRecordWriter(DataOutputStream out) {
this.out = out;
}
private byte[] readRawBytes(BytesWritable bv)
throws IOException{
byte[] typedbytesContent = bv.getBytes();
ByteArrayInputStream bais = new ByteArrayInputStream(typedbytesContent);
DataInputStream in = new DataInputStream(bais);
try {
in.readUnsignedByte();//read the code and discard it
} catch (Exception eof) {
return null;
}
int length = in.readInt();
byte[] bytes = new byte[length];
in.readFully(bytes);
return bytes;
}
public synchronized void write(K key, V value)
throws IOException {
boolean nullValue = (value == null) || value instanceof NullWritable;
if(!nullValue) {
BytesWritable bv = (BytesWritable) value;
out.write(readRawBytes(bv));
}
}
public synchronized void close(Reporter reporter) throws IOException {
out.close();
}
}
@OverRide
public RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress)
throws IOException {
Path file = FileOutputFormat.getTaskOutputPath(job, name);
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new RawFileRecordWriter<K, V>(fileOut);
}
}
i.e. in such a simple example
@opt("getpath", "yes")
def mapper(key, value):
path = value[0:5]
yield (path,None),row
if __name__ == "__main__":
run(mapper)
dumbo start example.py -input /x -output /y -hadoop /hadoop -libjar /home/zoltanctoth/env-dumbo/feathers/feathers.jar -outputformat text -numReduceTasks 0
Is it possible to have an output file that doesn't have its rows in a <key>\t<value>
but simply in <value>
format?
I tried it with ""
and <backspace>
instead of None
in yield (path,None),row
but none of them seems to work.
Hi
I'm new to Java , Hadoop and AWS. So i'm posting issue after i tried so much of google search
Facing below error while executing build.sh on emr cluster
src/map/Words.java:24: error: package org.apache.hadoop.io does not exist
import org.apache.hadoop.io.;
^
src/map/Words.java:25: error: package org.apache.hadoop.mapred does not exist
import org.apache.hadoop.mapred.;
^
src/map/Words.java:28: error: cannot find symbol
public class Words extends MapReduceBase
In order to use this jar with hadoop streaming on a standalone installation of Cloudera CDH3 on Ubuntu Linux, I had to do two things:
+++ b/build.xml
@@ -21,6 +21,8 @@
<fileset dir="${hadoop.home}"
includes="contrib/streaming/hadoop-streaming-*.jar" />
In dumbo/backends/streaming.py, I added:
if addedopts['libjarstreaming'] and addedopts['libjarstreaming'][0] != 'no':
addedopts['libjar'].append(streamingjar)
which seemed to be required to get it to work.
Without this, I always got an error that it couldn't figure out where where org.apache.hadoop.typedbytes.TypedBytesWritable was for the Partition function:
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/typedbytes/TypedBytesWritable
at fm.last.feathers.partition.Prefix.(Unknown Source)
After that, I was able to do use the partition/Prefix class successfully.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.