Comments (2)
Basically what I did to solve my problem was create a new class to override call and include a call to a cleanup class:
class OuterJoinReducer(JoinCombiner):
def __call__(self, key, values):
self.cleanup()
if key.isprimary:
self._key = key.body
output = self.primary(key.body, values)
if output:
for k, v in output:
jk = copy(key)
jk.body = k
yield jk, v
elif not self.secondary_blocked(key.body):
for k, v in self.secondary(key.body, values):
jk = copy(key)
jk.body = k
yield jk, v
def cleanup(self):
pass
And now I can use it to dump also records in secondary that doesnt have a primary:
class myJoinReducer(OuterJoinReducer):
def cleanup(self):
self.primary_data = "UNKNOWN"
def primary(self, key, values):
self.primary_data = values.next()
def secondary(self, key, values):
for v in values:
yield key, ((primary_data, ) + v )
In my example I'm not dumping primaries with a secondary, but it would be also quite straight if needed. As you can imagine my primary is like a lookup where I'm translating something from the secondary, I could also cache the primary, but the volume is high and the performance is not good.
It might be good to have this kind of options
from dumbo.
I've found an easier way to do it just by overriding secondary_blocked() returning always False, and updating cache data to null, something like:
class joinReducer(JoinReducer):
def primary(self, key, values):
self.primary_data = values.next()
def secondary(self, key, values):
for v in values:
yield self.primary_data, v
def secondary_blocked(self, b):
if self._key != b:
self.primary_data = None
return False
from dumbo.
Related Issues (20)
- Add access to filepath in MultiMapper
- Crash if mapper or reducer does not yield anything HOT 1
- dumbo cat can be slow in case of many part files
- Implement params access via global variable like os.environ
- MultiMapper fails with single-parameter mappers HOT 1
- MultiMapper does not support cleanup functionality
- tunnel/proxy HOT 2
- cdh4, centos 6.3, cannot get simple dumbo job to run. HOT 1
- " -file option is deprecated, please use generic option -files instead." HOT 1
- Support for SequenceFiles in local runs HOT 1
- Integration Amazon EMR
- Reading text as typedbytes affects lines with encoding other than utf8
- memlimit enabled by default
- Custom Input File Formats
- Set reducer‘s numbers failed
- The -fake option does not work as described when using Job.run()
- links in README are broken
- installation problem: could not find typedbytes HOT 1
- website is down
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dumbo.