Giter VIP home page Giter VIP logo

appengine-mapreduce's Introduction

AppEngine Mapreduce library

Build Status

Official site: https://github.com/GoogleCloudPlatform/appengine-mapreduce

Check the site for up to date status, latest version, getting started & user guides and other documentation.

Archive contents:

  • python : python version of the library resides here
    • build.sh : use this to run tests for python library, build and run demo app
    • src : python source code for mapreduce library
    • tests : tests for mapreduce library
    • demo : a demo application that uses the map reduce.
  • java : java version of the library
    • build.xml : ant build file

appengine-mapreduce's People

Contributors

aizatsky-at-google avatar angrybrock avatar anniesullie avatar aozarov avatar asafronau avatar bmenasha avatar capstan avatar csilvers avatar dependabot[bot] avatar dinoboff avatar eschultink avatar ksookocheff-va avatar lucena avatar ludoch avatar markgoldstein avatar mikelambert avatar sadovnychyi avatar seano314 avatar soundofjw avatar tkaitchuck avatar tomyedwab avatar xiaolong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

appengine-mapreduce's Issues

Exceeded soft private memory limit in python map reduce job

I am running mapreduce on python app engine that throws alot of exceptions like:
"Exceeded soft private memory limit of 128 MB with 131 MB after servicing 17 requests total".

The job uses a GoogleCloudStorageRecordInputReader for reading data and GoogleCloudStorageConsistentRecordOutputWriter for writing.

The input file sizes range from 1 MB to 200 MB. The mapper/combiner/reducer functions are very simple string processing with no calls to external services like datastore. Each input record is a string of less than a couple hundred bytes.

The job finishes correctly in the end though I suspect it runs longer than necessary due to all of the instance shutdowns from exceeding memory available. In a job running right now, I've had 12 soft memory limit exceptions that shut down an instance and the job is not yet done.

I'm not sure what is driving all of the memory use. Could it result from buffering on the readers or writers?
Other likely causes?

Warn if a user is on the default version and module

This can cause the UI to be completely unresponsive. And will be very hard for the user to understand what is wrong.
So what we should do is in the UI request check to see if the mapreduce is running in the same module and version as the UI requests that is being processed. If it is a warning should be issued, so that the user has something to see in the logs when they go looking for why all of their UI requests are failing.

HashPipeline _finalize_job() fails for `writer_state` values > 1MB

On a 100 shard MR with about 320MB of data read, the controller_callback task is failing.

This is due to the datastore limitation on entity sizes.

BadRequestError: The value "writer_state" contains a blob_value that is too long. It cannot exceed 1000000 bytes.

/mapreduce/handlers.py", line 1260, in _finalize_job

Again, as in issue #4, I believe this problem stems from the _HashingGCSOutputWriter implementation of finalize_job.

processing_rate is a multiplier and not absolute number

When I pass "processing_rate":1 as part of mapper_params and examine the logs of /mapreduce/worker_callback I see that each worker callback processes 8 entities each time. If I set "processing_rate":2 each callback will process 16 entities. On another project I've worked on, the numbers were 15 and 30 (for processing_rate of 1 and 2). So I conclude that processing_rate param is a multiplier.

  • Where can I set the actual value?
  • What makes it change from project to project?

DatastoreInputReader filtering on a StructuredProperty fails in _validate_filters_ndb()

Coping this from http://stackoverflow.com/questions/23508116/appengine-mapreduce-how-to-filter-structuredproperty-while-using-datastore-input/23530474#23530474

While using appengine mapreduce lib, how to filter by StructuredProperty?

I tried:

  class Tag(ndb.Model):
     # ...
     tag = ndb.StringProperty()
     value = ndb.FloatProperty(indexed=False)

  class User(ndb.Model):
     # ...
     tags = ndb.StructuredProperty(Tag, repeated=True)

  class SamplePipeline(base_handler.PipelineBase):
     def run(self, tags, start_time, account_type, gsbucketname):
       start_time = datetime.strptime(start_time, "%Y-%m-%d %H:%M:%S")
       filters = []

       for tag in tags:
           filters.append(("tags.tag", "=", tag))

       yield mapreduce_pipeline.MapperPipeline(
           "name",
           "mapper",
           "mapreduce.input_readers.DatastoreInputReader",
           output_writer_spec="mapreduce.output_writers.FileOutputWriter",
           params={
               "input_reader": {
                   "entity_kind": "User",
                   "batch_size": 500,
                   "filters": filters
               },
               "output_writer": {
                   "filesystem": "gs",
                   "gs_bucket_name": gsbucketname,
               },
               "root_pipeline_id": self.root_pipeline_id,
               "account_type": account_type
           },
           shards=255
       )

What I got.

File "/Users/lucemia/vagrant_home/adex2/lib/mapreduce/input_readers.py", line 794, in _validate_filters_ndb
prop, model_class._get_kind())
BadReaderParamsError: ('Property %s is not defined for entity type %s', u'tags.tag', 'User')

A work around is to add the following lines at line 792 of mapreduce/input_readers.py to skip over the validation of properties that contain a . in them like properties of type StructuredProperty do:

if "." in prop:
continue

charts4j charts do not show up - getting ParameterInstantiationException

I am running Java AppEngine MapReduce, latest version as included by Maven.
I am running the ChainedMapReduce Job example from the tutorial.

The MapReduce jobs complete, but the visualizing graphs do not show up.
The browser shows the message

com.googlecode.charts4j.parameters.ParameterManager$ParameterInstantiationException: Full stack trace is available in the server logs. Message: Internal error: Could not instatiate com.googlecode.charts4j.parameters.DataParameter
<<<
The Eclipse console shows the exception

[INFO] SEVERE: Got exception while running command
[INFO] com.googlecode.charts4j.parameters.ParameterManager$ParameterInstantiationException: Internal error: Could not instatiate com.googlecode.charts4j.parameters.DataParameter
[INFO] at com.googlecode.charts4j.parameters.ParameterManager.getParameter(ParameterManager.java:600)
[INFO] at com.googlecode.charts4j.parameters.ParameterManager.setDataEncoding(ParameterManager.java:429)
[INFO] at com.googlecode.charts4j.AbstractGChart.prepareData(AbstractGChart.java:198)
[INFO] at com.googlecode.charts4j.AbstractGraphChart.prepareData(AbstractGraphChart.java:107)
[INFO] at com.googlecode.charts4j.AbstractAxisChart.prepareData(AbstractAxisChart.java:177)
[INFO] at com.googlecode.charts4j.BarChart.prepareData(BarChart.java:119)
[INFO] at com.googlecode.charts4j.AbstractGChart.toURLString(AbstractGChart.java:130)
[INFO] at com.google.appengine.tools.mapreduce.impl.handlers.StatusHandler.getChartUrl(StatusHandler.java:153)
[INFO] at com.google.appengine.tools.mapreduce.impl.handlers.StatusHandler.handleGetJobDetail(StatusHandler.java:223)
[INFO] at com.google.appengine.tools.mapreduce.impl.handlers.StatusHandler.handleCommand(StatusHandler.java:92)
[INFO] at com.google.appengine.tools.mapreduce.impl.handlers.MapReduceServletImpl.doGet(MapReduceServletImpl.java:86)
[INFO] at com.google.appengine.tools.mapreduce.MapReduceServlet.doGet(MapReduceServlet.java:71)
...
<<<

I added charts4j as a Maven dependency, but that did not resolve the problem.

Stefanie

Debugger terminating when I include the mapreduce section into the app.yaml

I'm trying to follow the example on http://sookocheff.com/posts/2014-04-22-app-engine-mapreduce-api-part-2-running-a-mapreduce-job-using-mapreduceyaml/

However, when I put :

includes:
- mapreduce/include.yaml

handlers:
- url: /_ah/pipeline.*
  script: mapreduce.lib.pipeline.handlers._APP
  login: admin

into my app.yaml, the debugger (PyDev) silently terminates on ANY page request. Just running dev_appserver from command line works fine. Unfortunately this means I cannot debug the code in PyDev which I am used to. Since it fails silently, I do not know how to debug it further? When I remove these lines (both have to be removed), the debugger start working again. For installing it, I copied the python/src folder into my main project root dir. Any suggestions on what could be wrong/how I can investigate it further?

Python: PyPi package include unnecessary modules?

PyPi package GoogleAppEngineMapReduce-1.9.21.1.tar.gz include these modules

  • cloudstorage
  • graphy
  • pipeline
  • simplejson

pip install dependency module automatically. I think there is no need to include these modules.
And included modue seems old version.
e.g. pipeline
GoogleAppEngineMapReduce.egg-info/requires.txt says

GoogleAppEnginePipeline >= 1.9.21.1

but include pipeline is not 1.9.21.1!(Using Files API)

import Datastore entities inside BigQuery : problem while reading csv files (regression issue ?)

Hi all,

I am using the pipeline and the map reduce API to import datastore entites inside BigQuery. My configuration is pretty much the same as the bigqueryload exemple :

  • The input is a DatastoreInput.
  • The output is BigQueryGoogleCloudStorageStoreOutput
  • There is an additional job called BigQueryLoadGoogleCloudStorageFilesJob that read the csv files and fill the table
    I am using appengine-mapreduce version 0.8.1 and everything works just fine. I tried to update the library to 0.8.2 and the final staging job is failing :

com.google.appengine.tools.mapreduce.bigqueryjobs.RetryLoadOrCleanupJob run: Job failed while writing to Bigquery. Retrying...#attempt 4 Error details : invalid: Invalid path: gs://my_bucket/Job-57656379-0175-4393-afaa-6d70d86a3322/Shard-0000/file-1426599362203 at null

I noticed that regardless of the version or the number of Datastore entities, some csv files may stay empty :

com.google.appengine.tools.mapreduce.impl.WorkerShardTask run: Ending slice after 0 items read and calling the worker 0 times

I am having hard time understanding what's going on but my guess is that the BigQueryLoadGoogleCloudStorageFilesJob 0.8.2 version doen't like very much empty csv files.
Any idea about that ?

Thanks for the help,

Julien

JavaDoc

Hello,

It would help a lot to understand the examples if the generated JavaDoc was provided.

More missing files

This file is missing from the source tree:

mapreduce/lib/pipeline/ui/images/treeview-black-line.gif

RequestTooLargeError when using a lot of shards

I created a mapreduce job with 2048 shards (I needed it for a very large update job). I didn't get any warning or error that the number of shards is too high. The code tried to create the mapper but it failed with the error below.

After this error, the mapreduce is stuck in an error state: it's listed in the /mapreduce/status page as "running", but I can't "Abort" it or clean it up.

E 2015-08-27 23:35:40.070  500      4 KB  1.06 s I 23:35:39.012 E 23:35:40.067 /mapreduce/kickoffjob_callback/1573912547002E1E3DD63
  0.1.0.2 - - [27/Aug/2015:23:35:40 -0700] "POST /mapreduce/kickoffjob_callback/1573912547002E1E3DD63 HTTP/1.1" 500 4094 "http://live.symphonytools.appspot.com/mapreduce/pipeline/run" "AppEngine-Google; (+http://code.google.com/appengine)" "live.symphonytools.appspot.com" ms=1062 cpu_ms=1063 cpm_usd=0.000458 queue_name=default task_name=59300224872921797641 instance=00c61b117cc0391b13d22845bf6ae422d8f6c9ca app_engine_release=1.9.25
    I 23:35:39.012 Processing kickoff for job 1573912547002E1E3DD63
    E 23:35:40.067 The request to API call datastore_v3.Put() was too large.
      Traceback (most recent call last):
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
          rv = self.handle_exception(request, response, e)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
          rv = self.router.dispatch(request, response)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
          return route.handler_adapter(request, response)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
          return handler.dispatch()
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
          return self.handle_exception(e, self.app.debug)
        File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
          return method(*args, **kwargs)
        File "/base/data/home/apps/s~symphonytools/live.386746686635332317/mapreduce/base_handler.py", line 135, in post
          self.handle()
        File "/base/data/home/apps/s~symphonytools/live.386746686635332317/mapreduce/handlers.py", line 1385, in handle
          result = self._save_states(state, serialized_readers_entity)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/datastore.py", line 2732, in inner_wrapper
          return RunInTransactionOptions(options, func, *args, **kwds)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/datastore.py", line 2630, in RunInTransactionOptions
          ok, result = _DoOneTry(function, args, kwargs)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/datastore.py", line 2650, in _DoOneTry
          result = function(*args, **kwargs)
        File "/base/data/home/apps/s~symphonytools/live.386746686635332317/mapreduce/handlers.py", line 1493, in _save_states
          db.put([state, serialized_readers_entity], config=config)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 1576, in put
          return put_async(models, **kwargs).get_result()
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 929, in get_result
          result = rpc.get_result()
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 613, in get_result
          return self.__get_result_hook(self)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1881, in __put_hook
          self.check_rpc_success(rpc)
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1371, in check_rpc_success
          rpc.check_success()
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 579, in check_success
          self.__rpc.CheckSuccess()
        File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 134, in CheckSuccess
          raise self.exception
      RequestTooLargeError: The request to API call datastore_v3.Put() was too large.

Don't try to import simplejson on Python 2.6+

In Python 2.6+, simplejson is superseded by json. appengine-mapreduce should do something like:

try:
    import json
except:
    import simplejson as json

So that we aren't required to bundle simplejson with our apps when we want to use mapreduce

Controller keeps rescheduling itself.

In python the controller callback (Used to update the UI) continuously runs. See:

https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/handlers.py#L1414

https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/handlers.py#L1124

The problem is that this can continue even after the job has finished.
An easy patch is to get more aggressive about stopping or backing off the controller. However the preferred way, would be to move towards the way the Java impl works, by having the controller only used once to synchronize the end of the shard's life, and managing the UI elsewhere.

A minimalistic tutorial

Would be pretty amazing if someone from the maintainers of this repo would write a simple tutorial with the bare minimums to run a map-reduce job on [python flavored] app engine, covering these issues:

  • Which handlers are absolutely needed (For map and reduce, or only for map)
  • What should they accept/return
  • How to configure the corresponding mapreduce.yaml
  • Also, it would probably be very useful to show how it's used on a [large] database

dvr7v

Thanks!

TypeError: __init__() got an unexpected keyword argument '_user_agent'

Got an exception when starting a mapreduce task on dev_appserver. Looks similar to the StackOverflow issue http://stackoverflow.com/questions/30124798/facing-error-while-starting-map-job .

Stack trace:

TypeError: init() got an unexpected keyword argument '_user_agent'
ERROR 2015-05-25 15:16:05,837 wsgi.py:279]
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/runtime/wsgi.py", line 267, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/webapp2-2.3/webapp2.py", line 1519, in call
response = self._internal_error(e)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/webapp2-2.3/webapp2.py", line 1511, in call
rv = self.handle_exception(request, response, e)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/webapp2-2.3/webapp2.py", line 1505, in call
rv = self.router.dispatch(request, response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
return route.handler_adapter(request, response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/webapp2-2.3/webapp2.py", line 1076, in call
handler = self.handler(request, response)
File "/Users/polster/src/quantis-gae/quantiscloud/mapreduce/base_handler.py", line 88, in init
_user_agent=self._DEFAULT_USER_AGENT))
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/datastore/datastore_rpc.py", line 105, in positional_wrapper
return wrapped(_args, *_kwds)

Documentation is out of date

As of the 1.9.16.1 release, the documentation available on the wiki is severely out of date.

Both BlobstoreOutputWriter and FileOutputWriter are no longer in the source, and none of the output writers in the source are mentioned in the documentation.

Python: Update Pipeline Images Path

The include.yaml still points to an internally packaged pipeline library for static images which is no longer valid:

handlers:
- url: /mapreduce/pipeline/images
  static_dir: mapreduce/lib/pipeline/ui/images

switching to /pipeline/ui/images or ../pipeline/ui/images does not seem to do the trick, maybe the pipeline library should be updated instead and use absolute URLs and not relative?

Cannot reassign a Pipeline to use a specific push task queue

The only place I saw where this could be "overridden" is in the start() method, which works on the local dev_appserver but not with certain Mapreduce jobs on a deployed GAE environment.

It would seem this would need to be able to be added to with_params() to override the default task queue, along with the default target.

Sharding with range filters results in poor workload distribution

This is a somewhat old issue that I doesn't seem to have gotten much attention. I found a Google Groups post that articulates the problem and proposes a fix for it here: https://groups.google.com/forum/#!topic/app-engine-pipeline-api/O_5aaoNsS04

Problems summary:
Presently equality[sic] filters are ignored when choosing key range split points using the scatter property. The resulting split points are appropriate for the entire population of entities of a given kind, but not necessarily for the sub-population matching the equality filter(s). This can result in a very uneven distribution of work among shards.

I thought I'd re-file it here in case that group is no longer monitored.
The fix seems simple enough and I'm happy to create a local branch with it, but the performance impact is huge enough that it would seem prudent to get it into the main branch.

is file_format_root.py used?

I try to remove all files API usage in mapReduce code, because Google will stop Files API service.
file_format_root.py use Files API, however I did not see any code usage file_format_root.py?

Is file_format_root.py still useful?

next() method in InputReader interface(Python)

This is a question.

What's the rationale of returning a key-value pair in the next() method in InputReader? Wouldn't returning the value be enough? I implemented the later in a customized input reader and it worked just fine, but I want to hear your opinion. Thanks.

Missing files in latest pypi 1.9.5.0 release

The HTML/JS files in mapreduce/third_party/pipeline/ui are missing in the current egg posted to pypi.

Here is a manifest of the files:

lippa@rm-lippa ~/Downloads$ tar tvf GoogleAppEngineMapReduce-1.9.5.0.tar.gz
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/
-rwxr-xr-x jlucena/eng 17685 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/distribute_setup.py
-rw-r--r-- jlucena/eng 59 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/setup.cfg
-rwxr-xr-x jlucena/eng 1029 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/setup.py
-rwxr-xr-x jlucena/eng 581 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/README
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/GoogleAppEngineMapReduce.egg-info/
-rw-r--r-- jlucena/eng 1 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/GoogleAppEngineMapReduce.egg-info/dependency_links.txt
-rw-r--r-- jlucena/eng 42 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/GoogleAppEngineMapReduce.egg-info/requires.txt
-rw-r--r-- jlucena/eng 3136 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/GoogleAppEngineMapReduce.egg-info/SOURCES.txt
-rw-r--r-- jlucena/eng 392 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/GoogleAppEngineMapReduce.egg-info/PKG-INFO
-rw-r--r-- jlucena/eng 1 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/GoogleAppEngineMapReduce.egg-info/zip-safe
-rw-r--r-- jlucena/eng 10 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/GoogleAppEngineMapReduce.egg-info/top_level.txt
-rwxr-xr-x jlucena/eng 146 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/MANIFEST.in
-rw-r--r-- jlucena/eng 392 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/PKG-INFO
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/
-rwxr-xr-x jlucena/eng 600 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/init.py
-rwxr-xr-x jlucena/eng 3259 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/main.py
-rwxr-xr-x jlucena/eng 7733 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/mapreduce_pipeline.py
-rwxr-xr-x jlucena/eng 102973 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/input_readers.py
-rwxr-xr-x jlucena/eng 4157 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/key_ranges.py
-rwxr-xr-x jlucena/eng 2274 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/errors.py
-rwxr-xr-x jlucena/eng 12527 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/property_range.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/tools/
-rwxr-xr-x jlucena/eng 22 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/tools/init.py
-rwxr-xr-x jlucena/eng 3152 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/tools/gcs_file_seg_reader.py
-rwxr-xr-x jlucena/eng 6049 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/json_util.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/operation/
-rwxr-xr-x jlucena/eng 1613 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/operation/db.py
-rwxr-xr-x jlucena/eng 946 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/operation/init.py
-rwxr-xr-x jlucena/eng 982 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/operation/base.py
-rwxr-xr-x jlucena/eng 1296 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/operation/counters.py
-rwxr-xr-x jlucena/eng 15075 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/namespace_range.py
-rwxr-xr-x jlucena/eng 14669 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/context.py
-rwxr-xr-x jlucena/eng 10903 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/file_format_root.py
-rwxr-xr-x jlucena/eng 7406 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/file_format_parser.py
-rwxr-xr-x jlucena/eng 5052 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/mapper_pipeline.py
-rwxr-xr-x jlucena/eng 10140 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/records.py
-rwxr-xr-x jlucena/eng 68593 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/handlers.py
-rwxr-xr-x jlucena/eng 43758 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/output_writers.py
-rwxr-xr-x jlucena/eng 12442 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/status.py
-rwxr-xr-x jlucena/eng 7611 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/parameters.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/
-rwxr-xr-x jlucena/eng 40 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/init.py
-rwxr-xr-x jlucena/eng 4022 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/line_chart.py
-rwxr-xr-x jlucena/eng 6249 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/pie_chart.py
-rwxr-xr-x jlucena/eng 6837 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/formatters.py
-rwxr-xr-x jlucena/eng 5814 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/bar_chart.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/backends/
-rwxr-xr-x jlucena/eng 22 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/backends/init.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/backends/google_chart_api/
-rwxr-xr-x jlucena/eng 2098 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/backends/google_chart_api/init.py
-rwxr-xr-x jlucena/eng 14823 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/backends/google_chart_api/encoders.py
-rwxr-xr-x jlucena/eng 6336 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/backends/google_chart_api/util.py
-rwxr-xr-x jlucena/eng 14653 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/common.py
-rwxr-xr-x jlucena/eng 442 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/graphy/util.py
-rwxr-xr-x jlucena/eng 22 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/init.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/simplejson/
-rwxr-xr-x jlucena/eng 12383 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/simplejson/init.py
-rwxr-xr-x jlucena/eng 15826 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/simplejson/encoder.py
-rwxr-xr-x jlucena/eng 11139 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/simplejson/decoder.py
-rwxr-xr-x jlucena/eng 2271 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/simplejson/scanner.py
-rwxr-xr-x jlucena/eng 5182 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/crc32c.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/pipeline/
-rwxr-xr-x jlucena/eng 1661 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/pipeline/init.py
-rwxr-xr-x jlucena/eng 121455 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/pipeline/pipeline.py
-rwxr-xr-x jlucena/eng 1049 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/pipeline/handlers.py
-rwxr-xr-x jlucena/eng 9941 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/pipeline/models.py
-rwxr-xr-x jlucena/eng 6317 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/pipeline/status_ui.py
-rwxr-xr-x jlucena/eng 11461 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/pipeline/common.py
-rwxr-xr-x jlucena/eng 7003 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/third_party/pipeline/util.py
-rwxr-xr-x jlucena/eng 13605 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/file_formats.py
-rwxr-xr-x jlucena/eng 14466 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/datastore_range_iterators.py
-rwxr-xr-x jlucena/eng 8981 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/base_handler.py
-rwxr-xr-x jlucena/eng 22322 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/shuffler.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/
-rwxr-xr-x jlucena/eng 22 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/init.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/
-rwxr-xr-x jlucena/eng 11555 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/abstract_datastore_input_reader.py
-rwxr-xr-x jlucena/eng 681 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/init.py
-rwxr-xr-x jlucena/eng 1987 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/shard_life_cycle.py
-rwxr-xr-x jlucena/eng 8737 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/map_job_config.py
-rwxr-xr-x jlucena/eng 5652 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/model_datastore_input_reader.py
-rwxr-xr-x jlucena/eng 3046 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/map_job_context.py
-rwxr-xr-x jlucena/eng 3468 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/sample_input_reader.py
-rwxr-xr-x jlucena/eng 1906 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/mapper.py
-rwxr-xr-x jlucena/eng 3943 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/input_reader.py
-rwxr-xr-x jlucena/eng 5283 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/output_writer.py
-rwxr-xr-x jlucena/eng 6845 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/map_job_control.py
-rwxr-xr-x jlucena/eng 1273 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/api/map_job/datastore_input_reader.py
-rwxr-xr-x jlucena/eng 7022 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/test_support.py
-rwxr-xr-x jlucena/eng 12371 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/util.py
-rwxr-xr-x jlucena/eng 39511 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/model.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/lib/
-rwxr-xr-x jlucena/eng 74 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/lib/init.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/lib/input_reader/
-rwxr-xr-x jlucena/eng 272 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/lib/input_reader/init.py
-rwxr-xr-x jlucena/eng 13896 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/lib/input_reader/_gcs.py
-rwxr-xr-x jlucena/eng 3246 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/hooks.py
-rwxr-xr-x jlucena/eng 800 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/pipeline_base.py
-rwxr-xr-x jlucena/eng 4559 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/control.py
drwxr-xr-x jlucena/eng 0 2014-06-02 14:16 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/static/
-rwxr-xr-x jlucena/eng 19863 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/static/status.js
-rwxr-xr-x jlucena/eng 1470 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/static/detail.html
-rwxr-xr-x jlucena/eng 91344 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/static/jquery-1.6.1.min.js
-rwxr-xr-x jlucena/eng 1509 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/static/base.css
-rwxr-xr-x jlucena/eng 2247 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/static/jquery.json-2.2.min.js
-rwxr-xr-x jlucena/eng 7270 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/static/jquery.url.js
-rwxr-xr-x jlucena/eng 1485 1980-01-01 02:00 GoogleAppEngineMapReduce-1.9.5.0/mapreduce/static/overview.html

Python demo should use Cloud Storage not the Files API

python/demo/main.py currently uses:

    "mapreduce.input_readers.BlobstoreZipInputReader",
    "mapreduce.output_writers.BlobstoreOutputWriter",

The demo should use Cloud Storage, since the Files API is deprecated.

A GCSZipFileReader will likely need to be written.

Python: build.sh crashes at the end

When ran as non-root (as instructed):

$ sh build.sh build_demo

...

Cleaning up...
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip-1.4.1-py2.7.egg/pip/basecommand.py", line 134, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip-1.4.1-py2.7.egg/pip/commands/install.py", line 241, in run
requirement_set.install(install_options, global_options, root=options.root_path)
File "/Library/Python/2.7/site-packages/pip-1.4.1-py2.7.egg/pip/req.py", line 1294, in install
requirement.uninstall(auto_confirm=True)
File "/Library/Python/2.7/site-packages/pip-1.4.1-py2.7.egg/pip/req.py", line 525, in uninstall
paths_to_remove.remove(auto_confirm)
File "/Library/Python/2.7/site-packages/pip-1.4.1-py2.7.egg/pip/req.py", line 1639, in remove
renames(path, new_path)
File "/Library/Python/2.7/site-packages/pip-1.4.1-py2.7.egg/pip/util.py", line 294, in renames
shutil.move(old, new)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
os.unlink(src)
OSError: [Errno 13] Permission denied: '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

Storing complete log in /Users/polster/Library/Logs/pip.log

Create a better way to run MapReduce jobs from unit tests.

For the project's own unit tests this can be done by extending EndToEndTestBase. However this comes with a lot of baggage, and would need to be cleaned up. Alternately InProcessMapReduce exists, but it does not fully accurately mimic a real MapReduce. This could be improved by having it call into more of the real components.

Multiple output files

I've tested the demo app and when I run one of the three mapreduces I obtain the output fragmented in several files in the blobstorage. Is there a way of obtain all the outputs in the same file?

In the main page, after you run the mapreduces and go to the output links, it shows only the content of one of those files. I think is not correct to show only part of the output.

I've done more test with this library and I check that with enough data the output is divided into differents files (one per shard).

In the wiki, in the GoogleCloudStorageOutputWriter section I've read "These segs live in a tmp directory and should be combined and renamed to the final location. In current impl, they are not combined.". Is that refered to that I've just comment?

python "MapReduce in Three Simple Steps" does not correctly render files table -- eventual consistency problem?

The python/demo index.html does not correctly render last-added files to the user's uploaded files table even though the upload POST culminates in a redirect (refresh) of the page.

If the refresh is performed manually, the table will render correctly.

I suspect (!?) this is an eventual consistency issue and FileMetadata are root (without parent) entities. It's a minor glitch but the UI is confusing by its omission. The user would have to know to refresh manually or abandon their demo and return and be surprised by the magical appearance of previous uploads.

Is my suspicion correct? And, Is it worth pursuing a fix through the addition of a (probably non-existent) parent and the additional of an ancestor query for strong consistency?

Debugging over multiple instances in a module

I have a module dedicated to running appengine-mapreduce because it does background things like analytics, and it made sense to decouple it from the front end instances. Currently, the config is:
application: cosightio
version: 1
runtime: python27
api_version: 1
threadsafe: no
module: mapred
instance_class: B8
manual_scaling:
instances: 5

My question is whether this is the best way to scale the map reduce jobs. It goes over each entity of a DataStore kind and puts it into an appropriate Document Index. My queue config is:
name: mapreduce-queue
rate: 200/s
max_concurrent_requests: 200

The current job takes 1 hr to run, and I don't think that it is distributing it over the 5 machines. Won't all machines pick up the jobs from the taskqueue if they are free?
Also, it seems better to have the scaling such that it spins up max number of instances and spins them back down after its done. And how can I find out whether I am bottlenecked on the processing or the I/O (I think its I/O since the document index can only do about 15k reads/writes per min, which is why I tuned down my queue config to do only 200 req/sec [15000/60=250]? Would basic_scaling be better for that?

Specifying a custom KeyRangesIterator for a query

We have a case where we have deterministic keys on entities that share prefixes is certain cases.

Basically, we'd like to query by this prefix, which right now is also contained in a ComputedProperty (which drives the query in its current iteration).

To visualize:

  • 25 million entities with key IDs that start with "abc"
  • 2 million entities with key IDs that start with "def"

Using the current query, we see very poor sharding. One shard handles everything in the "def" range.

Knowing that our keys are prefixed this way, we can easily determine the first and last Key for the KeyRange, but I'm not sure of how we would utilize this knowledge to create the sharding we'd like to see.

Any ideas?

Any plan for the Go version?

We're planning to build some features based on appengine-mapreduce. Just wanted to know if there's any plan for the Go version. Or if it's left to the community to contribute.

If there's a Go version coming, we might hold until it got released. Or we'll just use the python version instead.

Thank you.

Setup Travis

It would be really useful to have Javadocs auto generated.

Is `IN` filter supported in the latest DatastoreInputReader?

I've only used equality filters, like:

filters = [("event_type", "=", event_type),
               ("date", ">=", date_start),
               ("date", "<", date_end)]

Is there any plan to add support for the IN filter, as in ("event_type", "IN", list_of_event_types) ? Or is this already possible?

"Exceeded soft private memory limit" issue breaks a job

At a certain point of the mapreduce job on one of the worker_callbacks (/mapreduce/worker_callback/15734955148708B76E8DE-1) I'm getting this error:

Exceeded soft private memory limit of 128 MB with 128 MB after servicing 1261 requests total

And after that on each mapreduce request I get this error:

Traceback (most recent call last):
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 240, in Handle
    handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 299, in _LoadHandler
    handler, path, err = LoadObject(self._handler)
  File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 85, in LoadObject
    obj = __import__(path[0])
ImportError: No module named mapreduce

Until the job dies. Please advise

HashPipeline kickoffjobcallback task Long Execution Time

On a 100 shard MR job, the /mapreduce/kickoffjob_callback/<id> task took ~997.718 seconds to execute.

Knowns
_save_states of handler.py is executing in OK time - retrieving the MapreduceState via mapreduce.model.MapreduceState.get_by_job_id revealed state.active_shards == 100.

The problem is likely caused by instantiation of _HashingGCSOutputWriter.

The log provided interesting tasklet related output:

I 16:40:25.226 Processing kickoff for job 15766020759087509210F
D 16:43:14.096 Tasklet is <bound method _StorageApi.urlfetch_async of <cloudstorage.storage_api._StorageApi object at 0xfc243430>>
D 16:43:14.100 Got result <google.appengine.api.urlfetch._URLFetchResult object at 0xfc251a10> from tasklet.
D 16:43:14.100 Retry in 0.1 seconds.

The task finished at 16:57:02.926.

Having this long of an execution that isn't parallelized is a bit bizarre to me. I am unable to reproduce this in test cases of smaller data.

When the task did complete, here's what the counters revealed.

io-read-bytes: 338903012 (273308.88/sec avg.)
io-read-msec: 2710632 (2185.99/sec avg.)
io-write-bytes: 544800768 (439355.46/sec avg.)
io-write-msec: 464 (0.37/sec avg.)
mapper-calls: 206516 (166.55/sec avg.)
mapper-walltime-ms: 3492316 (2816.38/sec avg.)

Ultimately, the MR did not finish due to a seemingly separate issue, which I will file in a moment. (The controller_callback to finish this job says BadRequestError: The value "writer_state" contains a blob_value that is too long. It cannot exceed 1000000 bytes.)

Python: How to add mapreduce to existing App Engine project?

Tried to copy src/mapreduce to my project folder, then include include.yam from the my yaml. But when starting up dev_appserver, got this exception;

from graphy.backends import google_chart_api
ImportError: No module named graphy.backends

Then I tried build.sh, but it can only build the demo app and nothing else.

Additionally, build.sh installed a huge amount of redundant packages, like graphy, simplejson etc., directly into the demo folder.

Missing output writers

The BlobstoreRecordsOutputWriter is no longer available in the output writers. Which output writer should be used for testing purposes ? BlobstoreRecordsOutputWriter was useful to collect output from mapreduce processes together with testutil.HandlerTestBase, in order to assert results, any ideas for replacement ?

    p = mapreduce_pipeline.MapreducePipeline.from_id(pipeline_id)
    output_data = []
    for output_file in p.outputs.default.value:
      with files.open(output_file, "r") as f:
        for record in records.RecordsReader(f):
          output_data.append(record)

    self.assertEqual(3, len(output_data))

MapreducePipeline behavior is inconsistent

In the below snippet from mapreduce_pipeline.py (MapreducePipeline.run()), the mapper_params default to None, showing that they are not always required, but the first line of the constructor immediately de-references it.

  def run(self,
          job_name,
          mapper_spec,
          reducer_spec,
          input_reader_spec,
          output_writer_spec=None,
          mapper_params=None,
          reducer_params=None,
          shards=None,
          combiner_spec=None):
    # Check that you have a bucket_name set in the mapper_params and set it
    # to the default if not.
    if mapper_params.get("bucket_name") is None:
      try:
        mapper_params["bucket_name"] = (
            app_identity.get_default_gcs_bucket_name())

Please add license headers to all files

Some chromium dependencies import this project and would require the presence of valid license headers in all source files.

Files that are missing license headers:

mapreduce/lib/init.py
mapreduce/lib/input_reader/init.py
mapreduce/lib/input_reader/_gcs.py
mapreduce/property_range.py
mapreduce/datastore_range_iterators.py
mapreduce/shard_life_cycle.py
mapreduce/records.py
mapreduce/api/map_job/mapper.py
mapreduce/api/map_job/map_job_control.py
mapreduce/api/map_job/output_writer.py
mapreduce/api/map_job/input_reader.py
mapreduce/api/map_job/abstract_datastore_input_reader.py
mapreduce/api/map_job/datastore_input_reader.py
mapreduce/api/map_job/sample_input_reader.py
mapreduce/api/map_job/init.py
mapreduce/api/map_job/model_datastore_input_reader.py
mapreduce/api/init.py
mapreduce/key_ranges.py
mapreduce/tools/gcs_file_seg_reader.py
mapreduce/mapreduce/tools/init.py
mapreduce/map_job_context.py
mapreduce/parameters.py
mapreduce/pipeline_base.py
mapreduce/json_util.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.