Giter VIP home page Giter VIP logo

lambda-refarch-mapreduce's Introduction

Serverless Reference Architecture: MapReduce

This serverless MapReduce reference architecture demonstrates how to use AWS Lambda in conjunction with Amazon S3 to build a MapReduce framework that can process data stored in S3.

By leveraging this framework, you can build a cost-effective pipeline to run ad hoc MapReduce jobs. The price-per-query model and ease of use make it very suitable for data scientists and developers alike.

Features

  • Close to "zero" setup time
  • Pay per execution model for every job
  • Cheaper than other data processing solutions
  • Enables data processing within a VPC

Architecture

Serverless MapReduce architecture

IAM policies

  • Lambda execution role with
    • S3 read/write access
    • Cloudwatch log access (logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents)
    • X-Ray write access (xray:PutTraceSegments, xray:PutTelemetryRecords)

Check policy.json for a sample that you can use or extend.

  • To execute the driver locally, make sure that you configure your AWS profile with access to:

Quickstart::Step by Step

To run the example, you must have the AWS CLI set up. Your credentials must have access to create and invoke Lambda and access to list, read, and write to a S3 bucket.

  1. Create your S3 bucket to store the intermediaries and result (remember to use your own bucket name due to S3 namespace)

$ aws s3 mb s3://YOUR-BUCKET-NAME-HERE

  1. Update the policy.json with your S3 bucket name

$ sed -i 's/s3:::YOUR-BUCKET-NAME-HERE/s3:::biglambda-s3-bucket/' policy.json

  1. Create the IAM role with respective policy

$ python create-biglambda-role.py

  1. Use the output ARN from the script. Set the serverless_mapreduce_role environment variable:

$ export serverless_mapreduce_role=arn:aws:iam::MY-ACCOUNT-ID:role/biglambda_role

  1. Make edits to driverconfig.json and verify

$ cat driverconfig.json

  1. Run AWS X-Ray Daemon locally, otherwise you will not be able to see traces from the local driver in AWS X-Ray console. However, traces from Reducer Coordinator Lambda functions will be present.

  2. Run the driver

    $ python driver.py

Modifying the Job (driverconfig.json)

For the jobBucket field, enter an S3 bucket in your account that you wish to use for the example. Make changes to the other fields if you have different source data, or if you have renamed the files.


{
        "bucket": "big-data-benchmark",
        "prefix": "pavlo/text/1node/uservisits/",
        "jobBucket": "YOUR-BUCKET-NAME-HERE",
        "concurrentLambdas": 100,
        "mapper": {
            "name": "mapper.py",
            "handler": "mapper.lambda_handler",
            "zip": "mapper.zip"
        },
        "reducer":{
            "name": "reducer.py",
            "handler": "reducer.lambda_handler",
            "zip": "reducer.zip"
        },
        "reducerCoordinator":{
            "name": "reducerCoordinator.py",
            "handler": "reducerCoordinator.lambda_handler",
            "zip": "reducerCoordinator.zip"
        },
}

Outputs

smallya$ aws s3 ls s3://JobBucket/py-bl-1node-2 --recursive --human-readable --summarize

2016-09-26 15:01:17   69 Bytes py-bl-1node-2/jobdata
2016-09-26 15:02:04   74 Bytes py-bl-1node-2/reducerstate.1
2016-09-26 15:03:21   51.6 MiB py-bl-1node-2/result 
2016-09-26 15:01:46   18.8 MiB py-bl-1node-2/task/
….

smallya$ head –n 3 result 
67.23.87,5.874290244999999
30.94.22,96.25011190570001
25.77.91,14.262780186000002

Cleaning up the example resources

To remove all resources created by this example, do the following:

  1. Delete all objects from the S3 bucket listed in jobBucket created by the job.

  2. Delete the Cloudwatch log groups for each of the Lambda functions created by the job.

  3. Delete the created IAM role

    $ python delete-biglambda-role.py

Languages

  • Python 2.7 (active development)
  • Node.js

The Python version is under active development and feature enhancement.

Benchmark

To compare this framework with other data processing frameworks, we ran a subset of the Amplab benchmark. The table below has the execution time for each workload in seconds:

Dataset

s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix].

S3 Suffix Scale Factor Rankings (rows) Rankings (bytes) UserVisits (rows) UserVisits (bytes) Documents (bytes) /5nodes/ 5 90 Million 6.38 GB 775 Million 126.8 GB 136.9 GB

Queries:

  • Scan query (90 M Rows, 6.36 GB of data)

  • SELECT pageURL, pageRank FROM rankings WHERE pageRank > X ( X= {1000, 100, 10} )

    • 1a) SELECT pageURL, pageRank FROM rankings WHERE pageRank > 1000
    • 1b) SELECT pageURL, pageRank FROM rankings WHERE pageRank > 100
  • Aggregation query on UserVisits ( 775M rows, ~127GB of data)

    • 2a) SELECT SUBSTR(sourceIP, 1, 8), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 8)

NOTE: Only a subset of the queries could be run, as Lambda currently supports a maximum container size of 1536 MB. The benchmark is designed to increase the output size by an order of magnitude for the a,b,c iterations. Given that the output size doesn't fit in Lambda memory, we currently can't process to compute the final output.

|-----------------------|---------|---------|--------------|
| Technology            | Scan 1a | Scan 1b | Aggregate 2a | 
|-----------------------|---------|---------|--------------|
| Amazon Redshift (HDD) | 2.49    | 2.61    | 25.46        |
|-----------------------|---------|---------|--------------|
| Impala - Disk - 1.2.3 | 12.015  | 12.015  | 113.72       |
|-----------------------|---------|---------|--------------|
| Impala - Mem - 1.2.3  | 2.17    | 3.01    | 84.35        |
|-----------------------|---------|---------|--------------|
| Shark - Disk - 0.8.1  | 6.6     | 7       | 151.4        |
|-----------------------|---------|---------|--------------|
| Shark - Mem - 0.8.1   | 1.7     | 1.8     | 83.7         |
|-----------------------|---------|---------|--------------|
| Hive - 0.12 YARN      | 50.49   | 59.93   | 730.62       |
|-----------------------|---------|---------|--------------|
| Tez - 0.2.0           | 28.22   | 36.35   | 377.48       |
|-----------------------|---------|---------|--------------|
| Serverless MapReduce  | 39      | 47      | 200          |   
|-----------------------|---------|---------|--------------|

Serverless MapReduce Cost:

|---------|---------|--------------|
| Scan 1a | Scan 1b | Aggregate 2a | 
|---------|---------|--------------|
| 0.00477 | 0.0055  | 0.1129       |   
|---------|---------|--------------|

License

This sample code is made available under the MIT-0 license. See the LICENSE file.
src/nodejs/s3utils.js is made available under the MIT license. See the THIRD_PARTY file.

lambda-refarch-mapreduce's People

Contributors

ajorg-aws avatar homingli avatar rayz0r avatar sunilmallya avatar teknogeek0 avatar witsoej avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lambda-refarch-mapreduce's Issues

Default bucket should be a sane default

The default bucket for results data is currently set to a private bucket. We should replace the entry in driverconfig.json to an empty default so that we don't confuse new users.

botocore.vendored.requests.exceptions.SSLError: EOF occurred in violation of protocol

When I try to run python driver.py, it goes like following:

Dataset size: 26186978239.0, nKeys: 202, avg: 129638506.134
updating: mapper.py (deflated 54%)
updating: jobinfo.json (deflated 34%)
updating: lambdautils.py (deflated 64%)
updating: reducer.py (deflated 55%)
updating: jobinfo.json (deflated 34%)
updating: lambdautils.py (deflated 64%)
updating: reducerCoordinator.py (deflated 64%)
updating: jobinfo.json (deflated 34%)
updating: lambdautils.py (deflated 64%)
{u'TracingConfig': {u'Mode': u'PassThrough'}, u'CodeSha256': u'71RMdPKxqObAaOGTEIWEBewOYZ4gDRy/hzhfRLIKIwY=', u'FunctionName': u'BL-mapper-bl-release', 'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'RequestId': '1c5907ad-8745-11e8-9ff2-c123f00e76d9', 'HTTPHeaders': {'date': 'Sat, 14 Jul 2018 09:05:53 GMT', 'x-amzn-requestid': '1c5907ad-8745-11e8-9ff2-c123f00e76d9', 'content-length': '626', 'content-type': 'application/json', 'connection': 'keep-alive'}}, u'CodeSize': 3777, u'RevisionId': u'cc218ffe-de04-4b1b-bea9-ef454225dcce', u'MemorySize': 1536, u'FunctionArn': u'arn:aws:lambda:us-west-1:238423183776:function:BL-mapper-bl-release:5', u'Version': u'5', u'Role': u'arn:aws:iam::238423183776:role/biglambda_role', u'Timeout': 300, u'LastModified': u'2018-07-14T09:05:53.331+0000', u'Handler': u'mapper.lambda_handler', u'Runtime': u'python2.7', u'Description': u'BL-mapper-bl-release'}
{u'TracingConfig': {u'Mode': u'PassThrough'}, u'CodeSha256': u'NejYpJPs9Kg2AzZW3lBIwZvPR+QUPtH7GWd+OW6TYjg=', u'FunctionName': u'BL-reducer-bl-release', 'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'RequestId': '1ce717b1-8745-11e8-8cfa-49a139cb41d7', 'HTTPHeaders': {'date': 'Sat, 14 Jul 2018 09:05:54 GMT', 'x-amzn-requestid': '1ce717b1-8745-11e8-8cfa-49a139cb41d7', 'content-length': '630', 'content-type': 'application/json', 'connection': 'keep-alive'}}, u'CodeSize': 3859, u'RevisionId': u'c740f033-bbb4-42eb-8e6b-8885b1c4fdc6', u'MemorySize': 1536, u'FunctionArn': u'arn:aws:lambda:us-west-1:238423183776:function:BL-reducer-bl-release:5', u'Version': u'5', u'Role': u'arn:aws:iam::238423183776:role/biglambda_role', u'Timeout': 300, u'LastModified': u'2018-07-14T09:05:54.269+0000', u'Handler': u'reducer.lambda_handler', u'Runtime': u'python2.7', u'Description': u'BL-reducer-bl-release'}
{u'TracingConfig': {u'Mode': u'PassThrough'}, u'CodeSha256': u'w/dQHEKHQlKL8XTSjUJZcyDldUc+/QUYniJfYo+RNp8=', u'FunctionName': u'BL-rc-bl-release', 'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'RequestId': '1df0e605-8745-11e8-96b5-692c4dd04c96', 'HTTPHeaders': {'date': 'Sat, 14 Jul 2018 09:05:56 GMT', 'x-amzn-requestid': '1df0e605-8745-11e8-96b5-692c4dd04c96', 'content-length': '626', 'content-type': 'application/json', 'connection': 'keep-alive'}}, u'CodeSize': 4997, u'RevisionId': u'b39a817b-1a18-4abf-9557-4fdbbd4c46e5', u'MemorySize': 1536, u'FunctionArn': u'arn:aws:lambda:us-west-1:238423183776:function:BL-rc-bl-release:5', u'Version': u'5', u'Role': u'arn:aws:iam::238423183776:role/biglambda_role', u'Timeout': 300, u'LastModified': u'2018-07-14T09:05:56.015+0000', u'Handler': u'reducerCoordinator.lambda_handler', u'Runtime': u'python2.7', u'Description': u'BL-rc-bl-release'}
{u'Statement': u'{"Sid":"82","Effect":"Allow","Principal":{"Service":"s3.amazonaws.com"},"Action":"lambda:InvokeFunction","Resource":"arn:aws:lambda:us-west-1:238423183776:function:BL-rc-bl-release","Condition":{"ArnLike":{"AWS:SourceArn":"arn:aws:s3:::mrtest-kchen"}}}', 'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 201, 'RequestId': '1e3282ed-8745-11e8-81ce-130f46efef9d', 'HTTPHeaders': {'date': 'Sat, 14 Jul 2018 09:05:56 GMT', 'x-amzn-requestid': '1e3282ed-8745-11e8-81ce-130f46efef9d', 'content-length': '298', 'content-type': 'application/json', 'connection': 'keep-alive'}}}
# of Mappers  202
Traceback (most recent call last):
  File "driver.py", line 211, in <module>
    results = pool.map(invoke_lambda_partial, Ids[mappers_executed: mappers_executed + nm])
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 253, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 572, in get
    raise self._value
botocore.vendored.requests.exceptions.SSLError: EOF occurred in violation of protocol (_ssl.c:726)

Question: how does this framework compare to Athena from capability perspective?

Hi lambda-refarch-mapreduce team,

We are looking for a Serverless data processing pipeline, and this framework and Athena came to the final candidates.

For us, setup complexity [a little bit more] is not an issue -- I guess this may be a bit more complex to setup and maintain than Athena

I really appreciate if someone chime in to give some guidance on their capability comparison.

parallelize list keys

cool project!

idk if this is still under active development, but for large buckets, with millions of keys, just listing them could take a while with the 1000 keys/batch limit

Golang

Any plans on showcasing a golang version given the smaller startup latencies?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.