Serverless Reference Architecture: MapReduce

This serverless MapReduce reference architecture demonstrates how to use AWS Lambda in conjunction with Amazon S3 to build a MapReduce framework that can process data stored in S3.

By leveraging this framework, you can build a cost-effective pipeline to run ad hoc MapReduce jobs. The price-per-query model and ease of use make it very suitable for data scientists and developers alike.

Features

Close to "zero" setup time
Pay per execution model for every job
Cheaper than other data processing solutions
Enables data processing within a VPC

Architecture

![alt text] (https://s3.amazonaws.com/smallya-test/bl-git.png "Serverless MapReduce architecture")

Running the Example

To run the full example, you must have the AWS CLI set up. Your credentials must have access to create and invoke Lambda and access to list, read, and write to a S3 bucket.

IAM policies

Lambda execution role with
- S3 read/write access
- Cloudwatch log access (logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents)

Check policy.json for a sample that you can use or extend.

To execute the driver locally, make sure that you configure your AWS profile with access to:
- S3
- Lambda

Setting up the job

For the jobBucket field, enter an S3 bucket in your account that you wish to use for the example. Modify other fields as required if you end up renaming files.


{
        "bucket": "big-data-benchmark",
        "prefix": "pavlo/text/1node/uservisits/",
        "jobBucket": "MY-JOB-BUCKET",
        "concurrentLambdas": 100,
        "mapper": {
            "name": "mapper.py",
            "handler": "mapper.lambda_handler",
            "zip": "mapper.zip"
        },
        "reducer":{
            "name": "reducer.py",
            "handler": "reducer.lambda_handler",
            "zip": "reducer.zip"
        },
        "reducerCoordinator":{
            "name": "reducerCoordinator.py",
            "handler": "reducerCoordinator.lambda_handler",
            "zip": "reducerCoordinator.zip"
        },
}

Running the job

Make sure that you have edited the configuration JSON file (driverconfig.json)
Update the mapper and reducer code as required, leave them unedited if you're running the same example
Run the driver:

$ python driver.py

Outputs

smallya$ aws s3 ls s3://JobBucket/py-bl-1node-2 --recursive --human-readable --summarize

2016-09-26 15:01:17   69 Bytes py-bl-1node-2/jobdata
2016-09-26 15:02:04   74 Bytes py-bl-1node-2/reducerstate.1
2016-09-26 15:03:21   51.6 MiB py-bl-1node-2/result 
2016-09-26 15:01:46   18.8 MiB py-bl-1node-2/task/
….

smallya$ head –n 3 result 
67.23.87,5.874290244999999
30.94.22,96.25011190570001
25.77.91,14.262780186000002

Cleaning up the example resources

To remove all resources created by this example, do the following:

Delete all objects from the S3 bucket listed in jobBucket created by the job.
Delete the Cloudwatch log groups for each of the Lambda functions created by the job.

Languages

Python 2.7 (active development)
Node.js

The Python version is under active development and feature enhancement.

Benchmark

To compare this framework with other data processing frameworks, we ran a subset of the Amplab benchmark. The table below has the execution time for each workload in seconds:

Dataset

s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix].

S3 Suffix Scale Factor Rankings (rows) Rankings (bytes) UserVisits (rows) UserVisits (bytes) Documents (bytes) /5nodes/ 5 90 Million 6.38 GB 775 Million 126.8 GB 136.9 GB

Queries:

Scan query (90 M Rows, 6.36 GB of data)
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X ( X= {1000, 100, 10} )
- 1a) SELECT pageURL, pageRank FROM rankings WHERE pageRank > 1000
- 1b) SELECT pageURL, pageRank FROM rankings WHERE pageRank > 100
Aggregation query on UserVisits ( 775M rows, ~127GB of data)
- 2a) SELECT SUBSTR(sourceIP, 1, 8), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 8)

NOTE: Only a subset of the queries could be run, as Lambda currently supports a maximum container size of 1536 MB. The benchmark is designed to increase the output size by an order of magnitude for the a,b,c iterations. Given that the output size doesn't fit in Lambda memory, we currently can't process to compute the final output.

|-----------------------|---------|---------|--------------|
| Technology            | Scan 1a | Scan 1b | Aggregate 2a | 
|-----------------------|---------|---------|--------------|
| Amazon Redshift (HDD) | 2.49    | 2.61    | 25.46        |
|-----------------------|---------|---------|--------------|
| Impala - Disk - 1.2.3 | 12.015  | 12.015  | 113.72       |
|-----------------------|---------|---------|--------------|
| Impala - Mem - 1.2.3  | 2.17    | 3.01    | 84.35        |
|-----------------------|---------|---------|--------------|
| Shark - Disk - 0.8.1  | 6.6     | 7       | 151.4        |
|-----------------------|---------|---------|--------------|
| Shark - Mem - 0.8.1   | 1.7     | 1.8     | 83.7         |
|-----------------------|---------|---------|--------------|
| Hive - 0.12 YARN      | 50.49   | 59.93   | 730.62       |
|-----------------------|---------|---------|--------------|
| Tez - 0.2.0           | 28.22   | 36.35   | 377.48       |
|-----------------------|---------|---------|--------------|
| Serverless MapReduce  | 39      | 47      | 200          |   
|-----------------------|---------|---------|--------------|

Serverless MapReduce Cost:

|---------|---------|--------------|
| Scan 1a | Scan 1b | Aggregate 2a | 
|---------|---------|--------------|
| 0.00477 | 0.0055  | 0.1129       |   
|---------|---------|--------------|

License

This reference architecture sample is licensed under the Amazon Software License.

nee1 / lambda-refarch-mapreduce Goto Github PK

lambda-refarch-mapreduce's Introduction

Serverless Reference Architecture: MapReduce

Features

Architecture

Running the Example

IAM policies

Setting up the job

Running the job

Outputs

Cleaning up the example resources

Languages

Benchmark

License

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent