vhive-serverless / stellar Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 10.0 201.07 MB

STeLLAR: Open-source framework for serverless clouds benchmarking

License: MIT License

Shell 3.50% Go 71.11% Makefile 1.04% Python 19.88% Dockerfile 2.33% Java 0.95% JavaScript 0.67% Ruby 0.13% Rust 0.38%

benchmarking faas function-as-a-service open-source serverless vhive

stellar's People

Contributors

Stargazers

Watchers

Forkers

plamenmpetrov alexandrinapanfil salemmohammed shengdazhu tobodner kk-min dilinade cmporeddy luccadibe kavi-99

stellar's Issues

Missing vHive Minio Data Transfer Integration Test

The following should be debugged and included in the pipeline:

    - name: vHive Minio Data Transfer (Integration Test)
      working-directory: ${{env.working-directory}}
      run: |
        docker run -d -p 50052:9000 --name minio -e "MINIO_ROOT_USER=minio" -e "MINIO_ROOT_PASSWORD=minio123" minio/minio server /data
        wget https://dl.min.io/client/mc/release/linux-amd64/mc
        chmod +x mc
        sudo mv mc /usr/local/bin
        mc alias set myminio http://localhost:50052 minio minio123
        mc mb mybucket
        ./main -g ../endpoints -o ../latency-samples -c ../experiments/tests/vhive/data-transfer-minio.json

Translate producer-consumer from Golang to Python

This PR focuses on translating the producer-consumer image from Golang to Python. The reason for the translation is that Golang is not natively supported by Azure Functions (they only have a work-around where you write a custom handler HTTP server that gets queried by the Azure Function). This wouldn't be a problem in itself if the serverless.com tool supported Golang deployments, but it doesn't either...

Files to be translated include:

producer-consumer/aws/main.go
producer-consumer/vhive/main.go
producer-consumer/common/aws.go
producer-consumer/common/grpc.go
producer-consumer/common/main.go
producer-consumer/common/object-storage.go
producer-consumer/common/util.go
producer-consumer/common/util-test.go

automatic provisioning and teardown of self-hosted runners (Continuous Benchmarking)

To reduce costs, self-hosted runners for various Cloud Providers (AWS, Azure, Google Cloud, Cloudflare) needs to be provisioned and tore down after experiments. This is done by provisioning the VMs via the Cloud Providers' cli and using Github's API to assign them as self-hosted runners, after which the respective Cloud Providers' continuous benchmarking experiments can be picked up by the runners.

Current Schedule:
Sunday 11pm UTC - Automatic provisioning of VMs
Monday 12am UTC - Baseline experiments (expected to run up to 12h)
Monday 12pm UTC - Image size experiments (expected to run up to 12h)
Tuesday 12am UTC - Runtime experiments (expected to run up to 24h)
Wednesday 12am UTC - Automatic teardown of VMs

Note: Cloudflare's self-hosted runner VM is to be hosted on AWS EC2 as Cloudflare currently does not offer a VM service.

Current Progress:
AWS: Completed ✅ (#450)
Azure: Completed ✅ (#446)
Google Cloud: Completed ✅ (#447)
Cloudflare: Completed ✅ (#453)

Update compatibility table in README

We need to update the compatibility table in the README to make it more complete and presentable. The relevant section is as follows:

Serverless.com & Cloud Provider Capabilities

While serverless.com framework is a powerful tool when it comes to deploying Lambda Functions to AWS, its capabilities are more limited with other providers. The following table shows serverless.com features vs. different providers whose deployment was considered to be automated.

	AWS Lambda	Azure Function	Google Cloud Run	Knative	Google Cloud Function *	Alibaba
Deploy function - zip	Yes	Yes	No	N/A	Yes	Yes
Deploy function - docker	Yes	No (documentation does not mention it)**	No	Yes	No (documentation does not mention it)**	No (documentation does not mention it)**
Python Runtime	Yes	Yes	No	Yes	Yes	??**
Go Runtime	Yes	No	No	Yes	Yes	??**
Java Runtime	Yes	No	No	??**	No	??**
Node JS	Yes	Yes	No	Yes	Yes	Yes
HTTP trigger	Yes	Yes	No	Yes	Yes	Yes
Resource Management (e.g. buckets databases)	Yes	Yes (at least it looks like it)	No	??**	??**	??**
Serverless framework support	Yes	Yes	No	Yes	Yes	Yes

*experimental version, not meant for production yet

**needs to be empirically verified

Migrate to a new official Golang linter

https://github.com/ease-lab/vhive-bench/blob/3dc5c1aae761ddc1eba3635436d79007e713c7dd/.github/workflows/pipeline.yml#L44

Example from vHive:
https://github.com/ease-lab/vhive/blob/ea634bcdc5673ccc10263c29171a41d6dbc3031a/.github/workflows/build.yml#L45

Explore serverless separate packaging option in serverless.com

https://www.serverless.com/framework/docs/providers/aws/guide/packaging#packaging-functions-separately

Add retry logic for failed experiments in the pipeline

Currently, experiments in continuous benchmarking may sometimes fail due to a myriad of reasons that may be outside of our control. We need to add a mechanism to retry failed experiments automatically.

Add Google Cloud Run Credentials to repo for Github Actions

In order to run GCR e2e tests, we need to have credentials configured inside the repository secrets. We can take this opportunity to create an official account for GCR as well.

Prerequisites

Create a Google Cloud Project (We can login with existing official gmail account if any)

Configure Credentials

Create a Service Account under the project
Get the JSON service account key
Upload secret to Github Secrets (GCR_CREDENTIALS)

See here for more details.

Add support for Alibaba deployment

Add support for Alibaba's Function Compute deployment and benchmarking via STeLLAR. We can do it via the Serverless plugin for Alibaba if the plugin works as expected or via other methods like their CLI tool, fcli.

Deploy Docker images using serverless.com framework

Build the docker image
deploy the docker image with serverless.com using the artifact feature

Use GitHub Actions concurrency to prevent parallel workflows

Here is a link to the manual

Add filler files to serverless packaging

utilise the FunctionMemoryMB experiment paramater

Make generated CDF file easier to read

Currently, the x-axis of the generated CDF file is fixed from 0 to 2000. This can make certain CDFs hard to read, especially if the tail latency is fairly low. See the following example:

We should set the max to the max value in the latency samples or slightly higher.

Update Wiki on benchmarking with various providers

The Wiki needs to be updated once feature-serverless-framework-deployment is merged into main. Some notes on things that can be updated/included:

How the user interfaces with the tool (experiment JSON file)
How to deploy using various providers (preparing raw-code specific to providers, env variables etc.)
Obtaining results

Scale the number of functions in a single serverless.com deployment

the AWS's CoudFormartionSack has a limit of 500 resouces per stac, meaning a single serverless.com service can deploy only ~80 hellopy functions and even fewer for functions which require more resources (e.g. producer consumer)
Example: https://github.com/vhive-serverless/STeLLAR/blob/main/experiments/coldboot-predictability/aws/coldstart-cdfs.json deploys 4 x 100 producer-consumer functions and highly exceeds the deployment limit.
Can be solved by crearting multiple services (multiple serverless.yml file with different service names)

FOR each service:
   create serverless.yml
   run serverless deploy
   delete serverless.yml

The services shall be removed in a similar fashion.

Build & Deploy Go binaries for serverless framework deployment

Take the go source code and build the binary.
Deploy it using serverless.com
ONLY hellogo functions (do not worry about producer-consumer)

Experiments Failing on AWS, Azure, and Cloudflare

All experiments on AWS, Azure, and Cloudflare are failing except for those on Google Cloud Run.

Details

Language Runtime Experiment: GitHub Actions Run
Image Size Experiment: GitHub Actions Run

Error Logs

Provisioning of VM and Setting Up Self-Hosted Runners for AWS and Cloudflare

The experiments on Azure fail with the below error logs.

Add experiments for different runtimes

Add experiments for benchmarking different runtimes across different providers.

Build & Deploy Java functions with Gradle

Create a raw-code SnapStart Java function
Create a SnapStart experiment JSON file
Create a STeLLAR module which builds the java file with gradle
Deploy it to AWS using serverless.com

Function 'hellopy' throws an error

hellopy function throws an error once deployed and invoked with status 502 Bad Gateway: {"message": "Internal server error"}"

Data Transfer experiments stuck at subexperiment 0

Hello,

I am reproducing the data transfer results showed in the STeLLAR paper using vHive.
I have setup vHive on cloud lab using the provided profile.

To run the experiments I ran the commands given here: https://github.com/ease-lab/STeLLAR/wiki/vHive-Benchmarking

On running the command for inline data transfer, I get the output shown below. The experiment has been running for multiple hours now and is stuck at subexperiment 0.

I checked and confirmed that all pods are up and running. All kn services deployed by deploy_functions script are also up

The warm.json file is the same as the one in the repo.
I have modified the vHive.json file in endpoints/vhive to include the correct Gateway ID. Specifically I have replaced
producer.default.192.168.1.240.nip.io:80 and consumer.default.192.168.1.240.nip.io:80 with producer.default.192.168.1.240.sslip.io:50051

switch to alternative methods of deployment for Azure Functions

Presently, STeLLAR deploys Azure Functions using the Serverless Framework. Unfortunately, they have announced that non-AWS providers will no longer be supported in Serverless Framework V4. We may require alternative methods of deployment for Azure Functions, such as by using the Azure CLI directly.

As an interim measure, we can continue to use Serverless Framework V3 (install using npm i [email protected]). It will be "maintained via critical security and bug fixes through 2024".

Add experiments for different image sizes

Add experiments for different image sizes.

Add support for Google Cloud Run Deployment

Build containers
Deploy to GCR, get endpoints

Discard latency results for non-200 status code responses

This issue is possibly related to #313. Azure's logs have shown that some percentage of requests are returning 5xx errors, and they have extremely high latencies (possibly due to timeout?). We should discard all non-200 status code responses to maintain the integrity of the data.

Required Changes

Ensure latency samples from non-200 responses are not written to the latencies csv file
Ensure that the % of failure in experiments do not exceed a certain threshold (5-10%). We can fail the experiment if the threshold is exceeded.

Migate AWS Lambda go1.x runtime to provided.al2023

Continuous benchmarking runtime experiments for AWS Lambda are failing due to the deprecation of the go1.x runtime (see here for details).

It looks like AWS Lambda no longer accepts function deployments with the go1.x runtime, so it needs to be migrated to the provider.al2023 runtime provided by AWS.

It might be a good idea to temporarily remove Go runtime experiments for AWS Lambda for now until this is implemented.

Teardown VM and remove self-hosted runner - Error: Runner not found

Summary

This workflow is designed to close and delete the self-hosted runner and the VM for each provider. The error occurs when trying to remove the runner, due to the inability to retrieve the runner ID. This happens because the Provision VM workflow encountered an error, which prevented the runner from being created and registered.

When I fixed the Provision VM workflow (while experimenting on my forked repository), the Teardown VM workflow also executed successfully.

Error: Runner not found, in step, Remove self-hosted runner

Context

Error

Error stems from runner not getting registered in Provision VM workflow, due to Personal Access Token permissions error. Once PAT is updated both issues should be solved.

When I fixed the Provision VM workflow (while experimenting on my forked repository), the Teardown VM workflow also executed successfully.

TODO:

Once Provision VM workflow executes without errors on the vhive-serverless/STeLLAR repository (runner is created and registered), re-run this workflow to verify no errors.

Update CI to Ubuntu 22

Ubuntu 18 is going to be decommissioned quite soon, let's update the workflows.

Make domain for STeLLAR

Having a domain would help greatly for some Cloud Providers in the future. For example, Cloudflare allows you to control the Zone (location) of workers with a domain, and Alibaba limits deployments with no domains to 30 per day.

Clean up legacy AWS deployment code

There are some modules such as connection which are no longer needed after the Serverless Framework integration and require removal from the repository. We need to confirm that such modules/code are obsolete and remove them accordingly, as some of them may not be compatible with new changes and result in failing tests.

Add error checking/retry for deployment/removal steps

The pipeline may fail due to unforeseen circumstances, usually when a single deployment/removal fails. Stopping the whole pipeline due to a single deployment failure is not desirable as everything needs to be restarted. We need to add an error check + retry for deployment/removal of functions.

Required Changes

Add error check + retry mechanism to function deployment/removal logic

Use docker-compose in multi-container tests in the CI

Here is a workflow that can be simplified with docker-compose.

Here is a link to the tutorial.

@anukratb it can be one of the tasks to start with after PR #77 is merged. Worth adding this to your kanban

Provision VM and setup self hosted runners - EC2 instances connection status check

Summary

This workflow creates a VM for each of the providers, and registers a runner on this VM, to execute the STeLLAR client setup commands on this VM.

The error occurs, when trying to create a registration token for the runner to be hosted on the VM.

I forked this repository, so that I can use the migrated Azure credentials and my own Personal Access Token (PAT) for the Github API actions.
On giving the PAT admin read-write permissions to my forked repository, I was able to create the registration token, and workflow completed without errors.

Another observation is that the credentials for at least Azure and GCR seem to be valid, as the Azure Services Cleanup and GCR Services Cleanup workflows are able to make connections to the services.

Error in step, Create Registration Token self hosted runners for Repository

The Personal Access Token (PAT) used to authenticate gh api actions must have admin access to the repository and should not be expired. In the workflow, this token is represented by the environment variable GH_TOKEN, which retrieves its value from the secrets.

Issue:

I currently do not have access to view or edit the secrets. When I forked the repository and used my own PAT and Azure credentials and executed the workflow for Azure, which ran without errors.

Action Items TODO:

Request Access: Obtain access to the vhive-serverless/STeLLAR repository to view and edit Action Secrets.
Create New PAT: Generate a new PAT with admin (read-write) permissions for the STeLLAR repository to perform the required action. Refer to the Create Registration Token for Repository Request API Documentation.
Credentials Migration: Finish migration of credentials for AWS and GCR, to finally verify if workflow runs without errors for all providers.

Prevent dependabot updates upon minor versions

There are too many PRs raised by dependabot. @dilinade could you please apply the filtering rules as we did in vHive and vSwarm? Please see the dependabpt yaml files in those repos.

Add support for Cloudflare Worker deployment

Add support for Cloudflare Worker deployment and benchmarking via STeLLAR. Cloudflare Workers used to be only deployable via a domain that is owned by the user (e.g. www.mydomain.com), but they now support deployments via a subdomain called workers.dev without the need to own a domain.

Issues

Serverless Framework plugin for Cloudflare Workers only support deployments with a domain (which we do not have) with a required "zoneId" value
Due to the nature of how Cloudflare's infrastructure works, deployments with Serverless Framework plugin only supports Javascript runtime deployments (After modification for domain/zone-less deployments)

Solution

Use Cloudflare's CLI tool wrangler to deploy instead
This supports both domain/zone-less deployments as well as multiple runtimes

Add design documentation for serverless framework deployment

Add Azure credentials as repository secret for GitHub Actions

E2E tests for experiments involving Azure deployments requires Azure credentials to be set up as a repository secret.

Steps to generate Azure credentials with an existing account:

Run az login and login with a web browser
Find your subscription ID and export it as a variable using export AZURE_SUBSCRIPTION_ID={your_subscription_id}
Run

az ad sp create-for-rbac --name "STeLLAR GitHub Actions" --role contributor \
                         --scopes /subscriptions/$AZURE_SUBSCRIPTION_ID \
                         --sdk-auth

to generate credentials resembling

{
   "clientId": "<GUID>",
   "clientSecret": "<STRING>",
   "subscriptionId": "<GUID>",
   "tenantId": "<GUID>",
   "resourceManagerEndpointUrl": "<URL>"
   (...)
 }

Add the credentials in JSON format as a secret in GitHub

Reference: https://github.com/marketplace/actions/azure-login#configure-deployment-credentials

Refactor function config key-values to experiment JSON file

Currently, certain function config key-values such as handler and package patterns are hardcoded inside src/setup/serverless_config.go. These values should be moved to the experiment JSON file to be defined by the user for greater flexibility.

Add relevant key-values to SubExperiment struct
Replace hardcoded values with struct fields

Deploy Producer-Consumer Function using serverless.com Framework

sreverless-framework-deployment branch

Include Modal.com in benchmarking

👋, I saw this project in gVisor's forums: https://groups.google.com/g/gvisor-users/c/DV6JUD3MrZc.

We at modal.com also use gVisor in our serverless runtime.

Would you accept Modal's inclusion in your serverless benchmarking? I want to see how we stack up 😏.

Address self-hosted runners' stability

Our self-hosted runner VMs sometimes go offline, mostly due to running out of space. This seems to be caused by a combination of limited initial space to begin with, as well as Github's runner updates creating duplicates (see here). We need to make their performance more stable.

Requirements

Attempt more thorough cleanups
Increase memory space on VMs

Reduce Azure Deployment/Teardown Times

Cold start experiments in Azure seem to be taking an unusually long time (see Azure cold function run here), up to 3 hours.

If possible, we should try to reduce this by deploying/removing them in parallel (e.g. using goroutines).

Bring module unit test coverage to 70%+.

Investigate unexpected Azure warm start latencies

Azure's warm baseline experiments are yielding unexpected latencies (up to ~3000ms) unpredictably in the middle of experiments when only warm invocations should be occurring. We need to check if these are truly warm starts, or a case of new instances being created despite the existence of an idle warm instance, or some other unknown matter/mechanism at work.

Azure warm baseline experiment JSON:

{
  "Sequential": false,
  "Provider": "azure",
  "Runtime": "python3.8",
  "SubExperiments": [
    {
      "Title": "warm-baseline-azure",
      "Function": "hellopy",
      "Handler": "main.main",
      "PackageType": "Zip",
      "PackagePattern": "main.py",
      "Bursts": 51,
      "BurstSizes": [
        1
      ],
      "IATSeconds": 10,
      "DesiredServiceTimes": [
        "0ms"
      ]
    }
  ]
}

Add support for Azure

Features

Deployment of Azure functions with the Serverless framework
Modify benchmarking code if necessary to retrieve endpoints of Azure functions, run experiments and send bursts

Continuous Benchmarking Experiments - Baseline (Warm/Cold), Runtime, Image Size

Understanding

Recent Runs (No Logs)

All continuous benchmarking experiments are failing.
No logs are present because no runners were able to register on either provider's VMs during the latest provisioning workflow.
Without runners, no entities were available to handle experiment requests, resulting in the lack of logs.

In recent runs, the absence of logs is due to failures in provisioning VMs and self-hosted runners workflow runs. Consequently, the workflow keeps waiting for a runner to pick up the job. This happens because the workflow runs on the self-hosted runner of the provider, ensuring that the stellar client is built on a server close to the other infrastructure handling the function endpoints. This proximity minimizes request send-off time delays, which is crucial for accurately measuring roundtrip delay used in benchmarking.

Runs from 3 and 4 Weeks Ago

GCR (Google Cloud Run) functioned successfully.
Azure, AWS, and Cloudflare experienced common deployment errors.
These errors occurred when the workflow attempted to deploy the function to the service.
Specific errors included issues with the sls deploy (Serverless Framework) and wrangler CLI tools for Azure, AWS, and Cloudflare, respectively.

In 3 and 4 weeks back runs, GCR works and Azure, AWS, and Cloudfare fail. For all the experiments - baseline, image size, and runtime, the errors are common and occur when the workflow attempts to deploy the function to the service - errors in sls deploy and wrangler CLI tool (Azure, AWS and Cloudfare respectively)

For GCR - runs without errors

GCR runs without errors, likely due to its use of the gcloud run command for deployment automation, avoiding permission issues.

For Azure and AWS

Azure and AWS failures suggest permission problems with the sls deploy command used by the Serverless Framework.
The errors likely occur because the GitHub Actions Runner setup on the VM does not have the necessary permissions to access a file in the serverless package's directory.

Another observation is that this error could be related to this warning that was issued (in Provision VM workflows logs), when setting up the runner service on the Azure VM and EC2 instance.

[email protected] requires Node.js version 18.0.0 or higher. The current Node.js version is 16.20.2, which could be why we encounter issues or unexpected behavior when using the serverless framework (sls deploy).

For Cloudflare

Cloudflare's error indicates a missing wrangler executable, required for its deployment process.

Similar to Azure and AWS runner setup warning for serverless, when Cloudfare runner is being setup, the workflow fails to install wrangler (from Provision VM workflow logs). The GitHub Actions Runner user running the command does not have the necessary permissions to write to the /usr/lib/node_modules directory.

Steps to fix Issue

Understand why permissions issue occurs, and how to ensure permissions of GitHub Actions Runner.
Complete the credential migration process, update the secrets in the repository with the new credentials, and then rerun the workflows to verify if the errors persist.

Deploy other functions using serverless.com framework

i.e Java functions for SnapStart experimentation

concurrent Alibaba Cloud deployments failing

Currently, STeLLAR supports deployments to Alibaba Cloud through the serverless-aliyun-function-compute plugin. However, the plugin appears to have a limitation. During deployments, it attempts to create a single OSS bucket with the exact same name, in the format of sls-{ACCOUNT_NUMBER}-{REGION_NAME}. This is an issue as OSS bucket names must be unique globally and subsequent deployments through the plugin would fail. This is the root cause of STeLLAR integration/end-to-end test failures when they are executed concurrently.

Alibaba Cloud documentation suggests the use of an alternative framework, Serverless Devs for Functions Compute deployments.

vhive-serverless / stellar Goto Github PK

stellar's People

Contributors

Stargazers

Watchers

Forkers

stellar's Issues

Serverless.com & Cloud Provider Capabilities

Prerequisites

Configure Credentials

Required Changes

Summary

Error: Runner not found, in step, Remove self-hosted runner

Context

Error

Required Changes

Summary

Error in step, Create Registration Token self hosted runners for Repository

Issue:

Action Items TODO:

Issues

Solution

Requirements

Features

Understanding

Recent Runs (No Logs)

Runs from 3 and 4 Weeks Ago

For GCR - runs without errors

For Azure and AWS

For Cloudflare

Steps to fix Issue

Recommend Projects

Recommend Topics

Recommend Org