Giter VIP home page Giter VIP logo

stellar's People

Contributors

amariucaitheodor avatar dependabot[bot] avatar dhschall avatar dilinade avatar kk-min avatar plamenmpetrov avatar ria avatar ustiugov avatar ypwong99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stellar's Issues

Missing vHive Minio Data Transfer Integration Test

The following should be debugged and included in the pipeline:

    - name: vHive Minio Data Transfer (Integration Test)
      working-directory: ${{env.working-directory}}
      run: |
        docker run -d -p 50052:9000 --name minio -e "MINIO_ROOT_USER=minio" -e "MINIO_ROOT_PASSWORD=minio123" minio/minio server /data
        wget https://dl.min.io/client/mc/release/linux-amd64/mc
        chmod +x mc
        sudo mv mc /usr/local/bin
        mc alias set myminio http://localhost:50052 minio minio123
        mc mb mybucket
        ./main -g ../endpoints -o ../latency-samples -c ../experiments/tests/vhive/data-transfer-minio.json

Translate producer-consumer from Golang to Python

This PR focuses on translating the producer-consumer image from Golang to Python. The reason for the translation is that Golang is not natively supported by Azure Functions (they only have a work-around where you write a custom handler HTTP server that gets queried by the Azure Function). This wouldn't be a problem in itself if the serverless.com tool supported Golang deployments, but it doesn't either...

Files to be translated include:

  • producer-consumer/aws/main.go
  • producer-consumer/vhive/main.go
  • producer-consumer/common/aws.go
  • producer-consumer/common/grpc.go
  • producer-consumer/common/main.go
  • producer-consumer/common/object-storage.go
  • producer-consumer/common/util.go
  • producer-consumer/common/util-test.go

automatic provisioning and teardown of self-hosted runners (Continuous Benchmarking)

To reduce costs, self-hosted runners for various Cloud Providers (AWS, Azure, Google Cloud, Cloudflare) needs to be provisioned and tore down after experiments. This is done by provisioning the VMs via the Cloud Providers' cli and using Github's API to assign them as self-hosted runners, after which the respective Cloud Providers' continuous benchmarking experiments can be picked up by the runners.

Current Schedule:
Sunday 11pm UTC - Automatic provisioning of VMs
Monday 12am UTC - Baseline experiments (expected to run up to 12h)
Monday 12pm UTC - Image size experiments (expected to run up to 12h)
Tuesday 12am UTC - Runtime experiments (expected to run up to 24h)
Wednesday 12am UTC - Automatic teardown of VMs

Note: Cloudflare's self-hosted runner VM is to be hosted on AWS EC2 as Cloudflare currently does not offer a VM service.

Current Progress:
AWS: Completed ✅ (#450)
Azure: Completed ✅ (#446)
Google Cloud: Completed ✅ (#447)
Cloudflare: Completed ✅ (#453)

Update compatibility table in README

We need to update the compatibility table in the README to make it more complete and presentable. The relevant section is as follows:


Serverless.com & Cloud Provider Capabilities

While serverless.com framework is a powerful tool when it comes to deploying Lambda Functions to AWS, its capabilities are more limited with other providers. The following table shows serverless.com features vs. different providers whose deployment was considered to be automated.

AWS Lambda Azure Function Google Cloud Run Knative Google Cloud Function * Alibaba
Deploy function - zip Yes Yes No N/A Yes Yes
Deploy function - docker Yes No (documentation does not mention it)** No Yes No (documentation does not mention it)** No (documentation does not mention it)**
Python Runtime Yes Yes No Yes Yes ??**
Go Runtime Yes No No Yes Yes ??**
Java Runtime Yes No No ??** No ??**
Node JS Yes Yes No Yes Yes Yes
HTTP trigger Yes Yes No Yes Yes Yes
Resource Management (e.g. buckets databases) Yes Yes (at least it looks like it) No ??** ??** ??**
Serverless framework support Yes Yes No Yes Yes Yes

*experimental version, not meant for production yet

**needs to be empirically verified


Add Google Cloud Run Credentials to repo for Github Actions

In order to run GCR e2e tests, we need to have credentials configured inside the repository secrets. We can take this opportunity to create an official account for GCR as well.

Prerequisites

  • Create a Google Cloud Project (We can login with existing official gmail account if any)

Configure Credentials

  • Create a Service Account under the project
  • Get the JSON service account key
  • Upload secret to Github Secrets (GCR_CREDENTIALS)

See here for more details.

Make generated CDF file easier to read

Currently, the x-axis of the generated CDF file is fixed from 0 to 2000. This can make certain CDFs hard to read, especially if the tail latency is fairly low. See the following example:

empirical_CDF

We should set the max to the max value in the latency samples or slightly higher.

Update Wiki on benchmarking with various providers

The Wiki needs to be updated once feature-serverless-framework-deployment is merged into main. Some notes on things that can be updated/included:

  • How the user interfaces with the tool (experiment JSON file)
  • How to deploy using various providers (preparing raw-code specific to providers, env variables etc.)
  • Obtaining results

Scale the number of functions in a single serverless.com deployment

FOR each service:
   create serverless.yml
   run serverless deploy
   delete serverless.yml
  • The services shall be removed in a similar fashion.

Build & Deploy Java functions with Gradle

  • Create a raw-code SnapStart Java function
  • Create a SnapStart experiment JSON file
  • Create a STeLLAR module which builds the java file with gradle
  • Deploy it to AWS using serverless.com

Data Transfer experiments stuck at subexperiment 0

Hello,

I am reproducing the data transfer results showed in the STeLLAR paper using vHive.
I have setup vHive on cloud lab using the provided profile.

To run the experiments I ran the commands given here: https://github.com/ease-lab/STeLLAR/wiki/vHive-Benchmarking

On running the command for inline data transfer, I get the output shown below. The experiment has been running for multiple hours now and is stuck at subexperiment 0.
image

I checked and confirmed that all pods are up and running. All kn services deployed by deploy_functions script are also up
image

The warm.json file is the same as the one in the repo.
I have modified the vHive.json file in endpoints/vhive to include the correct Gateway ID. Specifically I have replaced
producer.default.192.168.1.240.nip.io:80 and consumer.default.192.168.1.240.nip.io:80 with producer.default.192.168.1.240.sslip.io:50051

switch to alternative methods of deployment for Azure Functions

Presently, STeLLAR deploys Azure Functions using the Serverless Framework. Unfortunately, they have announced that non-AWS providers will no longer be supported in Serverless Framework V4. We may require alternative methods of deployment for Azure Functions, such as by using the Azure CLI directly.

As an interim measure, we can continue to use Serverless Framework V3 (install using npm i [email protected]). It will be "maintained via critical security and bug fixes through 2024".

Discard latency results for non-200 status code responses

This issue is possibly related to #313. Azure's logs have shown that some percentage of requests are returning 5xx errors, and they have extremely high latencies (possibly due to timeout?). We should discard all non-200 status code responses to maintain the integrity of the data.

Required Changes

  • Ensure latency samples from non-200 responses are not written to the latencies csv file
  • Ensure that the % of failure in experiments do not exceed a certain threshold (5-10%). We can fail the experiment if the threshold is exceeded.

Migate AWS Lambda go1.x runtime to provided.al2023

Continuous benchmarking runtime experiments for AWS Lambda are failing due to the deprecation of the go1.x runtime (see here for details).

It looks like AWS Lambda no longer accepts function deployments with the go1.x runtime, so it needs to be migrated to the provider.al2023 runtime provided by AWS.

It might be a good idea to temporarily remove Go runtime experiments for AWS Lambda for now until this is implemented.

Teardown VM and remove self-hosted runner - Error: Runner not found

Summary

This workflow is designed to close and delete the self-hosted runner and the VM for each provider. The error occurs when trying to remove the runner, due to the inability to retrieve the runner ID. This happens because the Provision VM workflow encountered an error, which prevented the runner from being created and registered.

When I fixed the Provision VM workflow (while experimenting on my forked repository), the Teardown VM workflow also executed successfully.


Error: Runner not found, in step, Remove self-hosted runner

Context

This workflow is designed to close and delete the self-hosted runner and the VM for each provider. The error occurs when trying to remove the runner, due to the inability to retrieve the runner ID. This happens because the Provision VM workflow encountered an error, which prevented the runner from being created and registered.

Error

image

Error stems from runner not getting registered in Provision VM workflow, due to Personal Access Token permissions error. Once PAT is updated both issues should be solved.

When I fixed the Provision VM workflow (while experimenting on my forked repository), the Teardown VM workflow also executed successfully.

TODO:

  • Once Provision VM workflow executes without errors on the vhive-serverless/STeLLAR repository (runner is created and registered), re-run this workflow to verify no errors.

Make domain for STeLLAR

Having a domain would help greatly for some Cloud Providers in the future. For example, Cloudflare allows you to control the Zone (location) of workers with a domain, and Alibaba limits deployments with no domains to 30 per day.

Clean up legacy AWS deployment code

There are some modules such as connection which are no longer needed after the Serverless Framework integration and require removal from the repository. We need to confirm that such modules/code are obsolete and remove them accordingly, as some of them may not be compatible with new changes and result in failing tests.

Add error checking/retry for deployment/removal steps

The pipeline may fail due to unforeseen circumstances, usually when a single deployment/removal fails. Stopping the whole pipeline due to a single deployment failure is not desirable as everything needs to be restarted. We need to add an error check + retry for deployment/removal of functions.

Required Changes

  • Add error check + retry mechanism to function deployment/removal logic

Provision VM and setup self hosted runners - EC2 instances connection status check

Summary

This workflow creates a VM for each of the providers, and registers a runner on this VM, to execute the STeLLAR client setup commands on this VM.

The error occurs, when trying to create a registration token for the runner to be hosted on the VM.

I forked this repository, so that I can use the migrated Azure credentials and my own Personal Access Token (PAT) for the Github API actions.
On giving the PAT admin read-write permissions to my forked repository, I was able to create the registration token, and workflow completed without errors.

Another observation is that the credentials for at least Azure and GCR seem to be valid, as the Azure Services Cleanup and GCR Services Cleanup workflows are able to make connections to the services.


Error in step, Create Registration Token self hosted runners for Repository

The Personal Access Token (PAT) used to authenticate gh api actions must have admin access to the repository and should not be expired. In the workflow, this token is represented by the environment variable GH_TOKEN, which retrieves its value from the secrets.

image

Issue:

I currently do not have access to view or edit the secrets. When I forked the repository and used my own PAT and Azure credentials and executed the workflow for Azure, which ran without errors.

Action Items TODO:

  • Request Access: Obtain access to the vhive-serverless/STeLLAR repository to view and edit Action Secrets.
  • Create New PAT: Generate a new PAT with admin (read-write) permissions for the STeLLAR repository to perform the required action. Refer to the Create Registration Token for Repository Request API Documentation.
  • Credentials Migration: Finish migration of credentials for AWS and GCR, to finally verify if workflow runs without errors for all providers.

image

image

Add support for Cloudflare Worker deployment

Add support for Cloudflare Worker deployment and benchmarking via STeLLAR. Cloudflare Workers used to be only deployable via a domain that is owned by the user (e.g. www.mydomain.com), but they now support deployments via a subdomain called workers.dev without the need to own a domain.

Issues

  • Serverless Framework plugin for Cloudflare Workers only support deployments with a domain (which we do not have) with a required "zoneId" value
  • Due to the nature of how Cloudflare's infrastructure works, deployments with Serverless Framework plugin only supports Javascript runtime deployments (After modification for domain/zone-less deployments)

Solution

  • Use Cloudflare's CLI tool wrangler to deploy instead
  • This supports both domain/zone-less deployments as well as multiple runtimes

Add Azure credentials as repository secret for GitHub Actions

E2E tests for experiments involving Azure deployments requires Azure credentials to be set up as a repository secret.

Steps to generate Azure credentials with an existing account:

  1. Run az login and login with a web browser
  2. Find your subscription ID and export it as a variable using export AZURE_SUBSCRIPTION_ID={your_subscription_id}
  3. Run
az ad sp create-for-rbac --name "STeLLAR GitHub Actions" --role contributor \
                         --scopes /subscriptions/$AZURE_SUBSCRIPTION_ID \
                         --sdk-auth

to generate credentials resembling

{
   "clientId": "<GUID>",
   "clientSecret": "<STRING>",
   "subscriptionId": "<GUID>",
   "tenantId": "<GUID>",
   "resourceManagerEndpointUrl": "<URL>"
   (...)
 }
  1. Add the credentials in JSON format as a secret in GitHub

Reference: https://github.com/marketplace/actions/azure-login#configure-deployment-credentials

Refactor function config key-values to experiment JSON file

Currently, certain function config key-values such as handler and package patterns are hardcoded inside src/setup/serverless_config.go. These values should be moved to the experiment JSON file to be defined by the user for greater flexibility.

  • Add relevant key-values to SubExperiment struct
  • Replace hardcoded values with struct fields

Address self-hosted runners' stability

Our self-hosted runner VMs sometimes go offline, mostly due to running out of space. This seems to be caused by a combination of limited initial space to begin with, as well as Github's runner updates creating duplicates (see here). We need to make their performance more stable.

Requirements

  • Attempt more thorough cleanups
  • Increase memory space on VMs

Reduce Azure Deployment/Teardown Times

Cold start experiments in Azure seem to be taking an unusually long time (see Azure cold function run here), up to 3 hours.

If possible, we should try to reduce this by deploying/removing them in parallel (e.g. using goroutines).

Investigate unexpected Azure warm start latencies

Azure's warm baseline experiments are yielding unexpected latencies (up to ~3000ms) unpredictably in the middle of experiments when only warm invocations should be occurring. We need to check if these are truly warm starts, or a case of new instances being created despite the existence of an idle warm instance, or some other unknown matter/mechanism at work.

Azure warm baseline experiment JSON:

{
  "Sequential": false,
  "Provider": "azure",
  "Runtime": "python3.8",
  "SubExperiments": [
    {
      "Title": "warm-baseline-azure",
      "Function": "hellopy",
      "Handler": "main.main",
      "PackageType": "Zip",
      "PackagePattern": "main.py",
      "Bursts": 51,
      "BurstSizes": [
        1
      ],
      "IATSeconds": 10,
      "DesiredServiceTimes": [
        "0ms"
      ]
    }
  ]
}

Add support for Azure

Features

  • Deployment of Azure functions with the Serverless framework
  • Modify benchmarking code if necessary to retrieve endpoints of Azure functions, run experiments and send bursts

Continuous Benchmarking Experiments - Baseline (Warm/Cold), Runtime, Image Size

Understanding

Recent Runs (No Logs)

  • All continuous benchmarking experiments are failing.
  • No logs are present because no runners were able to register on either provider's VMs during the latest provisioning workflow.
  • Without runners, no entities were available to handle experiment requests, resulting in the lack of logs.
Screenshot 2024-07-23 at 1 58 44 AM

In recent runs, the absence of logs is due to failures in provisioning VMs and self-hosted runners workflow runs. Consequently, the workflow keeps waiting for a runner to pick up the job. This happens because the workflow runs on the self-hosted runner of the provider, ensuring that the stellar client is built on a server close to the other infrastructure handling the function endpoints. This proximity minimizes request send-off time delays, which is crucial for accurately measuring roundtrip delay used in benchmarking.

Runs from 3 and 4 Weeks Ago

  • GCR (Google Cloud Run) functioned successfully.
  • Azure, AWS, and Cloudflare experienced common deployment errors.
  • These errors occurred when the workflow attempted to deploy the function to the service.
  • Specific errors included issues with the sls deploy (Serverless Framework) and wrangler CLI tools for Azure, AWS, and Cloudflare, respectively.

In 3 and 4 weeks back runs, GCR works and Azure, AWS, and Cloudfare fail. For all the experiments - baseline, image size, and runtime, the errors are common and occur when the workflow attempts to deploy the function to the service - errors in sls deploy and wrangler CLI tool (Azure, AWS and Cloudfare respectively)

For GCR - runs without errors

GCR runs without errors, likely due to its use of the gcloud run command for deployment automation, avoiding permission issues.

For Azure and AWS

Azure and AWS failures suggest permission problems with the sls deploy command used by the Serverless Framework.
The errors likely occur because the GitHub Actions Runner setup on the VM does not have the necessary permissions to access a file in the serverless package's directory.

Screenshot 2024-07-23 at 1 34 10 AM

Another observation is that this error could be related to this warning that was issued (in Provision VM workflows logs), when setting up the runner service on the Azure VM and EC2 instance.

image

[email protected] requires Node.js version 18.0.0 or higher. The current Node.js version is 16.20.2, which could be why we encounter issues or unexpected behavior when using the serverless framework (sls deploy).

For Cloudflare

Cloudflare's error indicates a missing wrangler executable, required for its deployment process.

Screenshot 2024-07-23 at 1 33 40 AM

Similar to Azure and AWS runner setup warning for serverless, when Cloudfare runner is being setup, the workflow fails to install wrangler (from Provision VM workflow logs). The GitHub Actions Runner user running the command does not have the necessary permissions to write to the /usr/lib/node_modules directory.

image

Steps to fix Issue

  • Understand why permissions issue occurs, and how to ensure permissions of GitHub Actions Runner.
  • Complete the credential migration process, update the secrets in the repository with the new credentials, and then rerun the workflows to verify if the errors persist.

concurrent Alibaba Cloud deployments failing

Currently, STeLLAR supports deployments to Alibaba Cloud through the serverless-aliyun-function-compute plugin. However, the plugin appears to have a limitation. During deployments, it attempts to create a single OSS bucket with the exact same name, in the format of sls-{ACCOUNT_NUMBER}-{REGION_NAME}. This is an issue as OSS bucket names must be unique globally and subsequent deployments through the plugin would fail. This is the root cause of STeLLAR integration/end-to-end test failures when they are executed concurrently.

Alibaba Cloud documentation suggests the use of an alternative framework, Serverless Devs for Functions Compute deployments.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.