Giter VIP home page Giter VIP logo

iterative / terraform-provider-iterative Goto Github PK

View Code? Open in Web Editor NEW
289.0 14.0 27.0 20.68 MB

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes

Home Page: https://registry.terraform.io/providers/iterative/iterative/latest/docs

License: Apache License 2.0

Makefile 0.31% Go 96.85% Shell 2.27% HCL 0.56%
terraform terraform-provider tpi terraform-provider-iterative developer-tools cloud cloud-computing cloud-storage cml data-science

terraform-provider-iterative's Introduction

TPI

Terraform Provider Iterative (TPI)

docs tests Apache-2.0

TPI is a Terraform plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.

  • Lower cost with spot recovery: transparent data checkpoint/restore & auto-respawning of low-cost spot/preemptible instances
  • No cloud vendor lock-in: switch between clouds with just one line thanks to unified abstraction
  • No waste: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use
  • Developer-first experience: one-command data sync & code execution with no external server, making the cloud feel like a laptop

Supported cloud vendors include:

Amazon Web Services (AWS) Microsoft Azure Google Cloud Platform (GCP) Kubernetes (K8s)

Why TPI?

There are several reasons to use TPI instead of other related solutions (custom scripts and/or cloud orchestrators):

  1. Reduced management overhead and infrastructure cost: TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups1, taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead & infrastructure costs. You can close your laptop while cloud tasks are running — auto-recovery happens even if you are offline.
  2. Unified tool for data science and software development teams: TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production.
  3. Reproducible, codified environments: Store hardware requirements in a single configuration file alongside the rest of your ML pipeline code.

TPI is used to power CML, bringing cloud providers to existing GitHub, GitLab & Bitbucket CI/CD workflows (repository).

Usage

Requirements

  • Install Terraform 1.0+, e.g.:
    • Brew (Homebrew/Mac OS): brew tap hashicorp/tap && brew install hashicorp/tap/terraform
    • Choco (Chocolatey/Windows): choco install terraform
    • Conda (Anaconda): conda install -c conda-forge terraform
    • Debian (Ubuntu/Linux):
      sudo apt-get update && sudo apt-get install -y gnupg software-properties-common curl
      curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
      sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
      sudo apt-get update && sudo apt-get install terraform
      
  • Create an account with any supported cloud vendor and expose its authentication credentials via environment variables

Define a Task

In a project root directory, create a file named main.tf with the following contents:

terraform {
  required_providers { iterative = { source = "iterative/iterative" } }
}
provider "iterative" {}

resource "iterative_task" "example" {
  cloud      = "aws" # or any of: gcp, az, k8s
  machine    = "m"   # medium. Or any of: l, xl, m+k80, xl+v100, ...
  spot       = 0     # auto-price. Default -1 to disable, or >0 for hourly USD limit
  disk_size  = -1    # GB. Default -1 for automatic

  storage {
    workdir = "."       # default blank (don't upload)
    output  = "results" # default blank (don't download). Relative to workdir
  }
  script = <<-END
    #!/bin/bash

    # create output directory if needed
    mkdir -p results
    # read last result (in case of spot/preemptible instance recovery)
    if test -f results/epoch.txt; then EPOCH="$(cat results/epoch.txt)"; fi
    EPOCH=$${EPOCH:-1}  # start from 1 if last result not found

    echo "(re)starting training loop from $EPOCH up to 1337 epochs"
    for epoch in $(seq $EPOCH 1337); do
      sleep 1
      echo "$epoch" | tee results/epoch.txt
    done
  END
}

See the reference for the full list of options for main.tf -- including more information on machine types with and without GPUs.

console

Run this once (in the directory containing main.tf) to download the required_providers:

terraform init
export TF_LOG_PROVIDER=INFO

Run Task

terraform apply

This launches a machine in the cloud, uploads workdir, and runs the script. Upon completion (or error), the machine is terminated.

With spot/preemptible instances (spot >= 0), auto-recovery logic and persistent (disk_size) storage will be used to relaunch interrupted tasks.

Query Status

Results and logs are periodically synced to persistent cloud storage. To query this status and view logs:

terraform refresh
terraform show

End Task

terraform destroy

This terminates the machine (if still running), downloads output, and removes the persistent disk_size storage.

Example Projects

How it Works

This diagram may help to see what TPI does under-the-hood:

flowchart LR
subgraph tpi [what TPI manages]
direction LR
    subgraph you [what you manage]
        direction LR
        A([Personal Computer])
    end
    B[("Cloud Storage (low cost)")]
    C{{"Cloud instance scaler (zero cost)"}}
    D[["Cloud (spot) Instance"]]
    A ---> |2. create cloud storage| B
    A --> |1. create cloud instance scaler| C
    A ==> |3. upload script & workdir| B
    A -.-> |"4. offline (lunch break)"| A
    C -.-> |"5. (re)provision instance"| D
    D ==> |7. run script| D
    B <-.-> |6. persistent workdir cache| D
    D ==> |8. script end,\nshutdown instance| B
    D -.-> |outage| C
    B ==> |9. download output| A
end
style you fill:#FFFFFF00,stroke:#13ADC7
style tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px
style A fill:#13ADC7,stroke:#333333,color:#000000
style B fill:#945DD5,stroke:#333333,color:#000000
style D fill:#F46737,stroke:#333333,color:#000000
style C fill:#7B61FF,stroke:#333333,color:#000000
Loading

Future Plans

TPI is a CLI tool bringing the power of bare-metal cloud to a bare-metal local laptop. We're working on more featureful and visual interfaces. We'd also like to have more native support for distributed (multi-instance) training, more data sync optimisations & options, and tighter ecosystem integration with tools such as DVC. Plus of course more examples for Data Scientists and Machine Learning Engineers - from Jupyter, VSCode, and Codespaces to improving the live logging/monitoring/reporting experience.

Help

The getting started guide has some more information. In case of errors, extra debugging information is available using TF_LOG_PROVIDER=DEBUG instead of INFO.

Feature requests and bugs can be reported via GitHub issues, while general questions and feedback are very welcome on our active Discord server.

Contributing

Instead of using the latest stable release, a local copy of the repository must be used.

  1. Install Go 1.17+
  2. Clone the repository & build the provider
    git clone https://github.com/iterative/terraform-provider-iterative
    cd terraform-provider-iterative
    make install
    
  3. Use source = "github.com/iterative/iterative" in your main.tf to use the local repository (source = "iterative/iterative" will download the latest release instead), and run terraform init --upgrade

Copyright

This project and all contributions to it are distributed under Apache-2.0

Footnotes

  1. AWS Auto Scaling Groups, Azure VM Scale Sets, GCP managed instance groups, and Kubernetes Jobs.

terraform-provider-iterative's People

Contributors

0x2b3bfa0 avatar aguschin avatar aliabbasjaffri avatar casperdcl avatar dacbd avatar danieljimeneznz avatar davidgortega avatar dberenbaum avatar dependabot[bot] avatar dhanushnehru avatar dmpetrov avatar elleobrien avatar jendefig avatar kaaloo avatar karajan1001 avatar ludelafo avatar mjasion avatar mkhalusova avatar omesser avatar redouan-rhazouani avatar sjawhar avatar tasdomas avatar vaibhavwakde52 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

terraform-provider-iterative's Issues

Provider should admit a provisioning script

Like packer the provider could have a provisioner section to run shell but also a great addition would be having file provisioners.

Example of packer provisioner

 "provisioners" : [
        {
            "type" : "shell",
            "environment_vars": ["DEBIAN_FRONTEND=noninteractive"],
            "script" : "./setup.sh"
        },
        {
            "type": "shell",
            "inline": [
              "sudo shutdown -r now",
              "sleep 60"
            ],
            "start_retry_timeout": "10m",
            "expect_disconnect": true
        }
    ]

Actually this might be not necessary since the users can build their own image pre-setting the stack.
Benefits:

  • Easy. users don't need to create their own specific AMIs and learn how to!
  • Flexibility. One can reuse a stack that might be updated with minor packages.

Cons:

  • Not sure how hard would be to implement this.

machine is failing while runner is ok

resource "iterative_machine" "machine" {
    cloud = "azure"
    region = "us-west"
    instance_type = "m"
    #spot = true
    #spot_price = 0.09
} 
Error: rpc error: code = Unavailable desc = transport is closing


panic: interface conversion: interface {} is nil, not string
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative: 
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative: goroutine 45 [running]:
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative: terraform-provider-iterative/iterative/azure.ResourceMachineCreate(0x21ea120, 0xc000508000, 0xc000506000, 0x0, 0x0, 0x1, 0x0)
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative:       /Users/davidgortega/Documents/projects/@iterative/terraform-provider-iterative/iterative/azure/provider.go:28 +0x39bf
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative: terraform-provider-iterative/iterative.resourceMachineCreate(0x21ea120, 0xc000508000, 0xc000506000, 0x0, 0x0, 0xc0004bc400, 0x12f8d0a, 0xc00059f240)
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative:       /Users/davidgortega/Documents/projects/@iterative/terraform-provider-iterative/iterative/resource_machine.go:157 +0x6c5
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative: github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).create(0xc0004b00c0, 0x21ea0a0, 0xc00020da00, 0xc000506000, 0x0, 0x0, 0x0, 0x0, 0x0)
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative:       /Users/davidgortega/Documents/projects/@iterative/terraform-provider-iterative/vendor/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema/resource.go:285 +0x1ea
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative: github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).Apply(0xc0004b00c0, 0x21ea0a0, 0xc00020da00, 0xc0001c8d20, 0xc00059f240, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative:       /Users/davidgortega/Documents/projects/@iterative/terraform-provider-iterative/vendor/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema/resource.go:396 +0x67b
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative: github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*GRPCProviderServer).ApplyResourceChange(0xc00048c0c0, 0x21ea0a0, 0xc00020da00, 0xc000434550, 0xc00020da00, 0xc000436000, 0x21f9220)
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative:       /Users/davidgortega/Documents/projects/@iterative/terraform-provider-iterative/vendor/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema/grpc_provider.go:955 +0x8cf
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative: github.com/hashicorp/terraform-plugin-go/tfprotov5/server.(*server).ApplyResourceChange(0xc00048cc60, 0x21ea0a0, 0xc00020da00, 0xc0001c8a80, 0xc00048cc60, 0xc000421830, 0xc000075ba0)
2021-01-09T10:11:42.465+0100 [DEBUG] plugin.terraform-provider-iterative:       /Users/davidgortega/Documents/projects/@iterative/terraform-provider-iterative/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/server/server.go:331 +0xae
2021-01-09T10:11:42.466+0100 [DEBUG] plugin.terraform-provider-iterative: github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/tfplugin5._Provider_ApplyResourceChange_Handler(0x1f8ad60, 0xc00048cc60, 0x21ea160, 0xc000421830, 0xc0004360c0, 0x0, 0x21ea160, 0xc000421830, 0xc0003a6600, 0x1f4)
2021-01-09T10:11:42.466+0100 [DEBUG] plugin.terraform-provider-iterative:       /Users/davidgortega/Documents/projects/@iterative/terraform-provider-iterative/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/tfplugin5/tfplugin5_grpc.pb.go:380 +0x214
2021-01-09T10:11:42.466+0100 [DEBUG] plugin.terraform-provider-iterative: google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003b8c40, 0x21f35a0, 0xc000103200, 0xc000099200, 0xc000404630, 0x29dc740, 0x0, 0x0, 0x0)
2021-01-09T10:11:42.466+0100 [DEBUG] plugin.terraform-provider-iterative:       /Users/davidgortega/Documents/projects/@iterative/terraform-provider-iterative/vendor/google.golang.org/grpc/server.go:1194 +0x522
2021-01-09T10:11:42.466+0100 [DEBUG] plugin.terraform-provider-iterative: google.golang.org/grpc.(*Server).handleStream(0xc0003b8c40, 0x21f35a0, 0xc000103200, 0xc000099200, 0x0)
2021-01-09T10:11:42.466+0100 [DEBUG] plugin.terraform-provider-iterative:       /Users/davidgortega/Documents/projects/@iterative/terraform-provider-iterative/vendor/google.golang.org/grpc/server.go:1517 +0xd05
2021-01-09T10:11:42.466+0100 [DEBUG] plugin.terraform-provider-iterative: google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc00036e2d0, 0xc0003b8c40, 0x21f35a0, 0xc000103200, 0xc000099200)
2021-01-09T10:11:42.466+0100 [DEBUG] plugin.terraform-provider-iterative:       /Users/davidgortega/Documents/projects/@iterative/terraform-provider-iterative/vendor/google.golang.org/grpc/server.go:859 +0xa5
2021-01-09T10:11:42.466+0100 [DEBUG] plugin.terraform-provider-iterative: created by google.golang.org/grpc.(*Server).serveStreams.func1
2021-01-09T10:11:42.466+0100 [DEBUG] plugin.terraform-provider-iterative:       /Users/davidgortega/Documents/projects/@iterative/terraform-provider-iterative/vendor/google.golang.org/grpc/server.go:857 +0x1fd
2021/01/09 10:11:42 [DEBUG] iterative_machine.machine: apply errored, but we're indicating that via the Error pointer rather than returning it: rpc error: code = Unavailable desc = transport is closing
2021/01/09 10:11:42 [TRACE] eval: *terraform.EvalMaybeTainted
2021/01/09 10:11:42 [TRACE] EvalMaybeTainted: iterative_machine.machine encountered an error during creation, so it is now marked as tainted
2021/01/09 10:11:42 [TRACE] eval: *terraform.EvalWriteState
2021/01/09 10:11:42 [TRACE] EvalWriteState: removing state object for iterative_machine.machine
2021/01/09 10:11:42 [TRACE] eval: *terraform.EvalApplyProvisioners
2021/01/09 10:11:42 [TRACE] EvalApplyProvisioners: iterative_machine.machine has no state, so skipping provisioners
2021/01/09 10:11:42 [TRACE] eval: *terraform.EvalMaybeTainted
2021-01-09T10:11:42.468+0100 [WARN]  plugin.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = transport is closing"
2021/01/09 10:11:42 [TRACE] EvalMaybeTainted: iterative_machine.machine encountered an error during creation, so it is now marked as tainted
2021/01/09 10:11:42 [TRACE] eval: *terraform.EvalWriteState
2021/01/09 10:11:42 [TRACE] EvalWriteState: removing state object for iterative_machine.machine
2021/01/09 10:11:42 [TRACE] eval: *terraform.EvalIf
2021/01/09 10:11:42 [TRACE] eval: *terraform.EvalIf
2021/01/09 10:11:42 [TRACE] eval: *terraform.EvalWriteDiff
2021/01/09 10:11:42 [TRACE] eval: *terraform.EvalApplyPost
2021/01/09 10:11:42 [ERROR] eval: *terraform.EvalApplyPost, err: rpc error: code = Unavailable desc = transport is closing
2021/01/09 10:11:42 [ERROR] eval: *terraform.EvalSequence, err: rpc error: code = Unavailable desc = transport is closing
2021/01/09 10:11:42 [TRACE] [walkApply] Exiting eval tree: iterative_machine.machine
2021/01/09 10:11:42 [TRACE] vertex "iterative_machine.machine": visit complete
2021/01/09 10:11:42 [TRACE] dag/walk: upstream of "provider[\"github.com/iterative/iterative\"] (close)" errored, so skipping
2021/01/09 10:11:42 [TRACE] dag/walk: upstream of "meta.count-boundary (EachMode fixup)" errored, so skipping
2021/01/09 10:11:42 [TRACE] dag/walk: upstream of "root" errored, so skipping
2021/01/09 10:11:42 [TRACE] statemgr.Filesystem: not making a backup, because the new snapshot is identical to the old
2021-01-09T10:11:42.468+0100 [DEBUG] plugin: plugin process exited: path=.terraform/plugins/github.com/iterative/iterative/0.6.0/darwin_amd64/terraform-provider-iterative pid=87859 error="exit status 2"
2021/01/09 10:11:42 [TRACE] statemgr.Filesystem: no state changes since last snapshot
2021/01/09 10:11:42 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate
2021/01/09 10:11:42 [TRACE] statemgr.Filesystem: removing lock metadata file .terraform.tfstate.lock.info
2021/01/09 10:11:42 [TRACE] statemgr.Filesystem: unlocking terraform.tfstate using fcntl flock
2021-01-09T10:11:42.480+0100 [DEBUG] plugin: plugin exited



!!!!!!!!!!!!!!!!!!!!!!!!!!! TERRAFORM CRASH !!!!!!!!!!!!!!!!!!!!!!!!!!!!

Terraform crashed! This is always indicative of a bug within Terraform.
A crash log has been placed at "crash.log" relative to your current
working directory. It would be immensely helpful if you could please
report the crash with Terraform[1] so that we can fix this.

When reporting bugs, please include your terraform version. That
information is available on the first line of crash.log. You can also
get it by running 'terraform --version' on the command line.

SECURITY WARNING: the "crash.log" file that was created may contain 
sensitive information that must be redacted before it is safe to share 
on the issue tracker.

[1]: https://github.com/hashicorp/terraform/issues

!!!!!!!!!!!!!!!!!!!!!!!!!!! TERRAFORM CRASH !!!!!!!!!!!!!!!!!!!!!!!!!!!!

SSH into machine like docker-machine

Coming from here

This one's much lower priority, but previously with docker-machine I was able to use docker-machine ssh after creation to install ec2-instance-connect

Await runner mechanism based on logs

Implement a mechanism based on logs to await the runner. It will also have the following benefits:

  • Common method for public ip machines
  • Shows any runner error during the deploy job

Unify security group creation between providers

  1. AWS creates a single security group for all the machines named iterative that will be reused by all the created instances.

  2. Azure creates a new security group for every machine, rendering manual ClickOps adjustments practically impossible but allowing granular configuration.

The second approach allows a finer control over network security for each runner or group of runners (#125) and discourages non-reproducible infrastructure. If we plan to offer any customization capabilities on network ingress and egress, they should be vendor-agnostic and don't require any user interaction outside our public API.

check api key is valid

Now is checked in CML, should be done here instead.
Once done CML code has to be removed

Improve the AWS VPC architecture

(Follow-up of iterative/cml#484)

We're using the first VPC and the first subnet to deploy EC2 instances, but this behavior can be problematic in some cases.

We can either create a new VPC as part of the provisioning process or just allow users to specify which one should be used. The first solution could be more transparent for users and allow us to keep the common interface as pure as possible.

Human readable AWS error messages.

AWS messages are encrypted . They should be readable automatically to avoid the hassle.

aws sts decode-authorization-message --encoded-message $msg | jq ".DecodedMessage | fromjson"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.