Describe the bug [Update - 2024-06-06] We've identified Te

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Made an update to the deion, surfacing the extreme degradation in performance wh

Thanks for the bug report, <a class="user-mention notranslate" data-hovercard-type="us

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Is this linked to <a class="issue-link js-issue-link" data-error-text="Failed to load

Severe Performance Issues: O(n^2) Redundancy in Terragrunt Locals and Root Includes about terragrunt HOT 15 CLOSED

swordfish444 commented on September 26, 2024 6

Severe Performance Issues: O(n^2) Redundancy in Terragrunt Locals and Root Includes

from terragrunt.

Comments (15)

swordfish444 commented on September 26, 2024 1

@caspar-ds

Is this linked to #2980?

There is definitely overlap. One of the unique aspects surfaced in this issue is how inefficient Terragrunt's AWS Role-based authentication is:

What we did find so far is that the initialization time took 300% longer when using aws sso with an AWS role for AWS CLI authentication as opposed to just using direct AWS ENV credentials. The time increased from 50 seconds to 150 seconds.

from terragrunt.

swordfish444 commented on September 26, 2024 1

Made an update to the description, surfacing the extreme degradation in performance when using AWS role-based authentication when compared to default authn.

from terragrunt.

yhakbar commented on September 26, 2024

Thanks for the bug report, @swordfish444 !

Do you mind sharing how you're calling Terragrunt right now? Is my assumption right that you're calling run-all commands from your root terragrunt.hcl file when you see these issues?

If so, do you mind sharing if you're leveraging the opt-in terragrunt-use-partial-config-cache?

It was specifically introduced to address performance issues like yours, so it may help! Note that it's still experimental, so use it with caution. What it does is cache parsed Terragrunt configurations between includes, so it should dramatically improve performance for you.

Also, do you mind sharing how you've investigated your performance issues? Are you leveraging the OpenTelemetry Integration to get insight into your performance issues?

from terragrunt.

denis256 commented on September 26, 2024

Hi,
this are a lot of good things to look into

Optimize Locals Blocks: Review and optimize locals block evaluations to reduce redundancy. Consider consolidating or reusing evaluated values instead of recalculating them.

In some cases is required to re-evaluate each time included locals, like common code to read configurations in each child project:

.
├── app
│   ├── app1
│   │   ├── api.yaml
│   │   ├── main.tf
│   │   └── terragrunt.hcl
│   └── app2
│       ├── api.yaml
│       ├── main.tf
│       └── terragrunt.hcl
├── README.md
└── terragrunt.hcl   <----- common code to read api.yaml, should be evaluated in each appX folder to load respective api.yaml


# terragrunt.hcl

locals {
  config = yamldecode(file("${get_terragrunt_dir()}/api.yaml"))
}

inputs = {
  api_key = local.config.api_key
}

https://github.com/denis256/terragrunt-tests/tree/master/locals-evaluation

from terragrunt.

swordfish444 commented on September 26, 2024

@yhakbar

Do you mind sharing how you're calling Terragrunt right now? Is my assumption right that you're calling run-all commands from your root terragrunt.hcl file when you see these issues?

Here is the command I'm running:

terragrunt apply

Some additional details:

Storing state in S3
All of our modules depend on AWS provider

I was not using --terragrunt-use-partial-parse-config-cache, but I tested it just now and it didn't help 😢.

What we did find so far is that the initialization time took 300% longer when using aws sso with an AWS role for AWS CLI authentication as opposed to just using direct AWS ENV credentials. The time increased from 50 seconds to 150 seconds.

Even 50 seconds is excessive, but at least it's bearable. It seems like there are some good optimizations that can be done with the AWS role-based authentication in Terragrunt code.

Also, do you mind sharing how you've investigated your performance issues? Are you leveraging the OpenTelemetry Integration to get insight into your performance issues?

This is a great recommendation. No, we have not used this. I will look into this and report back with any new insights.

from terragrunt.

caspar-ds commented on September 26, 2024

Is this linked to #2980?

from terragrunt.

jtackaberry commented on September 26, 2024

We use the same type of project structure, and this tracks with our experience. (Except we're using federated identity and tooling that acquires an STS token and stores in $AWS_SHARED_CREDENTIALS_FILE, so the comment about AWS SSO doesn't apply.)

In nontrivial deployments, running terragrunt plan against modules with a few dependencies can easily take minutes, especially when there's a bit of latency to the AWS APIs (e.g. if connecting over VPN). It definitely degrades more severely the more dependencies are involved (whether direct or indirect), as mentioned above.

Contrary to #2980 performance with Terragrunt has always poor in our projects, and we've been using it since 0.20. But I'm sure there are multiple factors at play that together conspire to prevent Terragrunt's performance being all that it might be.

When debug is enabled, it certainly seems as though Terragrunt is evaluating the same locals over and over. It's hard to tell from the logs how much of the actual execution time is spent on locals evaluation relative to other things like AWS API calls, but a number of things are re-evaluated hundreds of times.

$ terragrunt plan --terragrunt-log-level debug --terragrunt-debug 2>&1 | tee debug.log
$ grep Evaluated debug.log  | wc -l
5551
$ grep Evaluated debug.log | sed 's/^.*): //' | sort | uniq -c | sort -n | tail -10
    180 awspca_arns, internal_roots, aws_account_id, tags prefix=[/home/jtackaberry/projects/someservice]
    241 zones, tags prefix=[/home/jtackaberry/projects/someservice/lab]
    304 aws_account_id, tags, awspca_arns, internal_roots prefix=[/home/jtackaberry/projects/someservice]
    339 level_paths prefix=[/home/jtackaberry/projects/someservice/lab/bootstrap/tfstate]
    339 levels prefix=[/home/jtackaberry/projects/someservice/lab/bootstrap/tfstate]
    339 tags prefix=[/home/jtackaberry/projects/someservice/lab/bootstrap/tfstate]
    356 internal_roots, aws_account_id, tags, awspca_arns prefix=[/home/jtackaberry/projects/someservice]
    397 levels, merge_keys prefix=[/home/jtackaberry/projects/someservice]
    776 tags, zones prefix=[/home/jtackaberry/projects/someservice/lab]
   1017 eks prefix=[/home/jtackaberry/projects/someservice]

I always felt like there must be some low hanging fruit here for memoizing these evaluations.

from terragrunt.

yhakbar commented on September 26, 2024

Thanks for all of your feedback, everyone. We appreciate that you are all so invested in improving the performance of Terragrunt and we are paying close attention to the feedback you have provided.

Special thanks to @swordfish444 for opening this issue, and doing some hard work diagnosing the root cause of your performance issues, and providing us with clues that we were able to use to look more closely at the source code and attempt to address inefficiencies in how certain operations are handled. Specifically pointing out we had an opportunity to speed up operations related to role assumptions and state management gave us insight that the optimizations in v0.59.1 might help.

I want to be clear that those optimizations are to reduce unnecessary overhead in what Terragrunt does to initialize state and assume roles, however it will always be faster to not use Terragrunt for role assumption and assume the role outside of Terragrunt, as doing no work is always faster than doing work. If you are struggling with performance issues related to role assumptions taking place in Terragrunt, and you have the option to assume the role outside of Terragrunt, that will always result in Terragrunt being faster.

With respect to memoizing rendering of local parsing, please take note of @denis256's comment up here. Terragrunt cannot fully reproduce previously parsed HCL include, because of the nature of HCL files. They allow for dynamic content, and reproducing exactly the same output for an HCL parse from cache will likely result in bugs for users. Note that there are optimizations that we can perform for this, and do, but each read of an HCL file ultimately requires doing some additional work to ensure that the values from the file are up to date.

If any other specific recommendations for performance improvements come to mind, or particular circumstances where Terragrunt is slower than expected, please share them here.

I will be closing out this issue in about a week if nothing actionable comes up to avoid having this issue go stale.

Afterwards, please look to @swordfish444's great example on how to present performance issues in a way that is actionable and helpful for the maintainers to address. I plan in improving Terragrunt documentation on bug reporting to use this approach as a template.

from terragrunt.

jtackaberry commented on September 26, 2024

With respect to memoizing rendering of local parsing, please take note of @denis256's comment up #3153 (comment). Terragrunt cannot fully reproduce previously parsed HCL include, because of the nature of HCL files. They allow for dynamic content, and reproducing exactly the same output for an HCL parse from cache will likely result in bugs for users.

I can definitely understand that locals would need to be reevaluated based on each relative starting point. So in @denis256's example, if app2 depended on app1, then when running terragrunt from app2, the root-level terragrunt.hcl would be evaluated from both app2's perspective and app1's perspective. Makes sense.

Are there other factors that confound locals memoization?

from terragrunt.

yhakbar commented on September 26, 2024

Are there other factors that confound locals memoization?

In addition to the current working directory of the terragrunt.hcl file initiating the parse of the secondary HCL file causing variance in the result of the HCL parse, there can also be variance caused by things like timing and network activity.

For example:

locals {
  current_time = run_cmd("date")
}

This is a valid local that would have to be recomputed on every parse.

We can cache the AST body of the file (that it has a locals block with a current_time local, which calls run_cmd on date), so we can skip the filesystem read penalty the next parse, but we have to actually run the function on every parse to make sure the result is accurate.

The same would apply if the command being run involved network connectivity, and the same request could result in a different response.

from terragrunt.

jtackaberry commented on September 26, 2024

We can cache the AST body of the file (that it has a locals block with a current_time local, which calls run_cmd on date), so we can skip the filesystem read penalty the next parse, but we have to actually run the function on every parse to make sure the result is accurate.

This makes sense, thanks @yhakbar.

Is this something --terragrunt-use-partial-parse-config-cache would cache, at the potential expense of correctness?

This is the type of thing I'm fairly sure our projects could safely cache. The network calls as well are extremely costly, especially over latent connections, but I see no reason we couldn't safely cache that. Looking through our project code, AFAICT if everything was evaluated once and cached, the result would still be sane. And if the network calls were cached, my guess is an order of magnitude reduction in execution time, which is a benefit I would be willing to pay for in the form of less dynamic magic.

from terragrunt.

yhakbar commented on September 26, 2024

@jtackaberry

Rather than respond directly, I've updated documentation to hopefully make it clearer to everyone what the flag caches so that we can help others understand what the trade-off is for the --terragrunt-use-partial-parse-config-cache flag. Please take a read and let me know if I can further clarify that piece of documentation.

I would need more details to be able to provide more useful feedback on the issues that you're experiencing with respect to the costly network calls impacting performance. Things like how the calls are being made (via a dependency block, a run_cmd call, a read_terragrunt_config call, etc) and why the call is being repeated even though it can be safely re-used would be useful info.

from terragrunt.

yhakbar commented on September 26, 2024

As discussed earlier, I'm closing out this issue. Please open up a new one if you would like to provide insight as to how the performance of Terragrunt can be improved going forward!

from terragrunt.

apshoemaker commented on September 26, 2024

This really needs to be reopened - the newer versions of terragrunt are painfully much slower than anything in the 0.50.x range and older. I would love to use your later featureset, but can't take the operational hit of several minutes to apply relatively simple infrastructure mutations. Any thoughts? Do you prefer a new issue to track to completion?

from terragrunt.

yhakbar commented on September 26, 2024

Hey @apshoemaker , what is the latest version of Terragrunt that you've tried?

Have you tried v0.66.3? Some great work was done in that release to improve performance significantly for many users.

from terragrunt.

Severe Performance Issues: O(n^2) Redundancy in Terragrunt Locals and Root Includes about terragrunt HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent