Comments (6)
Hi, @harish-kamath , very nice that you like this library! I have few questions:
1/ When you say InternalServerError: We encountered an internal error. Please try again
- which log file exactly do you see this message at? Is it in CloudWatch?
2/ Do you know which process generates this message, e.g., is there any prefix in front of this line?
3/ What SageMaker component you are connecting to, e.g., SageMaker Training or Studio, or Inference?
from sagemaker-ssh-helper.
Upon further digging - I'm not sure if it is actually the credentials (only).
I noticed that the ssm agent will first pull AWS credentials from the environment variables - so I tried including explicit AWS access key and secret key in my training job. The logs still show that the credentials are being refreshed, but the machine doesn't actually crash ~1m after the credentials are refreshed anymore. However, now it just crashes otherwise (even if there is nothing running, so no chance that it's a resource issue).
And here is the last cloudwatch log:
(Note that in this case, I did not do sm-wait stop
, but it doesn't actually matter for this error. It occurs even if I do that)
-
There's no prefix or process unfortunately. Since it just crashes the machine, and there's no persistent storage, I'm not sure how I can actually debug after a crash either.
-
Sagemaker Training Jobs
from sagemaker-ssh-helper.
On the bright side, it no longer crashes always after 30 mins of being connected. However, it is still crashing within an hour.
from sagemaker-ssh-helper.
Never mind, just got another crash in <30 minutes.
I'm pretty sure it is still this package, because connecting over plain SSH is still fine and never causes a crash.
from sagemaker-ssh-helper.
Hi, @harish-kamath , apologies for delay, this indeed seems to me very strange. I will have to investigate it further since in the short-running tests it never crashed like that before. In a meantime, is it possible for you to raise the support case from AWS Console? Please, add the link to this issue and mention my name:
https://docs.aws.amazon.com/awssupport/latest/user/case-management.html
from sagemaker-ssh-helper.
@harish-kamath I am using the following manual test to try to reproduce the issue. Without connection from VS Code the job successfully stops in 3 hours without any "Internal Server Error". I have few more asks and questions:
1/ What instance types did you try?
2/ Could you please run the test on ml.g4dn.xlarge
:
sagemaker-ssh-helper/tests/test_manual.py
Lines 23 to 50 in 0d70d02
Status | Start time | End time | Description |
---|---|---|---|
Starting | 5/11/2024, 10:41:50 AM | 5/11/2024, 10:42:32 AM | Preparing the instances for training |
Downloading | 5/11/2024, 10:42:32 AM | 5/11/2024, 10:46:08 AM | Downloading the training image |
Training | 5/11/2024, 10:46:08 AM | 5/11/2024, 1:45:42 PM | Training image download completed. Training in progress. |
Stopping | 5/11/2024, 1:45:42 PM | 5/11/2024, 1:45:43 PM | Stopping the training job |
Uploading | 5/11/2024, 1:45:43 PM | 5/11/2024, 1:45:55 PM | Uploading generated training model |
MaxRuntimeExceeded | 5/11/2024, 1:45:55 PM | 5/11/2024, 1:45:55 PM | Resource released due to keep alive period expiry |
3/ Don't connect with VS Code and wait for some time until credentials will refresh automatically for one or two times. The job should not crash. Then try to connect with VS Code.
So far for me it seems that the issue has nothing to do with Credential refresh, because they are refreshed all the time automatically and this is expected.
4/ How likely is that the VS Code runs some heavy process inside and the instance is running out of RAM?
Could you please check the utilization on the job page in AWS Console? The successful run looks like this:
Make the note at which exact time the credentials were refreshed and what time you've connected with VS Code.
I hope the above steps will help you to localize and isolate the issue down to some process that VS Code starts inside the container.
from sagemaker-ssh-helper.
Related Issues (20)
- [Feature] Support HF accelerate and DeepSpeed for inference HOT 1
- Thoughts on using a configuration management framework? HOT 6
- sm-local-configure only works with bash like installations - no Powershell/CMD support / Windows support at all HOT 4
- Error occurred when starting amazon-ssm-agent: failed to get identity: failed to find agent identity HOT 1
- Are scripts supposed to work on SageMaker notebook instances? HOT 12
- How to install VSCode, other apps in WebVNC view? HOT 2
- JupyterServer URL suffix when tunnelling into KernelGateway app HOT 2
- Notebook `SageMaker_SSH_Notebook.ipynb` fails due to docker-compose HOT 5
- Enable advanced-instances tier to use Session Manager with your on-premises instances HOT 2
- Connecting to SageMaker BYOC Inference Endpoint? HOT 2
- SSH port forwarding to KernelGateway app container HOT 2
- [Question] Shell environment different from web terminal HOT 2
- [bug] - `SageMaker_SSH_IDE.ipynb` does not work HOT 1
- [Feature] Support shared spaces in SageMaker Studio Classic
- [Feature] Support the updated SageMaker Studio experience HOT 1
- [Question] How to connect to sagemaker notebooks HOT 4
- does ssh helper support sagemaker's remote debug's ssm connection? HOT 2
- vscode connect fails HOT 3
- does ssh helper support byoc sagemaker endpoint? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sagemaker-ssh-helper.