Comments (23)
Yes @statmike ,
while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error.
The point of hpt.trial_id
might be the issue which i am also guessing, but didn't got any resolution about it yet
Even i ran the 05i
notebook in november 2022 but at that time i did not faced any issues.
from vertex-ai-mlops.
I already tried when there were no experiments in vertex ai experiment tabs
from vertex-ai-mlops.
Hi @statmike,
I tried the same but I am using REGION = "europe-west2". I am getting the same error mentioned below.
google.api_core.exceptions.AlreadyExists: 409 Context with name projects/123456/locations/europe-west2/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203194023-1 already exists
Seems like its a region specific issue. If possible could you please try it with REGION= "europe-west2".
Thanks in advance!
from vertex-ai-mlops.
Yeah, I'm also facing similar issue.
from vertex-ai-mlops.
I am facing the same issue.
from vertex-ai-mlops.
I am experiencing the same issue.
from vertex-ai-mlops.
using this code from 99_cleanup file.
exps = aiplatform.Experiment.list()
for exp in exps:
runs = aiplatform.ExperimentRun.list(experiment = exp.name)
print(f'Experiment: {exp.name}')
for run in runs:
print(f'Run: {run.name}')
run.delete(delete_backing_tensorboard_run = False)
exp.delete(delete_backing_tensorboard_runs = False)
Can you please try this before running the hyperparameter script?
from vertex-ai-mlops.
Is this for notebook 05i? A quick thought here. If you are running the same notebook multiple time the it can be important to rerun it starting from the very top. There is a line RUN_NAME = f'run-{TIMESTAMP}'
that will use a new TIMESTAMP each time to make sure the run names end up being unique.
from vertex-ai-mlops.
Is this for notebook 05i? A quick thought here. If you are running the same notebook multiple time the it can be important to rerun it starting from the very top. There is a line
RUN_NAME = f'run-{TIMESTAMP}'
that will use a new TIMESTAMP each time to make sure the run names end up being unique.
Yes it's 05i, yes I thought the same initially so I made a timestamp dynamic also then also it's giving the same issue.
I then separately ran an HPT job without logging in any run in the Vertex Ai Experiment and it ran successfully.
from vertex-ai-mlops.
Thank you. I will try to help you through the chat here. Can you let me know which part of the notebook ends in the error - which cell? From that I will have a few more steps to request so I can understand what the error is here.
from vertex-ai-mlops.
If you see the 3rd line then the run name is already dynamic because trial ID is append in the end.
Now i am getting below error in the 4th line:
google.api_core.exceptions.AlreadyExists: 409 Context with name projects/1234/locations/us-central1/metadataStores/ default/contexts/experiment-05-05i-tf-classification-dnn-run-20230202120231-1 already exists
.
I won't be able to share screen shot from logs as i am using client's environment.
Even after giving the TIMESTAMP Dynamic also it gives the same error.
from vertex-ai-mlops.
Hello @Jay2201,
I see that you are referencing line from the script in ./code/hp_train.py
.
This script is copied to GCS with a new name and then into the container used for training. The training job is created by the cell that looks like:
customJob = aiplatform.CustomJob(
display_name = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}',
worker_pool_specs = WORKER_POOL_SPEC,
base_output_dir = f"{URI}/models/{TIMESTAMP}",
staging_bucket = f"{URI}/models/{TIMESTAMP}",
labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)
This references the object WORKER_POOL_SPEC
this is defined in the notebook cell with this code:
WORKER_POOL_SPEC = [
{
"replica_count": 1,
"machine_spec": MACHINE_SPEC,
"container_spec": {
"image_uri": f"{REPOSITORY}/{EXPERIMENT}_trainer",
"command": [],
"args": CMDARGS
}
}
]
Part of this definitions is the additional object CMDARGS
which is defined in the notebook with:
CMDARGS = [
"--epochs=" + str(EPOCHS),
"--batch_size=" + str(BATCH_SIZE),
"--var_target=" + VAR_TARGET,
"--var_omit=" + VAR_OMIT,
"--project_id=" + PROJECT_ID,
"--bq_project=" + BQ_PROJECT,
"--bq_dataset=" + BQ_DATASET,
"--bq_table=" + BQ_TABLE,
"--region=" + REGION,
"--experiment=" + EXPERIMENT,
"--series=" + SERIES,
"--experiment_name=" + EXPERIMENT_NAME,
"--run_name=" + RUN_NAME
]
This is where the value of experiment_name
and run_name
get passed in. The are defined at the top of the notebook in the cell that looks like:
FRAMEWORK = 'tf'
TASK = 'classification'
MODEL_TYPE = 'dnn'
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-{FRAMEWORK}-{TASK}-{MODEL_TYPE}'
RUN_NAME = f'run-{TIMESTAMP}'
The unique part of this will be RUN_NAME
because it has a value TIMESTAMP
that is also defined near the top of the notebook in a cell that looks like:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{SERIES}/{EXPERIMENT}"
DIR = f"temp/{EXPERIMENT}"
It looks like the value of TIMESTAMP the notebook is using on your run may have already been used before. Is this possible?
The only other possibility I can think of is multiple values of the hpt.trial_id
are the same but I have not run into that before.
Thank You
from vertex-ai-mlops.
Yes @statmike ,
while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error.
The point of
hpt.trial_id
might be the issue which i am also guessing, but didn't got any resolution about it yetEven i ran the
05i
notebook in november 2022 but at that time i did not faced any issues.
Yeah, this issue looks similar to mine.
from vertex-ai-mlops.
Yes @statmike ,
while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error.
The point ofhpt.trial_id
might be the issue which i am also guessing, but didn't got any resolution about it yet
Even i ran the05i
notebook in november 2022 but at that time i did not faced any issues.Its same for me too.
My issue is also the same.
from vertex-ai-mlops.
Hello @Jay2201 ,
I just did a test run of the notebook in an environment where it was run previously and did not encounter any errors. I am going to cover the diagnostics I did here in case you want to replicate the steps for troubleshooting in your environment.
On the Vertex AI Console Page for Training, HyperParameter Tuning Jobs tab, select the current job related to the notebook. This gives a list of all the tuning trials and includes links to the logs for each:
I went to the logs for each of these tuning trials and looked for the result of the line that creates the experiment run:
expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)
Here are the values I found in the logs for the first 6 trials:
- Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-1 to Experiment: experiment-05-05i-tf-classification-dnn
- Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-2 to Experiment: experiment-05-05i-tf-classification-dnn
- Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-3 to Experiment: experiment-05-05i-tf-classification-dnn
- Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-4 to Experiment: experiment-05-05i-tf-classification-dnn
- Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-5 to Experiment: experiment-05-05i-tf-classification-dnn
- Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-6 to Experiment: experiment-05-05i-tf-classification-dnn
Are you making any other changes to the tutorial notebook that might need to be investigated for causing this issue?
Thank You
from vertex-ai-mlops.
Let me again run the notebook and see if I am getting the same error or not
As far as I remember i am only changing region rest code i am running as is
from vertex-ai-mlops.
Hi @statmike,
I tried the same but I am using REGION = "europe-west2". I am getting the same error mentioned below.
google.api_core.exceptions.AlreadyExists: 409 Context with name projects/123456/locations/europe-west2/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203194023-1 already exists
Seems like its a region specific issue. If possible could you please try it with REGION= "europe-west2".
Thanks in advance!
@statmike - Getting the same error with the above region ....
from vertex-ai-mlops.
Hello @Jay2201 , When you run this job what are you using for parallel_trial_count
? The example uses 3. If you go to the logs for each of the parallel jobs do the all have this error or does one of them succeed in Associating projects/... to Experiment: ...
from vertex-ai-mlops.
Hello @Jay2201 , When you run this job what are you using for
parallel_trial_count
? The example uses 3. If you go to the logs for each of the parallel jobs do the all have this error or does one of them succeed inAssociating projects/... to Experiment: ...
I have tried 3 and 2 both for Parallel Trial Count. I have same error for all..
from vertex-ai-mlops.
Hello @Jay2201 ,
If all of the initial set of trials specified by parallel_trial_count
are giving the same error then it seems to indicate the runs are being created before the job. I have some ideas for diagnostics here.
Initialize Parameters and Clients:
PROJECT_ID = <your project here>
REGION = 'europe-west2'
EXPERIMENT_NAME = 'experiment-05-05i-tf-classification-dnn'
TIMESTAMP = 20230203194023
from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=REGION)
Return the known runs for the experiment:
exp = aiplatform.Experiment(experiment_name = EXPERIMENT_NAME)
exp_runs = exp.get_data_frame()
exp_runs
If needed, subset to the runs for the specific TIMESTAMP value:
exp_runs[exp_runs['run_name'].str.contains(f'run-{TIMESTAMP}')]
Let me know how the results of these checks for the experiment and logged runs work out.
Thank You
from vertex-ai-mlops.
Hi @Jay2201 ,
Have you had any luck troubleshooting the run name already exisitng?
I gave this some thought over the last week. In cases where a run name may already exist it could be desirable to add to the experiment run or overwrite information based on updated data. I made a small alteration to accommodate this which might also help in your situation. If the run name is already defined it will attach to it rather than try to create a new run with the same name. Inside the scripts that each of the 05a-05i call the following change has been made:
Before:
# Vertex AI Experiment
expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)
After:
# Vertex AI Experiment
if args.run_name in [run.name for run in aiplatform.ExperimentRun.list(experiment = args.experiment_name)]:
expRun = aiplatform.ExperimentRun(run_name = args.run_name, experiment = args.experiment_name)
else:
expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)
from vertex-ai-mlops.
Hi @Jay2201 ,
Have you had any luck troubleshooting the run name already exisitng?I gave this some thought over the last week. In cases where a run name may already exist it could be desirable to add to the experiment run or overwrite information based on updated data. I made a small alteration to accommodate this which might also help in your situation. If the run name is already defined it will attach to it rather than try to create a new run with the same name. Inside the scripts that each of the 05a-05i call the following change has been made:
Before:
# Vertex AI Experiment expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)After:
# Vertex AI Experiment if args.run_name in [run.name for run in aiplatform.ExperimentRun.list(experiment = args.experiment_name)]: expRun = aiplatform.ExperimentRun(run_name = args.run_name, experiment = args.experiment_name) else: expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)
Thanks @statmike i will definitely check this and will update you, actually busy with other task so not getting time.
from vertex-ai-mlops.
Hey @statmike, sorry to reply you late I tested and it runs fine on your notebook, but as I am using multiple GPUs so I have multiple runs which are not updating in the vertex ai experiments. Thanks for the solution 🙂
from vertex-ai-mlops.
Related Issues (20)
- Review data section in 03g - Switch cell type from markdown to code HOT 2
- suggested updates on Prediction section in 03g HOT 3
- Tabular-dataset-create_notebook_02c HOT 3
- [04 - Vertex AI Custom Model - scikit-learn - in Notebook] Cannot deploy model to endpoint HOT 5
- ImportError: cannot import name 'aiplatform' from 'google.cloud' (unknown location) HOT 4
- On main branch, I see only notebooks 4a,4b,4c however in readme I do see a mention of 4g-4i. Where can I find that? HOT 3
- Bad Request: 400 Syntax Error : Missing whitespace between literal and alias at [1:36] in 04 - Vertex AI Custom Model - scikit-learn - in Notebook HOT 2
- FR: project structure skeleton HOT 2
- Using Vertex AI Pipeline with Workflow with Generative AI HOT 1
- MLB notebook fails on API enablement when run in Vertex AI Workbench HOT 2
- Explainations error in Notebook 2a
- MLB notebook fails on permission issue until compute engine default service account granted Vertex AI User permission on project HOT 2
- Error while Configuring Local Docker to Use GCLOUD CLI HOT 2
- Could you utilize MMR for UmpireBot - MLB Rules For Baseball
- Error on Docker Run Command HOT 1
- 02c: failure of import aiplatform from google_cloud_pipeline_components HOT 8
- 02c: def pipeline() gives Attribute Error. HOT 3
- 02c: change of region causes problems. HOT 1
- getting error in creating custom job from local script for 05a Notebook HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vertex-ai-mlops.