Giter VIP home page Giter VIP logo

Comments (23)

Jay2201 avatar Jay2201 commented on May 30, 2024 3

Yes @statmike ,

while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error.

The point of hpt.trial_id might be the issue which i am also guessing, but didn't got any resolution about it yet

Even i ran the 05i notebook in november 2022 but at that time i did not faced any issues.

from vertex-ai-mlops.

Jay2201 avatar Jay2201 commented on May 30, 2024 1

I already tried when there were no experiments in vertex ai experiment tabs

from vertex-ai-mlops.

sakshi74 avatar sakshi74 commented on May 30, 2024 1

Hi @statmike,

I tried the same but I am using REGION = "europe-west2". I am getting the same error mentioned below.

google.api_core.exceptions.AlreadyExists: 409 Context with name projects/123456/locations/europe-west2/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203194023-1 already exists

Seems like its a region specific issue. If possible could you please try it with REGION= "europe-west2".

Thanks in advance!

from vertex-ai-mlops.

savi-bhide avatar savi-bhide commented on May 30, 2024

Yeah, I'm also facing similar issue.

from vertex-ai-mlops.

sakshi74 avatar sakshi74 commented on May 30, 2024

I am facing the same issue.

from vertex-ai-mlops.

SARANYA-J avatar SARANYA-J commented on May 30, 2024

I am experiencing the same issue.

from vertex-ai-mlops.

rsher60 avatar rsher60 commented on May 30, 2024

using this code from 99_cleanup file.

exps = aiplatform.Experiment.list()
for exp in exps:
    runs = aiplatform.ExperimentRun.list(experiment = exp.name)
    print(f'Experiment: {exp.name}')
    for run in runs:
        print(f'Run: {run.name}')
        run.delete(delete_backing_tensorboard_run = False)
    exp.delete(delete_backing_tensorboard_runs = False)

Can you please try this before running the hyperparameter script?

from vertex-ai-mlops.

statmike avatar statmike commented on May 30, 2024

Is this for notebook 05i? A quick thought here. If you are running the same notebook multiple time the it can be important to rerun it starting from the very top. There is a line RUN_NAME = f'run-{TIMESTAMP}' that will use a new TIMESTAMP each time to make sure the run names end up being unique.

from vertex-ai-mlops.

Jay2201 avatar Jay2201 commented on May 30, 2024

Is this for notebook 05i? A quick thought here. If you are running the same notebook multiple time the it can be important to rerun it starting from the very top. There is a line RUN_NAME = f'run-{TIMESTAMP}' that will use a new TIMESTAMP each time to make sure the run names end up being unique.

Yes it's 05i, yes I thought the same initially so I made a timestamp dynamic also then also it's giving the same issue.

I then separately ran an HPT job without logging in any run in the Vertex Ai Experiment and it ran successfully.

from vertex-ai-mlops.

statmike avatar statmike commented on May 30, 2024

Thank you. I will try to help you through the chat here. Can you let me know which part of the notebook ends in the error - which cell? From that I will have a few more steps to request so I can understand what the error is here.

from vertex-ai-mlops.

Jay2201 avatar Jay2201 commented on May 30, 2024

image

If you see the 3rd line then the run name is already dynamic because trial ID is append in the end.

Now i am getting below error in the 4th line:

google.api_core.exceptions.AlreadyExists: 409 Context with name projects/1234/locations/us-central1/metadataStores/ default/contexts/experiment-05-05i-tf-classification-dnn-run-20230202120231-1 already exists.

I won't be able to share screen shot from logs as i am using client's environment.

Even after giving the TIMESTAMP Dynamic also it gives the same error.

from vertex-ai-mlops.

statmike avatar statmike commented on May 30, 2024

Hello @Jay2201,
I see that you are referencing line from the script in ./code/hp_train.py.

This script is copied to GCS with a new name and then into the container used for training. The training job is created by the cell that looks like:

customJob = aiplatform.CustomJob(
    display_name = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/models/{TIMESTAMP}",
    staging_bucket = f"{URI}/models/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

This references the object WORKER_POOL_SPEC this is defined in the notebook cell with this code:

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "container_spec": {
            "image_uri": f"{REPOSITORY}/{EXPERIMENT}_trainer",
            "command": [],
            "args": CMDARGS
        }
    }
]

Part of this definitions is the additional object CMDARGS which is defined in the notebook with:

CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
    "--var_target=" + VAR_TARGET,
    "--var_omit=" + VAR_OMIT,
    "--project_id=" + PROJECT_ID,
    "--bq_project=" + BQ_PROJECT,
    "--bq_dataset=" + BQ_DATASET,
    "--bq_table=" + BQ_TABLE,
    "--region=" + REGION,
    "--experiment=" + EXPERIMENT,
    "--series=" + SERIES,
    "--experiment_name=" + EXPERIMENT_NAME,
    "--run_name=" + RUN_NAME
]

This is where the value of experiment_name and run_name get passed in. The are defined at the top of the notebook in the cell that looks like:

FRAMEWORK = 'tf'
TASK = 'classification'
MODEL_TYPE = 'dnn'
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-{FRAMEWORK}-{TASK}-{MODEL_TYPE}'
RUN_NAME = f'run-{TIMESTAMP}'

The unique part of this will be RUN_NAME because it has a value TIMESTAMP that is also defined near the top of the notebook in a cell that looks like:

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{SERIES}/{EXPERIMENT}"
DIR = f"temp/{EXPERIMENT}"

It looks like the value of TIMESTAMP the notebook is using on your run may have already been used before. Is this possible?

The only other possibility I can think of is multiple values of the hpt.trial_id are the same but I have not run into that before.

Thank You

from vertex-ai-mlops.

savi-bhide avatar savi-bhide commented on May 30, 2024

Yes @statmike ,

while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error.

The point of hpt.trial_id might be the issue which i am also guessing, but didn't got any resolution about it yet

Even i ran the 05i notebook in november 2022 but at that time i did not faced any issues.

Yeah, this issue looks similar to mine.

from vertex-ai-mlops.

sakshi74 avatar sakshi74 commented on May 30, 2024

Yes @statmike ,
while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error.
The point of hpt.trial_id might be the issue which i am also guessing, but didn't got any resolution about it yet
Even i ran the 05i notebook in november 2022 but at that time i did not faced any issues.

Its same for me too.

My issue is also the same.

from vertex-ai-mlops.

statmike avatar statmike commented on May 30, 2024

Hello @Jay2201 ,
I just did a test run of the notebook in an environment where it was run previously and did not encounter any errors. I am going to cover the diagnostics I did here in case you want to replicate the steps for troubleshooting in your environment.

On the Vertex AI Console Page for Training, HyperParameter Tuning Jobs tab, select the current job related to the notebook. This gives a list of all the tuning trials and includes links to the logs for each:

Screenshot 2023-02-03 at 10 09 03 AM

I went to the logs for each of these tuning trials and looked for the result of the line that creates the experiment run:

expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)

Here are the values I found in the logs for the first 6 trials:

  • Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-1 to Experiment: experiment-05-05i-tf-classification-dnn
  • Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-2 to Experiment: experiment-05-05i-tf-classification-dnn
  • Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-3 to Experiment: experiment-05-05i-tf-classification-dnn
  • Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-4 to Experiment: experiment-05-05i-tf-classification-dnn
  • Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-5 to Experiment: experiment-05-05i-tf-classification-dnn
  • Associating projects/1026793852137/locations/us-central1/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203134418-6 to Experiment: experiment-05-05i-tf-classification-dnn

Are you making any other changes to the tutorial notebook that might need to be investigated for causing this issue?
Thank You

from vertex-ai-mlops.

Jay2201 avatar Jay2201 commented on May 30, 2024

Let me again run the notebook and see if I am getting the same error or not

As far as I remember i am only changing region rest code i am running as is

from vertex-ai-mlops.

Jay2201 avatar Jay2201 commented on May 30, 2024

Hi @statmike,

I tried the same but I am using REGION = "europe-west2". I am getting the same error mentioned below.

google.api_core.exceptions.AlreadyExists: 409 Context with name projects/123456/locations/europe-west2/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203194023-1 already exists

Seems like its a region specific issue. If possible could you please try it with REGION= "europe-west2".

Thanks in advance!

@statmike - Getting the same error with the above region ....

from vertex-ai-mlops.

statmike avatar statmike commented on May 30, 2024

Hello @Jay2201 , When you run this job what are you using for parallel_trial_count? The example uses 3. If you go to the logs for each of the parallel jobs do the all have this error or does one of them succeed in Associating projects/... to Experiment: ...

from vertex-ai-mlops.

Jay2201 avatar Jay2201 commented on May 30, 2024

Hello @Jay2201 , When you run this job what are you using for parallel_trial_count? The example uses 3. If you go to the logs for each of the parallel jobs do the all have this error or does one of them succeed in Associating projects/... to Experiment: ...

I have tried 3 and 2 both for Parallel Trial Count. I have same error for all..

from vertex-ai-mlops.

statmike avatar statmike commented on May 30, 2024

Hello @Jay2201 ,
If all of the initial set of trials specified by parallel_trial_count are giving the same error then it seems to indicate the runs are being created before the job. I have some ideas for diagnostics here.

Initialize Parameters and Clients:

PROJECT_ID = <your project here>
REGION = 'europe-west2'
EXPERIMENT_NAME = 'experiment-05-05i-tf-classification-dnn'
TIMESTAMP = 20230203194023

from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=REGION)

Return the known runs for the experiment:

exp = aiplatform.Experiment(experiment_name = EXPERIMENT_NAME)
exp_runs = exp.get_data_frame()
exp_runs

If needed, subset to the runs for the specific TIMESTAMP value:

exp_runs[exp_runs['run_name'].str.contains(f'run-{TIMESTAMP}')]

Let me know how the results of these checks for the experiment and logged runs work out.
Thank You

from vertex-ai-mlops.

statmike avatar statmike commented on May 30, 2024

Hi @Jay2201 ,
Have you had any luck troubleshooting the run name already exisitng?

I gave this some thought over the last week. In cases where a run name may already exist it could be desirable to add to the experiment run or overwrite information based on updated data. I made a small alteration to accommodate this which might also help in your situation. If the run name is already defined it will attach to it rather than try to create a new run with the same name. Inside the scripts that each of the 05a-05i call the following change has been made:

Before:

# Vertex AI Experiment
expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)

After:

# Vertex AI Experiment
if args.run_name in [run.name for run in aiplatform.ExperimentRun.list(experiment = args.experiment_name)]:
    expRun = aiplatform.ExperimentRun(run_name = args.run_name, experiment = args.experiment_name)
else:
    expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)

from vertex-ai-mlops.

Jay2201 avatar Jay2201 commented on May 30, 2024

Hi @Jay2201 ,
Have you had any luck troubleshooting the run name already exisitng?

I gave this some thought over the last week. In cases where a run name may already exist it could be desirable to add to the experiment run or overwrite information based on updated data. I made a small alteration to accommodate this which might also help in your situation. If the run name is already defined it will attach to it rather than try to create a new run with the same name. Inside the scripts that each of the 05a-05i call the following change has been made:

Before:

# Vertex AI Experiment
expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)

After:

# Vertex AI Experiment
if args.run_name in [run.name for run in aiplatform.ExperimentRun.list(experiment = args.experiment_name)]:
    expRun = aiplatform.ExperimentRun(run_name = args.run_name, experiment = args.experiment_name)
else:
    expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)

Thanks @statmike i will definitely check this and will update you, actually busy with other task so not getting time.

from vertex-ai-mlops.

Jay2201 avatar Jay2201 commented on May 30, 2024

Hey @statmike, sorry to reply you late I tested and it runs fine on your notebook, but as I am using multiple GPUs so I have multiple runs which are not updating in the vertex ai experiments. Thanks for the solution 🙂

from vertex-ai-mlops.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.