Giter VIP home page Giter VIP logo

Comments (1)

NightMachinery avatar NightMachinery commented on May 12, 2024

I am also getting these OOM errors; any way to monitor the TPU ram usage? Any docs on garbage collection on the TPU?

---------------------------------------------------------------------------
UnfilteredStackTrace                      Traceback (most recent call last)
<ipython-input-72-63dca48c8c17> in <module>()
     20   params, state, opt_state, model_output, loss = (
---> 21     train_step(params, state, opt_state, input_batch, target_batch, k1))
     22

9 frames
UnfilteredStackTrace: RuntimeError: RESOURCE_EXHAUSTED: Attempting to allocate 31.06M. That was not possible. There are 58.64M free. Due to fragmentation, the largest contiguous region of free memory is 30.56M.; (0x0x0_HBM0)

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py in _execute_compiled(name, compiled, output_buffer_counts, handlers, kept_var_idx, *args)
   1098           for i, x in enumerate(args)
   1099           if x is not token and i in kept_var_idx))
-> 1100   out_bufs = compiled.execute(input_bufs)
   1101   check_special(name, out_bufs)
   1102   if output_buffer_counts is None:

RuntimeError: RESOURCE_EXHAUSTED: Attempting to allocate 31.06M. That was not possible. There are 58.64M free. Due to fragmentation, the largest contiguous region of free memory is 30.56M.; (0x0x0_HBM0)
---------------------------------------------------------------------------
UnfilteredStackTrace                      Traceback (most recent call last)
<ipython-input-69-63dca48c8c17> in <module>()
     20   params, state, opt_state, model_output, loss = (
---> 21     train_step(params, state, opt_state, input_batch, target_batch, k1))
     22

9 frames
UnfilteredStackTrace: RuntimeError: FAILED_PRECONDITION: Dependency failed: Could not allocate 32571392 bytes in memory 0x0x0_HBM0; 32047104 bytes allocatable, 59981824 bytes available

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py in _execute_compiled(name, compiled, output_buffer_counts, handlers, kept_var_idx, *args)
   1098           for i, x in enumerate(args)
   1099           if x is not token and i in kept_var_idx))
-> 1100   out_bufs = compiled.execute(input_bufs)
   1101   check_special(name, out_bufs)
   1102   if output_buffer_counts is None:

RuntimeError: FAILED_PRECONDITION: Dependency failed: Could not allocate 32571392 bytes in memory 0x0x0_HBM0; 32047104 bytes allocatable, 59981824 bytes available

from trax.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.