Giter VIP home page Giter VIP logo

core-pytorch-utils's Introduction

Hello, I am Zhengjia Li

I am currently working on Alibaba DAMO lab, focusing on the research of Computer Vision.

Few things about me:

  • 🔭 I’m currently working on Person Search, Vision Transformer, 3D Human Shape and Pose Estimation.
  • 💻 Microsoft (SDE Intern) -> SenseTime (CV Intern) -> ByteDance (CV Intern) -> Alibaba DAMO (CV).
  • 📜 Google scholar
  • 💬 Ask me about Machine Learning and Deep Learning. I am more than happy to help anytime. :)
  • 📫 How to reach me: [email protected]. I am mostly active, so feel free to reach out to me.
  • 👨 Pronouns: He/His

core-pytorch-utils's People

Contributors

serend1p1ty avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

core-pytorch-utils's Issues

when save checkpoints, oom occured

i have debug this problem for a long time, now i think the core reason is that saving too many state ,not only weights of model, but also optimizer state_dict.
if want to avoid oom during training, a solution is to decrease the batch size, but will cause a waster of gpu memory
is it better to save optimizer state stand alone?

dist.barrier() hangs

i want to use cpu to train a model on two machines with two gpus on each machine. but i have encounter a confused problem.
all processes hangs on dist.barrier() call in distributed.py file。
image
i have print some information to help judge. after print out "finish init_process_group, going to exec barrier", all 4 processes hangs.
then, i tried to remove dist.barrier(), it seems all right。could you give me some advice, how to debug this problem.

little advice 0.0

process_string = f"Epoch: [{self.trainer.epoch}][{self.trainer.inner_iter}/{self.trainer.epoch_len - 1}]"

inner_iter starts from 0, I this this might be better:
process_string = f"Epoch: [{self.trainer.epoch}][{self.trainer.inner_iter + 1}/{self.trainer.epoch_len}]"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.