Giter VIP home page Giter VIP logo

bike's Introduction

Hi, I'm Wenhao Wu 👋

Wenhao Wu 知乎 github LinkedIn Google Scholar X

Wenhao Wu (吴文灏🇨🇳) is a Ph.D. student in the School of Computer Science at The University of Sydney, supervised by Prof. Wanli Ouyang. I have a close collaboration with Department of Computer Vision Technology (VIS) at Baidu led by Dr. Jingdong Wang (IEEE Fellow). I received my M.S.E degree from Multimedia Laboratory (MMLab@SIAT), University of Chinese Academy of Sciences, supervised by Prof. Shifeng Chen and Prof. Yu Qiao. I was also fortunate to intern/RA at MMLab@CUHK, Baidu, iQIYI, SenseTime, Samsung Research and Chinese Academy of Sciences. I am honored to be awarded the 11th Baidu Scholarship (2023).

My current research interest includes Cross-Modal Learning and Video Understanding. I have published 20+ papers at the top international CV/AI conferences or journals such as CVPR/ICCV/ECCV/AAAI/IJCAI/ACMMM/IJCV.

Wenhao Wu's GitHub stats Top Langs

🔭 Research Interest

My research interests broadly lie in the areas of Computer Vision and Deep Learning, including:

  • Cross-Modal Learning (2022-Present): Video-Language Matching, Multimodal Large Language Model (MLLM)
  • Video Foundation Model (2017-Present): Video Recognition, Efficient Video Tuning
  • Video-related Applications (2017-2022): Video Sampler, Temporal Action Detection, Anomaly Detction in Video
  • Self-supervised Learning (2021-2022): Contrastive Video Learning, Masked Video Modeling
  • Low-level Vision (2021-2022): Image Colorization, Style Transfer, Image Rescaling

🔥 News

  • 2024.01: I am honored to receive the 11th🎖Baidu Scholarship🎖, a prestigious fellowship awarding 200,000 RMB (about $30,000) to a select 10 PhD students worldwide in Artificial Intelligence, selected from thousands of applicants.
  • 2023.11: We release GPT4Vis , which provides a Quantitative Evaluation of GPT-4 for Visual Understanding across images, videos and point clouds, spinning on 16 popular datasets.
  • 2023.11: We release Side4Video , a Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning, which significantly reduces the training memory cost for action recognition (↓75%) and text-video retrieval (↓30%).
  • 2023.08: The extension of Text4Vis has been accepted by IJCV.
  • 2023.07: Two First-author papers (Temporal Modeling: ATM , Cross-Modal Retrieval: UA ) are accepted by ICCV2023.
  • 2023.02: Two First-author papers for video understanding (BIKE , Cap4Video ) are accepted by CVPR 2023. Cap4Video involves GPT to enhance text-video learning, is selected as a 🎉Highlight paper🎉 (Top 2.5%).
  • 2022.11: Two papers (Video Recognition: Text4Vis , Style Transfer: AdaCM) are accepted by AAAI 2023.
  • 2022.07: Three papers (Video Sampling: NSNet, TSQNet, Cross-Modal Learning: CODER) are accepted by ECCV 2022.
  • 2022.06: Our MaMiCo, a new video self-supervised learning work, is accepted by ACMMM 2022 (🎉Oral Presentation🎉).

bike's People

Contributors

eltociear avatar whwu95 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bike's Issues

Frozen label encoder

What is the difference and connection between Frozen label encoder and category encoder, I see in table6(a) it shows that adding frozen label encoder improves the result. And what is (technical) Transf?

CUDA out of memory

I am using 4 3090ti cards, and I have set the batch size to very small, but this situation occurs every time the first epoch is clicked

Traceback (most recent call last):
File "train.py", line 522, in
main(args)
File "train.py", line 316, in main
prec1, output_list, labels_list = validate(epoch, val_loader, classes, device, model, video_head, config, n_class, logger, save_score)
File "train.py", line 433, in validate
cls_feature, text_features = model.module.encode_text(text_inputs, return_token=True) # [n_cls, feat_dim]
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 443, in encode_text
x = self.transformer(x)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 253, in forward
x = checkpoint(r, x)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 227, in forward
x = x + self.drop_path(self.attention(self.ln_1(x)))
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 219, in attention
return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1153, in forward
attn_output, attn_output_weights = F.multi_head_attention_forward(
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 5131, in multi_head_attention_forward
v = v.contiguous().view(v.shape[0], bsz * num_heads, head_dim).transpose(0, 1)
RuntimeError: CUDA out of memory. Tried to allocate 1.96 GiB (GPU 0; 23.70 GiB total capacity; 15.94 GiB already allocated; 657.56 MiB free; 16.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

About testing charades: the result of the test should use the last recorded mAP ?

Hi!

I noticed that AverageMeter() is used in calculating the mAP and the output is the global avg, and this is not a problem when calculating accuracy.

However, the result of the mAP calculation should be obtained using all the predictions. Therefore the value of the last maper.value().numpy() should be used directly as the final result.

Am I understanding this correctly?

About:wanrning:“None of the inputs have requires_grad=True. Gradients will be None“

I'm reaching out regarding an issue I've encountered while working with the model. I'm receiving a warning that says, 'None of the inputs have requires_grad=True. Gradients will be None.' I've been trying to troubleshoot this and was wondering if you might have insights into resolving this particular warning.

From my understanding, it seems like there might be an issue related to the setting of the requires_grad attribute for the inputs, leading to the absence of gradients during the training process. But the inputs(images) should not be set "requires_grad=True"

Could you kindly offer guidance or share any specific steps or considerations to address this warning? Any advice or direction you could provide would be greatly appreciated.

Thank you very much for your time and assistance.

Questions about the pre-generated attributes on ucf101/hmdb51. The original intention was that I couldn't replicate your results on UCF101

Congratulation to your excellent work! And I saw that you provided the attribute JSON file for k400. May I ask if you could provide the train/val JSON files for ucf and hmdb51? The main reason is that when conducting few shot training on the UCF101 dataset (shot=1/2/5), I cannot reproduce your results (95.2/96.1/96.5), while my results are (90.22/94.15/95.32). I have trained multiple times and have taken the highest value. Of course, the above are all the results of the video branch and do not use the attribute branch mentioned in your paper. Also, I wish you a happy National Day! Looking forward to your reply!Thank you!

Questions about the pre-generated attributes

Congratulation to your excellent work! It seems that current published code doesn't include the 'pre-generated attributes' which should be in a json file , and I also find that the test code doesn't include the attributes branch now. I wonder that will the complete code be published soon ?

question about Kinetics400 88.7

ViT-L/14; 16x336 4x3 top1= 88.7 on Kinetics400. However, the accuracy in the corresponding log link is not 88.7 but only 87.9.

about inference

Hi! I want to know whether attributes branch is used when inference?

key models

Congratulations on your work. May I ask if you can upload some key models to the cloud for download (e.g. k400-vit-l-14-f16.pt, etc.)

num_sample=1——>4

When changing num_sample from 1 to 4, how is the function in train modified: in particular, the following code
images = images.view((-1,config.data.num_segments,3)+images.size()[-2:]) # bt 3 h w b,t,c,h,w = images.size() images= images.view(-1,c,h,w)
and
image_embedding, cls_embedding, text_embedding, logit_scale = model(images, texts, return_token=True)
After the incoming parameters of the model are changed from a tensor to a list

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.