pengsongyou / openscene Goto Github PK

[CVPR'23] OpenScene: 3D Scene Understanding with Open Vocabularies

Home Page: https://pengsongyou.github.io/openscene

License: Apache License 2.0

Python 96.18% Shell 3.82%

3d-scene-understanding clip semantic-segmentation llm cvpr2023 point-cloud-segmentation point-clouds scannet matterport3d nuscenes

openscene's Introduction

OpenScene: 3D Scene Understanding with Open Vocabularies

Songyou Peng · Kyle Genova · Chiyu "Max" Jiang · Andrea Tagliasacchi
Marc Pollefeys · Thomas Funkhouser

CVPR 2023

Paper | Video | Project Page

OpenScene is a zero-shot approach to perform a series of novel 3D scene understanding tasks using open-vocabulary queries.

Table of Contents

Interactive Demo
Installation
Data Preparation
Run
Applications
TODO
Acknowledgement
Citation

News 🚩

[2023/10/27] Add the code for LSeg per-pixel feature extraction and multi-view fusion. Check this repo.
[2023/03/31] Code is released.

Interactive Demo

No GPU is needed! Follow this instruction to set up and play with the real-time demo yourself.

Here we present a real-time, interactive, open-vocabulary scene understanding tool. A user can type in an arbitrary query phrase like snoopy (rare object), somewhere soft (property), made of metal (material), where can I cook? (activity), festive (abstract concept) etc, and the correponding regions are highlighted.

Installation

Follow the installation.md to install all required packages so you can do the evaluation & distillation afterwards.

Data Preparation

We provide the pre-processed 3D&2D data and multi-view fused features for the following datasets:

ScanNet
Matterport3D
nuScenes
Replica

Pre-processed 3D&2D Data

You can preprocess the dataset yourself, see the data pre-processing instruction.

Alternatively, we have provided the preprocessed datasets. One can download the pre-processed datasets by running the script below, and following the command line instruction to download the corresponding datasets:

bash scripts/download_dataset.sh

The script will download and unpack data into the folder data/. One can also download the dataset somewhere else, but link to the corresponding folder with the symbolic link:

ln -s /PATH/TO/DOWNLOADED/FOLDER data

List of provided processed data (click to expand):

ScanNet 3D (point clouds with GT semantic labels)
ScanNet 2D (RGB-D images with camera poses)
Matterport 3D (point clouds with GT semantic labels)
Matterport 2D (RGB-D images with camera poses)
nuScenes 3D (lidar point clouds with GT semantic labels)
nuScenes 2D (RGB images with camera poses)
Replica 3D (point clouds)
Replica 2D (RGB-D images)
Matterport 3D with top 40 NYU classes
Matterport 3D with top 80 NYU classes
Matterport 3D with top 160 NYU classes

Note: 2D processed datasets (e.g. scannet_2d) are only needed if you want to do multi-view feature fusion on your own. If so, please follow the instruction for multi-view fusion.

Multi-view Fused Features

To evaluate our OpenScene model or distill a 3D model, one needs to have the multi-view fused image feature for each 3D point (see method in Sec. 3.1 in the paper).

You can run the following to directly download provided fused features:

bash scripts/download_fused_features.sh

List of provided fused features (click to expand):

ScanNet - Multi-view fused OpenSeg features, train/val (234.8G)
ScanNet - Multi-view fused LSeg features, train/val (175.8G)
Matterport - Multi-view fused OpenSeg features, train/val (198.3G)
Matterport - Multi-view fused OpenSeg features, test set (66.7G)
Replica - Multi-view fused OpenSeg features (9.0G)
Matterport - Multi-view fused LSeg features (coming)
nuScenes - Multi-view fused OpenSeg features (coming)
nuScenes - Multi-view fused LSeg features (coming)

Alternatively, you can also generate multi-view features yourself following the instruction.

Run

When you have installed the environment and obtained the processed 3D data and multi-view fused features, you are ready to run our OpenScene disilled/ensemble model for 3D semantic segmentation, or distill your own model from scratch.

Evaluation for 3D Semantic Segmentation with a Pre-defined Labelsets

Here you can evaluate OpenScene features on different dataset (ScanNet/Matterport3D/nuScenes/Replica) that have pre-defined labelsets. We already include the following labelsets in label_constants.py:

ScanNet 20 classes (wall, door, chair, ...)
Matterport3D 21 classes (ScanNet 20 classes + floor)
Matterport top 40, 80, 160 NYU classes (more rare object classes)
nuScenes 16 classes (road, bicycle, sidewalk, ...)

The general command to run evaluation:

sh run/eval.sh EXP_DIR CONFIG.yaml feature_type

where you specify your experiment directory EXP_DIR, and replace CONFIG.yaml with the correct config file under config/. feature_type corresponds to per-point OpenScene features:

fusion: The 2D multi-view fused features
distill: features from 3D distilled model
ensemble: Our 2D-3D ensemble features

To evaluate with distill and ensemble, the easiest way is to use a pre-trained 3D distilled model. You can do this by using one of the config files with postfix _pretrained.

For example, to evaluate the semantic segmentation on Replica, you can simply run:

# 2D-3D ensemble
sh run/eval.sh out/replica_openseg config/replica/ours_openseg_pretrained.yaml ensemble

# Run 3D distilled model
sh run/eval.sh out/replica_openseg config/replica/ours_openseg_pretrained.yaml distill

# Evaluate with 2D fused features
sh run/eval.sh out/replica_openseg config/replica/ours_openseg_pretrained.yaml fusion

The script will automatically download the pretrained 3D model and run the evaluation for Matterport 21 classes. You can find all outputs in the out/replica_openseg.

For evaluation options, see under TEST inside config/replica/ours_openseg_pretrained.yaml. Below are important evaluation options that you might want to modify:

labelset (default: None, scannet| matterport | matterport40| matterport80|matterport160): Evaluate on a specific pre-defined labelset in label_constants.py. If not specified, same as your 3D point cloud folder name
eval_iou (default: True): whether evaluating the mIoU. Set to False if there is no GT labels
save_feature_as_numpy (default: False): save the per-point features as .npy
prompt_eng (default: True): input class name X -> "a X in a scene"
vis_gt (default: True): visualize point clouds with GT semantic labels
vis_pred (default: True): visualize point clouds with our predicted semantic labels
vis_input (default: True): visualize input point clouds

If you want to use a 3D model distilled from scratch, specify the model_path to the correponding checkpoints EXP/model/model_best.pth.tar.

Distillation

Finally, if you want to distill a new 3D model from scratch, run:

Start distilling: sh run/distill.sh EXP_NAME CONFIG.yaml
Resume: sh run/resume_distill.sh EXP_NAME CONFIG.yaml

For available distillation options, please take a look at DISTILL inside config/matterport/ours_openseg.yaml

Using Your Own Datasets

Follow the data preprocessing instruction, modify codes accordingly to obtain the processed 2D&3D data
Follow the feature fusion instruction, modify codes to obtain multi-view fused features.
You can distill a model on your own, or take our provided 3D distilled model weights (e.g. our 3D model for ScanNet or Matterport3D), and modify the model_path accordingly.
If you want to evaluate on a specific labelset, change the labelset in config.

Applications

Besides the zero-shot 3D semantic segmentation, we can perform also the following tasks:

Open-vocabulary 3D scene understanding and exploration: query a 3D scene to understand properties that extend beyond fixed category labels, e.g. materials, activity, affordances, room type, abstract concepts...
Rare object search: query a 3D scene database to find rare examples based on their names
Image-based 3D object detection: query a 3D scene database to retrieve examples based on similarities to a given input image

Acknowledgement

We sincerely thank Golnaz Ghiasi for providing guidance on using OpenSeg model. Our appreciation extends to Huizhong Chen, Yin Cui, Tom Duerig, Dan Gnanapragasam, Xiuye Gu, Leonidas Guibas, Nilesh Kulkarni, Abhijit Kundu, Hao-Ning Wu, Louis Yang, Guandao Yang, Xiaoshuai Zhang, Howard Zhou, and Zihan Zhu for helpful discussion. We are also grateful to Charles R. Qi and Paul-Edouard Sarlin for their proofreading.

We build some parts of our code on top of the BPNet repository.

TODO

Support demo for arbitrary scenes
Support in-webiste demo
Support multi-view feature fusion with LSeg
Add missing multi-view fusion LSeg feature for Matterport & nuScenes
Add missing multi-view fusion OpenSeg feature for nuScenes
Multi-view feature fusion code for nuScenes
Support the latest PyTorch version

We are very much welcome all kinds of contributions to the project.

Citation

If you find our code or paper useful, please cite

@inproceedings{Peng2023OpenScene,
  title     = {OpenScene: 3D Scene Understanding with Open Vocabularies},
  author    = {Peng, Songyou and Genova, Kyle and Jiang, Chiyu "Max" and Tagliasacchi, Andrea and Pollefeys, Marc and Funkhouser, Thomas},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2023}

openscene's People

Contributors

Stargazers

Watchers

openscene's Issues

Evaluation with "model.eval()"

Thanks for your wonderful work!

When I run the 'distill.py' file, the evaluation performance can attain 41.0 miou with model.train(). But when I run the 'evaluate.py', the performance will drop to 25.0 miou with model.eval().

I don't know why the performance drop so significantly when I use model.eval() state?

demo: OSError: Address already in use

in the demo app, you can run into the Address already in use error message if you manage to crash/exit uncleanly.

Just adding a note that, at least on osx, you can find/kill the process id via netstat rather than edit the port as noted in the https://github.com/pengsongyou/openscene/tree/main/demo#troubleshooting

netstat -anvp tcp | awk 'NR<3 || /LISTEN/'

Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)      rhiwat  shiwat    pid   
tcp4       0      0  127.0.0.1.1111         *.*                    LISTEN       131072  131072   3263 

kill 3263

Error on pre-trained model loading on nuScenes dataset

Hi, thank you for releasing this excellent research.

I got an error when I loaded the pre-trained model on the nuScenes dataset. I run the command "sh run/eval.sh out/nuScenes_openseg config/nuscenes/ours_openseg_pretrained.yaml ensemble".

There is an error on
"checkpoint = model_zoo.load_url(args.model_path, progress=True)
model.load_state_dict(checkpoint['state_dict'], strict=True)"

The layer name in the model begins with net3d, like net3d.conv0p1s1.kernel. But the layer name in the pre-trained model begins with module.net3d like module.net3d.conv0p1s1.kernel. So I cannot load the pre-trained weights.

Did I miss some configuration? Any idea how to fix this error?

Thank you so much.

Question about the formula to get the pixel

Hi,

I‘m new to 3D Word.

I would like to know if the formula for calculating the pixel coordinates

$$ \tilde{\mathbf{u}}=I_i \cdot E_i \cdot \tilde{\mathbf{p}} $$

Is it an innovation of this article or a convention in the field? Are there any relevant references?

Readme to obtain the demo data

Could you please provide the detailed procedure to obtain openscene features and region segmentations so we can do interactive demo on more 3D scenes besides Matterport3D, thanks

Openseg feature

How do you extract Openseg image feature in detail since Openseg predicts masks?
if regional_pool: image_embedding_feat = results['ppixel_ave_feat'][:, :crop_sz[0], :crop_sz[1]] else: image_embedding_feat = results['image_embedding_feat'][:, :crop_sz[0], :crop_sz[1]]
What is the difference between the results['ppixel_ave_feat'] and results['image_embedding_feat']?

Question about the double data augmentations

Hi,

I find that the there exists double data augmentations when loading the data. The first time is the rotation and scale augmentation during the voxelizaiton period, and the second time includes several more augmentation strategies.

May I ask the meaning of the double data augmentation strategy, and why not just perform a single one?

Creating full 3d map with clip features for a matterport scene

Given the precomputed fused features, I am trying to create the full 3d space for a matterport3d scene as a tensor (x, y, z, 768) in order to query the features around a particular point in the scene. Ive tried using the FusedFeatureLoader class to load the features for a scene and put them all together using the coordinates but it does not seem to work well. Is there a better way to do this?

Update: Im unable to find a way to go from the region wise coordinates to a global coordinate system that aligns these different regions together.

How to implement Image-based 3D object detection？can you share your code？

How to implement Image-based 3D object detection？can you share your code？
do i need to convert images to clip features？

loss doesn't decrease

Hi,

I want to apply openscene in my own dataset. I follow the data preprocess instruction. However, the distillation loss always fluctuate around 0.2, like this:

And the validation iou is around 0:

The distillation training script is adapted from scannet. For hyper-parameter setting, I only changed batch size from 8 (original in scannet) to 4 in order to fit the gpu memory. I suspect it is because of the different batch size since the learning rate is also related to batch size:

Do you know how to solve this problem?

Also, I test the performance by using 2d features to represent point cloud. The results are okay (miou=0.2391).

"Unable to connection" error when running the demo

Hi! Thanks for your brilliant work and your effort to provide an interactive demo!

Unfortunately, I got some errors when running the demo after following the instructions.

I run the demo on a local Ubuntu machine. After typing the query term, the error occurs.

Unable to make connection
./run_demo: line 5: 30113 Aborted                 (core dumped) ./gaps/bin/x86_64/osview region_segmentations/region*.ply openscene_features/RP*.npy -v -tcp

Do you have any idea about it?

Thanks!

train on my own dataset

Hi!
Is it possible to train this project on my own dataset?
Thank you!

Could you share distill log for each dataset ?

As title, for nuScenes, ScanNet, Matterport, it would be great if you can share the logs of distilling, which can help me to debug the training process ...

About LSeg feature

Hello, thanks for this great job! Now, I am doing some work with LSeg feature. But I notice some question. For LSeg, there are some layers for refinement after the computation of cosine similarity with text feature,like:

Due to the fact that, in OpenScene distillation, you don't introduce the text feature, I guess that you just use the variable named 'image_features' here,isn't it?(ignore the part I mark in the figure above).
Thanks for your reply!

Could you share nuscenes_3d/train data ?

Thanks for your great job !
I plan to reproduce results of nuscenes in your paper but do not find nuscenes_3d/train ...

Inference speed on the nuScenes Lidarseg

Great work, I'm very interested in your work.
How about the inference speed of this model on the GPU you are using? On the nuScenes dataset, how long does it take to query a 10-class object in a scene scene?

Question for modification of other per-pixel feature extractor e.g. OV-Seg

I'd like to change the feature extractor to a stronger OV-Seg. However, OV-Seg adopts maskformer head. I am confused with the per-pixel feature used in OpenScene. Is this feature from the backbone (image embedding) or the last two layers in the head? May you provide some instructions on language-aware visual feature extraction from which layer? Thanks.

Question about table 1

Hi,

Thanks for releasing code! I have some questions about table 1 in the main paper:

Do you use 2D-3D Feature Ensemble to do zero-shot segmentation?
How to evaluate zero-shot performance (only test the performance of segmenting new classes)?

How can I play the interactive demo on a remote Ubuntu Server?

I’m working on a remote Ubuntu Server via SSH and run run_demo. However, it return the error:

 freeglut (conf2texture): failed to open display

I know that using a local machine with a graphical interface can avoid this problem, but I would like to try runing run_demo on a remote server (as it is the only available Linux machine for me). Do you have any experience in this regard?

Data setting for zero-shot results

Hi, @pengsongyou,

Thanks for your great work.
During the training of 3d distillation models, are all the seen and unseen objects available?
If is, does it means that your models access the unseen features without labels?

Something about “nuscenes_openseg.py”

In your code “scripts/feature_fusion/nuscenes_openseg.py”, you simply project all point cloud data at all timestamps in the same scene onto the image data at the last timestamp. I think this is not rigorous. What do you think? I also want to ask, why not directly use the point cloud data at the last timestamp to map and match with the image data?

Why not support MEUNet50 and MEUNet101?

How to obtain ideal test results on the nuscenes dataset?

Hi，thanks for your interesting work！
And I have some questions.
According to your config.yaml, I tested the nuscenes dataset using the nuscenes_openseg.pth.tar checkpoint you provided and obtained an mIoU of only 29, which is not the 42 published in the paper. How can I solve this issue?

training on other 3d datasets(3dfront or s3dis)

Thank you for the great work of open source！

If I want to retrain Openscene on other datasets(3dfront or s3dis), do I need a corresponding RGBD image in addition to the 3D point cloud? As far as I know,the answer is yes. But other datasets don't provide RGBD images of the corresponding scene, so what should I do in this case.

Looking forward to your reply.
Thank u~

Have met the index error on Nuscene

Have met the following index error on Nuscene

IndexError: Caught IndexError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/kcl/anaconda3/envs/openscene/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/home/kcl/anaconda3/envs/openscene/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/kcl/anaconda3/envs/openscene/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/media/kcl/b0beda02-b349-4a4a-b5d0-95769b843737/openscene/dataset/feature_loader.py", line 111, in getitem feat_3d_new[mask] = feat_3d IndexError: The shape of the mask [34752] at index 0 does not match the shape of the indexed tensor [277920, 768] at index 0

IndexError: The shape of the mask [34752] at index 0 does not match the shape of the indexed tensor [277920, 768] at index 0

thx.

Failed to clone demo/gaps

Thanks for your great work and the repo

I ran into an issue when following the installation instruciton that I'm not able to access the gap repo;

Submodule 'demo/gaps' ([email protected]:tomfunkhouser/gaps.git) registered for path 'demo/gaps'
Cloning into 'C:/Users/Egdivad/source/repos/Github/dg_fork_openscene/demo/gaps'...
Host key verification failed.
fatal: Could not read from remote repository.

Do you know what I'm doing wrong? Thanks

Addition of random number to coords

While going through the train script I saw
coords[:, :3] += (torch.rand(3) * 100).type_as(coords)
here. I'm unable to understand the purpose of this, any insight would be appreciated!

About ScanNet feature_loader

Hello,thanks for your great work! I'm sorry that I still have some question about the ScanNet feature_loader.

In ./scripts/feature_fusion/scannet_openseg.py:

and in fusion_util.py:

I find that only 2 keys in this pth,which is 'feat' and 'mask_full'

But in feature_loader,your annotation says:

Is there any mistake here? What is the key named 'mask'?

Thanks for your answering.

Nuscenes validate set

Thanks for sharing your data and code, your work has been very helpful to me!
I noticed that you updated the point cloud data for the nuscenes validation set. The new point clouds contain approximately 300,000 points each, whereas the original lidar point cloud only contained about 30,000 points. Were these additional 300,000 points obtained by concatenating neighboring frames of the nuscenes data? If I want to retrain the model, do I also need to update the lidar point cloud in the training set?

Required resources and time for training OpenScene

Hello

Thanks for your inspiring work.
Is the current code complete and ready for running? and how long does it generally take with 4/8 x A100s ?

Scale factor for Matterport 2D depth scans

Hello~
Thank you for the clean codebase, and amazing work.
I downloaded the preprocessed Matterport 2D scans that have been provided by you in the README.md. But when I checked out the depth image pixel values (for an arbitrary sequence), they are integers ranging from 0 to a some thousands. What scale do these numbers belong to? I have usually worked with depth images in floating point meters.

Sorry if I am asking something obvious !
Thanks in advance~

Multi-view fusion in nuscenes

Thanks for your excellent wrok! Additionally, I would like to know which images among 'CAM_FRONT', 'CAM_FRONT_RIGHT', 'CAM_BACK_RIGHT', 'CAM_BACK', 'CAM_BACK_LEFT', and 'CAM_FRONT_LEFT' are used for multi-view fusion in nuscenes.

How to get the ".pt" file？

How to get the ".pt" file from nuscenes_multiview_openseg_val?

However, the files obtained from nuscenes_multiview_openseg_va are "color", "K", and "pose".

Moreover, what do the above "K" and "pose" files mean? How are they generated?

Can you help me?

Openseg feature

OpenSeg obtains the per-mask embedding z to perform region-word grounding loss. How do you get per-pixel embedding (i.e., results['ppixel_ave_feat'] in your code) for OpenScene? Do you multiply z and s using matrix multiplication and aggregate the information of N masks?

Lseg feature extraction

Hi, do you have plans to open-source the code for Lseg feature extraction? I'm curious about how you determine the crop_size. Additionally, I would like to know if the final extracted feature map is smaller than [240, 320], and if so, is it directly interpolated to [240, 320]?

LSeg on ScanNet

Hello, @pengsongyou ,thanks for your great job. I've done the scannet lseg feature extraction. The code is :
`def extract_lseg_img_feature(img_dir, lseg_model, text_emb,img_size=None, regional_pool=True):
'''Extract per-pixel OpenSeg features.'''

# load RGB image
lseg_model = lseg_model.cuda()
lseg_model.eval()
np_img = cv2.imread(img_dir)
h,w,_ = np_img.shape
new_w,new_h = 320,256 # (320,240) can't divide 32

padded_image = cv2.copyMakeBorder(np_img, 0, new_h - h, 0, new_w - w, cv2.BORDER_CONSTANT, value=0)
input_img = transforms.ToTensor()(padded_image)
input_img = transforms.Normalize(mean=IMAGENET_DEFAULT_MEAN,std=IMAGENET_DEFAULT_STD)(input_img)
input_img = input_img.unsqueeze(dim=0)

# run LSeg
with torch.no_grad():
    feat_2d = lseg_model(input_img.cuda())
feat_2d = feat_2d[:,:,:h,:w].squeeze(dim=0)
feat_2d = feat_2d.detach().half().cpu() # 不加detach的话，这些计算图不会释放，显存会累加，导致爆炸

return feat_2d`

In this way, I use the demo model :

But I only get 31 mIoU on ScanNet. Instead, your multi-view lseg feature get 49 mIoU. So I am really curious about your implement here. Can you point out the mistake on my code? Thanks for your reply!

Multiview fusion in nuScenes

Thanks for the excellent work!
I'm trying to do multi-view feature fusion in nuScenes dataset. In code, it seems that 6 images are fused for each scan.
I wonder that the feature could be robust if I apply multi-view fusion with 'adjacent 30 images' (from adjacent 5 LiDAR scans).
When I preprocess the openseg feature as I proposed, it records 32.37 mIoU in nuScene validation set. (Only used the 2D multiview features)

I wonder how much mIoU should be record applying with only 2D openseg features. 2D-3D ensemble records about 42, so it would be slightly lower I guess. (Maybe over than 35.xx..?)
Is there any ablation or opinion for number of the multi-view frames?

Cannot download the multiview fused features

Hi,

I'm trying to download the multiview fused feature following your script, but it seems that there exists some problems with the server...

What do the images need to satisfy?

Hi,
I was wondering what attributes the pictures need? I was starting to look through your code, but maybe there is an easy answer?
More concretely, I would like to try OpenScene with outdoor scenes made from street view images and point clouds optained through lidar scans. As far as I know, I can only get RGB images from streetview. But at least for the 2D-3D-fusion I probably need intrinsics?
Thank you for your work on OpenScene and all help would be much appreciated.
Best,
Yannick

Feature fusion of nuScenes

Hi. Thanks for your great work！

How do I implement the feature fusion of nuScenes? Because I don't find the nuScenes_openseg.py file in the code.

I would like to ask when the multi-view fused features of nuScenes will be released?

error when extracting 2d image features using openseg

Hi,

I generate multi-view features using scannet_openseg.py, but it gave this error:

The environment is pytorch 1.8 + Cu111. tensorflow 2.13.
Do you know how to solve it?

LSeg Feature and Evaluate

Hello, @pengsongyou ! Thanks for your great job! I implement my own LSeg feature and evaluate. But unfortunately, it has a big gap with using your multiview-lseg-feature for evaluation.

So I am curious about your implement of LSeg feature extraction. Have you ever used some tricks like multi-scale evaluation? How did you solve the problem that the (320,240) 'image_dim' (240 cannot be divided by 32.). (I just resize to (320,256) and after forward, I interpolate back to (320,240)

Thansk for your reply!

How to get 'scene.ply' in nuScenes

Thanks for your great work! When reading 'prepeocess_3d_nuscenes.py' Line#113, I saw that you are using 'scence.ply' to process each scence in the nuScenes dataset.

However, I do not find 'scenece.ply' in the original nuScenes dataset, can you tell me how this was generated?

In Table 1, does the labelsets contain only four categories?

Hi, thanks for your work. I have a question about Table 1. When evaluating ScanNet, do you set labelsets as 20 classes? Or do you set it as 4 classes (Bookshelf, Desk, Sofa, Toilet) and all points of the scene will be forced to predict one of the four categories?

a question about voxelization

Is the voxelization process subject to any random factors? I noticed that the number of voxels generated is inconsistent for a given scene.

Matterport Evaluation Error

Hello, when I evaluate on the Matterport Dataset, there is an error raised like:
Traceback (most recent call last): File "run/evaluate_modify.py", line 409, in <module> main() File "run/evaluate_modify.py", line 138, in main main_worker(args.test_gpu, args.ngpus_per_node, args) File "run/evaluate_modify.py", line 234, in main_worker evaluate(model, val_loader, dataset_name) File "run/evaluate_modify.py", line 403, in evaluate accumu_iou = metric.evaluate(store_logit.numpy(), File "/home/qbw/research/multi-modal/openscene/util/metric.py", line 64, in evaluate confusion = confusion_matrix(pred_ids, gt_ids, N_CLASSES) File "/home/qbw/research/multi-modal/openscene/util/metric.py", line 23, in confusion_matrix return np.bincount( ValueError: cannot reshape array of size 442 into shape (21,21)

I run np.unique(gt_ids) and find that 21 in gt_ids.
np.unique(gt_ids)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21])

This way, when computing confusion_matrix,pred_ids[idxs] * num_classes + gt_ids[idxs] will be in range [0,441]. So the shape is 442, can't reshape.

Do you know the inner reason? Thanks for reply!

Working with Frame parameters for Replica feature fusion

Thank you for this great repository! The documentation is very clear too.

I have a recording of a space with no semantic labels. So I am using the Replica dataset processing files. I was able to preprocess the 3D ply file. For 2D preprocessing, I have all the frames, but it is also asking me for a trajectory.txt file. What is this file supposed to contain? The app I used for getting the data did not provide this file. Instead, for each frame, I get a JSON whose structure is like this:

{
  "intrinsics" : [
    1327.18115234375,    0,     959.62261962890625,     0,     1327.18115234375,     720.123046875,
    0,     0,     1
  ],
  "averageAngularVelocity" : 0.83692288398742676,
  "motionQuality" : 0.76087915897369385,
  "frame_index" : 91,
  "exposureDuration" : 0.01666666753590107,
  "averageVelocity" : 0.41746380925178528,
  "cameraPoseARFrame" : [
    -0.35722801089286804,     0.7659410834312439,     -0.53453010320663452,     0.63373279571533203,
    -0.8567434549331665,     -0.040781203657388687,     0.51412779092788696,    0.10553276538848877,
    0.37199276685714722,     0.64161604642868042,     0.67078316211700439,     -1.9972312450408936,
    0,     0,     0,     1
  ],
  "cameraGrain" : 0,
  "time" : 98064.684786291691,
  "projectionMatrix" : [
    1.3824803829193115,     0,     -0.00012767314910888672,     0,     0,     1.8433071374893188,
    0.00086534023284912109,     0,     0,     0,     -0.9999997615814209,     -0.00099999981466680765, 
    0,     0,     -1,     0
  ]
}

Which parameter is supposed to be used out of these to get the trajectory.txt? Thanks in advance!

About the LSeg

Possible Filepath Issue in "download_dataset.sh"

While utilizing the download_dataset.sh script for downloading the processed data, I encountered an issue regarding a file path specified in line 73.

The current file path stands as data/matterport_processed/nuscenes_2d.zip. However, considering the context and the associated datasets, I wonder if the correct path might instead be data/nuscenes_processed/nuscenes_2d.zip.

Training Time

Thanks for your wonderful work!
I use a 40G A100 to train ScanNet with a batch size of 8, but the training time is about 2 days. Is this right?
Besides, when I increase the batch size to 16 (the learning rate is also doubled), I only get 41.9 (46.0 in paper) mIoU on ScanNet. Have you tried with the batch size of 16?