zhuhu00 / paper-daily-notice Goto Github PK

Get the papers you want from ArXiv every weekday.

Python 100.00%

paper-daily-notice's Introduction

Get-daily-arxiv-notice

不知道什么原因，不能更新issue了，本地运行返回401，action也不能正常更新issue，不过还是会更新到markdown里面的，每天论文都是在文件夹Arxiv_Daily_Notice文件夹下xxxx-xx-xx-Arxiv-Daily-Paper.md，文件夹的README.md是最新更新的论文

以及，如果感觉以后论文太多翻的麻烦，可以在个人主页看：https://zhuhu00.top/blog/，这里会更新每天的arxiv有关SLAM等的文章

如何使用

fork本repository，然后在Setting->Security->Secrets->Actions下，创建一个Repository secrets, 并记下名字为ISSUE_TOKEN,这个TOKEN上需要先做github账号下申请的。然后粘贴到ISSUE_TOKEN。
修改config.py下，repo的名字，以及github名字等，可查看后面的内容。
可先在本地运行，成功后github的action会每天自动运行

You can get daily arxiv notification with pre-defined keywords as here.

Arxiv.org announces new submissions every day on fixed time as informed here.

This repository makes it easy to filter papers and follow-up new papers which are in your interests by creating an issue in a github repository.

Prerequisites

Python3.x

Install requirements with below command.

$ pip install --upgrade pip
$ pip install -r requirements.txt

Usage

1. Create a Repo

Create a repository to get notification in your github.

2. Set Config

Revise config.py as your perferences.

# Authentication for user filing issue (must have read/write access to repository to add issue to)
USERNAME = 'changeme'

# The repository to add this issue to
REPO_OWNER = 'changeme'
REPO_NAME = 'changeme'

# Set new submission url of subject
NEW_SUB_URL = 'https://arxiv.org/list/cs/new'

# Keywords to search
KEYWORD_LIST = ["changeme"]

3. Set Cronjob

You need to set a cronjob to run the code everyday to get the daily notification.

Refer the announcement schedule in arxiv.org and set the cronjob as below.

$ cronjob -e
$ 0 13 * * mon-fri python PATH-TO-CODE/get-daily-arxiv-noti/main.py

设定定时任务时可以怎么设置呢？

ubuntu下可以是crontab，也是一样的功能

Arxiv上的时区

paper-daily-notice's People

Contributors

Stargazers

Watchers

Forkers

jianpingzhonggit grandzxw zivzone aoteman233 tongqy haitian2du springtiger breezeandneco junqianzi shijiwei123

paper-daily-notice's Issues

New submissions for Tue, 23 Nov 21

Keyword: SLAM

A General Framework for Lifelong Localization and Mapping in Changing Environment

Authors: Min Zhao, Xin Guo, Le Song, Baoxing Qin, Xuesong Shi, Gim Hee Lee, Guanghui Sun
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.10946
Pdf link: https://arxiv.org/pdf/2111.10946
Abstract
The environment of most real-world scenarios such as malls and supermarkets changes at all times. A pre-built map that does not account for these changes becomes out-of-date easily. Therefore, it is necessary to have an up-to-date model of the environment to facilitate long-term operation of a robot. To this end, this paper presents a general lifelong simultaneous localization and mapping (SLAM) framework. Our framework uses a multiple session map representation, and exploits an efficient map updating strategy that includes map building, pose graph refinement and sparsification. To mitigate the unbounded increase of memory usage, we propose a map-trimming method based on the Chow-Liu maximum-mutual-information spanning tree. The proposed SLAM framework has been comprehensively validated by over a month of robot deployment in real supermarket environment. Furthermore, we release the dataset collected from the indoor and outdoor changing environment with the hope to accelerate lifelong SLAM research in the community. Our dataset is available at https://github.com/sanduan168/lifelong-SLAM-dataset.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Sparse Tensor-based Multiscale Representation for Point Cloud Geometry Compression

Authors: Jianqiang Wang, Dandan Ding, Zhu Li, Xiaoxing Feng, Chuntong Cao, Zhan Ma
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.10633
Pdf link: https://arxiv.org/pdf/2111.10633
Abstract
This study develops a unified Point Cloud Geometry (PCG) compression method through Sparse Tensor Processing (STP) based multiscale representation of voxelized PCG, dubbed as the SparsePCGC. Applying the STP reduces the complexity significantly because it only performs the convolutions centered at Most-Probable Positively-Occupied Voxels (MP-POV). And the multiscale representation facilitates us to compress scale-wise MP-POVs progressively. The overall compression efficiency highly depends on the approximation accuracy of occupancy probability of each MP-POV. Thus, we design the Sparse Convolution based Neural Networks (SparseCNN) consisting of sparse convolutions and voxel re-sampling to extensively exploit priors. We then develop the SparseCNN based Occupancy Probability Approximation (SOPA) model to estimate the occupancy probability in a single-stage manner only using the cross-scale prior or in multi-stage by step-wisely utilizing autoregressive neighbors. Besides, we also suggest the SparseCNN based Local Neighborhood Embedding (SLNE) to characterize the local spatial variations as the feature attribute to improve the SOPA. Our unified approach shows the state-of-art performance in both lossless and lossy compression modes across a variety of datasets including the dense PCGs (8iVFB, Owlii) and the sparse LiDAR PCGs (KITTI, Ford) when compared with the MPEG G-PCC and other popular learning-based compression schemes. Furthermore, the proposed method presents lightweight complexity due to point-wise computation, and tiny storage desire because of model sharing across all scales. We make all materials publicly accessible at https://github.com/NJUVISION/SparsePCGC for reproducible research.

A Gaussian Process-Based Ground Segmentation for Sloped Terrains

Authors: Pouria Mehrabi, Hamid D.Taghirad
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2111.10638
Pdf link: https://arxiv.org/pdf/2111.10638
Abstract
A Gaussian Process GP based ground segmentation method is proposed in this paper which is fully developed in a probabilistic framework. The proposed method tends to obtain a continuous realistic model of the ground. The LiDAR three-dimensional point cloud data is used as the sole source of the input data. The physical realities of the data are taken into account to properly classify sloped ground as well as the flat ones. Furthermore, unlike conventional ground segmentation methods, no height or distance constraints or limitations are required for the algorithm to be applied to take all the regarding physical behavior of the ground into account. Furthermore, a density-like parameter is defined to handle ground-like obstacle points in the ground candidate set. The non-stationary covariance kernel function is used for the Gaussian Process, by which Bayesian inference is applied using the maximum A Posteriori criterion. The log-marginal likelihood function is assumed to be a multi-task objective function, to represent a whole-frame unbiased view of the ground at each frame. Simulation results show the effectiveness of the proposed method even in an uneven, rough scene which outperforms similar Gaussian process-based ground segmentation methods.

Simulated LiDAR Repositioning: a novel point cloud data augmentation method

Authors: Xavier Morin-Duchesne (1), Michael S Langer (1) ((1) McGill University)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.10650
Pdf link: https://arxiv.org/pdf/2111.10650
Abstract
We address a data augmentation problem for LiDAR. Given a LiDAR scan of a scene from some position, how can one simulate new scans of that scene from different, secondary positions? The method defines criteria for selecting valid secondary positions, and then estimates which points from the original point cloud would be acquired by a scanner from these positions. We validate the method using synthetic scenes, and examine how the similarity of generated point clouds depends on scanner distance, occlusion, and angular resolution. We show that the method is more accurate at short distances, and that having a high scanner resolution for the original point clouds has a strong impact on the similarity of generated point clouds. We also demonstrate how the method can be applied to natural scene statistics: in particular, we apply our method to reposition the scanner horizontally and vertically, separately consider points belonging to the ground and to non-ground objects, and describe the impact on the distributions of distances to these two classes of points.

Self-Supervised Point Cloud Completion via Inpainting

Authors: Himangi Mittal, Brian Okorn, Arpit Jangid, David Held
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.10701
Pdf link: https://arxiv.org/pdf/2111.10701
Abstract
When navigating in urban environments, many of the objects that need to be tracked and avoided are heavily occluded. Planning and tracking using these partial scans can be challenging. The aim of this work is to learn to complete these partial point clouds, giving us a full understanding of the object's geometry using only partial observations. Previous methods achieve this with the help of complete, ground-truth annotations of the target objects, which are available only for simulated datasets. However, such ground truth is unavailable for real-world LiDAR data. In this work, we present a self-supervised point cloud completion algorithm, PointPnCNet, which is trained only on partial scans without assuming access to complete, ground-truth annotations. Our method achieves this via inpainting. We remove a portion of the input data and train the network to complete the missing region. As it is difficult to determine which regions were occluded in the initial cloud and which were synthetically removed, our network learns to complete the full cloud, including the missing regions in the initial partial cloud. We show that our method outperforms previous unsupervised and weakly-supervised methods on both the synthetic dataset, ShapeNet, and real-world LiDAR dataset, Semantic KITTI.

Monocular Road Planar Parallax Estimation

Authors: Haobo Yuan, Teng Chen, Wei Sui, Jiafeng Xie, Lefei Zhang, Yuan Li, Qian Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.11089
Pdf link: https://arxiv.org/pdf/2111.11089
Abstract
Estimating the 3D structure of the drivable surface and surrounding environment is a crucial task for assisted and autonomous driving. It is commonly solved either by using expensive 3D sensors such as LiDAR or directly predicting the depth of points via deep learning. Instead of following existing methodologies, we propose Road Planar Parallax Attention Network (RPANet), a new deep neural network for 3D sensing from monocular image sequences based on planar parallax, which takes full advantage of the commonly seen road plane geometry in driving scenes. RPANet takes a pair of images aligned by the homography of the road plane as input and outputs a $\gamma$ map for 3D reconstruction. Beyond estimating the depth or height, the $\gamma$ map has a potential to construct a two-dimensional transformation between two consecutive frames while can be easily derived to depth or height. By warping the consecutive frames using the road plane as a reference, the 3D structure can be estimated from the planar parallax and the residual image displacements. Furthermore, to make the network better perceive the displacements caused by planar parallax, we introduce a novel cross-attention module. We sample data from the Waymo Open Dataset and construct data related to planar parallax. Comprehensive experiments are conducted on the sampled dataset to demonstrate the 3D reconstruction accuracy of our approach in challenging scenarios.

Paris-CARLA-3D: A Real and Synthetic Outdoor Point Cloud Dataset for Challenging Tasks in 3D Mapping

Authors: Jean-Emmanuel Deschaud, David Duque, Jean Pierre Richa, Santiago Velasco-Forero, Beatriz Marcotegui, and François Goulette
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.11348
Pdf link: https://arxiv.org/pdf/2111.11348
Abstract
Paris-CARLA-3D is a dataset of several dense colored point clouds of outdoor environments built by a mobile LiDAR and camera system. The data are composed of two sets with synthetic data from the open source CARLA simulator (700 million points) and real data acquired in the city of Paris (60 million points), hence the name Paris-CARLA-3D. One of the advantages of this dataset is to have simulated the same LiDAR and camera platform in the open source CARLA simulator as the one used to produce the real data. In addition, manual annotation of the classes using the semantic tags of CARLA was performed on the real data, allowing the testing of transfer methods from the synthetic to the real data. The objective of this dataset is to provide a challenging dataset to evaluate and improve methods on difficult vision tasks for the 3D mapping of outdoor environments: semantic segmentation, instance segmentation, and scene completion. For each task, we describe the evaluation protocol as well as the experiments carried out to establish a baseline.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Backdoor Attack through Frequency Domain

Authors: Tong Wang, Yuan Yao, Feng Xu, Shengwei An, Ting Wang
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.10991
Pdf link: https://arxiv.org/pdf/2111.10991
Abstract
Backdoor attacks have been shown to be a serious threat against deep learning systems such as biometric authentication and autonomous driving. An effective backdoor attack could enforce the model misbehave under certain predefined conditions, i.e., triggers, but behave normally otherwise. However, the triggers of existing attacks are directly injected in the pixel space, which tend to be detectable by existing defenses and visually identifiable at both training and inference stages. In this paper, we propose a new backdoor attack FTROJAN through trojaning the frequency domain. The key intuition is that triggering perturbations in the frequency domain correspond to small pixel-wise perturbations dispersed across the entire image, breaking the underlying assumptions of existing defenses and making the poisoning images visually indistinguishable from clean ones. We evaluate FTROJAN in several datasets and tasks showing that it achieves a high attack success rate without significantly degrading the prediction accuracy on benign inputs. Moreover, the poisoning images are nearly invisible and retain high perceptual quality. We also evaluate FTROJAN against state-of-the-art defenses as well as several adaptive defenses that are designed on the frequency domain. The results show that FTROJAN can robustly elude or significantly degenerate the performance of these defenses.

Monocular Road Planar Parallax Estimation

Authors: Haobo Yuan, Teng Chen, Wei Sui, Jiafeng Xie, Lefei Zhang, Yuan Li, Qian Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.11089
Pdf link: https://arxiv.org/pdf/2111.11089
Abstract
Estimating the 3D structure of the drivable surface and surrounding environment is a crucial task for assisted and autonomous driving. It is commonly solved either by using expensive 3D sensors such as LiDAR or directly predicting the depth of points via deep learning. Instead of following existing methodologies, we propose Road Planar Parallax Attention Network (RPANet), a new deep neural network for 3D sensing from monocular image sequences based on planar parallax, which takes full advantage of the commonly seen road plane geometry in driving scenes. RPANet takes a pair of images aligned by the homography of the road plane as input and outputs a $\gamma$ map for 3D reconstruction. Beyond estimating the depth or height, the $\gamma$ map has a potential to construct a two-dimensional transformation between two consecutive frames while can be easily derived to depth or height. By warping the consecutive frames using the road plane as a reference, the 3D structure can be estimated from the planar parallax and the residual image displacements. Furthermore, to make the network better perceive the displacements caused by planar parallax, we introduce a novel cross-attention module. We sample data from the Waymo Open Dataset and construct data related to planar parallax. Comprehensive experiments are conducted on the sampled dataset to demonstrate the 3D reconstruction accuracy of our approach in challenging scenarios.

Keyword: mapping

StylePart: Image-based Shape Part Manipulation

Authors: I-Chao Shen, Li-Wen Su, Yu-Ting Wu, Bing-Yu Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2111.10520
Pdf link: https://arxiv.org/pdf/2111.10520
Abstract
Due to a lack of image-based "part controllers", shape manipulation of man-made shape images, such as resizing the backrest of a chair or replacing a cup handle is not intuitive because of the lack of image-based part controllers. To tackle this problem, we present StylePart, a framework that enables direct shape manipulation of an image by leveraging generative models of both images and 3D shapes. Our key contribution is a shape-consistent latent mapping function that connects the image generative latent space and the 3D man-made shape attribute latent space. Our method "forwardly maps" the image content to its corresponding 3D shape attributes, where the shape part can be easily manipulated. The attribute codes of the manipulated 3D shape are then "backwardly mapped" to the image latent code to obtain the final manipulated image. We demonstrate our approach through various manipulation tasks, including part replacement, part resizing, and viewpoint manipulation, and evaluate its effectiveness through extensive ablation studies.

Quality and Computation Time in Optimization Problems

Authors: Zhicheng He
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.10595
Pdf link: https://arxiv.org/pdf/2111.10595
Abstract
Optimization problems are crucial in artificial intelligence. Optimization algorithms are generally used to adjust the performance of artificial intelligence models to minimize the error of mapping inputs to outputs. Current evaluation methods on optimization algorithms generally consider the performance in terms of quality. However, not all optimization algorithms for all test cases are evaluated equal from quality, the computation time should be also considered for optimization tasks. In this paper, we investigate the quality and computation time of optimization algorithms in optimization problems, instead of the one-for-all evaluation of quality. We select the well-known optimization algorithms (Bayesian optimization and evolutionary algorithms) and evaluate them on the benchmark test functions in terms of quality and computation time. The results show that BO is suitable to be applied in the optimization tasks that are needed to obtain desired quality in the limited function evaluations, and the EAs are suitable to search the optimal of the tasks that are allowed to find the optimal solution with enough function evaluations. This paper provides the recommendation to select suitable optimization algorithms for optimization problems with different numbers of function evaluations, which contributes to the efficiency that obtains the desired quality with less computation time for optimization problems.

A General Framework for Lifelong Localization and Mapping in Changing Environment

Authors: Min Zhao, Xin Guo, Le Song, Baoxing Qin, Xuesong Shi, Gim Hee Lee, Guanghui Sun
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.10946
Pdf link: https://arxiv.org/pdf/2111.10946
Abstract
The environment of most real-world scenarios such as malls and supermarkets changes at all times. A pre-built map that does not account for these changes becomes out-of-date easily. Therefore, it is necessary to have an up-to-date model of the environment to facilitate long-term operation of a robot. To this end, this paper presents a general lifelong simultaneous localization and mapping (SLAM) framework. Our framework uses a multiple session map representation, and exploits an efficient map updating strategy that includes map building, pose graph refinement and sparsification. To mitigate the unbounded increase of memory usage, we propose a map-trimming method based on the Chow-Liu maximum-mutual-information spanning tree. The proposed SLAM framework has been comprehensively validated by over a month of robot deployment in real supermarket environment. Furthermore, we release the dataset collected from the indoor and outdoor changing environment with the hope to accelerate lifelong SLAM research in the community. Our dataset is available at https://github.com/sanduan168/lifelong-SLAM-dataset.

Auto-Encoding Score Distribution Regression for Action Quality Assessment

Authors: Boyu Zhang, Jiayuan Chen, Yinfei Xu, Hui Zhang, Xu Yang, Xin Geng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.11029
Pdf link: https://arxiv.org/pdf/2111.11029
Abstract
Action quality assessment (AQA) from videos is a challenging vision task since the relation between videos and action scores is difficult to model. Thus, action quality assessment has been widely studied in the literature. Traditionally, AQA task is treated as a regression problem to learn the underlying mappings between videos and action scores. More recently, the method of uncertainty score distribution learning (USDL) made success due to the introduction of label distribution learning (LDL). But USDL does not apply to dataset with continuous labels and needs a fixed variance in training. In this paper, to address the above problems, we further develop Distribution Auto-Encoder (DAE). DAE takes both advantages of regression algorithms and label distribution learning (LDL).Specifically, it encodes videos into distributions and uses the reparameterization trick in variational auto-encoders (VAE) to sample scores, which establishes a more accurate mapping between videos and scores. Meanwhile, a combined loss is constructed to accelerate the training of DAE. DAE-MT is further proposed to deal with AQA on multi-task datasets. We evaluate our DAE approach on MTL-AQA and JIGSAWS datasets. Experimental results on public datasets demonstrate that our method achieves state-of-the-arts under the Spearman's Rank Correlation: 0.9449 on MTL-AQA and 0.73 on JIGSAWS.

Cycle Consistent Probability Divergences Across Different Spaces

Authors: Zhengxin Zhang, Youssef Mroueh, Ziv Goldfeld, Bharath K. Sriperumbudur
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2111.11328
Pdf link: https://arxiv.org/pdf/2111.11328
Abstract
Discrepancy measures between probability distributions are at the core of statistical inference and machine learning. In many applications, distributions of interest are supported on different spaces, and yet a meaningful correspondence between data points is desired. Motivated to explicitly encode consistent bidirectional maps into the discrepancy measure, this work proposes a novel unbalanced Monge optimal transport formulation for matching, up to isometries, distributions on different spaces. Our formulation arises as a principled relaxation of the Gromov-Haussdroff distance between metric spaces, and employs two cycle-consistent maps that push forward each distribution onto the other. We study structural properties of the proposed discrepancy and, in particular, show that it captures the popular cycle-consistent generative adversarial network (GAN) framework as a special case, thereby providing the theory to explain it. Motivated by computational efficiency, we then kernelize the discrepancy and restrict the mappings to parametric function classes. The resulting kernelized version is coined the generalized maximum mean discrepancy (GMMD). Convergence rates for empirical estimation of GMMD are studied and experiments to support our theory are provided.

Paris-CARLA-3D: A Real and Synthetic Outdoor Point Cloud Dataset for Challenging Tasks in 3D Mapping

Authors: Jean-Emmanuel Deschaud, David Duque, Jean Pierre Richa, Santiago Velasco-Forero, Beatriz Marcotegui, and François Goulette
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.11348
Pdf link: https://arxiv.org/pdf/2111.11348
Abstract
Paris-CARLA-3D is a dataset of several dense colored point clouds of outdoor environments built by a mobile LiDAR and camera system. The data are composed of two sets with synthetic data from the open source CARLA simulator (700 million points) and real data acquired in the city of Paris (60 million points), hence the name Paris-CARLA-3D. One of the advantages of this dataset is to have simulated the same LiDAR and camera platform in the open source CARLA simulator as the one used to produce the real data. In addition, manual annotation of the classes using the semantic tags of CARLA was performed on the real data, allowing the testing of transfer methods from the synthetic to the real data. The objective of this dataset is to provide a challenging dataset to evaluate and improve methods on difficult vision tasks for the 3D mapping of outdoor environments: semantic segmentation, instance segmentation, and scene completion. For each task, we describe the evaluation protocol as well as the experiments carried out to establish a baseline.

Analysis of Exploration vs. Exploitation in Adaptive Information Sampling

Authors: Aiman Munir, Ramviyas Parasuraman
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.11384
Pdf link: https://arxiv.org/pdf/2111.11384
Abstract
Adaptive information sampling approaches enable efficient selection of mobile robot's waypoints through which accurate sensing and mapping of a physical process, such as the radiation or field intensity, can be obtained. This paper analyzes the role of exploration and exploitation in such information-theoretic spatial sampling of the environmental processes. We use Gaussian processes to predict and estimate predictions with confidence bounds, thereby determining each point's informativeness in terms of exploration and exploitation. Specifically, we use a Gaussian process regression model to sample the Wi-Fi signal strength of the environment. For different variants of the informative function, we extensively analyze and evaluate the effectiveness and efficiency of information mapping through two different initial trajectories in both single robot and multi-robot settings. The results provide meaningful insights in choosing appropriate information function based on sampling objectives.

Neural Fields in Visual Computing and Beyond

Authors: Yiheng Xie, Towaki Takikawa, Shunsuke Saito, Or Litany, Shiqin Yan, Numair Khan, Federico Tombari, James Tompkin, Vincent Sitzmann, Srinath Sridhar
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.11426
Pdf link: https://arxiv.org/pdf/2111.11426
Abstract
Recent advances in machine learning have created increasing interest in solving visual computing problems using a class of coordinate-based neural networks that parametrize physical properties of scenes or objects across space and time. These methods, which we call neural fields, have seen successful application in the synthesis of 3D shapes and image, animation of human bodies, 3D reconstruction, and pose estimation. However, due to rapid progress in a short time, many papers exist but a comprehensive review and formulation of the problem has not yet emerged. In this report, we address this limitation by providing context, mathematical grounding, and an extensive review of literature on neural fields. This report covers research along two dimensions. In Part I, we focus on techniques in neural fields by identifying common components of neural field methods, including different representations, architectures, forward mapping, and generalization methods. In Part II, we focus on applications of neural fields to different problems in visual computing, and beyond (e.g., robotics, audio). Our review shows the breadth of topics already covered in visual computing, both historically and in current incarnations, demonstrating the improved quality, flexibility, and capability brought by neural fields methods. Finally, we present a companion website that contributes a living version of this review that can be continually updated by the community.

Florence: A New Foundation Model for Computer Vision

Authors: Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.11432
Pdf link: https://arxiv.org/pdf/2111.11432
Abstract
Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

Keyword: localization

A General Framework for Lifelong Localization and Mapping in Changing Environment

Authors: Min Zhao, Xin Guo, Le Song, Baoxing Qin, Xuesong Shi, Gim Hee Lee, Guanghui Sun
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.10946
Pdf link: https://arxiv.org/pdf/2111.10946
Abstract
The environment of most real-world scenarios such as malls and supermarkets changes at all times. A pre-built map that does not account for these changes becomes out-of-date easily. Therefore, it is necessary to have an up-to-date model of the environment to facilitate long-term operation of a robot. To this end, this paper presents a general lifelong simultaneous localization and mapping (SLAM) framework. Our framework uses a multiple session map representation, and exploits an efficient map updating strategy that includes map building, pose graph refinement and sparsification. To mitigate the unbounded increase of memory usage, we propose a map-trimming method based on the Chow-Liu maximum-mutual-information spanning tree. The proposed SLAM framework has been comprehensively validated by over a month of robot deployment in real supermarket environment. Furthermore, we release the dataset collected from the indoor and outdoor changing environment with the hope to accelerate lifelong SLAM research in the community. Our dataset is available at https://github.com/sanduan168/lifelong-SLAM-dataset.

Extending the Dissipating Energy Flow Method to Flexible AC Transmission Systems

Authors: Kaustav Chatterjee, Sayan Samanta, Nilanjan Ray Chaudhuri
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2111.10960
Pdf link: https://arxiv.org/pdf/2111.10960
Abstract
In recent years, the dissipating energy flow (DEF) method has emerged as a promising tool for online localization of oscillation sources. In literature, the mathematical foundations of this method are well-studied for networks with synchronous generators. In this paper, we extend the analysis to flexible AC transmission systems (FACTS). To that end, we derive the DEF-expressions for a thyristor controlled series capacitor (TCSC) and a static synchronous compensator (STATCOM) operating with conventional control strategies. Analyzing their respective DEFs, we obtain the conditions for which these FACTS devices could behave as the sources of oscillation energy. Our findings are structured into propositions and are supported through numerical case studies on IEEE test systems.

CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Authors: Tianlun Zheng, Zhineng Chen, Shancheng Fang, Hongtao Xie, Yu-Gang Jiang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.11011
Pdf link: https://arxiv.org/pdf/2111.11011
Abstract
The attention-based encoder-decoder framework is becoming popular in scene text recognition, largely due to its superiority in integrating recognition clues from both visual and semantic domains. However, recent studies show the two clues might be misaligned in the difficult text (e.g., with rare text shapes) and introduce constraints such as character position to alleviate the problem. Despite certain success, a content-free positional embedding hardly associates with meaningful local image regions stably. In this paper, we propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visual and semantic related position encoding. MDCDP uses positional embedding to query both visual and semantic features following the attention mechanism. It naturally encodes the positional clue, which describes both visual and semantic distances among characters. We develop a novel architecture named CDistNet that stacks MDCDP several times to guide precise distance modeling. Thus, the visual-semantic alignment is well built even various difficulties presented. We apply CDistNet to two augmented datasets and six public benchmarks. The experiments demonstrate that CDistNet achieves state-of-the-art recognition accuracy. While the visualization also shows that CDistNet achieves proper attention localization in both visual and semantic domains. We will release our code upon acceptance.

New submissions for Wed, 30 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Multi-Sensor Fusion based Robust Row Following for Compact Agricultural Robots

Authors: Andres Eduardo Baquero Velasquez, Vitor Akihiro Hisano Higuti, Mateus Valverde Gasparino, Arun Narenthiran Sivakumar, Marcelo Becker, Girish Chowdhary
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.15029
Pdf link: https://arxiv.org/pdf/2106.15029
Abstract
This paper presents a state-of-the-art LiDAR based autonomous navigation system for under-canopy agricultural robots. Under-canopy agricultural navigation has been a challenging problem because GNSS and other positioning sensors are prone to significant errors due to attentuation and multi-path caused by crop leaves and stems. Reactive navigation by detecting crop rows using LiDAR measurements is a better alternative to GPS but suffers from challenges due to occlusion from leaves under the canopy. Our system addresses this challenge by fusing IMU and LiDAR measurements using an Extended Kalman Filter framework on low-cost hardwware. In addition, a local goal generator is introduced to provide locally optimal reference trajectories to the onboard controller. Our system is validated extensively in real-world field environments over a distance of 50.88km on multiple robots in different field conditions across different locations. We report state-of-the-art distance between intervention results, showing that our system is able to safely navigate without interventions for 386.9m on average in fields without significant gaps in the crop rows, 56.1~~m in production fields and 47.5~~m in fields with gaps (space of 1~m without plants in both sides of the row).

Perception-aware Multi-sensor Fusion for 3D LiDAR Semantic Segmentation

Authors: Zhuangwei Zhuang, Rong Li, Yuanqing Li, Kui Jia, Qicheng Wang, Mingkui Tan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.15277
Pdf link: https://arxiv.org/pdf/2106.15277
Abstract
3D LiDAR (light detection and ranging) based semantic segmentation is important in scene understanding for many applications, such as auto-driving and robotics. For example, for autonomous cars equipped with RGB cameras and LiDAR, it is crucial to fuse complementary information from different sensors for robust and accurate segmentation. Existing fusion-based methods, however, may not achieve promising performance due to the vast difference between two modalities. In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) to exploit perceptual information from two modalities, namely, appearance information from RGB images and spatio-depth information from point clouds. To this end, we first project point clouds to the camera coordinates to provide spatio-depth information for RGB images. Then, we propose a two-stream network to extract features from the two modalities, separately, and fuse the features by effective residual-based fusion modules. Moreover, we propose additional perception-aware losses to measure the great perceptual difference between the two modalities. Extensive experiments on two benchmark data sets show the superiority of our method. For example, on nuScenes, our PMF outperforms the state-of-the-art method by 0.8% in mIoU.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Symbiotic Sensing and Communications Towards 6G: Vision, Applications, and Technology Trends

Authors: Zhiqin Wang, Kaifeng Han, Jiamo Jiang, Zhiqing Wei, Guangxu Zhu, Zhiyong Feng, Jianmin Lu, Chunwei Meng
Subjects: Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2106.15197
Pdf link: https://arxiv.org/pdf/2106.15197
Abstract
Driven by the vision of intelligent connection of everything and digital twin towards 6G, a myriad of new applications, such as immersive extended reality, autonomous driving, holographic communications, intelligent industrial internet, will emerge in the near future, holding the promise to revolutionize the way we live and work. These trends inspire a novel technical design principle that seamlessly integrates two originally decoupled functionalities, i.e., wireless communication and sensing, into one system in a symbiotic way, which is dubbed symbiotic sensing and communications (SSaC), to endow the wireless network with the capability to "see" and "talk" to the physical world simultaneously. Noting that the term SSaC is used instead of ISAC (integrated sensing and communications) because the word ``symbiotic/symbiosis" is more inclusive and can better accommodate different integration levels and evolution stages of sensing and communications. Aligned with this understanding, this article makes the first attempts to clarify the concept of SSaC, illustrate its vision, envision the three-stage evolution roadmap, namely neutralism, commensalism, and mutualism of SaC. Then, three categories of applications of SSaC are introduced, followed by detailed description of typical use cases in each category. Finally, we summarize the major performance metrics and key enabling technologies for SSaC.

Autonomous Driving Implementation in an Experimental Environment

Authors: Namig Aliyev, Oguzhan Sezer, Mehmet Turan Guzel
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.15274
Pdf link: https://arxiv.org/pdf/2106.15274
Abstract
Autonomous systems require identifying the environment and it has a long way to go before putting it safely into practice. In autonomous driving systems, the detection of obstacles and traffic lights are of importance as well as lane tracking. In this study, an autonomous driving system is developed and tested in the experimental environment designed for this purpose. In this system, a model vehicle having a camera is used to trace the lanes and avoid obstacles to experimentally study autonomous driving behavior. Convolutional Neural Network models were trained for Lane tracking. For the vehicle to avoid obstacles, corner detection, optical flow, focus of expansion, time to collision, balance calculation, and decision mechanism were created, respectively.

Towards Fast and Accurate Multi-Person Pose Estimation on Mobile Devices

Authors: Xuan Shen, Geng Yuan, Wei Niu, Xiaolong Ma, Jiexiong Guan, Zhengang Li, Bin Ren, Yanzhi Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.15304
Pdf link: https://arxiv.org/pdf/2106.15304
Abstract
The rapid development of autonomous driving, abnormal behavior detection, and behavior recognition makes an increasing demand for multi-person pose estimation-based applications, especially on mobile platforms. However, to achieve high accuracy, state-of-the-art methods tend to have a large model size and complex post-processing algorithm, which costs intense computation and long end-to-end latency. To solve this problem, we propose an architecture optimization and weight pruning framework to accelerate inference of multi-person pose estimation on mobile devices. With our optimization framework, we achieve up to 2.51x faster model inference speed with higher accuracy compared to representative lightweight multi-person pose estimator.

Deep Learning for Multi-View Stereo via Plane Sweep: A Survey

Authors: Qingtian Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.15328
Pdf link: https://arxiv.org/pdf/2106.15328
Abstract
3D reconstruction has lately attracted increasing attention due to its wide application in many areas, such as autonomous driving, robotics and virtual reality. As a dominant technique in artificial intelligence, deep learning has been successfully adopted to solve various computer vision problems. However, deep learning for 3D reconstruction is still at its infancy due to its unique challenges and varying pipelines. To stimulate future research, this paper presents a review of recent progress in deep learning methods for Multi-view Stereo (MVS), which is considered as a crucial task of image-based 3D reconstruction. It also presents comparative results on several publicly available datasets, with insightful observations and inspiring future research directions.

New submissions for Mon, 8 Nov 21

Keyword: SLAM

MSC-VO: Exploiting Manhattan and Structural Constraints for Visual Odometry

Authors: Joan P. Company-Corcoles, Emilio Garcia-Fidalgo, Alberto Ortiz
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.03408
Pdf link: https://arxiv.org/pdf/2111.03408
Abstract
Visual odometry algorithms tend to degrade when facing low-textured scenes -from e.g. human-made environments-, where it is often difficult to find a sufficient number of point features. Alternative geometrical visual cues, such as lines, which can often be found within these scenarios, can become particularly useful. Moreover, these scenarios typically present structural regularities, such as parallelism or orthogonality, and hold the Manhattan World assumption. Under these premises, in this work, we introduce MSC-VO, an RGB-D -based visual odometry approach that combines both point and line features and leverages, if exist, those structural regularities and the Manhattan axes of the scene. Within our approach, these structural constraints are initially used to estimate accurately the 3D position of the extracted lines. These constraints are also combined next with the estimated Manhattan axes and the reprojection errors of points and lines to refine the camera pose by means of local map optimization. Such a combination enables our approach to operate even in the absence of the aforementioned constraints, allowing the method to work for a wider variety of scenarios. Furthermore, we propose a novel multi-view Manhattan axes estimation procedure that mainly relies on line features. MSC-VO is assessed using several public datasets, outperforming other state-of-the-art solutions, and comparing favourably even with some SLAM methods.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Compressing Sensor Data for Remote Assistance of Autonomous Vehicles using Deep Generative Models

Authors: Daniel Bogdoll, Johannes Jestram, Jonas Rauch, Christin Scheib, Moritz Wittig, J. Marius Zöllner
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2111.03201
Pdf link: https://arxiv.org/pdf/2111.03201
Abstract
In the foreseeable future, autonomous vehicles will require human assistance in situations they can not resolve on their own. In such scenarios, remote assistance from a human can provide the required input for the vehicle to continue its operation. Typical sensors used in autonomous vehicles include camera and lidar sensors. Due to the massive volume of sensor data that must be sent in real-time, highly efficient data compression is elementary to prevent an overload of network infrastructure. Sensor data compression using deep generative neural networks has been shown to outperform traditional compression approaches for both image and lidar data, regarding compression rate as well as reconstruction quality. However, there is a lack of research about the performance of generative-neural-network-based compression algorithms for remote assistance. In order to gain insights into the feasibility of deep generative models for usage in remote assistance, we evaluate state-of-the-art algorithms regarding their applicability and identify potential weaknesses. Further, we implement an online pipeline for processing sensor data and demonstrate its performance for remote assistance using the CARLA simulator.

LiODOM: Adaptive Local Mapping for Robust LiDAR-Only Odometry

Authors: Emilio Garcia-Fidalgo, Joan P. Company-Corcoles, Francisco Bonnin-Pascual, Alberto Ortiz
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.03393
Pdf link: https://arxiv.org/pdf/2111.03393
Abstract
In the last decades, Light Detection And Ranging (LiDAR) technology has been extensively explored as a robust alternative for self-localization and mapping. These approaches typically state ego-motion estimation as a non-linear optimization problem dependent on the correspondences established between the current point cloud and a map, whatever its scope, local or global. This paper proposes LiODOM, a novel LiDAR-only ODOmetry and Mapping approach for pose estimation and map-building, based on minimizing a loss function derived from a set of weighted point-to-line correspondences with a local map abstracted from the set of available point clouds. Furthermore, this work places a particular emphasis on map representation given its relevance for quick data association. To efficiently represent the environment, we propose a data structure that combined with a hashing scheme allows for fast access to any section of the map. LiODOM is validated by means of a set of experiments on public datasets, for which it compares favourably against other solutions. Its performance on-board an aerial platform is also reported.

Keyword: loop detection

There is no result

Keyword: autonomous driving

FBNet: Feature Balance Network for Urban-Scene Segmentation

Authors: Lei Gan, Huabin Huang, Banghuai Li, Ye Yuan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.03286
Pdf link: https://arxiv.org/pdf/2111.03286
Abstract
Image segmentation in the urban scene has recently attracted much attention due to its success in autonomous driving systems. However, the poor performance of concerned foreground targets, e.g., traffic lights and poles, still limits its further practical applications. In urban scenes, foreground targets are always concealed in their surrounding stuff because of the special camera position and 3D perspective projection. What's worse, it exacerbates the unbalance between foreground and background classes in high-level features due to the continuous expansion of the reception field. We call it Feature Camouflage. In this paper, we present a novel add-on module, named Feature Balance Network (FBNet), to eliminate the feature camouflage in urban-scene segmentation. FBNet consists of two key components, i.e., Block-wise BCE(BwBCE) and Dual Feature Modulator(DFM). BwBCE serves as an auxiliary loss to ensure uniform gradients for foreground classes and their surroundings during backpropagation. At the same time, DFM intends to enhance the deep representation of foreground classes in high-level features adaptively under the supervision of BwBCE. These two modules facilitate each other as a whole to ease feature camouflage effectively. Our proposed method achieves a new state-of-the-art segmentation performance on two challenging urban-scene benchmarks, i.e., Cityscapes and BDD100K. Code will be released for reproduction.

Disengagement Cause-and-Effect Relationships Extraction Using an NLP Pipeline

Authors: Yangtao Zhang, X. Jessie Yang, Feng Zhou
Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.03511
Pdf link: https://arxiv.org/pdf/2111.03511
Abstract
The advancement in machine learning and artificial intelligence is promoting the testing and deployment of autonomous vehicles (AVs) on public roads. The California Department of Motor Vehicles (CA DMV) has launched the Autonomous Vehicle Tester Program, which collects and releases reports related to Autonomous Vehicle Disengagement (AVD) from autonomous driving. Understanding the causes of AVD is critical to improving the safety and stability of the AV system and provide guidance for AV testing and deployment. In this work, a scalable end-to-end pipeline is constructed to collect, process, model, and analyze the disengagement reports released from 2014 to 2020 using natural language processing deep transfer learning. The analysis of disengagement data using taxonomy, visualization and statistical tests revealed the trends of AV testing, categorized cause frequency, and significant relationships between causes and effects of AVD. We found that (1) manufacturers tested AVs intensively during the Spring and/or Winter, (2) test drivers initiated more than 80% of the disengagement while more than 75% of the disengagement were led by errors in perception, localization & mapping, planning and control of the AV system itself, and (3) there was a significant relationship between the initiator of AVD and the cause category. This study serves as a successful practice of deep transfer learning using pre-trained models and generates a consolidated disengagement database allowing further investigation for other researchers.

Keyword: mapping

GraN-GAN: Piecewise Gradient Normalization for Generative Adversarial Networks

Authors: Vineeth S. Bhaskara, Tristan Aumentado-Armstrong, Allan Jepson, Alex Levinshtein
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2111.03162
Pdf link: https://arxiv.org/pdf/2111.03162
Abstract
Modern generative adversarial networks (GANs) predominantly use piecewise linear activation functions in discriminators (or critics), including ReLU and LeakyReLU. Such models learn piecewise linear mappings, where each piece handles a subset of the input space, and the gradients per subset are piecewise constant. Under such a class of discriminator (or critic) functions, we present Gradient Normalization (GraN), a novel input-dependent normalization method, which guarantees a piecewise K-Lipschitz constraint in the input space. In contrast to spectral normalization, GraN does not constrain processing at the individual network layers, and, unlike gradient penalties, strictly enforces a piecewise Lipschitz constraint almost everywhere. Empirically, we demonstrate improved image generation performance across multiple datasets (incl. CIFAR-10/100, STL-10, LSUN bedrooms, and CelebA), GAN loss functions, and metrics. Further, we analyze altering the often untuned Lipschitz constant K in several standard GANs, not only attaining significant performance gains, but also finding connections between K and training dynamics, particularly in low-gradient loss plateaus, with the common Adam optimizer.

Analysis of Sensing Spectral for Signal Recovery Under a Generalized Linear Model

Authors: Junjie Ma, Ji Xu, Arian Maleki
Subjects: Information Theory (cs.IT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.03237
Pdf link: https://arxiv.org/pdf/2111.03237
Abstract
We consider a nonlinear inverse problem $\mathbf{y}= f(\mathbf{Ax})$, where observations $\mathbf{y} \in \mathbb{R}^m$ are the componentwise nonlinear transformation of $\mathbf{Ax} \in \mathbb{R}^m$, $\mathbf{x} \in \mathbb{R}^n$ is the signal of interest and $\mathbf{A}$ is a known linear mapping. By properly specifying the nonlinear processing function, this model can be particularized to many signal processing problems, including compressed sensing and phase retrieval. Our main goal in this paper is to understand the impact of sensing matrices, or more specifically the spectrum of sensing matrices, on the difficulty of recovering $\mathbf{x}$ from $\mathbf{y}$. Towards this goal, we study the performance of one of the most successful recovery methods, i.e. the expectation propagation algorithm (EP). We define a notion for the spikiness of the spectrum of $\mathbf{A}$ and show the importance of this measure in the performance of the EP. Whether the spikiness of the spectrum can hurt or help the recovery performance of EP depends on $f$. We define certain quantities based on the function $f$ that enables us to describe the impact of the spikiness of the spectrum on EP recovery. Based on our framework, we are able to show that for instance, in phase-retrieval problems, matrices with spikier spectrums are better for EP, while in 1-bit compressed sensing problems, less spiky (flatter) spectrums offer better recoveries. Our results unify and substantially generalize the existing results that compare sub-Gaussian and orthogonal matrices, and provide a platform toward designing optimal sensing systems.

Federated Learning Attacks Revisited: A Critical Discussion of Gaps, Assumptions, and Evaluation Setups

Authors: Aidmar Wainakh, Ephraim Zimmer, Sandeep Subedi, Jens Keim, Tim Grube, Shankar Karuppayah, Alejandro Sanchez Guinea, Max Mühlhäuser
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2111.03363
Pdf link: https://arxiv.org/pdf/2111.03363
Abstract
Federated learning (FL) enables a set of entities to collaboratively train a machine learning model without sharing their sensitive data, thus, mitigating some privacy concerns. However, an increasing number of works in the literature propose attacks that can manipulate the model and disclose information about the training data in FL. As a result, there has been a growing belief in the research community that FL is highly vulnerable to a variety of severe attacks. Although these attacks do indeed highlight security and privacy risks in FL, some of them may not be as effective in production deployment because they are feasible only under special -- sometimes impractical -- assumptions. Furthermore, some attacks are evaluated under limited setups that may not match real-world scenarios. In this paper, we investigate this issue by conducting a systematic mapping study of attacks against FL, covering 48 relevant papers from 2016 to the third quarter of 2021. On the basis of this study, we provide a quantitative analysis of the proposed attacks and their evaluation settings. This analysis reveals several research gaps with regard to the type of target ML models and their architectures. Additionally, we highlight unrealistic assumptions in the problem settings of some attacks, related to the hyper-parameters of the ML model and data distribution among clients. Furthermore, we identify and discuss several fallacies in the evaluation of attacks, which open up questions on the generalizability of the conclusions. As a remedy, we propose a set of recommendations to avoid these fallacies and to promote adequate evaluations.

LiODOM: Adaptive Local Mapping for Robust LiDAR-Only Odometry

Authors: Emilio Garcia-Fidalgo, Joan P. Company-Corcoles, Francisco Bonnin-Pascual, Alberto Ortiz
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.03393
Pdf link: https://arxiv.org/pdf/2111.03393
Abstract
In the last decades, Light Detection And Ranging (LiDAR) technology has been extensively explored as a robust alternative for self-localization and mapping. These approaches typically state ego-motion estimation as a non-linear optimization problem dependent on the correspondences established between the current point cloud and a map, whatever its scope, local or global. This paper proposes LiODOM, a novel LiDAR-only ODOmetry and Mapping approach for pose estimation and map-building, based on minimizing a loss function derived from a set of weighted point-to-line correspondences with a local map abstracted from the set of available point clouds. Furthermore, this work places a particular emphasis on map representation given its relevance for quick data association. To efficiently represent the environment, we propose a data structure that combined with a hashing scheme allows for fast access to any section of the map. LiODOM is validated by means of a set of experiments on public datasets, for which it compares favourably against other solutions. Its performance on-board an aerial platform is also reported.

Meta-Forecasting by combining Global DeepRepresentations with Local Adaptation

Authors: Riccardo Grazzi, Valentin Flunkert, David Salinas, Tim Januschowski, Matthias Seeger, Cedric Archambeau
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2111.03418
Pdf link: https://arxiv.org/pdf/2111.03418
Abstract
While classical time series forecasting considers individual time series in isolation, recent advances based on deep learning showed that jointly learning from a large pool of related time series can boost the forecasting accuracy. However, the accuracy of these methods suffers greatly when modeling out-of-sample time series, significantly limiting their applicability compared to classical forecasting methods. To bridge this gap, we adopt a meta-learning view of the time series forecasting problem. We introduce a novel forecasting method, called Meta Global-Local Auto-Regression (Meta-GLAR), that adapts to each time series by learning in closed-form the mapping from the representations produced by a recurrent neural network (RNN) to one-step-ahead forecasts. Crucially, the parameters ofthe RNN are learned across multiple time series by backpropagating through the closed-form adaptation mechanism. In our extensive empirical evaluation we show that our method is competitive with the state-of-the-art in out-of-sample forecasting accuracy reported in earlier work.

Disengagement Cause-and-Effect Relationships Extraction Using an NLP Pipeline

Authors: Yangtao Zhang, X. Jessie Yang, Feng Zhou
Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.03511
Pdf link: https://arxiv.org/pdf/2111.03511
Abstract
The advancement in machine learning and artificial intelligence is promoting the testing and deployment of autonomous vehicles (AVs) on public roads. The California Department of Motor Vehicles (CA DMV) has launched the Autonomous Vehicle Tester Program, which collects and releases reports related to Autonomous Vehicle Disengagement (AVD) from autonomous driving. Understanding the causes of AVD is critical to improving the safety and stability of the AV system and provide guidance for AV testing and deployment. In this work, a scalable end-to-end pipeline is constructed to collect, process, model, and analyze the disengagement reports released from 2014 to 2020 using natural language processing deep transfer learning. The analysis of disengagement data using taxonomy, visualization and statistical tests revealed the trends of AV testing, categorized cause frequency, and significant relationships between causes and effects of AVD. We found that (1) manufacturers tested AVs intensively during the Spring and/or Winter, (2) test drivers initiated more than 80% of the disengagement while more than 75% of the disengagement were led by errors in perception, localization & mapping, planning and control of the AV system itself, and (3) there was a significant relationship between the initiator of AVD and the cause category. This study serves as a successful practice of deep transfer learning using pre-trained models and generates a consolidated disengagement database allowing further investigation for other researchers.

Optimal Inverted Landing in a Small Aerial Robot with Varied Approach Velocities and Landing Gear Designs

Authors: Bryan Habas, Bader AlAttar, Brian Davis, Jack W. Langelaan, Bo Cheng
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2111.03539
Pdf link: https://arxiv.org/pdf/2111.03539
Abstract
Inverted landing is a challenging feat to perform in aerial robots, especially without external positioning. However, it is routinely performed by biological fliers such as bees, flies, and bats. Our previous observations of landing behaviors in flies suggest an open-loop causal relationship between their putative visual cues and the kinematics of the aerial maneuvers executed. For example, the degree of rotational maneuver (therefore the body inversion prior to touchdown) and the amount of leg-assisted body swing both depend on the flies' initial body states while approaching the ceiling. In this work, by using a physics-based simulation with experimental validation, we systematically investigated how optimized inverted landing maneuvers depend on the initial approach velocities with varied magnitude and direction. This was done by analyzing the putative visual cues (that can be derived from onboard measurements) during optimal maneuvering trajectories. We identified a three-dimensional policy region, from which a mapping to a global inverted landing policy can be developed without the use of external positioning data. In addition, we also investigated the effects of an array of landing gear designs on the optimized landing performance and identified their advantages and disadvantages. The above results have been partially validated using limited experimental testing and will continue to inform and guide our future experiments, for example by applying the calculated global policy.

TermiNeRF: Ray Termination Prediction for Efficient Neural Rendering

Authors: Martin Piala, Ronald Clark
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2111.03643
Pdf link: https://arxiv.org/pdf/2111.03643
Abstract
Volume rendering using neural fields has shown great promise in capturing and synthesizing novel views of 3D scenes. However, this type of approach requires querying the volume network at multiple points along each viewing ray in order to render an image, resulting in very slow rendering times. In this paper, we present a method that overcomes this limitation by learning a direct mapping from camera rays to locations along the ray that are most likely to influence the pixel's final appearance. Using this approach we are able to render, train and fine-tune a volumetrically-rendered neural field model an order of magnitude faster than standard approaches. Unlike existing methods, our approach works with general volumes and can be trained end-to-end.

Keyword: localization

Attention on Classification for Fire Segmentation

Authors: Milad Niknejad, Alexandre Bernardino
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.03129
Pdf link: https://arxiv.org/pdf/2111.03129
Abstract
Detection and localization of fire in images and videos are important in tackling fire incidents. Although semantic segmentation methods can be used to indicate the location of pixels with fire in the images, their predictions are localized, and they often fail to consider global information of the existence of fire in the image which is implicit in the image labels. We propose a Convolutional Neural Network (CNN) for joint classification and segmentation of fire in images which improves the performance of the fire segmentation. We use a spatial self-attention mechanism to capture long-range dependency between pixels, and a new channel attention module which uses the classification probability as an attention weight. The network is jointly trained for both segmentation and classification, leading to improvement in the performance of the single-task image segmentation methods, and the previous methods proposed for fire segmentation.

RASEC: Rescaling Acquisition Strategy with Energy Constraints under SE-OU Fusion Kernel for Active Trachea Palpation and Incision Recommendation in Laryngeal Region

Authors: Wenchao Yue, Fan Bai, Jianbang Liu, Feng Ju, Max Q-H Meng, Chwee Ming Lim, Hongliang Ren
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.03235
Pdf link: https://arxiv.org/pdf/2111.03235
Abstract
A novel palpation-based incision detection strategy in the laryngeal region, potentially for robotic tracheotomy, is proposed in this letter. A tactile sensor is introduced to measure tissue hardness in the specific laryngeal region by gentle contact. The kernel fusion method is proposed to combine the Squared Exponential (SE) kernel with Ornstein-Uhlenbeck (OU) kernel to figure out the drawbacks that the existing kernel functions are not sufficiently optimal in this scenario. Moreover, we further regularize exploration factor and greed factor, and the tactile sensor's moving distance and the robotic base link's rotation angle during the incision localization process are considered as new factors in the acquisition strategy. We conducted simulation and physical experiments to compare the newly proposed algorithm - Rescaling Acquisition Strategy with Energy Constraints (RASEC) in trachea detection with current palpation-based acquisition strategies. The result indicates that the proposed acquisition strategy with fusion kernel can successfully localize the incision with the highest algorithm performance (Average Precision 0.932, Average Recall 0.973, Average F1 score 0.952). During the robotic palpation process, the cumulative moving distance is reduced by 50%, and the cumulative rotation angle is reduced by 71.4% with no sacrifice in the comprehensive performance capabilities. Therefore, it proves that RASEC can efficiently suggest the incision zone in the laryngeal region and greatly reduced the energy loss.

Recognizing Vector Graphics without Rasterization

Authors: Xinyang Jiang, Lu Liu, Caihua Shan, Yifei Shen, Xuanyi Dong, Dongsheng Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.03281
Pdf link: https://arxiv.org/pdf/2111.03281
Abstract
In this paper, we consider a different data format for images: vector graphics. In contrast to raster graphics which are widely used in image recognition, vector graphics can be scaled up or down into any resolution without aliasing or information loss, due to the analytic representation of the primitives in the document. Furthermore, vector graphics are able to give extra structural information on how low-level elements group together to form high level shapes or structures. These merits of graphic vectors have not been fully leveraged in existing methods. To explore this data format, we target on the fundamental recognition tasks: object localization and classification. We propose an efficient CNN-free pipeline that does not render the graphic into pixels (i.e. rasterization), and takes textual document of the vector graphics as input, called YOLaT (You Only Look at Text). YOLaT builds multi-graphs to model the structural and spatial information in vector graphics, and a dual-stream graph neural network is proposed to detect objects from the graph. Our experiments show that by directly operating on vector graphics, YOLaT out-performs raster-graphic based object detection baselines in terms of both average precision and efficiency.

KORSAL: Key-point Detection based Online Real-Time Spatio-Temporal Action Localization

Authors: Kalana Abeywardena, Shechem Sumanthiran, Sakuna Jayasundara, Sachira Karunasena, Ranga Rodrigo, Peshala Jayasekara
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.03319
Pdf link: https://arxiv.org/pdf/2111.03319
Abstract
Real-time and online action localization in a video is a critical yet highly challenging problem. Accurate action localization requires the utilization of both temporal and spatial information. Recent attempts achieve this by using computationally intensive 3D CNN architectures or highly redundant two-stream architectures with optical flow, making them both unsuitable for real-time, online applications. To accomplish activity localization under highly challenging real-time constraints, we propose utilizing fast and efficient key-point based bounding box prediction to spatially localize actions. We then introduce a tube-linking algorithm that maintains the continuity of action tubes temporally in the presence of occlusions. Further, we eliminate the need for a two-stream architecture by combining temporal and spatial information into a cascaded input to a single network, allowing the network to learn from both types of information. Temporal information is efficiently extracted using a structural similarity index map as opposed to computationally intensive optical flow. Despite the simplicity of our approach, our lightweight end-to-end architecture achieves state-of-the-art frame-mAP of 74.7% on the challenging UCF101-24 dataset, demonstrating a performance gain of 6.4% over the previous best online methods. We also achieve state-of-the-art video-mAP results compared to both online and offline methods. Moreover, our model achieves a frame rate of 41.8 FPS, which is a 10.7% improvement over contemporary real-time methods.

SSA: Semantic Structure Aware Inference for Weakly Pixel-Wise Dense Predictions without Cost

Authors: Yanpeng Sun, Zechao Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.03392
Pdf link: https://arxiv.org/pdf/2111.03392
Abstract
The pixel-wise dense prediction tasks based on weakly supervisions currently use Class Attention Maps (CAM) to generate pseudo masks as ground-truth. However, the existing methods typically depend on the painstaking training modules, which may bring in grinding computational overhead and complex training procedures. In this work, the semantic structure aware inference (SSA) is proposed to explore the semantic structure information hidden in different stages of the CNN-based network to generate high-quality CAM in the model inference. Specifically, the semantic structure modeling module (SSM) is first proposed to generate the class-agnostic semantic correlation representation, where each item denotes the affinity degree between one category of objects and all the others. Then the structured feature representation is explored to polish an immature CAM via the dot product operation. Finally, the polished CAMs from different backbone stages are fused as the output. The proposed method has the advantage of no parameters and does not need to be trained. Therefore, it can be applied to a wide range of weakly-supervised pixel-wise dense prediction tasks. Experimental results on both weakly-supervised object localization and weakly-supervised semantic segmentation tasks demonstrate the effectiveness of the proposed method, which achieves the new state-of-the-art results on these two tasks.

LiODOM: Adaptive Local Mapping for Robust LiDAR-Only Odometry

Authors: Emilio Garcia-Fidalgo, Joan P. Company-Corcoles, Francisco Bonnin-Pascual, Alberto Ortiz
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.03393
Pdf link: https://arxiv.org/pdf/2111.03393
Abstract
In the last decades, Light Detection And Ranging (LiDAR) technology has been extensively explored as a robust alternative for self-localization and mapping. These approaches typically state ego-motion estimation as a non-linear optimization problem dependent on the correspondences established between the current point cloud and a map, whatever its scope, local or global. This paper proposes LiODOM, a novel LiDAR-only ODOmetry and Mapping approach for pose estimation and map-building, based on minimizing a loss function derived from a set of weighted point-to-line correspondences with a local map abstracted from the set of available point clouds. Furthermore, this work places a particular emphasis on map representation given its relevance for quick data association. To efficiently represent the environment, we propose a data structure that combined with a hashing scheme allows for fast access to any section of the map. LiODOM is validated by means of a set of experiments on public datasets, for which it compares favourably against other solutions. Its performance on-board an aerial platform is also reported.

Disengagement Cause-and-Effect Relationships Extraction Using an NLP Pipeline

Authors: Yangtao Zhang, X. Jessie Yang, Feng Zhou
Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.03511
Pdf link: https://arxiv.org/pdf/2111.03511
Abstract
The advancement in machine learning and artificial intelligence is promoting the testing and deployment of autonomous vehicles (AVs) on public roads. The California Department of Motor Vehicles (CA DMV) has launched the Autonomous Vehicle Tester Program, which collects and releases reports related to Autonomous Vehicle Disengagement (AVD) from autonomous driving. Understanding the causes of AVD is critical to improving the safety and stability of the AV system and provide guidance for AV testing and deployment. In this work, a scalable end-to-end pipeline is constructed to collect, process, model, and analyze the disengagement reports released from 2014 to 2020 using natural language processing deep transfer learning. The analysis of disengagement data using taxonomy, visualization and statistical tests revealed the trends of AV testing, categorized cause frequency, and significant relationships between causes and effects of AVD. We found that (1) manufacturers tested AVs intensively during the Spring and/or Winter, (2) test drivers initiated more than 80% of the disengagement while more than 75% of the disengagement were led by errors in perception, localization & mapping, planning and control of the AV system itself, and (3) there was a significant relationship between the initiator of AVD and the cause category. This study serves as a successful practice of deep transfer learning using pre-trained models and generates a consolidated disengagement database allowing further investigation for other researchers.

New submissions for Fri, 25 Jun 21

Keyword: SLAM

Planetary UAV localization based on Multi-modal Registration with Pre-existing Digital Terrain Model

Authors: Xue Wan, Yuanbin Shao, Shengyang Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.12738
Pdf link: https://arxiv.org/pdf/2106.12738
Abstract
The autonomous real-time optical navigation of planetary UAV is of the key technologies to ensure the success of the exploration. In such a GPS denied environment, vision-based localization is an optimal approach. In this paper, we proposed a multi-modal registration based SLAM algorithm, which estimates the location of a planet UAV using a nadir view camera on the UAV compared with pre-existing digital terrain model. To overcome the scale and appearance difference between on-board UAV images and pre-installed digital terrain model, a theoretical model is proposed to prove that topographic features of UAV image and DEM can be correlated in frequency domain via cross power spectrum. To provide the six-DOF of the UAV, we also developed an optimization approach which fuses the geo-referencing result into a SLAM system via LBA (Local Bundle Adjustment) to achieve robust and accurate vision-based navigation even in featureless planetary areas. To test the robustness and effectiveness of the proposed localization algorithm, a new cross-source drone-based localization dataset for planetary exploration is proposed. The proposed dataset includes 40200 synthetic drone images taken from nine planetary scenes with related DEM query images. Comparison experiments carried out demonstrate that over the flight distance of 33.8km, the proposed method achieved average localization error of 0.45 meters, compared to 1.31 meters by ORB-SLAM, with the processing speed of 12hz which will ensure a real-time performance. We will make our datasets available to encourage further work on this promising topic.

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Object Detection and Ranging for Autonomous Navigation of Mobile Robots

Authors: Md Ziaul Haque Zim, Nimai Chandra Das
Subjects: Robotics (cs.RO); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2106.12701
Pdf link: https://arxiv.org/pdf/2106.12701
Abstract
In the recent decade, electronic technology gets advanced day by day the methodologies too should update. For the purpose of ranging various methods such Radio Detection and Ranging (RADAR), Light Detection and Ranging (LIDAR) and Sonic Navigation and Ranging (SONAR) etc. are used. Later, by adapting the earlier technologies and further modifying the purposes of detection and ranging in navigation, the technology of Sonic Detection and Ranging (SODAR) is used in modern robotics. The SODAR can be defined as a child of SONAR and also a twin of Echo sounder. The echo-sounder is used only for ranging. But the SODAR use the low-frequency wave of 33 kHz to measure the underwater depth and also to detect the objects below the water medium. So, this work comprises the designing of a system to evaluate the Object Detection and Ranging for Autonomous Navigation of Mobile Robots.

Multi-Modal 3D Object Detection in Autonomous Driving: a Survey

Authors: Yingjie Wang, Qiuyu Mao, Hanqi Zhu, Yu Zhang, Jianmin Ji, Yanyong Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.12735
Pdf link: https://arxiv.org/pdf/2106.12735
Abstract
In the past few years, we have witnessed rapid development of autonomous driving. However, achieving full autonomy remains a daunting task due to the complex and dynamic driving environment. As a result, self-driving cars are equipped with a suite of sensors to conduct robust and accurate environment perception. As the number and type of sensors keep increasing, combining them for better perception is becoming a natural trend. So far, there has been no indepth review that focuses on multi-sensor fusion based perception. To bridge this gap and motivate future research, this survey devotes to review recent fusion-based 3D detection deep learning models that leverage multiple sensor data sources, especially cameras and LiDARs. In this survey, we first introduce the background of popular sensors for autonomous cars, including their common data representations as well as object detection networks developed for each type of sensor data. Next, we discuss some popular datasets for multi-modal 3D object detection, with a special focus on the sensor data included in each dataset. Then we present in-depth reviews of recent multi-modal 3D detection networks by considering the following three aspects of the fusion: fusion location, fusion data representation, and fusion granularity. After a detailed review, we discuss open challenges and point out possible solutions. We hope that our detailed review can help researchers to embark investigations in the area of multi-modal 3D object detection.

SGTBN: Generating Dense Depth Maps from Single-Line LiDAR

Authors: Hengjie Lu, Shugong Xu, Shan Cao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.12994
Pdf link: https://arxiv.org/pdf/2106.12994
Abstract
Depth completion aims to generate a dense depth map from the sparse depth map and aligned RGB image. However, current depth completion methods use extremely expensive 64-line LiDAR(about $100,000) to obtain sparse depth maps, which will limit their application scenarios. Compared with the 64-line LiDAR, the single-line LiDAR is much less expensive and much more robust. Therefore, we propose a method to tackle the problem of single-line depth completion, in which we aim to generate a dense depth map from the single-line LiDAR info and the aligned RGB image. A single-line depth completion dataset is proposed based on the existing 64-line depth completion dataset(KITTI). A network called Semantic Guided Two-Branch Network(SGTBN) which contains global and local branches to extract and fuse global and local info is proposed for this task. A Semantic guided depth upsampling module is used in our network to make full use of the semantic info in RGB images. Except for the usual MSE loss, we add the virtual normal loss to increase the constraint of high-order 3D geometry in our network. Our network outperforms the state-of-the-art in the single-line depth completion task. Besides, compared with the monocular depth estimation, our method also has significant advantages in precision and model size.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Multi-Modal 3D Object Detection in Autonomous Driving: a Survey

Authors: Yingjie Wang, Qiuyu Mao, Hanqi Zhu, Yu Zhang, Jianmin Ji, Yanyong Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.12735
Pdf link: https://arxiv.org/pdf/2106.12735
Abstract
In the past few years, we have witnessed rapid development of autonomous driving. However, achieving full autonomy remains a daunting task due to the complex and dynamic driving environment. As a result, self-driving cars are equipped with a suite of sensors to conduct robust and accurate environment perception. As the number and type of sensors keep increasing, combining them for better perception is becoming a natural trend. So far, there has been no indepth review that focuses on multi-sensor fusion based perception. To bridge this gap and motivate future research, this survey devotes to review recent fusion-based 3D detection deep learning models that leverage multiple sensor data sources, especially cameras and LiDARs. In this survey, we first introduce the background of popular sensors for autonomous cars, including their common data representations as well as object detection networks developed for each type of sensor data. Next, we discuss some popular datasets for multi-modal 3D object detection, with a special focus on the sensor data included in each dataset. Then we present in-depth reviews of recent multi-modal 3D detection networks by considering the following three aspects of the fusion: fusion location, fusion data representation, and fusion granularity. After a detailed review, we discuss open challenges and point out possible solutions. We hope that our detailed review can help researchers to embark investigations in the area of multi-modal 3D object detection.

Autonomous Driving Strategies at Intersections: Scenarios, State-of-the-Art, and Future Outlooks

Authors: Lianzhen Wei, Zirui Li, Jianwei Gong, Cheng Gong, Jiachen Li
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2106.13052
Pdf link: https://arxiv.org/pdf/2106.13052
Abstract
Due to the complex and dynamic character of intersection scenarios, the autonomous driving strategy at intersections has been a difficult problem and a hot point in the research of intelligent transportation systems in recent years. This paper gives a brief summary of state-of-the-art autonomous driving strategies at intersections. Firstly, we enumerate and analyze common types of intersection scenarios, corresponding simulation platforms, as well as related datasets. Secondly, by reviewing previous studies, we have summarized characteristics of existing autonomous driving strategies and classified them into several categories. Finally, we point out problems of the existing autonomous driving strategies and put forward several valuable research outlooks.

New submissions for Wed, 3 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Neural Scene Flow Prior

Authors: Xueqian Li, Jhony Kaesemodel Pontes, Simon Lucey
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.01253
Pdf link: https://arxiv.org/pdf/2111.01253
Abstract
Before the deep learning revolution, many perception algorithms were based on runtime optimization in conjunction with a strong prior/regularization penalty. A prime example of this in computer vision is optical and scene flow. Supervised learning has largely displaced the need for explicit regularization. Instead, they rely on large amounts of labeled data to capture prior statistics, which are not always readily available for many problems. Although optimization is employed to learn the neural network, the weights of this network are frozen at runtime. As a result, these learning solutions are domain-specific and do not generalize well to other statistically different scenarios. This paper revisits the scene flow problem that relies predominantly on runtime optimization and strong regularization. A central innovation here is the inclusion of a neural scene flow prior, which uses the architecture of neural networks as a new type of implicit regularizer. Unlike learning-based scene flow methods, optimization occurs at runtime, and our approach needs no offline datasets -- making it ideal for deployment in new environments such as autonomous driving. We show that an architecture based exclusively on multilayer perceptrons (MLPs) can be used as a scene flow prior. Our method attains competitive -- if not better -- results on scene flow benchmarks. Also, our neural prior's implicit and continuous scene flow representation allows us to estimate dense long-term correspondences across a sequence of point clouds. The dense motion information is represented by scene flow fields where points can be propagated through time by integrating motion vectors. We demonstrate such a capability by accumulating a sequence of lidar point clouds.

CPSeg: Cluster-free Panoptic Segmentation of 3D LiDAR Point Clouds

Authors: Enxu Li, Ryan Razani, Yixuan Xu, Bingbing Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.01723
Pdf link: https://arxiv.org/pdf/2111.01723
Abstract
A fast and accurate panoptic segmentation system for LiDAR point clouds is crucial for autonomous driving vehicles to understand the surrounding objects and scenes. Existing approaches usually rely on proposals or clustering to segment foreground instances. As a result, they struggle to achieve real-time performance. In this paper, we propose a novel real-time end-to-end panoptic segmentation network for LiDAR point clouds, called CPSeg. In particular, CPSeg comprises a shared encoder, a dual decoder, a task-aware attention module (TAM) and a cluster-free instance segmentation head. TAM is designed to enforce these two decoders to learn rich task-aware features for semantic and instance embedding. Moreover, CPSeg incorporates a new cluster-free instance segmentation head to dynamically pillarize foreground points according to the learned embedding. Then, it acquires instance labels by finding connected pillars with a pairwise embedding comparison. Thus, the conventional proposal-based or clustering-based instance segmentation is transformed into a binary segmentation problem on the pairwise embedding comparison matrix. To help the network regress instance embedding, a fast and deterministic depth completion algorithm is proposed to calculate surface normal of each point cloud in real-time. The proposed method is benchmarked on two large-scale autonomous driving datasets, namely, SemanticKITTI and nuScenes. Notably, extensive experimental results show that CPSeg achieves the state-of-the-art results among real-time approaches on both datasets.

Keyword: loop detection

There is no result

Keyword: autonomous driving

One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search

Authors: Bingqian Lu, Jianyi Yang, Weiwen Jiang, Yiyu Shi, Shaolei Ren
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.01203
Pdf link: https://arxiv.org/pdf/2111.01203
Abstract
Convolutional neural networks (CNNs) are used in numerous real-world applications such as vision-based autonomous driving and video content analysis. To run CNN inference on various target devices, hardware-aware neural architecture search (NAS) is crucial. A key requirement of efficient hardware-aware NAS is the fast evaluation of inference latencies in order to rank different architectures. While building a latency predictor for each target device has been commonly used in state of the art, this is a very time-consuming process, lacking scalability in the presence of extremely diverse devices. In this work, we address the scalability challenge by exploiting latency monotonicity -- the architecture latency rankings on different devices are often correlated. When strong latency monotonicity exists, we can re-use architectures searched for one proxy device on new target devices, without losing optimality. In the absence of strong latency monotonicity, we propose an efficient proxy adaptation technique to significantly boost the latency monotonicity. Finally, we validate our approach and conduct experiments with devices of different platforms on multiple mainstream search spaces, including MobileNet-V2, MobileNet-V3, NAS-Bench-201, ProxylessNAS and FBNet. Our results highlight that, by using just one proxy device, we can find almost the same Pareto-optimal architectures as the existing per-device NAS, while avoiding the prohibitive cost of building a latency predictor for each device.

Neural Scene Flow Prior

Authors: Xueqian Li, Jhony Kaesemodel Pontes, Simon Lucey
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.01253
Pdf link: https://arxiv.org/pdf/2111.01253
Abstract
Before the deep learning revolution, many perception algorithms were based on runtime optimization in conjunction with a strong prior/regularization penalty. A prime example of this in computer vision is optical and scene flow. Supervised learning has largely displaced the need for explicit regularization. Instead, they rely on large amounts of labeled data to capture prior statistics, which are not always readily available for many problems. Although optimization is employed to learn the neural network, the weights of this network are frozen at runtime. As a result, these learning solutions are domain-specific and do not generalize well to other statistically different scenarios. This paper revisits the scene flow problem that relies predominantly on runtime optimization and strong regularization. A central innovation here is the inclusion of a neural scene flow prior, which uses the architecture of neural networks as a new type of implicit regularizer. Unlike learning-based scene flow methods, optimization occurs at runtime, and our approach needs no offline datasets -- making it ideal for deployment in new environments such as autonomous driving. We show that an architecture based exclusively on multilayer perceptrons (MLPs) can be used as a scene flow prior. Our method attains competitive -- if not better -- results on scene flow benchmarks. Also, our neural prior's implicit and continuous scene flow representation allows us to estimate dense long-term correspondences across a sequence of point clouds. The dense motion information is represented by scene flow fields where points can be propagated through time by integrating motion vectors. We demonstrate such a capability by accumulating a sequence of lidar point clouds.

Verifying Contracts for Perturbed Control Systems using Linear Programming

Authors: Miel Sharf, Bart Besselink, Karl Henrik Johansson
Subjects: Systems and Control (eess.SY); Dynamical Systems (math.DS); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2111.01259
Pdf link: https://arxiv.org/pdf/2111.01259
Abstract
Verifying specifications for large-scale control systems is of utmost importance, but can be hard in practice as most formal verification methods can not handle high-dimensional dynamics. Contract theory has been proposed as a modular alternative to formal verification in which specifications are defined by assumptions on the inputs to a component and guarantees on its outputs. In this paper, we present linear-programming-based tools for verifying contracts for control systems. We first consider the problem of verifying contracts defined by time-invariant inequalities for unperturbed systems. We use $k$-induction to show that contract verification can be achieved by considering a collection of implications between inequalities, which are then recast as linear programs. We then move our attention to perturbed systems. We present a comparison-based framework, verifying that a perturbed system satisfies a contract by checking that the corresponding unperturbed system satisfies a robustified (and $\epsilon$-approximated) contract. In both cases, we present explicit algorithms for contract verification, proving their correctness and analyzing their complexity. We also demonstrate the verification process for two case studies, one considering a two-vehicle autonomous driving scenario, and one considering formation control of a multi-agent system.

CPSeg: Cluster-free Panoptic Segmentation of 3D LiDAR Point Clouds

Authors: Enxu Li, Ryan Razani, Yixuan Xu, Bingbing Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.01723
Pdf link: https://arxiv.org/pdf/2111.01723
Abstract
A fast and accurate panoptic segmentation system for LiDAR point clouds is crucial for autonomous driving vehicles to understand the surrounding objects and scenes. Existing approaches usually rely on proposals or clustering to segment foreground instances. As a result, they struggle to achieve real-time performance. In this paper, we propose a novel real-time end-to-end panoptic segmentation network for LiDAR point clouds, called CPSeg. In particular, CPSeg comprises a shared encoder, a dual decoder, a task-aware attention module (TAM) and a cluster-free instance segmentation head. TAM is designed to enforce these two decoders to learn rich task-aware features for semantic and instance embedding. Moreover, CPSeg incorporates a new cluster-free instance segmentation head to dynamically pillarize foreground points according to the learned embedding. Then, it acquires instance labels by finding connected pillars with a pairwise embedding comparison. Thus, the conventional proposal-based or clustering-based instance segmentation is transformed into a binary segmentation problem on the pairwise embedding comparison matrix. To help the network regress instance embedding, a fast and deterministic depth completion algorithm is proposed to calculate surface normal of each point cloud in real-time. The proposed method is benchmarked on two large-scale autonomous driving datasets, namely, SemanticKITTI and nuScenes. Notably, extensive experimental results show that CPSeg achieves the state-of-the-art results among real-time approaches on both datasets.

Keyword: mapping

Attention-Guided Generative Adversarial Network for Whisper to Normal Speech Conversion

Authors: Teng Gao, Jian Zhou, Huabin Wang, Liang Tao, Hon Keung Kwan
Subjects: Sound (cs.SD); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2111.01342
Pdf link: https://arxiv.org/pdf/2111.01342
Abstract
Whispered speech is a special way of pronunciation without using vocal cord vibration. A whispered speech does not contain a fundamental frequency, and its energy is about 20dB lower than that of a normal speech. Converting a whispered speech into a normal speech can improve speech quality and intelligibility. In this paper, a novel attention-guided generative adversarial network model incorporating an autoencoder, a Siamese neural network, and an identity mapping loss function for whisper to normal speech conversion (AGAN-W2SC) is proposed. The proposed method avoids the challenge of estimating the fundamental frequency of the normal voiced speech converted from a whispered speech. Specifically, the proposed model is more amendable to practical applications because it does not need to align speech features for training. Experimental results demonstrate that the proposed AGAN-W2SC can obtain improved speech quality and intelligibility compared with dynamic-time-warping-based methods.

CycleGAN with Dual Adversarial Loss for Bone-Conducted Speech Enhancement

Authors: Qing Pan, Teng Gao, Jian Zhou, Huabin Wang, Liang Tao, Hon Keung Kwan
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2111.01430
Pdf link: https://arxiv.org/pdf/2111.01430
Abstract
Compared with air-conducted speech, bone-conducted speech has the unique advantage of shielding background noise. Enhancement of bone-conducted speech helps to improve its quality and intelligibility. In this paper, a novel CycleGAN with dual adversarial loss (CycleGAN-DAL) is proposed for bone-conducted speech enhancement. The proposed method uses an adversarial loss and a cycle-consistent loss simultaneously to learn forward and cyclic mapping, in which the adversarial loss is replaced with the classification adversarial loss and the defect adversarial loss to consolidate the forward mapping. Compared with conventional baseline methods, it can learn feature mapping between bone-conducted speech and target speech without additional air-conducted speech assistance. Moreover, the proposed method also avoids the oversmooth problem which is occurred commonly in conventional statistical based models. Experimental results show that the proposed method outperforms baseline methods such as CycleGAN, GMM, and BLSTM. Keywords: Bone-conducted speech enhancement, dual adversarial loss, Parallel CycleGAN, high frequency speech reconstruction

UnProjection: Leveraging Inverse-Projections for Visual Analytics of High-Dimensional Data

Authors: Mateus Espadoto, Gabriel Appleby, Ashley Suh, Dylan Cashman, Mingwei Li, Carlos Scheidegger, Erik W Anderson, Remco Chang, Alexandru C Telea
Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.01744
Pdf link: https://arxiv.org/pdf/2111.01744
Abstract
Projection techniques are often used to visualize high-dimensional data, allowing users to better understand the overall structure of multi-dimensional spaces on a 2D screen. Although many such methods exist, comparably little work has been done on generalizable methods of inverse-projection -- the process of mapping the projected points, or more generally, the projection space back to the original high-dimensional space. In this paper we present NNInv, a deep learning technique with the ability to approximate the inverse of any projection or mapping. NNInv learns to reconstruct high-dimensional data from any arbitrary point on a 2D projection space, giving users the ability to interact with the learned high-dimensional representation in a visual analytics system. We provide an analysis of the parameter space of NNInv, and offer guidance in selecting these parameters. We extend validation of the effectiveness of NNInv through a series of quantitative and qualitative analyses. We then demonstrate the method's utility by applying it to three visualization tasks: interactive instance interpolation, classifier agreement, and gradient visualization.

Keyword: localization

Boundary Distribution Estimation to Precise Object Detection

Authors: Haoran Zhou, Hang Huang, Rui Zhao, Wei Wang, Qingguo Zhou
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.01396
Pdf link: https://arxiv.org/pdf/2111.01396
Abstract
In principal modern detectors, the task of object localization is implemented by the box subnet which concentrates on bounding box regression. The box subnet customarily predicts the position of the object by regressing box center position and scaling factors. Although this approach is frequently adopted, we observe that the result of localization remains defective, which makes the performance of the detector unsatisfactory. In this paper, we prove the flaws in the previous method through theoretical analysis and experimental verification and propose a novel solution to detect objects precisely. Rather than plainly focusing on center and size, our approach refines the edges of the bounding box on previous localization results by estimating the distribution at the boundary of the object. Experimental results have shown the potentiality and generalization of our proposed method.

New submissions for Fri, 12 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Yaw-Guided Imitation Learning for Autonomous Driving in Urban Environments

Authors: Yandong Liu, Chengzhong Xu, Hui Kong
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.06017
Pdf link: https://arxiv.org/pdf/2111.06017
Abstract
Existing imitation learning methods suffer from low efficiency and generalization ability when facing the road option problem in an urban environment. In this paper, we propose a yaw-guided imitation learning method to improve the road option performance in an end-to-end autonomous driving paradigm in terms of the efficiency of exploiting training samples and adaptability to changing environments. Specifically, the yaw information is provided by the trajectory of the navigation map. Our end-to-end architecture, Yaw-guided Imitation Learning with ResNet34 Attention (YILRatt), integrates the ResNet34 backbone and attention mechanism to obtain an accurate perception. It does not need high precision maps and realizes fully end-to-end autonomous driving given the yaw information provided by a consumer-level GPS receiver. By analyzing the attention heat maps, we can reveal some causal relationship between decision-making and scene perception, where, in particular, failure cases are caused by erroneous perception. We collect expert experience in the Carla 0.9.11 simulator and improve the benchmark CoRL2017 and NoCrash. Experimental results show that YILRatt has a 26.27% higher success rate than the SOTA CILRS. The code, dataset, benchmark and experimental results can be found at https://github.com/Yandong024/Yaw-guided-IL.git

csBoundary: City-scale Road-boundary Detection in Aerial Images for High-definition Maps

Authors: Zhenhua Xu, Yuxuan Liu, Lu Gan, Xiangcheng Hu, Yuxiang Sun, Lujia Wang, Ming Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.06020
Pdf link: https://arxiv.org/pdf/2111.06020
Abstract
High-Definition (HD) maps can provide precise geometric and semantic information of static traffic environments for autonomous driving. Road-boundary is one of the most important information contained in HD maps since it distinguishes between road areas and off-road areas, which can guide vehicles to drive within road areas. But it is labor-intensive to annotate road boundaries for HD maps at the city scale. To enable automatic HD map annotation, current work uses semantic segmentation or iterative graph growing for road-boundary detection. However, the former could not ensure topological correctness since it works at the pixel level, while the latter suffers from inefficiency and drifting issues. To provide a solution to the aforementioned problems, in this letter, we propose a novel system termed csBoundary to automatically detect road boundaries at the city scale for HD map annotation. Our network takes as input an aerial image patch, and directly infers the continuous road-boundary graph (i.e., vertices and edges) from this image. To generate the city-scale road-boundary graph, we stitch the obtained graphs from all the image patches. Our csBoundary is evaluated and compared on a public benchmark dataset. The results demonstrate our superiority. The accompanied demonstration video is available at our project page \url{https://sites.google.com/view/csboundary/}.

Multi-agent Reinforcement Learning for Cooperative Lane Changing of Connected and Autonomous Vehicles in Mixed Traffic

Authors: Wei Zhou, Dong Chen, Jun Yan, Zhaojian Li, Huilin Yin, Wanchen Ge
Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2111.06318
Pdf link: https://arxiv.org/pdf/2111.06318
Abstract
Autonomous driving has attracted significant research interests in the past two decades as it offers many potential benefits, including releasing drivers from exhausting driving and mitigating traffic congestion, among others. Despite promising progress, lane-changing remains a great challenge for autonomous vehicles (AV), especially in mixed and dynamic traffic scenarios. Recently, reinforcement learning (RL), a powerful data-driven control method, has been widely explored for lane-changing decision makings in AVs with encouraging results demonstrated. However, the majority of those studies are focused on a single-vehicle setting, and lane-changing in the context of multiple AVs coexisting with human-driven vehicles (HDVs) have received scarce attention. In this paper, we formulate the lane-changing decision making of multiple AVs in a mixed-traffic highway environment as a multi-agent reinforcement learning (MARL) problem, where each AV makes lane-changing decisions based on the motions of both neighboring AVs and HDVs. Specifically, a multi-agent advantage actor-critic network (MA2C) is developed with a novel local reward design and a parameter sharing scheme. In particular, a multi-objective reward function is proposed to incorporate fuel efficiency, driving comfort, and safety of autonomous driving. Comprehensive experimental results, conducted under three different traffic densities and various levels of human driver aggressiveness, show that our proposed MARL framework consistently outperforms several state-of-the-art benchmarks in terms of efficiency, safety and driver comfort.

Keyword: mapping

Multi-Resolution Elevation Mapping and Safe Landing Site Detection with Applications to Planetary Rotorcraft

Authors: Pascal Schoppmann, Pedro F. Proença, Jeff Delaune, Michael Pantic, Timo Hinzmann, Larry Matthies, Roland Siegwart, Roland Brockers
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.06271
Pdf link: https://arxiv.org/pdf/2111.06271
Abstract
In this paper, we propose a resource-efficient approach to provide an autonomous UAV with an on-board perception method to detect safe, hazard-free landing sites during flights over complex 3D terrain. We aggregate 3D measurements acquired from a sequence of monocular images by a Structure-from-Motion approach into a local, robot-centric, multi-resolution elevation map of the overflown terrain, which fuses depth measurements according to their lateral surface resolution (pixel-footprint) in a probabilistic framework based on the concept of dynamic Level of Detail. Map aggregation only requires depth maps and the associated poses, which are obtained from an onboard Visual Odometry algorithm. An efficient landing site detection method then exploits the features of the underlying multi-resolution map to detect safe landing sites based on slope, roughness, and quality of the reconstructed terrain surface. The evaluation of the performance of the mapping and landing site detection modules are analyzed independently and jointly in simulated and real-world experiments in order to establish the efficacy of the proposed approach.

Keyword: localization

Clicking Matters:Towards Interactive Human Parsing

Authors: Yutong Gao, Liqian Liang, Congyan Lang, Songhe Feng, Yidong Li, Yunchao Wei
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.06162
Pdf link: https://arxiv.org/pdf/2111.06162
Abstract
In this work, we focus on Interactive Human Parsing (IHP), which aims to segment a human image into multiple human body parts with guidance from users' interactions. This new task inherits the class-aware property of human parsing, which cannot be well solved by traditional interactive image segmentation approaches that are generally class-agnostic. To tackle this new task, we first exploit user clicks to identify different human parts in the given image. These clicks are subsequently transformed into semantic-aware localization maps, which are concatenated with the RGB image to form the input of the segmentation network and generate the initial parsing result. To enable the network to better perceive user's purpose during the correction process, we investigate several principal ways for the refinement, and reveal that random-sampling-based click augmentation is the best way for promoting the correction effectiveness. Furthermore, we also propose a semantic-perceiving loss (SP-loss) to augment the training, which can effectively exploit the semantic relationships of clicks for better optimization. To the best knowledge, this work is the first attempt to tackle the human parsing task under the interactive setting. Our IHP solution achieves 85% mIoU on the benchmark LIP, 80% mIoU on PASCAL-Person-Part and CIHP, 75% mIoU on Helen with only 1.95, 3.02, 2.84 and 1.09 clicks per class respectively. These results demonstrate that we can simply acquire high-quality human parsing masks with only a few human effort. We hope this work can motivate more researchers to develop data-efficient solutions to IHP in the future.

New submissions for Tue, 26 Oct 21

Keyword: SLAM

WOLF: A modular estimation framework for robotics based on factor graphs

Authors: Joan Sola, Joan Vallve-Navarro, Joaquim Casals, Jeremie Deray, Mederic Fourmy, Dinesh Atchuthan, Juan Andrade-Cetto
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.12919
Pdf link: https://arxiv.org/pdf/2110.12919
Abstract
This paper introduces WOLF, a C++ estimation framework based on factor graphs and targeted at mobile robotics. WOLF extends the applications of factor graphs from the typical problems of SLAM and odometry to a general estimation framework able to handle self-calibration, model identification, or the observation of dynamic quantities other than localization. WOLF produces high throughput estimates at sensor rates up to the kHz range, which can be used for feedback control of highly dynamic robots such as humanoids, quadrupeds or aerial manipulators. Departing from the factor graph paradigm, the architecture of WOLF allows for a modular yet tightly-coupled estimator. Modularity is based on plugins that are loaded at runtime. Then, integration is achieved simply through YAML files, allowing users to configure a wide range of applications without the need of writing or compiling code. Synchronization of incoming data and their processing into a unique factor graph is achieved through a decentralized strategy of frame creation and joining. Most algorithmic assets are coded as abstract algorithms in base classes with varying levels of specialization. Overall, these assets allow for coherent processing and favor code reusability and scalability. WOLF can be interfaced with different solvers, and we provide a wrapper to Google Ceres. Likewise, we offer ROS integration, providing a generic ROS node and specialized packages with subscribers and publishers. WOLF is made publicly available and open to collaboration.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale

Authors: Jacek Komorowski, Monika Wysoczanska, Tomasz Trzcinski
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.12486
Pdf link: https://arxiv.org/pdf/2110.12486
Abstract
The paper presents a deep neural network-based method for global and local descriptors extraction from a point cloud acquired by a rotating 3D LiDAR. The descriptors can be used for two-stage 6DoF relocalization. First, a course position is retrieved by finding candidates with the closest global descriptor in the database of geo-tagged point clouds. Then, the 6DoF pose between a query point cloud and a database point cloud is estimated by matching local descriptors and using a robust estimator such as RANSAC. Our method has a simple, fully convolutional architecture based on a sparse voxelized representation. It can efficiently extract a global descriptor and a set of keypoints with local descriptors from large point clouds with tens of thousand points. Our code and pretrained models are publicly available on the project website.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Encoding Integrated Decision and Control for Autonomous Driving with Mixed Traffic Flow

Authors: Yangang Ren, Jianhua Jiang, Jingliang Duan, Shengbo Eben Li, Dongjie Yu, Guojian Zhan
Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.12359
Pdf link: https://arxiv.org/pdf/2110.12359
Abstract
Reinforcement learning (RL) has been widely adopted to make intelligent driving policy in autonomous driving due to the self-evolution ability and humanoid learning paradigm. Despite many elegant demonstrations of RL-enabled decision-making, current research mainly focuses on the pure vehicle driving environment while ignoring other traffic participants like bicycles and pedestrians. For urban roads, the interaction of mixed traffic flows leads to a quite dynamic and complex relationship, which poses great difficulty to learn a safe and intelligent policy. This paper proposes the encoding integrated decision and control (E-IDC) to handle complicated driving tasks with mixed traffic flows, which composes of an encoding function to construct driving states, a value function to choose the optimal path as well as a policy function to output the control command of ego vehicle. Specially, the encoding function is capable of dealing with different types and variant number of traffic participants and extracting features from original driving observation. Next, we design the training principle for the functions of E-IDC with RL algorithms by adding the gradient-based update rules and refine the safety constraints concerning the otherness of different participants. The verification is conducted on the intersection scenario with mixed traffic flows and result shows that E-IDC can enhance the driving performance, including the tracking performance and safety constraint requirements with a large margin. The online application indicates that E-IDC can realize efficient and smooth driving in the complex intersection, guaranteeing the intelligence and safety simultaneously.

Robustness via Uncertainty-aware Cycle Consistency

Authors: Uddeshya Upadhyay, Yanbei Chen, Zeynep Akata
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2110.12467
Pdf link: https://arxiv.org/pdf/2110.12467
Abstract
Unpaired image-to-image translation refers to learning inter-image-domain mapping without corresponding image pairs. Existing methods learn deterministic mappings without explicitly modelling the robustness to outliers or predictive uncertainty, leading to performance degradation when encountering unseen perturbations at test time. To address this, we propose a novel probabilistic method based on Uncertainty-aware Generalized Adaptive Cycle Consistency (UGAC), which models the per-pixel residual by generalized Gaussian distribution, capable of modelling heavy-tailed distributions. We compare our model with a wide variety of state-of-the-art methods on various challenging tasks including unpaired image translation of natural images, using standard datasets, spanning autonomous driving, maps, facades, and also in medical imaging domain consisting of MRI. Experimental results demonstrate that our method exhibits stronger robustness towards unseen perturbations in test data. Code is released here: https://github.com/ExplainableML/UncertaintyAwareCycleConsistency.

Complete Test of Synthesised Safety Supervisors for Robots and Autonomous Systems

Authors: Mario Gleirscher (University of Bremen), Jan Peleska (University of Bremen)
Subjects: Software Engineering (cs.SE); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.12589
Pdf link: https://arxiv.org/pdf/2110.12589
Abstract
Verified controller synthesis uses world models that comprise all potential behaviours of humans, robots, further equipment, and the controller to be synthesised. A world model enables quantitative risk assessment, for example, by stochastic model checking. Such a model describes a range of controller behaviours some of which -- when implemented correctly -- guarantee that the overall risk in the actual world is acceptable, provided that the stochastic assumptions have been made to the safe side. Synthesis then selects an acceptable-risk controller behaviour. However, because of crossing abstraction, formalism, and tool boundaries, verified synthesis for robots and autonomous systems has to be accompanied by rigorous testing. In general, standards and regulations for safety-critical systems require testing as a key element to obtain certification credit before entry into service. This work-in-progress paper presents an approach to the complete testing of synthesised supervisory controllers that enforce safety properties in domains such as human-robot collaboration and autonomous driving. Controller code is generated from the selected controller behaviour. The code generator, however, is hard, if not infeasible, to verify in a formal and comprehensive way. Instead, utilising testing, an abstract test reference is generated, a symbolic finite state machine with simpler semantics than code semantics. From this reference, a complete test suite is derived and applied to demonstrate the observational equivalence between the synthesised abstract test reference and the generated concrete controller code running on a control system platform.

2nd Place Solution for SODA10M Challenge 2021 -- Continual Detection Track

Authors: Manoj Acharya, Christopher Kanan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.13064
Pdf link: https://arxiv.org/pdf/2110.13064
Abstract
In this technical report, we present our approaches for the continual object detection track of the SODA10M challenge. We adapt ResNet50-FPN as the baseline and try several improvements for the final submission model. We find that task-specific replay scheme, learning rate scheduling, model calibration, and using original image scale helps to improve performance for both large and small objects in images. Our team `hypertune28' secured the second position among 52 participants in the challenge. This work will be presented at the ICCV 2021 Workshop on Self-supervised Learning for Next-Generation Industry-level Autonomous Driving (SSLAD).

Keyword: mapping

HWTool: Fully Automatic Mapping of an Extensible C++ Image Processing Language to Hardware

Authors: James Hegarty, Omar Eldash, Amr Suleiman, Armin Alaghi
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2110.12106
Pdf link: https://arxiv.org/pdf/2110.12106
Abstract
Implementing image processing algorithms using FPGAs or ASICs can improve energy efficiency by orders of magnitude over optimized CPU, DSP, or GPU code. These efficiency improvements are crucial for enabling new applications on mobile power-constrained devices, such as cell phones or AR/VR headsets. Unfortunately, custom hardware is commonly implemented using a waterfall process with time-intensive manual mapping and optimization phases. Thus, it can take years for a new algorithm to make it all the way from an algorithm design to shipping silicon. Recent improvements in hardware design tools, such as C-to-gates High-Level Synthesis (HLS), can reduce design time, but still require manual tuning from hardware experts. In this paper, we present HWTool, a novel system for automatically mapping image processing and computer vision algorithms to hardware. Our system maps between two domains: HWImg, an extensible C++ image processing library containing common image processing and parallel computing operators, and Rigel2, a library of optimized hardware implementations of HWImg's operators and backend Verilog compiler. We show how to automatically compile HWImg to Rigel2, by solving for interfaces, hardware sizing, and FIFO buffer allocation. Finally, we map full-scale image processing applications like convolution, optical flow, depth from stereo, and feature descriptors to FPGA using our system. On these examples, HWTool requires on average only 11% more FPGA area than hand-optimized designs (with manual FIFO allocation), and 33% more FPGA area than hand-optimized designs with automatic FIFO allocation, and performs similarly to HLS.

Robustness via Uncertainty-aware Cycle Consistency

Authors: Uddeshya Upadhyay, Yanbei Chen, Zeynep Akata
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2110.12467
Pdf link: https://arxiv.org/pdf/2110.12467
Abstract
Unpaired image-to-image translation refers to learning inter-image-domain mapping without corresponding image pairs. Existing methods learn deterministic mappings without explicitly modelling the robustness to outliers or predictive uncertainty, leading to performance degradation when encountering unseen perturbations at test time. To address this, we propose a novel probabilistic method based on Uncertainty-aware Generalized Adaptive Cycle Consistency (UGAC), which models the per-pixel residual by generalized Gaussian distribution, capable of modelling heavy-tailed distributions. We compare our model with a wide variety of state-of-the-art methods on various challenging tasks including unpaired image translation of natural images, using standard datasets, spanning autonomous driving, maps, facades, and also in medical imaging domain consisting of MRI. Experimental results demonstrate that our method exhibits stronger robustness towards unseen perturbations in test data. Code is released here: https://github.com/ExplainableML/UncertaintyAwareCycleConsistency.

GeoSneakPique: Visual Autocompletion for Geospatial Queries

Authors: Vidya Setlur, Sarah Battersby, Tracy Wong
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2110.12596
Pdf link: https://arxiv.org/pdf/2110.12596
Abstract
How many crimes occurred in the city center? And exactly which part of town is the 'city center'? While location is at the heart of many data questions, geographic location can be difficult to specify in natural language (NL) queries. This is especially true when working with fuzzy cognitive regions or regions that may be defined based on data distributions instead of absolute administrative location (e.g., state, country). GeoSneakPique presents a novel method for using a mapping widget to support the NL query process, allowing users to specify location via direct manipulation with data-driven guidance on spatial distributions to help select the area of interest. Users receive feedback to help them evaluate and refine their spatial selection interactively and can save spatial definitions for re-use in subsequent queries. We conduct a qualitative evaluation of the GeoSneakPique that indicates the usefulness of the interface as well as opportunities for better supporting geospatial workflows in visual analysis tasks employing cognitive regions.

Learning Stochastic Shortest Path with Linear Function Approximation

Authors: Yifei Min, Jiafan He, Tianhao Wang, Quanquan Gu
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2110.12727
Pdf link: https://arxiv.org/pdf/2110.12727
Abstract
We study the stochastic shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We call this class of SSP problems the linear mixture SSP. We propose a novel algorithm for learning the linear mixture SSP, which can attain a $\tilde O(d B_{\star}^{1.5}\sqrt{K/c_{\min}})$ regret. Here $K$ is the number of episodes, $d$ is the dimension of the feature mapping in the mixture model, $B_{\star}$ bounds the expected cumulative cost of the optimal policy, and $c_{\min}>0$ is the lower bound of the cost function. Our algorithm also applies to the case when $c_{\min} = 0$, where a $\tilde O(K^{2/3})$ regret is guaranteed. To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP. In complement to the regret upper bounds, we also prove a lower bound of $\Omega(d B_{\star} \sqrt{K})$, which nearly matches our upper bound.

Spread2RML: Constructing Knowledge Graphs by Predicting RML Mappings on Messy Spreadsheets

Authors: Markus Schröder, Christian Jilek, Andreas Dengel
Subjects: Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2110.12829
Pdf link: https://arxiv.org/pdf/2110.12829
Abstract
The RDF Mapping Language (RML) allows to map semi-structured data to RDF knowledge graphs. Besides CSV, JSON and XML, this also includes the mapping of spreadsheet tables. Since spreadsheets have a complex data model and can become rather messy, their mapping creation tends to be very time consuming. In order to reduce such efforts, this paper presents Spread2RML which predicts RML mappings on messy spreadsheets. This is done with an extensible set of RML object map templates which are applied for each column based on heuristics. In our evaluation, three datasets are used ranging from very messy synthetic data to spreadsheets from data.gov which are less messy. We obtained first promising results especially with regard to our approach being fully automatic and dealing with rather messy data.

Accelerating Compact Fractals with Tensor Core GPUs

Authors: Felipe A. Quezada, Cristóbal A. Navarro
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2110.12952
Pdf link: https://arxiv.org/pdf/2110.12952
Abstract
This work presents a GPU thread mapping approach that allows doing fast parallel stencil-like computations on discrete fractals using their compact representation. The intuition behind is to employ two GPU tensor-core accelerated thread maps, $\lambda(\omega)$ and $\nu(\omega)$, which act as threadspace-to-dataspace and dataspace-to-threadspace functions, respectively. By combining these maps, threads can access compact space and interact with their neighbors. The cost of the maps is $\mathcal{O}(\log \log(n))$ time, with $n$ being the side of a $n \times n$ embedding for the fractal in its expanded form. The technique works on any fractal that belongs to the Non-overlapping-Bounding-Boxes (NBB) class of discrete fractals, and can be extended to three dimensions as well. Results using an A100 GPU on the Sierpinski Triangle as a case study show up to $\sim11\times$ of speedup and a memory usage reduction of $234\times$ with respect to a Bounding Box approach. These results show that the proposed compact approach can allow the scientific community to tackle larger problems that did not fit in GPU memory before, and run even faster than a bounding box approach.

Keyword: localization

Feasibility of Remote Landmark Identification for Cricothyrotomy Using Robotic Palpation

Authors: Neel Shihora, Rashid M. Yasin, Ryan Walsh, Nabil Simaan
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.12078
Pdf link: https://arxiv.org/pdf/2110.12078
Abstract
Cricothyrotomy is a life-saving emergency intervention that secures an alternate airway route after a neck injury or obstruction. The procedure starts with identifying the correct location (the cricothyroid membrane) for creating an incision to insert an endotracheal tube. This location is determined using a combination of visual and palpation cues. Enabling robot-assisted remote cricothyrotomy may extend this life-saving procedure to injured soldiers or patients who may not be readily accessible for on-site intervention during search-and-rescue scenarios. As a first step towards achieving this goal, this paper explores the feasibility of palpation-assisted landmark identification for cricothyrotomy. Using a cricothyrotomy training simulator, we explore several alternatives for in-situ remote localization of the cricothyroid membrane. These alternatives include a) unaided telemanipulation, b) telemanipulation with direct force feedback, c) telemanipulation with superimposed motion excitation for online stiffness estimation and display, and d) fully autonomous palpation scan initialized based on the user's understanding of key anatomical landmarks. Using the manually digitized cricothyroid membrane location as ground truth, we compare these four methods for accuracy and repeatability of identifying the landmark for cricothyrotomy, time of completion, and ease of use. These preliminary results suggest that the accuracy of remote cricothyrotomy landmark identification is improved when the user is aided with visual and force cues. They also show that, with proper user initialization, landmark identification using remote palpation is feasible - therefore satisfying a key pre-requisite for future robotic solutions for remote cricothyrotomy.

EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale

Authors: Jacek Komorowski, Monika Wysoczanska, Tomasz Trzcinski
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.12486
Pdf link: https://arxiv.org/pdf/2110.12486
Abstract
The paper presents a deep neural network-based method for global and local descriptors extraction from a point cloud acquired by a rotating 3D LiDAR. The descriptors can be used for two-stage 6DoF relocalization. First, a course position is retrieved by finding candidates with the closest global descriptor in the database of geo-tagged point clouds. Then, the 6DoF pose between a query point cloud and a database point cloud is estimated by matching local descriptors and using a robust estimator such as RANSAC. Our method has a simple, fully convolutional architecture based on a sparse voxelized representation. It can efficiently extract a global descriptor and a set of keypoints with local descriptors from large point clouds with tens of thousand points. Our code and pretrained models are publicly available on the project website.

HSDB-instrument: Instrument Localization Database for Laparoscopic and Robotic Surgeries

Authors: Jihun Yoon, Jiwon Lee, Sunghwan Heo, Hayeong Yu, Jayeon Lim, Chi Hyun Song, SeulGi Hong, Seungbum Hong, Bokyung Park, SungHyun Park, Woo Jin Hyung, Min-Kook Choi1
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.12555
Pdf link: https://arxiv.org/pdf/2110.12555
Abstract
Automated surgical instrument localization is an important technology to understand the surgical process and in order to analyze them to provide meaningful guidance during surgery or surgical index after surgery to the surgeon. We introduce a new dataset that reflects the kinematic characteristics of surgical instruments for automated surgical instrument localization of surgical videos. The hSDB(hutom Surgery DataBase)-instrument dataset consists of instrument localization information from 24 cases of laparoscopic cholecystecomy and 24 cases of robotic gastrectomy. Localization information for all instruments is provided in the form of a bounding box for object detection. To handle class imbalance problem between instruments, synthesized instruments modeled in Unity for 3D models are included as training data. Besides, for 3D instrument data, a polygon annotation is provided to enable instance segmentation of the tool. To reflect the kinematic characteristics of all instruments, they are annotated with head and body parts for laparoscopic instruments, and with head, wrist, and body parts for robotic instruments separately. Annotation data of assistive tools (specimen bag, needle, etc.) that are frequently used for surgery are also included. Moreover, we provide statistical information on the hSDB-instrument dataset and the baseline localization performances of the object detection networks trained by the MMDetection library and resulting analyses.

Industrial Scene Text Detection with Refined Feature-attentive Network

Authors: Tongkun Guan, Chaochen Gu, Changsheng Lu, Jingzheng Tu, Qi Feng, Kaijie Wu, Xinping Guan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.12663
Pdf link: https://arxiv.org/pdf/2110.12663
Abstract
Detecting the marking characters of industrial metal parts remains challenging due to low visual contrast, uneven illumination, corroded character structures, and cluttered background of metal part images. Affected by these factors, bounding boxes generated by most existing methods locate low-contrast text areas inaccurately. In this paper, we propose a refined feature-attentive network (RFN) to solve the inaccurate localization problem. Specifically, we design a parallel feature integration mechanism to construct an adaptive feature representation from multi-resolution features, which enhances the perception of multi-scale texts at each scale-specific level to generate a high-quality attention map. Then, an attentive refinement network is developed by the attention map to rectify the location deviation of candidate boxes. In addition, a re-scoring mechanism is designed to select text boxes with the best rectified location. Moreover, we construct two industrial scene text datasets, including a total of 102156 images and 1948809 text instances with various character structures and metal parts. Extensive experiments on our dataset and four public datasets demonstrate that our proposed method achieves the state-of-the-art performance.

Instance-Conditional Knowledge Distillation for Object Detection

Authors: Zijian Kang, Peizhen Zhang, Xiangyu Zhang, Jian Sun, Nanning Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.12724
Pdf link: https://arxiv.org/pdf/2110.12724
Abstract
Despite the success of Knowledge Distillation (KD) on image classification, it is still challenging to apply KD on object detection due to the difficulty in locating knowledge. In this paper, we propose an instance-conditional distillation framework to find desired knowledge. To locate knowledge of each instance, we use observed instances as condition information and formulate the retrieval process as an instance-conditional decoding process. Specifically, information of each instance that specifies a condition is encoded as query, and teacher's information is presented as key, we use the attention between query and key to measure the correlation, formulated by the transformer decoder. To guide this module, we further introduce an auxiliary task that directs to instance localization and identification, which are fundamental for detection. Extensive experiments demonstrate the efficacy of our method: we observe impressive improvements under various settings. Notably, we boost RetinaNet with ResNet-50 backbone from 37.4 to 40.7 mAP (+3.3) under 1x schedule, that even surpasses the teacher (40.4 mAP) with ResNet-101 backbone under 3x schedule. Code will be released soon.

A Deep Reinforcement Learning Approach for Audio-based Navigation and Audio Source Localization in Multi-speaker Environments

Authors: Petros Giannakopoulos, Aggelos Pikrakis, Yannis Cotronis
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2110.12778
Pdf link: https://arxiv.org/pdf/2110.12778
Abstract
In this work we apply deep reinforcement learning to the problems of navigating a three-dimensional environment and inferring the locations of human speaker audio sources within, in the case where the only available information is the raw sound from the environment, as a simulated human listener placed in the environment would hear it. For this purpose we create two virtual environments using the Unity game engine, one presenting an audio-based navigation problem and one presenting an audio source localization problem. We also create an autonomous agent based on PPO online reinforcement learning algorithm and attempt to train it to solve these environments. Our experiments show that our agent achieves adequate performance and generalization ability in both environments, measured by quantitative metrics, even when a limited amount of training data are available or the environment parameters shift in ways not encountered during training. We also show that a degree of agent knowledge transfer is possible between the environments.

Multi-vehicle experiment platform: A Digital Twin Realization Method

Authors: Chunying Yang, Jianghong Dong, Qing Xu, Mengchi Cai, Hongmao Qin, Jianqiang Wang, Keqiang Li
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2110.12859
Pdf link: https://arxiv.org/pdf/2110.12859
Abstract
With the development of V2X technology, multiple vehicles cooperative control has been widely studied. However, filed testing is rarely conducted due to financial and safety considerations. To solve this problem, this study proposes a digital twin method to carry out multi-vehicle experiments, which uses combination of physical and virtual vehicles to perform coordination tasks. To confirm effectiveness of this method, a prototype system is developed, which consists of sand table testbed, its twin system and cloud. Several aspects are quantified to describe system performance, including time delay and localization accuracy. Finally, a vehicle level experiment in platoon scenario is carried out and experiment results confirm effectiveness of this method.

WOLF: A modular estimation framework for robotics based on factor graphs

Authors: Joan Sola, Joan Vallve-Navarro, Joaquim Casals, Jeremie Deray, Mederic Fourmy, Dinesh Atchuthan, Juan Andrade-Cetto
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.12919
Pdf link: https://arxiv.org/pdf/2110.12919
Abstract
This paper introduces WOLF, a C++ estimation framework based on factor graphs and targeted at mobile robotics. WOLF extends the applications of factor graphs from the typical problems of SLAM and odometry to a general estimation framework able to handle self-calibration, model identification, or the observation of dynamic quantities other than localization. WOLF produces high throughput estimates at sensor rates up to the kHz range, which can be used for feedback control of highly dynamic robots such as humanoids, quadrupeds or aerial manipulators. Departing from the factor graph paradigm, the architecture of WOLF allows for a modular yet tightly-coupled estimator. Modularity is based on plugins that are loaded at runtime. Then, integration is achieved simply through YAML files, allowing users to configure a wide range of applications without the need of writing or compiling code. Synchronization of incoming data and their processing into a unique factor graph is achieved through a decentralized strategy of frame creation and joining. Most algorithmic assets are coded as abstract algorithms in base classes with varying levels of specialization. Overall, these assets allow for coherent processing and favor code reusability and scalability. WOLF can be interfaced with different solvers, and we provide a wrapper to Google Ceres. Likewise, we offer ROS integration, providing a generic ROS node and specialized packages with subscribers and publishers. WOLF is made publicly available and open to collaboration.

TAPL: Dynamic Part-based Visual Tracking via Attention-guided Part Localization

Authors: Wei han, Hantao Huang, Xiaoxi Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.13027
Pdf link: https://arxiv.org/pdf/2110.13027
Abstract
Holistic object representation-based trackers suffer from performance drop under large appearance change such as deformation and occlusion. In this work, we propose a dynamic part-based tracker and constantly update the target part representation to adapt to object appearance change. Moreover, we design an attention-guided part localization network to directly predict the target part locations, and determine the final bounding box with the distribution of target parts. Our proposed tracker achieves promising results on various benchmarks: VOT2018, OTB100 and GOT-10k

Diagnosing Errors in Video Relation Detectors

Authors: Shuo Chen, Pascal Mettes, Cees G.M. Snoek
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.13110
Pdf link: https://arxiv.org/pdf/2110.13110
Abstract
Video relation detection forms a new and challenging problem in computer vision, where subjects and objects need to be localized spatio-temporally and a predicate label needs to be assigned if and only if there is an interaction between the two. Despite recent progress in video relation detection, overall performance is still marginal and it remains unclear what the key factors are towards solving the problem. Following examples set in the object detection and action localization literature, we perform a deep dive into the error diagnosis of current video relation detection approaches. We introduce a diagnostic tool for analyzing the sources of detection errors. Our tool evaluates and compares current approaches beyond the single scalar metric of mean Average Precision by defining different error types specific to video relation detection, used for false positive analyses. Moreover, we examine different factors of influence on the performance in a false negative analysis, including relation length, number of subject/object/predicate instances, and subject/object size. Finally, we present the effect on video relation performance when considering an oracle fix for each error type. On two video relation benchmarks, we show where current approaches excel and fall short, allowing us to pinpoint the most important future directions in the field. The tool is available at \url{https://github.com/shanshuo/DiagnoseVRD}.

New submissions for Thu, 24 Jun 21

Keyword: SLAM

Collaborative Visual Inertial SLAM for Multiple Smart Phones

Authors: Jialing Liu, Ruyu Liu, Kaiqi Chen, Jianhua Zhang, Dongyan Guo
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.12186
Pdf link: https://arxiv.org/pdf/2106.12186
Abstract
The efficiency and accuracy of mapping are crucial in a large scene and long-term AR applications. Multi-agent cooperative SLAM is the precondition of multi-user AR interaction. The cooperation of multiple smart phones has the potential to improve efficiency and robustness of task completion and can complete tasks that a single agent cannot do. However, it depends on robust communication, efficient location detection, robust mapping, and efficient information sharing among agents. We propose a multi-intelligence collaborative monocular visual-inertial SLAM deployed on multiple ios mobile devices with a centralized architecture. Each agent can independently explore the environment, run a visual-inertial odometry module online, and then send all the measurement information to a central server with higher computing resources. The server manages all the information received, detects overlapping areas, merges and optimizes the map, and shares information with the agents when needed. We have verified the performance of the system in public datasets and real environments. The accuracy of mapping and fusion of the proposed system is comparable to VINS-Mono which requires higher computing resources.

Keyword: VINS

Collaborative Visual Inertial SLAM for Multiple Smart Phones

Authors: Jialing Liu, Ruyu Liu, Kaiqi Chen, Jianhua Zhang, Dongyan Guo
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.12186
Pdf link: https://arxiv.org/pdf/2106.12186
Abstract
The efficiency and accuracy of mapping are crucial in a large scene and long-term AR applications. Multi-agent cooperative SLAM is the precondition of multi-user AR interaction. The cooperation of multiple smart phones has the potential to improve efficiency and robustness of task completion and can complete tasks that a single agent cannot do. However, it depends on robust communication, efficient location detection, robust mapping, and efficient information sharing among agents. We propose a multi-intelligence collaborative monocular visual-inertial SLAM deployed on multiple ios mobile devices with a centralized architecture. Each agent can independently explore the environment, run a visual-inertial odometry module online, and then send all the measurement information to a central server with higher computing resources. The server manages all the information received, detects overlapping areas, merges and optimizes the map, and shares information with the agents when needed. We have verified the performance of the system in public datasets and real environments. The accuracy of mapping and fusion of the proposed system is comparable to VINS-Mono which requires higher computing resources.

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection

Authors: Shaoqing Xu, Dingfu Zhou, Jin Fang, Junbo Yin, Zhou Bin, Liangjun Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.12449
Pdf link: https://arxiv.org/pdf/2106.12449
Abstract
Accurate detection of obstacles in 3D is an essential task for autonomous driving and intelligent transportation. In this work, we propose a general multimodal fusion framework FusionPainting to fuse the 2D RGB image and 3D point clouds at a semantic level for boosting the 3D object detection task. Especially, the FusionPainting framework consists of three main modules: a multi-modal semantic segmentation module, an adaptive attention-based semantic fusion module, and a 3D object detector. First, semantic information is obtained for 2D images and 3D Lidar point clouds based on 2D and 3D segmentation approaches. Then the segmentation results from different sensors are adaptively fused based on the proposed attention-based semantic fusion module. Finally, the point clouds painted with the fused semantic label are sent to the 3D detector for obtaining the 3D objection results. The effectiveness of the proposed framework has been verified on the large-scale nuScenes detection benchmark by comparing it with three different baselines. The experimental results show that the fusion strategy can significantly improve the detection performance compared to the methods using only point clouds, and the methods using point clouds only painted with 2D segmentation information. Furthermore, the proposed approach outperforms other state-of-the-art methods on the nuScenes testing benchmark.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Imitation Learning: Progress, Taxonomies and Opportunities

Authors: Boyuan Zheng, Sunny Verma, Jianlong Zhou, Ivor Tsang, Fang Chen
Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.12177
Pdf link: https://arxiv.org/pdf/2106.12177
Abstract
Imitation learning aims to extract knowledge from human experts' demonstrations or artificially created agents in order to replicate their behaviors. Its success has been demonstrated in areas such as video games, autonomous driving, robotic simulations and object manipulation. However, this replicating process could be problematic, such as the performance is highly dependent on the demonstration quality, and most trained agents are limited to perform well in task-specific environments. In this survey, we provide a systematic review on imitation learning. We first introduce the background knowledge from development history and preliminaries, followed by presenting different taxonomies within Imitation Learning and key milestones of the field. We then detail challenges in learning strategies and present research opportunities with learning policy from suboptimal demonstration, voice instructions and other associated optimization schemes.

Uncertainty-Aware Model-Based Reinforcement Learning with Application to Autonomous Driving

Authors: Jingda Wu, Zhiyu Huang, Chen Lv
Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.12194
Pdf link: https://arxiv.org/pdf/2106.12194
Abstract
To further improve the learning efficiency and performance of reinforcement learning (RL), in this paper we propose a novel uncertainty-aware model-based RL (UA-MBRL) framework, and then implement and validate it in autonomous driving under various task scenarios. First, an action-conditioned ensemble model with the ability of uncertainty assessment is established as the virtual environment model. Then, a novel uncertainty-aware model-based RL framework is developed based on the adaptive truncation approach, providing virtual interactions between the agent and environment model, and improving RL's training efficiency and performance. The developed algorithms are then implemented in end-to-end autonomous vehicle control tasks, validated and compared with state-of-the-art methods under various driving scenarios. The validation results suggest that the proposed UA-MBRL method surpasses the existing model-based and model-free RL approaches, in terms of learning efficiency and achieved performance. The results also demonstrate the good ability of the proposed method with respect to the adaptiveness and robustness, under various autonomous driving scenarios.

An investigation into the state-of-the-practice autonomous driving testing

Authors: Guannan Lou, Yao Deng, Xi Zheng, Tianyi Zhang, Mengshi Zhang
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2106.12233
Pdf link: https://arxiv.org/pdf/2106.12233
Abstract
Autonomous driving shows great potential to reform modern transportation and its safety is attracting much attention from public. Autonomous driving systems generally include deep neural networks (DNNs) for gaining better performance (e.g., accuracy on object detection and trajectory prediction). However, compared with traditional software systems, this new paradigm (i.e., program + DNNs) makes software testing more difficult. Recently, software engineering community spent significant effort in developing new testing methods for autonomous driving systems. However, it is not clear that what extent those testing methods have addressed the needs of industrial practitioners of autonomous driving. To fill this gap, in this paper, we present the first comprehensive study to identify the current practices and needs of testing autonomous driving systems in industry. We conducted semi-structured interviews with developers from 10 autonomous driving companies and surveyed 100 developers who have worked on autonomous driving systems. Through thematic analysis of interview and questionnaire data, we identified five urgent needs of testing autonomous driving systems from industry. We further analyzed the limitations of existing testing methods to address those needs and proposed several future directions for software testing researchers.

Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images

Authors: Libo Wang, Rui Li, Dongzhi Wang, Chenxi Duan, Teng Wang, Xiaoliang Meng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.12413
Pdf link: https://arxiv.org/pdf/2106.12413
Abstract
Semantic segmentation from very fine resolution (VFR) urban scene images plays a significant role in several application scenarios including autonomous driving, land cover classification, and urban planning, etc. However, the tremendous details contained in the VFR image severely limit the potential of the existing deep learning approaches. More seriously, the considerable variations in scale and appearance of objects further deteriorate the representational capacity of those se-mantic segmentation methods, leading to the confusion of adjacent objects. Addressing such is-sues represents a promising research field in the remote sensing community, which paves the way for scene-level landscape pattern analysis and decision making. In this manuscript, we pro-pose a bilateral awareness network (BANet) which contains a dependency path and a texture path to fully capture the long-range relationships and fine-grained details in VFR images. Specif-ically, the dependency path is conducted based on the ResT, a novel Transformer backbone with memory-efficient multi-head self-attention, while the texture path is built on the stacked convo-lution operation. Besides, using the linear attention mechanism, a feature aggregation module (FAM) is designed to effectively fuse the dependency features and texture features. Extensive experiments conducted on the three large-scale urban scene image segmentation datasets, i.e., ISPRS Vaihingen dataset, ISPRS Potsdam dataset, and UAVid dataset, demonstrate the effective-ness of our BANet. Specifically, a 64.6% mIoU is achieved on the UAVid dataset.

FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection

Authors: Shaoqing Xu, Dingfu Zhou, Jin Fang, Junbo Yin, Zhou Bin, Liangjun Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.12449
Pdf link: https://arxiv.org/pdf/2106.12449
Abstract
Accurate detection of obstacles in 3D is an essential task for autonomous driving and intelligent transportation. In this work, we propose a general multimodal fusion framework FusionPainting to fuse the 2D RGB image and 3D point clouds at a semantic level for boosting the 3D object detection task. Especially, the FusionPainting framework consists of three main modules: a multi-modal semantic segmentation module, an adaptive attention-based semantic fusion module, and a 3D object detector. First, semantic information is obtained for 2D images and 3D Lidar point clouds based on 2D and 3D segmentation approaches. Then the segmentation results from different sensors are adaptively fused based on the proposed attention-based semantic fusion module. Finally, the point clouds painted with the fused semantic label are sent to the 3D detector for obtaining the 3D objection results. The effectiveness of the proposed framework has been verified on the large-scale nuScenes detection benchmark by comparing it with three different baselines. The experimental results show that the fusion strategy can significantly improve the detection performance compared to the methods using only point clouds, and the methods using point clouds only painted with 2D segmentation information. Furthermore, the proposed approach outperforms other state-of-the-art methods on the nuScenes testing benchmark.

New submissions for Thu, 28 Oct 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Learning from demonstrations with SACR2: Soft Actor-Critic with Reward Relabeling

Authors: Jesus Bujalance Martin, Raphaël Chekroun, Fabien Moutarde
Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.14464
Pdf link: https://arxiv.org/pdf/2110.14464
Abstract
During recent years, deep reinforcement learning (DRL) has made successful incursions into complex decision-making applications such as robotics, autonomous driving or video games. However, a well-known caveat of DRL algorithms is their inefficiency, requiring huge amounts of data to converge. Off-policy algorithms tend to be more sample-efficient, and can additionally benefit from any off-policy data stored in the replay buffer. Expert demonstrations are a popular source for such data: the agent is exposed to successful states and actions early on, which can accelerate the learning process and improve performance. In the past, multiple ideas have been proposed to make good use of the demonstrations in the buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We carry on a study to evaluate several of these ideas in isolation, to see which of them have the most significant impact. We also present a new method, based on a reward bonus given to demonstrations and successful episodes. First, we give a reward bonus to the transitions coming from demonstrations to encourage the agent to match the demonstrated behaviour. Then, upon collecting a successful episode, we relabel its transitions with the same bonus before adding them to the replay buffer, encouraging the agent to also match its previous successes. The base algorithm for our experiments is the popular Soft Actor-Critic (SAC), a state-of-the-art off-policy algorithm for continuous action spaces. Our experiments focus on robotics, specifically on a reaching task for a robotic arm in simulation. We show that our method SACR2 based on reward relabeling improves the performance on this task, even in the absence of demonstrations.

Keyword: mapping

Dual-Mode Synchronization Predictive Control of Robotic Manipulator

Authors: Zhu Dachang, Cui Aodong, Du Baolin, Zhu Puchen
Subjects: Robotics (cs.RO); Dynamical Systems (math.DS)
Arxiv link: https://arxiv.org/abs/2110.14195
Pdf link: https://arxiv.org/pdf/2110.14195
Abstract
To reduce the contour error of the end-effector of a robotic manipulator during trajectory tracking, a dual-mode synchronization predictive control is proposed. Firstly, the dynamic model of n-DoF robotic manipulator is discretized by using the Taylor expansion method, and the mapping relationship between the joint error in the joint space and the contour error in task space is constructed. Secondly, the synchronization error and the tracking error in the joint space are defined, and the coupling error of joints is derived through the coupling coefficient . Thirdly, a dual-mode synchronization predictive control is proposed, and the stability of the proposed control system is guaranteed using constraint set conditions. Finally, numerical simulation and experimental results are shown to prove the effectiveness of the proposed control strategy.

Feature and Label Embedding Spaces Matter in Addressing Image Classifier Bias

Authors: William Thong, Cees G. M. Snoek
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.14336
Pdf link: https://arxiv.org/pdf/2110.14336
Abstract
This paper strives to address image classifier bias, with a focus on both feature and label embedding spaces. Previous works have shown that spurious correlations from protected attributes, such as age, gender, or skin tone, can cause adverse decisions. To balance potential harms, there is a growing need to identify and mitigate image classifier bias. First, we identify in the feature space a bias direction. We compute class prototypes of each protected attribute value for every class, and reveal an existing subspace that captures the maximum variance of the bias. Second, we mitigate biases by mapping image inputs to label embedding spaces. Each value of the protected attribute has its projection head where classes are embedded through a latent vector representation rather than a common one-hot encoding. Once trained, we further reduce in the feature space the bias effect by removing its direction. Evaluation on biased image datasets, for multi-class, multi-label and binary classifications, shows the effectiveness of tackling both feature and label embedding spaces in improving the fairness of the classifier predictions, while preserving classification performance.

Separating Content and Style for Unsupervised Image-to-Image Translation

Authors: Yunfei Liu, Haofei Wang, Yang Yue, Feng Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.14404
Pdf link: https://arxiv.org/pdf/2110.14404
Abstract
Unsupervised image-to-image translation aims to learn the mapping between two visual domains with unpaired samples. Existing works focus on disentangling domain-invariant content code and domain-specific style code individually for multimodal purposes. However, less attention has been paid to interpreting and manipulating the translated image. In this paper, we propose to separate the content code and style code simultaneously in a unified framework. Based on the correlation between the latent features and the high-level domain-invariant tasks, the proposed framework demonstrates superior performance in multimodal translation, interpretability and manipulation of the translated image. Experimental results show that the proposed approach outperforms the existing unsupervised image translation methods in terms of visual quality and diversity.

Keyword: localization

Video-based fully automatic assessment of open surgery suturing skills

Authors: Adam Goldbraikh, Anne-Lise D'Angelo, Carla M. Pugh, Shlomi Laufer
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2110.13972
Pdf link: https://arxiv.org/pdf/2110.13972
Abstract
The goal of this study was to develop new reliable open surgery suturing simulation system for training medical students in situation where resources are limited or in the domestic setup. Namely, we developed an algorithm for tools and hands localization as well as identifying the interactions between them based on simple webcam video data, calculating motion metrics for assessment of surgical skill. Twenty-five participants performed multiple suturing tasks using our simulator. The YOLO network has been modified to a multi-task network, for the purpose of tool localization and tool-hand interaction detection. This was accomplished by splitting the YOLO detection heads so that they supported both tasks with minimal addition to computer run-time. Furthermore, based on the outcome of the system, motion metrics were calculated. These metrics included traditional metrics such as time and path length as well as new metrics assessing the technique participants use for holding the tools. The dual-task network performance was similar to that of two networks, while computational load was only slightly bigger than one network. In addition, the motion metrics showed significant differences between experts and novices. While video capture is an essential part of minimally invasive surgery, it is not an integral component of open surgery. Thus, new algorithms, focusing on the unique challenges open surgery videos present, are required. In this study, a dual-task network was developed to solve both a localization task and a hand-tool interaction task. The dual network may be easily expanded to a multi-task network, which may be useful for images with multiple layers and for evaluating the interaction between these different layers.

Controllable Data Augmentation Through Deep Relighting

Authors: George Chogovadze, Rémi Pautrat, Marc Pollefeys
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2110.13996
Pdf link: https://arxiv.org/pdf/2110.13996
Abstract
At the heart of the success of deep learning is the quality of the data. Through data augmentation, one can train models with better generalization capabilities and thus achieve greater results in their field of interest. In this work, we explore how to augment a varied set of image datasets through relighting so as to improve the ability of existing models to be invariant to illumination changes, namely for learned descriptors. We develop a tool, based on an encoder-decoder network, that is able to quickly generate multiple variations of the illumination of various input scenes whilst also allowing the user to define parameters such as the angle of incidence and intensity. We demonstrate that by training models on datasets that have been augmented with our pipeline, it is possible to achieve higher performance on localization benchmarks.

From Image to Imuge: Immunized Image Generation

Authors: Qichao Ying, Zhenxing Qian, Hang Zhou, Haisheng Xu, Xinpeng Zhang, Siyi Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.14196
Pdf link: https://arxiv.org/pdf/2110.14196
Abstract
We introduce Imuge, an image tamper resilient generative scheme for image self-recovery. The traditional manner of concealing image content within the image are inflexible and fragile to diverse digital attack, i.e. image cropping and JPEG compression. To address this issue, we jointly train a U-Net backboned encoder, a tamper localization network and a decoder for image recovery. Given an original image, the encoder produces a visually indistinguishable immunized image. At the recipient's side, the verifying network localizes the malicious modifications, and the original content can be approximately recovered by the decoder, despite the presence of the attacks. Several strategies are proposed to boost the training efficiency. We demonstrate that our method can recover the details of the tampered regions with a high quality despite the presence of various kinds of attacks. Comprehensive ablation studies are conducted to validate our network designs.

Node-wise Localization of Graph Neural Networks

Authors: Zemin Liu, Yuan Fang, Chenghao Liu, Steven C.H. Hoi
Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2110.14322
Pdf link: https://arxiv.org/pdf/2110.14322
Abstract
Graph neural networks (GNNs) emerge as a powerful family of representation learning models on graphs. To derive node representations, they utilize a global model that recursively aggregates information from the neighboring nodes. However, different nodes reside at different parts of the graph in different local contexts, making their distributions vary across the graph. Ideally, how a node receives its neighborhood information should be a function of its local context, to diverge from the global GNN model shared by all nodes. To utilize node locality without overfitting, we propose a node-wise localization of GNNs by accounting for both global and local aspects of the graph. Globally, all nodes on the graph depend on an underlying global GNN to encode the general patterns across the graph; locally, each node is localized into a unique model as a function of the global model and its local context. Finally, we conduct extensive experiments on four benchmark graphs, and consistently obtain promising performance surpassing the state-of-the-art GNNs.

New submissions for Tue, 26 Oct 21

Keyword: SLAM

WOLF: A modular estimation framework for robotics based on factor graphs

Authors: Joan Sola, Joan Vallve-Navarro, Joaquim Casals, Jeremie Deray, Mederic Fourmy, Dinesh Atchuthan, Juan Andrade-Cetto
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.12919
Pdf link: https://arxiv.org/pdf/2110.12919
Abstract
This paper introduces WOLF, a C++ estimation framework based on factor graphs and targeted at mobile robotics. WOLF extends the applications of factor graphs from the typical problems of SLAM and odometry to a general estimation framework able to handle self-calibration, model identification, or the observation of dynamic quantities other than localization. WOLF produces high throughput estimates at sensor rates up to the kHz range, which can be used for feedback control of highly dynamic robots such as humanoids, quadrupeds or aerial manipulators. Departing from the factor graph paradigm, the architecture of WOLF allows for a modular yet tightly-coupled estimator. Modularity is based on plugins that are loaded at runtime. Then, integration is achieved simply through YAML files, allowing users to configure a wide range of applications without the need of writing or compiling code. Synchronization of incoming data and their processing into a unique factor graph is achieved through a decentralized strategy of frame creation and joining. Most algorithmic assets are coded as abstract algorithms in base classes with varying levels of specialization. Overall, these assets allow for coherent processing and favor code reusability and scalability. WOLF can be interfaced with different solvers, and we provide a wrapper to Google Ceres. Likewise, we offer ROS integration, providing a generic ROS node and specialized packages with subscribers and publishers. WOLF is made publicly available and open to collaboration.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale

Authors: Jacek Komorowski, Monika Wysoczanska, Tomasz Trzcinski
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.12486
Pdf link: https://arxiv.org/pdf/2110.12486
Abstract
The paper presents a deep neural network-based method for global and local descriptors extraction from a point cloud acquired by a rotating 3D LiDAR. The descriptors can be used for two-stage 6DoF relocalization. First, a course position is retrieved by finding candidates with the closest global descriptor in the database of geo-tagged point clouds. Then, the 6DoF pose between a query point cloud and a database point cloud is estimated by matching local descriptors and using a robust estimator such as RANSAC. Our method has a simple, fully convolutional architecture based on a sparse voxelized representation. It can efficiently extract a global descriptor and a set of keypoints with local descriptors from large point clouds with tens of thousand points. Our code and pretrained models are publicly available on the project website.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Encoding Integrated Decision and Control for Autonomous Driving with Mixed Traffic Flow

Authors: Yangang Ren, Jianhua Jiang, Jingliang Duan, Shengbo Eben Li, Dongjie Yu, Guojian Zhan
Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.12359
Pdf link: https://arxiv.org/pdf/2110.12359
Abstract
Reinforcement learning (RL) has been widely adopted to make intelligent driving policy in autonomous driving due to the self-evolution ability and humanoid learning paradigm. Despite many elegant demonstrations of RL-enabled decision-making, current research mainly focuses on the pure vehicle driving environment while ignoring other traffic participants like bicycles and pedestrians. For urban roads, the interaction of mixed traffic flows leads to a quite dynamic and complex relationship, which poses great difficulty to learn a safe and intelligent policy. This paper proposes the encoding integrated decision and control (E-IDC) to handle complicated driving tasks with mixed traffic flows, which composes of an encoding function to construct driving states, a value function to choose the optimal path as well as a policy function to output the control command of ego vehicle. Specially, the encoding function is capable of dealing with different types and variant number of traffic participants and extracting features from original driving observation. Next, we design the training principle for the functions of E-IDC with RL algorithms by adding the gradient-based update rules and refine the safety constraints concerning the otherness of different participants. The verification is conducted on the intersection scenario with mixed traffic flows and result shows that E-IDC can enhance the driving performance, including the tracking performance and safety constraint requirements with a large margin. The online application indicates that E-IDC can realize efficient and smooth driving in the complex intersection, guaranteeing the intelligence and safety simultaneously.

Robustness via Uncertainty-aware Cycle Consistency

Authors: Uddeshya Upadhyay, Yanbei Chen, Zeynep Akata
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2110.12467
Pdf link: https://arxiv.org/pdf/2110.12467
Abstract
Unpaired image-to-image translation refers to learning inter-image-domain mapping without corresponding image pairs. Existing methods learn deterministic mappings without explicitly modelling the robustness to outliers or predictive uncertainty, leading to performance degradation when encountering unseen perturbations at test time. To address this, we propose a novel probabilistic method based on Uncertainty-aware Generalized Adaptive Cycle Consistency (UGAC), which models the per-pixel residual by generalized Gaussian distribution, capable of modelling heavy-tailed distributions. We compare our model with a wide variety of state-of-the-art methods on various challenging tasks including unpaired image translation of natural images, using standard datasets, spanning autonomous driving, maps, facades, and also in medical imaging domain consisting of MRI. Experimental results demonstrate that our method exhibits stronger robustness towards unseen perturbations in test data. Code is released here: https://github.com/ExplainableML/UncertaintyAwareCycleConsistency.

Complete Test of Synthesised Safety Supervisors for Robots and Autonomous Systems

Authors: Mario Gleirscher (University of Bremen), Jan Peleska (University of Bremen)
Subjects: Software Engineering (cs.SE); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.12589
Pdf link: https://arxiv.org/pdf/2110.12589
Abstract
Verified controller synthesis uses world models that comprise all potential behaviours of humans, robots, further equipment, and the controller to be synthesised. A world model enables quantitative risk assessment, for example, by stochastic model checking. Such a model describes a range of controller behaviours some of which -- when implemented correctly -- guarantee that the overall risk in the actual world is acceptable, provided that the stochastic assumptions have been made to the safe side. Synthesis then selects an acceptable-risk controller behaviour. However, because of crossing abstraction, formalism, and tool boundaries, verified synthesis for robots and autonomous systems has to be accompanied by rigorous testing. In general, standards and regulations for safety-critical systems require testing as a key element to obtain certification credit before entry into service. This work-in-progress paper presents an approach to the complete testing of synthesised supervisory controllers that enforce safety properties in domains such as human-robot collaboration and autonomous driving. Controller code is generated from the selected controller behaviour. The code generator, however, is hard, if not infeasible, to verify in a formal and comprehensive way. Instead, utilising testing, an abstract test reference is generated, a symbolic finite state machine with simpler semantics than code semantics. From this reference, a complete test suite is derived and applied to demonstrate the observational equivalence between the synthesised abstract test reference and the generated concrete controller code running on a control system platform.

2nd Place Solution for SODA10M Challenge 2021 -- Continual Detection Track

Authors: Manoj Acharya, Christopher Kanan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.13064
Pdf link: https://arxiv.org/pdf/2110.13064
Abstract
In this technical report, we present our approaches for the continual object detection track of the SODA10M challenge. We adapt ResNet50-FPN as the baseline and try several improvements for the final submission model. We find that task-specific replay scheme, learning rate scheduling, model calibration, and using original image scale helps to improve performance for both large and small objects in images. Our team `hypertune28' secured the second position among 52 participants in the challenge. This work will be presented at the ICCV 2021 Workshop on Self-supervised Learning for Next-Generation Industry-level Autonomous Driving (SSLAD).

Keyword: mapping

HWTool: Fully Automatic Mapping of an Extensible C++ Image Processing Language to Hardware

Authors: James Hegarty, Omar Eldash, Amr Suleiman, Armin Alaghi
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2110.12106
Pdf link: https://arxiv.org/pdf/2110.12106
Abstract
Implementing image processing algorithms using FPGAs or ASICs can improve energy efficiency by orders of magnitude over optimized CPU, DSP, or GPU code. These efficiency improvements are crucial for enabling new applications on mobile power-constrained devices, such as cell phones or AR/VR headsets. Unfortunately, custom hardware is commonly implemented using a waterfall process with time-intensive manual mapping and optimization phases. Thus, it can take years for a new algorithm to make it all the way from an algorithm design to shipping silicon. Recent improvements in hardware design tools, such as C-to-gates High-Level Synthesis (HLS), can reduce design time, but still require manual tuning from hardware experts. In this paper, we present HWTool, a novel system for automatically mapping image processing and computer vision algorithms to hardware. Our system maps between two domains: HWImg, an extensible C++ image processing library containing common image processing and parallel computing operators, and Rigel2, a library of optimized hardware implementations of HWImg's operators and backend Verilog compiler. We show how to automatically compile HWImg to Rigel2, by solving for interfaces, hardware sizing, and FIFO buffer allocation. Finally, we map full-scale image processing applications like convolution, optical flow, depth from stereo, and feature descriptors to FPGA using our system. On these examples, HWTool requires on average only 11% more FPGA area than hand-optimized designs (with manual FIFO allocation), and 33% more FPGA area than hand-optimized designs with automatic FIFO allocation, and performs similarly to HLS.

Robustness via Uncertainty-aware Cycle Consistency

Authors: Uddeshya Upadhyay, Yanbei Chen, Zeynep Akata
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2110.12467
Pdf link: https://arxiv.org/pdf/2110.12467
Abstract
Unpaired image-to-image translation refers to learning inter-image-domain mapping without corresponding image pairs. Existing methods learn deterministic mappings without explicitly modelling the robustness to outliers or predictive uncertainty, leading to performance degradation when encountering unseen perturbations at test time. To address this, we propose a novel probabilistic method based on Uncertainty-aware Generalized Adaptive Cycle Consistency (UGAC), which models the per-pixel residual by generalized Gaussian distribution, capable of modelling heavy-tailed distributions. We compare our model with a wide variety of state-of-the-art methods on various challenging tasks including unpaired image translation of natural images, using standard datasets, spanning autonomous driving, maps, facades, and also in medical imaging domain consisting of MRI. Experimental results demonstrate that our method exhibits stronger robustness towards unseen perturbations in test data. Code is released here: https://github.com/ExplainableML/UncertaintyAwareCycleConsistency.

GeoSneakPique: Visual Autocompletion for Geospatial Queries

Authors: Vidya Setlur, Sarah Battersby, Tracy Wong
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2110.12596
Pdf link: https://arxiv.org/pdf/2110.12596
Abstract
How many crimes occurred in the city center? And exactly which part of town is the 'city center'? While location is at the heart of many data questions, geographic location can be difficult to specify in natural language (NL) queries. This is especially true when working with fuzzy cognitive regions or regions that may be defined based on data distributions instead of absolute administrative location (e.g., state, country). GeoSneakPique presents a novel method for using a mapping widget to support the NL query process, allowing users to specify location via direct manipulation with data-driven guidance on spatial distributions to help select the area of interest. Users receive feedback to help them evaluate and refine their spatial selection interactively and can save spatial definitions for re-use in subsequent queries. We conduct a qualitative evaluation of the GeoSneakPique that indicates the usefulness of the interface as well as opportunities for better supporting geospatial workflows in visual analysis tasks employing cognitive regions.

Learning Stochastic Shortest Path with Linear Function Approximation

Authors: Yifei Min, Jiafan He, Tianhao Wang, Quanquan Gu
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2110.12727
Pdf link: https://arxiv.org/pdf/2110.12727
Abstract
We study the stochastic shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We call this class of SSP problems the linear mixture SSP. We propose a novel algorithm for learning the linear mixture SSP, which can attain a $\tilde O(d B_{\star}^{1.5}\sqrt{K/c_{\min}})$ regret. Here $K$ is the number of episodes, $d$ is the dimension of the feature mapping in the mixture model, $B_{\star}$ bounds the expected cumulative cost of the optimal policy, and $c_{\min}>0$ is the lower bound of the cost function. Our algorithm also applies to the case when $c_{\min} = 0$, where a $\tilde O(K^{2/3})$ regret is guaranteed. To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP. In complement to the regret upper bounds, we also prove a lower bound of $\Omega(d B_{\star} \sqrt{K})$, which nearly matches our upper bound.

Spread2RML: Constructing Knowledge Graphs by Predicting RML Mappings on Messy Spreadsheets

Authors: Markus Schröder, Christian Jilek, Andreas Dengel
Subjects: Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2110.12829
Pdf link: https://arxiv.org/pdf/2110.12829
Abstract
The RDF Mapping Language (RML) allows to map semi-structured data to RDF knowledge graphs. Besides CSV, JSON and XML, this also includes the mapping of spreadsheet tables. Since spreadsheets have a complex data model and can become rather messy, their mapping creation tends to be very time consuming. In order to reduce such efforts, this paper presents Spread2RML which predicts RML mappings on messy spreadsheets. This is done with an extensible set of RML object map templates which are applied for each column based on heuristics. In our evaluation, three datasets are used ranging from very messy synthetic data to spreadsheets from data.gov which are less messy. We obtained first promising results especially with regard to our approach being fully automatic and dealing with rather messy data.

Accelerating Compact Fractals with Tensor Core GPUs

Authors: Felipe A. Quezada, Cristóbal A. Navarro
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2110.12952
Pdf link: https://arxiv.org/pdf/2110.12952
Abstract
This work presents a GPU thread mapping approach that allows doing fast parallel stencil-like computations on discrete fractals using their compact representation. The intuition behind is to employ two GPU tensor-core accelerated thread maps, $\lambda(\omega)$ and $\nu(\omega)$, which act as threadspace-to-dataspace and dataspace-to-threadspace functions, respectively. By combining these maps, threads can access compact space and interact with their neighbors. The cost of the maps is $\mathcal{O}(\log \log(n))$ time, with $n$ being the side of a $n \times n$ embedding for the fractal in its expanded form. The technique works on any fractal that belongs to the Non-overlapping-Bounding-Boxes (NBB) class of discrete fractals, and can be extended to three dimensions as well. Results using an A100 GPU on the Sierpinski Triangle as a case study show up to $\sim11\times$ of speedup and a memory usage reduction of $234\times$ with respect to a Bounding Box approach. These results show that the proposed compact approach can allow the scientific community to tackle larger problems that did not fit in GPU memory before, and run even faster than a bounding box approach.

Keyword: localization

Feasibility of Remote Landmark Identification for Cricothyrotomy Using Robotic Palpation

Authors: Neel Shihora, Rashid M. Yasin, Ryan Walsh, Nabil Simaan
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.12078
Pdf link: https://arxiv.org/pdf/2110.12078
Abstract
Cricothyrotomy is a life-saving emergency intervention that secures an alternate airway route after a neck injury or obstruction. The procedure starts with identifying the correct location (the cricothyroid membrane) for creating an incision to insert an endotracheal tube. This location is determined using a combination of visual and palpation cues. Enabling robot-assisted remote cricothyrotomy may extend this life-saving procedure to injured soldiers or patients who may not be readily accessible for on-site intervention during search-and-rescue scenarios. As a first step towards achieving this goal, this paper explores the feasibility of palpation-assisted landmark identification for cricothyrotomy. Using a cricothyrotomy training simulator, we explore several alternatives for in-situ remote localization of the cricothyroid membrane. These alternatives include a) unaided telemanipulation, b) telemanipulation with direct force feedback, c) telemanipulation with superimposed motion excitation for online stiffness estimation and display, and d) fully autonomous palpation scan initialized based on the user's understanding of key anatomical landmarks. Using the manually digitized cricothyroid membrane location as ground truth, we compare these four methods for accuracy and repeatability of identifying the landmark for cricothyrotomy, time of completion, and ease of use. These preliminary results suggest that the accuracy of remote cricothyrotomy landmark identification is improved when the user is aided with visual and force cues. They also show that, with proper user initialization, landmark identification using remote palpation is feasible - therefore satisfying a key pre-requisite for future robotic solutions for remote cricothyrotomy.

EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale

Authors: Jacek Komorowski, Monika Wysoczanska, Tomasz Trzcinski
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.12486
Pdf link: https://arxiv.org/pdf/2110.12486
Abstract
The paper presents a deep neural network-based method for global and local descriptors extraction from a point cloud acquired by a rotating 3D LiDAR. The descriptors can be used for two-stage 6DoF relocalization. First, a course position is retrieved by finding candidates with the closest global descriptor in the database of geo-tagged point clouds. Then, the 6DoF pose between a query point cloud and a database point cloud is estimated by matching local descriptors and using a robust estimator such as RANSAC. Our method has a simple, fully convolutional architecture based on a sparse voxelized representation. It can efficiently extract a global descriptor and a set of keypoints with local descriptors from large point clouds with tens of thousand points. Our code and pretrained models are publicly available on the project website.

HSDB-instrument: Instrument Localization Database for Laparoscopic and Robotic Surgeries

Authors: Jihun Yoon, Jiwon Lee, Sunghwan Heo, Hayeong Yu, Jayeon Lim, Chi Hyun Song, SeulGi Hong, Seungbum Hong, Bokyung Park, SungHyun Park, Woo Jin Hyung, Min-Kook Choi1
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.12555
Pdf link: https://arxiv.org/pdf/2110.12555
Abstract
Automated surgical instrument localization is an important technology to understand the surgical process and in order to analyze them to provide meaningful guidance during surgery or surgical index after surgery to the surgeon. We introduce a new dataset that reflects the kinematic characteristics of surgical instruments for automated surgical instrument localization of surgical videos. The hSDB(hutom Surgery DataBase)-instrument dataset consists of instrument localization information from 24 cases of laparoscopic cholecystecomy and 24 cases of robotic gastrectomy. Localization information for all instruments is provided in the form of a bounding box for object detection. To handle class imbalance problem between instruments, synthesized instruments modeled in Unity for 3D models are included as training data. Besides, for 3D instrument data, a polygon annotation is provided to enable instance segmentation of the tool. To reflect the kinematic characteristics of all instruments, they are annotated with head and body parts for laparoscopic instruments, and with head, wrist, and body parts for robotic instruments separately. Annotation data of assistive tools (specimen bag, needle, etc.) that are frequently used for surgery are also included. Moreover, we provide statistical information on the hSDB-instrument dataset and the baseline localization performances of the object detection networks trained by the MMDetection library and resulting analyses.

Industrial Scene Text Detection with Refined Feature-attentive Network

Authors: Tongkun Guan, Chaochen Gu, Changsheng Lu, Jingzheng Tu, Qi Feng, Kaijie Wu, Xinping Guan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.12663
Pdf link: https://arxiv.org/pdf/2110.12663
Abstract
Detecting the marking characters of industrial metal parts remains challenging due to low visual contrast, uneven illumination, corroded character structures, and cluttered background of metal part images. Affected by these factors, bounding boxes generated by most existing methods locate low-contrast text areas inaccurately. In this paper, we propose a refined feature-attentive network (RFN) to solve the inaccurate localization problem. Specifically, we design a parallel feature integration mechanism to construct an adaptive feature representation from multi-resolution features, which enhances the perception of multi-scale texts at each scale-specific level to generate a high-quality attention map. Then, an attentive refinement network is developed by the attention map to rectify the location deviation of candidate boxes. In addition, a re-scoring mechanism is designed to select text boxes with the best rectified location. Moreover, we construct two industrial scene text datasets, including a total of 102156 images and 1948809 text instances with various character structures and metal parts. Extensive experiments on our dataset and four public datasets demonstrate that our proposed method achieves the state-of-the-art performance.

Instance-Conditional Knowledge Distillation for Object Detection

Authors: Zijian Kang, Peizhen Zhang, Xiangyu Zhang, Jian Sun, Nanning Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.12724
Pdf link: https://arxiv.org/pdf/2110.12724
Abstract
Despite the success of Knowledge Distillation (KD) on image classification, it is still challenging to apply KD on object detection due to the difficulty in locating knowledge. In this paper, we propose an instance-conditional distillation framework to find desired knowledge. To locate knowledge of each instance, we use observed instances as condition information and formulate the retrieval process as an instance-conditional decoding process. Specifically, information of each instance that specifies a condition is encoded as query, and teacher's information is presented as key, we use the attention between query and key to measure the correlation, formulated by the transformer decoder. To guide this module, we further introduce an auxiliary task that directs to instance localization and identification, which are fundamental for detection. Extensive experiments demonstrate the efficacy of our method: we observe impressive improvements under various settings. Notably, we boost RetinaNet with ResNet-50 backbone from 37.4 to 40.7 mAP (+3.3) under 1x schedule, that even surpasses the teacher (40.4 mAP) with ResNet-101 backbone under 3x schedule. Code will be released soon.

A Deep Reinforcement Learning Approach for Audio-based Navigation and Audio Source Localization in Multi-speaker Environments

Authors: Petros Giannakopoulos, Aggelos Pikrakis, Yannis Cotronis
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2110.12778
Pdf link: https://arxiv.org/pdf/2110.12778
Abstract
In this work we apply deep reinforcement learning to the problems of navigating a three-dimensional environment and inferring the locations of human speaker audio sources within, in the case where the only available information is the raw sound from the environment, as a simulated human listener placed in the environment would hear it. For this purpose we create two virtual environments using the Unity game engine, one presenting an audio-based navigation problem and one presenting an audio source localization problem. We also create an autonomous agent based on PPO online reinforcement learning algorithm and attempt to train it to solve these environments. Our experiments show that our agent achieves adequate performance and generalization ability in both environments, measured by quantitative metrics, even when a limited amount of training data are available or the environment parameters shift in ways not encountered during training. We also show that a degree of agent knowledge transfer is possible between the environments.

Multi-vehicle experiment platform: A Digital Twin Realization Method

Authors: Chunying Yang, Jianghong Dong, Qing Xu, Mengchi Cai, Hongmao Qin, Jianqiang Wang, Keqiang Li
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2110.12859
Pdf link: https://arxiv.org/pdf/2110.12859
Abstract
With the development of V2X technology, multiple vehicles cooperative control has been widely studied. However, filed testing is rarely conducted due to financial and safety considerations. To solve this problem, this study proposes a digital twin method to carry out multi-vehicle experiments, which uses combination of physical and virtual vehicles to perform coordination tasks. To confirm effectiveness of this method, a prototype system is developed, which consists of sand table testbed, its twin system and cloud. Several aspects are quantified to describe system performance, including time delay and localization accuracy. Finally, a vehicle level experiment in platoon scenario is carried out and experiment results confirm effectiveness of this method.

WOLF: A modular estimation framework for robotics based on factor graphs

Authors: Joan Sola, Joan Vallve-Navarro, Joaquim Casals, Jeremie Deray, Mederic Fourmy, Dinesh Atchuthan, Juan Andrade-Cetto
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.12919
Pdf link: https://arxiv.org/pdf/2110.12919
Abstract
This paper introduces WOLF, a C++ estimation framework based on factor graphs and targeted at mobile robotics. WOLF extends the applications of factor graphs from the typical problems of SLAM and odometry to a general estimation framework able to handle self-calibration, model identification, or the observation of dynamic quantities other than localization. WOLF produces high throughput estimates at sensor rates up to the kHz range, which can be used for feedback control of highly dynamic robots such as humanoids, quadrupeds or aerial manipulators. Departing from the factor graph paradigm, the architecture of WOLF allows for a modular yet tightly-coupled estimator. Modularity is based on plugins that are loaded at runtime. Then, integration is achieved simply through YAML files, allowing users to configure a wide range of applications without the need of writing or compiling code. Synchronization of incoming data and their processing into a unique factor graph is achieved through a decentralized strategy of frame creation and joining. Most algorithmic assets are coded as abstract algorithms in base classes with varying levels of specialization. Overall, these assets allow for coherent processing and favor code reusability and scalability. WOLF can be interfaced with different solvers, and we provide a wrapper to Google Ceres. Likewise, we offer ROS integration, providing a generic ROS node and specialized packages with subscribers and publishers. WOLF is made publicly available and open to collaboration.

TAPL: Dynamic Part-based Visual Tracking via Attention-guided Part Localization

Authors: Wei han, Hantao Huang, Xiaoxi Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.13027
Pdf link: https://arxiv.org/pdf/2110.13027
Abstract
Holistic object representation-based trackers suffer from performance drop under large appearance change such as deformation and occlusion. In this work, we propose a dynamic part-based tracker and constantly update the target part representation to adapt to object appearance change. Moreover, we design an attention-guided part localization network to directly predict the target part locations, and determine the final bounding box with the distribution of target parts. Our proposed tracker achieves promising results on various benchmarks: VOT2018, OTB100 and GOT-10k

Diagnosing Errors in Video Relation Detectors

Authors: Shuo Chen, Pascal Mettes, Cees G.M. Snoek
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.13110
Pdf link: https://arxiv.org/pdf/2110.13110
Abstract
Video relation detection forms a new and challenging problem in computer vision, where subjects and objects need to be localized spatio-temporally and a predicate label needs to be assigned if and only if there is an interaction between the two. Despite recent progress in video relation detection, overall performance is still marginal and it remains unclear what the key factors are towards solving the problem. Following examples set in the object detection and action localization literature, we perform a deep dive into the error diagnosis of current video relation detection approaches. We introduce a diagnostic tool for analyzing the sources of detection errors. Our tool evaluates and compares current approaches beyond the single scalar metric of mean Average Precision by defining different error types specific to video relation detection, used for false positive analyses. Moreover, we examine different factors of influence on the performance in a false negative analysis, including relation length, number of subject/object/predicate instances, and subject/object size. Finally, we present the effect on video relation performance when considering an oracle fix for each error type. On two video relation benchmarks, we show where current approaches excel and fall short, allowing us to pinpoint the most important future directions in the field. The tool is available at \url{https://github.com/shanshuo/DiagnoseVRD}.

New submissions for Wed, 17 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Switching Recurrent Kalman Networks

Authors: Giao Nguyen-Quynh, Philipp Becker, Chen Qiu, Maja Rudolph, Gerhard Neumann
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2111.08291
Pdf link: https://arxiv.org/pdf/2111.08291
Abstract
Forecasting driving behavior or other sensor measurements is an essential component of autonomous driving systems. Often real-world multivariate time series data is hard to model because the underlying dynamics are nonlinear and the observations are noisy. In addition, driving data can often be multimodal in distribution, meaning that there are distinct predictions that are likely, but averaging can hurt model performance. To address this, we propose the Switching Recurrent Kalman Network (SRKN) for efficient inference and prediction on nonlinear and multi-modal time-series data. The model switches among several Kalman filters that model different aspects of the dynamics in a factorized latent state. We empirically test the resulting scalable and interpretable deep state-space model on toy data sets and real driving data from taxis in Porto. In all cases, the model can capture the multimodal nature of the dynamics in the data.

Robust 3D Scene Segmentation through Hierarchical and Learnable Part-Fusion

Authors: Anirud Thyagharajan, Benjamin Ummenhofer, Prashant Laddha, Om J Omer, Sreenivas Subramoney
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.08434
Pdf link: https://arxiv.org/pdf/2111.08434
Abstract
3D semantic segmentation is a fundamental building block for several scene understanding applications such as autonomous driving, robotics and AR/VR. Several state-of-the-art semantic segmentation models suffer from the part misclassification problem, wherein parts of the same object are labelled incorrectly. Previous methods have utilized hierarchical, iterative methods to fuse semantic and instance information, but they lack learnability in context fusion, and are computationally complex and heuristic driven. This paper presents Segment-Fusion, a novel attention-based method for hierarchical fusion of semantic and instance information to address the part misclassifications. The presented method includes a graph segmentation algorithm for grouping points into segments that pools point-wise features into segment-wise features, a learnable attention-based network to fuse these segments based on their semantic and instance features, and followed by a simple yet effective connected component labelling algorithm to convert segment features to instance labels. Segment-Fusion can be flexibly employed with any network architecture for semantic/instance segmentation. It improves the qualitative and quantitative performance of several semantic segmentation backbones by upto 5% when evaluated on the ScanNet and S3DIS datasets.

GRI: General Reinforced Imitation and its Application to Vision-Based Autonomous Driving

Authors: Raphael Chekroun, Marin Toromanoff, Sascha Hornauer, Fabien Moutarde
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.08575
Pdf link: https://arxiv.org/pdf/2111.08575
Abstract
Deep reinforcement learning (DRL) has been demonstrated to be effective for several complex decision-making applications such as autonomous driving and robotics. However, DRL is notoriously limited by its high sample complexity and its lack of stability. Prior knowledge, e.g. as expert demonstrations, is often available but challenging to leverage to mitigate these issues. In this paper, we propose General Reinforced Imitation (GRI), a novel method which combines benefits from exploration and expert data and is straightforward to implement over any off-policy RL algorithm. We make one simplifying hypothesis: expert demonstrations can be seen as perfect data whose underlying policy gets a constant high reward. Based on this assumption, GRI introduces the notion of offline demonstration agents. This agent sends expert data which are processed both concurrently and indistinguishably with the experiences coming from the online RL exploration agent. We show that our approach enables major improvements on vision-based autonomous driving in urban environments. We further validate the GRI method on Mujoco continuous control tasks with different off-policy RL algorithms. Our method ranked first on the CARLA Leaderboard and outperforms World on Rails, the previous state-of-the-art, by 17%.

Towards Real-Time Monocular Depth Estimation for Robotics: A Survey

Authors: Xingshuai Dong, Matthew A. Garratt, Sreenatha G. Anavatti, Hussein A. Abbass
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.08600
Pdf link: https://arxiv.org/pdf/2111.08600
Abstract
As an essential component for many autonomous driving and robotic activities such as ego-motion estimation, obstacle avoidance and scene understanding, monocular depth estimation (MDE) has attracted great attention from the computer vision and robotics communities. Over the past decades, a large number of methods have been developed. To the best of our knowledge, however, there is not a comprehensive survey of MDE. This paper aims to bridge this gap by reviewing 197 relevant articles published between 1970 and 2021. In particular, we provide a comprehensive survey of MDE covering various methods, introduce the popular performance evaluation metrics and summarize publically available datasets. We also summarize available open-source implementations of some representative methods and compare their performances. Furthermore, we review the application of MDE in some important robotic tasks. Finally, we conclude this paper by presenting some promising directions for future research. This survey is expected to assist readers to navigate this research field.

Keyword: mapping

Soft Delivery: Survey on A New Paradigm for Wireless and Mobile Multimedia Streaming

Authors: Takuya Fujihashi, Toshiaki Koike-Akino, Takashi Watanabe
Subjects: Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2111.08189
Pdf link: https://arxiv.org/pdf/2111.08189
Abstract
The increasing demand for video streaming services is the key driver of modern wireless and mobile communications. For robust and high-quality delivery of video content over wireless and mobile networks, the main challenge is sending image and video signals to single and multiple users over unstable and diverse channel environments. To this end, many studies have designed digital-based video delivery schemes, which mainly consist of a sequence of digital-based coding and transmission schemes. Although digital-based schemes perform well when the channel characteristics are known in advance, significant quality degradation, known as cliff and leveling effects, often occurs owing to the fluctuating channel characteristics. To prevent cliff and leveling effects irrespective of the channel characteristics of each user, a new paradigm for wireless and mobile video streaming has been proposed. Soft delivery schemes skip the digital operations of quantization and entropy and channel coding while directly mapping the power-assigned frequency--domain coefficients onto the transmission symbols. This modification is based on the fact that the pixel distortion due to communication noise is proportional to the magnitude of the noise, resulting in graceful quality improvement, wherein quality is improved gradually, according to the wireless channel quality without any cliff and leveling effects. Herein, we present a comprehensive summary of soft delivery schemes.

NatiDroid: Cross-Language Android Permission Specification

Authors: Chaoran Li, Xiao Chen, Ruoxi Sun, Jason Xue, Sheng Wen, Muhammad Ejaz Ahmed, Seyit Camtepe, Yang Xiang
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2111.08217
Pdf link: https://arxiv.org/pdf/2111.08217
Abstract
The Android system manages access to sensitive APIs by permission enforcement. An application (app) must declare proper permissions before invoking specific Android APIs. However, there is no official documentation providing the complete list of permission-protected APIs and the corresponding permissions to date. Researchers have spent significant efforts extracting such API protection mapping from the Android API framework, which leverages static code analysis to determine if specific permissions are required before accessing an API. Nevertheless, none of them has attempted to analyze the protection mapping in the native library (i.e., code written in C and C++), an essential component of the Android framework that handles communication with the lower-level hardware, such as cameras and sensors. While the protection mapping can be utilized to detect various security vulnerabilities in Android apps, such as permission over-privilege and component hijacking, imprecise mapping will lead to false results in detecting such security vulnerabilities. To fill this gap, we develop a prototype system, named NatiDroid, to facilitate the cross-language static analysis to benchmark against two state-of-the-art tools, termed Axplorer and Arcade. We evaluate NatiDroid on more than 11,000 Android apps, including system apps from custom Android ROMs and third-party apps from the Google Play. Our NatiDroid can identify up to 464 new API-permission mappings, in contrast to the worst-case results derived from both Axplorer and Arcade, where approximately 71% apps have at least one false positive in permission over-privilege and up to 3.6% apps have at least one false negative in component hijacking. Additionally, we identify that 24 components with at least one Native-triggered component hijacking vulnerability are misidentified by two benchmarks.

Towards Generating Real-World Time Series Data

Authors: Hengzhi Pei, Kan Ren, Yuqing Yang, Chang Liu, Tao Qin, Dongsheng Li
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.08386
Pdf link: https://arxiv.org/pdf/2111.08386
Abstract
Time series data generation has drawn increasing attention in recent years. Several generative adversarial network (GAN) based methods have been proposed to tackle the problem usually with the assumption that the targeted time series data are well-formatted and complete. However, real-world time series (RTS) data are far away from this utopia, e.g., long sequences with variable lengths and informative missing data raise intractable challenges for designing powerful generation algorithms. In this paper, we propose a novel generative framework for RTS data - RTSGAN to tackle the aforementioned challenges. RTSGAN first learns an encoder-decoder module which provides a mapping between a time series instance and a fixed-dimension latent vector and then learns a generation module to generate vectors in the same latent space. By combining the generator and the decoder, RTSGAN is able to generate RTS which respect the original feature distributions and the temporal dynamics. To generate time series with missing values, we further equip RTSGAN with an observation embedding layer and a decide-and-generate decoder to better utilize the informative missing patterns. Experiments on the four RTS datasets show that the proposed framework outperforms the previous generation methods in terms of synthetic data utility for downstream classification and prediction tasks.

Weakly-supervised fire segmentation by visualizing intermediate CNN layers

Authors: Milad Niknejad, Alexandre Bernardino
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.08401
Pdf link: https://arxiv.org/pdf/2111.08401
Abstract
Fire localization in images and videos is an important step for an autonomous system to combat fire incidents. State-of-art image segmentation methods based on deep neural networks require a large number of pixel-annotated samples to train Convolutional Neural Networks (CNNs) in a fully-supervised manner. In this paper, we consider weakly supervised segmentation of fire in images, in which only image labels are used to train the network. We show that in the case of fire segmentation, which is a binary segmentation problem, the mean value of features in a mid-layer of classification CNN can perform better than conventional Class Activation Mapping (CAM) method. We also propose to further improve the segmentation accuracy by adding a rotation equivariant regularization loss on the features of the last convolutional layer. Our results show noticeable improvements over baseline method for weakly-supervised fire segmentation.

Grounding Psychological Shape Space in Convolutional Neural Networks

Authors: Lucas Bechberger, Kai-Uwe Kühnberger
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.08409
Pdf link: https://arxiv.org/pdf/2111.08409
Abstract
Shape information is crucial for human perception and cognition, and should therefore also play a role in cognitive AI systems. We employ the interdisciplinary framework of conceptual spaces, which proposes a geometric representation of conceptual knowledge through low-dimensional interpretable similarity spaces. These similarity spaces are often based on psychological dissimilarity ratings for a small set of stimuli, which are then transformed into a spatial representation by a technique called multidimensional scaling. Unfortunately, this approach is incapable of generalizing to novel stimuli. In this paper, we use convolutional neural networks to learn a generalizable mapping between perceptual inputs (pixels of grayscale line drawings) and a recently proposed psychological similarity space for the shape domain. We investigate different network architectures (classification network vs. autoencoder) and different training regimes (transfer learning vs. multi-task learning). Our results indicate that a classification-based multi-task learning scenario yields the best results, but that its performance is relatively sensitive to the dimensionality of the similarity space.

Towards Lightweight Controllable Audio Synthesis with Conditional Implicit Neural Representations

Authors: Jan Zuiderveld, Marco Federici, Erik J. Bekkers
Subjects: Sound (cs.SD); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.08462
Pdf link: https://arxiv.org/pdf/2111.08462
Abstract
The high temporal resolution of audio and our perceptual sensitivity to small irregularities in waveforms make synthesizing at high sampling rates a complex and computationally intensive task, prohibiting real-time, controllable synthesis within many approaches. In this work we aim to shed light on the potential of Conditional Implicit Neural Representations (CINRs) as lightweight backbones in generative frameworks for audio synthesis. Implicit neural representations (INRs) are neural networks used to approximate low-dimensional functions, trained to represent a single geometric object by mapping input coordinates to structural information at input locations. In contrast with other neural methods for representing geometric objects, the memory required to parameterize the object is independent of resolution, and only scales with its complexity. A corollary of this is that INRs have infinite resolution, as they can be sampled at arbitrary resolutions. To apply the concept of INRs in the generative domain we frame generative modelling as learning a distribution of continuous functions. This can be achieved by introducing conditioning methods to INRs. Our experiments show that Periodic Conditional INRs (PCINRs) learn faster and generally produce quantitatively better audio reconstructions than Transposed Convolutional Neural Networks with equal parameter counts. However, their performance is very sensitive to activation scaling hyperparameters. When learning to represent more uniform sets, PCINRs tend to introduce artificial high-frequency components in reconstructions. We validate this noise can be minimized by applying standard weight regularization during training or decreasing the compositional depth of PCINRs, and suggest directions for future research.

Remote Memory-Deduplication Attacks

Authors: Martin Schwarzl, Erik Kraft, Moritz Lipp, Daniel Gruss
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2111.08553
Pdf link: https://arxiv.org/pdf/2111.08553
Abstract
Memory utilization can be reduced by merging identical memory blocks into copy-on-write mappings. Previous work showed that this so-called memory deduplication can be exploited in local attacks to break ASLR, spy on other programs,and determine the presence of data, i.e., website images. All these attacks exploit memory deduplication across security domains, which in turn was disabled. However, within a security domain or on an isolated system with no untrusted local access, memory deduplication is still not considered a security risk and was recently re-enabled on Windows by default. In this paper, we present the first fully remote memorydeduplication attacks. Unlike previous attacks, our attacks require no local code execution. Consequently, we can disclose memory contents from a remote server merely by sending and timing HTTP/1 and HTTP/2 network requests. We demonstrate our attacks on deduplication both on Windows and Linux and attack widely used server software such as Memcached and InnoDB. Our side channel leaks up to 34.41 B/h over the internet, making it faster than comparable remote memory-disclosure channels. We showcase our remote memory-deduplication attack in three case studies: First, we show that an attacker can disclose the presence of data in memory on a server running Memcached. We show that this information disclosure channel can also be used for fingerprinting and detect the correct libc version over the internet in 166.51 s. Second, in combination with InnoDB, we present an information disclosure attack to leak MariaDB database records. Third, we demonstrate a fully remote KASLR break in less than 4 minutes allowing to derandomize the kernel image of a virtual machine over the Internet, i.e., 14 network hops away. We conclude that memory deduplication must also be considered a security risk if only applied within a single security domain.

A Data-Driven Approach for Linear and Nonlinear Damage Detection Using Variational Mode Decomposition and GARCH Model

Authors: Vahid Reza Gharehbaghi, Hashem Kalbkhani, Ehsan Noroozinejad Farsangi, T.Y. Yang, Seyedali Mirjalili
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.08620
Pdf link: https://arxiv.org/pdf/2111.08620
Abstract
In this article, an original data-driven approach is proposed to detect both linear and nonlinear damage in structures using output-only responses. The method deploys variational mode decomposition (VMD) and a generalised autoregressive conditional heteroscedasticity (GARCH) model for signal processing and feature extraction. To this end, VMD decomposes the response signals into intrinsic mode functions (IMFs). Afterwards, the GARCH model is utilised to represent the statistics of IMFs. The model coefficients of IMFs construct the primary feature vector. Kernel-based principal component analysis (PCA) and linear discriminant analysis (LDA) are utilised to reduce the redundancy of the primary features by mapping them to the new feature space. The informative features are then fed separately into three supervised classifiers, namely support vector machine (SVM), k-nearest neighbour (kNN), and fine tree. The performance of the proposed method is evaluated on two experimentally scaled models in terms of linear and nonlinear damage assessment. Kurtosis and ARCH tests proved the compatibility of the GARCH model.

Machine Learning-Assisted Analysis of Small Angle X-ray Scattering

Authors: Piotr Tomaszewski, Shun Yu, Markus Borg, Jerk Rönnols
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.08645
Pdf link: https://arxiv.org/pdf/2111.08645
Abstract
Small angle X-ray scattering (SAXS) is extensively used in materials science as a way of examining nanostructures. The analysis of experimental SAXS data involves mapping a rather simple data format to a vast amount of structural models. Despite various scientific computing tools to assist the model selection, the activity heavily relies on the SAXS analysts' experience, which is recognized as an efficiency bottleneck by the community. To cope with this decision-making problem, we develop and evaluate the open-source, Machine Learning-based tool SCAN (SCattering Ai aNalysis) to provide recommendations on model selection. SCAN exploits multiple machine learning algorithms and uses models and a simulation tool implemented in the SasView package for generating a well defined set of datasets. Our evaluation shows that SCAN delivers an overall accuracy of 95%-97%. The XGBoost Classifier has been identified as the most accurate method with a good balance between accuracy and training time. With eleven predefined structural models for common nanostructures and an easy draw-drop function to expand the number and types training models, SCAN can accelerate the SAXS data analysis workflow.

Keyword: localization

Weakly-supervised fire segmentation by visualizing intermediate CNN layers

Authors: Milad Niknejad, Alexandre Bernardino
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.08401
Pdf link: https://arxiv.org/pdf/2111.08401
Abstract
Fire localization in images and videos is an important step for an autonomous system to combat fire incidents. State-of-art image segmentation methods based on deep neural networks require a large number of pixel-annotated samples to train Convolutional Neural Networks (CNNs) in a fully-supervised manner. In this paper, we consider weakly supervised segmentation of fire in images, in which only image labels are used to train the network. We show that in the case of fire segmentation, which is a binary segmentation problem, the mean value of features in a mid-layer of classification CNN can perform better than conventional Class Activation Mapping (CAM) method. We also propose to further improve the segmentation accuracy by adding a rotation equivariant regularization loss on the features of the last convolutional layer. Our results show noticeable improvements over baseline method for weakly-supervised fire segmentation.

Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

Authors: Minglang Qiao, Yufan Liu, Mai Xu, Xin Deng, Bing Li, Weiming Hu, Ali Borji
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.08567
Pdf link: https://arxiv.org/pdf/2111.08567
Abstract
Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video by leveraging visual, audio and face information. Specifically, we first introduce a large-scale database of multi-face video in visual-audio condition (MVVA), containing eye-tracking data and sound source annotations. Using this database, we find that sound influences human attention, and conversly attention offers a cue to determine sound source on multi-face video. Guided by these findings, a visual-audio multi-task network (VAM-Net) is introduced to predict saliency and locate sound source. VAM-Net consists of three branches corresponding to visual, audio and face modalities. Visual branch has a two-stream architecture to capture spatial and temporal information. Face and audio branches encode audio signals and faces, respectively. Finally, a spatio-temporal multi-modal graph (STMG) is constructed to model the interaction among multiple faces. With joint optimization of these branches, the intrinsic correlation of the tasks of saliency prediction and sound source localization is utilized and their performance is boosted by each other. Experiments show that the proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.

New submissions for Wed, 23 Jun 21

Keyword: SLAM

BEyond observation: an approach for ObjectNav

Authors: Daniel V. Ruiz, Eduardo Todt
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.11379
Pdf link: https://arxiv.org/pdf/2106.11379
Abstract
With the rise of automation, unmanned vehicles became a hot topic both as commercial products and as a scientific research topic. It composes a multi-disciplinary field of robotics that encompasses embedded systems, control theory, path planning, Simultaneous Localization and Mapping (SLAM), scene reconstruction, and pattern recognition. In this work, we present our exploratory research of how sensor data fusion and state-of-the-art machine learning algorithms can perform the Embodied Artificial Intelligence (E-AI) task called Visual Semantic Navigation. This task, a.k.a Object-Goal Navigation (ObjectNav) consists of autonomous navigation using egocentric visual observations to reach an object belonging to the target semantic class without prior knowledge of the environment. Our method reached fourth place on the Habitat Challenge 2021 ObjectNav on the Minival phase and the Test-Standard Phase.

SA-LOAM: Semantic-aided LiDAR SLAM with Loop Closure

Authors: Lin Li, Xin Kong, Xiangrui Zhao, Wanlong Li, Feng Wen, Hongbo Zhang, Yong Liu
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.11516
Pdf link: https://arxiv.org/pdf/2106.11516
Abstract
LiDAR-based SLAM system is admittedly more accurate and stable than others, while its loop closure detection is still an open issue. With the development of 3D semantic segmentation for point cloud, semantic information can be obtained conveniently and steadily, essential for high-level intelligence and conductive to SLAM. In this paper, we present a novel semantic-aided LiDAR SLAM with loop closure based on LOAM, named SA-LOAM, which leverages semantics in odometry as well as loop closure detection. Specifically, we propose a semantic-assisted ICP, including semantically matching, downsampling and plane constraint, and integrates a semantic graph-based place recognition method in our loop closure detection module. Benefitting from semantics, we can improve the localization accuracy, detect loop closures effectively, and construct a global consistent semantic map even in large-scale scenes. Extensive experiments on KITTI and Ford Campus dataset show that our system significantly improves baseline performance, has generalization ability to unseen data and achieves competitive results compared with state-of-the-art methods.

HybVIO: Pushing the Limits of Real-time Visual-inertial Odometry

Authors: Otto Seiskari, Pekka Rantalankila, Juho Kannala, Jerry Ylilammi, Esa Rahtu, Arno Solin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.11857
Pdf link: https://arxiv.org/pdf/2106.11857
Abstract
We present HybVIO, a novel hybrid approach for combining filtering-based visual-inertial odometry (VIO) with optimization-based SLAM. The core of our method is highly robust, independent VIO with improved IMU bias modeling, outlier rejection, stationarity detection, and feature track selection, which is adjustable to run on embedded hardware. Long-term consistency is achieved with a loosely-coupled SLAM module. In academic benchmarks, our solution yields excellent performance in all categories, especially in the real-time use case, where we outperform the current state-of-the-art. We also demonstrate the feasibility of VIO for vehicular tracking on consumer-grade hardware using a custom dataset, and show good performance in comparison to current commercial VISLAM alternatives.

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

SA-LOAM: Semantic-aided LiDAR SLAM with Loop Closure

Authors: Lin Li, Xin Kong, Xiangrui Zhao, Wanlong Li, Feng Wen, Hongbo Zhang, Yong Liu
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.11516
Pdf link: https://arxiv.org/pdf/2106.11516
Abstract
LiDAR-based SLAM system is admittedly more accurate and stable than others, while its loop closure detection is still an open issue. With the development of 3D semantic segmentation for point cloud, semantic information can be obtained conveniently and steadily, essential for high-level intelligence and conductive to SLAM. In this paper, we present a novel semantic-aided LiDAR SLAM with loop closure based on LOAM, named SA-LOAM, which leverages semantics in odometry as well as loop closure detection. Specifically, we propose a semantic-assisted ICP, including semantically matching, downsampling and plane constraint, and integrates a semantic graph-based place recognition method in our loop closure detection module. Benefitting from semantics, we can improve the localization accuracy, detect loop closures effectively, and construct a global consistent semantic map even in large-scale scenes. Extensive experiments on KITTI and Ford Campus dataset show that our system significantly improves baseline performance, has generalization ability to unseen data and achieves competitive results compared with state-of-the-art methods.

Keyword: lidar

SA-LOAM: Semantic-aided LiDAR SLAM with Loop Closure

Authors: Lin Li, Xin Kong, Xiangrui Zhao, Wanlong Li, Feng Wen, Hongbo Zhang, Yong Liu
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.11516
Pdf link: https://arxiv.org/pdf/2106.11516
Abstract
LiDAR-based SLAM system is admittedly more accurate and stable than others, while its loop closure detection is still an open issue. With the development of 3D semantic segmentation for point cloud, semantic information can be obtained conveniently and steadily, essential for high-level intelligence and conductive to SLAM. In this paper, we present a novel semantic-aided LiDAR SLAM with loop closure based on LOAM, named SA-LOAM, which leverages semantics in odometry as well as loop closure detection. Specifically, we propose a semantic-assisted ICP, including semantically matching, downsampling and plane constraint, and integrates a semantic graph-based place recognition method in our loop closure detection module. Benefitting from semantics, we can improve the localization accuracy, detect loop closures effectively, and construct a global consistent semantic map even in large-scale scenes. Extensive experiments on KITTI and Ford Campus dataset show that our system significantly improves baseline performance, has generalization ability to unseen data and achieves competitive results compared with state-of-the-art methods.

PALMAR: Towards Adaptive Multi-inhabitant Activity Recognition in Point-Cloud Technology

Authors: Mohammad Arif Ul Alam, Md Mahmudur Rahman, Jared Q Widberg
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2106.11902
Pdf link: https://arxiv.org/pdf/2106.11902
Abstract
With the advancement of deep neural networks and computer vision-based Human Activity Recognition, employment of Point-Cloud Data technologies (LiDAR, mmWave) has seen a lot interests due to its privacy preserving nature. Given the high promise of accurate PCD technologies, we develop, PALMAR, a multiple-inhabitant activity recognition system by employing efficient signal processing and novel machine learning techniques to track individual person towards developing an adaptive multi-inhabitant tracking and HAR system. More specifically, we propose (i) a voxelized feature representation-based real-time PCD fine-tuning method, (ii) efficient clustering (DBSCAN and BIRCH), Adaptive Order Hidden Markov Model based multi-person tracking and crossover ambiguity reduction techniques and (iii) novel adaptive deep learning-based domain adaptation technique to improve the accuracy of HAR in presence of data scarcity and diversity (device, location and population diversity). We experimentally evaluate our framework and systems using (i) a real-time PCD collected by three devices (3D LiDAR and 79 GHz mmWave) from 6 participants, (ii) one publicly available 3D LiDAR activity data (28 participants) and (iii) an embedded hardware prototype system which provided promising HAR performances in multi-inhabitants (96%) scenario with a 63% improvement of multi-person tracking than state-of-art framework without losing significant system performances in the edge computing device.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation

Authors: Eslam Mohamed, Ahmed El-Sallab
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.11401
Pdf link: https://arxiv.org/pdf/2106.11401
Abstract
Moving objects have special importance for Autonomous Driving tasks. Detecting moving objects can be posed as Moving Object Segmentation, by segmenting the object pixels, or Moving Object Detection, by generating a bounding box for the moving targets. In this paper, we present a Multi-Task Learning architecture, based on Transformers, to jointly perform both tasks through one network. Due to the importance of the motion features to the task, the whole setup is based on a Spatio-Temporal aggregation. We evaluate the performance of the individual tasks architecture versus the MTL setup, both with early shared encoders, and late shared encoder-decoder transformers. For the latter, we present a novel joint tasks query decoder transformer, that enables us to have tasks dedicated heads out of the shared model. To evaluate our approach, we use the KITTI MOD [29] data set. Results show1.5% mAP improvement for Moving Object Detection, and 2%IoU improvement for Moving Object Segmentation, over the individual tasks networks.

MODETR: Moving Object Detection with Transformers

Authors: Eslam Mohamed, Ahmad El-Sallab
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.11422
Pdf link: https://arxiv.org/pdf/2106.11422
Abstract
Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline. MOD is usually handled via 2-stream convolutional architectures that incorporates both appearance and motion cues, without considering the inter-relations between the spatial or motion features. In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams. We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformer encoders for both spatial and motion modalities, and an object transformer decoder that produces the moving objects bounding boxes using set predictions. The whole architecture is trained end-to-end using bi-partite loss. Several methods of incorporating motion cues with the Transformer model are explored, including two-stream RGB and Optical Flow (OF) methods, and multi-stream architectures that take advantage of sequence information. To incorporate the temporal information, we propose a new Temporal Positional Encoding (TPE) approach to extend the Spatial Positional Encoding(SPE) in DETR. We explore two architectural choices for that, balancing between speed and time. To evaluate the our network, we perform the MOD task on the KITTI MOD [6] data set. Results show significant 5% mAP of the Transformer network for MOD over the state-of-the art methods. Moreover, the proposed TPE encoding provides 10% mAP improvement over the SPE baseline.

SeqNetVLAD vs PointNetVLAD: Image Sequence vs 3D Point Clouds for Day-Night Place Recognition

Authors: Sourav Garg, Michael Milford
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.11481
Pdf link: https://arxiv.org/pdf/2106.11481
Abstract
Place Recognition is a crucial capability for mobile robot localization and navigation. Image-based or Visual Place Recognition (VPR) is a challenging problem as scene appearance and camera viewpoint can change significantly when places are revisited. Recent VPR methods based on sequential representations'' have shown promising results as compared to traditional sequence score aggregation or single image based techniques. In parallel to these endeavors, 3D point clouds based place recognition is also being explored following the advances in deep learning based point cloud processing. However, a key question remains: is an explicit 3D structure based place representation always superior to an implicit spatial'' representation based on sequence of RGB images which can inherently learn scene structure. In this extended abstract, we attempt to compare these two types of methods by considering a similar ``metric span'' to represent places. We compare a 3D point cloud based method (PointNetVLAD) with image sequence based methods (SeqNet and others) and showcase that image sequence based techniques approach, and can even surpass, the performance achieved by point cloud based methods for a given metric span. These performance variations can be attributed to differences in data richness of input sensors as well as data accumulation strategies for a mobile robot. While a perfect apple-to-apple comparison may not be feasible for these two different modalities, the presented comparison takes a step in the direction of answering deeper questions regarding spatial representations, relevant to several applications like Autonomous Driving and Augmented/Virtual Reality. Source code available publicly https://github.com/oravus/seqNet.

nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Authors: Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, Sammy Omari
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.11810
Pdf link: https://arxiv.org/pdf/2106.11810
Abstract
In this work, we propose the world's first closed-loop ML-based planning benchmark for autonomous driving. While there is a growing body of ML-based motion planners, the lack of established datasets and metrics has limited the progress in this area. Existing benchmarks for autonomous vehicle motion prediction have focused on short-term motion forecasting, rather than long-term planning. This has led previous works to use open-loop evaluation with L2-based metrics, which are not suitable for fairly evaluating long-term planning. Our benchmark overcomes these limitations by introducing a large-scale driving dataset, lightweight closed-loop simulator, and motion-planning-specific metrics. We provide a high-quality dataset with 1500h of human driving data from 4 cities across the US and Asia with widely varying traffic patterns (Boston, Pittsburgh, Las Vegas and Singapore). We will provide a closed-loop simulation framework with reactive agents and provide a large set of both general and scenario-specific planning metrics. We plan to release the dataset at NeurIPS 2021 and organize benchmark challenges starting in early 2022.

MEAL: Manifold Embedding-based Active Learning

Authors: Deepthi Sreenivasaiah, Thomas Wollmann
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.11858
Pdf link: https://arxiv.org/pdf/2106.11858
Abstract
Image segmentation is a common and challenging task in autonomous driving. Availability of sufficient pixel-level annotations for the training data is a hurdle. Active learning helps learning from small amounts of data by suggesting the most promising samples for labeling. In this work, we propose a new pool-based method for active learning, which proposes promising image regions, in each acquisition step. The problem is framed in an exploration-exploitation framework by combining an embedding based on Uniform Manifold Approximation to model representativeness with entropy as uncertainty measure to model informativeness. We applied our proposed method to the challenging autonomous driving data sets CamVid and Cityscapes and performed a quantitative comparison with state-of-the-art methods. We find that our active learning method achieves better performance on CamVid compared to other methods, while on Cityscapes, the performance lift was negligible.

New submissions for Tue, 8 Jun 21

Keyword: SLAM

FedNL: Making Newton-Type Methods Applicable to Federated Learning

Authors: Mher Safaryan, Rustem Islamov, Xun Qian, Peter Richtárik
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2106.02969
Pdf link: https://arxiv.org/pdf/2106.02969
Abstract
Inspired by recent work of Islamov et al (2021), we propose a family of Federated Newton Learn (FedNL) methods, which we believe is a marked step in the direction of making second-order methods applicable to FL. In contrast to the aforementioned work, FedNL employs a different Hessian learning technique which i) enhances privacy as it does not rely on the training data to be revealed to the coordinating server, ii) makes it applicable beyond generalized linear models, and iii) provably works with general contractive compression operators for compressing the local Hessians, such as Top-$K$ or Rank-$R$, which are vastly superior in practice. Notably, we do not need to rely on error feedback for our methods to work with contractive compressors. Moreover, we develop FedNL-PP, FedNL-CR and FedNL-LS, which are variants of FedNL that support partial participation, and globalization via cubic regularization and line search, respectively, and FedNL-BC, which is a variant that can further benefit from bidirectional compression of gradients and models, i.e., smart uplink gradient and smart downlink model compression. We prove local convergence rates that are independent of the condition number, the number of training data points, and compression variance. Our communication efficient Hessian learning technique provably learns the Hessian at the optimum. Finally, we perform a variety of numerical experiments that show that our FedNL methods have state-of-the-art communication complexity when compared to key baselines.

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Radar-Camera Pixel Depth Association for Depth Completion

Authors: Yunfei Long, Daniel Morris, Xiaoming Liu, Marcos Castro, Punarjay Chakravarty, Praveen Narayanan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.02778
Pdf link: https://arxiv.org/pdf/2106.02778
Abstract
While radar and video data can be readily fused at the detection level, fusing them at the pixel level is potentially more beneficial. This is also more challenging in part due to the sparsity of radar, but also because automotive radar beams are much wider than a typical pixel combined with a large baseline between camera and radar, which results in poor association between radar pixels and color pixel. A consequence is that depth completion methods designed for LiDAR and video fare poorly for radar and video. Here we propose a radar-to-pixel association stage which learns a mapping from radar returns to pixels. This mapping also serves to densify radar returns. Using this as a first stage, followed by a more traditional depth completion method, we are able to achieve image-guided depth completion with radar and video. We demonstrate performance superior to camera and radar alone on the nuScenes dataset. Our source code is available at https://github.com/longyunf/rc-pda.

Brno Urban Dataset: Winter Extention

Authors: Adam Ligocki, Ales Jelinek, Ludek Zalud
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.02952
Pdf link: https://arxiv.org/pdf/2106.02952
Abstract
Research on autonomous driving is advancing dramatically and requires new data and techniques to progress even further. To reflect this pressure, we present an extension of our recent work - the Brno Urban Dataset (BUD). The new data focus on winter conditions in various snow-covered environments and feature additional LiDAR and radar sensors for object detection in front of the vehicle. The improvement affects the old data as well. We provide YOLO detection annotations for all old RGB images in the dataset. The detections are further also transferred by our original algorithm into the infra-red (IR) images, captured by the thermal camera. To our best knowledge, it makes this dataset the largest source of machine-annotated thermal images currently available. The dataset is published under MIT license on https://github.com/Robotics-BUT/Brno-Urban-Dataset.

Stein ICP for Uncertainty Estimation in Point Cloud Matching

Authors: Fahira Afzal Maken, Fabio Ramos, Lionel Ott
Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.03287
Pdf link: https://arxiv.org/pdf/2106.03287
Abstract
Quantification of uncertainty in point cloud matching is critical in many tasks such as pose estimation, sensor fusion, and grasping. Iterative closest point (ICP) is a commonly used pose estimation algorithm which provides a point estimate of the transformation between two point clouds. There are many sources of uncertainty in this process that may arise due to sensor noise, ambiguous environment, and occlusion. However, for safety critical problems such as autonomous driving, a point estimate of the pose transformation is not sufficient as it does not provide information about the multiple solutions. Current probabilistic ICP methods usually do not capture all sources of uncertainty and may provide unreliable transformation estimates which can have a detrimental effect in state estimation or decision making tasks that use this information. In this work we propose a new algorithm to align two point clouds that can precisely estimate the uncertainty of ICP's transformation parameters. We develop a Stein variational inference framework with gradient based optimization of ICP's cost function. The method provides a non-parametric estimate of the transformation, can model complex multi-modal distributions, and can be effectively parallelized on a GPU. Experiments using 3D kinect data as well as sparse indoor/outdoor LiDAR data show that our method is capable of efficiently producing accurate pose uncertainty estimates.

Cost-effective Mapping of Mobile Robot Based on the Fusion of UWB and Short-range 2D LiDAR

Authors: Ran Liu, Yongping He, Chau Yuen, Billy Pik Lik Lau, Rashid Ali, Wenpeng Fu, Zhiqiang Cao
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.03648
Pdf link: https://arxiv.org/pdf/2106.03648
Abstract
Environment mapping is an essential prerequisite for mobile robots to perform different tasks such as navigation and mission planning. With the availability of low-cost 2D LiDARs, there are increasing applications of such 2D LiDARs in industrial environments. However, environment mapping in an unknown and feature-less environment with such low-cost 2D LiDARs remains a challenge. The challenge mainly originates from the short-range of LiDARs and complexities in performing scan matching in these environments. In order to resolve these shortcomings, we propose to fuse the ultra-wideband (UWB) with 2D LiDARs to improve the mapping quality of a mobile robot. The optimization-based approach is utilized for the fusion of UWB ranging information and odometry to first optimize the trajectory. Then the LiDAR-based loop closures are incorporated to improve the accuracy of the trajectory estimation. Finally, the optimized trajectory is combined with the LiDAR scans to produce the occupancy map of the environment. The performance of the proposed approach is evaluated in an indoor feature-less environment with a size of 20m*20m. Obtained results show that the mapping error of the proposed scheme is 85.5% less than that of the conventional GMapping algorithm with short-range LiDAR (for example Hokuyo URG-04LX in our experiment with a maximum range of 5.6m).

Keyword: loop detection

There is no result

Keyword: autonomous driving

Constrained Generalized Additive 2 Model with Consideration of High-Order Interactions

Authors: Akihisa Watanabe, Michiya Kuramata, Kaito Majima, Haruka Kiyohara, Kensho Kondo, Kazuhide Nakata
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.02836
Pdf link: https://arxiv.org/pdf/2106.02836
Abstract
In recent years, machine learning and AI have been introduced in many industrial fields. In fields such as finance, medicine, and autonomous driving, where the inference results of a model may have serious consequences, high interpretability as well as prediction accuracy is required. In this study, we propose CGA2M+, which is based on the Generalized Additive 2 Model (GA2M) and differs from it in two major ways. The first is the introduction of monotonicity. Imposing monotonicity on some functions based on an analyst's knowledge is expected to improve not only interpretability but also generalization performance. The second is the introduction of a higher-order term: given that GA2M considers only second-order interactions, we aim to balance interpretability and prediction accuracy by introducing a higher-order term that can capture higher-order interactions. In this way, we can improve prediction performance without compromising interpretability by applying learning innovation. Numerical experiments showed that the proposed model has high predictive performance and interpretability. Furthermore, we confirmed that generalization performance is improved by introducing monotonicity.

Brno Urban Dataset: Winter Extention

Authors: Adam Ligocki, Ales Jelinek, Ludek Zalud
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.02952
Pdf link: https://arxiv.org/pdf/2106.02952
Abstract
Research on autonomous driving is advancing dramatically and requires new data and techniques to progress even further. To reflect this pressure, we present an extension of our recent work - the Brno Urban Dataset (BUD). The new data focus on winter conditions in various snow-covered environments and feature additional LiDAR and radar sensors for object detection in front of the vehicle. The improvement affects the old data as well. We provide YOLO detection annotations for all old RGB images in the dataset. The detections are further also transferred by our original algorithm into the infra-red (IR) images, captured by the thermal camera. To our best knowledge, it makes this dataset the largest source of machine-annotated thermal images currently available. The dataset is published under MIT license on https://github.com/Robotics-BUT/Brno-Urban-Dataset.

Stein ICP for Uncertainty Estimation in Point Cloud Matching

Authors: Fahira Afzal Maken, Fabio Ramos, Lionel Ott
Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.03287
Pdf link: https://arxiv.org/pdf/2106.03287
Abstract
Quantification of uncertainty in point cloud matching is critical in many tasks such as pose estimation, sensor fusion, and grasping. Iterative closest point (ICP) is a commonly used pose estimation algorithm which provides a point estimate of the transformation between two point clouds. There are many sources of uncertainty in this process that may arise due to sensor noise, ambiguous environment, and occlusion. However, for safety critical problems such as autonomous driving, a point estimate of the pose transformation is not sufficient as it does not provide information about the multiple solutions. Current probabilistic ICP methods usually do not capture all sources of uncertainty and may provide unreliable transformation estimates which can have a detrimental effect in state estimation or decision making tasks that use this information. In this work we propose a new algorithm to align two point clouds that can precisely estimate the uncertainty of ICP's transformation parameters. We develop a Stein variational inference framework with gradient based optimization of ICP's cost function. The method provides a non-parametric estimate of the transformation, can model complex multi-modal distributions, and can be effectively parallelized on a GPU. Experiments using 3D kinect data as well as sparse indoor/outdoor LiDAR data show that our method is capable of efficiently producing accurate pose uncertainty estimates.

Self-supervised Depth Estimation Leveraging Global Perception and Geometric Smoothness Using On-board Videos

Authors: Shaocheng Jia, Xin Pei, Wei Yao, S.C. Wong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.03505
Pdf link: https://arxiv.org/pdf/2106.03505
Abstract
Self-supervised depth estimation has drawn much attention in recent years as it does not require labeled data but image sequences. Moreover, it can be conveniently used in various applications, such as autonomous driving, robotics, realistic navigation, and smart cities. However, extracting global contextual information from images and predicting a geometrically natural depth map remain challenging. In this paper, we present DLNet for pixel-wise depth estimation, which simultaneously extracts global and local features with the aid of our depth Linformer block. This block consists of the Linformer and innovative soft split multi-layer perceptron blocks. Moreover, a three-dimensional geometry smoothness loss is proposed to predict a geometrically natural depth map by imposing the second-order smoothness constraint on the predicted three-dimensional point clouds, thereby realizing improved performance as a byproduct. Finally, we explore the multi-scale prediction strategy and propose the maximum margin dual-scale prediction strategy for further performance improvement. In experiments on the KITTI and Make3D benchmarks, the proposed DLNet achieves performance competitive to those of the state-of-the-art methods, reducing time and space complexities by more than $62%$ and $56%$, respectively. Extensive testing on various real-world situations further demonstrates the strong practicality and generalization capability of the proposed model.

Learning without Knowing: Unobserved Context in Continuous Transfer Reinforcement Learning

Authors: Chenyu Liu, Yan Zhang, Yi Shen, Michael M. Zavlanos
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2106.03833
Pdf link: https://arxiv.org/pdf/2106.03833
Abstract
In this paper, we consider a transfer Reinforcement Learning (RL) problem in continuous state and action spaces, under unobserved contextual information. For example, the context can represent the mental view of the world that an expert agent has formed through past interactions with this world. We assume that this context is not accessible to a learner agent who can only observe the expert data. Then, our goal is to use the context-aware expert data to learn an optimal context-unaware policy for the learner using only a few new data samples. Such problems are typically solved using imitation learning that assumes that both the expert and learner agents have access to the same information. However, if the learner does not know the expert context, using the expert data alone will result in a biased learner policy and will require many new data samples to improve. To address this challenge, in this paper, we formulate the learning problem as a causal bound-constrained Multi-Armed-Bandit (MAB) problem. The arms of this MAB correspond to a set of basis policy functions that can be initialized in an unsupervised way using the expert data and represent the different expert behaviors affected by the unobserved context. On the other hand, the MAB constraints correspond to causal bounds on the accumulated rewards of these basis policy functions that we also compute from the expert data. The solution to this MAB allows the learner agent to select the best basis policy and improve it online. And the use of causal bounds reduces the exploration variance and, therefore, improves the learning rate. We provide numerical experiments on an autonomous driving example that show that our proposed transfer RL method improves the learner's policy faster compared to existing imitation learning methods and enjoys much lower variance during training.

Tunable Trajectory Planner Using G3 Curves

Authors: Alexander Botros, Stephen L. Smith
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2106.03836
Pdf link: https://arxiv.org/pdf/2106.03836
Abstract
Trajectory planning is commonly used as part of a local planner in autonomous driving. This paper considers the problem of planning a continuous-curvature-rate trajectory between fixed start and goal states that minimizes a tunable trade-off between passenger comfort and travel time. The problem is an instance of infinite dimensional optimization over two continuous functions: a path, and a velocity profile. We propose a simplification of this problem that facilitates the discretization of both functions. This paper also proposes a method to quickly generate minimal-length paths between start and goal states based on a single tuning parameter: the second derivative of curvature. Furthermore, we discretize the set of velocity profiles along a given path into a selection of acceleration way-points along the path. Gradient-descent is then employed to minimize cost over feasible choices of the second derivative of curvature, and acceleration way-points, resulting in a method that repeatedly solves the path and velocity profiles in an iterative fashion. Numerical examples are provided to illustrate the benefits of the proposed methods.

New submissions for Fri, 18 Jun 21

Keyword: SLAM

Towards bio-inspired unsupervised representation learning for indoor aerial navigation

Authors: Ni Wang, Ozan Catal, Tim Verbelen, Matthias Hartmann, Bart Dhoedt
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2106.09326
Pdf link: https://arxiv.org/pdf/2106.09326
Abstract
Aerial navigation in GPS-denied, indoor environments, is still an open challenge. Drones can perceive the environment from a richer set of viewpoints, while having more stringent compute and energy constraints than other autonomous platforms. To tackle that problem, this research displays a biologically inspired deep-learning algorithm for simultaneous localization and mapping (SLAM) and its application in a drone navigation system. We propose an unsupervised representation learning method that yields low-dimensional latent state descriptors, that mitigates the sensitivity to perceptual aliasing, and works on power-efficient, embedded hardware. The designed algorithm is evaluated on a dataset collected in an indoor warehouse environment, and initial results show the feasibility for robust indoor aerial navigation.

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Invisible for both Camera and LiDAR: Security of Multi-Sensor Fusion based Perception in Autonomous Driving Under Physical-World Attacks

Authors: Yulong Cao*, Ningfei Wang*, Chaowei Xiao*, Dawei Yang*, Jin Fang, Ruigang Yang, Qi Alfred Chen, Mingyan Liu, Bo Li (*co-first authors)
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.09249
Pdf link: https://arxiv.org/pdf/2106.09249
Abstract
In Autonomous Driving (AD) systems, perception is both security and safety critical. Despite various prior studies on its security issues, all of them only consider attacks on camera- or LiDAR-based AD perception alone. However, production AD systems today predominantly adopt a Multi-Sensor Fusion (MSF) based design, which in principle can be more robust against these attacks under the assumption that not all fusion sources are (or can be) attacked at the same time. In this paper, we present the first study of security issues of MSF-based perception in AD systems. We directly challenge the basic MSF design assumption above by exploring the possibility of attacking all fusion sources simultaneously. This allows us for the first time to understand how much security guarantee MSF can fundamentally provide as a general defense strategy for AD perception. We formulate the attack as an optimization problem to generate a physically-realizable, adversarial 3D-printed object that misleads an AD system to fail in detecting it and thus crash into it. We propose a novel attack pipeline that addresses two main design challenges: (1) non-differentiable target camera and LiDAR sensing systems, and (2) non-differentiable cell-level aggregated features popularly used in LiDAR-based AD perception. We evaluate our attack on MSF included in representative open-source industry-grade AD systems in real-world driving scenarios. Our results show that the attack achieves over 90% success rate across different object types and MSF. Our attack is also found stealthy, robust to victim positions, transferable across MSF algorithms, and physical-world realizable after being 3D-printed and captured by LiDAR and camera devices. To concretely assess the end-to-end safety impact, we further perform simulation evaluation and show that it can cause a 100% vehicle collision rate for an industry-grade AD system.

AttDLNet: Attention-based DL Network for 3D LiDAR Place Recognition

Authors: Tiago Barros, Luís Garrote, Ricardo Pereira, Cristiano Premebida, Urbano J. Nunes
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.09637
Pdf link: https://arxiv.org/pdf/2106.09637
Abstract
Deep networks have been progressively adapted to new sensor modalities, namely to 3D LiDAR, which led to unprecedented achievements in autonomous vehicle-related applications such as place recognition. One of the main challenges of deep models in place recognition is to extract efficient and descriptive feature representations that relate places based on their similarity. To address the problem of place recognition using LiDAR data, this paper proposes a novel 3D LiDAR-based deep learning network (named AttDLNet) that comprises an encoder network and exploits an attention mechanism to selectively focus on long-range context and interfeature relationships. The proposed network is trained and validated on the KITTI dataset, using the cosine loss for training and a retrieval-based place recognition pipeline for validation. Additionally, an ablation study is presented to assess the best network configuration. Results show that the encoder network features are already very descriptive, but adding attention to the network further improves performance. From the ablation study, results indicate that the middle encoder layers have the highest mean performance, while deeper layers are more robust to orientation change. The code is publicly available on the project website: https://github.com/Cybonic/ AttDLNet

Keyword: loop detection

There is no result

Keyword: autonomous driving

Planning on a (Risk) Budget: Safe Non-Conservative Planning in Probabilistic Dynamic Environments

Authors: Hung-Jui Huang, Kai-Chi Huang, Michal Čáp, Yibiao Zhao, Ying Nian Wu, Chris L. Baker
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.09127
Pdf link: https://arxiv.org/pdf/2106.09127
Abstract
Planning in environments with other agents whose future actions are uncertain often requires compromise between safety and performance. Here our goal is to design efficient planning algorithms with guaranteed bounds on the probability of safety violation, which nonetheless achieve non-conservative performance. To quantify a system's risk, we define a natural criterion called interval risk bounds (IRBs), which provide a parametric upper bound on the probability of safety violation over a given time interval or task. We present a novel receding horizon algorithm, and prove that it can satisfy a desired IRB. Our algorithm maintains a dynamic risk budget which constrains the allowable risk at each iteration, and guarantees recursive feasibility by requiring a safe set to be reachable by a contingency plan within the budget. We empirically demonstrate that our algorithm is both safer and less conservative than strong baselines in two simulated autonomous driving experiments in scenarios involving collision avoidance with other vehicles, and additionally demonstrate our algorithm running on an autonomous class 8 truck.

Invisible for both Camera and LiDAR: Security of Multi-Sensor Fusion based Perception in Autonomous Driving Under Physical-World Attacks

Authors: Yulong Cao*, Ningfei Wang*, Chaowei Xiao*, Dawei Yang*, Jin Fang, Ruigang Yang, Qi Alfred Chen, Mingyan Liu, Bo Li (*co-first authors)
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.09249
Pdf link: https://arxiv.org/pdf/2106.09249
Abstract
In Autonomous Driving (AD) systems, perception is both security and safety critical. Despite various prior studies on its security issues, all of them only consider attacks on camera- or LiDAR-based AD perception alone. However, production AD systems today predominantly adopt a Multi-Sensor Fusion (MSF) based design, which in principle can be more robust against these attacks under the assumption that not all fusion sources are (or can be) attacked at the same time. In this paper, we present the first study of security issues of MSF-based perception in AD systems. We directly challenge the basic MSF design assumption above by exploring the possibility of attacking all fusion sources simultaneously. This allows us for the first time to understand how much security guarantee MSF can fundamentally provide as a general defense strategy for AD perception. We formulate the attack as an optimization problem to generate a physically-realizable, adversarial 3D-printed object that misleads an AD system to fail in detecting it and thus crash into it. We propose a novel attack pipeline that addresses two main design challenges: (1) non-differentiable target camera and LiDAR sensing systems, and (2) non-differentiable cell-level aggregated features popularly used in LiDAR-based AD perception. We evaluate our attack on MSF included in representative open-source industry-grade AD systems in real-world driving scenarios. Our results show that the attack achieves over 90% success rate across different object types and MSF. Our attack is also found stealthy, robust to victim positions, transferable across MSF algorithms, and physical-world realizable after being 3D-printed and captured by LiDAR and camera devices. To concretely assess the end-to-end safety impact, we further perform simulation evaluation and show that it can cause a 100% vehicle collision rate for an industry-grade AD system.

Design of a prototypical platform for autonomous and connected vehicles

Authors: Stefano Arrigoni, Simone Mentasti, Federico Cheli, Matteo Matteucci, Francesco Braghin
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2106.09307
Pdf link: https://arxiv.org/pdf/2106.09307
Abstract
Self-driving technology is expected to revolutionize different sectors and is seen as the natural evolution of road vehicles. In the last years, real-world validation of designed and virtually tested solutions is growing in importance since simulated environments will never fully replicate all the aspects that can affect results in the real world. To this end, this paper presents our prototype platform for experimental research on connected and autonomous driving projects. In detail, the paper presents the overall architecture of the vehicle focusing both on mechanical aspects related to remote actuation and sensors set-up and software aspects by means of a comprehensive description of the main algorithms required for autonomous driving as ego-localization, environment perception, motion planning, and actuation. Finally, experimental tests conducted in an urban-like environment are reported to validate and assess the performances of the overall system.

KIT Bus: A Shuttle Model for CARLA Simulator

Authors: Yusheng Xiang, Shuo Wang, Tianqing Su, Jun Li, Samuel S. Mao, Marcus Geimer
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.09508
Pdf link: https://arxiv.org/pdf/2106.09508
Abstract
With the continuous development of science and technology, self-driving vehicles will surely change the nature of transportation and realize the automotive industry's transformation in the future. Compared with self-driving cars, self-driving buses are more efficient in carrying passengers and more environmentally friendly in terms of energy consumption. Therefore, it is speculated that in the future, self-driving buses will become more and more important. As a simulator for autonomous driving research, the CARLA simulator can help people accumulate experience in autonomous driving technology faster and safer. However, a shortcoming is that there is no modern bus model in the CARLA simulator. Consequently, people cannot simulate autonomous driving on buses or the scenarios interacting with buses. Therefore, we built a bus model in 3ds Max software and imported it into the CARLA to fill this gap. Our model, namely KIT bus, is proven to work in the CARLA by testing it with the autopilot simulation. The video demo is shown on our Youtube.

SECANT: Self-Expert Cloning for Zero-Shot Generalization of Visual Policies

Authors: Linxi Fan, Guanzhi Wang, De-An Huang, Zhiding Yu, Li Fei-Fei, Yuke Zhu, Anima Anandkumar
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.09678
Pdf link: https://arxiv.org/pdf/2106.09678
Abstract
Generalization has been a long-standing challenge for reinforcement learning (RL). Visual RL, in particular, can be easily distracted by irrelevant factors in high-dimensional observation space. In this work, we consider robust policy learning which targets zero-shot generalization to unseen visual environments with large distributional shift. We propose SECANT, a novel self-expert cloning technique that leverages image augmentation in two stages to decouple robust representation learning from policy optimization. Specifically, an expert policy is first trained by RL from scratch with weak augmentations. A student network then learns to mimic the expert policy by supervised learning with strong augmentations, making its representation more robust against visual variations compared to the expert. Extensive experiments demonstrate that SECANT significantly advances the state of the art in zero-shot generalization across 4 challenging domains. Our average reward improvements over prior SOTAs are: DeepMind Control (+26.5%), robotic manipulation (+337.8%), vision-based autonomous driving (+47.7%), and indoor object navigation (+15.8%). Code release and video are available at https://linxifan.github.io/secant-site/.

Simulation study on the fleet performance of shared autonomous bicycles

Authors: Naroa Coretti Sánchez, Iñigo Martinez, Luis Alonso Pastor, Kent Larson
Subjects: Computers and Society (cs.CY); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2106.09694
Pdf link: https://arxiv.org/pdf/2106.09694
Abstract
Rethinking cities is now more imperative than ever, as society is facing challenges such as population growth and climate change. The design of cities can not be abstracted from the design of its mobility system, and, therefore, efficient solutions must be found to transport people and goods throughout the city in an ecological way. An autonomous bicycle-sharing system that combines the benefits of vehicle sharing, electrification, autonomy, and micro-mobility could increase the efficiency and convenience of bicycle-sharing systems incentivizing more people to bike and enjoy their cities in an environmentally friendly way. Due to the uniqueness and radical novelty of introducing autonomous driving technology into bicycle-sharing systems and the inherent complexity of these systems, there is a need to quantify the potential impact of autonomy on fleet performance and user experience. This paper presents an ad-hoc agent-based, discrete event simulator that provides an in-depth understanding of the fleet behavior of autonomous bicycle-sharing systems in the most realistic possible scenarios, including a rebalancing system based on demand prediction. In addition, this work quantifies the extent to which an autonomous system would outperform current bicycle-sharing schemes and describes the impact of different parameters on system efficiency and service quality. This research shows that with a fleet size three and a half times smaller than a station-based system and eight times smaller than a dockless system, an autonomous system can provide overall improved performance and user experience even with no rebalancing. These findings indicate that the remarkable efficiency of an autonomous bicycle-sharing system could compensate for the additional cost of autonomous bicycles.

New submissions for Mon, 29 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Online Adaptation for Implicit Object Tracking and Shape Reconstruction in the Wild

Authors: Jianglong Ye, Yuntao Chen, Naiyan Wang, Xiaolong Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12728
Pdf link: https://arxiv.org/pdf/2111.12728
Abstract
Tracking and reconstructing 3D objects from cluttered scenes are the key components for computer vision, robotics and autonomous driving systems. While recent progress in implicit function (e.g., DeepSDF) has shown encouraging results on high-quality 3D shape reconstruction, it is still very challenging to generalize to cluttered and partially observable LiDAR data. In this paper, we propose to leverage the continuity in video data. We introduce a novel and unified framework which utilizes a DeepSDF model to simultaneously track and reconstruct 3D objects in the wild. We online adapt the DeepSDF model in the video, iteratively improving the shape reconstruction while in return improving the tracking, and vice versa. We experiment with both Waymo and KITTI datasets, and show significant improvements over state-of-the-art methods for both tracking and shape reconstruction.

Notebook-as-a-VRE (NaaVRE): from private notebooks to a collaborative cloud virtual research environment

Authors: Zhiming Zhao, Spiros Koulouzis, Riccardo Bianchi, Siamak Farshidi, Zeshun Shi, Ruyue Xin, Yuandou Wang, Na Li, Yifang Shi, Joris Timmermans, W. Daniel Kissling
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2111.12785
Pdf link: https://arxiv.org/pdf/2111.12785
Abstract
Virtual Research Environments (VREs) provide user-centric support in the lifecycle of research activities, e.g., discovering and accessing research assets, or composing and executing application workflows. A typical VRE is often implemented as an integrated environment, which includes a catalog of research assets, a workflow management system, a data management framework, and tools for enabling collaboration among users. Notebook environments, such as Jupyter, allow researchers to rapidly prototype scientific code and share their experiments as online accessible notebooks. Jupyter can support several popular languages that are used by data scientists, such as Python, R, and Julia. However, such notebook environments do not have seamless support for running heavy computations on remote infrastructure or finding and accessing software code inside notebooks. This paper investigates the gap between a notebook environment and a VRE and proposes an embedded VRE solution for the Jupyter environment called Notebook-as-a-VRE (NaaVRE). The NaaVRE solution provides functional components via a component marketplace and allows users to create a customized VRE on top of the Jupyter environment. From the VRE, a user can search research assets (data, software, and algorithms), compose workflows, manage the lifecycle of an experiment, and share the results among users in the community. We demonstrate how such a solution can enhance a legacy workflow that uses Light Detection and Ranging (LiDAR) data from country-wide airborne laser scanning surveys for deriving geospatial data products of ecosystem structure at high resolution over broad spatial extents. This enables users to scale out the processing of multi-terabyte LiDAR point clouds for ecological applications to more data sources in a distributed cloud environment.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Online Adaptation for Implicit Object Tracking and Shape Reconstruction in the Wild

Authors: Jianglong Ye, Yuntao Chen, Naiyan Wang, Xiaolong Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12728
Pdf link: https://arxiv.org/pdf/2111.12728
Abstract
Tracking and reconstructing 3D objects from cluttered scenes are the key components for computer vision, robotics and autonomous driving systems. While recent progress in implicit function (e.g., DeepSDF) has shown encouraging results on high-quality 3D shape reconstruction, it is still very challenging to generalize to cluttered and partially observable LiDAR data. In this paper, we propose to leverage the continuity in video data. We introduce a novel and unified framework which utilizes a DeepSDF model to simultaneously track and reconstruct 3D objects in the wild. We online adapt the DeepSDF model in the video, iteratively improving the shape reconstruction while in return improving the tracking, and vice versa. We experiment with both Waymo and KITTI datasets, and show significant improvements over state-of-the-art methods for both tracking and shape reconstruction.

Uncertainty Aware Proposal Segmentation for Unknown Object Detection

Authors: Yimeng Li, Jana Kosecka
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12866
Pdf link: https://arxiv.org/pdf/2111.12866
Abstract
Recent efforts in deploying Deep Neural Networks for object detection in real world applications, such as autonomous driving, assume that all relevant object classes have been observed during training. Quantifying the performance of these models in settings when the test data is not represented in the training set has mostly focused on pixel-level uncertainty estimation techniques of models trained for semantic segmentation. This paper proposes to exploit additional predictions of semantic segmentation models and quantifying its confidences, followed by classification of object hypotheses as known vs. unknown, out of distribution objects. We use object proposals generated by Region Proposal Network (RPN) and adapt distance aware uncertainty estimation of semantic segmentation using Radial Basis Functions Networks (RBFN) for class agnostic object mask prediction. The augmented object proposals are then used to train a classifier for known vs. unknown objects categories. Experimental results demonstrate that the proposed method achieves parallel performance to state of the art methods for unknown object detection and can also be used effectively for reducing object detectors' false positive rate. Our method is well suited for applications where prediction of non-object background categories obtained by semantic segmentation is reliable.

MegLoc: A Robust and Accurate Visual Localization Pipeline

Authors: Shuxue Peng, Zihang He, Haotian Zhang, Ran Yan, Chuting Wang, Qingtian Zhu, Xiao Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.13063
Pdf link: https://arxiv.org/pdf/2111.13063
Abstract
In this paper, we present a visual localization pipeline, namely MegLoc, for robust and accurate 6-DoF pose estimation under varying scenarios, including indoor and outdoor scenes, different time across a day, different seasons across a year, and even across years. MegLoc achieves state-of-the-art results on a range of challenging datasets, including winning the Outdoor and Indoor Visual Localization Challenge of ICCV 2021 Workshop on Long-term Visual Localization under Changing Conditions, as well as the Re-localization Challenge for Autonomous Driving of ICCV 2021 Workshop on Map-based Localization for Autonomous Driving.

Medial Spectral Coordinates for 3D Shape Analysis

Authors: Morteza Rezanejad, Mohammad Khodadad, Hamidreza Mahyar, Herve Lombaert, Michael Gruninger, Dirk B. Walther, Kaleem Siddiqi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.13295
Pdf link: https://arxiv.org/pdf/2111.13295
Abstract
In recent years there has been a resurgence of interest in our community in the shape analysis of 3D objects represented by surface meshes, their voxelized interiors, or surface point clouds. In part, this interest has been stimulated by the increased availability of RGBD cameras, and by applications of computer vision to autonomous driving, medical imaging, and robotics. In these settings, spectral coordinates have shown promise for shape representation due to their ability to incorporate both local and global shape properties in a manner that is qualitatively invariant to isometric transformations. Yet, surprisingly, such coordinates have thus far typically considered only local surface positional or derivative information. In the present article, we propose to equip spectral coordinates with medial (object width) information, so as to enrich them. The key idea is to couple surface points that share a medial ball, via the weights of the adjacency matrix. We develop a spectral feature using this idea, and the algorithms to compute it. The incorporation of object width and medial coupling has direct benefits, as illustrated by our experiments on object classification, object part segmentation, and surface point correspondence.

Keyword: mapping

Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation

Authors: Wenxuan Ma, Jinming Zhang, Shuang Li, Chi Harold Liu, Yulin Wang, Wei Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12941
Pdf link: https://arxiv.org/pdf/2111.12941
Abstract
Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Most existing UDA approaches enable knowledge transfer via learning domain-invariant representation and sharing one classifier across two domains. However, ignoring the domain-specific information that are related to the task, and forcing a unified classifier to fit both domains will limit the feature expressiveness in each domain. In this paper, by observing that the Transformer architecture with comparable parameters can generate more transferable representations than CNN counterparts, we propose a Win-Win TRansformer framework (WinTR) that separately explores the domain-specific knowledge for each domain and meanwhile interchanges cross-domain knowledge. Specifically, we learn two different mappings using two individual classification tokens in the Transformer, and design for each one a domain-specific classifier. The cross-domain knowledge is transferred via source guided label refinement and single-sided feature alignment with respect to source or target, which keeps the integrity of domain-specific information. Extensive experiments on three benchmark datasets show that our method outperforms the state-of-the-art UDA methods, validating the effectiveness of exploiting both domain-specific and invariant

Rotation Equivariant 3D Hand Mesh Generation from a Single RGB Image

Authors: Joshua Mitton, Chaitanya Kaul, Roderick Murray-Smith
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.13023
Pdf link: https://arxiv.org/pdf/2111.13023
Abstract
We develop a rotation equivariant model for generating 3D hand meshes from 2D RGB images. This guarantees that as the input image of a hand is rotated the generated mesh undergoes a corresponding rotation. Furthermore, this removes undesirable deformations in the meshes often generated by methods without rotation equivariance. By building a rotation equivariant model, through considering symmetries in the problem, we reduce the need for training on very large datasets to achieve good mesh reconstruction. The encoder takes images defined on $\mathbb{Z}^{2}$ and maps these to latent functions defined on the group $C_{8}$. We introduce a novel vector mapping function to map the function defined on $C_{8}$ to a latent point cloud space defined on the group $\mathrm{SO}(2)$. Further, we introduce a 3D projection function that learns a 3D function from the $\mathrm{SO}(2)$ latent space. Finally, we use an $\mathrm{SO}(3)$ equivariant decoder to ensure rotation equivariance. Our rotation equivariant model outperforms state-of-the-art methods on a real-world dataset and we demonstrate that it accurately captures the shape and pose in the generated meshes under rotation of the input hand.

Quasi-Isometric Graph-Simplifications

Authors: Bakhadyr Khoussainov, Khí-Uí Soo
Subjects: Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2111.13238
Pdf link: https://arxiv.org/pdf/2111.13238
Abstract
We propose a general framework based on quasi-isometries to study graph simplifications. Quasi-isometries are mappings on metric spaces that preserve the distance functions within an additive and a multiplicative constant. We use them to measure the distance distortion between the original graph and the simplified graph. We also introduce a novel concept called the centre-shift, which quantifies how much a graph simplification affects the location of the graph centre. Given a quasi-isometry, we establish a weak upper bound on the centre-shift. We present methods to construct so-called partition-graphs, which are quasi-isometric graph simplifications. Furthermore, in terms of the centre-shift, we show that partition-graphs constructed in a certain way preserve the centres of trees. Finally, we also show that by storing extra numerical information, partition-graphs preserve the median of trees.

Ensembling of Distilled Models from Multi-task Teachers for Constrained Resource Language Pairs

Authors: Amr Hendy, Esraa A. Gad, Mohamed Abdelghaffar, Jailan S. ElMosalami, Mohamed Afify, Ahmed Y. Tawfik, Hany Hassan Awadalla
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2111.13284
Pdf link: https://arxiv.org/pdf/2111.13284
Abstract
This paper describes our submission to the constrained track of WMT21 shared news translation task. We focus on the three relatively low resource language pairs Bengali to and from Hindi, English to and from Hausa, and Xhosa to and from Zulu. To overcome the limitation of relatively low parallel data we train a multilingual model using a multitask objective employing both parallel and monolingual data. In addition, we augment the data using back translation. We also train a bilingual model incorporating back translation and knowledge distillation then combine the two models using sequence-to-sequence mapping. We see around 70% relative gain in BLEU point for English to and from Hausa, and around 25% relative improvements for both Bengali to and from Hindi, and Xhosa to and from Zulu compared to bilingual baselines.

NeRF in the Dark: High Dynamic Range View Synthesis from Noisy Raw Images

Authors: Ben Mildenhall, Peter Hedman, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.13679
Pdf link: https://arxiv.org/pdf/2111.13679
Abstract
Neural Radiance Fields (NeRF) is a technique for high quality novel view synthesis from a collection of posed input images. Like most view synthesis methods, NeRF uses tonemapped low dynamic range (LDR) as input; these images have been processed by a lossy camera pipeline that smooths detail, clips highlights, and distorts the simple noise distribution of raw sensor data. We modify NeRF to instead train directly on linear raw images, preserving the scene's full dynamic range. By rendering raw output images from the resulting NeRF, we can perform novel high dynamic range (HDR) view synthesis tasks. In addition to changing the camera viewpoint, we can manipulate focus, exposure, and tonemapping after the fact. Although a single raw image appears significantly more noisy than a postprocessed one, we show that NeRF is highly robust to the zero-mean distribution of raw noise. When optimized over many noisy raw inputs (25-200), NeRF produces a scene representation so accurate that its rendered novel views outperform dedicated single and multi-image deep raw denoisers run on the same wide baseline input images. As a result, our method, which we call RawNeRF, can reconstruct scenes from extremely noisy images captured in near-darkness.

Keyword: localization

Joint stereo 3D object detection and implicit surface reconstruction

Authors: Shichao Li, Kwang-Ting Cheng
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.12924
Pdf link: https://arxiv.org/pdf/2111.12924
Abstract
We present the first learning-based framework for category-level 3D object detection and implicit shape estimation based on a pair of stereo RGB images in the wild. Traditional stereo 3D object detection approaches describe the detected objects only with 3D bounding boxes and cannot infer their full surface geometry, which makes creating a realistic outdoor immersive experience difficult. In contrast, we propose a new model S-3D-RCNN that can perform precise localization as well as provide a complete and resolution-agnostic shape description for the detected objects. We first decouple the estimation of object coordinate systems from shape reconstruction using a global-local framework. We then propose a new instance-level network that addresses the unseen surface hallucination problem by extracting point-based representations from stereo region-of-interests, and infers implicit shape codes with predicted complete surface geometry. Extensive experiments validate our approach's superior performance using existing and new metrics on the KITTI benchmark. Code and pre-trained models will be available at this https URL.

MegLoc: A Robust and Accurate Visual Localization Pipeline

Authors: Shuxue Peng, Zihang He, Haotian Zhang, Ran Yan, Chuting Wang, Qingtian Zhu, Xiao Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.13063
Pdf link: https://arxiv.org/pdf/2111.13063
Abstract
In this paper, we present a visual localization pipeline, namely MegLoc, for robust and accurate 6-DoF pose estimation under varying scenarios, including indoor and outdoor scenes, different time across a day, different seasons across a year, and even across years. MegLoc achieves state-of-the-art results on a range of challenging datasets, including winning the Outdoor and Indoor Visual Localization Challenge of ICCV 2021 Workshop on Long-term Visual Localization under Changing Conditions, as well as the Re-localization Challenge for Autonomous Driving of ICCV 2021 Workshop on Map-based Localization for Autonomous Driving.

ArchRepair: Block-Level Architecture-Oriented Repairing for Deep Neural Networks

Authors: Hua Qi, Zhijie Wang, Qing Guo, Jianlang Chen, Felix Juefei-Xu, Lei Ma, Jianjun Zhao
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.13330
Pdf link: https://arxiv.org/pdf/2111.13330
Abstract
Over the past few years, deep neural networks (DNNs) have achieved tremendous success and have been continuously applied in many application domains. However, during the practical deployment in the industrial tasks, DNNs are found to be erroneous-prone due to various reasons such as overfitting, lacking robustness to real-world corruptions during practical usage. To address these challenges, many recent attempts have been made to repair DNNs for version updates under practical operational contexts by updating weights (i.e., network parameters) through retraining, fine-tuning, or direct weight fixing at a neural level. In this work, as the first attempt, we initiate to repair DNNs by jointly optimizing the architecture and weights at a higher (i.e., block) level. We first perform empirical studies to investigate the limitation of whole network-level and layer-level repairing, which motivates us to explore a novel repairing direction for DNN repair at the block level. To this end, we first propose adversarial-aware spectrum analysis for vulnerable block localization that considers the neurons' status and weights' gradients in blocks during the forward and backward processes, which enables more accurate candidate block localization for repairing even under a few examples. Then, we further propose the architecture-oriented search-based repairing that relaxes the targeted block to a continuous repairing search space at higher deep feature levels. By jointly optimizing the architecture and weights in that space, we can identify a much better block architecture. We implement our proposed repairing techniques as a tool, named ArchRepair, and conduct extensive experiments to validate the proposed method. The results show that our method can not only repair but also enhance accuracy & robustness, outperforming the state-of-the-art DNN repair techniques.

Geometric Multimodal Deep Learning with Multi-Scaled Graph Wavelet Convolutional Network

Authors: Maysam Behmanesh, Peyman Adibi, Mohammad Saeed Ehsani, Jocelyn Chanussot
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.13361
Pdf link: https://arxiv.org/pdf/2111.13361
Abstract
Multimodal data provide complementary information of a natural phenomenon by integrating data from various domains with very different statistical properties. Capturing the intra-modality and cross-modality information of multimodal data is the essential capability of multimodal learning methods. The geometry-aware data analysis approaches provide these capabilities by implicitly representing data in various modalities based on their geometric underlying structures. Also, in many applications, data are explicitly defined on an intrinsic geometric structure. Generalizing deep learning methods to the non-Euclidean domains is an emerging research field, which has recently been investigated in many studies. Most of those popular methods are developed for unimodal data. In this paper, a multimodal multi-scaled graph wavelet convolutional network (M-GWCN) is proposed as an end-to-end network. M-GWCN simultaneously finds intra-modality representation by applying the multiscale graph wavelet transform to provide helpful localization properties in the graph domain of each modality, and cross-modality representation by learning permutations that encode correlations among various modalities. M-GWCN is not limited to either the homogeneous modalities with the same number of data, or any prior knowledge indicating correspondences between modalities. Several semi-supervised node classification experiments have been conducted on three popular unimodal explicit graph-based datasets and five multimodal implicit ones. The experimental results indicate the superiority and effectiveness of the proposed methods compared with both spectral graph domain convolutional neural networks and state-of-the-art multimodal methods.

TDAN: Top-Down Attention Networks for Enhanced Feature Selectivity in CNNs

Authors: Shantanu Jaiswal, Basura Fernando, Cheston Tan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.13470
Pdf link: https://arxiv.org/pdf/2111.13470
Abstract
Attention modules for Convolutional Neural Networks (CNNs) are an effective method to enhance performance of networks on multiple computer-vision tasks. While many works focus on building more effective modules through appropriate modelling of channel-, spatial- and self-attention, they primarily operate in a feedfoward manner. Consequently, the attention mechanism strongly depends on the representational capacity of a single input feature activation, and can benefit from incorporation of semantically richer higher-level activations that can specify "what and where to look" through top-down information flow. Such feedback connections are also prevalent in the primate visual cortex and recognized by neuroscientists as a key component in primate visual attention. Accordingly, in this work, we propose a lightweight top-down (TD) attention module that iteratively generates a "visual searchlight" to perform top-down channel and spatial modulation of its inputs and consequently outputs more selective feature activations at each computation step. Our experiments indicate that integrating TD in CNNs enhances their performance on ImageNet-1k classification and outperforms prominent attention modules while being more parameter and memory efficient. Further, our models are more robust to changes in input resolution during inference and learn to "shift attention" by localizing individual objects or features at each computation step without any explicit supervision. This capability results in 5% improvement for ResNet50 on weakly-supervised object localization besides improvements in fine-grained and multi-label classification.

Semi-supervised t-SNE for Millimeter-wave Wireless Localization

Authors: Junquan Deng, Wei Shi, Jian Hu, Xianlong Jiao
Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2111.13573
Pdf link: https://arxiv.org/pdf/2111.13573
Abstract
We consider the mobile localization problem in future millimeter-wave wireless networks with distributed Base Stations (BSs) based on multi-antenna channel state information (CSI). For this problem, we propose a Semi-supervised tdistributed Stochastic Neighbor Embedding (St-SNE) algorithm to directly embed the high-dimensional CSI samples into the 2D geographical map. We evaluate the performance of St-SNE in a simulated urban outdoor millimeter-wave radio access network. Our results show that St-SNE achieves a mean localization error of 6.8 m with only 5% of labeled CSI samples in a 200*200 m^2 area with a ray-tracing channel model. St-SNE does not require accurate synchronization among multiple BSs, and is promising for future large-scale millimeter-wave localization.

Towards Low-Cost and Efficient Malaria Detection

Authors: Waqas Sultani1, Wajahat Nawaz, Syed Javed, Muhammad Sohail Danish, Asma Saadia, Mohsen Ali
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.13656
Pdf link: https://arxiv.org/pdf/2111.13656
Abstract
Malaria, a fatal but curable disease claims hundreds of thousands of lives every year. Early and correct diagnosis is vital to avoid health complexities, however, it depends upon the availability of costly microscopes and trained experts to analyze blood-smear slides. Deep learning-based methods have the potential to not only decrease the burden of experts but also improve diagnostic accuracy on low-cost microscopes. However, this is hampered by the absence of a reasonable size dataset. One of the most challenging aspects is the reluctance of the experts to annotate the dataset at low magnification on low-cost microscopes. We present a dataset to further the research on malaria microscopy over the low-cost microscopes at low magnification. Our large-scale dataset consists of images of blood-smear slides from several malaria-infected patients, collected through microscopes at two different cost spectrums and multiple magnifications. Malarial cells are annotated for the localization and life-stage classification task on the images collected through the high-cost microscope at high magnification. We design a mechanism to transfer these annotations from the high-cost microscope at high magnification to the low-cost microscope, at multiple magnifications. Multiple object detectors and domain adaptation methods are presented as the baselines. Furthermore, a partially supervised domain adaptation method is introduced to adapt the object-detector to work on the images collected from the low-cost microscope. The dataset will be made publicly available after publication.

New submissions for Wed, 1 Dec 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

RailLoMer: Rail Vehicle Localization and Mapping with LiDAR-IMU-Odometer-GNSS Data Fusion

Authors: Yusheng Wang, Yidong Lou, Yi Zhang, Weiwei Song, Fei Huang, Zhiyong Tu, Shimin Zhang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.15043
Pdf link: https://arxiv.org/pdf/2111.15043
Abstract
We present RailLoMer in this article, to achieve real-time accurate and robust odometry and mapping for rail vehicles. RailLoMer receives measurements from two LiDARs, an IMU, train odometer, and a global navigation satellite system (GNSS) receiver. As frontend, the estimated motion from IMU/odometer preintegration de-skews the denoised point clouds and produces initial guess for frame-to-frame LiDAR odometry. As backend, a sliding window based factor graph is formulated to jointly optimize multi-modal information. In addition, we leverage the plane constraints from extracted rail tracks and the structure appearance descriptor to further improve the system robustness against repetitive structures. To ensure a globally-consistent and less blurry mapping result, we develop a two-stage mapping method that first performs scan-to-map in local scale, then utilizes the GNSS information to register the submaps. The proposed method is extensively evaluated on datasets gathered for a long time range over numerous scales and scenarios, and show that RailLoMer delivers decimeter-grade localization accuracy even in large or degenerated environments. We also integrate RailLoMer into an interactive train state and railway monitoring system prototype design, which has already been deployed to an experimental freight traffic railroad.

Aerial Images Meet Crowdsourced Trajectories: A New Approach to Robust Road Extraction

Authors: Lingbo Liu, Zewei Yang, Guanbin Li, Kuo Wang, Tianshui Chen, Liang Lin
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.15119
Pdf link: https://arxiv.org/pdf/2111.15119
Abstract
Land remote sensing analysis is a crucial research in earth science. In this work, we focus on a challenging task of land analysis, i.e., automatic extraction of traffic roads from remote sensing data, which has widespread applications in urban development and expansion estimation. Nevertheless, conventional methods either only utilized the limited information of aerial images, or simply fused multimodal information (e.g., vehicle trajectories), thus cannot well recognize unconstrained roads. To facilitate this problem, we introduce a novel neural network framework termed Cross-Modal Message Propagation Network (CMMPNet), which fully benefits the complementary different modal data (i.e., aerial images and crowdsourced trajectories). Specifically, CMMPNet is composed of two deep Auto-Encoders for modality-specific representation learning and a tailor-designed Dual Enhancement Module for cross-modal representation refinement. In particular, the complementary information of each modality is comprehensively extracted and dynamically propagated to enhance the representation of another modality. Extensive experiments on three real-world benchmarks demonstrate the effectiveness of our CMMPNet for robust road extraction benefiting from blending different modal data, either using image and trajectory data or image and Lidar data. From the experimental results, we observe that the proposed approach outperforms current state-of-the-art methods by large margins.

NeeDrop: Self-supervised Shape Representation from Sparse Point Clouds using Needle Dropping

Authors: Alexandre Boulch, Pierre-Alain Langlois, Gilles Puy, Renaud Marlet
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.15207
Pdf link: https://arxiv.org/pdf/2111.15207
Abstract
There has been recently a growing interest for implicit shape representations. Contrary to explicit representations, they have no resolution limitations and they easily deal with a wide variety of surface topologies. To learn these implicit representations, current approaches rely on a certain level of shape supervision (e.g., inside/outside information or distance-to-shape knowledge), or at least require a dense point cloud (to approximate well enough the distance-to-shape). In contrast, we introduce {\method}, an self-supervised method for learning shape representations from possibly extremely sparse point clouds. Like in Buffon's needle problem, we "drop" (sample) needles on the point cloud and consider that, statistically, close to the surface, the needle end points lie on opposite sides of the surface. No shape knowledge is required and the point cloud can be highly sparse, e.g., as lidar point clouds acquired by vehicles. Previous self-supervised shape representation approaches fail to produce good-quality results on this kind of data. We obtain quantitative results on par with existing supervised approaches on shape reconstruction datasets and show promising qualitative results on hard autonomous driving datasets such as KITTI.

ConDA: Unsupervised Domain Adaptation for LiDAR Segmentation via Regularized Domain Concatenation

Authors: Lingdong Kong, Niamul Quader, Venice Erin Liong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.15242
Pdf link: https://arxiv.org/pdf/2111.15242
Abstract
Transferring knowledge learned from the labeled source domain to the raw target domain for unsupervised domain adaptation (UDA) is essential to the scalable deployment of an autonomous driving system. State-of-the-art approaches in UDA often employ a key concept: utilize joint supervision signals from both the source domain (with ground-truth) and the target domain (with pseudo-labels) for self-training. In this work, we improve and extend on this aspect. We present ConDA, a concatenation-based domain adaptation framework for LiDAR semantic segmentation that: (1) constructs an intermediate domain consisting of fine-grained interchange signals from both source and target domains without destabilizing the semantic coherency of objects and background around the ego-vehicle; and (2) utilizes the intermediate domain for self-training. Additionally, to improve both the network training on the source domain and self-training on the intermediate domain, we propose an anti-aliasing regularizer and an entropy aggregator to reduce the detrimental effects of aliasing artifacts and noisy target predictions. Through extensive experiments, we demonstrate that ConDA is significantly more effective in mitigating the domain gap compared to prior arts.

ARTSeg: Employing Attention for Thermal images Semantic Segmentation

Authors: Farzeen Munir, Shoaib Azam, Unse Fatima, Moongu Jeon
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.15257
Pdf link: https://arxiv.org/pdf/2111.15257
Abstract
The research advancements have made the neural network algorithms deployed in the autonomous vehicle to perceive the surrounding. The standard exteroceptive sensors that are utilized for the perception of the environment are cameras and Lidar. Therefore, the neural network algorithms developed using these exteroceptive sensors have provided the necessary solution for the autonomous vehicle's perception. One major drawback of these exteroceptive sensors is their operability in adverse weather conditions, for instance, low illumination and night conditions. The useability and affordability of thermal cameras in the sensor suite of the autonomous vehicle provide the necessary improvement in the autonomous vehicle's perception in adverse weather conditions. The semantics of the environment benefits the robust perception, which can be achieved by segmenting different objects in the scene. In this work, we have employed the thermal camera for semantic segmentation. We have designed an attention-based Recurrent Convolution Network (RCNN) encoder-decoder architecture named ARTSeg for thermal semantic segmentation. The main contribution of this work is the design of encoder-decoder architecture, which employ units of RCNN for each encoder and decoder block. Furthermore, additive attention is employed in the decoder module to retain high-resolution features and improve the localization of features. The efficacy of the proposed method is evaluated on the available public dataset, showing better performance with other state-of-the-art methods in mean intersection over union (IoU).

Semi-Local Convolutions for LiDAR Scan Processing

Authors: Larissa T. Triess, David Peter, J. Marius Zöllner
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.15615
Pdf link: https://arxiv.org/pdf/2111.15615
Abstract
A number of applications, such as mobile robots or automated vehicles, use LiDAR sensors to obtain detailed information about their three-dimensional surroundings. Many methods use image-like projections to efficiently process these LiDAR measurements and use deep convolutional neural networks to predict semantic classes for each point in the scan. The spatial stationary assumption enables the usage of convolutions. However, LiDAR scans exhibit large differences in appearance over the vertical axis. Therefore, we propose semi local convolution (SLC), a convolution layer with reduced amount of weight-sharing along the vertical dimension. We are first to investigate the usage of such a layer independent of any other model changes. Our experiments did not show any improvement over traditional convolution layers in terms of segmentation IoU or accuracy.

Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection

Authors: Deepti Hegde, Vishal Patel
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.15656
Pdf link: https://arxiv.org/pdf/2111.15656
Abstract
3D object detection networks tend to be biased towards the data they are trained on. Evaluation on datasets captured in different locations, conditions or sensors than that of the training (source) data results in a drop in model performance due to the gap in distribution with the test (or target) data. Current methods for domain adaptation either assume access to source data during training, which may not be available due to privacy or memory concerns, or require a sequence of lidar frames as an input. We propose a single-frame approach for source-free, unsupervised domain adaptation of lidar-based 3D object detectors that uses class prototypes to mitigate the effect pseudo-label noise. Addressing the limitations of traditional feature aggregation methods for prototype computation in the presence of noisy labels, we utilize a transformer module to identify outlier ROI's that correspond to incorrect, over-confident annotations, and compute an attentive class prototype. Under an iterative training strategy, the losses associated with noisy pseudo labels are down-weighed and thus refined in the process of self-training. To validate the effectiveness of our proposed approach, we examine the domain shift associated with networks trained on large, label-rich datasets (such as the Waymo Open Dataset and nuScenes) and evaluate on smaller, label-poor datasets (such as KITTI) and vice-versa. We demonstrate our approach on two recent object detectors and achieve results that out-perform the other domain adaptation works.

Keyword: loop detection

There is no result

Keyword: autonomous driving

MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction

Authors: Balakrishnan Varadarajan, Ahmed Hefny, Avikalp Srivastava, Khaled S. Refaat, Nigamaa Nayakanti, Andre Cornman, Kan Chen, Bertrand Douillard, Chi Pang Lam, Dragomir Anguelov, Benjamin Sapp
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.14973
Pdf link: https://arxiv.org/pdf/2111.14973
Abstract
Predicting the future behavior of road users is one of the most challenging and important problems in autonomous driving. Applying deep learning to this problem requires fusing heterogeneous world state in the form of rich perception signals and map information, and inferring highly multi-modal distributions over possible futures. In this paper, we present MultiPath++, a future prediction model that achieves state-of-the-art performance on popular benchmarks. MultiPath++ improves the MultiPath architecture by revisiting many design choices. The first key design difference is a departure from dense image-based encoding of the input world state in favor of a sparse encoding of heterogeneous scene elements: MultiPath++ consumes compact and efficient polylines to describe road features, and raw agent state information directly (e.g., position, velocity, acceleration). We propose a context-aware fusion of these elements and develop a reusable multi-context gating fusion component. Second, we reconsider the choice of pre-defined, static anchors, and develop a way to learn latent anchor embeddings end-to-end in the model. Lastly, we explore ensembling and output aggregation techniques -- common in other ML domains -- and find effective variants for our probabilistic multimodal output representation. We perform an extensive ablation on these design choices, and show that our proposed model achieves state-of-the-art performance on the Argoverse Motion Forecasting Competition and the Waymo Open Dataset Motion Prediction Challenge.

NeeDrop: Self-supervised Shape Representation from Sparse Point Clouds using Needle Dropping

Authors: Alexandre Boulch, Pierre-Alain Langlois, Gilles Puy, Renaud Marlet
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.15207
Pdf link: https://arxiv.org/pdf/2111.15207
Abstract
There has been recently a growing interest for implicit shape representations. Contrary to explicit representations, they have no resolution limitations and they easily deal with a wide variety of surface topologies. To learn these implicit representations, current approaches rely on a certain level of shape supervision (e.g., inside/outside information or distance-to-shape knowledge), or at least require a dense point cloud (to approximate well enough the distance-to-shape). In contrast, we introduce {\method}, an self-supervised method for learning shape representations from possibly extremely sparse point clouds. Like in Buffon's needle problem, we "drop" (sample) needles on the point cloud and consider that, statistically, close to the surface, the needle end points lie on opposite sides of the surface. No shape knowledge is required and the point cloud can be highly sparse, e.g., as lidar point clouds acquired by vehicles. Previous self-supervised shape representation approaches fail to produce good-quality results on this kind of data. We obtain quantitative results on par with existing supervised approaches on shape reconstruction datasets and show promising qualitative results on hard autonomous driving datasets such as KITTI.

ConDA: Unsupervised Domain Adaptation for LiDAR Segmentation via Regularized Domain Concatenation

Authors: Lingdong Kong, Niamul Quader, Venice Erin Liong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.15242
Pdf link: https://arxiv.org/pdf/2111.15242
Abstract
Transferring knowledge learned from the labeled source domain to the raw target domain for unsupervised domain adaptation (UDA) is essential to the scalable deployment of an autonomous driving system. State-of-the-art approaches in UDA often employ a key concept: utilize joint supervision signals from both the source domain (with ground-truth) and the target domain (with pseudo-labels) for self-training. In this work, we improve and extend on this aspect. We present ConDA, a concatenation-based domain adaptation framework for LiDAR semantic segmentation that: (1) constructs an intermediate domain consisting of fine-grained interchange signals from both source and target domains without destabilizing the semantic coherency of objects and background around the ego-vehicle; and (2) utilizes the intermediate domain for self-training. Additionally, to improve both the network training on the source domain and self-training on the intermediate domain, we propose an anti-aliasing regularizer and an entropy aggregator to reduce the detrimental effects of aliasing artifacts and noisy target predictions. Through extensive experiments, we demonstrate that ConDA is significantly more effective in mitigating the domain gap compared to prior arts.

Fast and Real-time End to End Control in Autonomous Racing Cars Through Representation Learning

Authors: Praveen Venkatesh, Rwik Rana, Harish PM
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.15343
Pdf link: https://arxiv.org/pdf/2111.15343
Abstract
The challenges presented in an autonomous racing situation are distinct from those faced in regular autonomous driving and require faster end-to-end algorithms and consideration of a longer horizon in determining optimal current actions keeping in mind upcoming maneuvers and situations. In this paper, we propose an end-to-end method for autonomous racing that takes in as inputs video information from an onboard camera and determines final steering and throttle control actions. We use the following split to construct such a method (1) learning a low dimensional representation of the scene, (2) pre-generating the optimal trajectory for the given scene, and (3) tracking the predicted trajectory using a classical control method. In learning a low-dimensional representation of the scene, we use intermediate representations with a novel unsupervised trajectory planner to generate expert trajectories, and hence utilize them to directly predict race lines from a given front-facing input image. Thus, the proposed algorithm employs the best of two worlds - the robustness of learning-based approaches to perception and the accuracy of optimization-based approaches for trajectory generation in an end-to-end learning-based framework. We deploy and demonstrate our framework on CARLA, a photorealistic simulator for testing self-driving cars in realistic environments.

Large-Scale Video Analytics through Object-Level Consolidation

Authors: Daniel Rivas, Francesc Guim, Jordà Polo, David Carrera
Subjects: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2111.15451
Pdf link: https://arxiv.org/pdf/2111.15451
Abstract
As the number of installed cameras grows, so do the compute resources required to process and analyze all the images captured by these cameras. Video analytics enables new use cases, such as smart cities or autonomous driving. At the same time, it urges service providers to install additional compute resources to cope with the demand while the strict latency requirements push compute towards the end of the network, forming a geographically distributed and heterogeneous set of compute locations, shared and resource-constrained. Such landscape (shared and distributed locations) forces us to design new techniques that can optimize and distribute work among all available locations and, ideally, make compute requirements grow sublinearly with respect to the number of cameras installed. In this paper, we present FoMO (Focus on Moving Objects). This method effectively optimizes multi-camera deployments by preprocessing images for scenes, filtering the empty regions out, and composing regions of interest from multiple cameras into a single image that serves as input for a pre-trained object detection model. Results show that overall system performance can be increased by 8x while accuracy improves 40% as a by-product of the methodology, all using an off-the-shelf pre-trained model with no additional training or fine-tuning.

Consensus Synergizes with Memory: A Simple Approach for Anomaly Segmentation in Urban Scenes

Authors: Jiazhong Cen, Zenkun Jiang, Lingxi Xie, Qi Tian, Xiaokang Yang, Wei Shen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.15463
Pdf link: https://arxiv.org/pdf/2111.15463
Abstract
Anomaly segmentation is a crucial task for safety-critical applications, such as autonomous driving in urban scenes, where the goal is to detect out-of-distribution (OOD) objects with categories which are unseen during training. The core challenge of this task is how to distinguish hard in-distribution samples from OOD samples, which has not been explicitly discussed yet. In this paper, we propose a novel and simple approach named Consensus Synergizes with Memory (CosMe) to address this challenge, inspired by the psychology finding that groups perform better than individuals on memory tasks. The main idea is 1) building a memory bank which consists of seen prototypes extracted from multiple layers of the pre-trained segmentation model and 2) training an auxiliary model that mimics the behavior of the pre-trained model, and then measuring the consensus of their mid-level features as complementary cues that synergize with the memory bank. CosMe is good at distinguishing between hard in-distribution examples and OOD samples. Experimental results on several urban scene anomaly segmentation datasets show that CosMe outperforms previous approaches by large margins.

Keyword: mapping

Image denoising by Super Neurons: Why go deep?

Authors: Junaid Malik, Serkan Kiranyaz, Moncef Gabbouj
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.14948
Pdf link: https://arxiv.org/pdf/2111.14948
Abstract
Classical image denoising methods utilize the non-local self-similarity principle to effectively recover image content from noisy images. Current state-of-the-art methods use deep convolutional neural networks (CNNs) to effectively learn the mapping from noisy to clean images. Deep denoising CNNs manifest a high learning capacity and integrate non-local information owing to the large receptive field yielded by numerous cascade of hidden layers. However, deep networks are also computationally complex and require large data for training. To address these issues, this study draws the focus on the Self-organized Operational Neural Networks (Self-ONNs) empowered by a novel neuron model that can achieve a similar or better denoising performance with a compact and shallow model. Recently, the concept of super-neurons has been introduced which augment the non-linear transformations of generative neurons by utilizing non-localized kernel locations for an enhanced receptive field size. This is the key accomplishment which renders the need for a deep network configuration. As the integration of non-local information is known to benefit denoising, in this work we investigate the use of super neurons for both synthetic and real-world image denoising. We also discuss the practical issues in implementing the super neuron model on GPUs and propose a trade-off between the heterogeneity of non-localized operations and computational complexity. Our results demonstrate that with the same width and depth, Self-ONNs with super neurons provide a significant boost of denoising performance over the networks with generative and convolutional neurons for both denoising tasks. Moreover, results demonstrate that Self-ONNs with super neurons can achieve a competitive and superior synthetic denoising performances than well-known deep CNN denoisers for synthetic and real-world denoising, respectively.

MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

Authors: Robert J. Joyce, Dev Amlani, Charles Nicholas, Edward Raff
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2111.15031
Pdf link: https://arxiv.org/pdf/2111.15031
Abstract
Malware family classification is a significant issue with public safety and research implications that has been hindered by the high cost of expert labels. The vast majority of corpora use noisy labeling approaches that obstruct definitive quantification of results and study of deeper interactions. In order to provide the data needed to advance further, we have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset. MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, nearly 3x larger than any prior expert-labeled corpus and 36x larger than the prior Windows malware corpus. MOTIF also comes with a mapping from malware samples to threat reports published by reputable industry sources, which both validates the labels and opens new research opportunities in connecting opaque malware samples to human-readable descriptions. This enables important evaluations that are normally infeasible due to non-standardized reporting in industry. For example, we provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools when names are obtained from differing sources. Evaluation results obtained using the MOTIF dataset indicate that existing tasks have significant room for improvement, with accuracy of antivirus majority voting measured at only 62.10% and the well-known AVClass tool having just 46.78% accuracy. Our findings indicate that malware family classification suffers a type of labeling noise unlike that studied in most ML literature, due to the large open set of classes that may not be known from the sample under consideration

RailLoMer: Rail Vehicle Localization and Mapping with LiDAR-IMU-Odometer-GNSS Data Fusion

Authors: Yusheng Wang, Yidong Lou, Yi Zhang, Weiwei Song, Fei Huang, Zhiyong Tu, Shimin Zhang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.15043
Pdf link: https://arxiv.org/pdf/2111.15043
Abstract
We present RailLoMer in this article, to achieve real-time accurate and robust odometry and mapping for rail vehicles. RailLoMer receives measurements from two LiDARs, an IMU, train odometer, and a global navigation satellite system (GNSS) receiver. As frontend, the estimated motion from IMU/odometer preintegration de-skews the denoised point clouds and produces initial guess for frame-to-frame LiDAR odometry. As backend, a sliding window based factor graph is formulated to jointly optimize multi-modal information. In addition, we leverage the plane constraints from extracted rail tracks and the structure appearance descriptor to further improve the system robustness against repetitive structures. To ensure a globally-consistent and less blurry mapping result, we develop a two-stage mapping method that first performs scan-to-map in local scale, then utilizes the GNSS information to register the submaps. The proposed method is extensively evaluated on datasets gathered for a long time range over numerous scales and scenarios, and show that RailLoMer delivers decimeter-grade localization accuracy even in large or degenerated environments. We also integrate RailLoMer into an interactive train state and railway monitoring system prototype design, which has already been deployed to an experimental freight traffic railroad.

Robust 3D Garment Digitization from Monocular 2D Images for 3D Virtual Try-On Systems

Authors: Sahib Majithia, Sandeep N. Parameswaran, Sadbhavana Babar, Vikram Garg, Astitva Srivastava, Avinash Sharma
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.15140
Pdf link: https://arxiv.org/pdf/2111.15140
Abstract
In this paper, we develop a robust 3D garment digitization solution that can generalize well on real-world fashion catalog images with cloth texture occlusions and large body pose variations. We assumed fixed topology parametric template mesh models for known types of garments (e.g., T-shirts, Trousers) and perform mapping of high-quality texture from an input catalog image to UV map panels corresponding to the parametric mesh model of the garment. We achieve this by first predicting a sparse set of 2D landmarks on the boundary of the garments. Subsequently, we use these landmarks to perform Thin-Plate-Spline-based texture transfer on UV map panels. Subsequently, we employ a deep texture inpainting network to fill the large holes (due to view variations & self-occlusions) in TPS output to generate consistent UV maps. Furthermore, to train the supervised deep networks for landmark prediction & texture inpainting tasks, we generated a large set of synthetic data with varying texture and lighting imaged from various views with the human present in a wide variety of poses. Additionally, we manually annotated a small set of fashion catalog images crawled from online fashion e-commerce platforms to finetune. We conduct thorough empirical evaluations and show impressive qualitative results of our proposed 3D garment texture solution on fashion catalog images. Such 3D garment digitization helps us solve the challenging task of enabling 3D Virtual Try-on.

Deep Auto-encoder with Neural Response

Authors: Xuming Ran, Jie Zhang, Ziyuan Ye, Haiyan Wu, Qi Xu, Huihui Zhou, Quanying Liu
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.15309
Pdf link: https://arxiv.org/pdf/2111.15309
Abstract
Artificial intelligence and neuroscience are deeply interactive. Artificial neural networks (ANNs) have been a versatile tool to study the neural representation in the ventral visual stream, and the knowledge in neuroscience in return inspires ANN models to improve performance in the task. However, how to merge these two directions into a unified model has less studied. Here, we propose a hybrid model, called deep auto-encoder with the neural response (DAE-NR), which incorporates the information from the visual cortex into ANNs to achieve better image reconstruction and higher neural representation similarity between biological and artificial neurons. Specifically, the same visual stimuli (i.e., natural images) are input to both the mice brain and DAE-NR. The DAE-NR jointly learns to map a specific layer of the encoder network to the biological neural responses in the ventral visual stream by a mapping function and to reconstruct the visual input by the decoder. Our experiments demonstrate that if and only if with the joint learning, DAE-NRs can (i) improve the performance of image reconstruction and (ii) increase the representational similarity between biological neurons and artificial neurons. The DAE-NR offers a new perspective on the integration of computer vision and visual neuroscience.

Probabilistic Estimation of 3D Human Shape and Pose with a Semantic Local Parametric Model

Authors: Akash Sengupta, Ignas Budvytis, Roberto Cipolla
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.15404
Pdf link: https://arxiv.org/pdf/2111.15404
Abstract
This paper addresses the problem of 3D human body shape and pose estimation from RGB images. Some recent approaches to this task predict probability distributions over human body model parameters conditioned on the input images. This is motivated by the ill-posed nature of the problem wherein multiple 3D reconstructions may match the image evidence, particularly when some parts of the body are locally occluded. However, body shape parameters in widely-used body models (e.g. SMPL) control global deformations over the whole body surface. Distributions over these global shape parameters are unable to meaningfully capture uncertainty in shape estimates associated with locally-occluded body parts. In contrast, we present a method that (i) predicts distributions over local body shape in the form of semantic body measurements and (ii) uses a linear mapping to transform a local distribution over body measurements to a global distribution over SMPL shape parameters. We show that our method outperforms the current state-of-the-art in terms of identity-dependent body shape estimation accuracy on the SSP-3D dataset, and a private dataset of tape-measured humans, by probabilistically-combining local body measurement distributions predicted from multiple images of a subject.

A Face Recognition System's Worst Morph Nightmare, Theoretically

Authors: Una M. Kelly, Raymond Veldhuis, Luuk Spreeuwers
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.15416
Pdf link: https://arxiv.org/pdf/2111.15416
Abstract
It has been shown that Face Recognition Systems (FRSs) are vulnerable to morphing attacks, but most research focusses on landmark-based morphs. A second method for generating morphs uses Generative Adversarial Networks, which results in convincingly real facial images that can be almost as challenging for FRSs as landmark-based attacks. We propose a method to create a third, different type of morph, that has the advantage of being easier to train. We introduce the theoretical concept of \textit{worst-case morphs}, which are those morphs that are most challenging for a fixed FRS. For a set of images and corresponding embeddings in an FRS's latent space, we generate images that approximate these worst-case morphs using a mapping from embedding space back to image space. While the resulting images are not yet as challenging as other morphs, they can provide valuable information in future research on Morphing Attack Detection (MAD) methods and on weaknesses of FRSs. Methods for MAD need to be validated on more varied morph databases. Our proposed method contributes to achieving such variation.

Nonlinear Intensity Underwater Sonar Image Matching Method Based on Phase Information and Deep Convolution Features

Authors: Xiaoteng Zhou, Changli Yu, Xin Yuan, Haijun Feng, Yang Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.15514
Pdf link: https://arxiv.org/pdf/2111.15514
Abstract
In the field of deep-sea exploration, sonar is presently the only efficient long-distance sensing device. The complicated underwater environment, such as noise interference, low target intensity or background dynamics, has brought many negative effects on sonar imaging. Among them, the problem of nonlinear intensity is extremely prevalent. It is also known as the anisotropy of acoustic sensor imaging, that is, when autonomous underwater vehicles (AUVs) carry sonar to detect the same target from different angles, the intensity variation between image pairs is sometimes very large, which makes the traditional matching algorithm almost ineffective. However, image matching is the basis of comprehensive tasks such as navigation, positioning, and mapping. Therefore, it is very valuable to obtain robust and accurate matching results. This paper proposes a combined matching method based on phase information and deep convolution features. It has two outstanding advantages: one is that the deep convolution features could be used to measure the similarity of the local and global positions of the sonar image; the other is that local feature matching could be performed at the key target position of the sonar image. This method does not need complex manual designs, and completes the matching task of nonlinear intensity sonar images in a close end-to-end manner. Feature matching experiments are carried out on the deep-sea sonar images captured by AUVs, and the results show that our proposal has preeminent matching accuracy and robustness.

The MIS Check-Dam Dataset for Object Detection and Instance Segmentation Tasks

Authors: Chintan Tundia, Rajiv Kumar, Om Damani, G. Sivakumar
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.15613
Pdf link: https://arxiv.org/pdf/2111.15613
Abstract
Deep learning has led to many recent advances in object detection and instance segmentation, among other computer vision tasks. These advancements have led to wide application of deep learning based methods and related methodologies in object detection tasks for satellite imagery. In this paper, we introduce MIS Check-Dam, a new dataset of check-dams from satellite imagery for building an automated system for the detection and mapping of check-dams, focusing on the importance of irrigation structures used for agriculture. We review some of the most recent object detection and instance segmentation methods and assess their performance on our new dataset. We evaluate several single stage, two-stage and attention based methods under various network configurations and backbone architectures. The dataset and the pre-trained models are available at https://www.cse.iitb.ac.in/gramdrishti/.

Keyword: localization

Real-Time CRLB based Antenna Selection in Planar Antenna Arrays

Authors: Masoud Arash, Ivan Stupia, Luc Vandendorpe
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2111.15008
Pdf link: https://arxiv.org/pdf/2111.15008
Abstract
Estimation of User Terminals' (UTs') Angle of Arrival (AoA) plays a significant role in the next generation of wireless systems. Due to high demands, energy efficiency concerns, and scarcity of available resources, it is pivotal how these resources are used. Installed antennas and their corresponding hardware at the Base Station (BS) are of these resources. In this paper, we address the problem of antenna selection to minimize Cramer-Rao Lower Bound (CRLB) of a planar antenna array when fewer antennas than total available antennas have to be used for a UT. First, the optimal antenna selection strategy to minimize the expected CRLB in a planar antenna array is proposed. Then, using this strategy as a preliminary step, we present a two-step antenna selection method whose goal is to minimize the instantaneous CRLB. Minimizing instantaneous CRLB through antenna selection is a combinatorial optimization problem for which we utilize a greedy algorithm. The optimal start point of the greedy algorithm is presented alongside some methods to reduce the computational complexity of the selection procedure. Numerical results confirm the accuracy of the proposed solutions and highlight the benefits of using antenna selection in the localization phase in a wireless system.

RailLoMer: Rail Vehicle Localization and Mapping with LiDAR-IMU-Odometer-GNSS Data Fusion

Authors: Yusheng Wang, Yidong Lou, Yi Zhang, Weiwei Song, Fei Huang, Zhiyong Tu, Shimin Zhang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.15043
Pdf link: https://arxiv.org/pdf/2111.15043
Abstract
We present RailLoMer in this article, to achieve real-time accurate and robust odometry and mapping for rail vehicles. RailLoMer receives measurements from two LiDARs, an IMU, train odometer, and a global navigation satellite system (GNSS) receiver. As frontend, the estimated motion from IMU/odometer preintegration de-skews the denoised point clouds and produces initial guess for frame-to-frame LiDAR odometry. As backend, a sliding window based factor graph is formulated to jointly optimize multi-modal information. In addition, we leverage the plane constraints from extracted rail tracks and the structure appearance descriptor to further improve the system robustness against repetitive structures. To ensure a globally-consistent and less blurry mapping result, we develop a two-stage mapping method that first performs scan-to-map in local scale, then utilizes the GNSS information to register the submaps. The proposed method is extensively evaluated on datasets gathered for a long time range over numerous scales and scenarios, and show that RailLoMer delivers decimeter-grade localization accuracy even in large or degenerated environments. We also integrate RailLoMer into an interactive train state and railway monitoring system prototype design, which has already been deployed to an experimental freight traffic railroad.

AssistSR: Affordance-centric Question-driven Video Segment Retrieval

Authors: Stan Weixian Lei, Yuxuan Wang, Dongxing Mao, Difei Gao, Mike Zheng Shou
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.15050
Pdf link: https://arxiv.org/pdf/2111.15050
Abstract
It is still a pipe dream that AI assistants on phone and AR glasses can assist our daily life in addressing our questions like "how to adjust the date for this watch?" and "how to set its heating duration? (while pointing at an oven)". The queries used in conventional tasks (i.e. Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure text. In contrast, we present a new task called Affordance-centric Question-driven Video Segment Retrieval (AQVSR). Each of our questions is an image-box-text query that focuses on affordance of items in our daily life and expects relevant answer segments to be retrieved from a corpus of instructional video-transcript segments. To support the study of this AQVSR task, we construct a new dataset called AssistSR. We design novel guidelines to create high-quality samples. This dataset contains 1.4k multimodal questions on 1k video segments from instructional videos on diverse daily-used items. To address AQVSR, we develop a straightforward yet effective model called Dual Multimodal Encoders (DME) that significantly outperforms several baseline methods while still having large room for improvement in the future. Moreover, we present detailed ablation analyses. Our codes and data are available at https://github.com/StanLei52/AQVSR.

Automatic tracing of mandibular canal pathways using deep learning

Authors: Mrinal Kanti Dhar, Zeyun Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.15111
Pdf link: https://arxiv.org/pdf/2111.15111
Abstract
There is an increasing demand in medical industries to have automated systems for detection and localization which are manually inefficient otherwise. In dentistry, it bears great interest to trace the pathway of mandibular canals accurately. Proper localization of the position of the mandibular canals, which surrounds the inferior alveolar nerve (IAN), reduces the risk of damaging it during dental implantology. Manual detection of canal paths is not an efficient way in terms of time and labor. Here, we propose a deep learning-based framework to detect mandibular canals from CBCT data. It is a 3-stage process fully automatic end-to-end. Ground truths are generated in the preprocessing stage. Instead of using commonly used fixed diameter tubular-shaped ground truth, we generate centerlines of the mandibular canals and used them as ground truths in the training process. A 3D U-Net architecture is used for model training. An efficient post-processing stage is developed to rectify the initial prediction. The precision, recall, F1-score, and IoU are measured to analyze the voxel-level segmentation performance. However, to analyze the distance-based measurements, mean curve distance (MCD) both from ground truth to prediction and prediction to ground truth is calculated. Extensive experiments are conducted to demonstrate the effectiveness of the model.

AirObject: A Temporally Evolving Graph Embedding for Object Identification

Authors: Nikhil Varma Keetha, Chen Wang, Yuheng Qiu, Kuan Xu, Sebastian Scherer
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.15150
Pdf link: https://arxiv.org/pdf/2111.15150
Abstract
Object encoding and identification are vital for robotic tasks such as autonomous exploration, semantic scene understanding, and re-localization. Previous approaches have attempted to either track objects or generate descriptors for object identification. However, such systems are limited to a "fixed" partial object representation from a single viewpoint. In a robot exploration setup, there is a requirement for a temporally "evolving" global object representation built as the robot observes the object from multiple viewpoints. Furthermore, given the vast distribution of unknown novel objects in the real world, the object identification process must be class-agnostic. In this context, we propose a novel temporal 3D object encoding approach, dubbed AirObject, to obtain global keypoint graph-based embeddings of objects. Specifically, the global 3D object embeddings are generated using a temporal convolutional network across structural information of multiple frames obtained from a graph attention-based encoding method. We demonstrate that AirObject achieves the state-of-the-art performance for video object identification and is robust to severe occlusion, perceptual aliasing, viewpoint shift, deformation, and scale transform, outperforming the state-of-the-art single-frame and sequential descriptors. To the best of our knowledge, AirObject is one of the first temporal object encoding methods.

WALK-VIO: Walking-motion-Adaptive Leg Kinematic Constraint Visual-Inertial Odometry for Quadruped Robots

Authors: Hyunjun Lim, Byeongho Yu, Yeeun Kim, Joowoong Byun, Soonpyo Kwon, Haewon Park, Hyun Myung
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.15164
Pdf link: https://arxiv.org/pdf/2111.15164
Abstract
In this paper, WALK-VIO, a novel visual-inertial odometry (VIO) with walking-motion-adaptive leg kinematic constraints that change with body motion for localization of quadruped robots, is proposed. Quadruped robots primarily use VIO because they require fast localization for control and path planning. However, since quadruped robots are mainly used outdoors, extraneous features extracted from the sky or ground cause tracking failures. In addition, the quadruped robots' walking motion cause wobbling, which lowers the localization accuracy due to the camera and inertial measurement unit (IMU). To overcome these limitations, many researchers use VIO with leg kinematic constraints. However, since the quadruped robot's walking motion varies according to the controller, gait, quadruped robots' velocity, and so on, these factors should be considered in the process of adding leg kinematic constraints. We propose VIO that can be used regardless of walking motion by adjusting the leg kinematic constraint factor. In order to evaluate WALK-VIO, we create and publish datasets of quadruped robots that move with various types of walking motion in a simulation environment. In addition, we verified the validity of WALK-VIO through comparison with current state-of-the-art algorithms.

SP-SEDT: Self-supervised Pre-training for Sound Event Detection Transformer

Authors: Zhirong Ye, Xiangdong Wang, Hong Liu, Yueliang Qian, Rui Tao, Long Yan, Kazushige Ouchi
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2111.15222
Pdf link: https://arxiv.org/pdf/2111.15222
Abstract
Recently, an event-based end-to-end model (SEDT) has been proposed for sound event detection (SED) and achieves competitive performance. However, compared with the frame-based model, it requires more training data with temporal annotations to improve the localization ability. Synthetic data is an alternative, but it suffers from a great domain gap with real recordings. Inspired by the great success of UP-DETR in object detection, we propose to self-supervisedly pre-train SEDT (SP-SEDT) by detecting random patches (only cropped along the time axis). Experiments on the DCASE2019 task4 dataset show the proposed SP-SEDT can outperform fine-tuned frame-based model. The ablation study is also conducted to investigate the impact of different loss functions and patch size.

ARTSeg: Employing Attention for Thermal images Semantic Segmentation

Authors: Farzeen Munir, Shoaib Azam, Unse Fatima, Moongu Jeon
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.15257
Pdf link: https://arxiv.org/pdf/2111.15257
Abstract
The research advancements have made the neural network algorithms deployed in the autonomous vehicle to perceive the surrounding. The standard exteroceptive sensors that are utilized for the perception of the environment are cameras and Lidar. Therefore, the neural network algorithms developed using these exteroceptive sensors have provided the necessary solution for the autonomous vehicle's perception. One major drawback of these exteroceptive sensors is their operability in adverse weather conditions, for instance, low illumination and night conditions. The useability and affordability of thermal cameras in the sensor suite of the autonomous vehicle provide the necessary improvement in the autonomous vehicle's perception in adverse weather conditions. The semantics of the environment benefits the robust perception, which can be achieved by segmenting different objects in the scene. In this work, we have employed the thermal camera for semantic segmentation. We have designed an attention-based Recurrent Convolution Network (RCNN) encoder-decoder architecture named ARTSeg for thermal semantic segmentation. The main contribution of this work is the design of encoder-decoder architecture, which employ units of RCNN for each encoder and decoder block. Furthermore, additive attention is employed in the decoder module to retain high-resolution features and improve the localization of features. The efficacy of the proposed method is evaluated on the available public dataset, showing better performance with other state-of-the-art methods in mean intersection over union (IoU).

Reconstruction Student with Attention for Student-Teacher Pyramid Matching

Authors: Shinji Yamada, Kazuhiro Hotta
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.15376
Pdf link: https://arxiv.org/pdf/2111.15376
Abstract
Anomaly detection and localization are important problems in computer vision. Recently, Convolutional Neural Network (CNN) has been used for visual inspection. In particular, the scarcity of anomalous samples increases the difficulty of this task, and unsupervised leaning based methods are attracting attention. We focus on Student-Teacher Feature Pyramid Matching (STPM) which can be trained from only normal images with small number of epochs. Here we proposed a powerful method which compensates for the shortcomings of STPM. Proposed method consists of two students and two teachers that a pair of student-teacher network is the same as STPM. The other student-teacher network has a role to reconstruct the features of normal products. By reconstructing the features of normal products from an abnormal image, it is possible to detect abnormalities with higher accuracy by taking the difference between them. The new student-teacher network uses attention modules and different teacher network from the original STPM. Attention mechanism acts to successfully reconstruct the normal regions in an input image. Different teacher network prevents looking at the same regions as the original STPM. Six anomaly maps obtained from the two student-teacher networks are used to calculate the final anomaly map. Student-teacher network for reconstructing features improved AUC scores for pixel level and image level in comparison with the original STPM.

New submissions for Tue, 2 Nov 21

Keyword: SLAM

Loop closure detection using local 3D deep descriptors

Authors: Youjie Zhou, Yiming Wang, Fabio Poiesi, Qi Qin, Yi Wan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00440
Pdf link: https://arxiv.org/pdf/2111.00440
Abstract
We present a simple yet effective method to address loop closure detection in simultaneous localisation and mapping using local 3D deep descriptors (L3Ds). L3Ds are emerging compact representations of patches extracted from point clouds that are learned from data using a deep learning algorithm. We propose a novel overlap measure for loop detection by computing the metric error between points that correspond to mutually-nearest-neighbour descriptors after registering the loop candidate point cloud by its estimated relative pose. This novel approach enables us to accurately detect loops and estimate six degrees-of-freedom poses in the case of small overlaps. We compare our L3D-based loop closure approach with recent approaches on LiDAR data and achieve state-of-the-art loop closure detection accuracy. Additionally, we embed our loop closure approach in RESLAM, a recent edge-based SLAM system, and perform the evaluation on real-world RGBD-TUM and synthetic ICL datasets. Our approach enables RESLAM to achieve a better localisation accuracy compared to its original loop closure strategy.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Loop closure detection using local 3D deep descriptors

Authors: Youjie Zhou, Yiming Wang, Fabio Poiesi, Qi Qin, Yi Wan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00440
Pdf link: https://arxiv.org/pdf/2111.00440
Abstract
We present a simple yet effective method to address loop closure detection in simultaneous localisation and mapping using local 3D deep descriptors (L3Ds). L3Ds are emerging compact representations of patches extracted from point clouds that are learned from data using a deep learning algorithm. We propose a novel overlap measure for loop detection by computing the metric error between points that correspond to mutually-nearest-neighbour descriptors after registering the loop candidate point cloud by its estimated relative pose. This novel approach enables us to accurately detect loops and estimate six degrees-of-freedom poses in the case of small overlaps. We compare our L3D-based loop closure approach with recent approaches on LiDAR data and achieve state-of-the-art loop closure detection accuracy. Additionally, we embed our loop closure approach in RESLAM, a recent edge-based SLAM system, and perform the evaluation on real-world RGBD-TUM and synthetic ICL datasets. Our approach enables RESLAM to achieve a better localisation accuracy compared to its original loop closure strategy.

Local Trajectory Planning For UAV Autonomous Landing

Authors: Yossi Magrisso, Ehud Rivlin, Hector Rotstein, Oren Salzman
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00495
Pdf link: https://arxiv.org/pdf/2111.00495
Abstract
An important capability of autonomous Unmanned Aerial Vehicles (UAVs) is autonomous landing while avoiding collision with obstacles in the process. Such capability requires real-time local trajectory planning. Although trajectory-planning methods have been introduced for cases such as emergency landing, they have not been evaluated in real-life scenarios where only the surface of obstacles can be sensed and detected. We propose a novel optimization framework using a pre-planned global path and a priority map of the landing area. Several trajectory planning algorithms were implemented and evaluated in a simulator that includes a 3D urban environment, LiDAR-based obstacle-surface sensing and UAV guidance and dynamics. We show that using our proposed optimization criterion can successfully improve the landing-mission success probability while avoiding collisions with obstacles in real-time.

MetroLoc: Metro Vehicle Mapping and Localization with LiDAR-Camera-Inertial Integration

Authors: Yusheng Wang, Weiwei Song, Yi Zhang, Fei Huang, Zhiyong Tu, Yidong Lou
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00762
Pdf link: https://arxiv.org/pdf/2111.00762
Abstract
We propose an accurate and robust multi-modal sensor fusion framework, MetroLoc, towards one of the most extreme scenarios, the large-scale metro vehicle localization and mapping. MetroLoc is built atop an IMU-centric state estimator that tightly couples light detection and ranging (LiDAR), visual, and inertial information with the convenience of loosely coupled methods. The proposed framework is composed of three submodules: IMU odometry, LiDAR-inertial odometry (LIO), and Visual-inertial odometry (VIO). The IMU is treated as the primary sensor, which achieves the observations from LIO and VIO to constrain the accelerometer and gyroscope biases. Compared to previous point-only LIO methods, our approach leverages more geometry information by introducing both line and plane features into motion estimation. The VIO also utilizes the environmental structure information by employing both lines and points. Our proposed method has been extensively tested in the long-during metro environments with a maintenance vehicle. Experimental results show the system more accurate and robust than the state-of-the-art approaches with real-time performance. Besides, we develop a series of Virtual Reality (VR) applications towards efficient, economical, and interactive rail vehicle state and trackside infrastructure monitoring, which has already been deployed to an outdoor testing railroad.

Learning Inertial Odometry for Dynamic Legged Robot State Estimation

Authors: Russell Buchanan, Marco Camurri, Frank Dellaert, Maurice Fallon
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00789
Pdf link: https://arxiv.org/pdf/2111.00789
Abstract
This paper introduces a novel proprioceptive state estimator for legged robots based on a learned displacement measurement from IMU data. Recent research in pedestrian tracking has shown that motion can be inferred from inertial data using convolutional neural networks. A learned inertial displacement measurement can improve state estimation in challenging scenarios where leg odometry is unreliable, such as slipping and compressible terrains. Our work learns to estimate a displacement measurement from IMU data which is then fused with traditional leg odometry. Our approach greatly reduces the drift of proprioceptive state estimation, which is critical for legged robots deployed in vision and lidar denied environments such as foggy sewers or dusty mines. We compared results from an EKF and an incremental fixed-lag factor graph estimator using data from several real robot experiments crossing challenging terrains. Our results show a reduction of relative pose error by 37% in challenging scenarios when compared to a traditional kinematic-inertial estimator without learned measurement. We also demonstrate a 22% reduction in error when used with vision systems in visually degraded environments such as an underground mine.

VPFNet: Voxel-Pixel Fusion Network for Multi-class 3D Object Detection

Authors: Chia-Hung Wang, Hsueh-Wei Chen, Li-Chen Fu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00966
Pdf link: https://arxiv.org/pdf/2111.00966
Abstract
Many LiDAR-based methods for detecting large objects, single-class object detection, or under easy situations were claimed to perform quite well. However, their performances of detecting small objects or under hard situations did not surpass those of the fusion-based ones due to failure to leverage the image semantics. In order to elevate the detection performance in a complicated environment, this paper proposes a deep learning (DL)-embedded fusion-based multi-class 3D object detection network which admits both LiDAR and camera sensor data streams, named Voxel-Pixel Fusion Network (VPFNet). Inside this network, a key novel component is called Voxel-Pixel Fusion (VPF) layer, which takes advantage of the geometric relation of a voxel-pixel pair and fuses the voxel features and the pixel features with proper mechanisms. Moreover, several parameters are particularly designed to guide and enhance the fusion effect after considering the characteristics of a voxel-pixel pair. Finally, the proposed method is evaluated on the KITTI benchmark for multi-class 3D object detection task under multilevel difficulty, and is shown to outperform all state-of-the-art methods in mean average precision (mAP). It is also noteworthy that our approach here ranks the first on the KITTI leaderboard for the challenging pedestrian class.

Keyword: loop detection

Loop closure detection using local 3D deep descriptors

Authors: Youjie Zhou, Yiming Wang, Fabio Poiesi, Qi Qin, Yi Wan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00440
Pdf link: https://arxiv.org/pdf/2111.00440
Abstract
We present a simple yet effective method to address loop closure detection in simultaneous localisation and mapping using local 3D deep descriptors (L3Ds). L3Ds are emerging compact representations of patches extracted from point clouds that are learned from data using a deep learning algorithm. We propose a novel overlap measure for loop detection by computing the metric error between points that correspond to mutually-nearest-neighbour descriptors after registering the loop candidate point cloud by its estimated relative pose. This novel approach enables us to accurately detect loops and estimate six degrees-of-freedom poses in the case of small overlaps. We compare our L3D-based loop closure approach with recent approaches on LiDAR data and achieve state-of-the-art loop closure detection accuracy. Additionally, we embed our loop closure approach in RESLAM, a recent edge-based SLAM system, and perform the evaluation on real-world RGBD-TUM and synthetic ICL datasets. Our approach enables RESLAM to achieve a better localisation accuracy compared to its original loop closure strategy.

Keyword: autonomous driving

There is no result

Keyword: mapping

Multi-Objective Autonomous Exploration on Real-Time Continuous Occupancy Maps

Authors: Zheng Chen, Weizhe Chen, Shi Bai, Lantao Liu
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00067
Pdf link: https://arxiv.org/pdf/2111.00067
Abstract
Autonomous exploration in unknown environments using mobile robots is the pillar of many robotic applications. Existing exploration frameworks either select the nearest geometric frontier or the nearest information-theoretic frontier. However, just because a frontier itself is informative does not necessarily mean that the robot will be in an informative area after reaching that frontier. To fill this gap, we propose to use a multi-objective variant of Monte-Carlo tree search that provides a non-myopic Pareto optimal action sequence leading the robot to a frontier with the greatest extent of unknown area uncovering. We also adopted Bayesian Hilbert Map (BHM) for continuous occupancy mapping and made it more applicable to real-time tasks.

Efficient Map Prediction via Low-Rank Matrix Completion

Authors: Zheng Chen, Shi Bai, Lantao Liu
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00075
Pdf link: https://arxiv.org/pdf/2111.00075
Abstract
In many autonomous mapping tasks, the maps cannot be accurately constructed due to various reasons such as sparse, noisy, and partial sensor measurements. We propose a novel map prediction method built upon the recent success of Low-Rank Matrix Completion. The proposed map prediction is able to achieve both map interpolation and extrapolation on raw poor-quality maps with missing or noisy observations. We validate with extensive simulated experiments that the approach can achieve real-time computation for large maps, and the performance is superior to the state-of-the-art map prediction approach - Bayesian Hilbert Mapping in terms of mapping accuracy and computation time. Then we demonstrate that with the proposed real-time map prediction framework, the coverage convergence rate (per action step) for a set of representative coverage planning methods commonly used for environmental modeling and monitoring tasks can be significantly improved.

Visual Explanations for Convolutional Neural Networks via Latent Traversal

Authors: Amil Dravid, Aggelos K. Katsaggelos
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.00116
Pdf link: https://arxiv.org/pdf/2111.00116
Abstract
Lack of explainability in artificial intelligence, specifically deep neural networks, remains a bottleneck for implementing models in practice. Popular techniques such as Gradient-weighted Class Activation Mapping (Grad-CAM) provide a coarse map of salient features in an image, which rarely tells the whole story of what a convolutional neural network (CNN) learned. Using COVID-19 chest X-rays, we present a method for interpreting what a CNN has learned by utilizing Generative Adversarial Networks (GANs). Our GAN framework disentangles lung structure from COVID-19 features. Using this GAN, we can visualize the transition of a pair of COVID negative lungs in a chest radiograph to a COVID positive pair by interpolating in the latent space of the GAN, which provides fine-grained visualization of how the CNN responds to varying features within the lungs.

A Quasi-Newton method for physically-admissible simulation of Poiseuille flow under fracture propagation

Authors: Guotong Ren, Rami M. Younis
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2111.00264
Pdf link: https://arxiv.org/pdf/2111.00264
Abstract
Coupled hydro-mechanical processes are of great importance to numerous engineering systems, e.g., hydraulic fracturing, geothermal energy, and carbon sequestration. Fluid flow in fractures is modeled after a Poiseuille law that relates the conductivity to the aperture by a cubic relation. Newton's method is commonly employed to solve the resulting discrete, nonlinear algebraic systems. It is demonstrated, however, that Newton's method will likely converge to nonphysical numerical solutions, resulting in estimates with a negative fracture aperture. A Quasi-Newton approach is developed to ensure global convergence to the physical solution. A fixed-point stability analysis demonstrates that both physical and nonphysical solutions are stable for Newton's method, whereas only physical solutions are stable for the proposed Quasi-Newton method. Additionally, it is also demonstrated that the Quasi-Newton method offers a contraction mapping along the iteration path. Numerical examples of fluid-driven fracture propagation demonstrate that the proposed solution method results in robust and computationally efficient performance.

A Novel Linear Power Flow Model

Authors: Zhentong Shao, Qiaozhu Zhai, Xiaohong Guan
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2111.00382
Pdf link: https://arxiv.org/pdf/2111.00382
Abstract
Linear power flow (LPF) models are important for the solution of large-scale problems in power system analysis. This paper proposes a novel LPF method named data-based LPF (DB-LPF). The DB-LPF is distinct from the data-driven LPF (DD-LPF) model because the DB-LPF formulates the definition set first and then discretizes the set into representative data samples, while the DD-LPF directly mines the variable mappings from historical or measurement data. In this paper, the concept of LPF definition set (i.e., the set where the LPF performs well) is proposed and an analytical algorithm is provided to get the set. Meanwhile, a novel form of AC-PF models is provided, which is helpful in deriving the analytical algorithm and directing the formulations of LPF models. The definition set is discretized by a grid sampling approach and the obtained samples are processed by the least-squares method to formulate the DB-LPF model. Moreover, the model for obtaining the error bound of the DB-LPF is proposed, and the network losses of the DB-LPF is also analyzed. Finally, the DB-LPF model is tested on enormous cases, whose branch parameters are also anal-yzed. The test results verify the effectiveness and superiority of the proposed method.

Loop closure detection using local 3D deep descriptors

Authors: Youjie Zhou, Yiming Wang, Fabio Poiesi, Qi Qin, Yi Wan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00440
Pdf link: https://arxiv.org/pdf/2111.00440
Abstract
We present a simple yet effective method to address loop closure detection in simultaneous localisation and mapping using local 3D deep descriptors (L3Ds). L3Ds are emerging compact representations of patches extracted from point clouds that are learned from data using a deep learning algorithm. We propose a novel overlap measure for loop detection by computing the metric error between points that correspond to mutually-nearest-neighbour descriptors after registering the loop candidate point cloud by its estimated relative pose. This novel approach enables us to accurately detect loops and estimate six degrees-of-freedom poses in the case of small overlaps. We compare our L3D-based loop closure approach with recent approaches on LiDAR data and achieve state-of-the-art loop closure detection accuracy. Additionally, we embed our loop closure approach in RESLAM, a recent edge-based SLAM system, and perform the evaluation on real-world RGBD-TUM and synthetic ICL datasets. Our approach enables RESLAM to achieve a better localisation accuracy compared to its original loop closure strategy.

Revealing and Protecting Labels in Distributed Training

Authors: Trung Dang, Om Thakkar, Swaroop Ramaswamy, Rajiv Mathews, Peter Chin, Françoise Beaufays
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2111.00556
Pdf link: https://arxiv.org/pdf/2111.00556
Abstract
Distributed learning paradigms such as federated learning often involve transmission of model updates, or gradients, over a network, thereby avoiding transmission of private data. However, it is possible for sensitive information about the training data to be revealed from such gradients. Prior works have demonstrated that labels can be revealed analytically from the last layer of certain models (e.g., ResNet), or they can be reconstructed jointly with model inputs by using Gradients Matching [Zhu et al'19] with additional knowledge about the current state of the model. In this work, we propose a method to discover the set of labels of training samples from only the gradient of the last layer and the id to label mapping. Our method is applicable to a wide variety of model architectures across multiple domains. We demonstrate the effectiveness of our method for model training in two domains - image classification, and automatic speech recognition. Furthermore, we show that existing reconstruction techniques improve their efficacy when used in conjunction with our method. Conversely, we demonstrate that gradient quantization and sparsification can significantly reduce the success of the attack.

Deep Recursive Embedding for High-Dimensional Data

Authors: Zixia Zhou, Xinrui Zu, Yuanyuan Wang, Boudewijn P.F. Lelieveldt, Qian Tao
Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2111.00622
Pdf link: https://arxiv.org/pdf/2111.00622
Abstract
Embedding high-dimensional data onto a low-dimensional manifold is of both theoretical and practical value. In this paper, we propose to combine deep neural networks (DNN) with mathematics-guided embedding rules for high-dimensional data embedding. We introduce a generic deep embedding network (DEN) framework, which is able to learn a parametric mapping from high-dimensional space to low-dimensional space, guided by well-established objectives such as Kullback-Leibler (KL) divergence minimization. We further propose a recursive strategy, called deep recursive embedding (DRE), to make use of the latent data representations for boosted embedding performance. We exemplify the flexibility of DRE by different architectures and loss functions, and benchmarked our method against the two most popular embedding methods, namely, t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP). The proposed DRE method can map out-of-sample data and scale to extremely large datasets. Experiments on a range of public datasets demonstrated improved embedding performance in terms of local and global structure preservation, compared with other state-of-the-art embedding methods.

SADGA: Structure-Aware Dual Graph Aggregation Network for Text-to-SQL

Authors: Ruichu Cai, Jinjie Yuan, Boyan Xu, Zhifeng Hao
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2111.00653
Pdf link: https://arxiv.org/pdf/2111.00653
Abstract
The Text-to-SQL task, aiming to translate the natural language of the questions into SQL queries, has drawn much attention recently. One of the most challenging problems of Text-to-SQL is how to generalize the trained model to the unseen database schemas, also known as the cross-domain Text-to-SQL task. The key lies in the generalizability of (i) the encoding method to model the question and the database schema and (ii) the question-schema linking method to learn the mapping between words in the question and tables/columns in the database schema. Focusing on the above two key issues, we propose a Structure-Aware Dual Graph Aggregation Network (SADGA) for cross-domain Text-to-SQL. In SADGA, we adopt the graph structure to provide a unified encoding model for both the natural language question and database schema. Based on the proposed unified modeling, we further devise a structure-aware aggregation method to learn the mapping between the question-graph and schema-graph. The structure-aware aggregation method is featured with Global Graph Linking, Local Graph Linking, and Dual-Graph Aggregation Mechanism. We not only study the performance of our proposal empirically but also achieved 3rd place on the challenging Text-to-SQL benchmark Spider at the time of writing.

Edge-Level Explanations for Graph Neural Networks by Extending Explainability Methods for Convolutional Neural Networks

Authors: Tetsu Kasanishi, Xueting Wang, Toshihiko Yamasaki
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.00722
Pdf link: https://arxiv.org/pdf/2111.00722
Abstract
Graph Neural Networks (GNNs) are deep learning models that take graph data as inputs, and they are applied to various tasks such as traffic prediction and molecular property prediction. However, owing to the complexity of the GNNs, it has been difficult to analyze which parts of inputs affect the GNN model's outputs. In this study, we extend explainability methods for Convolutional Neural Networks (CNNs), such as Local Interpretable Model-Agnostic Explanations (LIME), Gradient-Based Saliency Maps, and Gradient-Weighted Class Activation Mapping (Grad-CAM) to GNNs, and predict which edges in the input graphs are important for GNN decisions. The experimental results indicate that the LIME-based approach is the most efficient explainability method for multiple tasks in the real-world situation, outperforming even the state-of-the-art method in GNN explainability.

MetroLoc: Metro Vehicle Mapping and Localization with LiDAR-Camera-Inertial Integration

Authors: Yusheng Wang, Weiwei Song, Yi Zhang, Fei Huang, Zhiyong Tu, Yidong Lou
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00762
Pdf link: https://arxiv.org/pdf/2111.00762
Abstract
We propose an accurate and robust multi-modal sensor fusion framework, MetroLoc, towards one of the most extreme scenarios, the large-scale metro vehicle localization and mapping. MetroLoc is built atop an IMU-centric state estimator that tightly couples light detection and ranging (LiDAR), visual, and inertial information with the convenience of loosely coupled methods. The proposed framework is composed of three submodules: IMU odometry, LiDAR-inertial odometry (LIO), and Visual-inertial odometry (VIO). The IMU is treated as the primary sensor, which achieves the observations from LIO and VIO to constrain the accelerometer and gyroscope biases. Compared to previous point-only LIO methods, our approach leverages more geometry information by introducing both line and plane features into motion estimation. The VIO also utilizes the environmental structure information by employing both lines and points. Our proposed method has been extensively tested in the long-during metro environments with a maintenance vehicle. Experimental results show the system more accurate and robust than the state-of-the-art approaches with real-time performance. Besides, we develop a series of Virtual Reality (VR) applications towards efficient, economical, and interactive rail vehicle state and trackside infrastructure monitoring, which has already been deployed to an outdoor testing railroad.

Dynamic Distances in Hyperbolic Graphs

Authors: Eryk Kopczyński, Dorota Celińska-Kopczyńska
Subjects: Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2111.01019
Pdf link: https://arxiv.org/pdf/2111.01019
Abstract
We consider the following dynamic problem: given a fixed (small) template graph with colored vertices C and a large graph with colored vertices G (whose colors can be changed dynamically), how many mappings m are there from the vertices of C to vertices of G in such a way that the colors agree, and the distances between m(v) and m(w) have given values for every edge? We show that this problem can be solved efficiently on triangulations of the hyperbolic plane, as well as other Gromov hyperbolic graphs. For various template graphs C, this result lets us efficiently solve various computational problems which are relevant in applications, such as visualization of hierarchical data and social network analysis.

Keyword: localization

Multi-User Augmented Reality with Infrastructure-free Collaborative Localization

Authors: John Miller, Elahe Soltanaghai, Raewyn Duvall, Jeff Chen, Vikram Bhat, Nuno Pereira, Anthony Rowe
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2111.00174
Pdf link: https://arxiv.org/pdf/2111.00174
Abstract
Multi-user augmented reality (AR) could someday empower first responders with the ability to see team members around corners and through walls. For this vision of people tracking in dynamic environments to be practical, we need a relative localization system that is nearly instantly available across wide-areas without any existing infrastructure or manual setup. In this paper, we present LocAR, an infrastructure-free 6-degrees-of-freedom (6DoF) localization system for AR applications that uses motion estimates and range measurements between users to establish an accurate relative coordinate system. We show that not only is it possible to perform collaborative localization without infrastructure or global coordinates, but that our approach provides nearly the same level of accuracy as fixed infrastructure approaches for AR teaming applications. LocAR uses visual-inertial odometry (VIO) in conjunction with ultra-wideband (UWB) ranging radios to estimate the relative position of each device in an ad-hoc manner. The system leverages a collaborative 6DoF particle filtering formulation that operates on sporadic messages exchanged between nearby users. Unlike map or landmark sharing approaches, this allows for collaborative AR sessions even if users do not overlap the same spaces. LocAR consists of an open-source UWB firmware and reference mobile phone application that can display the location of team members in real-time using mobile AR. We evaluate LocAR across multiple buildings under a wide-variety of conditions including a contiguous 30,000 square foot region spanning multiple floors and find that it achieves median geometric error in 3D of less than 1 meter between five users freely walking across 3 floors.

Personal thermal comfort models using digital twins: Preference prediction with BIM-extracted spatial-temporal proximity data from Build2Vec

Authors: Mahmoud Abdelrahman, Adrian Chong, Clayton Miller
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.00199
Pdf link: https://arxiv.org/pdf/2111.00199
Abstract
Conventional thermal preference prediction in buildings has limitations due to the difficulty in capturing all environmental and personal factors. New model features can improve the ability of a machine learning model to classify a person's thermal preference. The spatial context of a building can provide information to models about the windows, walls, heating and cooling sources, air diffusers, and other factors that create micro-environments that influence thermal comfort. Due to spatial heterogeneity, it is impractical to position sensors at a high enough resolution to capture all conditions. This research aims to build upon an existing vector-based spatial model, called Build2Vec, for predicting spatial-temporal occupants' indoor environmental preferences. Build2Vec utilizes the spatial data from the Building Information Model (BIM) and indoor localization in a real-world setting. This framework uses longitudinal intensive thermal comfort subjective feedback from smart watch-based ecological momentary assessments (EMA). The aggregation of these data is combined into a graph network structure (i.e., objects and relations) and used as input for a classification model to predict occupant thermal preference. The results of a test implementation show 14-28% accuracy improvement over a set of baselines that use conventional thermal preference prediction input variables.

Hierarchical Deep Residual Reasoning for Temporal Moment Localization

Authors: Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, Liqiang Nie
Subjects: Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2111.00417
Pdf link: https://arxiv.org/pdf/2111.00417
Abstract
Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different resolution and sentences with different length have different difficulty in understanding, we design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner. Extensive experiments conducted on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our HDRR model compared with other state-of-the-art methods.

Fully convolutional Siamese neural networks for buildings damage assessment from satellite images

Authors: Eugene Khvedchenya, Tatiana Gabruseva
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.00508
Pdf link: https://arxiv.org/pdf/2111.00508
Abstract
Damage assessment after natural disasters is needed to distribute aid and forces to recovery from damage dealt optimally. This process involves acquiring satellite imagery for the region of interest, localization of buildings, and classification of the amount of damage caused by nature or urban factors to buildings. In case of natural disasters, this means processing many square kilometers of the area to judge whether a particular building had suffered from the damaging factors. In this work, we develop a computational approach for an automated comparison of the same region's satellite images before and after the disaster, and classify different levels of damage in buildings. Our solution is based on Siamese neural networks with encoder-decoder architecture. We include an extensive ablation study and compare different encoders, decoders, loss functions, augmentations, and several methods to combine two images. The solution achieved one of the best results in the Computer Vision for Building Damage Assessment competition.

Feature Aggregation and Refinement Network for 2D AnatomicalLandmark Detection

Authors: Yueyuan Ao, Hong Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.00659
Pdf link: https://arxiv.org/pdf/2111.00659
Abstract
Localization of anatomical landmarks is essential for clinical diagnosis, treatment planning, and research. In this paper, we propose a novel deep network, named feature aggregation and refinement network (FARNet), for the automatic detection of anatomical landmarks. To alleviate the problem of limited training data in the medical domain, our network adopts a deep network pre-trained on natural images as the backbone network and several popular networks have been compared. Our FARNet also includes a multi-scale feature aggregation module for multi-scale feature fusion and a feature refinement module for high-resolution heatmap regression. Coarse-to-fine supervisions are applied to the two modules to facilitate the end-to-end training. We further propose a novel loss function named Exponential Weighted Center loss for accurate heatmap regression, which focuses on the losses from the pixels near landmarks and suppresses the ones from far away. Our network has been evaluated on three publicly available anatomical landmark detection datasets, including cephalometric radiographs, hand radiographs, and spine radiographs, and achieves state-of-art performances on all three datasets. Code is available at: \url{https://github.com/JuvenileInWind/FARNet}

MetroLoc: Metro Vehicle Mapping and Localization with LiDAR-Camera-Inertial Integration

Authors: Yusheng Wang, Weiwei Song, Yi Zhang, Fei Huang, Zhiyong Tu, Yidong Lou
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.00762
Pdf link: https://arxiv.org/pdf/2111.00762
Abstract
We propose an accurate and robust multi-modal sensor fusion framework, MetroLoc, towards one of the most extreme scenarios, the large-scale metro vehicle localization and mapping. MetroLoc is built atop an IMU-centric state estimator that tightly couples light detection and ranging (LiDAR), visual, and inertial information with the convenience of loosely coupled methods. The proposed framework is composed of three submodules: IMU odometry, LiDAR-inertial odometry (LIO), and Visual-inertial odometry (VIO). The IMU is treated as the primary sensor, which achieves the observations from LIO and VIO to constrain the accelerometer and gyroscope biases. Compared to previous point-only LIO methods, our approach leverages more geometry information by introducing both line and plane features into motion estimation. The VIO also utilizes the environmental structure information by employing both lines and points. Our proposed method has been extensively tested in the long-during metro environments with a maintenance vehicle. Experimental results show the system more accurate and robust than the state-of-the-art approaches with real-time performance. Besides, we develop a series of Virtual Reality (VR) applications towards efficient, economical, and interactive rail vehicle state and trackside infrastructure monitoring, which has already been deployed to an outdoor testing railroad.

New submissions for Tue, 9 Nov 21

Keyword: SLAM

Online Adaptation of Monocular Depth Prediction with Visual SLAM

Authors: Shing Yan Loo, Moein Shakeri, Sai Hong Tang, Syamsiah Mashohor, Hong Zhang
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.04096
Pdf link: https://arxiv.org/pdf/2111.04096
Abstract
The ability of accurate depth prediction by a CNN is a major challenge for its wide use in practical visual SLAM applications, such as enhanced camera tracking and dense mapping. This paper is set out to answer the following question: Can we tune a depth prediction CNN with the help of a visual SLAM algorithm even if the CNN is not trained for the current operating environment in order to benefit the SLAM performance? To this end, we propose a novel online adaptation framework consisting of two complementary processes: a SLAM algorithm that is used to generate keyframes to fine-tune the depth prediction and another algorithm that uses the online adapted depth to improve map quality. Once the potential noisy map points are removed, we perform global photometric bundle adjustment (BA) to improve the overall SLAM performance. Experimental results on both benchmark datasets and a real robot in our own experimental environments show that our proposed method improves the SLAM reconstruction accuracy. We demonstrate the use of regularization in the training loss as an effective means to prevent catastrophic forgetting. In addition, we compare our online adaptation framework against the state-of-the-art pre-trained depth prediction CNNs to show that our online adapted depth prediction CNN outperforms the depth prediction CNNs that have been trained on a large collection of datasets.

Hierarchical Segment-based Optimization for SLAM

Authors: Yuxin Tian, Yujie Wang, Ming Ouyang, Xuesong Shi
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.04101
Pdf link: https://arxiv.org/pdf/2111.04101
Abstract
This paper presents a hierarchical segment-based optimization method for Simultaneous Localization and Mapping (SLAM) system. First we propose a reliable trajectory segmentation method that can be used to increase efficiency in the back-end optimization. Then we propose a buffer mechanism for the first time to improve the robustness of the segmentation. During the optimization, we use global information to optimize the frames with large error, and interpolation instead of optimization to update well-estimated frames to hierarchically allocate the amount of computation according to error of each frame. Comparative experiments on the benchmark show that our method greatly improves the efficiency of optimization with almost no drop in accuracy, and outperforms existing high-efficiency optimization method by a large margin.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

3D Siamese Voxel-to-BEV Tracker for Sparse Point Clouds

Authors: Le Hui, Lingpeng Wang, Mingmei Cheng, Jin Xie, Jian Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.04426
Pdf link: https://arxiv.org/pdf/2111.04426
Abstract
3D object tracking in point clouds is still a challenging problem due to the sparsity of LiDAR points in dynamic environments. In this work, we propose a Siamese voxel-to-BEV tracker, which can significantly improve the tracking performance in sparse 3D point clouds. Specifically, it consists of a Siamese shape-aware feature learning network and a voxel-to-BEV target localization network. The Siamese shape-aware feature learning network can capture 3D shape information of the object to learn the discriminative features of the object so that the potential target from the background in sparse point clouds can be identified. To this end, we first perform template feature embedding to embed the template's feature into the potential target and then generate a dense 3D shape to characterize the shape information of the potential target. For localizing the tracked target, the voxel-to-BEV target localization network regresses the target's 2D center and the $z$-axis center from the dense bird's eye view (BEV) feature map in an anchor-free manner. Concretely, we compress the voxelized point cloud along $z$-axis through max pooling to obtain a dense BEV feature map, where the regression of the 2D center and the $z$-axis center can be performed more effectively. Extensive evaluation on the KITTI and nuScenes datasets shows that our method significantly outperforms the current state-of-the-art methods by a large margin.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Towards Learning Generalizable Driving Policies from Restricted Latent Representations

Authors: Behrad Toghi, Rodolfo Valiente, Ramtin Pedarsani, Yaser P. Fallah
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.03688
Pdf link: https://arxiv.org/pdf/2111.03688
Abstract
Training intelligent agents that can drive autonomously in various urban and highway scenarios has been a hot topic in the robotics society within the last decades. However, the diversity of driving environments in terms of road topology and positioning of the neighboring vehicles makes this problem very challenging. It goes without saying that although scenario-specific driving policies for autonomous driving are promising and can improve transportation safety and efficiency, they are clearly not a universal scalable solution. Instead, we seek decision-making schemes and driving policies that can generalize to novel and unseen environments. In this work, we capitalize on the key idea that human drivers learn abstract representations of their surroundings that are fairly similar among various driving scenarios and environments. Through these representations, human drivers are able to quickly adapt to novel environments and drive in unseen conditions. Formally, through imposing an information bottleneck, we extract a latent representation that minimizes the \textit{distance} -- a quantification that we introduce to gauge the similarity among different driving configurations -- between driving scenarios. This latent space is then employed as the input to a Q-learning module to learn generalizable driving policies. Our experiments revealed that, using this latent representation can reduce the number of crashes to about half.

Get a Model! Model Hijacking Attack Against Machine Learning Models

Authors: Ahmed Salem, Michael Backes, Yang Zhang
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.04394
Pdf link: https://arxiv.org/pdf/2111.04394
Abstract
Machine learning (ML) has established itself as a cornerstone for various critical applications ranging from autonomous driving to authentication systems. However, with this increasing adoption rate of machine learning models, multiple attacks have emerged. One class of such attacks is training time attack, whereby an adversary executes their attack before or during the machine learning model training. In this work, we propose a new training time attack against computer vision based machine learning models, namely model hijacking attack. The adversary aims to hijack a target model to execute a different task than its original one without the model owner noticing. Model hijacking can cause accountability and security risks since a hijacked model owner can be framed for having their model offering illegal or unethical services. Model hijacking attacks are launched in the same way as existing data poisoning attacks. However, one requirement of the model hijacking attack is to be stealthy, i.e., the data samples used to hijack the target model should look similar to the model's original training dataset. To this end, we propose two different model hijacking attacks, namely Chameleon and Adverse Chameleon, based on a novel encoder-decoder style ML model, namely the Camouflager. Our evaluation shows that both of our model hijacking attacks achieve a high attack success rate, with a negligible drop in model utility.

Keyword: mapping

Disaster mapping from satellites: damage detection with crowdsourced point labels

Authors: Danil Kuzin, Olga Isupova, Brooke D. Simmons, Steven Reece
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.03693
Pdf link: https://arxiv.org/pdf/2111.03693
Abstract
High-resolution satellite imagery available immediately after disaster events is crucial for response planning as it facilitates broad situational awareness of critical infrastructure status such as building damage, flooding, and obstructions to access routes. Damage mapping at this scale would require hundreds of expert person-hours. However, a combination of crowdsourcing and recent advances in deep learning reduces the effort needed to just a few hours in real time. Asking volunteers to place point marks, as opposed to shapes of actual damaged areas, significantly decreases the required analysis time for response during the disaster. However, different volunteers may be inconsistent in their marking. This work presents methods for aggregating potentially inconsistent damage marks to train a neural network damage detector.

Spatiotemporal Impact of Hurricanes on a Power Grid

Authors: Abodh Poudyal, Vishnu Iyengar, Diego Garcia-Camargo, Anamika Dubey
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2111.03711
Pdf link: https://arxiv.org/pdf/2111.03711
Abstract
Almost 90% of the major power outages in the US are caused due to hurricanes. Due to the highly uncertain nature of hurricanes in both spatial and temporal dimensions, it is essential to quantify the effect of such hurricanes on a power grid. In this paper, we provide a Monte-Carlo-based framework in which several hurricane scenarios and their impact on a power grid are analyzed in spatiotemporal dimensions. The hurricane simulations are performed using samples from previously occurred hurricanes in the US whereas probabilistic assessment of the transmission lines is performed through line fragility model. Finally, a loss metric based on the amount of load disconnected due to hurricanes traveling inland is calculated for each time step. The simulation is performed on ACTIVSg2000: 2000-bus synthetic Texas grid while mapping the transmission lines of the test case on the geographical footprint of Texas. The simulation results show that the loss increases significantly for a few time steps when the wind field of a hurricane is intense and almost saturates when the intensity of the hurricane decays while traversing further. The proposed analysis can provide some insights for proactive planning strategies on improving the resilience of the power grid.

String Abstractions for Qubit Mapping

Authors: Blake Gerard, Martin Kong
Subjects: Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2111.03716
Pdf link: https://arxiv.org/pdf/2111.03716
Abstract
One of the key compilation steps in Quantum Computing (QC) is to determine an initial logical to physical mapping of the qubits used in a quantum circuit. The impact of the starting qubit layout can vastly affect later scheduling and placement decisions of QASM operations, yielding higher values on critical performance metrics (gate count and circuit depth) as a result of quantum compilers introducing SWAP operations to meet the underlying physical neighboring and connectivity constraints of the quantum device. In this paper we introduce a novel qubit mapping approach, string-based qubit mapping. The key insight is to prioritize the mapping of logical qubits that appear in longest repeating non-overlapping substrings of qubit pairs accessed. This mapping method is complemented by allocating qubits according to their global frequency usage. We evaluate and compare our new mapping scheme against two quantum compilers (QISKIT and TKET) and two device topologies, the IBM Manhattan (65 qubits) and the IBM Kolkata (27 qubits). Our results demonstrate that combining both mapping mechanisms often achieve better results than either one individually, allowing us to best QISKIT and TKET baselines, yielding between 13% and 17% average improvement in several group sizes, up to 32% circuit depth reduction and 63% gate volume improvement.

Multi-modal land cover mapping of remote sensing images using pyramid attention and gated fusion networks

Authors: Qinghui Liu, Michael Kampffmeyer, Robert Jenssen, Arnt-Børre Salberg
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.03845
Pdf link: https://arxiv.org/pdf/2111.03845
Abstract
Multi-modality data is becoming readily available in remote sensing (RS) and can provide complementary information about the Earth's surface. Effective fusion of multi-modal information is thus important for various applications in RS, but also very challenging due to large domain differences, noise, and redundancies. There is a lack of effective and scalable fusion techniques for bridging multiple modality encoders and fully exploiting complementary information. To this end, we propose a new multi-modality network (MultiModNet) for land cover mapping of multi-modal remote sensing data based on a novel pyramid attention fusion (PAF) module and a gated fusion unit (GFU). The PAF module is designed to efficiently obtain rich fine-grained contextual representations from each modality with a built-in cross-level and cross-view attention fusion mechanism, and the GFU module utilizes a novel gating mechanism for early merging of features, thereby diminishing hidden redundancies and noise. This enables supplementary modalities to effectively extract the most valuable and complementary information for late feature fusion. Extensive experiments on two representative RS benchmark datasets demonstrate the effectiveness, robustness, and superiority of the MultiModNet for multi-modal land cover classification.

Registration Techniques for Deformable Objects

Authors: Alireza Ahmadi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
Arxiv link: https://arxiv.org/abs/2111.04053
Pdf link: https://arxiv.org/pdf/2111.04053
Abstract
In general, the problem of non-rigid registration is about matching two different scans of a dynamic object taken at two different points in time. These scans can undergo both rigid motions and non-rigid deformations. Since new parts of the model may come into view and other parts get occluded in between two scans, the region of overlap is a subset of both scans. In the most general setting, no prior template shape is given and no markers or explicit feature point correspondences are available. So, this case is a partial matching problem that takes into account the assumption that consequent scans undergo small deformations while having a significant amount of overlapping area [28]. The problem which this thesis is addressing is mapping deforming objects and localizing cameras in the environment at the same time.

Em-K Indexing for Approximate Query Matching in Large-scale ER

Authors: Samudra Herath, Matthew Roughan, Gary Glonek
Subjects: Databases (cs.DB); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2111.04070
Pdf link: https://arxiv.org/pdf/2111.04070
Abstract
Accurate and efficient entity resolution (ER) is a significant challenge in many data mining and analysis projects requiring integrating and processing massive data collections. It is becoming increasingly important in real-world applications to develop ER solutions that produce prompt responses for entity queries on large-scale databases. Some of these applications demand entity query matching against large-scale reference databases within a short time. We define this as the query matching problem in ER in this work. Indexing or blocking techniques reduce the search space and execution time in the ER process. However, approximate indexing techniques that scale to very large-scale datasets remain open to research. In this paper, we investigate the query matching problem in ER to propose an indexing method suitable for approximate and efficient query matching. We first use spatial mappings to embed records in a multidimensional Euclidean space that preserves the domain-specific similarity. Among the various mapping techniques, we choose multidimensional scaling. Then using a Kd-tree and the nearest neighbour search, the method returns a block of records that includes potential matches for a query. Our method can process queries against a large-scale dataset using only a fraction of the data $L$ (given the dataset size is $N$), with a $O(L^2)$ complexity where $L \ll N$. The experiments conducted on several datasets showed the effectiveness of the proposed method.

Online Adaptation of Monocular Depth Prediction with Visual SLAM

Authors: Shing Yan Loo, Moein Shakeri, Sai Hong Tang, Syamsiah Mashohor, Hong Zhang
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.04096
Pdf link: https://arxiv.org/pdf/2111.04096
Abstract
The ability of accurate depth prediction by a CNN is a major challenge for its wide use in practical visual SLAM applications, such as enhanced camera tracking and dense mapping. This paper is set out to answer the following question: Can we tune a depth prediction CNN with the help of a visual SLAM algorithm even if the CNN is not trained for the current operating environment in order to benefit the SLAM performance? To this end, we propose a novel online adaptation framework consisting of two complementary processes: a SLAM algorithm that is used to generate keyframes to fine-tune the depth prediction and another algorithm that uses the online adapted depth to improve map quality. Once the potential noisy map points are removed, we perform global photometric bundle adjustment (BA) to improve the overall SLAM performance. Experimental results on both benchmark datasets and a real robot in our own experimental environments show that our proposed method improves the SLAM reconstruction accuracy. We demonstrate the use of regularization in the training loss as an effective means to prevent catastrophic forgetting. In addition, we compare our online adaptation framework against the state-of-the-art pre-trained depth prediction CNNs to show that our online adapted depth prediction CNN outperforms the depth prediction CNNs that have been trained on a large collection of datasets.

Hierarchical Segment-based Optimization for SLAM

Authors: Yuxin Tian, Yujie Wang, Ming Ouyang, Xuesong Shi
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.04101
Pdf link: https://arxiv.org/pdf/2111.04101
Abstract
This paper presents a hierarchical segment-based optimization method for Simultaneous Localization and Mapping (SLAM) system. First we propose a reliable trajectory segmentation method that can be used to increase efficiency in the back-end optimization. Then we propose a buffer mechanism for the first time to improve the robustness of the segmentation. During the optimization, we use global information to optimize the frames with large error, and interpolation instead of optimization to update well-estimated frames to hierarchically allocate the amount of computation according to error of each frame. Comparative experiments on the benchmark show that our method greatly improves the efficiency of optimization with almost no drop in accuracy, and outperforms existing high-efficiency optimization method by a large margin.

NeurInt : Learning to Interpolate through Neural ODEs

Authors: Avinandan Bose, Aniket Das, Yatin Dandi, Piyush Rai
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.04123
Pdf link: https://arxiv.org/pdf/2111.04123
Abstract
A wide range of applications require learning image generation models whose latent space effectively captures the high-level factors of variation present in the data distribution. The extent to which a model represents such variations through its latent space can be judged by its ability to interpolate between images smoothly. However, most generative models mapping a fixed prior to the generated images lead to interpolation trajectories lacking smoothness and containing images of reduced quality. In this work, we propose a novel generative model that learns a flexible non-parametric prior over interpolation trajectories, conditioned on a pair of source and target images. Instead of relying on deterministic interpolation methods (such as linear or spherical interpolation in latent space), we devise a framework that learns a distribution of trajectories between two given images using Latent Second-Order Neural Ordinary Differential Equations. Through a hybrid combination of reconstruction and adversarial losses, the generator is trained to map the sampled points from these trajectories to sequences of realistic images that smoothly transition from the source to the target image. Through comprehensive qualitative and quantitative experiments, we demonstrate our approach's effectiveness in generating images of improved quality as well as its ability to learn a diverse distribution over smooth interpolation trajectories for any pair of real source and target images.

Not All Fabrics Are Created Equal: Exploring eFPGA Parameters For IP Redaction

Authors: Jitendra Bhandari, Abdul Khader Thalakkattu Moosa, Benjamin Tan, Christian Pilato, Ganesh Gore, Xifan Tang, Scott Temple, Pierre-Emmanuel Gaillardo, Ramesh Karri
Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2111.04222
Pdf link: https://arxiv.org/pdf/2111.04222
Abstract
Semiconductor design houses rely on third-party foundries to manufacture their integrated circuits (IC). While this trend allows them to tackle fabrication costs, it introduces security concerns as external (and potentially malicious) parties can access critical parts of the designs and steal or modify the Intellectual Property (IP). Embedded FPGA (eFPGA) redaction is a promising technique to protect critical IPs of an ASIC by \textit{redacting} (i.e., removing) critical parts and mapping them onto a custom reconfigurable fabric. Only trusted parties will receive the correct bitstream to restore the redacted functionality. While previous studies imply that using an eFPGA is a sufficient condition to provide security against IP threats like reverse-engineering, whether this truly holds for all eFPGA architectures is unclear, thus motivating the study in this paper. We examine the security of eFPGA fabrics generated by varying different FPGA design parameters. We characterize the power, performance, and area (PPA) characteristics and evaluate each fabric's resistance to SAT-based bitstream recovery. Our results encourage designers to work with custom eFPGA fabrics rather than off-the-shelf commercial FPGAs and reveals that only considering a redaction fabric's bitstream size is inadequate for gauging security.

Adaptive area-preserving parameterization of open and closed anatomical surfaces

Authors: Gary P. T. Choi, Amita Giri, Lalan Kumar
Subjects: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2111.04265
Pdf link: https://arxiv.org/pdf/2111.04265
Abstract
The parameterization of open and closed anatomical surfaces is of fundamental importance in many biomedical applications. Spherical harmonics, a set of basis functions defined on the unit sphere, are widely used for anatomical shape description. However, establishing a one-to-one correspondence between the object surface and the entire unit sphere may induce a large geometric distortion in case the shape of the surface is too different from a perfect sphere. In this work, we propose adaptive area-preserving parameterization methods for simply-connected open and closed surfaces with the target of the parameterization being a spherical cap. Our methods optimize the shape of the parameter domain along with the mapping from the object surface to the parameter domain. The object surface will be globally mapped to an optimal spherical cap region of the unit sphere in an area-preserving manner while also exhibiting low conformal distortion. We further develop a set of spherical harmonics-like basis functions defined over the adaptive spherical cap domain, which we call the adaptive harmonics. Experimental results show that the proposed parameterization methods outperform the existing methods for both open and closed anatomical surfaces in terms of area and angle distortion. Surface description of the object surfaces can be effectively achieved using a novel combination of the adaptive parameterization and the adaptive harmonics. Our work provides a novel way of mapping anatomical surfaces with improved accuracy and greater flexibility. More broadly, the idea of using an adaptive parameter domain allows easy handling of a wide range of biomedical shapes.

LMStream: When Distributed Micro-Batch Stream Processing Systems Meet GPU

Authors: Suyeon Lee, Sungyong Park
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2111.04289
Pdf link: https://arxiv.org/pdf/2111.04289
Abstract
This paper presents LMStream, which ensures bounded latency while maximizing the throughput on the GPU-enabled micro-batch streaming systems. The main ideas behind LMStream's design can be summarized as two novel mechanisms: (1) dynamic batching and (2) dynamic operation-level query planning. By controlling the micro-batch size, LMStream significantly reduces the latency of individual dataset because it does not perform unconditional buffering only for improving GPU utilization. LMStream bounds the latency to an optimal value according to the characteristics of the window operation used in the streaming application. Dynamic mapping between a query to an execution device based on the data size and dynamic device preference improves both the throughput and latency as much as possible. In addition, LMStream proposes a low-overhead online cost model parameter optimization method without interrupting the real-time stream processing. We implemented LMStream on Apache Spark, which supports micro-batch stream processing. Compared to the previous throughput-oriented method, LMStream showed an average latency improvement up to a maximum of 70.7%, while improving average throughput up to 1.74x.

Keyword: localization

Using Monocular Vision and Human Body Priors for AUVs to Autonomously Approach Divers

Authors: Michael Fulton, Jungseok Hong, Junaed Sattar
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.03712
Pdf link: https://arxiv.org/pdf/2111.03712
Abstract
Direct communication between humans and autonomous underwater vehicles (AUVs) is a relatively underexplored area in human-robot interaction (HRI) research, although many tasks (\eg surveillance, inspection, and search-and-rescue) require close diver-robot collaboration. Many core functionalities in this domain are in need of further study to improve robotic capabilities for ease of interaction. One of these is the challenge of autonomous robots approaching and positioning themselves relative to divers to initiate and facilitate interactions. Suboptimal AUV positioning can lead to poor quality interaction and lead to excessive cognitive and physical load for divers. In this paper, we introduce a novel method for AUVs to autonomously navigate and achieve diver-relative positioning to begin interaction. Our method is based only on monocular vision, requires no global localization, and is computationally efficient. We present our algorithm along with an implementation of said algorithm on board both a simulated and physical AUV, performing extensive evaluations in the form of closed-water tests in a controlled pool. Analysis of our results show that the proposed monocular vision-based algorithm performs reliably and efficiently operating entirely on-board the AUV.

Asynchronous Collaborative Localization by Integrating Spatiotemporal Graph Learning with Model-Based Estimation

Authors: Peng Gao, Brian Reily, Rui Guo, Hongsheng Lu, Qingzhao Zhu, Hao Zhang
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.03751
Pdf link: https://arxiv.org/pdf/2111.03751
Abstract
Collaborative localization is an essential capability for a team of robots such as connected vehicles to collaboratively estimate object locations from multiple perspectives with reliant cooperation. To enable collaborative localization, four key challenges must be addressed, including modeling complex relationships between observed objects, fusing observations from an arbitrary number of collaborating robots, quantifying localization uncertainty, and addressing latency of robot communications. In this paper, we introduce a novel approach that integrates uncertainty-aware spatiotemporal graph learning and model-based state estimation for a team of robots to collaboratively localize objects. Specifically, we introduce a new uncertainty-aware graph learning model that learns spatiotemporal graphs to represent historical motions of the objects observed by each robot over time and provides uncertainties in object localization. Moreover, we propose a novel method for integrated learning and model-based state estimation, which fuses asynchronous observations obtained from an arbitrary number of robots for collaborative localization. We evaluate our approach in two collaborative object localization scenarios in simulations and on real robots. Experimental results show that our approach outperforms previous methods and achieves state-of-the-art performance on asynchronous collaborative localization.

CALText: Contextual Attention Localization for Offline Handwritten Text

Authors: Tayaba Anjum, Nazar Khan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2111.03952
Pdf link: https://arxiv.org/pdf/2111.03952
Abstract
Recognition of Arabic-like scripts such as Persian and Urdu is more challenging than Latin-based scripts. This is due to the presence of a two-dimensional structure, context-dependent character shapes, spaces and overlaps, and placement of diacritics. Not much research exists for offline handwritten Urdu script which is the 10th most spoken language in the world. We present an attention based encoder-decoder model that learns to read Urdu in context. A novel localization penalty is introduced to encourage the model to attend only one location at a time when recognizing the next character. In addition, we comprehensively refine the only complete and publicly available handwritten Urdu dataset in terms of ground-truth annotations. We evaluate the model on both Urdu and Arabic datasets and show that contextual attention localization outperforms both simple attention and multi-directional LSTM models.

Hierarchical Segment-based Optimization for SLAM

Authors: Yuxin Tian, Yujie Wang, Ming Ouyang, Xuesong Shi
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.04101
Pdf link: https://arxiv.org/pdf/2111.04101
Abstract
This paper presents a hierarchical segment-based optimization method for Simultaneous Localization and Mapping (SLAM) system. First we propose a reliable trajectory segmentation method that can be used to increase efficiency in the back-end optimization. Then we propose a buffer mechanism for the first time to improve the robustness of the segmentation. During the optimization, we use global information to optimize the frames with large error, and interpolation instead of optimization to update well-estimated frames to hierarchically allocate the amount of computation according to error of each frame. Comparative experiments on the benchmark show that our method greatly improves the efficiency of optimization with almost no drop in accuracy, and outperforms existing high-efficiency optimization method by a large margin.

3D Siamese Voxel-to-BEV Tracker for Sparse Point Clouds

Authors: Le Hui, Lingpeng Wang, Mingmei Cheng, Jin Xie, Jian Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.04426
Pdf link: https://arxiv.org/pdf/2111.04426
Abstract
3D object tracking in point clouds is still a challenging problem due to the sparsity of LiDAR points in dynamic environments. In this work, we propose a Siamese voxel-to-BEV tracker, which can significantly improve the tracking performance in sparse 3D point clouds. Specifically, it consists of a Siamese shape-aware feature learning network and a voxel-to-BEV target localization network. The Siamese shape-aware feature learning network can capture 3D shape information of the object to learn the discriminative features of the object so that the potential target from the background in sparse point clouds can be identified. To this end, we first perform template feature embedding to embed the template's feature into the potential target and then generate a dense 3D shape to characterize the shape information of the potential target. For localizing the tracked target, the voxel-to-BEV target localization network regresses the target's 2D center and the $z$-axis center from the dense bird's eye view (BEV) feature map in an anchor-free manner. Concretely, we compress the voxelized point cloud along $z$-axis through max pooling to obtain a dense BEV feature map, where the regression of the 2D center and the $z$-axis center can be performed more effectively. Extensive evaluation on the KITTI and nuScenes datasets shows that our method significantly outperforms the current state-of-the-art methods by a large margin.

New submissions for Tue, 22 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Place recognition survey: An update on deep learning approaches

Authors: Tiago Barros, Ricardo Pereira, Luís Garrote, Cristiano Premebida, Urbano J. Nunes
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.10458
Pdf link: https://arxiv.org/pdf/2106.10458
Abstract
Autonomous Vehicles (AV) are becoming more capable of navigating in complex environments with dynamic and changing conditions. A key component that enables these intelligent vehicles to overcome such conditions and become more autonomous is the sophistication of the perception and localization systems. As part of the localization system, place recognition has benefited from recent developments in other perception tasks such as place categorization or object recognition, namely with the emergence of deep learning (DL) frameworks. This paper surveys recent approaches and methods used in place recognition, particularly those based on deep learning. The contributions of this work are twofold: surveying recent sensors such as 3D LiDARs and RADARs, applied in place recognition; and categorizing the various DL-based place recognition works into supervised, unsupervised, semi-supervised, parallel, and hierarchical categories. First, this survey introduces key place recognition concepts to contextualize the reader. Then, sensor characteristics are addressed. This survey proceeds by elaborating on the various DL-based works, presenting summaries for each framework. Some lessons learned from this survey include: the importance of NetVLAD for supervised end-to-end learning; the advantages of unsupervised approaches in place recognition, namely for cross-domain applications; or the increasing tendency of recent works to seek, not only for higher performance but also for higher efficiency.

CenterAtt: Fast 2-stage Center Attention Network

Authors: Jianyun Xu, Xin Tang, Jian Dou, Xu Shu, Yushi Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.10493
Pdf link: https://arxiv.org/pdf/2106.10493
Abstract
In this technical report, we introduce the methods of HIKVISION_LiDAR_Det in the challenge of waymo open dataset real-time 3D detection. Our solution for the competition are built upon Centerpoint 3D detection framework. Several variants of CenterPoint are explored, including center attention head and feature pyramid network neck. In order to achieve real time detection, methods like batchnorm merge, half-precision floating point network and GPU-accelerated voxelization process are adopted. By using these methods, our team ranks 6th among all the methods on real-time 3D detection challenge in the waymo open dataset.

3D Object Detection for Autonomous Driving: A Survey

Authors: Rui Qian, Xin Lai, Xirong Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.10823
Pdf link: https://arxiv.org/pdf/2106.10823
Abstract
Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of such perception system especially for the sake of path planning, motion prediction, collision avoidance, etc. Generally, stereo or monocular images with corresponding 3D point clouds are already standard layout for 3D object detection, out of which point clouds are increasingly prevalent with accurate depth information being provided. Despite existing efforts, 3D object detection on point clouds is still in its infancy due to high sparseness and irregularity of point clouds by nature, misalignment view between camera view and LiDAR bird's eye of view for modality synergies, occlusions and scale variations at long distances, etc. Recently, profound progress has been made in 3D object detection, with a large body of literature being investigated to address this vision task. As such, we present a comprehensive review of the latest progress in this field covering all the main topics including sensors, fundamentals, and the recent state-of-the-art detection methods with their pros and cons. Furthermore, we introduce metrics and provide quantitative comparisons on popular public datasets. The avenues for future work are going to be judiciously identified after an in-deep analysis of the surveyed works. Finally, we conclude this paper.

One Million Scenes for Autonomous Driving: ONCE Dataset

Authors: Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, Jie Yu, Hang Xu, Chunjing Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.11037
Pdf link: https://arxiv.org/pdf/2106.11037
Abstract
Current perception models in autonomous driving have become notorious for greatly relying on a mass of annotated data to cover unseen cases and address the long-tail problem. On the other hand, learning from unlabeled large-scale collected data and incrementally self-training powerful recognition models have received increasing attention and may become the solutions of next-generation industry-level powerful and robust perception models in autonomous driving. However, the research community generally suffered from data inadequacy of those essential real-world scene data, which hampers the future exploration of fully/semi/self-supervised methods for 3D perception. In this paper, we introduce the ONCE (One millioN sCenEs) dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists of 1 million LiDAR scenes and 7 million corresponding camera images. The data is selected from 144 driving hours, which is 20x longer than the largest 3D autonomous driving dataset available (e.g. nuScenes and Waymo), and it is collected across a range of different areas, periods and weather conditions. To facilitate future research on exploiting unlabeled data for 3D detection, we additionally provide a benchmark in which we reproduce and evaluate a variety of self-supervised and semi-supervised methods on the ONCE dataset. We conduct extensive analyses on those methods and provide valuable observations on their performance related to the scale of used data. Data, code, and more information are available at https://once-for-auto-driving.github.io/index.html.

Domain and Modality Gaps for LiDAR-based Person Detection on Mobile Robots

Authors: Dan Jia, Alexander Hermans, Bastian Leibe
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.11239
Pdf link: https://arxiv.org/pdf/2106.11239
Abstract
Person detection is a crucial task for mobile robots navigating in human-populated environments and LiDAR sensors are promising for this task, given their accurate depth measurements and large field of view. This paper studies existing LiDAR-based person detectors with a particular focus on mobile robot scenarios (e.g. service robot or social robot), where persons are observed more frequently and in much closer ranges, compared to the driving scenarios. We conduct a series of experiments, using the recently released JackRabbot dataset and the state-of-the-art detectors based on 3D or 2D LiDAR sensors (CenterPoint and DR-SPAAM respectively). These experiments revolve around the domain gap between driving and mobile robot scenarios, as well as the modality gap between 3D and 2D LiDAR sensors. For the domain gap, we aim to understand if detectors pretrained on driving datasets can achieve good performance on the mobile robot scenarios, for which there are currently no trained models readily available. For the modality gap, we compare detectors that use 3D or 2D LiDAR, from various aspects, including performance, runtime, localization accuracy, robustness to range and crowdedness. The results from our experiments provide practical insights into LiDAR-based person detection and facilitate informed decisions for relevant mobile robot designs and applications.

Keyword: loop detection

There is no result

Keyword: autonomous driving

3D Object Detection for Autonomous Driving: A Survey

Authors: Rui Qian, Xin Lai, Xirong Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.10823
Pdf link: https://arxiv.org/pdf/2106.10823
Abstract
Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of such perception system especially for the sake of path planning, motion prediction, collision avoidance, etc. Generally, stereo or monocular images with corresponding 3D point clouds are already standard layout for 3D object detection, out of which point clouds are increasingly prevalent with accurate depth information being provided. Despite existing efforts, 3D object detection on point clouds is still in its infancy due to high sparseness and irregularity of point clouds by nature, misalignment view between camera view and LiDAR bird's eye of view for modality synergies, occlusions and scale variations at long distances, etc. Recently, profound progress has been made in 3D object detection, with a large body of literature being investigated to address this vision task. As such, we present a comprehensive review of the latest progress in this field covering all the main topics including sensors, fundamentals, and the recent state-of-the-art detection methods with their pros and cons. Furthermore, we introduce metrics and provide quantitative comparisons on popular public datasets. The avenues for future work are going to be judiciously identified after an in-deep analysis of the surveyed works. Finally, we conclude this paper.

One Million Scenes for Autonomous Driving: ONCE Dataset

Authors: Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, Jie Yu, Hang Xu, Chunjing Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.11037
Pdf link: https://arxiv.org/pdf/2106.11037
Abstract
Current perception models in autonomous driving have become notorious for greatly relying on a mass of annotated data to cover unseen cases and address the long-tail problem. On the other hand, learning from unlabeled large-scale collected data and incrementally self-training powerful recognition models have received increasing attention and may become the solutions of next-generation industry-level powerful and robust perception models in autonomous driving. However, the research community generally suffered from data inadequacy of those essential real-world scene data, which hampers the future exploration of fully/semi/self-supervised methods for 3D perception. In this paper, we introduce the ONCE (One millioN sCenEs) dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists of 1 million LiDAR scenes and 7 million corresponding camera images. The data is selected from 144 driving hours, which is 20x longer than the largest 3D autonomous driving dataset available (e.g. nuScenes and Waymo), and it is collected across a range of different areas, periods and weather conditions. To facilitate future research on exploiting unlabeled data for 3D detection, we additionally provide a benchmark in which we reproduce and evaluate a variety of self-supervised and semi-supervised methods on the ONCE dataset. We conduct extensive analyses on those methods and provide valuable observations on their performance related to the scale of used data. Data, code, and more information are available at https://once-for-auto-driving.github.io/index.html.

SODA10M: Towards Large-Scale Object Detection Benchmark for Autonomous Driving

Authors: Jianhua Han, Xiwen Liang, Hang Xu, Kai Chen, Lanqing Hong, Chaoqiang Ye, Wei Zhang, Zhenguo Li, Chunjing Xu, Xiaodan Liang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.11118
Pdf link: https://arxiv.org/pdf/2106.11118
Abstract
Aiming at facilitating a real-world, ever-evolving and scalable autonomous driving system, we present a large-scale benchmark for standardizing the evaluation of different self-supervised and semi-supervised approaches by learning from raw data, which is the first and largest benchmark to date. Existing autonomous driving systems heavily rely on `perfect' visual perception models (e.g., detection) trained using extensive annotated data to ensure the safety. However, it is unrealistic to elaborately label instances of all scenarios and circumstances (e.g., night, extreme weather, cities) when deploying a robust autonomous driving system. Motivated by recent powerful advances of self-supervised and semi-supervised learning, a promising direction is to learn a robust detection model by collaboratively exploiting large-scale unlabeled data and few labeled data. Existing dataset (e.g., KITTI, Waymo) either provides only a small amount of data or covers limited domains with full annotation, hindering the exploration of large-scale pre-trained models. Here, we release a Large-Scale Object Detection benchmark for Autonomous driving, named as SODA10M, containing 10 million unlabeled images and 20K images labeled with 6 representative object categories. To improve diversity, the images are collected every ten seconds per frame within 32 different cities under different weather conditions, periods and location scenes. We provide extensive experiments and deep analyses of existing supervised state-of-the-art detection models, popular self-supervised and semi-supervised approaches, and some insights about how to develop future models. The data and more up-to-date information have been released at https://soda-2d.github.io.

New submissions for Fri, 19 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

See Eye to Eye: A Lidar-Agnostic 3D Detection Framework for Unsupervised Multi-Target Domain Adaptation

Authors: Darren Tsai, Julie Stephany Berrio, Mao Shan, Stewart Worrall, Eduardo Nebot
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09450
Pdf link: https://arxiv.org/pdf/2111.09450
Abstract
Sampling discrepancies between different manufacturers and models of lidar sensors result in inconsistent representations of objects. This leads to performance degradation when 3D detectors trained for one lidar are tested on other types of lidars. Remarkable progress in lidar manufacturing has brought about advances in mechanical, solid-state, and recently, adjustable scan pattern lidars. For the latter, existing works often require fine-tuning the model each time scan patterns are adjusted, which is infeasible. We explicitly deal with the sampling discrepancy by proposing a novel unsupervised multi-target domain adaptation framework, SEE, for transferring the performance of state-of-the-art 3D detectors across both fixed and flexible scan pattern lidars without requiring fine-tuning of models by end-users. Our approach interpolates the underlying geometry and normalizes the scan pattern of objects from different lidars before passing them to the detection network. We demonstrate the effectiveness of SEE on public datasets, achieving state-of-the-art results, and additionally provide quantitative results on a novel high-resolution lidar to prove the industry applications of our framework. This dataset and our code will be made publicly available.

Lidar with Velocity: Motion Distortion Correction of Point Clouds from Oscillating Scanning Lidars

Authors: Wen Yang, Zheng Gong, Baifu Huang, Xiaoping Hong
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09497
Pdf link: https://arxiv.org/pdf/2111.09497
Abstract
Lidar point cloud distortion from moving object is an important problem in autonomous driving, and recently becomes even more demanding with the emerging of newer lidars, which feature back-and-forth scanning patterns. Accurately estimating moving object velocity would not only provide a tracking capability but also correct the point cloud distortion with more accurate description of the moving object. Since lidar measures the time-of-flight distance but with a sparse angular resolution, the measurement is precise in the radial measurement but lacks angularly. Camera on the other hand provides a dense angular resolution. In this paper, Gaussian-based lidar and camera fusion is proposed to estimate the full velocity and correct the lidar distortion. A probabilistic Kalman-filter framework is provided to track the moving objects, estimate their velocities and simultaneously correct the point clouds distortions. The framework is evaluated on real road data and the fusion method outperforms the traditional ICP-based and point-cloud only method. The complete working framework is open-sourced (https://github.com/ISEE-Technology/lidar-with-velocity) to accelerate the adoption of the emerging lidars.

RAANet: Range-Aware Attention Network for LiDAR-based 3D Object Detection with Auxiliary Density Level Estimation

Authors: Yantao Lu, Xuetao Hao, Shiqi Sun, Weiheng Chai, Muchenxuan Tong, Senem Velipasalar
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09515
Pdf link: https://arxiv.org/pdf/2111.09515
Abstract
3D object detection from LiDAR data for autonomous driving has been making remarkable strides in recent years. Among the state-of-the-art methodologies, encoding point clouds into a bird's-eye view (BEV) has been demonstrated to be both effective and efficient. Different from perspective views, BEV preserves rich spatial and distance information between objects; and while farther objects of the same type do not appear smaller in the BEV, they contain sparser point cloud features. This fact weakens BEV feature extraction using shared-weight convolutional neural networks. In order to address this challenge, we propose Range-Aware Attention Network (RAANet), which extracts more powerful BEV features and generates superior 3D object detections. The range-aware attention (RAA) convolutions significantly improve feature extraction for near as well as far objects. Moreover, we propose a novel auxiliary loss for density estimation to further enhance the detection accuracy of RAANet for occluded objects. It is worth to note that our proposed RAA convolution is lightweight and compatible to be integrated into any CNN architecture used for the BEV detection. Extensive experiments on the nuScenes dataset demonstrate that our proposed approach outperforms the state-of-the-art methods for LiDAR-based 3D object detection, with real-time inference speed of 16 Hz for the full version and 22 Hz for the lite version. The code is publicly available at an anonymous Github repository https://github.com/anonymous0522/RAAN.

LiDAR Cluster First and Camera Inference Later: A New Perspective Towards Autonomous Driving

Authors: Jiyang Chen, Simon Yu, Rohan Tabish, Ayoosh Bansal, Shengzhong Liu, Tarek Abdelzaher, Lui Sha
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09799
Pdf link: https://arxiv.org/pdf/2111.09799
Abstract
Object detection in state-of-the-art Autonomous Vehicles (AV) framework relies heavily on deep neural networks. Typically, these networks perform object detection uniformly on the entire camera LiDAR frames. However, this uniformity jeopardizes the safety of the AV by giving the same priority to all objects in the scenes regardless of their risk of collision to the AV. In this paper, we present a new end-to-end pipeline for AV that introduces the concept of LiDAR cluster first and camera inference later to detect and classify objects. The benefits of our proposed framework are twofold. First, our pipeline prioritizes detecting objects that pose a higher risk of collision to the AV, giving more time for the AV to react to unsafe conditions. Second, it also provides, on average, faster inference speeds compared to popular deep neural network pipelines. We design our framework using the real-world datasets, the Waymo Open Dataset, solving challenges arising from the limitations of LiDAR sensors and object detection algorithms. We show that our novel object detection pipeline prioritizes the detection of higher risk objects while simultaneously achieving comparable accuracy and a 25% higher average speed compared to camera inference only.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Lidar with Velocity: Motion Distortion Correction of Point Clouds from Oscillating Scanning Lidars

Authors: Wen Yang, Zheng Gong, Baifu Huang, Xiaoping Hong
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09497
Pdf link: https://arxiv.org/pdf/2111.09497
Abstract
Lidar point cloud distortion from moving object is an important problem in autonomous driving, and recently becomes even more demanding with the emerging of newer lidars, which feature back-and-forth scanning patterns. Accurately estimating moving object velocity would not only provide a tracking capability but also correct the point cloud distortion with more accurate description of the moving object. Since lidar measures the time-of-flight distance but with a sparse angular resolution, the measurement is precise in the radial measurement but lacks angularly. Camera on the other hand provides a dense angular resolution. In this paper, Gaussian-based lidar and camera fusion is proposed to estimate the full velocity and correct the lidar distortion. A probabilistic Kalman-filter framework is provided to track the moving objects, estimate their velocities and simultaneously correct the point clouds distortions. The framework is evaluated on real road data and the fusion method outperforms the traditional ICP-based and point-cloud only method. The complete working framework is open-sourced (https://github.com/ISEE-Technology/lidar-with-velocity) to accelerate the adoption of the emerging lidars.

RAANet: Range-Aware Attention Network for LiDAR-based 3D Object Detection with Auxiliary Density Level Estimation

Authors: Yantao Lu, Xuetao Hao, Shiqi Sun, Weiheng Chai, Muchenxuan Tong, Senem Velipasalar
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09515
Pdf link: https://arxiv.org/pdf/2111.09515
Abstract
3D object detection from LiDAR data for autonomous driving has been making remarkable strides in recent years. Among the state-of-the-art methodologies, encoding point clouds into a bird's-eye view (BEV) has been demonstrated to be both effective and efficient. Different from perspective views, BEV preserves rich spatial and distance information between objects; and while farther objects of the same type do not appear smaller in the BEV, they contain sparser point cloud features. This fact weakens BEV feature extraction using shared-weight convolutional neural networks. In order to address this challenge, we propose Range-Aware Attention Network (RAANet), which extracts more powerful BEV features and generates superior 3D object detections. The range-aware attention (RAA) convolutions significantly improve feature extraction for near as well as far objects. Moreover, we propose a novel auxiliary loss for density estimation to further enhance the detection accuracy of RAANet for occluded objects. It is worth to note that our proposed RAA convolution is lightweight and compatible to be integrated into any CNN architecture used for the BEV detection. Extensive experiments on the nuScenes dataset demonstrate that our proposed approach outperforms the state-of-the-art methods for LiDAR-based 3D object detection, with real-time inference speed of 16 Hz for the full version and 22 Hz for the lite version. The code is publicly available at an anonymous Github repository https://github.com/anonymous0522/RAAN.

DeepGuard: A Framework for Safeguarding Autonomous Driving Systems from Inconsistent Behavior

Authors: Manzoor Hussain, Nazakat Ali, Jang-Eui Hong
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.09533
Pdf link: https://arxiv.org/pdf/2111.09533
Abstract
The deep neural networks (DNNs)based autonomous driving systems (ADSs) are expected to reduce road accidents and improve safety in the transportation domain as it removes the factor of human error from driving tasks. The DNN based ADS sometimes may exhibit erroneous or unexpected behaviors due to unexpected driving conditions which may cause accidents. It is not possible to generalize the DNN model performance for all driving conditions. Therefore, the driving conditions that were not considered during the training of the ADS may lead to unpredictable consequences for the safety of autonomous vehicles. This study proposes an autoencoder and time series analysis based anomaly detection system to prevent the safety critical inconsistent behavior of autonomous vehicles at runtime. Our approach called DeepGuard consists of two components. The first component, the inconsistent behavior predictor, is based on an autoencoder and time series analysis to reconstruct the driving scenarios. Based on reconstruction error and threshold it determines the normal and unexpected driving scenarios and predicts potential inconsistent behavior. The second component provides on the fly safety guards, that is, it automatically activates healing strategies to prevent inconsistencies in the behavior. We evaluated the performance of DeepGuard in predicting the injected anomalous driving scenarios using already available open sourced DNN based ADSs in the Udacity simulator. Our simulation results show that the best variant of DeepGuard can predict up to 93 percent on the CHAUFFEUR ADS, 83 percent on DAVE2 ADS, and 80 percent of inconsistent behavior on the EPOCH ADS model, outperforming SELFORACLE and DeepRoad. Overall, DeepGuard can prevent up to 89 percent of all predicted inconsistent behaviors of ADS by executing predefined safety guards.

Assisted Robust Reward Design

Authors: Jerry Zhi-Yang He, Anca D. Dragan
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.09884
Pdf link: https://arxiv.org/pdf/2111.09884
Abstract
Real-world robotic tasks require complex reward functions. When we define the problem the robot needs to solve, we pretend that a designer specifies this complex reward exactly, and it is set in stone from then on. In practice, however, reward design is an iterative process: the designer chooses a reward, eventually encounters an "edge-case" environment where the reward incentivizes the wrong behavior, revises the reward, and repeats. What would it mean to rethink robotics problems to formally account for this iterative nature of reward design? We propose that the robot not take the specified reward for granted, but rather have uncertainty about it, and account for the future design iterations as future evidence. We contribute an Assisted Reward Design method that speeds up the design process by anticipating and influencing this future evidence: rather than letting the designer eventually encounter failure cases and revise the reward then, the method actively exposes the designer to such environments during the development phase. We test this method in a simplified autonomous driving task and find that it more quickly improves the car's behavior in held-out environments by proposing environments that are "edge cases" for the current reward.

Keyword: mapping

Efficient deep learning models for land cover image classification

Authors: Ioannis Papoutsis, Nikolaos-Ioannis Bountos, Angelos Zavras, Dimitrios Michail, Christos Tryfonopoulos
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09451
Pdf link: https://arxiv.org/pdf/2111.09451
Abstract
The availability of the sheer volume of Copernicus Sentinel imagery has created new opportunities for land use land cover (LULC) mapping at large scales using deep learning. Training on such large datasets though is a non-trivial task. In this work we experiment with the BigEarthNet dataset for LULC image classification and benchmark different state-of-the-art models, including Convolution Neural Networks, Multi-Layer Perceptrons, Visual Transformers, EfficientNets and Wide Residual Networks (WRN) architectures. Our aim is to leverage classification accuracy, training time and inference rate. We propose a framework based on EfficientNets for compound scaling of WRNs in terms of network depth, width and input data resolution, for efficiently training and testing different model setups. We design a novel scaled WRN architecture enhanced with an Efficient Channel Attention mechanism. Our proposed lightweight model has an order of magnitude less trainable parameters, achieves 4.5% higher averaged f-score classification accuracy for all 19 LULC classes and is trained two times faster with respect to a ResNet50 state-of-the-art model that we use as a baseline. We provide access to more than 50 trained models, along with our code for distributed training on multiple GPU nodes.

Parabolic interface reconstruction for 2D volume of fluid methods

Authors: Ronald A. Remmerswaal, Arthur E.P. Veldman
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2111.09627
Pdf link: https://arxiv.org/pdf/2111.09627
Abstract
For capillary driven flow the interface curvature is essential in the modelling of surface tension via the imposition of the Young-Laplace jump condition. We show that traditional geometric volume of fluid (VoF) methods, that are based on a piecewise linear approximation of the interface, do not lead to an interface curvature which is convergent under mesh refinement in time-dependent problems. Instead, we propose to use a piecewise parabolic approximation of the interface, resulting in a class of piecewise parabolic interface calculation (PPIC) methods. In particular, we introduce the parabolic LVIRA and MoF methods, PLVIRA and PMoF, respectively. We show that a Lagrangian remapping method is sufficiently accurate for the advection of such a parabolic interface. It is numerically demonstrated that the newly proposed PPIC methods result in an increase of reconstruction accuracy by one order, convergence of the interface curvature in time-dependent advection problems and Weber number independent convergence of a droplet translation problem, where the advection method is coupled to a two-phase Navier--Stokes solver.

ClipCap: CLIP Prefix for Image Captioning

Authors: Ron Mokady, Amir Hertz, Amit H. Bermano
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09734
Pdf link: https://arxiv.org/pdf/2111.09734
Abstract
Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available in https://github.com/rmokady/CLIP_prefix_caption.

Simple but Effective: CLIP Embeddings for Embodied AI

Authors: Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, Aniruddha Kembhavi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09888
Pdf link: https://arxiv.org/pdf/2111.09888
Abstract
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to captioning and image manipulation. We investigate the effectiveness of CLIP visual backbones for embodied AI tasks. We build incredibly simple baselines, named EmbCLIP, with no task specific architectures, inductive biases (such as the use of semantic maps), auxiliary tasks during training, or depth maps -- yet we find that our improved baselines perform very well across a range of tasks and simulators. EmbCLIP tops the RoboTHOR ObjectNav leaderboard by a huge margin of 20 pts (Success Rate). It tops the iTHOR 1-Phase Rearrangement leaderboard, beating the next best submission, which employs Active Neural Mapping, and more than doubling the % Fixed Strict metric (0.08 to 0.17). It also beats the winners of the 2021 Habitat ObjectNav Challenge, which employ auxiliary tasks, depth maps, and human demonstrations, and those of the 2019 Habitat PointNav Challenge. We evaluate the ability of CLIP's visual representations at capturing semantic information about input observations -- primitives that are useful for navigation-heavy embodied tasks -- and find that CLIP's representations encode these primitives more effectively than ImageNet-pretrained backbones. Finally, we extend one of our baselines, producing an agent capable of zero-shot object navigation that can navigate to objects that were not used as targets during training.

Keyword: localization

Rethinking Drone-Based Search and Rescue with Aerial Person Detection

Authors: Pasi Pyrrö, Hassan Naseri, Alexander Jung
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09406
Pdf link: https://arxiv.org/pdf/2111.09406
Abstract
The visual inspection of aerial drone footage is an integral part of land search and rescue (SAR) operations today. Since this inspection is a slow, tedious and error-prone job for humans, we propose a novel deep learning algorithm to automate this aerial person detection (APD) task. We experiment with model architecture selection, online data augmentation, transfer learning, image tiling and several other techniques to improve the test performance of our method. We present the novel Aerial Inspection RetinaNet (AIR) algorithm as the combination of these contributions. The AIR detector demonstrates state-of-the-art performance on a commonly used SAR test data set in terms of both precision (~21 percentage point increase) and speed. In addition, we provide a new formal definition for the APD problem in SAR missions. That is, we propose a novel evaluation scheme that ranks detectors in terms of real-world SAR localization requirements. Finally, we propose a novel postprocessing method for robust, approximate object localization: the merging of overlapping bounding boxes (MOB) algorithm. This final processing stage used in the AIR detector significantly improves its performance and usability in the face of real-world aerial SAR missions.

Towards Open Vocabulary Object Detection without Human-provided Bounding Boxes

Authors: Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, Ran Xu, Wenhao Liu, Caiming Xiong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09452
Pdf link: https://arxiv.org/pdf/2111.09452
Abstract
Despite great progress in object detection, most existing methods are limited to a small set of object categories, due to the tremendous human effort needed for instance-level bounding-box annotation. To alleviate the problem, recent open vocabulary and zero-shot detection methods attempt to detect object categories not seen during training. However, these approaches still rely on manually provided bounding-box annotations on a set of base classes. We propose an open vocabulary detection framework that can be trained without manually provided bounding-box annotations. Our method achieves this by leveraging the localization ability of pre-trained vision-language models and generating pseudo bounding-box labels that can be used directly for training object detectors. Experimental results on COCO, PASCAL VOC, Objects365 and LVIS demonstrate the effectiveness of our method. Specifically, our method outperforms the state-of-the-arts (SOTA) that are trained using human annotated bounding-boxes by 3% AP on COCO novel categories even though our training source is not equipped with manual bounding-box labels. When utilizing the manual bounding-box labels as our baselines do, our method surpasses the SOTA largely by 8% AP.

New submissions for Tue, 30 Nov 21

Keyword: SLAM

Deployment of Aerial Robots after a major fire of an industrial hall with hazardous substances, a report

Authors: Hartmut Surmann, Dominik Slomma, Stefan Grobelny, Robert Grafe
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.14542
Pdf link: https://arxiv.org/pdf/2111.14542
Abstract
This technical report is about the mission and the experience gained during the reconnaissance of an industrial hall with hazardous substances after a major fire in Berlin. During this operation, only UAVs and cameras were used to obtain information about the site and the building. First, a geo-referenced 3D model of the building was created in order to plan the entry into the hall. Subsequently, the UAVs were used to fly in the heavily damaged interior and take pictures from inside of the hall. A 360{\deg} camera mounted under the UAV was used to collect images of the surrounding area especially from sections that were difficult to fly into. Since the collected data set contained similar images as well as blurred images, it was cleaned from non-optimal images using visual SLAM, bundle adjustment and blur detection so that a 3D model and overviews could be calculated. It was shown that the emergency services were not able to extract the necessary information from the 3D model. Therefore, an interactive panorama viewer with links to other 360{\deg} images was implemented where the links to the other images depends on the semi dense point cloud and located camera positions of the visual SLAM algorithm so that the emergency forces could view the surroundings.

An in-depth experimental study of sensor usage and visual reasoning of robots navigating in real environments

Authors: Assem Sadek, Guillaume Bono, Boris Chidlovskii, Christian Wolf
Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.14666
Pdf link: https://arxiv.org/pdf/2111.14666
Abstract
Visual navigation by mobile robots is classically tackled through SLAM plus optimal planning, and more recently through end-to-end training of policies implemented as deep networks. While the former are often limited to waypoint planning, but have proven their efficiency even on real physical environments, the latter solutions are most frequently employed in simulation, but have been shown to be able learn more complex visual reasoning, involving complex semantical regularities. Navigation by real robots in physical environments is still an open problem. End-to-end training approaches have been thoroughly tested in simulation only, with experiments involving real robots being restricted to rare performance evaluations in simplified laboratory conditions. In this work we present an in-depth study of the performance and reasoning capacities of real physical agents, trained in simulation and deployed to two different physical environments. Beyond benchmarking, we provide insights into the generalization capabilities of different agents training in different conditions. We visualize sensor usage and the importance of the different types of signals. We show, that for the PointGoal task, an agent pre-trained on wide variety of tasks and fine-tuned on a simulated version of the target environment can reach competitive performance without modelling any sim2real transfer, i.e. by deploying the trained agent directly from simulation to a real physical robot.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

The Implicit Values of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement

Authors: Ilya Chugunov, Yuxuan Zhang, Zhihao Xia, Cecilia Zhang, Jiawen Chen, Felix Heide
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.13738
Pdf link: https://arxiv.org/pdf/2111.13738
Abstract
Modern smartphones can continuously stream multi-megapixel RGB images at 60~Hz, synchronized with high-quality 3D pose information and low-resolution LiDAR-driven depth estimates. During a snapshot photograph, the natural unsteadiness of the photographer's hands offers millimeter-scale variation in camera pose, which we can capture along with RGB and depth in a circular buffer. In this work we explore how, from a bundle of these measurements acquired during viewfinding, we can combine dense micro-baseline parallax cues with kilopixel LiDAR depth to distill a high-fidelity depth map. We take a test-time optimization approach and train a coordinate MLP to output photometrically and geometrically consistent depth estimates at the continuous coordinates along the path traced by the photographer's natural hand shake. The proposed method brings high-resolution depth estimates to 'point-and-shoot' tabletop photography and requires no additional hardware, artificial hand motion, or user interaction beyond the press of a button.

DSC: Deep Scan Context Descriptor for Large-Scale Place Recognition

Authors: Jiafeng Cui, Tengfei Huang, Yingfeng Cai, Junqiao Zhao, Lu Xiong, Zhuoping Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.13838
Pdf link: https://arxiv.org/pdf/2111.13838
Abstract
LiDAR-based place recognition is an essential and challenging task both in loop closure detection and global relocalization. We propose Deep Scan Context (DSC), a general and discriminative global descriptor that captures the relationship among segments of a point cloud. Unlike previous methods that utilize either semantics or a sequence of adjacent point clouds for better place recognition, we only use raw point clouds to get competitive results. Concretely, we first segment the point cloud egocentrically to acquire centroids and eigenvalues of the segments. Then, we introduce a graph neural network to aggregate these features into an embedding representation. Extensive experiments conducted on the KITTI dataset show that DSC is robust to scene variants and outperforms existing methods.

VPFNet: Improving 3D Object Detection with Virtual Point based LiDAR and Stereo Data Fusion

Authors: Hanqi Zhu, Jiajun Deng, Yu Zhang, Jianmin Ji, Qiuyu Mao, Houqiang Li, Yanyong Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.14382
Pdf link: https://arxiv.org/pdf/2111.14382
Abstract
It has been well recognized that fusing the complementary information from depth-aware LiDAR point clouds and semantic-rich stereo images would benefit 3D object detection. Nevertheless, it is not trivial to explore the inherently unnatural interaction between sparse 3D points and dense 2D pixels. To ease this difficulty, the recent proposals generally project the 3D points onto the 2D image plane to sample the image data and then aggregate the data at the points. However, this approach often suffers from the mismatch between the resolution of point clouds and RGB images, leading to sub-optimal performance. Specifically, taking the sparse points as the multi-modal data aggregation locations causes severe information loss for high-resolution images, which in turn undermines the effectiveness of multi-sensor fusion. In this paper, we present VPFNet -- a new architecture that cleverly aligns and aggregates the point cloud and image data at the `virtual' points. Particularly, with their density lying between that of the 3D points and 2D pixels, the virtual points can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing. Moreover, we also investigate the data augmentation techniques that can be applied to both point clouds and RGB images, as the data augmentation has made non-negligible contribution towards 3D object detectors to date. We have conducted extensive experiments on KITTI dataset, and have observed good performance compared to the state-of-the-art methods. Remarkably, our VPFNet achieves 83.21% moderate 3D AP and 91.86% moderate BEV AP on the KITTI test set, ranking the 1st since May 21th, 2021. The network design also takes computation efficiency into consideration -- we can achieve a FPS of 15 on a single NVIDIA RTX 2080Ti GPU. The code will be made available for reproduction and further investigation.

Urban Radiance Fields

Authors: Konstantinos Rematas, Andrew Liu, Pratul P. Srinivasan, Jonathan T. Barron, Andrea Tagliasacchi, Thomas Funkhouser, Vittorio Ferrari
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2111.14643
Pdf link: https://arxiv.org/pdf/2111.14643
Abstract
The goal of this work is to perform 3D reconstruction and novel view synthesis from data captured by scanning platforms commonly deployed for world mapping in urban outdoor environments (e.g., Street View). Given a sequence of posed RGB images and lidar sweeps acquired by cameras and scanners moving through an outdoor scene, we produce a model from which 3D surfaces can be extracted and novel RGB images can be synthesized. Our approach extends Neural Radiance Fields, which has been demonstrated to synthesize realistic novel images for small scenes in controlled settings, with new methods for leveraging asynchronously captured lidar data, for addressing exposure variation between captured images, and for leveraging predicted image segmentations to supervise densities on rays pointing at the sky. Each of these three extensions provides significant performance improvements in experiments on Street View data. Our system produces state-of-the-art 3D surface reconstructions and synthesizes higher quality novel views in comparison to both traditional methods (e.g.~COLMAP) and recent neural representations (e.g.~Mip-NeRF).

Semi-supervised Implicit Scene Completion from Sparse LiDAR

Authors: Pengfei Li, Yongliang Shi, Tianyu Liu, Hao Zhao, Guyue Zhou, Ya-Qin Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.14798
Pdf link: https://arxiv.org/pdf/2111.14798
Abstract
Recent advances show that semi-supervised implicit representation learning can be achieved through physical constraints like Eikonal equations. However, this scheme has not yet been successfully used for LiDAR point cloud data, due to its spatially varying sparsity. In this paper, we develop a novel formulation that conditions the semi-supervised implicit function on localized shape embeddings. It exploits the strong representation learning power of sparse convolutional networks to generate shape-aware dense feature volumes, while still allows semi-supervised signed distance function learning without knowing its exact values at free space. With extensive quantitative and qualitative results, we demonstrate intrinsic properties of this new learning system and its usefulness in real-world road scenes. Notably, we improve IoU from 26.3% to 51.0% on SemanticKITTI. Moreover, we explore two paradigms to integrate semantic label predictions, achieving implicit semantic completion. Code and models can be accessed at https://github.com/OPEN-AIR-SUN/SISC.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Energy-Efficient Autonomous Driving Using Cognitive Driver Behavioral Models and Reinforcement Learning

Authors: Huayi Li, Nan Li, Ilya Kolmanovsky, Anouck Girard
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2111.13966
Pdf link: https://arxiv.org/pdf/2111.13966
Abstract
Autonomous driving technologies are expected to not only improve mobility and road safety but also bring energy efficiency benefits. In the foreseeable future, autonomous vehicles (AVs) will operate on roads shared with human-driven vehicles. To maintain safety and liveness while simultaneously minimizing energy consumption, the AV planning and decision-making process should account for interactions between the autonomous ego vehicle and surrounding human-driven vehicles. In this chapter, we describe a framework for developing energy-efficient autonomous driving policies on shared roads by exploiting human-driver behavior modeling based on cognitive hierarchy theory and reinforcement learning.

Towards Autonomous Driving of Personal Mobility with Small and Noisy Dataset using Tsallis-statistics-based Behavioral Cloning

Authors: Taisuke Kobayashi, Takahito Enomoto
Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.14294
Pdf link: https://arxiv.org/pdf/2111.14294
Abstract
Autonomous driving has made great progress and been introduced in practical use step by step. On the other hand, the concept of personal mobility is also getting popular, and its autonomous driving specialized for individual drivers is expected for a new step. However, it is difficult to collect a large driving dataset, which is basically required for the learning of autonomous driving, from the individual driver of the personal mobility. In addition, when the driver is not familiar with the operation of the personal mobility, the dataset will contain non-optimal data. This study therefore focuses on an autonomous driving method for the personal mobility with such a small and noisy, so-called personal, dataset. Specifically, we introduce a new loss function based on Tsallis statistics that weights gradients depending on the original loss function and allows us to exclude noisy data in the optimization phase. In addition, we improve the visualization technique to verify whether the driver and the controller have the same region of interest. From the experimental results, we found that the conventional autonomous driving failed to drive properly due to the wrong operations in the personal dataset, and the region of interest was different from that of the driver. In contrast, the proposed method learned robustly against the errors and successfully drove automatically while paying attention to the similar region to the driver. Attached video is also uploaded on youtube: https://youtu.be/KEq8-bOxYQA

Anomaly-Aware Semantic Segmentation by Leveraging Synthetic-Unknown Data

Authors: Guan-Rong Lu, Yueh-Cheng Liu, Tung-I Chen, Hung-Ting Su, Tsung-Han Wu, Winston H. Hsu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.14343
Pdf link: https://arxiv.org/pdf/2111.14343
Abstract
Anomaly awareness is an essential capability for safety-critical applications such as autonomous driving. While recent progress of robotics and computer vision has enabled anomaly detection for image classification, anomaly detection on semantic segmentation is less explored. Conventional anomaly-aware systems assuming other existing classes as out-of-distribution (pseudo-unknown) classes for training a model will result in two drawbacks. (1) Unknown classes, which applications need to cope with, might not actually exist during training time. (2) Model performance would strongly rely on the class selection. Observing this, we propose a novel Synthetic-Unknown Data Generation, intending to tackle the anomaly-aware semantic segmentation task. We design a new Masked Gradient Update (MGU) module to generate auxiliary data along the boundary of in-distribution data points. In addition, we modify the traditional cross-entropy loss to emphasize the border data points. We reach the state-of-the-art performance on two anomaly segmentation datasets. Ablation studies also demonstrate the effectiveness of proposed modules.

Human Performance Capture from Monocular Video in the Wild

Authors: Chen Guo, Xu Chen, Jie Song, Otmar Hilliges
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.14672
Pdf link: https://arxiv.org/pdf/2111.14672
Abstract
Capturing the dynamically deforming 3D shape of clothed human is essential for numerous applications, including VR/AR, autonomous driving, and human-computer interaction. Existing methods either require a highly specialized capturing setup, such as expensive multi-view imaging systems, or they lack robustness to challenging body poses. In this work, we propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses, without any additional input. We first build a 3D template human model of the subject based on a learned regression model. We then track this template model's deformation under challenging body articulations based on 2D image observations. Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW. Moreover, we demonstrate its efficacy in robustness and generalizability on videos from iPER datasets.

A Case for a Programmable Edge Storage Middleware

Authors: Giulia Frascaria, Animesh Trivedi, Lin Wang
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2111.14720
Pdf link: https://arxiv.org/pdf/2111.14720
Abstract
Edge computing is a fast-growing computing paradigm where data is processed at the local site where it is generated, close to the end-devices. This can benefit a set of disruptive applications like autonomous driving, augmented reality, and collaborative machine learning, which produce incredible amounts of data that need to be shared, processed and stored at the edge to meet low latency requirements. However, edge storage poses new challenges due to the scarcity and heterogeneity of edge infrastructures and the diversity of edge applications. In particular, edge applications may impose conflicting constraints and optimizations that are hard to be reconciled on the limited, hard-to-scale edge resources. In this vision paper we argue that a new middleware for constrained edge resources is needed, providing a unified storage service for diverse edge applications. We identify programmability as a critical feature that should be leveraged to optimize the resource sharing while delivering the specialization needed for edge applications. Following this line, we make a case for eBPF and present the design for Griffin - a flexible, lightweight programmable edge storage middleware powered by eBPF.

Keyword: mapping

Dynamic Analysis of Nonlinear Civil Engineering Structures using Artificial Neural Network with Adaptive Training

Authors: Xiao Pan, Zhizhao Wen, T.Y. Yang
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2111.13759
Pdf link: https://arxiv.org/pdf/2111.13759
Abstract
Dynamic analysis of structures subjected to earthquake excitation is a time-consuming process, particularly in the case of extremely small time step required, or in the presence of high geometric and material nonlinearity. Performing parametric studies in such cases is even more tedious. The advancement of computer graphics hardware in recent years enables efficient training of artificial neural networks that are well-known to be capable of learning highly nonlinear mappings. In this study, artificial neural networks are developed with adaptive training algorithms, which enables automatic nodes generation and layers addition. The hyperbolic tangent function is selected as the activation function. Stochastic Gradient Descent and Back Propagation algorithms are adopted to train the networks. The neural networks initiate with a small number of hidden layers and nodes. During training, the performance of the network is continuously tracked, and new nodes or layers are added to the hidden layers if the neural network reaches its capacity. At the end of the training process, the network with appropriate architecture is automatically formed. The performance of the networks has been validated for inelastic shear frames, as well as rocking structures, of which both are first built in finite element program for dynamic analysis to generate training data. Results have shown the developed networks can successfully predict the time-history response of the shear frame and the rock structure subjected to real ground motion records. The efficiency of the proposed neural networks is also examined, which shows the computational time can be reduced by 43% by the neural networks method than FE models. This indicates the trained networks can be utilized to generate rocking spectrums of structures more efficiently which demands a large number of time-history analyses.

AI-supported Framework of Semi-Automatic Monoplotting for Monocular Oblique Visual Data Analysis

Authors: Behzad Golparvar, Ruo-Qian Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.14021
Pdf link: https://arxiv.org/pdf/2111.14021
Abstract
In the last decades, the development of smartphones, drones, aerial patrols, and digital cameras enabled high-quality photographs available to large populations and, thus, provides an opportunity to collect massive data of the nature and society with global coverage. However, the data collected with new photography tools is usually oblique - they are difficult to be georeferenced, and huge amounts of data is often obsolete. Georeferencing oblique imagery data may be solved by a technique called monoplotting, which only requires a single image and Digital Elevation Model (DEM). In traditional monoplotting, a human user has to manually choose a series of ground control point (GCP) pairs in the image and DEM and then determine the extrinsic and intrinsic parameters of the camera to establish a pixel-level correspondence between photos and the DEM to enable the mapping and georeferencing of objects in photos. This traditional method is difficult to scale due to several challenges including the labor-intensive inputs, the need of rich experience to identify well-defined GCPs, and limitations in camera pose estimation. Therefore, existing monoplotting methods are rarely used in analyzing large-scale databases or near-real-time warning systems. In this paper, we propose and demonstrate a novel semi-automatic monoplotting framework that provides pixel-level correspondence between photos and DEMs requiring minimal human interventions. A pipeline of analyses was developed including key point detection in images and DEM rasters, retrieving georeferenced 3D DEM GCPs, regularized gradient-based optimization, pose estimation, ray tracing, and the correspondence identification between image pixels and real world coordinates. Two numerical experiments show that the framework is superior in georeferencing visual data in 3-D coordinates, paving a way toward fully automatic monoplotting methodology.

Joint Sensing and Communication for Situational Awareness in Wireless THz Systems

Authors: Christina Chaccour, Walid Saad, Omid Semiari, Mehdi Bennis, Petar Popovski
Subjects: Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2111.14044
Pdf link: https://arxiv.org/pdf/2111.14044
Abstract
Next-generation wireless systems are rapidly evolving from communication-only systems to multi-modal systems with integrated sensing and communications. In this paper a novel joint sensing and communication framework is proposed for enabling wireless extended reality (XR) at terahertz (THz) bands. To gather rich sensing information and a higher line-of-sight (LoS) availability, THz-operated reconfigurable intelligent surfaces (RISs) acting as base stations are deployed. The sensing parameters are extracted by leveraging THz's quasi-opticality and opportunistically utilizing uplink communication waveforms. This enables the use of the same waveform, spectrum, and hardware for both sensing and communication purposes. The environmental sensing parameters are then derived by exploiting the sparsity of THz channels via tensor decomposition. Hence, a high-resolution indoor mapping is derived so as to characterize the spatial availability of communications and the mobility of users. Simulation results show that in the proposed framework, the resolution and data rate of the overall system are positively correlated, thus allowing a joint optimization between these metrics with no tradeoffs. Results also show that the proposed framework improves the system reliability in static and mobile systems. In particular, the highest reliability gains of 10% in reliability are achieved in a walking speed mobile environment compared to communication only systems with beam tracking.

UAV-based Crowd Surveillance in Post COVID-19 Era

Authors: Nizar Masmoudi, Wael Jaafar, Safa Cherif, Jihene Ben Abderrazak, Halim Yanikomeroglu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2111.14176
Pdf link: https://arxiv.org/pdf/2111.14176
Abstract
To cope with the current pandemic situation and reinstate pseudo-normal daily life, several measures have been deployed and maintained, such as mask wearing, social distancing, hands sanitizing, etc. Since outdoor cultural events, concerts, and picnics, are gradually allowed, a close monitoring of the crowd activity is needed to avoid undesired contact and disease transmission. In this context, intelligent unmanned aerial vehicles (UAVs) can be occasionally deployed to ensure the surveillance of these activities, that health restriction measures are applied, and to trigger alerts when the latter are not respected. Consequently, we propose in this paper a complete UAV framework for intelligent monitoring of post COVID-19 outdoor activities. Specifically, we propose a three steps approach. In the first step, captured images by a UAV are analyzed using machine learning to detect and locate individuals. The second step consists of a novel coordinates mapping approach to evaluate distances among individuals, then cluster them, while the third step provides an energy-efficient and/or reliable UAV trajectory to inspect clusters for restrictions violation such as mask wearing. Obtained results provide the following insights: 1) Efficient detection of individuals depends on the angle from which the image was captured, 2) coordinates mapping is very sensitive to the estimation error in individuals' bounding boxes, and 3) UAV trajectory design algorithm 2-Opt is recommended for practical real-time deployments due to its low-complexity and near-optimal performance.

Adaptive Mesh Methods on Compact Manifolds via Optimal Transport and Optimal Information Transport

Authors: Axel G. R. Turnquist
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2111.14276
Pdf link: https://arxiv.org/pdf/2111.14276
Abstract
Moving mesh methods are designed to redistribute a mesh in a regular way. This applied problem can be considered to overlap with the problem of finding a diffeomorphic mapping between density measures. In applications, an off-the-shelf grid needs to be restructured to have higher grid density in some regions than others. This should be done in a way that avoids tangling, hence, the attractiveness of diffeomorphic mapping techniques. For exact diffeomorphic mapping on the sphere a major tool used is Optimal Transport, which allows for diffeomorphic mapping between even non-continuous source and target densities. However, recently Optimal Information Transport was rigorously developed allowing for exact and inexact diffeomorphic mapping and the solving of a simpler partial differential equation. In this manuscript, we solve adaptive mesh problems using Optimal Transport and Optimal Information Transport on the sphere and introduce how to generalize these computations to more general manifolds. We choose to perform this comparison with provably convergent solvers, which is generally challenging for either problem due to the lack of boundary conditions and lack of comparison principle in the partial differential equation formulation.

HDR-NeRF: High Dynamic Range Neural Radiance Fields

Authors: Xin Huang, Qi Zhang, Feng Ying, Hongdong Li, Xuan Wang, Qing Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.14451
Pdf link: https://arxiv.org/pdf/2111.14451
Abstract
We present High Dynamic Range Neural Radiance Fields (HDR-NeRF) to recover an HDR radiance field from a set of low dynamic range (LDR) views with different exposures. Using the HDR-NeRF, we are able to generate both novel HDR views and novel LDR views under different exposures. The key to our method is to model the physical imaging process, which dictates that the radiance of a scene point transforms to a pixel value in the LDR image with two implicit functions: a radiance field and a tone mapper. The radiance field encodes the scene radiance (values vary from 0 to +infty), which outputs the density and radiance of a ray by giving corresponding ray origin and ray direction. The tone mapper models the mapping process that a ray hitting on the camera sensor becomes a pixel value. The color of the ray is predicted by feeding the radiance and the corresponding exposure time into the tone mapper. We use the classic volume rendering technique to project the output radiance, colors, and densities into HDR and LDR images, while only the input LDR images are used as the supervision. We collect a new forward-facing HDR dataset to evaluate the proposed method. Experimental results on synthetic and real-world scenes validate that our method can not only accurately control the exposures of synthesized views but also render views with a high dynamic range.

Decoupled Low-light Image Enhancement

Authors: Shijie Hao, Xu Han, Yanrong Guo, Meng Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.14458
Pdf link: https://arxiv.org/pdf/2111.14458
Abstract
The visual quality of photographs taken under imperfect lightness conditions can be degenerated by multiple factors, e.g., low lightness, imaging noise, color distortion and so on. Current low-light image enhancement models focus on the improvement of low lightness only, or simply deal with all the degeneration factors as a whole, therefore leading to a sub-optimal performance. In this paper, we propose to decouple the enhancement model into two sequential stages. The first stage focuses on improving the scene visibility based on a pixel-wise non-linear mapping. The second stage focuses on improving the appearance fidelity by suppressing the rest degeneration factors. The decoupled model facilitates the enhancement in two aspects. On the one hand, the whole low-light enhancement can be divided into two easier subtasks. The first one only aims to enhance the visibility. It also helps to bridge the large intensity gap between the low-light and normal-light images. In this way, the second subtask can be shaped as the local appearance adjustment. On the other hand, since the parameter matrix learned from the first stage is aware of the lightness distribution and the scene structure, it can be incorporated into the second stage as the complementary information. In the experiments, our model demonstrates the state-of-the-art performance in both qualitative and quantitative comparisons, compared with other low-light image enhancement models. In addition, the ablation studies also validate the effectiveness of our model in multiple aspects, such as model structure and loss function. The trained model is available at https://github.com/hanxuhfut/Decoupled-Low-light-Image-Enhancement.

A new Sinkhorn algorithm with Deletion and Insertion operations

Authors: Luc Brun, Benoit Gaüzère, Sébastien Bougleux, Florian Yger
Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2111.14565
Pdf link: https://arxiv.org/pdf/2111.14565
Abstract
This technical report is devoted to the continuous estimation of an epsilon-assignment. Roughly speaking, an epsilon assignment between two sets V1 and V2 may be understood as a bijective mapping between a sub part of V1 and a sub part of V2 . The remaining elements of V1 (not included in this mapping) are mapped onto an epsilon pseudo element of V2 . We say that such elements are deleted. Conversely, the remaining elements of V2 correspond to the image of the epsilon pseudo element of V1. We say that these elements are inserted. As a result our method provides a result similar to the one of the Sinkhorn algorithm with the additional ability to reject some elements which are either inserted or deleted. It thus naturally handles sets V1 and V2 of different sizes and decides mappings/insertions/deletions in a unified way. Our algorithms are iterative and differentiable and may thus be easily inserted within a backpropagation based learning framework such as artificial neural networks.

Urban Radiance Fields

Authors: Konstantinos Rematas, Andrew Liu, Pratul P. Srinivasan, Jonathan T. Barron, Andrea Tagliasacchi, Thomas Funkhouser, Vittorio Ferrari
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2111.14643
Pdf link: https://arxiv.org/pdf/2111.14643
Abstract
The goal of this work is to perform 3D reconstruction and novel view synthesis from data captured by scanning platforms commonly deployed for world mapping in urban outdoor environments (e.g., Street View). Given a sequence of posed RGB images and lidar sweeps acquired by cameras and scanners moving through an outdoor scene, we produce a model from which 3D surfaces can be extracted and novel RGB images can be synthesized. Our approach extends Neural Radiance Fields, which has been demonstrated to synthesize realistic novel images for small scenes in controlled settings, with new methods for leveraging asynchronously captured lidar data, for addressing exposure variation between captured images, and for leveraging predicted image segmentations to supervise densities on rays pointing at the sky. Each of these three extensions provides significant performance improvements in experiments on Street View data. Our system produces state-of-the-art 3D surface reconstructions and synthesizes higher quality novel views in comparison to both traditional methods (e.g.~COLMAP) and recent neural representations (e.g.~Mip-NeRF).

Optimal No-Regret Learning in General Games: Bounded Regret with Unbounded Step-Sizes via Clairvoyant MWU

Authors: Georgios Piliouras, Ryann Sim, Stratis Skoulakis
Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
Arxiv link: https://arxiv.org/abs/2111.14737
Pdf link: https://arxiv.org/pdf/2111.14737
Abstract
In this paper we solve the problem of no-regret learning in general games. Specifically, we provide a simple and practical algorithm that achieves constant regret with fixed step-sizes. The cumulative regret of our algorithm provably decreases linearly as the step-size increases. Our findings depart from the prevailing paradigm that vanishing step-sizes are a prerequisite for low regret as championed by all state-of-the-art methods to date. We shift away from this paradigm by defining a novel algorithm that we call Clairvoyant Multiplicative Weights Updates (CMWU). CMWU is Multiplicative Weights Updates (MWU) equipped with a mental model (jointly shared across all agents) about the state of the system in its next period. Each agent records its mixed strategy, i.e., its belief about what it expects to play in the next period, in this shared mental model which is internally updated using MWU without any changes to the real-world behavior up until it equilibrates, thus marking its consistency with the next day's real-world outcome. It is then and only then that agents take action in the real-world, effectively doing so with the ``full knowledge" of the state of the system on the next day, i.e., they are clairvoyant. CMWU effectively acts as MWU with one day look-ahead, achieving bounded regret. At a technical level, we establish that self-consistent mental models exist for any choice of step-sizes and provide bounds on the step-size under which their uniqueness and linear-time computation are guaranteed via contraction mapping arguments. Our arguments extend well beyond normal-form games with little effort.

Keyword: localization

DSC: Deep Scan Context Descriptor for Large-Scale Place Recognition

Authors: Jiafeng Cui, Tengfei Huang, Yingfeng Cai, Junqiao Zhao, Lu Xiong, Zhuoping Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.13838
Pdf link: https://arxiv.org/pdf/2111.13838
Abstract
LiDAR-based place recognition is an essential and challenging task both in loop closure detection and global relocalization. We propose Deep Scan Context (DSC), a general and discriminative global descriptor that captures the relationship among segments of a point cloud. Unlike previous methods that utilize either semantics or a sequence of adjacent point clouds for better place recognition, we only use raw point clouds to get competitive results. Concretely, we first segment the point cloud egocentrically to acquire centroids and eigenvalues of the segments. Then, we introduce a graph neural network to aggregate these features into an embedding representation. Extensive experiments conducted on the KITTI dataset show that DSC is robust to scene variants and outperforms existing methods.

Asymptotic spectra of large (grid) graphs with a uniform local structure (part II): numerical applications

Authors: Andrea Adriani, Davide Bianchi, Paola Ferrari, Stefano Serra-Capizzano
Subjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP); Combinatorics (math.CO)
Arxiv link: https://arxiv.org/abs/2111.13859
Pdf link: https://arxiv.org/pdf/2111.13859
Abstract
In the current work we are concerned with sequences of graphs having a grid geometry, with a uniform local structure in a bounded domain $\Omega\subset {\mathbb R}^d$, $d\ge 1$. When $\Omega=[0,1]$, such graphs include the standard Toeplitz graphs and, for $\Omega=[0,1]^d$, the considered class includes $d$-level Toeplitz graphs. In the general case, the underlying sequence of adjacency matrices has a canonical eigenvalue distribution, in the Weyl sense, and it has been shown in the theoretical part of this work that we can associate to it a symbol $\boldsymbol{\mathfrak{f}}$. The knowledge of the symbol and of its basic analytical features provides key information on the eigenvalue structure in terms of localization, spectral gap, clustering, and global distribution. In the present paper, many different applications are discussed and various numerical examples are presented in order to underline the practical use of the developed theory. Tests and applications are mainly obtained from the approximation of differential operators via numerical schemes such as Finite Differences (FDs), Finite Elements (FEs), and Isogeometric Analysis (IgA). Moreover, we show that more applications can be taken into account, since the results presented here can be applied as well to study the spectral properties of adjacency matrices and Laplacian operators of general large graphs and networks, whenever the involved matrices enjoy a uniform local structure.

Learning a Weight Map for Weakly-Supervised Localization

Authors: Tal Shaharabany, Lior Wolf
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.14131
Pdf link: https://arxiv.org/pdf/2111.14131
Abstract
In the weakly supervised localization setting, supervision is given as an image-level label. We propose to employ an image classifier $f$ and to train a generative network $g$ that outputs, given the input image, a per-pixel weight map that indicates the location of the object within the image. Network $g$ is trained by minimizing the discrepancy between the output of the classifier $f$ on the original image and its output given the same image weighted by the output of $g$. The scheme requires a regularization term that ensures that $g$ does not provide a uniform weight, and an early stopping criterion in order to prevent $g$ from over-segmenting the image. Our results indicate that the method outperforms existing localization methods by a sizable margin on the challenging fine-grained classification datasets, as well as a generic image recognition dataset. Additionally, the obtained weight map is also state-of-the-art in weakly supervised segmentation in fine-grained categorization datasets.

FashionSearchNet-v2: Learning Attribute Representations with Localization for Image Retrieval with Attribute Manipulation

Authors: Kenan E. Ak, Joo Hwee Lim, Ying Sun, Jo Yew Tham, Ashraf A. Kassim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.14145
Pdf link: https://arxiv.org/pdf/2111.14145
Abstract
The focus of this paper is on the problem of image retrieval with attribute manipulation. Our proposed work is able to manipulate the desired attributes of the query image while maintaining its other attributes. For example, the collar attribute of the query image can be changed from round to v-neck to retrieve similar images from a large dataset. A key challenge in e-commerce is that images have multiple attributes where users would like to manipulate and it is important to estimate discriminative feature representations for each of these attributes. The proposed FashionSearchNet-v2 architecture is able to learn attribute specific representations by leveraging on its weakly-supervised localization module, which ignores the unrelated features of attributes in the feature space, thus improving the similarity learning. The network is jointly trained with the combination of attribute classification and triplet ranking loss to estimate local representations. These local representations are then merged into a single global representation based on the instructed attribute manipulation where desired images can be retrieved with a distance metric. The proposed method also provides explainability for its retrieval process to help provide additional information on the attention of the network. Experiments performed on several datasets that are rich in terms of the number of attributes show that FashionSearchNet-v2 outperforms the other state-of-the-art attribute manipulation techniques. Different than our earlier work (FashionSearchNet), we propose several improvements in the learning procedure and show that the proposed FashionSearchNet-v2 can be generalized to different domains other than fashion.

Make an Omelette with Breaking Eggs: Zero-Shot Learning for Novel Attribute Synthesis

Authors: Yu Hsuan Li, Tzu-Yin Chao, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.14182
Pdf link: https://arxiv.org/pdf/2111.14182
Abstract
Most of the existing algorithms for zero-shot classification problems typically rely on the attribute-based semantic relations among categories to realize the classification of novel categories without observing any of their instances. However, training the zero-shot classification models still requires attribute labeling for each class (or even instance) in the training dataset, which is also expensive. To this end, in this paper, we bring up a new problem scenario: "Are we able to derive zero-shot learning for novel attribute detectors/classifiers and use them to automatically annotate the dataset for labeling efficiency?" Basically, given only a small set of detectors that are learned to recognize some manually annotated attributes (i.e., the seen attributes), we aim to synthesize the detectors of novel attributes in a zero-shot learning manner. Our proposed method, Zero Shot Learning for Attributes (ZSLA), which is the first of its kind to the best of our knowledge, tackles this new research problem by applying the set operations to first decompose the seen attributes into their basic attributes and then recombine these basic attributes into the novel ones. Extensive experiments are conducted to verify the capacity of our synthesized detectors for accurately capturing the semantics of the novel attributes and show their superior performance in terms of detection and localization compared to other baseline approaches. Moreover, with using only 32 seen attributes on the Caltech-UCSD Birds-200-2011 dataset, our proposed method is able to synthesize other 207 novel attributes, while various generalized zero-shot classification algorithms trained upon the dataset re-annotated by our synthesized attribute detectors are able to provide comparable performance with those trained with the manual ground-truth annotations.

WiFi-based Multi-task Sensing

Authors: Xie Zhang, Chengpei Tang, Yasong An, Kang Yin
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.14619
Pdf link: https://arxiv.org/pdf/2111.14619
Abstract
WiFi-based sensing has aroused immense attention over recent years. The rationale is that the signal fluctuations caused by humans carry the information of human behavior which can be extracted from the channel state information of WiFi. Still, the prior studies mainly focus on single-task sensing (STS), e.g., gesture recognition, indoor localization, user identification. Since the fluctuations caused by gestures are highly coupling with body features and the user's location, we propose a WiFi-based multi-task sensing model (Wimuse) to perform gesture recognition, indoor localization, and user identification tasks simultaneously. However, these tasks have different difficulty levels (i.e., imbalance issue) and need task-specific information (i.e., discrepancy issue). To address these issues, the knowledge distillation technique and task-specific residual adaptor are adopted in Wimuse. We first train the STS model for each task. Then, for solving the imbalance issue, the extracted common feature in Wimuse is encouraged to get close to the counterpart features of the STS models. Further, for each task, a task-specific residual adaptor is applied to extract the task-specific compensation feature which is fused with the common feature to address the discrepancy issue. We conduct comprehensive experiments on three public datasets and evaluation suggests that Wimuse achieves state-of-the-art performance with the average accuracy of 85.20%, 98.39%, and 98.725% on the joint task of gesture recognition, indoor localization, and user identification, respectively.

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

Authors: Peize Sun, Jinkun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, Ping Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.14690
Pdf link: https://arxiv.org/pdf/2111.14690
Abstract
A typical pipeline for multi-object tracking (MOT) is to use a detector for object localization, and following re-identification (re-ID) for object association. This pipeline is partially motivated by recent progress in both object detection and re-ID, and partially motivated by biases in existing tracking datasets, where most objects tend to have distinguishing appearance and re-ID models are sufficient for establishing associations. In response to such bias, we would like to re-emphasize that methods for multi-object tracking should also work when object appearance is not sufficiently discriminative. To this end, we propose a large-scale dataset for multi-human tracking, where humans have similar appearance, diverse motion and extreme articulation. As the dataset contains mostly group dancing videos, we name it "DanceTrack". We expect DanceTrack to provide a better platform to develop more MOT algorithms that rely less on visual discrimination and depend more on motion analysis. We benchmark several state-of-the-art trackers on our dataset and observe a significant performance drop on DanceTrack when compared against existing benchmarks. The dataset, project code and competition server are released at: \url{https://github.com/DanceTrack}.

FaceAtlasAR: Atlas of Facial Acupuncture Points in Augmented Reality

Authors: Menghe Zhang, Jurgen Schulze, Dong Zhang
Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2111.14755
Pdf link: https://arxiv.org/pdf/2111.14755
Abstract
Acupuncture is a technique in which practitioners stimulate specific points on the body. These points, called acupuncture points (or acupoints), anatomically define areas on the skin relative to some landmarks on the body. Traditional acupuncture treatment relies on experienced acupuncturists for precise positioning of acupoints. A novice typically finds it difficult because of the lack of visual cues. This project presents FaceAtlasAR, a prototype system that localizes and visualizes facial acupoints in an augmented reality (AR) context. The system aims to 1) localize facial acupoints and auricular zone map in an anatomical yet feasible way, 2) overlay the requested acupoints by category in AR, and 3) show auricular zone map on the ears. We adopt Mediapipe, a cross-platform machine learning framework, to build the pipeline that runs on desktop and Android phones. We perform experiments on different benchmarks, including "In-the-wild", AMI ear datasets, and our own annotated datasets. Results show the localization accuracy of 95% for facial acupoints, 99% / 97% ("In-the-wild" / AMI) for auricular zone map, and high robustness. With this system, users, even not professionals, can position the acupoints quickly for their self-acupressure treatments.

New submissions for Fri, 4 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Multiscale Domain Adaptive YOLO for Cross-Domain Object Detection

Authors: Mazin Hnewa, Hayder Radha
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.01483
Pdf link: https://arxiv.org/pdf/2106.01483
Abstract
The area of domain adaptation has been instrumental in addressing the domain shift problem encountered by many applications. This problem arises due to the difference between the distributions of source data used for training in comparison with target data used during realistic testing scenarios. In this paper, we introduce a novel MultiScale Domain Adaptive YOLO (MS-DAYOLO) framework that employs multiple domain adaptation paths and corresponding domain classifiers at different scales of the recently introduced YOLOv4 object detector to generate domain-invariant features. We train and test our proposed method using popular datasets. Our experiments show significant improvements in object detection performance when training YOLOv4 using the proposed MS-DAYOLO and when tested on target data representing challenging weather conditions for autonomous driving applications.

DeepCompress: Efficient Point Cloud Geometry Compression

Authors: Ryan Killea, Yun Li, Saeed Bastani, Paul McLachlan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2106.01504
Pdf link: https://arxiv.org/pdf/2106.01504
Abstract
Point clouds are a basic data type that is increasingly of interest as 3D content becomes more ubiquitous. Applications using point clouds include virtual, augmented, and mixed reality and autonomous driving. We propose a more efficient deep learning-based encoder architecture for point clouds compression that incorporates principles from established 3D object detection and image compression architectures. Through an ablation study, we show that incorporating the learned activation function from Computational Efficient Neural Image Compression (CENIC) and designing more parameter-efficient convolutional blocks yields dramatic gains in efficiency and performance. Our proposed architecture incorporates Generalized Divisive Normalization activations and propose a spatially separable InceptionV4-inspired block. We then evaluate rate-distortion curves on the standard JPEG Pleno 8i Voxelized Full Bodies dataset to evaluate our model's performance. Our proposed modifications outperform the baseline approaches by a small margin in terms of Bjontegard delta rate and PSNR values, yet reduces necessary encoder convolution operations by 8 percent and reduces total encoder parameters by 20 percent. Our proposed architecture, when considered on its own, has a small penalty of 0.02 percent in Chamfer's Distance and 0.32 percent increased bit rate in Point to Plane Distance for the same peak signal-to-noise ratio.

Short-term Maintenance Planning of Autonomous Trucks for Minimizing Economic Risk

Authors: Xin Tao, Jonas Mårtensson, Håkan Warnquist, Anna Pernestål
Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2106.01871
Pdf link: https://arxiv.org/pdf/2106.01871
Abstract
New autonomous driving technologies are emerging every day and some of them have been commercially applied in the real world. While benefiting from these technologies, autonomous trucks are facing new challenges in short-term maintenance planning, which directly influences the truck operator's profit. In this paper, we implement a vehicle health management system by addressing the maintenance planning issues of autonomous trucks on a transport mission. We also present a maintenance planning model using a risk-based decision-making method, which identifies the maintenance decision with minimal economic risk of the truck company. Both availability losses and maintenance costs are considered when evaluating the economic risk. We demonstrate the proposed model by numerical experiments illustrating real-world scenarios. In the experiments, compared to three baseline methods, the expected economic risk of the proposed method is reduced by up to $47%$. We also conduct sensitivity analyses of different model parameters. The analyses show that the economic risk significantly decreases when the estimation accuracy of remaining useful life, the maximal allowed time of delivery delay before order cancellation, or the number of workshops increases. The experiment results contribute to identifying future research and development attentions of autonomous trucks from an economic perspective.

New submissions for Tue, 8 Jun 21

Keyword: SLAM

FedNL: Making Newton-Type Methods Applicable to Federated Learning

Authors: Mher Safaryan, Rustem Islamov, Xun Qian, Peter Richtárik
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2106.02969
Pdf link: https://arxiv.org/pdf/2106.02969
Abstract
Inspired by recent work of Islamov et al (2021), we propose a family of Federated Newton Learn (FedNL) methods, which we believe is a marked step in the direction of making second-order methods applicable to FL. In contrast to the aforementioned work, FedNL employs a different Hessian learning technique which i) enhances privacy as it does not rely on the training data to be revealed to the coordinating server, ii) makes it applicable beyond generalized linear models, and iii) provably works with general contractive compression operators for compressing the local Hessians, such as Top-$K$ or Rank-$R$, which are vastly superior in practice. Notably, we do not need to rely on error feedback for our methods to work with contractive compressors. Moreover, we develop FedNL-PP, FedNL-CR and FedNL-LS, which are variants of FedNL that support partial participation, and globalization via cubic regularization and line search, respectively, and FedNL-BC, which is a variant that can further benefit from bidirectional compression of gradients and models, i.e., smart uplink gradient and smart downlink model compression. We prove local convergence rates that are independent of the condition number, the number of training data points, and compression variance. Our communication efficient Hessian learning technique provably learns the Hessian at the optimum. Finally, we perform a variety of numerical experiments that show that our FedNL methods have state-of-the-art communication complexity when compared to key baselines.

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Radar-Camera Pixel Depth Association for Depth Completion

Authors: Yunfei Long, Daniel Morris, Xiaoming Liu, Marcos Castro, Punarjay Chakravarty, Praveen Narayanan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.02778
Pdf link: https://arxiv.org/pdf/2106.02778
Abstract
While radar and video data can be readily fused at the detection level, fusing them at the pixel level is potentially more beneficial. This is also more challenging in part due to the sparsity of radar, but also because automotive radar beams are much wider than a typical pixel combined with a large baseline between camera and radar, which results in poor association between radar pixels and color pixel. A consequence is that depth completion methods designed for LiDAR and video fare poorly for radar and video. Here we propose a radar-to-pixel association stage which learns a mapping from radar returns to pixels. This mapping also serves to densify radar returns. Using this as a first stage, followed by a more traditional depth completion method, we are able to achieve image-guided depth completion with radar and video. We demonstrate performance superior to camera and radar alone on the nuScenes dataset. Our source code is available at https://github.com/longyunf/rc-pda.

Brno Urban Dataset: Winter Extention

Authors: Adam Ligocki, Ales Jelinek, Ludek Zalud
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.02952
Pdf link: https://arxiv.org/pdf/2106.02952
Abstract
Research on autonomous driving is advancing dramatically and requires new data and techniques to progress even further. To reflect this pressure, we present an extension of our recent work - the Brno Urban Dataset (BUD). The new data focus on winter conditions in various snow-covered environments and feature additional LiDAR and radar sensors for object detection in front of the vehicle. The improvement affects the old data as well. We provide YOLO detection annotations for all old RGB images in the dataset. The detections are further also transferred by our original algorithm into the infra-red (IR) images, captured by the thermal camera. To our best knowledge, it makes this dataset the largest source of machine-annotated thermal images currently available. The dataset is published under MIT license on https://github.com/Robotics-BUT/Brno-Urban-Dataset.

Stein ICP for Uncertainty Estimation in Point Cloud Matching

Authors: Fahira Afzal Maken, Fabio Ramos, Lionel Ott
Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.03287
Pdf link: https://arxiv.org/pdf/2106.03287
Abstract
Quantification of uncertainty in point cloud matching is critical in many tasks such as pose estimation, sensor fusion, and grasping. Iterative closest point (ICP) is a commonly used pose estimation algorithm which provides a point estimate of the transformation between two point clouds. There are many sources of uncertainty in this process that may arise due to sensor noise, ambiguous environment, and occlusion. However, for safety critical problems such as autonomous driving, a point estimate of the pose transformation is not sufficient as it does not provide information about the multiple solutions. Current probabilistic ICP methods usually do not capture all sources of uncertainty and may provide unreliable transformation estimates which can have a detrimental effect in state estimation or decision making tasks that use this information. In this work we propose a new algorithm to align two point clouds that can precisely estimate the uncertainty of ICP's transformation parameters. We develop a Stein variational inference framework with gradient based optimization of ICP's cost function. The method provides a non-parametric estimate of the transformation, can model complex multi-modal distributions, and can be effectively parallelized on a GPU. Experiments using 3D kinect data as well as sparse indoor/outdoor LiDAR data show that our method is capable of efficiently producing accurate pose uncertainty estimates.

Cost-effective Mapping of Mobile Robot Based on the Fusion of UWB and Short-range 2D LiDAR

Authors: Ran Liu, Yongping He, Chau Yuen, Billy Pik Lik Lau, Rashid Ali, Wenpeng Fu, Zhiqiang Cao
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.03648
Pdf link: https://arxiv.org/pdf/2106.03648
Abstract
Environment mapping is an essential prerequisite for mobile robots to perform different tasks such as navigation and mission planning. With the availability of low-cost 2D LiDARs, there are increasing applications of such 2D LiDARs in industrial environments. However, environment mapping in an unknown and feature-less environment with such low-cost 2D LiDARs remains a challenge. The challenge mainly originates from the short-range of LiDARs and complexities in performing scan matching in these environments. In order to resolve these shortcomings, we propose to fuse the ultra-wideband (UWB) with 2D LiDARs to improve the mapping quality of a mobile robot. The optimization-based approach is utilized for the fusion of UWB ranging information and odometry to first optimize the trajectory. Then the LiDAR-based loop closures are incorporated to improve the accuracy of the trajectory estimation. Finally, the optimized trajectory is combined with the LiDAR scans to produce the occupancy map of the environment. The performance of the proposed approach is evaluated in an indoor feature-less environment with a size of 20m*20m. Obtained results show that the mapping error of the proposed scheme is 85.5% less than that of the conventional GMapping algorithm with short-range LiDAR (for example Hokuyo URG-04LX in our experiment with a maximum range of 5.6m).

Keyword: loop detection

There is no result

Keyword: autonomous driving

Constrained Generalized Additive 2 Model with Consideration of High-Order Interactions

Authors: Akihisa Watanabe, Michiya Kuramata, Kaito Majima, Haruka Kiyohara, Kensho Kondo, Kazuhide Nakata
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.02836
Pdf link: https://arxiv.org/pdf/2106.02836
Abstract
In recent years, machine learning and AI have been introduced in many industrial fields. In fields such as finance, medicine, and autonomous driving, where the inference results of a model may have serious consequences, high interpretability as well as prediction accuracy is required. In this study, we propose CGA2M+, which is based on the Generalized Additive 2 Model (GA2M) and differs from it in two major ways. The first is the introduction of monotonicity. Imposing monotonicity on some functions based on an analyst's knowledge is expected to improve not only interpretability but also generalization performance. The second is the introduction of a higher-order term: given that GA2M considers only second-order interactions, we aim to balance interpretability and prediction accuracy by introducing a higher-order term that can capture higher-order interactions. In this way, we can improve prediction performance without compromising interpretability by applying learning innovation. Numerical experiments showed that the proposed model has high predictive performance and interpretability. Furthermore, we confirmed that generalization performance is improved by introducing monotonicity.

Brno Urban Dataset: Winter Extention

Authors: Adam Ligocki, Ales Jelinek, Ludek Zalud
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.02952
Pdf link: https://arxiv.org/pdf/2106.02952
Abstract
Research on autonomous driving is advancing dramatically and requires new data and techniques to progress even further. To reflect this pressure, we present an extension of our recent work - the Brno Urban Dataset (BUD). The new data focus on winter conditions in various snow-covered environments and feature additional LiDAR and radar sensors for object detection in front of the vehicle. The improvement affects the old data as well. We provide YOLO detection annotations for all old RGB images in the dataset. The detections are further also transferred by our original algorithm into the infra-red (IR) images, captured by the thermal camera. To our best knowledge, it makes this dataset the largest source of machine-annotated thermal images currently available. The dataset is published under MIT license on https://github.com/Robotics-BUT/Brno-Urban-Dataset.

Stein ICP for Uncertainty Estimation in Point Cloud Matching

Authors: Fahira Afzal Maken, Fabio Ramos, Lionel Ott
Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.03287
Pdf link: https://arxiv.org/pdf/2106.03287
Abstract
Quantification of uncertainty in point cloud matching is critical in many tasks such as pose estimation, sensor fusion, and grasping. Iterative closest point (ICP) is a commonly used pose estimation algorithm which provides a point estimate of the transformation between two point clouds. There are many sources of uncertainty in this process that may arise due to sensor noise, ambiguous environment, and occlusion. However, for safety critical problems such as autonomous driving, a point estimate of the pose transformation is not sufficient as it does not provide information about the multiple solutions. Current probabilistic ICP methods usually do not capture all sources of uncertainty and may provide unreliable transformation estimates which can have a detrimental effect in state estimation or decision making tasks that use this information. In this work we propose a new algorithm to align two point clouds that can precisely estimate the uncertainty of ICP's transformation parameters. We develop a Stein variational inference framework with gradient based optimization of ICP's cost function. The method provides a non-parametric estimate of the transformation, can model complex multi-modal distributions, and can be effectively parallelized on a GPU. Experiments using 3D kinect data as well as sparse indoor/outdoor LiDAR data show that our method is capable of efficiently producing accurate pose uncertainty estimates.

Self-supervised Depth Estimation Leveraging Global Perception and Geometric Smoothness Using On-board Videos

Authors: Shaocheng Jia, Xin Pei, Wei Yao, S.C. Wong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.03505
Pdf link: https://arxiv.org/pdf/2106.03505
Abstract
Self-supervised depth estimation has drawn much attention in recent years as it does not require labeled data but image sequences. Moreover, it can be conveniently used in various applications, such as autonomous driving, robotics, realistic navigation, and smart cities. However, extracting global contextual information from images and predicting a geometrically natural depth map remain challenging. In this paper, we present DLNet for pixel-wise depth estimation, which simultaneously extracts global and local features with the aid of our depth Linformer block. This block consists of the Linformer and innovative soft split multi-layer perceptron blocks. Moreover, a three-dimensional geometry smoothness loss is proposed to predict a geometrically natural depth map by imposing the second-order smoothness constraint on the predicted three-dimensional point clouds, thereby realizing improved performance as a byproduct. Finally, we explore the multi-scale prediction strategy and propose the maximum margin dual-scale prediction strategy for further performance improvement. In experiments on the KITTI and Make3D benchmarks, the proposed DLNet achieves performance competitive to those of the state-of-the-art methods, reducing time and space complexities by more than $62%$ and $56%$, respectively. Extensive testing on various real-world situations further demonstrates the strong practicality and generalization capability of the proposed model.

Learning without Knowing: Unobserved Context in Continuous Transfer Reinforcement Learning

Authors: Chenyu Liu, Yan Zhang, Yi Shen, Michael M. Zavlanos
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2106.03833
Pdf link: https://arxiv.org/pdf/2106.03833
Abstract
In this paper, we consider a transfer Reinforcement Learning (RL) problem in continuous state and action spaces, under unobserved contextual information. For example, the context can represent the mental view of the world that an expert agent has formed through past interactions with this world. We assume that this context is not accessible to a learner agent who can only observe the expert data. Then, our goal is to use the context-aware expert data to learn an optimal context-unaware policy for the learner using only a few new data samples. Such problems are typically solved using imitation learning that assumes that both the expert and learner agents have access to the same information. However, if the learner does not know the expert context, using the expert data alone will result in a biased learner policy and will require many new data samples to improve. To address this challenge, in this paper, we formulate the learning problem as a causal bound-constrained Multi-Armed-Bandit (MAB) problem. The arms of this MAB correspond to a set of basis policy functions that can be initialized in an unsupervised way using the expert data and represent the different expert behaviors affected by the unobserved context. On the other hand, the MAB constraints correspond to causal bounds on the accumulated rewards of these basis policy functions that we also compute from the expert data. The solution to this MAB allows the learner agent to select the best basis policy and improve it online. And the use of causal bounds reduces the exploration variance and, therefore, improves the learning rate. We provide numerical experiments on an autonomous driving example that show that our proposed transfer RL method improves the learner's policy faster compared to existing imitation learning methods and enjoys much lower variance during training.

Tunable Trajectory Planner Using G3 Curves

Authors: Alexander Botros, Stephen L. Smith
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2106.03836
Pdf link: https://arxiv.org/pdf/2106.03836
Abstract
Trajectory planning is commonly used as part of a local planner in autonomous driving. This paper considers the problem of planning a continuous-curvature-rate trajectory between fixed start and goal states that minimizes a tunable trade-off between passenger comfort and travel time. The problem is an instance of infinite dimensional optimization over two continuous functions: a path, and a velocity profile. We propose a simplification of this problem that facilitates the discretization of both functions. This paper also proposes a method to quickly generate minimal-length paths between start and goal states based on a single tuning parameter: the second derivative of curvature. Furthermore, we discretize the set of velocity profiles along a given path into a selection of acceleration way-points along the path. Gradient-descent is then employed to minimize cost over feasible choices of the second derivative of curvature, and acceleration way-points, resulting in a method that repeatedly solves the path and velocity profiles in an iterative fashion. Numerical examples are provided to illustrate the benefits of the proposed methods.

New submissions for Fri, 29 Oct 21

Keyword: SLAM

Efficient Placard Discovery for Semantic Mapping During Frontier Exploration

Authors: David Balaban, Harshavardhan Jagannathan, Henry Liu, Justin Hart
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.14742
Pdf link: https://arxiv.org/pdf/2110.14742
Abstract
Semantic mapping is the task of providing a robot with a map of its environment beyond the open, navigable space of traditional Simultaneous Localization and Mapping (SLAM) algorithms by attaching semantics to locations. The system presented in this work reads door placards to annotate the locations of offices. Whereas prior work on this system developed hand-crafted detectors, this system leverages YOLOv2 for detection and a segmentation network for segmentation. Placards are localized by computing their pose from a homography computed from a segmented quadrilateral outline. This work also introduces an Interruptable Frontier Exploration algorithm, enabling the robot to explore its environment to construct its SLAM map while pausing to inspect placards observed during this process. This allows the robot to autonomously discover room placards without human intervention while speeding up significantly over previous autonomous exploration methods.

Millimeter Wave Wireless-Assisted Robotic Navigation with Link State Classification

Authors: Mingsheng Yin (1), Akshaj Veldanda (1), Amee Trivedi (2), Jeff Zhang (3), Kai Pfeiffer (1), Yaqi Hu (1), Siddharth Garg (1), Elza Erkip (1), Ludovic Righetti (1), Sundeep Rangan (1) ((1) NYU Tandon School of Engineering, (2) University of British Columbia, (3) Harvard University)
Subjects: Robotics (cs.RO); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2110.14789
Pdf link: https://arxiv.org/pdf/2110.14789
Abstract
The millimeter wave (mmWave) bands have attracted considerable attention for high precision localization applications due to the ability to capture high angular and temporal resolution measurements. This paper explores mmWave-based positioning for a target localization problem where a fixed target broadcasts mmWave signals and a mobile robotic agent attempts to listen to the signals to locate and navigate to the target. A three strage procedure is proposed: First, the mobile agent uses tensor decomposition methods to detect the wireless paths and their angles. Second, a machine-learning trained classifier is then used to predict the link state, meaning if the strongest path is line-of-sight (LOS) or non-LOS (NLOS). For the NLOS case, the link state predictor also determines if the strongest path arrived via one or more reflections. Third, based on the link state, the agent either follows the estimated angles or explores the environment. The method is demonstrated on a large dataset of indoor environments supplemented with ray tracing to simulate the wireless propagation. The path estimation and link state classification are also integrated into a state-ofthe-art neural simultaneous localization and mapping (SLAM) module to augment camera and LIDAR-based navigation. It is shown that the link state classifier can successfully generalize to completely new environments outside the training set. In addition, the neural-SLAM module with the wireless path estimation and link state classifier provides rapid navigation to the target, close to a baseline that knows the target location.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

Learning Actions for Drift-Free Navigation in Highly Dynamic Scenes

Authors: Mohd Omama, Sundar Sripada V. S., Sandeep Chinchali, K. Madhava Krishna
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.14928
Pdf link: https://arxiv.org/pdf/2110.14928
Abstract
We embark on a hitherto unreported problem of an autonomous robot (self-driving car) navigating in dynamic scenes in a manner that reduces its localization error and eventual cumulative drift or Absolute Trajectory Error, which is pronounced in such dynamic scenes. With the hugely popular Velodyne-16 3D LIDAR as the main sensing modality, and the accurate LIDAR-based Localization and Mapping algorithm, LOAM, as the state estimation framework, we show that in the absence of a navigation policy, drift rapidly accumulates in the presence of moving objects. To overcome this, we learn actions that lead to drift-minimized navigation through a suitable set of reward and penalty functions. We use Proximal Policy Optimization, a class of Deep Reinforcement Learning methods, to learn the actions that result in drift-minimized trajectories. We show by extensive comparisons on a variety of synthetic, yet photo-realistic scenes made available through the CARLA Simulator the superior performance of the proposed framework vis-a-vis methods that do not adopt such policies.

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Millimeter Wave Wireless-Assisted Robotic Navigation with Link State Classification

Authors: Mingsheng Yin (1), Akshaj Veldanda (1), Amee Trivedi (2), Jeff Zhang (3), Kai Pfeiffer (1), Yaqi Hu (1), Siddharth Garg (1), Elza Erkip (1), Ludovic Righetti (1), Sundeep Rangan (1) ((1) NYU Tandon School of Engineering, (2) University of British Columbia, (3) Harvard University)
Subjects: Robotics (cs.RO); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2110.14789
Pdf link: https://arxiv.org/pdf/2110.14789
Abstract
The millimeter wave (mmWave) bands have attracted considerable attention for high precision localization applications due to the ability to capture high angular and temporal resolution measurements. This paper explores mmWave-based positioning for a target localization problem where a fixed target broadcasts mmWave signals and a mobile robotic agent attempts to listen to the signals to locate and navigate to the target. A three strage procedure is proposed: First, the mobile agent uses tensor decomposition methods to detect the wireless paths and their angles. Second, a machine-learning trained classifier is then used to predict the link state, meaning if the strongest path is line-of-sight (LOS) or non-LOS (NLOS). For the NLOS case, the link state predictor also determines if the strongest path arrived via one or more reflections. Third, based on the link state, the agent either follows the estimated angles or explores the environment. The method is demonstrated on a large dataset of indoor environments supplemented with ray tracing to simulate the wireless propagation. The path estimation and link state classification are also integrated into a state-ofthe-art neural simultaneous localization and mapping (SLAM) module to augment camera and LIDAR-based navigation. It is shown that the link state classifier can successfully generalize to completely new environments outside the training set. In addition, the neural-SLAM module with the wireless path estimation and link state classifier provides rapid navigation to the target, close to a baseline that knows the target location.

3D Object Tracking with Transformer

Authors: Yubo Cui, Zheng Fang, Jiayao Shan, Zuoxu Gu, Sifan Zhou
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.14921
Pdf link: https://arxiv.org/pdf/2110.14921
Abstract
Feature fusion and similarity computation are two core problems in 3D object tracking, especially for object tracking using sparse and disordered point clouds. Feature fusion could make similarity computing more efficient by including target object information. However, most existing LiDAR-based approaches directly use the extracted point cloud feature to compute similarity while ignoring the attention changes of object regions during tracking. In this paper, we propose a feature fusion network based on transformer architecture. Benefiting from the self-attention mechanism, the transformer encoder captures the inter- and intra- relations among different regions of the point cloud. By using cross-attention, the transformer decoder fuses features and includes more target cues into the current point cloud feature to compute the region attentions, which makes the similarity computing more efficient. Based on this feature fusion network, we propose an end-to-end point cloud object tracking framework, a simple yet effective method for 3D object tracking using point clouds. Comprehensive experimental results on the KITTI dataset show that our method achieves new state-of-the-art performance. Code is available at: https://github.com/3bobo/lttr.

Learning Actions for Drift-Free Navigation in Highly Dynamic Scenes

Authors: Mohd Omama, Sundar Sripada V. S., Sandeep Chinchali, K. Madhava Krishna
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.14928
Pdf link: https://arxiv.org/pdf/2110.14928
Abstract
We embark on a hitherto unreported problem of an autonomous robot (self-driving car) navigating in dynamic scenes in a manner that reduces its localization error and eventual cumulative drift or Absolute Trajectory Error, which is pronounced in such dynamic scenes. With the hugely popular Velodyne-16 3D LIDAR as the main sensing modality, and the accurate LIDAR-based Localization and Mapping algorithm, LOAM, as the state estimation framework, we show that in the absence of a navigation policy, drift rapidly accumulates in the presence of moving objects. To overcome this, we learn actions that lead to drift-minimized navigation through a suitable set of reward and penalty functions. We use Proximal Policy Optimization, a class of Deep Reinforcement Learning methods, to learn the actions that result in drift-minimized trajectories. We show by extensive comparisons on a variety of synthetic, yet photo-realistic scenes made available through the CARLA Simulator the superior performance of the proposed framework vis-a-vis methods that do not adopt such policies.

Keyword: loop detection

There is no result

Keyword: autonomous driving

DDK: A Deep Koopman Approach for Dynamics Modeling and Trajectory Tracking of Autonomous Vehicles

Authors: Yongqian Xiao
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2110.14700
Pdf link: https://arxiv.org/pdf/2110.14700
Abstract
Autonomous driving has attracted lots of attention in recent years. An accurate vehicle dynamics is important for autonomous driving techniques, e.g. trajectory prediction, motion planning, and control of trajectory tracking. Although previous works have made some results, the strong nonlinearity, precision, and interpretability of dynamics for autonomous vehicles are open problems worth being studied. In this paper, the approach based on the Koopman operator named deep direct Koopman (DDK) is proposed to identify the model of the autonomous vehicle and the identified model is a linear time-invariant (LTI) version, which is convenient for motion planning and controller design. In the approach, the Koopman eigenvalues and system matrix are considered as trainable tensors with the original states of the autonomous vehicle being concatenated to a part of the Koopman eigenfunctions so that a physically interpretable subsystem can be extracted from the identified latent dynamics. Subsequently, the process of the identification model is trained under the proposed method based on the dataset which consists of about 60km of data collected with a real electric SUV while the effectiveness of the identified model is validated. Meanwhile, a high-fidelity vehicle dynamics is identified in CarSim with DDK, and then, a linear model predictive control (MPC) called DDK-MPC integrating DDK is designed to validate the performance for the control of trajectory tracking. Simulation results illustrate that the model of the nonlinear vehicle dynamics can be identified effectively via the proposed method and that excellent tracking performance can be obtained with the identified model under DDK-MPC.

Spatial Constraint Generation for Motion Planning in Dynamic Environments

Authors: Han Hu (1), Peyman Yadmellat (2) ((1) University of Toronto, (2) Noah's Ark Lab., Huawei Technologies Canada)
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.14786
Pdf link: https://arxiv.org/pdf/2110.14786
Abstract
This paper presents a novel method to generate spatial constraints for motion planning in dynamic environments. Motion planning methods for autonomous driving and mobile robots typically need to rely on the spatial constraints imposed by a map-based global planner to generate a collision-free trajectory. These methods may fail without an offline map or where the map is invalid due to dynamic changes in the environment such as road obstruction, construction, and traffic congestion. To address this problem, triangulation-based methods can be used to obtain a spatial constraint. However, the existing methods fall short when dealing with dynamic environments and may lead the motion planner to an unrecoverable state. In this paper, we propose a new method to generate a sequence of channels across different triangulation mesh topologies to serve as the spatial constraints. This can be applied to motion planning of autonomous vehicles or robots in cluttered, unstructured environments. The proposed method is evaluated and compared with other triangulation-based methods in synthetic and complex scenarios collected from a real-world autonomous driving dataset. We have shown that the proposed method results in a more stable, long-term plan with a higher task completion rate, faster arrival time, a higher rate of successful plans, and fewer collisions compared to existing methods.

Sliding Sequential CVAE with Time Variant Socially-aware Rethinking for Trajectory Prediction

Authors: Hao Zhou, Dongchun Ren, Xu Yang, Mingyu Fan, Hai Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.15016
Pdf link: https://arxiv.org/pdf/2110.15016
Abstract
Pedestrian trajectory prediction is a key technology in many applications such as video surveillance, social robot navigation, and autonomous driving, and significant progress has been made in this research topic. However, there remain two limitations of previous studies. First, with the continuation of time, the prediction error at each time step increases significantly, causing the final displacement error to be impossible to ignore. Second, the prediction results of multiple pedestrians might be impractical in the prediction horizon, i.e., the predicted trajectories might collide with each other. To overcome these limitations, this work proposes a novel trajectory prediction method called CSR, which consists of a cascaded conditional variational autoencoder (CVAE) module and a socially-aware regression module. The cascaded CVAE module first estimates the future trajectories in a sequential pattern. Specifically, each CVAE concatenates the past trajectories and the predicted points so far as the input and predicts the location at the following time step. Then, the socially-aware regression module generates offsets from the estimated future trajectories to produce the socially compliant final predictions, which are more reasonable and accurate results than the estimated trajectories. Moreover, considering the large model parameters of the cascaded CVAE module, a slide CVAE module is further exploited to improve the model efficiency using one shared CVAE, in a slidable manner. Experiments results demonstrate that the proposed method exhibits improvements over state-of-the-art method on the Stanford Drone Dataset (SDD) and ETH/UCY of approximately 38.0% and 22.2%, respectively.

Meta Guided Metric Learner for Overcoming Class Confusion in Few-Shot Road Object Detection

Authors: Anay Majee, Anbumani Subramanian, Kshitij Agrawal
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.15074
Pdf link: https://arxiv.org/pdf/2110.15074
Abstract
Localization and recognition of less-occurring road objects have been a challenge in autonomous driving applications due to the scarcity of data samples. Few-Shot Object Detection techniques extend the knowledge from existing base object classes to learn novel road objects given few training examples. Popular techniques in FSOD adopt either meta or metric learning techniques which are prone to class confusion and base class forgetting. In this work, we introduce a novel Meta Guided Metric Learner (MGML) to overcome class confusion in FSOD. We re-weight the features of the novel classes higher than the base classes through a novel Squeeze and Excite module and encourage the learning of truly discriminative class-specific features by applying an Orthogonality Constraint to the meta learner. Our method outperforms State-of-the-Art (SoTA) approaches in FSOD on the India Driving Dataset (IDD) by upto 11 mAP points while suffering from the least class confusion of 20% given only 10 examples of each novel road object. We further show similar improvements on the few-shot splits of PASCAL VOC dataset where we outperform SoTA approaches by upto 5.8 mAP accross all splits.

Keyword: mapping

Meta-Learning Sparse Implicit Neural Representations

Authors: Jaeho Lee, Jihoon Tack, Namhoon Lee, Jinwoo Shin
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2110.14678
Pdf link: https://arxiv.org/pdf/2110.14678
Abstract
Implicit neural representations are a promising new avenue of representing general signals by learning a continuous function that, parameterized as a neural network, maps the domain of a signal to its codomain; the mapping from spatial coordinates of an image to its pixel values, for example. Being capable of conveying fine details in a high dimensional signal, unboundedly of its domain, implicit neural representations ensure many advantages over conventional discrete representations. However, the current approach is difficult to scale for a large number of signals or a data set, since learning a neural representation -- which is parameter heavy by itself -- for each signal individually requires a lot of memory and computations. To address this issue, we propose to leverage a meta-learning approach in combination with network compression under a sparsity constraint, such that it renders a well-initialized sparse parameterization that evolves quickly to represent a set of unseen signals in the subsequent training. We empirically demonstrate that meta-learned sparse neural representations achieve a much smaller loss than dense meta-learned models with the same number of parameters, when trained to fit each signal using the same number of optimization steps.

Efficient Placard Discovery for Semantic Mapping During Frontier Exploration

Authors: David Balaban, Harshavardhan Jagannathan, Henry Liu, Justin Hart
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.14742
Pdf link: https://arxiv.org/pdf/2110.14742
Abstract
Semantic mapping is the task of providing a robot with a map of its environment beyond the open, navigable space of traditional Simultaneous Localization and Mapping (SLAM) algorithms by attaching semantics to locations. The system presented in this work reads door placards to annotate the locations of offices. Whereas prior work on this system developed hand-crafted detectors, this system leverages YOLOv2 for detection and a segmentation network for segmentation. Placards are localized by computing their pose from a homography computed from a segmented quadrilateral outline. This work also introduces an Interruptable Frontier Exploration algorithm, enabling the robot to explore its environment to construct its SLAM map while pausing to inspect placards observed during this process. This allows the robot to autonomously discover room placards without human intervention while speeding up significantly over previous autonomous exploration methods.

Millimeter Wave Wireless-Assisted Robotic Navigation with Link State Classification

Authors: Mingsheng Yin (1), Akshaj Veldanda (1), Amee Trivedi (2), Jeff Zhang (3), Kai Pfeiffer (1), Yaqi Hu (1), Siddharth Garg (1), Elza Erkip (1), Ludovic Righetti (1), Sundeep Rangan (1) ((1) NYU Tandon School of Engineering, (2) University of British Columbia, (3) Harvard University)
Subjects: Robotics (cs.RO); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2110.14789
Pdf link: https://arxiv.org/pdf/2110.14789
Abstract
The millimeter wave (mmWave) bands have attracted considerable attention for high precision localization applications due to the ability to capture high angular and temporal resolution measurements. This paper explores mmWave-based positioning for a target localization problem where a fixed target broadcasts mmWave signals and a mobile robotic agent attempts to listen to the signals to locate and navigate to the target. A three strage procedure is proposed: First, the mobile agent uses tensor decomposition methods to detect the wireless paths and their angles. Second, a machine-learning trained classifier is then used to predict the link state, meaning if the strongest path is line-of-sight (LOS) or non-LOS (NLOS). For the NLOS case, the link state predictor also determines if the strongest path arrived via one or more reflections. Third, based on the link state, the agent either follows the estimated angles or explores the environment. The method is demonstrated on a large dataset of indoor environments supplemented with ray tracing to simulate the wireless propagation. The path estimation and link state classification are also integrated into a state-ofthe-art neural simultaneous localization and mapping (SLAM) module to augment camera and LIDAR-based navigation. It is shown that the link state classifier can successfully generalize to completely new environments outside the training set. In addition, the neural-SLAM module with the wireless path estimation and link state classifier provides rapid navigation to the target, close to a baseline that knows the target location.

L2ight: Enabling On-Chip Learning for Optical Neural Networks via Efficient in-situ Subspace Optimization

Authors: Jiaqi Gu, Hanqing Zhu, Chenghao Feng, Zixuan Jiang, Ray T. Chen, David Z. Pan
Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Optics (physics.optics)
Arxiv link: https://arxiv.org/abs/2110.14807
Pdf link: https://arxiv.org/pdf/2110.14807
Abstract
Silicon-photonics-based optical neural network (ONN) is a promising hardware platform that could represent a paradigm shift in efficient AI with its CMOS-compatibility, flexibility, ultra-low execution latency, and high energy efficiency. In-situ training on the online programmable photonic chips is appealing but still encounters challenging issues in on-chip implementability, scalability, and efficiency. In this work, we propose a closed-loop ONN on-chip learning framework L2ight to enable scalable ONN mapping and efficient in-situ learning. L2ight adopts a three-stage learning flow that first calibrates the complicated photonic circuit states under challenging physical constraints, then performs photonic core mapping via combined analytical solving and zeroth-order optimization. A subspace learning procedure with multi-level sparsity is integrated into L2ight to enable in-situ gradient evaluation and fast adaptation, unleashing the power of optics for real on-chip intelligence. Extensive experiments demonstrate our proposed L2ight outperforms prior ONN training protocols with 3-order-of-magnitude higher scalability and over 30X better efficiency, when benchmarked on various models and learning tasks. This synergistic framework is the first scalable on-chip learning solution that pushes this emerging field from intractable to scalable and further to efficient for next-generation self-learnable photonic neural chips. From a co-design perspective, L2ight also provides essential insights for hardware-restricted unitary subspace optimization and efficient sparse training. We open-source our framework at https://github.com/JeremieMelo/L2ight.

Learning Actions for Drift-Free Navigation in Highly Dynamic Scenes

Authors: Mohd Omama, Sundar Sripada V. S., Sandeep Chinchali, K. Madhava Krishna
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.14928
Pdf link: https://arxiv.org/pdf/2110.14928
Abstract
We embark on a hitherto unreported problem of an autonomous robot (self-driving car) navigating in dynamic scenes in a manner that reduces its localization error and eventual cumulative drift or Absolute Trajectory Error, which is pronounced in such dynamic scenes. With the hugely popular Velodyne-16 3D LIDAR as the main sensing modality, and the accurate LIDAR-based Localization and Mapping algorithm, LOAM, as the state estimation framework, we show that in the absence of a navigation policy, drift rapidly accumulates in the presence of moving objects. To overcome this, we learn actions that lead to drift-minimized navigation through a suitable set of reward and penalty functions. We use Proximal Policy Optimization, a class of Deep Reinforcement Learning methods, to learn the actions that result in drift-minimized trajectories. We show by extensive comparisons on a variety of synthetic, yet photo-realistic scenes made available through the CARLA Simulator the superior performance of the proposed framework vis-a-vis methods that do not adopt such policies.

Deep Learning Aided Packet Routing in Aeronautical Ad-Hoc Networks Relying on Real Flight Data: From Single-Objective to Near-Pareto Multi-Objective Optimization

Authors: Dong Liu, Jiankang Zhang, Jingjing Cui, Soon-Xin Ng, Robert G. Maunder, Lajos Hanzo
Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2110.15145
Pdf link: https://arxiv.org/pdf/2110.15145
Abstract
Data packet routing in aeronautical ad-hoc networks (AANETs) is challenging due to their high-dynamic topology. In this paper, we invoke deep learning (DL) to assist routing in AANETs. We set out from the single objective of minimizing the end-to-end (E2E) delay. Specifically, a deep neural network (DNN) is conceived for mapping the local geographic information observed by the forwarding node into the information required for determining the optimal next hop. The DNN is trained by exploiting the regular mobility pattern of commercial passenger airplanes from historical flight data. After training, the DNN is stored by each airplane for assisting their routing decisions during flight relying solely on local geographic information. Furthermore, we extend the DL-aided routing algorithm to a multi-objective scenario, where we aim for simultaneously minimizing the delay, maximizing the path capacity, and maximizing the path lifetime. Our simulation results based on real flight data show that the proposed DL-aided routing outperforms existing position-based routing protocols in terms of its E2E delay, path capacity as well as path lifetime, and it is capable of approaching the Pareto front that is obtained using global link information.

Feature Learning for Neural-Network-Based Positioning with Channel State Information

Authors: Emre Gönültaş, Sueda Taner, Howard Huang, Christoph Studer
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2110.15160
Pdf link: https://arxiv.org/pdf/2110.15160
Abstract
Recent channel state information (CSI)-based positioning pipelines rely on deep neural networks (DNNs) in order to learn a mapping from estimated CSI to position. Since real-world communication transceivers suffer from hardware impairments, CSI-based positioning systems typically rely on features that are designed by hand. In this paper, we propose a CSI-based positioning pipeline that directly takes raw CSI measurements and learns features using a structured DNN in order to generate probability maps describing the likelihood of the transmitter being at pre-defined grid points. To further improve the positioning accuracy of moving user equipments, we propose to fuse a time-series of learned CSI features or a time-series of probability maps. To demonstrate the efficacy of our methods, we perform experiments with real-world indoor line-of-sight (LoS) and non-LoS channel measurements. We show that CSI feature learning and time-series fusion can reduce the mean distance error by up to 2.5$\boldsymbol\times$ compared to the state-of-the-art.

Word-level confidence estimation for RNN transducers

Authors: Mingqiu Wang, Hagen Soltau, Laurent El Shafey, Izhak Shafran
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2110.15222
Pdf link: https://arxiv.org/pdf/2110.15222
Abstract
Confidence estimate is an often requested feature in applications such as medical transcription where errors can impact patient care and the confidence estimate could be used to alert medical professionals to verify potential errors in recognition. In this paper, we present a lightweight neural confidence model tailored for Automatic Speech Recognition (ASR) system with Recurrent Neural Network Transducers (RNN-T). Compared to other existing approaches, our model utilizes: (a) the time information associated with recognized words, which reduces the computational complexity, and (b) a simple and elegant trick for mapping between sub-word and word sequences. The mapping addresses the non-unique tokenization and token deletion problems while amplifying differences between confusable words. Through extensive empirical evaluations on two different long-form test sets, we demonstrate that the model achieves a performance of 0.4 Normalized Cross Entropy (NCE) and 0.05 Expected Calibration Error (ECE). It is robust across different ASR configurations, including target types (graphemes vs. morphemes), traffic conditions (streaming vs. non-streaming), and encoder types. We further discuss the importance of evaluation metrics to reflect practical applications and highlight the need for further work in improving Area Under the Curve (AUC) for Negative Precision Rate (NPV) and True Negative Rate (TNR).

Keyword: localization

An Improved Positioning Accuracy Method of a Robot Based on Particle Filter

Authors: Rashid Ali, Dil Nawaz Hakro, Yongping He, Wenpeng Fu, Zhiqiang Cao
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.14635
Pdf link: https://arxiv.org/pdf/2110.14635
Abstract
This paper aims to improve the performance and positioning accuracy of a robot by using the particle filter method. The laser range information is a wireless navigation system mainly used to measure, position, and control autonomous robots. Its localization is more flexible to control than wired guidance systems. However, the navigation through the laser range finder occurs with a large positioning error while it moves or turns fast. For solving this problem, the paper proposes a method to improve the positioning accuracy of a robot in an indoor environment by using a particle filter with robust characteristics in a nonlinear or non-Gaussian system. In this experiment, a robot is equipped with a laser range finder, two encoders, and a gyro for navigation to verify the positioning accuracy and performance. The positioning accuracy and performance could improve by approximately 85.5% in this proposed method.

Efficient Placard Discovery for Semantic Mapping During Frontier Exploration

Authors: David Balaban, Harshavardhan Jagannathan, Henry Liu, Justin Hart
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.14742
Pdf link: https://arxiv.org/pdf/2110.14742
Abstract
Semantic mapping is the task of providing a robot with a map of its environment beyond the open, navigable space of traditional Simultaneous Localization and Mapping (SLAM) algorithms by attaching semantics to locations. The system presented in this work reads door placards to annotate the locations of offices. Whereas prior work on this system developed hand-crafted detectors, this system leverages YOLOv2 for detection and a segmentation network for segmentation. Placards are localized by computing their pose from a homography computed from a segmented quadrilateral outline. This work also introduces an Interruptable Frontier Exploration algorithm, enabling the robot to explore its environment to construct its SLAM map while pausing to inspect placards observed during this process. This allows the robot to autonomously discover room placards without human intervention while speeding up significantly over previous autonomous exploration methods.

Millimeter Wave Wireless-Assisted Robotic Navigation with Link State Classification

Authors: Mingsheng Yin (1), Akshaj Veldanda (1), Amee Trivedi (2), Jeff Zhang (3), Kai Pfeiffer (1), Yaqi Hu (1), Siddharth Garg (1), Elza Erkip (1), Ludovic Righetti (1), Sundeep Rangan (1) ((1) NYU Tandon School of Engineering, (2) University of British Columbia, (3) Harvard University)
Subjects: Robotics (cs.RO); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2110.14789
Pdf link: https://arxiv.org/pdf/2110.14789
Abstract
The millimeter wave (mmWave) bands have attracted considerable attention for high precision localization applications due to the ability to capture high angular and temporal resolution measurements. This paper explores mmWave-based positioning for a target localization problem where a fixed target broadcasts mmWave signals and a mobile robotic agent attempts to listen to the signals to locate and navigate to the target. A three strage procedure is proposed: First, the mobile agent uses tensor decomposition methods to detect the wireless paths and their angles. Second, a machine-learning trained classifier is then used to predict the link state, meaning if the strongest path is line-of-sight (LOS) or non-LOS (NLOS). For the NLOS case, the link state predictor also determines if the strongest path arrived via one or more reflections. Third, based on the link state, the agent either follows the estimated angles or explores the environment. The method is demonstrated on a large dataset of indoor environments supplemented with ray tracing to simulate the wireless propagation. The path estimation and link state classification are also integrated into a state-ofthe-art neural simultaneous localization and mapping (SLAM) module to augment camera and LIDAR-based navigation. It is shown that the link state classifier can successfully generalize to completely new environments outside the training set. In addition, the neural-SLAM module with the wireless path estimation and link state classifier provides rapid navigation to the target, close to a baseline that knows the target location.

Learning Actions for Drift-Free Navigation in Highly Dynamic Scenes

Authors: Mohd Omama, Sundar Sripada V. S., Sandeep Chinchali, K. Madhava Krishna
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.14928
Pdf link: https://arxiv.org/pdf/2110.14928
Abstract
We embark on a hitherto unreported problem of an autonomous robot (self-driving car) navigating in dynamic scenes in a manner that reduces its localization error and eventual cumulative drift or Absolute Trajectory Error, which is pronounced in such dynamic scenes. With the hugely popular Velodyne-16 3D LIDAR as the main sensing modality, and the accurate LIDAR-based Localization and Mapping algorithm, LOAM, as the state estimation framework, we show that in the absence of a navigation policy, drift rapidly accumulates in the presence of moving objects. To overcome this, we learn actions that lead to drift-minimized navigation through a suitable set of reward and penalty functions. We use Proximal Policy Optimization, a class of Deep Reinforcement Learning methods, to learn the actions that result in drift-minimized trajectories. We show by extensive comparisons on a variety of synthetic, yet photo-realistic scenes made available through the CARLA Simulator the superior performance of the proposed framework vis-a-vis methods that do not adopt such policies.

DocScanner: Robust Document Image Rectification with Progressive Learning

Authors: Hao Feng, Wengang Zhou, Jiajun Deng, Qi Tian, Houqiang Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.14968
Pdf link: https://arxiv.org/pdf/2110.14968
Abstract
Compared to flatbed scanners, portable smartphones are much more convenient for physical documents digitizing. However, such digitized documents are often distorted due to uncontrolled physical deformations, camera positions, and illumination variations. To this end, this work presents DocScanner, a new deep network architecture for document image rectification. Different from existing methods, DocScanner addresses this issue by introducing a progressive learning mechanism. Specifically, DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture. The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency. In addition, before the above rectification process, observing the corrupted rectified boundaries existing in prior works, DocScanner exploits a document localization module to explicitly segment the foreground document from the cluttered background environments. To further improve the rectification quality, based on the geometric priori between the distorted and the rectified images, a geometric regularization is introduced during training to further facilitate the performance. Extensive experiments are conducted on the Doc3D dataset and the DocUNet benchmark dataset, and the quantitative and qualitative evaluation results verify the effectiveness of DocScanner, which outperforms previous methods on OCR accuracy, image similarity, and our proposed distortion metric by a considerable margin. Furthermore, our DocScanner shows the highest efficiency in inference time and parameter count.

Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition

Authors: Liang Xu, Cuiling Lan, Wenjun Zeng, Cewu Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.14994
Pdf link: https://arxiv.org/pdf/2110.14994
Abstract
Skeleton data carries valuable motion information and is widely explored in human action recognition. However, not only the motion information but also the interaction with the environment provides discriminative cues to recognize the action of persons. In this paper, we propose a joint learning framework for mutually assisted "interacted object localization" and "human action recognition" based on skeleton data. The two tasks are serialized together and collaborate to promote each other, where preliminary action type derived from skeleton alone helps improve interacted object localization, which in turn provides valuable cues for the final human action recognition. Besides, we explore the temporal consistency of interacted object as constraint to better localize the interacted object with the absence of ground-truth labels. Extensive experiments on the datasets of SYSU-3D, NTU60 RGB+D and Northwestern-UCLA show that our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition. Visualization results show that our method can also provide reasonable interacted object localization results.

Meta Guided Metric Learner for Overcoming Class Confusion in Few-Shot Road Object Detection

Authors: Anay Majee, Anbumani Subramanian, Kshitij Agrawal
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.15074
Pdf link: https://arxiv.org/pdf/2110.15074
Abstract
Localization and recognition of less-occurring road objects have been a challenge in autonomous driving applications due to the scarcity of data samples. Few-Shot Object Detection techniques extend the knowledge from existing base object classes to learn novel road objects given few training examples. Popular techniques in FSOD adopt either meta or metric learning techniques which are prone to class confusion and base class forgetting. In this work, we introduce a novel Meta Guided Metric Learner (MGML) to overcome class confusion in FSOD. We re-weight the features of the novel classes higher than the base classes through a novel Squeeze and Excite module and encourage the learning of truly discriminative class-specific features by applying an Orthogonality Constraint to the meta learner. Our method outperforms State-of-the-Art (SoTA) approaches in FSOD on the India Driving Dataset (IDD) by upto 11 mAP points while suffering from the least class confusion of 20% given only 10 examples of each novel road object. We further show similar improvements on the few-shot splits of PASCAL VOC dataset where we outperform SoTA approaches by upto 5.8 mAP accross all splits.

SpineOne: A One-Stage Detection Framework for Degenerative Discs and Vertebrae

Authors: Jiabo He, Wei Liu, Yu Wang, Xingjun Ma, Xian-Sheng Hua
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.15082
Pdf link: https://arxiv.org/pdf/2110.15082
Abstract
Spinal degeneration plagues many elders, office workers, and even the younger generations. Effective pharmic or surgical interventions can help relieve degenerative spine conditions. However, the traditional diagnosis procedure is often too laborious. Clinical experts need to detect discs and vertebrae from spinal magnetic resonance imaging (MRI) or computed tomography (CT) images as a preliminary step to perform pathological diagnosis or preoperative evaluation. Machine learning systems have been developed to aid this procedure generally following a two-stage methodology: first perform anatomical localization, then pathological classification. Towards more efficient and accurate diagnosis, we propose a one-stage detection framework termed SpineOne to simultaneously localize and classify degenerative discs and vertebrae from MRI slices. SpineOne is built upon the following three key techniques: 1) a new design of the keypoint heatmap to facilitate simultaneous keypoint localization and classification; 2) the use of attention modules to better differentiate the representations between discs and vertebrae; and 3) a novel gradient-guided objective association mechanism to associate multiple learning objectives at the later training stage. Empirical results on the Spinal Disease Intelligent Diagnosis Tianchi Competition (SDID-TC) dataset of 550 exams demonstrate that our approach surpasses existing methods by a large margin.

Dist2Cycle: A Simplicial Neural Network for Homology Localization

Authors: Alexandros Dimitrios Keros, Vidit Nanda, Kartic Subr
Subjects: Machine Learning (cs.LG); Algebraic Topology (math.AT)
Arxiv link: https://arxiv.org/abs/2110.15182
Pdf link: https://arxiv.org/pdf/2110.15182
Abstract
Simplicial complexes can be viewed as high dimensional generalizations of graphs that explicitly encode multi-way ordered relations between vertices at different resolutions, all at once. This concept is central towards detection of higher dimensional topological features of data, features to which graphs, encoding only pairwise relationships, remain oblivious. While attempts have been made to extend Graph Neural Networks (GNNs) to a simplicial complex setting, the methods do not inherently exploit, or reason about, the underlying topological structure of the network. We propose a graph convolutional model for learning functions parametrized by the $k$-homological features of simplicial complexes. By spectrally manipulating their combinatorial $k$-dimensional Hodge Laplacians, the proposed model enables learning topological features of the underlying simplicial complexes, specifically, the distance of each $k$-simplex from the nearest "optimal" $k$-th homology generator, effectively providing an alternative to homology localization.

New submissions for Fri, 11 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Visual Sensor Pose Optimisation Using Rendering-based Visibility Models for Robust Cooperative Perception

Authors: Eduardo Arnold, Sajjad Mozaffari, Mehrdad Dianati, Paul Jennings
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2106.05308
Pdf link: https://arxiv.org/pdf/2106.05308
Abstract
Visual Sensor Networks can be used in a variety of perception applications such as infrastructure support for autonomous driving in complex road segments. The pose of the sensors in such networks directly determines the coverage of the environment and objects therein, which impacts the performance of applications such as object detection and tracking. Existing sensor pose optimisation methods in the literature either maximise the coverage of ground surfaces, or consider the visibility of the target objects as binary variables, which cannot represent various degrees of visibility. Such formulations cannot guarantee the visibility of the target objects as they fail to consider occlusions. This paper proposes two novel sensor pose optimisation methods, based on gradient-ascent and Integer Programming techniques, which maximise the visibility of multiple target objects in cluttered environments. Both methods consider a realistic visibility model based on a rendering engine that provides pixel-level visibility information about the target objects. The proposed methods are evaluated in a complex environment and compared to existing methods in the literature. The evaluation results indicate that explicitly modelling the visibility of target objects is critical to avoid occlusions in cluttered environments. Furthermore, both methods significantly outperform existing methods in terms of object visibility.

New submissions for Thu, 1 Jul 21

Keyword: SLAM

Deep Multi-Modal Contact Estimation for Invariant Observer Design on Quadruped Robots

Authors: Tzu-Yuan Lin, Ray Zhang, Justin Yu, Maani Ghaffari
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.15713
Pdf link: https://arxiv.org/pdf/2106.15713
Abstract
This work reports on developing a deep learning-based contact estimator for legged robots that bypasses the need for physical contact sensors and takes multi-modal proprioceptive sensory data from joint encoders, kinematics, and an inertial measurement unit as input. Unlike vision-based state estimators, proprioceptive state estimators are agnostic to perceptually degraded situations such as dark or foggy scenes. For legged robots, reliable kinematics and contact data are necessary to develop a proprioceptive state estimator. While some robots are equipped with dedicated contact sensors or springs to detect contact, some robots do not have dedicated contact sensors, and the addition of such sensors is non-trivial without redesigning the hardware. The trained deep network can accurately estimate contacts on different terrains and robot gaits and is deployed along a contact-aided invariant extended Kalman filter to generate odometry trajectories. The filter performs comparably to a state-of-the-art visual SLAM system.

Robust Inertial-aided Underwater Localization and Navigation based on Imaging Sonar Keyframes

Authors: Yang Xu, Ronghao Zheng, Meiqin liu, Senlin Zhang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.16032
Pdf link: https://arxiv.org/pdf/2106.16032
Abstract
Imaging sonars have shown better flexibility than optical cameras in underwater localization and navigation for autonomous underwater vehicles (AUVs). However, the sparsity of underwater acoustic features and the loss of elevation angle in sonar frames have imposed degeneracy cases, namely under-constrained or unobservable cases according to optimization-based or EKF-based simultaneous localization and mapping (SLAM). In these cases, the relative ambiguous sensor poses and landmarks cannot be triangulated. To handle this, this paper proposes a robust imaging sonar SLAM approach based on sonar keyframes (KFs) and an elastic sliding window. The degeneracy cases are further analyzed and the triangulation property of 2D landmarks in arbitrary motion has been proved. These degeneracy cases are discriminated and the sonar KFs are selected via saliency criteria to extract and save the informative constraints from previous sonar measurements. Incorporating the inertial measurements, an elastic sliding windowed back-end optimization is proposed to mostly utilize the past salient sonar frames and also restrain the optimization scale. Comparative experiments validate the effectiveness of the proposed method and its robustness to outliers from the wrong data association, even without loop closure.

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Monocular 3D Object Detection: An Extrinsic Parameter Free Approach

Authors: Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang, Hongyang Li, Qinhong Jiang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.15796
Pdf link: https://arxiv.org/pdf/2106.15796
Abstract
Monocular 3D object detection is an important task in autonomous driving. It can be easily intractable where there exists ego-car pose change w.r.t. ground plane. This is common due to the slight fluctuation of road smoothness and slope. Due to the lack of insight in industrial application, existing methods on open datasets neglect the camera pose information, which inevitably results in the detector being susceptible to camera extrinsic parameters. The perturbation of objects is very popular in most autonomous driving cases for industrial products. To this end, we propose a novel method to capture camera pose to formulate the detector free from extrinsic perturbation. Specifically, the proposed framework predicts camera extrinsic parameters by detecting vanishing point and horizon change. A converter is designed to rectify perturbative features in the latent space. By doing so, our 3D detector works independent of the extrinsic parameter variations and produces accurate results in realistic cases, e.g., potholed and uneven roads, where almost all existing monocular detectors fail to handle. Experiments demonstrate our method yields the best performance compared with the other state-of-the-arts by a large margin on both KITTI 3D and nuScenes datasets.

Mutual-GAN: Towards Unsupervised Cross-Weather Adaptation with Mutual Information Constraint

Authors: Jiawei Chen, Yuexiang Li, Kai Ma, Yefeng Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2106.16000
Pdf link: https://arxiv.org/pdf/2106.16000
Abstract
Convolutional neural network (CNN) have proven its success for semantic segmentation, which is a core task of emerging industrial applications such as autonomous driving. However, most progress in semantic segmentation of urban scenes is reported on standard scenarios, i.e., daytime scenes with favorable illumination conditions. In practical applications, the outdoor weather and illumination are changeable, e.g., cloudy and nighttime, which results in a significant drop of semantic segmentation accuracy of CNN only trained with daytime data. In this paper, we propose a novel generative adversarial network (namely Mutual-GAN) to alleviate the accuracy decline when daytime-trained neural network is applied to videos captured under adverse weather conditions. The proposed Mutual-GAN adopts mutual information constraint to preserve image-objects during cross-weather adaptation, which is an unsolved problem for most unsupervised image-to-image translation approaches (e.g., CycleGAN). The proposed Mutual-GAN is evaluated on two publicly available driving video datasets (i.e., CamVid and SYNTHIA). The experimental results demonstrate that our Mutual-GAN can yield visually plausible translated images and significantly improve the semantic segmentation accuracy of daytime-trained deep learning network while processing videos under challenging weathers.

New submissions for Thu, 25 Nov 21

Keyword: SLAM

Autonomous bot with ML-based reactive navigation for indoor environment

Authors: Yash Srivastava, Saumya Singh, S.P. Syed Ibrahim
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.12542
Pdf link: https://arxiv.org/pdf/2111.12542
Abstract
Local or reactive navigation is essential for autonomous mobile robots which operate in an indoor environment. Techniques such as SLAM, computer vision require significant computational power which increases cost. Similarly, using rudimentary methods makes the robot susceptible to inconsistent behavior. This paper aims to develop a robot that balances cost and accuracy by using machine learning to predict the best obstacle avoidance move based on distance inputs from four ultrasonic sensors that are strategically mounted on the front, front-left, front-right, and back of the robot. The underlying hardware consists of an Arduino Uno and a Raspberry Pi 3B. The machine learning model is first trained on the data collected by the robot. Then the Arduino continuously polls the sensors and calculates the distance values, and in case of critical need for avoidance, a suitable maneuver is made by the Arduino. In other scenarios, sensor data is sent to the Raspberry Pi using a USB connection and the machine learning model generates the best move for navigation, which is sent to the Arduino for driving motors accordingly. The system is mounted on a 2-WD robot chassis and tested in a cluttered indoor setting with most impressive results.

Automatic Mapping with Obstacle Identification for Indoor Human Mobility Assessment

Authors: V. Ayala-Alfaro, J. A. Vilchis-Mar, F. E. Correa-Tome, J. P. Ramirez-Paredes
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.12690
Pdf link: https://arxiv.org/pdf/2111.12690
Abstract
We propose a framework that allows a mobile robot to build a map of an indoor scenario, identifying and highlighting objects that may be considered a hindrance to people with limited mobility. The map is built by combining recent developments in monocular SLAM with information from inertial sensors of the robot platform, resulting in a metric point cloud that can be further processed to obtain a mesh. The images from the monocular camera are simultaneously analyzed with an object recognition neural network, tuned to detect a particular class of targets. This information is then processed and incorporated on the metric map, resulting in a detailed survey of the locations and bounding volumes of the objects of interest. The result can be used to inform policy makers and users with limited mobility of the hazards present in a particular indoor location. Our initial tests were performed using a micro-UAV and will be extended to other robotic platforms.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

MonoPLFlowNet: Permutohedral Lattice FlowNet for Real-Scale 3D Scene FlowEstimation with Monocular Images

Authors: Runfa Li, Truong Nguyen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12325
Pdf link: https://arxiv.org/pdf/2111.12325
Abstract
Real-scale scene flow estimation has become increasingly important for 3D computer vision. Some works successfully estimate real-scale 3D scene flow with LiDAR. However, these ubiquitous and expensive sensors are still unlikely to be equipped widely for real application. Other works use monocular images to estimate scene flow, but their scene flow estimations are normalized with scale ambiguity, where additional depth or point cloud ground truth are required to recover the real scale. Even though they perform well in 2D, these works do not provide accurate and reliable 3D estimates. We present a deep learning architecture on permutohedral lattice - MonoPLFlowNet. Different from all previous works, our MonoPLFlowNet is the first work where only two consecutive monocular images are used as input, while both depth and 3D scene flow are estimated in real scale. Our real-scale scene flow estimation outperforms all state-of-the-art monocular-image based works recovered to real scale by ground truth, and is comparable to LiDAR approaches. As a by-product, our real-scale depth estimation also outperforms other state-of-the-art works.

Fault-Tolerant Perception for Automated Driving A Lightweight Monitoring Approach

Authors: Cornelius Buerkle, Florian Geissler, Michael Paulitsch, Kay-Ulrich Scholl
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.12360
Pdf link: https://arxiv.org/pdf/2111.12360
Abstract
While the most visible part of the safety verification process of automated vehicles concerns the planning and control system, it is often overlooked that safety of the latter crucially depends on the fault-tolerance of the preceding environment perception. Modern perception systems feature complex and often machine-learning-based components with various failure modes that can jeopardize the overall safety. At the same time, a verification by for example redundant execution is not always feasible due to resource constraints. In this paper, we address the need for feasible and efficient perception monitors and propose a lightweight approach that helps to protect the integrity of the perception system while keeping the additional compute overhead minimal. In contrast to existing solutions, the monitor is realized by a well-balanced combination of sensor checks -- here using LiDAR information -- and plausibility checks on the object motion history. It is designed to detect relevant errors in the distance and velocity of objects in the environment of the automated vehicle. In conjunction with an appropriate planning system, such a monitor can help to make safe automated driving feasible.

SM3D: Simultaneous Monocular Mapping and 3D Detection

Authors: Runfa Li, Truong Nguyen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12643
Pdf link: https://arxiv.org/pdf/2111.12643
Abstract
Mapping and 3D detection are two major issues in vision-based robotics, and self-driving. While previous works only focus on each task separately, we present an innovative and efficient multi-task deep learning framework (SM3D) for Simultaneous Mapping and 3D Detection by bridging the gap with robust depth estimation and "Pseudo-LiDAR" point cloud for the first time. The Mapping module takes consecutive monocular frames to generate depth and pose estimation. In 3D Detection module, the depth estimation is projected into 3D space to generate "Pseudo-LiDAR" point cloud, where LiDAR-based 3D detector can be leveraged on point cloud for vehicular 3D detection and localization. By end-to-end training of both modules, the proposed mapping and 3D detection method outperforms the state-of-the-art baseline by 10.0% and 13.2% in accuracy, respectively. While achieving better accuracy, our monocular multi-task SM3D is more than 2 times faster than pure stereo 3D detector, and 18.3% faster than using two modules separately.

Keyword: loop detection

There is no result

Keyword: autonomous driving

There is no result

Keyword: mapping

Panoptic Segmentation Meets Remote Sensing

Authors: Osmar Luiz Ferreira de Carvalho, Osmar Abílio de Carvalho Júnior, Cristiano Rosa e Silva, Anesmar Olino de Albuquerque, Nickolas Castro Santana, Dibio Leandro Borges, Roberto Arnaldo Trancoso Gomes, Renato Fontes Guimarães
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2111.12126
Pdf link: https://arxiv.org/pdf/2111.12126
Abstract
Panoptic segmentation combines instance and semantic predictions, allowing the detection of "things" and "stuff" simultaneously. Effectively approaching panoptic segmentation in remotely sensed data can be auspicious in many challenging problems since it allows continuous mapping and specific target counting. Several difficulties have prevented the growth of this task in remote sensing: (a) most algorithms are designed for traditional images, (b) image labelling must encompass "things" and "stuff" classes, and (c) the annotation format is complex. Thus, aiming to solve and increase the operability of panoptic segmentation in remote sensing, this study has five objectives: (1) create a novel data preparation pipeline for panoptic segmentation, (2) propose an annotation conversion software to generate panoptic annotations; (3) propose a novel dataset on urban areas, (4) modify the Detectron2 for the task, and (5) evaluate difficulties of this task in the urban setting. We used an aerial image with a 0,24-meter spatial resolution considering 14 classes. Our pipeline considers three image inputs, and the proposed software uses point shapefiles for creating samples in the COCO format. Our study generated 3,400 samples with 512x512 pixel dimensions. We used the Panoptic-FPN with two backbones (ResNet-50 and ResNet-101), and the model evaluation considered semantic instance and panoptic metrics. We obtained 93.9, 47.7, and 64.9 for the mean IoU, box AP, and PQ. Our study presents the first effective pipeline for panoptic segmentation and an extensive database for other researchers to use and deal with other data or related problems requiring a thorough scene understanding.

Auto robust relative radiometric normalization via latent change noise modelling

Authors: Shiqi Liu, Lu Wang, Jie Lian, Ting chen, Cong Liu, Xuchen Zhan, Jintao Lu, Jie Liu, Ting Wang, Dong Geng, Hongwei Duan, Yuze Tian
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.12406
Pdf link: https://arxiv.org/pdf/2111.12406
Abstract
Relative radiometric normalization(RRN) of different satellite images of the same terrain is necessary for change detection, object classification/segmentation, and map-making tasks. However, traditional RRN models are not robust, disturbing by object change, and RRN models precisely considering object change can not robustly obtain the no-change set. This paper proposes auto robust relative radiometric normalization methods via latent change noise modeling. They utilize the prior knowledge that no change points possess small-scale noise under relative radiometric normalization and that change points possess large-scale radiometric noise after radiometric normalization, combining the stochastic expectation maximization method to quickly and robustly extract the no-change set to learn the relative radiometric normalization mapping functions. This makes our model theoretically grounded regarding the probabilistic theory and mathematics deduction. Specifically, when we select histogram matching as the relative radiometric normalization learning scheme integrating with the mixture of Gaussian noise(HM-RRN-MoG), the HM-RRN-MoG model achieves the best performance. Our model possesses the ability to robustly against clouds/fogs/changes. Our method naturally generates a robust evaluation indicator for RRN that is the no-change set root mean square error. We apply the HM-RRN-MoG model to the latter vegetation/water change detection task, which reduces the radiometric contrast and NDVI/NDWI differences on the no-change set, generates consistent and comparable results. We utilize the no-change set into the building change detection task, efficiently reducing the pseudo-change and boosting the precision.

ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel Appearance Invariant Semantic Representations

Authors: Robin Karlsson, Tomoki Hayashi, Keisuke Fujii, Alexander Carballo, Kento Ohtani, Kazuya Takeda
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2111.12460
Pdf link: https://arxiv.org/pdf/2111.12460
Abstract
This work presents a self-supervised method to learn dense semantically rich visual concept embeddings for images inspired by methods for learning word embeddings in NLP. Our method improves on prior work by generating more expressive embeddings and by being applicable for high-resolution images. Viewing the generation of natural images as a stochastic process where a set of latent visual concepts give rise to observable pixel appearances, our method is formulated to learn the inverse mapping from pixels to concepts. Our method greatly improves the effectiveness of self-supervised learning for dense embedding maps by introducing superpixelization as a natural hierarchical step up from pixels to a small set of visually coherent regions. Additional contributions are regional contextual masking with nonuniform shapes matching visually coherent patches and complexity-based view sampling inspired by masked language models. The enhanced expressiveness of our dense embeddings is demonstrated by significantly improving the state-of-the-art representation quality benchmarks on COCO (+12.94 mIoU, +87.6%) and Cityscapes (+16.52 mIoU, +134.2%). Results show favorable scaling and domain generalization properties not demonstrated by prior work.

SM3D: Simultaneous Monocular Mapping and 3D Detection

Authors: Runfa Li, Truong Nguyen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12643
Pdf link: https://arxiv.org/pdf/2111.12643
Abstract
Mapping and 3D detection are two major issues in vision-based robotics, and self-driving. While previous works only focus on each task separately, we present an innovative and efficient multi-task deep learning framework (SM3D) for Simultaneous Mapping and 3D Detection by bridging the gap with robust depth estimation and "Pseudo-LiDAR" point cloud for the first time. The Mapping module takes consecutive monocular frames to generate depth and pose estimation. In 3D Detection module, the depth estimation is projected into 3D space to generate "Pseudo-LiDAR" point cloud, where LiDAR-based 3D detector can be leveraged on point cloud for vehicular 3D detection and localization. By end-to-end training of both modules, the proposed mapping and 3D detection method outperforms the state-of-the-art baseline by 10.0% and 13.2% in accuracy, respectively. While achieving better accuracy, our monocular multi-task SM3D is more than 2 times faster than pure stereo 3D detector, and 18.3% faster than using two modules separately.

Keyword: localization

Three-Way Deep Neural Network for Radio Frequency Map Generation and Source Localization

Authors: Kuldeep S. Gill, Son Nguyen, Myo M. Thein, Alexander M. Wyglinski
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2111.12175
Pdf link: https://arxiv.org/pdf/2111.12175
Abstract
In this paper, we present a Generative Adversarial Network (GAN) machine learning model to interpolate irregularly distributed measurements across the spatial domain to construct a smooth radio frequency map (RFMap) and then perform localization using a deep neural network. Monitoring wireless spectrum over spatial, temporal, and frequency domains will become a critical feature in facilitating dynamic spectrum access (DSA) in beyond-5G and 6G communication technologies. Localization, wireless signal detection, and spectrum policy-making are several of the applications where distributed spectrum sensing will play a significant role. Detection and positioning of wireless emitters is a very challenging task in a large spectral and spatial area. In order to construct a smooth RFMap database, a large number of measurements are required which can be very expensive and time consuming. One approach to help realize these systems is to collect finite localized measurements across a given area and then interpolate the measurement values to construct the database. Current methods in the literature employ channel modeling to construct the radio frequency map, which lacks the granularity for accurate localization whereas our proposed approach reconstructs a new generalized RFMap. Localization results are presented and compared with conventional channel models.

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Authors: Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, Yuejie Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12374
Pdf link: https://arxiv.org/pdf/2111.12374
Abstract
Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential for complete scene comprehension. Most previous works attempted to analyze videos from a holistic perspective. However, they do not consider semantic information at multiple scales, which makes the model difficult to localize events in various lengths. In this paper, we present a Multimodal Pyramid Attentional Network (MM-Pyramid) that captures and integrates multi-level temporal features for audio-visual event localization and audio-visual video parsing. Specifically, we first propose the attentive feature pyramid module. This module captures temporal pyramid features via several stacking pyramid units, each of them is composed of a fixed-size attention block and dilated convolution block. We also design an adaptive semantic fusion module, which leverages a unit-level attention block and a selective fusion block to integrate pyramid features interactively. Extensive experiments on audio-visual event localization and weakly-supervised audio-visual video parsing tasks verify the effectiveness of our approach.

Background-Click Supervision for Temporal Action Localization

Authors: Le Yang, Junwei Han, Tao Zhao, Tianwei Lin, Dingwen Zhang, Jianxin Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12449
Pdf link: https://arxiv.org/pdf/2111.12449
Abstract
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion. To overcome this challenge, one recent work builds an action-click supervision framework. It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods. In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames. To this end, we convert the action-click supervision to the background-click supervision and develop a novel method, called BackTAL. Specifically, BackTAL implements two-fold modeling on the background video frames, i.e. the position modeling and the feature modeling. In position modeling, we not only conduct supervised learning on the annotated video frames but also design a score separation module to enlarge the score differences between the potential action frames and backgrounds. In feature modeling, we propose an affinity module to measure frame-specific similarities among neighboring frames and dynamically attend to informative neighbors when calculating temporal convolution. Extensive experiments on three benchmarks are conducted, which demonstrate the high performance of the established BackTAL and the rationality of the proposed background-click supervision. Code is available at https://github.com/VividLe/BackTAL.

FLACOCO: Fault Localization for Java based on Industry-grade Coverage

Authors: André Silva, Matias Martinez, Benjamin Danglot, Davide Ginelli, Martin Monperrus
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2111.12513
Pdf link: https://arxiv.org/pdf/2111.12513
Abstract
Fault localization is an essential step in the debugging process. Spectrum-Based Fault Localization (SBFL) is a popular fault localization family of techniques, utilizing code-coverage to predict suspicious lines of code. In this paper, we present FLACOCO, a new fault localization tool for Java. The key novelty of FLACOCO is that it is built on top of one of the most used and most reliable coverage libraries for Java, JaCoCo. FLACOCO is made available through a well-designed command-line interface and Java API and supports all Java versions. We validate FLACOCO on two use-cases from the automatic program repair domain by reproducing previous scientific experiments. We find it is capable of effectively replacing the state-of-the-art FL library. Overall, we hope that FLACOCO will help research in fault localization as well as industry adoption thanks to being founded on industry-grade code coverage. An introductory video is available at https://youtu.be/RFRyvQuwRYA

SM3D: Simultaneous Monocular Mapping and 3D Detection

Authors: Runfa Li, Truong Nguyen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12643
Pdf link: https://arxiv.org/pdf/2111.12643
Abstract
Mapping and 3D detection are two major issues in vision-based robotics, and self-driving. While previous works only focus on each task separately, we present an innovative and efficient multi-task deep learning framework (SM3D) for Simultaneous Mapping and 3D Detection by bridging the gap with robust depth estimation and "Pseudo-LiDAR" point cloud for the first time. The Mapping module takes consecutive monocular frames to generate depth and pose estimation. In 3D Detection module, the depth estimation is projected into 3D space to generate "Pseudo-LiDAR" point cloud, where LiDAR-based 3D detector can be leveraged on point cloud for vehicular 3D detection and localization. By end-to-end training of both modules, the proposed mapping and 3D detection method outperforms the state-of-the-art baseline by 10.0% and 13.2% in accuracy, respectively. While achieving better accuracy, our monocular multi-task SM3D is more than 2 times faster than pure stereo 3D detector, and 18.3% faster than using two modules separately.

New submissions for Mon, 1 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

False Positive Detection and Prediction Quality Estimation for LiDAR Point Cloud Segmentation

Authors: Pascal Colling, Matthias Rottmann, Lutz Roese-Koerner, Hanno Gottschalk
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.15681
Pdf link: https://arxiv.org/pdf/2110.15681
Abstract
We present a novel post-processing tool for semantic segmentation of LiDAR point cloud data, called LidarMetaSeg, which estimates the prediction quality segmentwise. For this purpose we compute dispersion measures based on network probability outputs as well as feature measures based on point cloud input features and aggregate them on segment level. These aggregated measures are used to train a meta classification model to predict whether a predicted segment is a false positive or not and a meta regression model to predict the segmentwise intersection over union. Both models can then be applied to semantic segmentation inferences without knowing the ground truth. In our experiments we use different LiDAR segmentation models and datasets and analyze the power of our method. We show that our results outperform other standard approaches.

Keyword: loop detection

There is no result

Keyword: autonomous driving

A hierarchical behavior prediction framework at signalized intersections

Authors: Zhen Yang, Rusheng Zhang, Henry X. Liu
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2110.15465
Pdf link: https://arxiv.org/pdf/2110.15465
Abstract
Road user behavior prediction is one of the most critical components in trajectory planning for autonomous driving, especially in urban scenarios involving traffic signals. In this paper, a hierarchical framework is proposed to predict vehicle behaviors at a signalized intersection, using the traffic signal information of the intersection. The framework is composed of two phases: a discrete intention prediction phase and a continuous trajectory prediction phase. In the discrete intention prediction phase, a Bayesian network is adopted to predict the vehicle's high-level intention, after that, maximum entropy inverse reinforcement learning is utilized to learn the human driving model offline; during the online trajectory prediction phase, a driver characteristic is designed and updated to capture the different driving preferences between human drivers. We applied the proposed framework to one of the most challenging scenarios in autonomous driving: the yellow light running scenario. Numerical experiment results are presented in the later part of the paper which show the viability of the method. The accuracy of the Bayesian network for discrete intention prediction is 91.1%, and the prediction results are getting more and more accurate as the yellow time elapses. The average Euclidean distance error in continuous trajectory prediction is only 0.85 m in the yellow light running scenario.

Keyword: mapping

Non-existence results for vectorial bent functions with Dillon exponent

Authors: Lucien Lapierre, Petr Lisonek
Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2110.15585
Pdf link: https://arxiv.org/pdf/2110.15585
Abstract
We prove new non-existence results for vectorial monomial Dillon type bent functions mapping the field of order $2^{2m}$ to the field of order $2^{m/3}$. When $m$ is odd and $m>3$ we show that there are no such functions. When $m$ is even we derive a condition for the bent coefficient. The latter result allows us to find examples of bent functions with $m=6$ in a simple way.

Towards Comparative Physical Interpretation of Spatial Variability Aware Neural Networks: A Summary of Results

Authors: Jayant Gupta, Carl Molnar, Gaoxiang Luo, Joe Knight, Shashi Shekhar
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.15866
Pdf link: https://arxiv.org/pdf/2110.15866
Abstract
Given Spatial Variability Aware Neural Networks (SVANNs), the goal is to investigate mathematical (or computational) models for comparative physical interpretation towards their transparency (e.g., simulatibility, decomposability and algorithmic transparency). This problem is important due to important use-cases such as reusability, debugging, and explainability to a jury in a court of law. Challenges include a large number of model parameters, vacuous bounds on generalization performance of neural networks, risk of overfitting, sensitivity to noise, etc., which all detract from the ability to interpret the models. Related work on either model-specific or model-agnostic post-hoc interpretation is limited due to a lack of consideration of physical constraints (e.g., mass balance) and properties (e.g., second law of geography). This work investigates physical interpretation of SVANNs using novel comparative approaches based on geographically heterogeneous features. The proposed approach on feature-based physical interpretation is evaluated using a case-study on wetland mapping. The proposed physical interpretation improves the transparency of SVANN models and the analytical results highlight the trade-off between model transparency and model performance (e.g., F1-score). We also describe an interpretation based on geographically heterogeneous processes modeled as partial differential equations (PDEs).

Keyword: localization

PEDENet: Image Anomaly Localization via Patch Embedding and Density Estimation

Authors: Kaitai Zhang, Bin Wang, C.-C. Jay Kuo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.15525
Pdf link: https://arxiv.org/pdf/2110.15525
Abstract
A neural network targeting at unsupervised image anomaly localization, called the PEDENet, is proposed in this work. PEDENet contains a patch embedding (PE) network, a density estimation (DE) network, and an auxiliary network called the location prediction (LP) network. The PE network takes local image patches as input and performs dimension reduction to get low-dimensional patch embeddings via a deep encoder structure. Being inspired by the Gaussian Mixture Model (GMM), the DE network takes those patch embeddings and then predicts the cluster membership of an embedded patch. The sum of membership probabilities is used as a loss term to guide the learning process. The LP network is a Multi-layer Perception (MLP), which takes embeddings from two neighboring patches as input and predicts their relative location. The performance of the proposed PEDENet is evaluated extensively and benchmarked with that of state-of-the-art methods.

New submissions for Fri, 5 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

When Neural Networks Using Different Sensors Create Similar Features

Authors: Hugues Moreau (CEA-LETI, LIRIS), Andréa Vassilev (CEA-LETI), Liming Chen (LIRIS, ECL)
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.02732
Pdf link: https://arxiv.org/pdf/2111.02732
Abstract
Multimodal problems are omnipresent in the real world: autonomous driving, robotic grasping, scene understanding, etc... We draw from the well-developed analysis of similarity to provide an example of a problem where neural networks are trained from different sensors, and where the features extracted from these sensors still carry similar information. More precisely, we demonstrate that for each sensor, the linear combination of the features from the last layer that correlates the most with other sensors corresponds to the classification components of the classification layer.

Online Continual Learning via Multiple Deep Metric Learning and Uncertainty-guided Episodic Memory Replay -- 3rd Place Solution for ICCV 2021 Workshop SSLAD Track 3A Continual Object Classification

Authors: Muhammad Rifki Kurniawan, Xing Wei, Yihong Gong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.02757
Pdf link: https://arxiv.org/pdf/2111.02757
Abstract
Online continual learning in the wild is a very difficult task in machine learning. Non-stationarity in online continual learning potentially brings about catastrophic forgetting in neural networks. Specifically, online continual learning for autonomous driving with SODA10M dataset exhibits extra problems on extremely long-tailed distribution with continuous distribution shift. To address these problems, we propose multiple deep metric representation learning via both contrastive and supervised contrastive learning alongside soft labels distillation to improve model generalization. Moreover, we exploit modified class-balanced focal loss for sensitive penalization in class imbalanced and hard-easy samples. We also store some samples under guidance of uncertainty metric for rehearsal and perform online and periodical memory updates. Our proposed method achieves considerable generalization with average mean class accuracy (AMCA) 64.01% on validation and 64.53% AMCA on test set.

Towards Panoptic 3D Parsing for Single Image in the Wild

Authors: Sainan Liu, Vincent Nguyen, Yuan Gao, Subarna Tripathi, Zhuowen Tu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.03039
Pdf link: https://arxiv.org/pdf/2111.03039
Abstract
Performing single image holistic understanding and 3D reconstruction is a central task in computer vision. This paper presents an integrated system that performs holistic image segmentation, object detection, instance segmentation, depth estimation, and object instance 3D reconstruction for indoor and outdoor scenes from a single RGB image. We name our system panoptic 3D parsing in which panoptic segmentation ("stuff" segmentation and "things" detection/segmentation) with 3D reconstruction is performed. We design a stage-wise system where a complete set of annotations is absent. Additionally, we present an end-to-end pipeline trained on a synthetic dataset with a full set of annotations. We show results on both indoor (3D-FRONT) and outdoor (COCO and Cityscapes) scenes. Our proposed panoptic 3D parsing framework points to a promising direction in computer vision. It can be applied to various applications, including autonomous driving, mapping, robotics, design, computer graphics, robotics, human-computer interaction, and augmented reality.

Keyword: mapping

Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples

Authors: Kanghyun Choi, Deokki Hong, Noseong Park, Youngsok Kim, Jinho Lee
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.02625
Pdf link: https://arxiv.org/pdf/2111.02625
Abstract
Model quantization is known as a promising method to compress deep neural networks, especially for inferences on lightweight mobile or edge devices. However, model quantization usually requires access to the original training data to maintain the accuracy of the full-precision models, which is often infeasible in real-world scenarios for security and privacy issues. A popular approach to perform quantization without access to the original data is to use synthetically generated samples, based on batch-normalization statistics or adversarial learning. However, the drawback of such approaches is that they primarily rely on random noise input to the generator to attain diversity of the synthetic samples. We find that this is often insufficient to capture the distribution of the original data, especially around the decision boundaries. To this end, we propose Qimera, a method that uses superposed latent embeddings to generate synthetic boundary supporting samples. For the superposed embeddings to better reflect the original distribution, we also propose using an additional disentanglement mapping layer and extracting information from the full-precision model. The experimental results show that Qimera achieves state-of-the-art performances for various settings on data-free quantization. Code is available at https://github.com/iamkanghyunchoi/qimera.

Defining Gaze Patterns for Process Model Literacy -- Exploring Visual Routines in Process Models with Diverse Mappings

Authors: Michael Winter, Heiko Neumann, Rüdiger Pryss, Thomas Probst, Manfred Reichert
Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2111.02881
Pdf link: https://arxiv.org/pdf/2111.02881
Abstract
Process models depict crucial artifacts for organizations regarding documentation, communication, and collaboration. The proper comprehension of such models is essential for an effective application. An important aspect in process model literacy constitutes the question how the information presented in process models is extracted and processed by the human visual system? For such visuospatial tasks, the visual system deploys a set of elemental operations, from whose compositions different visual routines are produced. This paper provides insights from an exploratory eye tracking study, in which visual routines during process model comprehension were contemplated. More specifically, n = 29 participants were asked to comprehend n = 18 process models expressed in the Business Process Model and Notation 2.0 reflecting diverse mappings (i.e., straight, upward, downward) and complexity levels. The performance measures indicated that even less complex process models pose a challenge regarding their comprehension. The upward mapping confronted participants' attention with more challenges, whereas the downward mapping was comprehended more effectively. Based on recorded eye movements, three gaze patterns applied during model comprehension were derived. Thereupon, we defined a general model which identifies visual routines and corresponding elemental operations during process model comprehension. Finally, implications for practice as well as research and directions for future work are discussed in this paper.

Using Graph-Theoretic Machine Learning to Predict Human Driver Behavior

Authors: Rohan Chandra, Aniket Bera, Dinesh Manocha
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.02964
Pdf link: https://arxiv.org/pdf/2111.02964
Abstract
Studies have shown that autonomous vehicles (AVs) behave conservatively in a traffic environment composed of human drivers and do not adapt to local conditions and socio-cultural norms. It is known that socially aware AVs can be designed if there exists a mechanism to understand the behaviors of human drivers. We present an approach that leverages machine learning to predict, the behaviors of human drivers. This is similar to how humans implicitly interpret the behaviors of drivers on the road, by only observing the trajectories of their vehicles. We use graph-theoretic tools to extract driver behavior features from the trajectories and machine learning to obtain a computational mapping between the extracted trajectory of a vehicle in traffic and the driver behaviors. Compared to prior approaches in this domain, we prove that our method is robust, general, and extendable to broad-ranging applications such as autonomous navigation. We evaluate our approach on real-world traffic datasets captured in the U.S., India, China, and Singapore, as well as in simulation.

Towards Panoptic 3D Parsing for Single Image in the Wild

Authors: Sainan Liu, Vincent Nguyen, Yuan Gao, Subarna Tripathi, Zhuowen Tu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.03039
Pdf link: https://arxiv.org/pdf/2111.03039
Abstract
Performing single image holistic understanding and 3D reconstruction is a central task in computer vision. This paper presents an integrated system that performs holistic image segmentation, object detection, instance segmentation, depth estimation, and object instance 3D reconstruction for indoor and outdoor scenes from a single RGB image. We name our system panoptic 3D parsing in which panoptic segmentation ("stuff" segmentation and "things" detection/segmentation) with 3D reconstruction is performed. We design a stage-wise system where a complete set of annotations is absent. Additionally, we present an end-to-end pipeline trained on a synthetic dataset with a full set of annotations. We show results on both indoor (3D-FRONT) and outdoor (COCO and Cityscapes) scenes. Our proposed panoptic 3D parsing framework points to a promising direction in computer vision. It can be applied to various applications, including autonomous driving, mapping, robotics, design, computer graphics, robotics, human-computer interaction, and augmented reality.

Keyword: localization

Logically Sound Arguments for the Effectiveness of ML Safety Measures

Authors: Chih-Hong Cheng, Tobias Schuster, Simon Burton
Subjects: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.02649
Pdf link: https://arxiv.org/pdf/2111.02649
Abstract
We investigate the issues of achieving sufficient rigor in the arguments for the safety of machine learning functions. By considering the known weaknesses of DNN-based 2D bounding box detection algorithms, we sharpen the metric of imprecise pedestrian localization by associating it with the safety goal. The sharpening leads to introducing a conservative post-processor after the standard non-max-suppression as a counter-measure. We then propose a semi-formal assurance case for arguing the effectiveness of the post-processor, which is further translated into formal proof obligations for demonstrating the soundness of the arguments. Applying theorem proving not only discovers the need to introduce missing claims and mathematical concepts but also reveals the limitation of Dempster-Shafer's rules used in semi-formal argumentation.

idSTLPy: A Python Toolbox for Active Perception and Control

Authors: Rafael Rodrigues da Silva, Kunal Yadav, Hai Lin
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2111.02943
Pdf link: https://arxiv.org/pdf/2111.02943
Abstract
This paper describes a Python toolbox for active perception and control synthesis of probabilistic signal temporal logic (PrSTL) formulas of switched linear systems with additive Gaussian disturbances and measurement noises. We implement a counterexample-guided synthesis strategy that combines Bounded Model Checking, linear programming, and sampling-based motion planning techniques. We illustrate our approach and the toolbox throughout the paper with a motion planning example for a vehicle with noisy localization. The code is available at \url{https://codeocean.com/capsule/0013534/tree}.

Bootstrap Your Object Detector via Mixed Training

Authors: Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Stephen Lin, Han Hu, Xiang Bai
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.03056
Pdf link: https://arxiv.org/pdf/2111.03056
Abstract
We introduce MixTraining, a new training paradigm for object detection that can improve the performance of existing detectors for free. MixTraining enhances data augmentation by utilizing augmentations of different strengths while excluding the strong augmentations of certain training samples that may be detrimental to training. In addition, it addresses localization noise and missing labels in human annotations by incorporating pseudo boxes that can compensate for these errors. Both of these MixTraining capabilities are made possible through bootstrapping on the detector, which can be used to predict the difficulty of training on a strong augmentation, as well as to generate reliable pseudo boxes thanks to the robustness of neural networks to labeling error. MixTraining is found to bring consistent improvements across various detectors on the COCO dataset. In particular, the performance of Faster R-CNN \cite{ren2015faster} with a ResNet-50 \cite{he2016deep} backbone is improved from 41.7 mAP to 44.0 mAP, and the accuracy of Cascade-RCNN \cite{cai2018cascade} with a Swin-Small \cite{liu2021swin} backbone is raised from 50.9 mAP to 52.8 mAP. The code and models will be made publicly available at \url{https://github.com/MendelXu/MixTraining}.

New submissions for Wed, 9 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Game-Theoretic Model Predictive Control with Data-Driven Identification of Vehicle Model for Head-to-Head Autonomous Racing

Authors: Chanyoung Jung, Seungwook Lee, Hyunki Seong, Andrea Finazzi, David Hyunchul Shim
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.04094
Pdf link: https://arxiv.org/pdf/2106.04094
Abstract
Resolving edge-cases in autonomous driving, head-to-head autonomous racing is getting a lot of attention from the industry and academia. In this study, we propose a game-theoretic model predictive control (MPC) approach for head-to-head autonomous racing and data-driven model identification method. For the practical estimation of nonlinear model parameters, we adopted the hyperband algorithm, which is used for neural model training in machine learning. The proposed controller comprises three modules: 1) game-based opponents' trajectory predictor, 2) high-level race strategy planner, and 3) MPC-based low-level controller. The game-based predictor was designed to predict the future trajectories of competitors. Based on the prediction results, the high-level race strategy planner plans several behaviors to respond to various race circumstances. Finally, the MPC-based controller computes the optimal control commands to follow the trajectories. The proposed approach was validated under various racing circumstances in an official simulator of the Indy Autonomous Challenge. The experimental results show that the proposed method can effectively overtake competitors, while driving through the track as quickly as possible without collisions.

New submissions for Mon, 15 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Multimodal Virtual Point 3D Detection

Authors: Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.06881
Pdf link: https://arxiv.org/pdf/2111.06881
Abstract
Lidar-based sensing drives current autonomous vehicles. Despite rapid progress, current Lidar sensors still lag two decades behind traditional color cameras in terms of resolution and cost. For autonomous driving, this means that large objects close to the sensors are easily visible, but far-away or small objects comprise only one measurement or two. This is an issue, especially when these objects turn out to be driving hazards. On the other hand, these same objects are clearly visible in onboard RGB sensors. In this work, we present an approach to seamlessly fuse RGB sensors into Lidar-based 3D recognition. Our approach takes a set of 2D detections to generate dense 3D virtual points to augment an otherwise sparse 3D point cloud. These virtual points naturally integrate into any standard Lidar-based 3D detectors along with regular Lidar measurements. The resulting multi-modal detector is simple and effective. Experimental results on the large-scale nuScenes dataset show that our framework improves a strong CenterPoint baseline by a significant 6.6 mAP, and outperforms competing fusion approaches. Code and more visualizations are available at https://tianweiy.github.io/mvp/

Keyword: loop detection

There is no result

Keyword: autonomous driving

Multimodal Virtual Point 3D Detection

Authors: Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.06881
Pdf link: https://arxiv.org/pdf/2111.06881
Abstract
Lidar-based sensing drives current autonomous vehicles. Despite rapid progress, current Lidar sensors still lag two decades behind traditional color cameras in terms of resolution and cost. For autonomous driving, this means that large objects close to the sensors are easily visible, but far-away or small objects comprise only one measurement or two. This is an issue, especially when these objects turn out to be driving hazards. On the other hand, these same objects are clearly visible in onboard RGB sensors. In this work, we present an approach to seamlessly fuse RGB sensors into Lidar-based 3D recognition. Our approach takes a set of 2D detections to generate dense 3D virtual points to augment an otherwise sparse 3D point cloud. These virtual points naturally integrate into any standard Lidar-based 3D detectors along with regular Lidar measurements. The resulting multi-modal detector is simple and effective. Experimental results on the large-scale nuScenes dataset show that our framework improves a strong CenterPoint baseline by a significant 6.6 mAP, and outperforms competing fusion approaches. Code and more visualizations are available at https://tianweiy.github.io/mvp/

Keyword: mapping

Closed-Loop Data Transcription to an LDR via Minimaxing Rate Reduction

Authors: Xili Dai, Shengbang Tong, Mingyang Li, Ziyang Wu, Kwan Ho Ryan Chan, Pengyuan Zhai, Yaodong Yu, Michael Psenka, Xiaojun Yuan, Heung Yeung Shum, Yi Ma
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.06636
Pdf link: https://arxiv.org/pdf/2111.06636
Abstract
This work proposes a new computational framework for learning an explicit generative model for real-world datasets. In particular we propose to learn {\em a closed-loop transcription} between a multi-class multi-dimensional data distribution and a { linear discriminative representation (LDR)} in the feature space that consists of multiple independent multi-dimensional linear subspaces. In particular, we argue that the optimal encoding and decoding mappings sought can be formulated as the equilibrium point of a {\em two-player minimax game between the encoder and decoder}. A natural utility function for this game is the so-called {\em rate reduction}, a simple information-theoretic measure for distances between mixtures of subspace-like Gaussians in the feature space. Our formulation draws inspiration from closed-loop error feedback from control systems and avoids expensive evaluating and minimizing approximated distances between arbitrary distributions in either the data space or the feature space. To a large extent, this new formulation unifies the concepts and benefits of Auto-Encoding and GAN and naturally extends them to the settings of learning a {\em both discriminative and generative} representation for multi-class and multi-dimensional real-world data. Our extensive experiments on many benchmark imagery datasets demonstrate tremendous potential of this new closed-loop formulation: under fair comparison, visual quality of the learned decoder and classification performance of the encoder is competitive and often better than existing methods based on GAN, VAE, or a combination of both. We notice that the so learned features of different classes are explicitly mapped onto approximately {\em independent principal subspaces} in the feature space; and diverse visual attributes within each class are modeled by the {\em independent principal components} within each subspace.

Robust Analytics for Video-Based Gait Biometrics

Authors: Ebenezer R.H.P. Isaac
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.06670
Pdf link: https://arxiv.org/pdf/2111.06670
Abstract
Gait analysis is the study of the systematic methods that assess and quantify animal locomotion. Gait finds a unique importance among the many state-of-the-art biometric systems since it does not require the subject's cooperation to the extent required by other modalities. Hence by nature, it is an unobtrusive biometric. This thesis discusses both hard and soft biometric characteristics of gait. It shows how to identify gender based on gait alone through the Posed-Based Voting scheme. It then describes improving gait recognition accuracy using Genetic Template Segmentation. Members of a wide population can be authenticated using Multiperson Signature Mapping. Finally, the mapping can be improved in a smaller population using Bayesian Thresholding. All methods proposed in this thesis have outperformed their existing state of the art with adequate experimentation and results.

Keyword: localization

There is no result

New submissions for Wed, 27 Oct 21

Keyword: SLAM

Robust Multi-view Registration of Point Sets with Laplacian Mixture Model

Authors: Jin Zhang, Mingyang Zhao, Xin Jiang, Dong-Ming Yan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.13744
Pdf link: https://arxiv.org/pdf/2110.13744
Abstract
Point set registration is an essential step in many computer vision applications, such as 3D reconstruction and SLAM. Although there exist many registration algorithms for different purposes, however, this topic is still challenging due to the increasing complexity of various real-world scenarios, such as heavy noise and outlier contamination. In this paper, we propose a novel probabilistic generative method to simultaneously align multiple point sets based on the heavy-tailed Laplacian distribution. The proposed method assumes each data point is generated by a Laplacian Mixture Model (LMM), where its centers are determined by the corresponding points in other point sets. Different from the previous Gaussian Mixture Model (GMM) based method, which minimizes the quadratic distance between points and centers of Gaussian probability density, LMM minimizes the sparsity-induced L1 distance, thereby it is more robust against noise and outliers. We adopt Expectation-Maximization (EM) framework to solve LMM parameters and rigid transformations. We approximate the L1 optimization as a linear programming problem by exponential mapping in Lie algebra, which can be effectively solved through the interior point method. To improve efficiency, we also solve the L1 optimization by Alternating Direction Multiplier Method (ADMM). We demonstrate the advantages of our method by comparing it with representative state-of-the-art approaches on benchmark challenging data sets, in terms of robustness and accuracy.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

AUTO-DISCERN: Autonomous Driving Using Common Sense Reasoning

Authors: Suraj Kothawade, Vinaya Khandelwal, Kinjal Basu, Huaduo Wang, Gopal Gupta
Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2110.13606
Pdf link: https://arxiv.org/pdf/2110.13606
Abstract
Driving an automobile involves the tasks of observing surroundings, then making a driving decision based on these observations (steer, brake, coast, etc.). In autonomous driving, all these tasks have to be automated. Autonomous driving technology thus far has relied primarily on machine learning techniques. We argue that appropriate technology should be used for the appropriate task. That is, while machine learning technology is good for observing and automatically understanding the surroundings of an automobile, driving decisions are better automated via commonsense reasoning rather than machine learning. In this paper, we discuss (i) how commonsense reasoning can be automated using answer set programming (ASP) and the goal-directed s(CASP) ASP system, and (ii) develop the AUTO-DISCERN system using this technology for automating decision-making in driving. The goal of our research, described in this paper, is to develop an autonomous driving system that works by simulating the mind of a human driver. Since driving decisions are based on human-style reasoning, they are explainable, their ethics can be ensured, and they will always be correct, provided the system modeling and system inputs are correct.

Semantic Segmentation for Urban-Scene Images

Authors: Shorya Sharma
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2110.13813
Pdf link: https://arxiv.org/pdf/2110.13813
Abstract
Urban-scene Image segmentation is an important and trending topic in computer vision with wide use cases like autonomous driving [1]. Starting with the breakthrough work of Long et al. [2] that introduces Fully Convolutional Networks (FCNs), the development of novel architectures and practical uses of neural networks in semantic segmentation has been expedited in the recent 5 years. Aside from seeking solutions in general model design for information shrinkage due to pooling, urban-scene image itself has intrinsic features like positional patterns [3]. Our project seeks an advanced and integrated solution that specifically targets urban-scene image semantic segmentation among the most novel approaches in the current field. We re-implement the cutting edge model DeepLabv3+ [4] with ResNet-101 [5] backbone as our strong baseline model. Based upon DeepLabv3+, we incorporate HANet [3] to account for the vertical spatial priors in urban-scene image tasks. To boost up model efficiency and performance, we further explore the Atrous Spatial Pooling (ASP) layer in DeepLabv3+ and infuse a computational efficient variation called "Waterfall" Atrous Spatial Pooling (WASP) [6] architecture in our model. We find that our two-step integrated model improves the mean Intersection-Over-Union (mIoU) score gradually from the baseline model. In particular, HANet successfully identifies height-driven patterns and improves per-class IoU of common class labels in urban scenario like fence and bus. We also demonstrate the improvement of model efficiency with help of WASP in terms of computational times during training and parameter reduction from the original ASPP module.

Keyword: mapping

Exposure of occupations to technologies of the fourth industrial revolution

Authors: Benjamin Meindl, Morgan R. Frank, Joana Mendonça
Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2110.13317
Pdf link: https://arxiv.org/pdf/2110.13317
Abstract
The fourth industrial revolution (4IR) is likely to have a substantial impact on the economy. Companies need to build up capabilities to implement new technologies, and automation may make some occupations obsolete. However, where, when, and how the change will happen remain to be determined. Robust empirical indicators of technological progress linked to occupations can help to illuminate this change. With this aim, we provide such an indicator based on patent data. Using natural language processing, we calculate patent exposure scores for more than 900 occupations, which represent the technological progress related to them. To provide a lens on the impact of the 4IR, we differentiate between traditional and 4IR patent exposure. Our method differs from previous approaches in that it both accounts for the diversity of task-level patent exposures within an occupation and reflects work activities more accurately. We find that exposure to 4IR patents differs from traditional patent exposure. Manual tasks, and accordingly occupations such as construction and production, are exposed mainly to traditional (non-4IR) patents but have low exposure to 4IR patents. The analysis suggests that 4IR technologies may have a negative impact on job growth; this impact appears 10 to 20 years after patent filing. Further, we compared the 4IR exposure to other automation and AI exposure scores. Whereas many measures refer to theoretical automation potential, our patent-based indicator reflects actual technology diffusion. Our work not only allows analyses of the impact of 4IR technologies as a whole, but also provides exposure scores for more than 300 technology fields, such as AI and smart office technologies. Finally, the work provides a general mapping of patents to tasks and occupations, which enables future researchers to construct individual exposure measures.

Tensor Network Kalman Filtering for Large-Scale LS-SVMs

Authors: Maximilian Lucassen, Johan A.K. Suykens, Kim Batselier
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2110.13501
Pdf link: https://arxiv.org/pdf/2110.13501
Abstract
Least squares support vector machines are a commonly used supervised learning method for nonlinear regression and classification. They can be implemented in either their primal or dual form. The latter requires solving a linear system, which can be advantageous as an explicit mapping of the data to a possibly infinite-dimensional feature space is avoided. However, for large-scale applications, current low-rank approximation methods can perform inadequately. For example, current methods are probabilistic due to their sampling procedures, and/or suffer from a poor trade-off between the ranks and approximation power. In this paper, a recursive Bayesian filtering framework based on tensor networks and the Kalman filter is presented to alleviate the demanding memory and computational complexities associated with solving large-scale dual problems. The proposed method is iterative, does not require explicit storage of the kernel matrix, and allows the formulation of early stopping conditions. Additionally, the framework yields confidence estimates of obtained models, unlike alternative methods. The performance is tested on two regression and three classification experiments, and compared to the Nystr"om and fixed size LS-SVM methods. Results show that our method can achieve high performance and is particularly useful when alternative methods are computationally infeasible due to a slowly decaying kernel matrix spectrum.

Non-Gaussian Gaussian Processes for Few-Shot Regression

Authors: Marcin Sendera, Jacek Tabor, Aleksandra Nowak, Andrzej Bedychaj, Massimiliano Patacchiola, Tomasz Trzciński, Przemysław Spurek, Maciej Zięba
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2110.13561
Pdf link: https://arxiv.org/pdf/2110.13561
Abstract
Gaussian Processes (GPs) have been widely used in machine learning to model distributions over functions, with applications including multi-modal regression, time-series prediction, and few-shot learning. GPs are particularly useful in the last application since they rely on Normal distributions and enable closed-form computation of the posterior probability function. Unfortunately, because the resulting posterior is not flexible enough to capture complex distributions, GPs assume high similarity between subsequent tasks - a requirement rarely met in real-world conditions. In this work, we address this limitation by leveraging the flexibility of Normalizing Flows to modulate the posterior predictive distribution of the GP. This makes the GP posterior locally non-Gaussian, therefore we name our method Non-Gaussian Gaussian Processes (NGGPs). More precisely, we propose an invertible ODE-based mapping that operates on each component of the random variable vectors and shares the parameters across all of them. We empirically tested the flexibility of NGGPs on various few-shot learning regression datasets, showing that the mapping can incorporate context embedding information to model different noise levels for periodic functions. As a result, our method shares the structure of the problem between subsequent tasks, but the contextualization allows for adaptation to dissimilarities. NGGPs outperform the competing state-of-the-art approaches on a diversified set of benchmarks and applications.

Gradient representations in ReLU networks as similarity functions

Authors: Dániel Rácz, Bálint Daróczy
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.13581
Pdf link: https://arxiv.org/pdf/2110.13581
Abstract
Feed-forward networks can be interpreted as mappings with linear decision surfaces at the level of the last layer. We investigate how the tangent space of the network can be exploited to refine the decision in case of ReLU (Rectified Linear Unit) activations. We show that a simple Riemannian metric parametrized on the parameters of the network forms a similarity function at least as good as the original network and we suggest a sparse metric to increase the similarity gap.

An extension of the order-preserving mapping to the WENO-Z-type schemes

Authors: Ruo Li, Wei Zhong
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2110.13607
Pdf link: https://arxiv.org/pdf/2110.13607
Abstract
In our latest studies, by introducing the novel order-preserving (OP) criterion, we have successfully addressed the widely concerned issue of the previously published mapped weighted essentially non-oscillatory (WENO) schemes that it is rather difficult to achieve high resolutions on the premise of removing spurious oscillations for long-run simulations of the hyperbolic systems. In the present study, we extend the OP criterion to the WENO-Z-type schemes as the forementioned issue has also been extensively observed numerically for these schemes. Firstly, we innovatively present the concept of the generalized mapped WENO schemes by rewriting the Z-type weights in a uniform formula from the perspective of the mapping relation. Then, we naturally introduce the OP criterion to improve the WENO-Z-type schemes, and the resultant schemes are denoted as MOP-GMWENO-X. Finally, extensive numerical experiments have been conducted to demonstrate the benefits of these new schemes. We draw the conclusion that, the convergence propoties of the proposed schemes are equivalent to the corresponding WENO-X schemes. The major benefit of the new schemes is that they have the capacity to achieve high resolutions and simultaneously remove spurious oscillations for long simulations. The new schemes have the additional benefit that they can greatly decrease the post-shock oscillations on solving 2D Euler problems with strong shock waves.

Robust Multi-view Registration of Point Sets with Laplacian Mixture Model

Authors: Jin Zhang, Mingyang Zhao, Xin Jiang, Dong-Ming Yan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.13744
Pdf link: https://arxiv.org/pdf/2110.13744
Abstract
Point set registration is an essential step in many computer vision applications, such as 3D reconstruction and SLAM. Although there exist many registration algorithms for different purposes, however, this topic is still challenging due to the increasing complexity of various real-world scenarios, such as heavy noise and outlier contamination. In this paper, we propose a novel probabilistic generative method to simultaneously align multiple point sets based on the heavy-tailed Laplacian distribution. The proposed method assumes each data point is generated by a Laplacian Mixture Model (LMM), where its centers are determined by the corresponding points in other point sets. Different from the previous Gaussian Mixture Model (GMM) based method, which minimizes the quadratic distance between points and centers of Gaussian probability density, LMM minimizes the sparsity-induced L1 distance, thereby it is more robust against noise and outliers. We adopt Expectation-Maximization (EM) framework to solve LMM parameters and rigid transformations. We approximate the L1 optimization as a linear programming problem by exponential mapping in Lie algebra, which can be effectively solved through the interior point method. To improve efficiency, we also solve the L1 optimization by Alternating Direction Multiplier Method (ADMM). We demonstrate the advantages of our method by comparing it with representative state-of-the-art approaches on benchmark challenging data sets, in terms of robustness and accuracy.

Keyword: localization

Plug-and-Play Few-shot Object Detection with Meta Strategy and Explicit Localization Inference

Authors: Junying Huang, Fan Chen, Liang Lin, Dongyu Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.13377
Pdf link: https://arxiv.org/pdf/2110.13377
Abstract
Aiming at recognizing and localizing the object of novel categories by a few reference samples, few-shot object detection is a quite challenging task. Previous works often depend on the fine-tuning process to transfer their model to the novel category and rarely consider the defect of fine-tuning, resulting in many drawbacks. For example, these methods are far from satisfying in the low-shot or episode-based scenarios since the fine-tuning process in object detection requires much time and high-shot support data. To this end, this paper proposes a plug-and-play few-shot object detection (PnP-FSOD) framework that can accurately and directly detect the objects of novel categories without the fine-tuning process. To accomplish the objective, the PnP-FSOD framework contains two parallel techniques to address the core challenges in the few-shot learning, i.e., across-category task and few-annotation support. Concretely, we first propose two simple but effective meta strategies for the box classifier and RPN module to enable the across-category object detection without fine-tuning. Then, we introduce two explicit inferences into the localization process to reduce its dependence on the annotated data, including explicit localization score and semi-explicit box regression. In addition to the PnP-FSOD framework, we propose a novel one-step tuning method that can avoid the defects in fine-tuning. It is noteworthy that the proposed techniques and tuning method are based on the general object detector without other prior methods, so they are easily compatible with the existing FSOD methods. Extensive experiments show that the PnP-FSOD framework has achieved the state-of-the-art few-shot object detection performance without any tuning method. After applying the one-step tuning method, it further shows a significant lead in both efficiency, precision, and recall, under varied evaluation protocols.

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

Authors: Tanzila Rahman, Mengyu Yang, Leonid Sigal
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.13412
Pdf link: https://arxiv.org/pdf/2110.13412
Abstract
The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visual-linguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-visual detection or segmentation tasks such as sound source separation and localization. In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention. The use of pose keypoints is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more persons are responsible for the sound explicitly (e.g., talking) or implicitly (e.g., sound produced as a function of human manipulating an object). From a technical perspective, as part of the TriBERT architecture, we introduce a learned visual tokenization scheme based on spatial attention and leverage weak-supervision to allow granular cross-modal interactions for visual and pose modalities. Further, we supplement learning with sound-source separation loss formulated across all three streams. We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. In addition, we show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as 66.7% in top-1 accuracy.

Response-based Distillation for Incremental Object Detection

Authors: Tao Feng, Mang Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.13471
Pdf link: https://arxiv.org/pdf/2110.13471
Abstract
Traditional object detection are ill-equipped for incremental learning. However, fine-tuning directly on a well-trained detection model with only new data will leads to catastrophic forgetting. Knowledge distillation is a straightforward way to mitigate catastrophic forgetting. In Incremental Object Detection (IOD), previous work mainly focuses on feature-level knowledge distillation, but the different response of detector has not been fully explored yet. In this paper, we propose a fully response-based incremental distillation method focusing on learning response from detection bounding boxes and classification predictions. Firstly, our method transferring category knowledge while equipping student model with the ability to retain localization knowledge during incremental learning. In addition, we further evaluate the qualities of all locations and provides valuable response by adaptive pseudo-label selection (APS) strategies. Finally, we elucidate that knowledge from different responses should be assigned with different importance during incremental distillation. Extensive experiments conducted on MS COCO demonstrate significant advantages of our method, which substantially narrow the performance gap towards full training.

Cross-Region Building Counting in Satellite Imagery using Counting Consistency

Authors: Muaaz Zakria, Hamza Rawal, Waqas Sultani, Mohsen Ali
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.13558
Pdf link: https://arxiv.org/pdf/2110.13558
Abstract
Estimating the number of buildings in any geographical region is a vital component of urban analysis, disaster management, and public policy decision. Deep learning methods for building localization and counting in satellite imagery, can serve as a viable and cheap alternative. However, these algorithms suffer performance degradation when applied to the regions on which they have not been trained. Current large datasets mostly cover the developed regions and collecting such datasets for every region is a costly, time-consuming, and difficult endeavor. In this paper, we propose an unsupervised domain adaptation method for counting buildings where we use a labeled source domain (developed regions) and adapt the trained model on an unlabeled target domain (developing regions). We initially align distribution maps across domains by aligning the output space distribution through adversarial loss. We then exploit counting consistency constraints, within-image count consistency, and across-image count consistency, to decrease the domain shift. Within-image consistency enforces that building count in the whole image should be greater than or equal to count in any of its sub-image. Across-image consistency constraint enforces that if an image contains considerably more buildings than the other image, then their sub-images shall also have the same order. These two constraints encourage the behavior to be consistent across and within the images, regardless of the scale. To evaluate the performance of our proposed approach, we collected and annotated a large-scale dataset consisting of challenging South Asian regions having higher building densities and irregular structures as compared to existing datasets. We perform extensive experiments to verify the efficacy of our approach and report improvements of approximately 7% to 20% over the competitive baseline methods.

Pyramidal Blur Aware X-Corner Chessboard Detector

Authors: Peter Abeles
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.13793
Pdf link: https://arxiv.org/pdf/2110.13793
Abstract
With camera resolution ever increasing and the need to rapidly recalibrate robotic platforms in less than ideal environments, there is a need for faster and more robust chessboard fiducial marker detectors. A new chessboard detector is proposed that is specifically designed for: high resolution images, focus/motion blur, harsh lighting conditions, and background clutter. This is accomplished using a new x-corner detector, where for the first time blur is estimated and used in a novel way to enhance corner localization, edge validation, and connectivity. Performance is measured and compared against other libraries using a diverse set of images created by combining multiple third party datasets and including new specially crafted scenarios designed to stress the state-of-the-art. The proposed detector has the best F1- Score of 0.97, runs 1.9x faster than next fastest, and is a top performer for corner accuracy, while being the only detector to have consistent good performance in all scenarios.

New submissions for Mon, 7 Jun 21

Keyword: SLAM

Flying with Cartographer: Adapting the Cartographer 3D Graph SLAM Stack for UAV Navigation

Authors: Juraj Orsulić, Robert Milijas, Ana Batinovic, Lovro Markovic, Antun Ivanovic, Stjepan Bogdan
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.02535
Pdf link: https://arxiv.org/pdf/2106.02535
Abstract
This paper describes an application of the Cartographer graph SLAM stack as a pose sensor in a UAV feedback control loop, with certain application-specific changes in the SLAM stack such as smoothing of the optimized pose. Pose estimation is performed by fusing 3D LiDAR/IMU-based proprioception with GPS position measurements by means of pose graph optimisation. Moreover, partial environment maps built from the LiDAR data (submaps) within the Cartographer SLAM stack are marshalled into OctoMap, an Octree-based voxel map implementation. The OctoMap is further used for navigation tasks such as path planning and obstacle avoidance.

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

A Survey on Deep Domain Adaptation for LiDAR Perception

Authors: Larissa T. Triess, Mariella Dreissig, Christoph B. Rist, J. Marius Zöllner
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2106.02377
Pdf link: https://arxiv.org/pdf/2106.02377
Abstract
Scalable systems for automated driving have to reliably cope with an open-world setting. This means, the perception systems are exposed to drastic domain shifts, like changes in weather conditions, time-dependent aspects, or geographic regions. Covering all domains with annotated data is impossible because of the endless variations of domains and the time-consuming and expensive annotation process. Furthermore, fast development cycles of the system additionally introduce hardware changes, such as sensor types and vehicle setups, and the required knowledge transfer from simulation. To enable scalable automated driving, it is therefore crucial to address these domain shifts in a robust and efficient manner. Over the last years, a vast amount of different domain adaptation techniques evolved. There already exists a number of survey papers for domain adaptation on camera images, however, a survey for LiDAR perception is absent. Nevertheless, LiDAR is a vital sensor for automated driving that provides detailed 3D scans of the vehicle's surroundings. To stimulate future research, this paper presents a comprehensive review of recent progress in domain adaptation methods and formulates interesting research questions specifically targeted towards LiDAR perception.

RoadMap: A Light-Weight Semantic Map for Visual Localization towards Autonomous Driving

Authors: Tong Qin, Yuxin Zheng, Tongqing Chen, Yilun Chen, Qing Su
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.02527
Pdf link: https://arxiv.org/pdf/2106.02527
Abstract
Accurate localization is of crucial importance for autonomous driving tasks. Nowadays, we have seen a lot of sensor-rich vehicles (e.g. Robo-taxi) driving on the street autonomously, which rely on high-accurate sensors (e.g. Lidar and RTK GPS) and high-resolution map. However, low-cost production cars cannot afford such high expenses on sensors and maps. How to reduce costs? How do sensor-rich vehicles benefit low-cost cars? In this paper, we proposed a light-weight localization solution, which relies on low-cost cameras and compact visual semantic maps. The map is easily produced and updated by sensor-rich vehicles in a crowd-sourced way. Specifically, the map consists of several semantic elements, such as lane line, crosswalk, ground sign, and stop line on the road surface. We introduce the whole framework of on-vehicle mapping, on-cloud maintenance, and user-end localization. The map data is collected and preprocessed on vehicles. Then, the crowd-sourced data is uploaded to a cloud server. The mass data from multiple vehicles are merged on the cloud so that the semantic map is updated in time. Finally, the semantic map is compressed and distributed to production cars, which use this map for localization. We validate the performance of the proposed map in real-world experiments and compare it against other algorithms. The average size of the semantic map is $36$ kb/km. We highlight that this framework is a reliable and practical localization solution for autonomous driving.

Flying with Cartographer: Adapting the Cartographer 3D Graph SLAM Stack for UAV Navigation

Authors: Juraj Orsulić, Robert Milijas, Ana Batinovic, Lovro Markovic, Antun Ivanovic, Stjepan Bogdan
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.02535
Pdf link: https://arxiv.org/pdf/2106.02535
Abstract
This paper describes an application of the Cartographer graph SLAM stack as a pose sensor in a UAV feedback control loop, with certain application-specific changes in the SLAM stack such as smoothing of the optimized pose. Pose estimation is performed by fusing 3D LiDAR/IMU-based proprioception with GPS position measurements by means of pose graph optimisation. Moreover, partial environment maps built from the LiDAR data (submaps) within the Cartographer SLAM stack are marshalled into OctoMap, an Octree-based voxel map implementation. The OctoMap is further used for navigation tasks such as path planning and obstacle avoidance.

Keyword: loop detection

There is no result

Keyword: autonomous driving

RoadMap: A Light-Weight Semantic Map for Visual Localization towards Autonomous Driving

Authors: Tong Qin, Yuxin Zheng, Tongqing Chen, Yilun Chen, Qing Su
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.02527
Pdf link: https://arxiv.org/pdf/2106.02527
Abstract
Accurate localization is of crucial importance for autonomous driving tasks. Nowadays, we have seen a lot of sensor-rich vehicles (e.g. Robo-taxi) driving on the street autonomously, which rely on high-accurate sensors (e.g. Lidar and RTK GPS) and high-resolution map. However, low-cost production cars cannot afford such high expenses on sensors and maps. How to reduce costs? How do sensor-rich vehicles benefit low-cost cars? In this paper, we proposed a light-weight localization solution, which relies on low-cost cameras and compact visual semantic maps. The map is easily produced and updated by sensor-rich vehicles in a crowd-sourced way. Specifically, the map consists of several semantic elements, such as lane line, crosswalk, ground sign, and stop line on the road surface. We introduce the whole framework of on-vehicle mapping, on-cloud maintenance, and user-end localization. The map data is collected and preprocessed on vehicles. Then, the crowd-sourced data is uploaded to a cloud server. The mass data from multiple vehicles are merged on the cloud so that the semantic map is updated in time. Finally, the semantic map is compressed and distributed to production cars, which use this map for localization. We validate the performance of the proposed map in real-world experiments and compare it against other algorithms. The average size of the semantic map is $36$ kb/km. We highlight that this framework is a reliable and practical localization solution for autonomous driving.

New submissions for Thu, 25 Nov 21

Keyword: SLAM

Autonomous bot with ML-based reactive navigation for indoor environment

Authors: Yash Srivastava, Saumya Singh, S.P. Syed Ibrahim
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.12542
Pdf link: https://arxiv.org/pdf/2111.12542
Abstract
Local or reactive navigation is essential for autonomous mobile robots which operate in an indoor environment. Techniques such as SLAM, computer vision require significant computational power which increases cost. Similarly, using rudimentary methods makes the robot susceptible to inconsistent behavior. This paper aims to develop a robot that balances cost and accuracy by using machine learning to predict the best obstacle avoidance move based on distance inputs from four ultrasonic sensors that are strategically mounted on the front, front-left, front-right, and back of the robot. The underlying hardware consists of an Arduino Uno and a Raspberry Pi 3B. The machine learning model is first trained on the data collected by the robot. Then the Arduino continuously polls the sensors and calculates the distance values, and in case of critical need for avoidance, a suitable maneuver is made by the Arduino. In other scenarios, sensor data is sent to the Raspberry Pi using a USB connection and the machine learning model generates the best move for navigation, which is sent to the Arduino for driving motors accordingly. The system is mounted on a 2-WD robot chassis and tested in a cluttered indoor setting with most impressive results.

Automatic Mapping with Obstacle Identification for Indoor Human Mobility Assessment

Authors: V. Ayala-Alfaro, J. A. Vilchis-Mar, F. E. Correa-Tome, J. P. Ramirez-Paredes
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.12690
Pdf link: https://arxiv.org/pdf/2111.12690
Abstract
We propose a framework that allows a mobile robot to build a map of an indoor scenario, identifying and highlighting objects that may be considered a hindrance to people with limited mobility. The map is built by combining recent developments in monocular SLAM with information from inertial sensors of the robot platform, resulting in a metric point cloud that can be further processed to obtain a mesh. The images from the monocular camera are simultaneously analyzed with an object recognition neural network, tuned to detect a particular class of targets. This information is then processed and incorporated on the metric map, resulting in a detailed survey of the locations and bounding volumes of the objects of interest. The result can be used to inform policy makers and users with limited mobility of the hazards present in a particular indoor location. Our initial tests were performed using a micro-UAV and will be extended to other robotic platforms.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

MonoPLFlowNet: Permutohedral Lattice FlowNet for Real-Scale 3D Scene FlowEstimation with Monocular Images

Authors: Runfa Li, Truong Nguyen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12325
Pdf link: https://arxiv.org/pdf/2111.12325
Abstract
Real-scale scene flow estimation has become increasingly important for 3D computer vision. Some works successfully estimate real-scale 3D scene flow with LiDAR. However, these ubiquitous and expensive sensors are still unlikely to be equipped widely for real application. Other works use monocular images to estimate scene flow, but their scene flow estimations are normalized with scale ambiguity, where additional depth or point cloud ground truth are required to recover the real scale. Even though they perform well in 2D, these works do not provide accurate and reliable 3D estimates. We present a deep learning architecture on permutohedral lattice - MonoPLFlowNet. Different from all previous works, our MonoPLFlowNet is the first work where only two consecutive monocular images are used as input, while both depth and 3D scene flow are estimated in real scale. Our real-scale scene flow estimation outperforms all state-of-the-art monocular-image based works recovered to real scale by ground truth, and is comparable to LiDAR approaches. As a by-product, our real-scale depth estimation also outperforms other state-of-the-art works.

Fault-Tolerant Perception for Automated Driving A Lightweight Monitoring Approach

Authors: Cornelius Buerkle, Florian Geissler, Michael Paulitsch, Kay-Ulrich Scholl
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.12360
Pdf link: https://arxiv.org/pdf/2111.12360
Abstract
While the most visible part of the safety verification process of automated vehicles concerns the planning and control system, it is often overlooked that safety of the latter crucially depends on the fault-tolerance of the preceding environment perception. Modern perception systems feature complex and often machine-learning-based components with various failure modes that can jeopardize the overall safety. At the same time, a verification by for example redundant execution is not always feasible due to resource constraints. In this paper, we address the need for feasible and efficient perception monitors and propose a lightweight approach that helps to protect the integrity of the perception system while keeping the additional compute overhead minimal. In contrast to existing solutions, the monitor is realized by a well-balanced combination of sensor checks -- here using LiDAR information -- and plausibility checks on the object motion history. It is designed to detect relevant errors in the distance and velocity of objects in the environment of the automated vehicle. In conjunction with an appropriate planning system, such a monitor can help to make safe automated driving feasible.

SM3D: Simultaneous Monocular Mapping and 3D Detection

Authors: Runfa Li, Truong Nguyen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12643
Pdf link: https://arxiv.org/pdf/2111.12643
Abstract
Mapping and 3D detection are two major issues in vision-based robotics, and self-driving. While previous works only focus on each task separately, we present an innovative and efficient multi-task deep learning framework (SM3D) for Simultaneous Mapping and 3D Detection by bridging the gap with robust depth estimation and "Pseudo-LiDAR" point cloud for the first time. The Mapping module takes consecutive monocular frames to generate depth and pose estimation. In 3D Detection module, the depth estimation is projected into 3D space to generate "Pseudo-LiDAR" point cloud, where LiDAR-based 3D detector can be leveraged on point cloud for vehicular 3D detection and localization. By end-to-end training of both modules, the proposed mapping and 3D detection method outperforms the state-of-the-art baseline by 10.0% and 13.2% in accuracy, respectively. While achieving better accuracy, our monocular multi-task SM3D is more than 2 times faster than pure stereo 3D detector, and 18.3% faster than using two modules separately.

Keyword: loop detection

There is no result

Keyword: autonomous driving

There is no result

Keyword: mapping

Panoptic Segmentation Meets Remote Sensing

Authors: Osmar Luiz Ferreira de Carvalho, Osmar Abílio de Carvalho Júnior, Cristiano Rosa e Silva, Anesmar Olino de Albuquerque, Nickolas Castro Santana, Dibio Leandro Borges, Roberto Arnaldo Trancoso Gomes, Renato Fontes Guimarães
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2111.12126
Pdf link: https://arxiv.org/pdf/2111.12126
Abstract
Panoptic segmentation combines instance and semantic predictions, allowing the detection of "things" and "stuff" simultaneously. Effectively approaching panoptic segmentation in remotely sensed data can be auspicious in many challenging problems since it allows continuous mapping and specific target counting. Several difficulties have prevented the growth of this task in remote sensing: (a) most algorithms are designed for traditional images, (b) image labelling must encompass "things" and "stuff" classes, and (c) the annotation format is complex. Thus, aiming to solve and increase the operability of panoptic segmentation in remote sensing, this study has five objectives: (1) create a novel data preparation pipeline for panoptic segmentation, (2) propose an annotation conversion software to generate panoptic annotations; (3) propose a novel dataset on urban areas, (4) modify the Detectron2 for the task, and (5) evaluate difficulties of this task in the urban setting. We used an aerial image with a 0,24-meter spatial resolution considering 14 classes. Our pipeline considers three image inputs, and the proposed software uses point shapefiles for creating samples in the COCO format. Our study generated 3,400 samples with 512x512 pixel dimensions. We used the Panoptic-FPN with two backbones (ResNet-50 and ResNet-101), and the model evaluation considered semantic instance and panoptic metrics. We obtained 93.9, 47.7, and 64.9 for the mean IoU, box AP, and PQ. Our study presents the first effective pipeline for panoptic segmentation and an extensive database for other researchers to use and deal with other data or related problems requiring a thorough scene understanding.

Auto robust relative radiometric normalization via latent change noise modelling

Authors: Shiqi Liu, Lu Wang, Jie Lian, Ting chen, Cong Liu, Xuchen Zhan, Jintao Lu, Jie Liu, Ting Wang, Dong Geng, Hongwei Duan, Yuze Tian
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.12406
Pdf link: https://arxiv.org/pdf/2111.12406
Abstract
Relative radiometric normalization(RRN) of different satellite images of the same terrain is necessary for change detection, object classification/segmentation, and map-making tasks. However, traditional RRN models are not robust, disturbing by object change, and RRN models precisely considering object change can not robustly obtain the no-change set. This paper proposes auto robust relative radiometric normalization methods via latent change noise modeling. They utilize the prior knowledge that no change points possess small-scale noise under relative radiometric normalization and that change points possess large-scale radiometric noise after radiometric normalization, combining the stochastic expectation maximization method to quickly and robustly extract the no-change set to learn the relative radiometric normalization mapping functions. This makes our model theoretically grounded regarding the probabilistic theory and mathematics deduction. Specifically, when we select histogram matching as the relative radiometric normalization learning scheme integrating with the mixture of Gaussian noise(HM-RRN-MoG), the HM-RRN-MoG model achieves the best performance. Our model possesses the ability to robustly against clouds/fogs/changes. Our method naturally generates a robust evaluation indicator for RRN that is the no-change set root mean square error. We apply the HM-RRN-MoG model to the latter vegetation/water change detection task, which reduces the radiometric contrast and NDVI/NDWI differences on the no-change set, generates consistent and comparable results. We utilize the no-change set into the building change detection task, efficiently reducing the pseudo-change and boosting the precision.

ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel Appearance Invariant Semantic Representations

Authors: Robin Karlsson, Tomoki Hayashi, Keisuke Fujii, Alexander Carballo, Kento Ohtani, Kazuya Takeda
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2111.12460
Pdf link: https://arxiv.org/pdf/2111.12460
Abstract
This work presents a self-supervised method to learn dense semantically rich visual concept embeddings for images inspired by methods for learning word embeddings in NLP. Our method improves on prior work by generating more expressive embeddings and by being applicable for high-resolution images. Viewing the generation of natural images as a stochastic process where a set of latent visual concepts give rise to observable pixel appearances, our method is formulated to learn the inverse mapping from pixels to concepts. Our method greatly improves the effectiveness of self-supervised learning for dense embedding maps by introducing superpixelization as a natural hierarchical step up from pixels to a small set of visually coherent regions. Additional contributions are regional contextual masking with nonuniform shapes matching visually coherent patches and complexity-based view sampling inspired by masked language models. The enhanced expressiveness of our dense embeddings is demonstrated by significantly improving the state-of-the-art representation quality benchmarks on COCO (+12.94 mIoU, +87.6%) and Cityscapes (+16.52 mIoU, +134.2%). Results show favorable scaling and domain generalization properties not demonstrated by prior work.

SM3D: Simultaneous Monocular Mapping and 3D Detection

Authors: Runfa Li, Truong Nguyen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12643
Pdf link: https://arxiv.org/pdf/2111.12643
Abstract
Mapping and 3D detection are two major issues in vision-based robotics, and self-driving. While previous works only focus on each task separately, we present an innovative and efficient multi-task deep learning framework (SM3D) for Simultaneous Mapping and 3D Detection by bridging the gap with robust depth estimation and "Pseudo-LiDAR" point cloud for the first time. The Mapping module takes consecutive monocular frames to generate depth and pose estimation. In 3D Detection module, the depth estimation is projected into 3D space to generate "Pseudo-LiDAR" point cloud, where LiDAR-based 3D detector can be leveraged on point cloud for vehicular 3D detection and localization. By end-to-end training of both modules, the proposed mapping and 3D detection method outperforms the state-of-the-art baseline by 10.0% and 13.2% in accuracy, respectively. While achieving better accuracy, our monocular multi-task SM3D is more than 2 times faster than pure stereo 3D detector, and 18.3% faster than using two modules separately.

Keyword: localization

Three-Way Deep Neural Network for Radio Frequency Map Generation and Source Localization

Authors: Kuldeep S. Gill, Son Nguyen, Myo M. Thein, Alexander M. Wyglinski
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2111.12175
Pdf link: https://arxiv.org/pdf/2111.12175
Abstract
In this paper, we present a Generative Adversarial Network (GAN) machine learning model to interpolate irregularly distributed measurements across the spatial domain to construct a smooth radio frequency map (RFMap) and then perform localization using a deep neural network. Monitoring wireless spectrum over spatial, temporal, and frequency domains will become a critical feature in facilitating dynamic spectrum access (DSA) in beyond-5G and 6G communication technologies. Localization, wireless signal detection, and spectrum policy-making are several of the applications where distributed spectrum sensing will play a significant role. Detection and positioning of wireless emitters is a very challenging task in a large spectral and spatial area. In order to construct a smooth RFMap database, a large number of measurements are required which can be very expensive and time consuming. One approach to help realize these systems is to collect finite localized measurements across a given area and then interpolate the measurement values to construct the database. Current methods in the literature employ channel modeling to construct the radio frequency map, which lacks the granularity for accurate localization whereas our proposed approach reconstructs a new generalized RFMap. Localization results are presented and compared with conventional channel models.

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Authors: Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, Yuejie Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12374
Pdf link: https://arxiv.org/pdf/2111.12374
Abstract
Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential for complete scene comprehension. Most previous works attempted to analyze videos from a holistic perspective. However, they do not consider semantic information at multiple scales, which makes the model difficult to localize events in various lengths. In this paper, we present a Multimodal Pyramid Attentional Network (MM-Pyramid) that captures and integrates multi-level temporal features for audio-visual event localization and audio-visual video parsing. Specifically, we first propose the attentive feature pyramid module. This module captures temporal pyramid features via several stacking pyramid units, each of them is composed of a fixed-size attention block and dilated convolution block. We also design an adaptive semantic fusion module, which leverages a unit-level attention block and a selective fusion block to integrate pyramid features interactively. Extensive experiments on audio-visual event localization and weakly-supervised audio-visual video parsing tasks verify the effectiveness of our approach.

Background-Click Supervision for Temporal Action Localization

Authors: Le Yang, Junwei Han, Tao Zhao, Tianwei Lin, Dingwen Zhang, Jianxin Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12449
Pdf link: https://arxiv.org/pdf/2111.12449
Abstract
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion. To overcome this challenge, one recent work builds an action-click supervision framework. It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods. In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames. To this end, we convert the action-click supervision to the background-click supervision and develop a novel method, called BackTAL. Specifically, BackTAL implements two-fold modeling on the background video frames, i.e. the position modeling and the feature modeling. In position modeling, we not only conduct supervised learning on the annotated video frames but also design a score separation module to enlarge the score differences between the potential action frames and backgrounds. In feature modeling, we propose an affinity module to measure frame-specific similarities among neighboring frames and dynamically attend to informative neighbors when calculating temporal convolution. Extensive experiments on three benchmarks are conducted, which demonstrate the high performance of the established BackTAL and the rationality of the proposed background-click supervision. Code is available at https://github.com/VividLe/BackTAL.

FLACOCO: Fault Localization for Java based on Industry-grade Coverage

Authors: André Silva, Matias Martinez, Benjamin Danglot, Davide Ginelli, Martin Monperrus
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2111.12513
Pdf link: https://arxiv.org/pdf/2111.12513
Abstract
Fault localization is an essential step in the debugging process. Spectrum-Based Fault Localization (SBFL) is a popular fault localization family of techniques, utilizing code-coverage to predict suspicious lines of code. In this paper, we present FLACOCO, a new fault localization tool for Java. The key novelty of FLACOCO is that it is built on top of one of the most used and most reliable coverage libraries for Java, JaCoCo. FLACOCO is made available through a well-designed command-line interface and Java API and supports all Java versions. We validate FLACOCO on two use-cases from the automatic program repair domain by reproducing previous scientific experiments. We find it is capable of effectively replacing the state-of-the-art FL library. Overall, we hope that FLACOCO will help research in fault localization as well as industry adoption thanks to being founded on industry-grade code coverage. An introductory video is available at https://youtu.be/RFRyvQuwRYA

SM3D: Simultaneous Monocular Mapping and 3D Detection

Authors: Runfa Li, Truong Nguyen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12643
Pdf link: https://arxiv.org/pdf/2111.12643
Abstract
Mapping and 3D detection are two major issues in vision-based robotics, and self-driving. While previous works only focus on each task separately, we present an innovative and efficient multi-task deep learning framework (SM3D) for Simultaneous Mapping and 3D Detection by bridging the gap with robust depth estimation and "Pseudo-LiDAR" point cloud for the first time. The Mapping module takes consecutive monocular frames to generate depth and pose estimation. In 3D Detection module, the depth estimation is projected into 3D space to generate "Pseudo-LiDAR" point cloud, where LiDAR-based 3D detector can be leveraged on point cloud for vehicular 3D detection and localization. By end-to-end training of both modules, the proposed mapping and 3D detection method outperforms the state-of-the-art baseline by 10.0% and 13.2% in accuracy, respectively. While achieving better accuracy, our monocular multi-task SM3D is more than 2 times faster than pure stereo 3D detector, and 18.3% faster than using two modules separately.

New submissions for Wed, 10 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Frustum Fusion: Pseudo-LiDAR and LiDAR Fusion for 3D Detection

Authors: Farzin Negahbani, Onur Berk Töre, Fatma Güney, Baris Akgun
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.04780
Pdf link: https://arxiv.org/pdf/2111.04780
Abstract
Most autonomous vehicles are equipped with LiDAR sensors and stereo cameras. The former is very accurate but generates sparse data, whereas the latter is dense, has rich texture and color information but difficult to extract robust 3D representations from. In this paper, we propose a novel data fusion algorithm to combine accurate point clouds with dense but less accurate point clouds obtained from stereo pairs. We develop a framework to integrate this algorithm into various 3D object detection methods. Our framework starts with 2D detections from both of the RGB images, calculates frustums and their intersection, creates Pseudo-LiDAR data from the stereo images, and fills in the parts of the intersection region where the LiDAR data is lacking with the dense Pseudo-LiDAR points. We train multiple 3D object detection methods and show that our fusion strategy consistently improves the performance of detectors.

LiMoSeg: Real-time Bird's Eye View based LiDAR Motion Segmentation

Authors: Sambit Mohapatra, Mona Hodaei, Senthil Yogamani, Stefan Milz, Patrick Maeder, Heinrich Gotzig, Martin Simon, Hazem Rashed
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.04875
Pdf link: https://arxiv.org/pdf/2111.04875
Abstract
Moving object detection and segmentation is an essential task in the Autonomous Driving pipeline. Detecting and isolating static and moving components of a vehicle's surroundings are particularly crucial in path planning and localization tasks. This paper proposes a novel real-time architecture for motion segmentation of Light Detection and Ranging (LiDAR) data. We use two successive scans of LiDAR data in 2D Bird's Eye View (BEV) representation to perform pixel-wise classification as static or moving. Furthermore, we propose a novel data augmentation technique to reduce the significant class imbalance between static and moving objects. We achieve this by artificially synthesizing moving objects by cutting and pasting static vehicles. We demonstrate a low latency of 8 ms on a commonly used automotive embedded platform, namely Nvidia Jetson Xavier. To the best of our knowledge, this is the first work directly performing motion segmentation in LiDAR BEV space. We provide quantitative results on the challenging SemanticKITTI dataset, and qualitative results are provided in https://youtu.be/2aJ-cL8b0LI.

Keyword: loop detection

There is no result

Keyword: autonomous driving

LiMoSeg: Real-time Bird's Eye View based LiDAR Motion Segmentation

Authors: Sambit Mohapatra, Mona Hodaei, Senthil Yogamani, Stefan Milz, Patrick Maeder, Heinrich Gotzig, Martin Simon, Hazem Rashed
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.04875
Pdf link: https://arxiv.org/pdf/2111.04875
Abstract
Moving object detection and segmentation is an essential task in the Autonomous Driving pipeline. Detecting and isolating static and moving components of a vehicle's surroundings are particularly crucial in path planning and localization tasks. This paper proposes a novel real-time architecture for motion segmentation of Light Detection and Ranging (LiDAR) data. We use two successive scans of LiDAR data in 2D Bird's Eye View (BEV) representation to perform pixel-wise classification as static or moving. Furthermore, we propose a novel data augmentation technique to reduce the significant class imbalance between static and moving objects. We achieve this by artificially synthesizing moving objects by cutting and pasting static vehicles. We demonstrate a low latency of 8 ms on a commonly used automotive embedded platform, namely Nvidia Jetson Xavier. To the best of our knowledge, this is the first work directly performing motion segmentation in LiDAR BEV space. We provide quantitative results on the challenging SemanticKITTI dataset, and qualitative results are provided in https://youtu.be/2aJ-cL8b0LI.

Convolutional Neural Networks with Radio-Frequency Spintronic Nano-Devices

Authors: Nathan Leroux, Arnaud De Riz, Dédalo Sanz-Hernández, Danijela Marković, Alice Mizrahi, Julie Grollier
Subjects: Emerging Technologies (cs.ET); Disordered Systems and Neural Networks (cond-mat.dis-nn); Applied Physics (physics.app-ph)
Arxiv link: https://arxiv.org/abs/2111.04961
Pdf link: https://arxiv.org/pdf/2111.04961
Abstract
Convolutional neural networks are state-of-the-art and ubiquitous in modern signal processing and machine vision. Nowadays, hardware solutions based on emerging nanodevices are designed to reduce the power consumption of these networks. Spintronics devices are promising for information processing because of the various neural and synaptic functionalities they offer. However, due to their low OFF/ON ratio, performing all the multiplications required for convolutions in a single step with a crossbar array of spintronic memories would cause sneak-path currents. Here we present an architecture where synaptic communications have a frequency selectivity that prevents crosstalk caused by sneak-path currents. We first demonstrate how a chain of spintronic resonators can function as synapses and make convolutions by sequentially rectifying radio-frequency signals encoding consecutive sets of inputs. We show that a parallel implementation is possible with multiple chains of spintronic resonators to avoid storing intermediate computational steps in memory. We propose two different spatial arrangements for these chains. For each of them, we explain how to tune many artificial synapses simultaneously, exploiting the synaptic weight sharing specific to convolutions. We show how information can be transmitted between convolutional layers by using spintronic oscillators as artificial microwave neurons. Finally, we simulate a network of these radio-frequency resonators and spintronic oscillators to solve the MNIST handwritten digits dataset, and obtain results comparable to software convolutional neural networks. Since it can run convolutional neural networks fully in parallel in a single step with nano devices, the architecture proposed in this paper is promising for embedded applications requiring machine vision, such as autonomous driving.

Does Thermal data make the detection systems more reliable?

Authors: Shruthi Gowda, Bahram Zonooz, Elahe Arani
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.05191
Pdf link: https://arxiv.org/pdf/2111.05191
Abstract
Deep learning-based detection networks have made remarkable progress in autonomous driving systems (ADS). ADS should have reliable performance across a variety of ambient lighting and adverse weather conditions. However, luminance degradation and visual obstructions (such as glare, fog) result in poor quality images by the visual camera which leads to performance decline. To overcome these challenges, we explore the idea of leveraging a different data modality that is disparate yet complementary to the visual data. We propose a comprehensive detection system based on a multimodal-collaborative framework that learns from both RGB (from visual cameras) and thermal (from Infrared cameras) data. This framework trains two networks collaboratively and provides flexibility in learning optimal features of its own modality while also incorporating the complementary knowledge of the other. Our extensive empirical results show that while the improvement in accuracy is nominal, the value lies in challenging and extremely difficult edge cases which is crucial in safety-critical applications such as AD. We provide a holistic view of both merits and limitations of using a thermal imaging system in detection.

Keyword: mapping

Survey of Deep Learning Methods for Inverse Problems

Authors: Shima Kamyab, Zihreh Azimifar, Rasool Sabzi, Paul Fieguth
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.04731
Pdf link: https://arxiv.org/pdf/2111.04731
Abstract
In this paper we investigate a variety of deep learning strategies for solving inverse problems. We classify existing deep learning solutions for inverse problems into three categories of Direct Mapping, Data Consistency Optimizer, and Deep Regularizer. We choose a sample of each inverse problem type, so as to compare the robustness of the three categories, and report a statistical analysis of their differences. We perform extensive experiments on the classic problem of linear regression and three well-known inverse problems in computer vision, namely image denoising, 3D human face inverse rendering, and object tracking, selected as representative prototypes for each class of inverse problems. The overall results and the statistical analyses show that the solution categories have a robustness behaviour dependent on the type of inverse problem domain, and specifically dependent on whether or not the problem includes measurement outliers. Based on our experimental results, we conclude by proposing the most robust solution category for each inverse problem class.

Efficient estimates of optimal transport via low-dimensional embeddings

Authors: Patric M. Fulop, Vincent Danos
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2111.04838
Pdf link: https://arxiv.org/pdf/2111.04838
Abstract
Optimal transport distances (OT) have been widely used in recent work in Machine Learning as ways to compare probability distributions. These are costly to compute when the data lives in high dimension. Recent work by Paty et al., 2019, aims specifically at reducing this cost by computing OT using low-rank projections of the data (seen as discrete measures). We extend this approach and show that one can approximate OT distances by using more general families of maps provided they are 1-Lipschitz. The best estimate is obtained by maximising OT over the given family. As OT calculations are done after mapping data to a lower dimensional space, our method scales well with the original data dimension. We demonstrate the idea with neural networks.

Graph-Based Depth Denoising & Dequantization for Point Cloud Enhancement

Authors: Xue Zhang, Gene Cheung, Jiahao Pang, Yash Sanghvi, Abhiram Gnanasambandam, Stanley H. Chan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2111.04946
Pdf link: https://arxiv.org/pdf/2111.04946
Abstract
A 3D point cloud is typically constructed from depth measurements acquired by sensors at one or more viewpoints. The measurements suffer from both quantization and noise corruption. To improve quality, previous works denoise a point cloud \textit{a posteriori} after projecting the imperfect depth data onto 3D space. Instead, we enhance depth measurements directly on the sensed images \textit{a priori}, before synthesizing a 3D point cloud. By enhancing near the physical sensing process, we tailor our optimization to our depth formation model before subsequent processing steps that obscure measurement errors. Specifically, we model depth formation as a combined process of signal-dependent noise addition and non-uniform log-based quantization. The designed model is validated (with parameters fitted) using collected empirical data from an actual depth sensor. To enhance each pixel row in a depth image, we first encode intra-view similarities between available row pixels as edge weights via feature graph learning. We next establish inter-view similarities with another rectified depth image via viewpoint mapping and sparse linear interpolation. This leads to a maximum a posteriori (MAP) graph filtering objective that is convex and differentiable. We optimize the objective efficiently using accelerated gradient descent (AGD), where the optimal step size is approximated via Gershgorin circle theorem (GCT). Experiments show that our method significantly outperformed recent point cloud denoising schemes and state-of-the-art image denoising schemes, in two established point cloud quality metrics.

A research framework for writing differentiable PDE discretizations in JAX

Authors: Antonio Stanziola, Simon R. Arridge, Ben T. Cox, Bradley E. Treeby
Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
Arxiv link: https://arxiv.org/abs/2111.05218
Pdf link: https://arxiv.org/pdf/2111.05218
Abstract
Differentiable simulators are an emerging concept with applications in several fields, from reinforcement learning to optimal control. Their distinguishing feature is the ability to calculate analytic gradients with respect to the input parameters. Like neural networks, which are constructed by composing several building blocks called layers, a simulation often requires computing the output of an operator that can itself be decomposed into elementary units chained together. While each layer of a neural network represents a specific discrete operation, the same operator can have multiple representations, depending on the discretization employed and the research question that needs to be addressed. Here, we propose a simple design pattern to construct a library of differentiable operators and discretizations, by representing operators as mappings between families of continuous functions, parametrized by finite vectors. We demonstrate the approach on an acoustic optimization problem, where the Helmholtz equation is discretized using Fourier spectral methods, and differentiability is demonstrated using gradient descent to optimize the speed of sound of an acoustic lens. The proposed framework is open-sourced and available at \url{https://github.com/ucl-bug/jaxdf}

Keyword: localization

LiMoSeg: Real-time Bird's Eye View based LiDAR Motion Segmentation

Authors: Sambit Mohapatra, Mona Hodaei, Senthil Yogamani, Stefan Milz, Patrick Maeder, Heinrich Gotzig, Martin Simon, Hazem Rashed
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.04875
Pdf link: https://arxiv.org/pdf/2111.04875
Abstract
Moving object detection and segmentation is an essential task in the Autonomous Driving pipeline. Detecting and isolating static and moving components of a vehicle's surroundings are particularly crucial in path planning and localization tasks. This paper proposes a novel real-time architecture for motion segmentation of Light Detection and Ranging (LiDAR) data. We use two successive scans of LiDAR data in 2D Bird's Eye View (BEV) representation to perform pixel-wise classification as static or moving. Furthermore, we propose a novel data augmentation technique to reduce the significant class imbalance between static and moving objects. We achieve this by artificially synthesizing moving objects by cutting and pasting static vehicles. We demonstrate a low latency of 8 ms on a commonly used automotive embedded platform, namely Nvidia Jetson Xavier. To the best of our knowledge, this is the first work directly performing motion segmentation in LiDAR BEV space. We provide quantitative results on the challenging SemanticKITTI dataset, and qualitative results are provided in https://youtu.be/2aJ-cL8b0LI.

New submissions for Thu, 2 Dec 21

Keyword: SLAM

Research on Event Accumulator Settings for Event-Based SLAM

Authors: Kun Xiao, Guohui Wang, Yi Chen, Yongfeng Xie, Hong Li
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2112.00427
Pdf link: https://arxiv.org/pdf/2112.00427
Abstract
Event cameras are a new type of sensors that are different from traditional cameras. Each pixel is triggered asynchronously by event. The trigger event is the change of the brightness irradiated on the pixel. If the increment or decrement of brightness is higher than a certain threshold, an event is output. Compared with traditional cameras, event cameras have the advantages of high dynamic range and no motion blur. Accumulating events to frames and using traditional SLAM algorithm is a direct and efficient way for event-based SLAM. Different event accumulator settings, such as slice method of event stream, processing method for no motion, using polarity or not, decay function and event contribution, can cause quite different accumulating results. We conducted the research on how to accumulate event frames to achieve a better event-based SLAM performance. For experiment verification, accumulated event frames are fed to the traditional SLAM system to construct an event-based SLAM system. Our strategy of setting event accumulator has been evaluated on the public dataset. The experiment results show that our method can achieve better performance in most sequences compared with the state-of-the-art event frame based SLAM algorithm. In addition, the proposed approach has been tested on a quadrotor UAV to show the potential of applications in real scenario. Code and results are open sourced to benefit the research community of event cameras

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Pattern-Aware Data Augmentation for LiDAR 3D Object Detection

Authors: Jordan S.K. Hu, Steven L. Waslander
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2112.00050
Pdf link: https://arxiv.org/pdf/2112.00050
Abstract
Autonomous driving datasets are often skewed and in particular, lack training data for objects at farther distances from the ego vehicle. The imbalance of data causes a performance degradation as the distance of the detected objects increases. In this paper, we propose pattern-aware ground truth sampling, a data augmentation technique that downsamples an object's point cloud based on the LiDAR's characteristics. Specifically, we mimic the natural diverging point pattern variation that occurs for objects at depth to simulate samples at farther distances. Thus, the network has more diverse training examples and can generalize to detecting farther objects more effectively. We evaluate against existing data augmentation techniques that use point removal or perturbation methods and find that our method outperforms all of them. Additionally, we propose using equal element AP bins to evaluate the performance of 3D object detectors across distance. We improve the performance of PV-RCNN on the car class by more than 0.7 percent on the KITTI validation split at distances greater than 25 m.

Scalable Primitives for Generalized Sensor Fusion in Autonomous Vehicles

Authors: Sammy Sidhu, Linda Wang, Tayyab Naseer, Ashish Malhotra, Jay Chia, Aayush Ahuja, Ella Rasmussen, Qiangui Huang, Ray Gao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2112.00219
Pdf link: https://arxiv.org/pdf/2112.00219
Abstract
In autonomous driving, there has been an explosion in the use of deep neural networks for perception, prediction and planning tasks. As autonomous vehicles (AVs) move closer to production, multi-modal sensor inputs and heterogeneous vehicle fleets with different sets of sensor platforms are becoming increasingly common in the industry. However, neural network architectures typically target specific sensor platforms and are not robust to changes in input, making the problem of scaling and model deployment particularly difficult. Furthermore, most players still treat the problem of optimizing software and hardware as entirely independent problems. We propose a new end to end architecture, Generalized Sensor Fusion (GSF), which is designed in such a way that both sensor inputs and target tasks are modular and modifiable. This enables AV system designers to easily experiment with different sensor configurations and methods and opens up the ability to deploy on heterogeneous fleets using the same models that are shared across a large engineering organization. Using this system, we report experimental results where we demonstrate near-parity of an expensive high-density (HD) LiDAR sensor with a cheap low-density (LD) LiDAR plus camera setup in the 3D object detection task. This paves the way for the industry to jointly design hardware and software architectures as well as large fleets with heterogeneous configurations.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Pattern-Aware Data Augmentation for LiDAR 3D Object Detection

Authors: Jordan S.K. Hu, Steven L. Waslander
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2112.00050
Pdf link: https://arxiv.org/pdf/2112.00050
Abstract
Autonomous driving datasets are often skewed and in particular, lack training data for objects at farther distances from the ego vehicle. The imbalance of data causes a performance degradation as the distance of the detected objects increases. In this paper, we propose pattern-aware ground truth sampling, a data augmentation technique that downsamples an object's point cloud based on the LiDAR's characteristics. Specifically, we mimic the natural diverging point pattern variation that occurs for objects at depth to simulate samples at farther distances. Thus, the network has more diverse training examples and can generalize to detecting farther objects more effectively. We evaluate against existing data augmentation techniques that use point removal or perturbation methods and find that our method outperforms all of them. Additionally, we propose using equal element AP bins to evaluate the performance of 3D object detectors across distance. We improve the performance of PV-RCNN on the car class by more than 0.7 percent on the KITTI validation split at distances greater than 25 m.

Scalable Primitives for Generalized Sensor Fusion in Autonomous Vehicles

Authors: Sammy Sidhu, Linda Wang, Tayyab Naseer, Ashish Malhotra, Jay Chia, Aayush Ahuja, Ella Rasmussen, Qiangui Huang, Ray Gao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2112.00219
Pdf link: https://arxiv.org/pdf/2112.00219
Abstract
In autonomous driving, there has been an explosion in the use of deep neural networks for perception, prediction and planning tasks. As autonomous vehicles (AVs) move closer to production, multi-modal sensor inputs and heterogeneous vehicle fleets with different sets of sensor platforms are becoming increasingly common in the industry. However, neural network architectures typically target specific sensor platforms and are not robust to changes in input, making the problem of scaling and model deployment particularly difficult. Furthermore, most players still treat the problem of optimizing software and hardware as entirely independent problems. We propose a new end to end architecture, Generalized Sensor Fusion (GSF), which is designed in such a way that both sensor inputs and target tasks are modular and modifiable. This enables AV system designers to easily experiment with different sensor configurations and methods and opens up the ability to deploy on heterogeneous fleets using the same models that are shared across a large engineering organization. Using this system, we report experimental results where we demonstrate near-parity of an expensive high-density (HD) LiDAR sensor with a cheap low-density (LD) LiDAR plus camera setup in the 3D object detection task. This paves the way for the industry to jointly design hardware and software architectures as well as large fleets with heterogeneous configurations.

The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization

Authors: M. Jehanzeb Mirza, Jakub Micorek, Horst Possegger, Horst Bischof
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2112.00463
Pdf link: https://arxiv.org/pdf/2112.00463
Abstract
Domain adaptation is crucial to adapt a learned model to new scenarios, such as domain shifts or changing data distributions. Current approaches usually require a large amount of labeled or unlabeled data from the shifted domain. This can be a hurdle in fields which require continuous dynamic adaptation or suffer from scarcity of data, e.g. autonomous driving in challenging weather conditions. To address this problem of continuous adaptation to distribution shifts, we propose Dynamic Unsupervised Adaptation (DUA). We modify the feature representations of the model by continuously adapting the statistics of the batch normalization layers. We show that by accessing only a tiny fraction of unlabeled data from the shifted domain and adapting sequentially, a strong performance gain can be achieved. With even less than 1% of unlabeled data from the target domain, DUA already achieves competitive results to strong baselines. In addition, the computational overhead is minimal in contrast to previous approaches. Our approach is simple, yet effective and can be applied to any architecture which uses batch normalization as one of its components. We show the utility of DUA by evaluating it on a variety of domain adaptation datasets and tasks including object recognition, digit recognition and object detection.

Keyword: mapping

Task2Sim : Towards Effective Pre-training and Transfer from Synthetic Data

Authors: Samarth Mishra, Rameswar Panda, Cheng Perng Phoo, Chun-Fu Chen, Leonid Karlinsky, Kate Saenko, Venkatesh Saligrama, Rogerio S. Feris
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2112.00054
Pdf link: https://arxiv.org/pdf/2112.00054
Abstract
Pre-training models on Imagenet or other massive datasets of real images has led to major advances in computer vision, albeit accompanied with shortcomings related to curation cost, privacy, usage rights, and ethical issues. In this paper, for the first time, we study the transferability of pre-trained models based on synthetic data generated by graphics simulators to downstream tasks from very different domains. In using such synthetic data for pre-training, we find that downstream performance on different tasks are favored by different configurations of simulation parameters (e.g. lighting, object pose, backgrounds, etc.), and that there is no one-size-fits-all solution. It is thus better to tailor synthetic pre-training data to a specific downstream task, for best performance. We introduce Task2Sim, a unified model mapping downstream task representations to optimal simulation parameters to generate synthetic pre-training data for them. Task2Sim learns this mapping by training to find the set of best parameters on a set of "seen" tasks. Once trained, it can then be used to predict best simulation parameters for novel "unseen" tasks in one shot, without requiring additional training. Given a budget in number of images per class, our extensive experiments with 20 diverse downstream tasks show Task2Sim's task-adaptive pre-training data results in significantly better downstream performance than non-adaptively choosing simulation parameters on both seen and unseen tasks. It is even competitive with pre-training on real images from Imagenet.

SAMO: Optimised Mapping of Convolutional Neural Networks to Streaming Architectures

Authors: Alexander Montgomerie-Corcoran, Zhewen Yu, Christos-Savvas Bouganis
Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2112.00170
Pdf link: https://arxiv.org/pdf/2112.00170
Abstract
Toolflows that map Convolutional Neural Network (CNN) models to Field Programmable Gate Arrays (FPGAs) have been an important tool in accelerating a range of applications across different deployment settings. However, the significance of the problem of finding an optimal mapping is often overlooked, with the expectation that the end user will tune their generated hardware to their desired platform. This is particularly prominent within Streaming Architectures toolflows, where there is a large design space to explore. There have been many Streaming Architectures proposed, however apart from fpgaConvNet, there is limited support for optimisation methods that explore both performance objectives and platform constraints. In this work, we establish a framework, SAMO: a Streaming Architecture Mapping Optimiser, which generalises the optimisation problem of mapping Streaming Architectures to FPGA platforms. We also implement both Brute Force and Simulated Annealing optimisation methods in order to generate valid, high performance designs for a range of target platforms and CNN models. We are able to observe a 4x increase in performance compared to example designs for the popular Streaming Architecture framework FINN.

Forward Operator Estimation in Generative Models with Kernel Transfer Operators

Authors: Zhichun Huang, Rudrasis Chakraborty, Vikas Singh
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2112.00305
Pdf link: https://arxiv.org/pdf/2112.00305
Abstract
Generative models which use explicit density modeling (e.g., variational autoencoders, flow-based generative models) involve finding a mapping from a known distribution, e.g. Gaussian, to the unknown input distribution. This often requires searching over a class of non-linear functions (e.g., representable by a deep neural network). While effective in practice, the associated runtime/memory costs can increase rapidly, usually as a function of the performance desired in an application. We propose a much cheaper (and simpler) strategy to estimate this mapping based on adapting known results in kernel transfer operators. We show that our formulation enables highly efficient distribution approximation and sampling, and offers surprisingly good empirical performance that compares favorably with powerful baselines, but with significant runtime savings. We show that the algorithm also performs well in small sample size settings (in brain imaging).

Triangle Counting Accelerations: From Algorithm to In-Memory Computing Architecture

Authors: Xueyan Wang, Jianlei Yang, Yinglin Zhao, Xiaotao Jia, Rong Yin, Xuhang Chen, Gang Qu, Weisheng Zhao
Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2112.00471
Pdf link: https://arxiv.org/pdf/2112.00471
Abstract
Triangles are the basic substructure of networks and triangle counting (TC) has been a fundamental graph computing problem in numerous fields such as social network analysis. Nevertheless, like other graph computing problems, due to the high memory-computation ratio and random memory access pattern, TC involves a large amount of data transfers thus suffers from the bandwidth bottleneck in the traditional Von-Neumann architecture. To overcome this challenge, in this paper, we propose to accelerate TC with the emerging processing-in-memory (PIM) architecture through an algorithm-architecture co-optimization manner. To enable the efficient in-memory implementations, we come up to reformulate TC with bitwise logic operations (such as AND), and develop customized graph compression and mapping techniques for efficient data flow management. With the emerging computational Spin-Transfer Torque Magnetic RAM (STT-MRAM) array, which is one of the most promising PIM enabling techniques, the device-to-architecture co-simulation results demonstrate that the proposed TC in-memory accelerator outperforms the state-of-the-art GPU and FPGA accelerations by 12.2x and 31.8x, respectively, and achieves a 34x energy efficiency improvement over the FPGA accelerator.

Improving gearshift controllers for electric vehicles with reinforcement learning

Authors: Marc-Antoine Beaudoin, Benoit Boulet
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2112.00529
Pdf link: https://arxiv.org/pdf/2112.00529
Abstract
During a multi-speed transmission development process, the final calibration of the gearshift controller parameters is usually performed on a physical test bench. Engineers typically treat the mapping from the controller parameters to the gearshift quality as a black-box, and use methods rooted in experimental design -- a purely statistical approach -- to infer the parameter combination that will maximize a chosen gearshift performance indicator. This approach unfortunately requires thousands of gearshift trials, ultimately discouraging the exploration of different control strategies. In this work, we calibrate the feedforward and feedback parameters of a gearshift controller using a model-based reinforcement learning algorithm adapted from Pilco. Experimental results show that the method optimizes the controller parameters with few gearshift trials. This approach can accelerate the exploration of gearshift control strategies, which is especially important for the emerging technology of multi-speed transmissions for electric vehicles.

Edge computing for cyber-physical systems: A systematic mapping study emphasizing trustworthiness

Authors: José Manuel Gaspar Sánchez, Nils Jörgensen, Martin Törngren, Rafia Inam, Andrii Berezovskyi, Lei Feng, Elena Fersman, Muhammad Rusyadi Ramli, Kaige Tan
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2112.00619
Pdf link: https://arxiv.org/pdf/2112.00619
Abstract
Edge computing is projected to have profound implications in the coming decades, proposed to provide solutions for applications such as augmented reality, predictive functionalities, and collaborative Cyber-Physical Systems (CPS). For such applications, edge computing addresses the new computational needs, as well as privacy, availability, and real-time constraints, by providing local high-performance computing capabilities to deal with the limitations and constraints of cloud and embedded systems. Our interests lie in the applications of edge computing as part of CPS, where several properties (or attributes) of trustworthiness, including safety, security, and predictability/availability are of particular concern, each facing challenges for the introduction of edge-based CPS. We present the results of a systematic mapping study, a kind of systematic literature survey, investigating the use of edge computing for CPS with a special emphasis on trustworthiness. The main contributions of this study are a detailed description of the current research efforts in edge-based CPS and the identification and discussion of trends and research gaps. The results show that the main body of research in edge-based CPS only to a very limited extent consider key attributes of system trustworthiness, despite many efforts referring to critical CPS and applications like intelligent transportation. More research and industrial efforts will be needed on aspects of trustworthiness of future edge-based CPS including their experimental evaluation. Such research needs to consider the multiple interrelated attributes of trustworthiness including safety, security, and predictability, and new methodologies and architectures to address them. It is further important to provide bridges and collaboration between edge computing and CPS disciplines.

MOMO -- Deep Learning-driven classification of external DICOM studies for PACS archivation

Authors: Frederic Jonske, Maximilian Dederichs, Moon-Sung Kim, Jan Egger, Lale Umutlu, Michael Forsting, Felix Nensa, Jens Kleesiek
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2112.00661
Pdf link: https://arxiv.org/pdf/2112.00661
Abstract
Patients regularly continue assessment or treatment in other facilities than they began them in, receiving their previous imaging studies as a CD-ROM and requiring clinical staff at the new hospital to import these studies into their local database. However, between different facilities, standards for nomenclature, contents, or even medical procedures may vary, often requiring human intervention to accurately classify the received studies in the context of the recipient hospital's standards. In this study, the authors present MOMO (MOdality Mapping and Orchestration), a deep learning-based approach to automate this mapping process utilizing metadata substring matching and a neural network ensemble, which is trained to recognize the 76 most common imaging studies across seven different modalities. A retrospective study is performed to measure the accuracy that this algorithm can provide. To this end, a set of 11,934 imaging series with existing labels was retrieved from the local hospital's PACS database to train the neural networks. A set of 843 completely anonymized external studies was hand-labeled to assess the performance of our algorithm. Additionally, an ablation study was performed to measure the performance impact of the network ensemble in the algorithm, and a comparative performance test with a commercial product was conducted. In comparison to a commercial product (96.20% predictive power, 82.86% accuracy, 1.36% minor errors), a neural network ensemble alone performs the classification task with less accuracy (99.05% predictive power, 72.69% accuracy, 10.3% minor errors). However, MOMO outperforms either by a large margin in accuracy and with increased predictive power (99.29% predictive power, 92.71% accuracy, 2.63% minor errors).

CYBORG: Blending Human Saliency Into the Loss Improves Deep Learning

Authors: Aidan Boyd, Patrick Tinsley, Kevin Bowyer, Adam Czajka
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2112.00686
Pdf link: https://arxiv.org/pdf/2112.00686
Abstract
Can deep learning models achieve greater generalization if their training is guided by reference to human perceptual abilities? And how can we implement this in a practical manner? This paper proposes a first-ever training strategy to ConveY Brain Oversight to Raise Generalization (CYBORG). This new training approach incorporates human-annotated saliency maps into a CYBORG loss function that guides the model towards learning features from image regions that humans find salient when solving a given visual task. The Class Activation Mapping (CAM) mechanism is used to probe the model's current saliency in each training batch, juxtapose model saliency with human saliency, and penalize the model for large differences. Results on the task of synthetic face detection show that the CYBORG loss leads to significant improvement in performance on unseen samples consisting of face images generated from six Generative Adversarial Networks (GANs) across multiple classification network architectures. We also show that scaling to even seven times as much training data with standard loss cannot beat the accuracy of CYBORG loss. As a side effect, we observed that the addition of explicit region annotation to the task of synthetic face detection increased human classification performance. This work opens a new area of research on how to incorporate human visual saliency into loss functions. All data, code and pre-trained models used in this work are offered with this paper.

Keyword: localization

Graph Convolutional Module for Temporal Action Localization in Videos

Authors: Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, Chuang Gan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2112.00302
Pdf link: https://arxiv.org/pdf/2112.00302
Abstract
Temporal action localization has long been researched in computer vision. Existing state-of-the-art action localization methods divide each video into multiple action units (i.e., proposals in two-stage methods and segments in one-stage methods) and then perform action recognition/regression on each of them individually, without explicitly exploiting their relations during learning. In this paper, we claim that the relations between action units play an important role in action localization, and a more powerful action detector should not only capture the local content of each action unit but also allow a wider field of view on the context related to it. To this end, we propose a general graph convolutional module (GCM) that can be easily plugged into existing action localization methods, including two-stage and one-stage paradigms. To be specific, we first construct a graph, where each action unit is represented as a node and their relations between two action units as an edge. Here, we use two types of relations, one for capturing the temporal connections between different action units, and the other one for characterizing their semantic relationship. Particularly for the temporal connections in two-stage methods, we further explore two different kinds of edges, one connecting the overlapping action units and the other one connecting surrounding but disjointed units. Upon the graph we built, we then apply graph convolutional networks (GCNs) to model the relations among different action units, which is able to learn more informative representations to enhance action localization. Experimental results show that our GCM consistently improves the performance of existing action localization methods, including two-stage methods (e.g., CBR and R-C3D) and one-stage methods (e.g., D-SSAD), verifying the generality and effectiveness of our GCM.

Background Activation Suppression for Weakly Supervised Object Localization

Authors: Pingyu Wu, Wei Zhai, Yang Cao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2112.00580
Pdf link: https://arxiv.org/pdf/2112.00580
Abstract
Weakly supervised object localization (WSOL) aims to localize the object region using only image-level labels as supervision. Recently a new paradigm has emerged by generating a foreground prediction map (FPM) to achieve the localization task. Existing FPM-based methods use cross-entropy (CE) to evaluate the foreground prediction map and to guide the learning of generator. We argue for using activation value to achieve more efficient learning. It is based on the experimental observation that, for a trained network, CE converges to zero when the foreground mask covers only part of the object region. While activation value increases until the mask expands to the object boundary, which indicates that more object areas can be learned by using activation value. In this paper, we propose a Background Activation Suppression (BAS) method. Specifically, an Activation Map Constraint module (AMC) is designed to facilitate the learning of generator by suppressing the background activation values. Meanwhile, by using the foreground region guidance and the area constraint, BAS can learn the whole region of the object. Furthermore, in the inference phase, we consider the prediction maps of different categories together to obtain the final localization results. Extensive experiments show that BAS achieves significant and consistent improvement over the baseline methods on the CUB-200-2011 and ILSVRC datasets.

Siamese Neural Encoders for Long-Term Indoor Localization with Mobile Devices

Authors: Saideep Tiku, Sudeep Pasricha
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2112.00654
Pdf link: https://arxiv.org/pdf/2112.00654
Abstract
Fingerprinting-based indoor localization is an emerging application domain for enhanced positioning and tracking of people and assets within indoor locales. The superior pairing of ubiquitously available WiFi signals with computationally capable smartphones is set to revolutionize the area of indoor localization. However, the observed signal characteristics from independently maintained WiFi access points vary greatly over time. Moreover, some of the WiFi access points visible at the initial deployment phase may be replaced or removed over time. These factors are often ignored in indoor localization frameworks and cause gradual and catastrophic degradation of localization accuracy post-deployment (over weeks and months). To overcome these challenges, we propose a Siamese neural encoder-based framework that offers up to 40% reduction in degradation of localization accuracy over time compared to the state-of-the-art in the area, without requiring any retraining.

New submissions for Thu, 10 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

There is no result

New submissions for Fri, 4 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Multiscale Domain Adaptive YOLO for Cross-Domain Object Detection

Authors: Mazin Hnewa, Hayder Radha
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.01483
Pdf link: https://arxiv.org/pdf/2106.01483
Abstract
The area of domain adaptation has been instrumental in addressing the domain shift problem encountered by many applications. This problem arises due to the difference between the distributions of source data used for training in comparison with target data used during realistic testing scenarios. In this paper, we introduce a novel MultiScale Domain Adaptive YOLO (MS-DAYOLO) framework that employs multiple domain adaptation paths and corresponding domain classifiers at different scales of the recently introduced YOLOv4 object detector to generate domain-invariant features. We train and test our proposed method using popular datasets. Our experiments show significant improvements in object detection performance when training YOLOv4 using the proposed MS-DAYOLO and when tested on target data representing challenging weather conditions for autonomous driving applications.

DeepCompress: Efficient Point Cloud Geometry Compression

Authors: Ryan Killea, Yun Li, Saeed Bastani, Paul McLachlan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2106.01504
Pdf link: https://arxiv.org/pdf/2106.01504
Abstract
Point clouds are a basic data type that is increasingly of interest as 3D content becomes more ubiquitous. Applications using point clouds include virtual, augmented, and mixed reality and autonomous driving. We propose a more efficient deep learning-based encoder architecture for point clouds compression that incorporates principles from established 3D object detection and image compression architectures. Through an ablation study, we show that incorporating the learned activation function from Computational Efficient Neural Image Compression (CENIC) and designing more parameter-efficient convolutional blocks yields dramatic gains in efficiency and performance. Our proposed architecture incorporates Generalized Divisive Normalization activations and propose a spatially separable InceptionV4-inspired block. We then evaluate rate-distortion curves on the standard JPEG Pleno 8i Voxelized Full Bodies dataset to evaluate our model's performance. Our proposed modifications outperform the baseline approaches by a small margin in terms of Bjontegard delta rate and PSNR values, yet reduces necessary encoder convolution operations by 8 percent and reduces total encoder parameters by 20 percent. Our proposed architecture, when considered on its own, has a small penalty of 0.02 percent in Chamfer's Distance and 0.32 percent increased bit rate in Point to Plane Distance for the same peak signal-to-noise ratio.

Short-term Maintenance Planning of Autonomous Trucks for Minimizing Economic Risk

Authors: Xin Tao, Jonas Mårtensson, Håkan Warnquist, Anna Pernestål
Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2106.01871
Pdf link: https://arxiv.org/pdf/2106.01871
Abstract
New autonomous driving technologies are emerging every day and some of them have been commercially applied in the real world. While benefiting from these technologies, autonomous trucks are facing new challenges in short-term maintenance planning, which directly influences the truck operator's profit. In this paper, we implement a vehicle health management system by addressing the maintenance planning issues of autonomous trucks on a transport mission. We also present a maintenance planning model using a risk-based decision-making method, which identifies the maintenance decision with minimal economic risk of the truck company. Both availability losses and maintenance costs are considered when evaluating the economic risk. We demonstrate the proposed model by numerical experiments illustrating real-world scenarios. In the experiments, compared to three baseline methods, the expected economic risk of the proposed method is reduced by up to $47%$. We also conduct sensitivity analyses of different model parameters. The analyses show that the economic risk significantly decreases when the estimation accuracy of remaining useful life, the maximal allowed time of delivery delay before order cancellation, or the number of workshops increases. The experiment results contribute to identifying future research and development attentions of autonomous trucks from an economic perspective.

New submissions for Fri, 4 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Multiscale Domain Adaptive YOLO for Cross-Domain Object Detection

Authors: Mazin Hnewa, Hayder Radha
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.01483
Pdf link: https://arxiv.org/pdf/2106.01483
Abstract
The area of domain adaptation has been instrumental in addressing the domain shift problem encountered by many applications. This problem arises due to the difference between the distributions of source data used for training in comparison with target data used during realistic testing scenarios. In this paper, we introduce a novel MultiScale Domain Adaptive YOLO (MS-DAYOLO) framework that employs multiple domain adaptation paths and corresponding domain classifiers at different scales of the recently introduced YOLOv4 object detector to generate domain-invariant features. We train and test our proposed method using popular datasets. Our experiments show significant improvements in object detection performance when training YOLOv4 using the proposed MS-DAYOLO and when tested on target data representing challenging weather conditions for autonomous driving applications.

DeepCompress: Efficient Point Cloud Geometry Compression

Authors: Ryan Killea, Yun Li, Saeed Bastani, Paul McLachlan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2106.01504
Pdf link: https://arxiv.org/pdf/2106.01504
Abstract
Point clouds are a basic data type that is increasingly of interest as 3D content becomes more ubiquitous. Applications using point clouds include virtual, augmented, and mixed reality and autonomous driving. We propose a more efficient deep learning-based encoder architecture for point clouds compression that incorporates principles from established 3D object detection and image compression architectures. Through an ablation study, we show that incorporating the learned activation function from Computational Efficient Neural Image Compression (CENIC) and designing more parameter-efficient convolutional blocks yields dramatic gains in efficiency and performance. Our proposed architecture incorporates Generalized Divisive Normalization activations and propose a spatially separable InceptionV4-inspired block. We then evaluate rate-distortion curves on the standard JPEG Pleno 8i Voxelized Full Bodies dataset to evaluate our model's performance. Our proposed modifications outperform the baseline approaches by a small margin in terms of Bjontegard delta rate and PSNR values, yet reduces necessary encoder convolution operations by 8 percent and reduces total encoder parameters by 20 percent. Our proposed architecture, when considered on its own, has a small penalty of 0.02 percent in Chamfer's Distance and 0.32 percent increased bit rate in Point to Plane Distance for the same peak signal-to-noise ratio.

Short-term Maintenance Planning of Autonomous Trucks for Minimizing Economic Risk

Authors: Xin Tao, Jonas Mårtensson, Håkan Warnquist, Anna Pernestål
Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2106.01871
Pdf link: https://arxiv.org/pdf/2106.01871
Abstract
New autonomous driving technologies are emerging every day and some of them have been commercially applied in the real world. While benefiting from these technologies, autonomous trucks are facing new challenges in short-term maintenance planning, which directly influences the truck operator's profit. In this paper, we implement a vehicle health management system by addressing the maintenance planning issues of autonomous trucks on a transport mission. We also present a maintenance planning model using a risk-based decision-making method, which identifies the maintenance decision with minimal economic risk of the truck company. Both availability losses and maintenance costs are considered when evaluating the economic risk. We demonstrate the proposed model by numerical experiments illustrating real-world scenarios. In the experiments, compared to three baseline methods, the expected economic risk of the proposed method is reduced by up to $47%$. We also conduct sensitivity analyses of different model parameters. The analyses show that the economic risk significantly decreases when the estimation accuracy of remaining useful life, the maximal allowed time of delivery delay before order cancellation, or the number of workshops increases. The experiment results contribute to identifying future research and development attentions of autonomous trucks from an economic perspective.

New submissions for Thu, 4 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

Efficient 3D Deep LiDAR Odometry

Authors: Guangming Wang, Xinrui Wu, Shuyang Jiang, Zhe Liu, Hesheng Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.02135
Pdf link: https://arxiv.org/pdf/2111.02135
Abstract
An efficient 3D point cloud learning architecture, named PWCLO-Net, for LiDAR odometry is first proposed in this paper. In this architecture, the projection-aware representation of the 3D point cloud is proposed to organize the raw 3D point cloud into an ordered data form to achieve efficiency. The Pyramid, Warping, and Cost volume (PWC) structure for the LiDAR odometry task is built to estimate and refine the pose in a coarse-to-fine approach hierarchically and efficiently. A projection-aware attentive cost volume is built to directly associate two discrete point clouds and obtain embedding motion patterns. Then, a trainable embedding mask is proposed to weigh the local motion patterns to regress the overall pose and filter outlier points. The trainable pose warp-refinement module is iteratively used with embedding mask optimized hierarchically to make the pose estimation more robust for outliers. The entire architecture is holistically optimized end-to-end to achieve adaptive learning of cost volume and mask, and all operations involving point cloud sampling and grouping are accelerated by projection-aware 3D feature learning methods. The superior performance and effectiveness of our LiDAR odometry architecture are demonstrated on KITTI odometry dataset. Our method outperforms all recent learning-based methods and even the geometry-based approach, LOAM with mapping optimization, on most sequences of KITTI odometry dataset.

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Efficient 3D Deep LiDAR Odometry

Authors: Guangming Wang, Xinrui Wu, Shuyang Jiang, Zhe Liu, Hesheng Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.02135
Pdf link: https://arxiv.org/pdf/2111.02135
Abstract
An efficient 3D point cloud learning architecture, named PWCLO-Net, for LiDAR odometry is first proposed in this paper. In this architecture, the projection-aware representation of the 3D point cloud is proposed to organize the raw 3D point cloud into an ordered data form to achieve efficiency. The Pyramid, Warping, and Cost volume (PWC) structure for the LiDAR odometry task is built to estimate and refine the pose in a coarse-to-fine approach hierarchically and efficiently. A projection-aware attentive cost volume is built to directly associate two discrete point clouds and obtain embedding motion patterns. Then, a trainable embedding mask is proposed to weigh the local motion patterns to regress the overall pose and filter outlier points. The trainable pose warp-refinement module is iteratively used with embedding mask optimized hierarchically to make the pose estimation more robust for outliers. The entire architecture is holistically optimized end-to-end to achieve adaptive learning of cost volume and mask, and all operations involving point cloud sampling and grouping are accelerated by projection-aware 3D feature learning methods. The superior performance and effectiveness of our LiDAR odometry architecture are demonstrated on KITTI odometry dataset. Our method outperforms all recent learning-based methods and even the geometry-based approach, LOAM with mapping optimization, on most sequences of KITTI odometry dataset.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Deep Point Set Resampling via Gradient Fields

Authors: Haolan Chen, Bi'an Du, Shitong Luo, Wei Hu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.02045
Pdf link: https://arxiv.org/pdf/2111.02045
Abstract
3D point clouds acquired by scanning real-world objects or scenes have found a wide range of applications including immersive telepresence, autonomous driving, surveillance, etc. They are often perturbed by noise or suffer from low density, which obstructs downstream tasks such as surface reconstruction and understanding. In this paper, we propose a novel paradigm of point set resampling for restoration, which learns continuous gradient fields of point clouds that converge points towards the underlying surface. In particular, we represent a point cloud via its gradient field -- the gradient of the log-probability density function, and enforce the gradient field to be continuous, thus guaranteeing the continuity of the model for solvable optimization. Based on the continuous gradient fields estimated via a proposed neural network, resampling a point cloud amounts to performing gradient-based Markov Chain Monte Carlo (MCMC) on the input noisy or sparse point cloud. Further, we propose to introduce regularization into the gradient-based MCMC during point cloud restoration, which essentially refines the intermediate resampled point cloud iteratively and accommodates various priors in the resampling process. Extensive experimental results demonstrate that the proposed point set resampling achieves the state-of-the-art performance in representative restoration tasks including point cloud denoising and upsampling.

Deep-Learning-Based Single-Image Height Reconstruction from Very-High-Resolution SAR Intensity Data

Authors: Michael Recla, Michael Schmitt
Subjects: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2111.02061
Pdf link: https://arxiv.org/pdf/2111.02061
Abstract
Originally developed in fields such as robotics and autonomous driving with image-based navigation in mind, deep learning-based single-image depth estimation (SIDE) has found great interest in the wider image analysis community. Remote sensing is no exception, as the possibility to estimate height maps from single aerial or satellite imagery bears great potential in the context of topographic reconstruction. A few pioneering investigations have demonstrated the general feasibility of single image height prediction from optical remote sensing images and motivate further studies in that direction. With this paper, we present the first-ever demonstration of deep learning-based single image height prediction for the other important sensor modality in remote sensing: synthetic aperture radar (SAR) data. Besides the adaptation of a convolutional neural network (CNN) architecture for SAR intensity images, we present a workflow for the generation of training data, and extensive experimental results for different SAR imaging modes and test sites. Since we put a particular emphasis on transferability, we are able to confirm that deep learning-based single-image height estimation is not only possible, but also transfers quite well to unseen data, even if acquired by different imaging modes and imaging parameters.

Keyword: mapping

Private Interdependent Valuations

Authors: Alon Eden, Kira Goldner, Shuran Zheng
Subjects: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2111.01851
Pdf link: https://arxiv.org/pdf/2111.01851
Abstract
We consider the single-item interdependent value setting, where there is a monopolist, $n$ buyers, and each buyer has a private signal $s_i$ describing a piece of information about the item. Each bidder $i$ also has a valuation function $v_i(s_1,\ldots,s_n)$ mapping the (private) signals of all buyers to a positive real number representing their value for the item. This setting captures scenarios where the item's information is asymmetric or dispersed among agents, such as in competitions for oil drilling rights, or in auctions for art pieces. Due to the increased complexity of this model compared to standard private values, it is generally assumed that each bidder's valuation function $v_i$ is public knowledge. But in many situations, the seller may not know how a bidder aggregates signals into a valuation. In this paper, we design mechanisms that guarantee approximately-optimal social welfare while satisfying ex-post incentive compatibility and individual rationality for the case where the valuation functions are private to the bidders. When the valuations are public, it is possible for optimal social welfare to be attained by a deterministic mechanism under a single-crossing condition. In contrast, when the valuations are the bidders' private information, we show that no finite bound can be achieved by any deterministic mechanism even under single-crossing. Moreover, no randomized mechanism can guarantee better than an $n$-approximation. We thus consider valuation functions that are submodular over signals (SOS), introduced in the context of combinatorial auctions in a recent breakthrough paper by Eden et al. [EC'19]. Our main result is an $O(\log^2 n)$-approximation for buyers with private signals and valuations under the SOS condition. We also give a tight $\Theta(k)$-approximation for the case each agent's valuation depends on at most $k$ other signals even for unknown $k$.

Accelerating Genome Sequence Analysis via Efficient Hardware/Algorithm Co-Design

Authors: Damla Senol Cali
Subjects: Hardware Architecture (cs.AR); Genomics (q-bio.GN)
Arxiv link: https://arxiv.org/abs/2111.01916
Pdf link: https://arxiv.org/pdf/2111.01916
Abstract
Genome sequence analysis plays a pivotal role in enabling many medical and scientific advancements in personalized medicine, outbreak tracing, and forensics. However, the analysis of genome sequencing data is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems. In this dissertation, we propose four major works, where we characterize the real-system behavior of the genome sequence analysis pipeline and its associated tools, expose the bottlenecks and tradeoffs, and co-design fast and efficient algorithms along with scalable and energy-efficient customized hardware accelerators for the key bottlenecks to enable faster genome sequence analysis. First, we comprehensively analyze the tools in the genome assembly pipeline for long reads in multiple dimensions, uncovering bottlenecks and tradeoffs that different combinations of tools and different underlying systems lead to. Second, we propose GenASM, an acceleration framework that builds upon bitvector-based approximate string matching to accelerate multiple steps of the genome sequence analysis pipeline. We co-design our highly-parallel, scalable and memory-efficient algorithms with low-power and area-efficient hardware accelerators. Third, we implement an FPGA-based prototype for GenASM, where state-of-the-art 3D-stacked memory offers high memory bandwidth and FPGA resources offer high parallelism. Fourth, we propose SeGraM, the first hardware acceleration framework for sequence-to-graph mapping and alignment. We co-design algorithms and accelerators for memory-efficient minimizer-based seeding and bitvector-based, highly-parallel sequence-to-graph alignment. Overall, we demonstrate that genome sequence analysis can be accelerated by co-designing scalable and energy-efficient customized accelerators along with efficient algorithms for the key steps of genome sequence analysis.

Multistep traffic speed prediction: A deep learning based approach using latent space mapping considering spatio-temporal dependencies

Authors: Shatrughan Modi, Jhilik Bhattacharya, Prasenjit Basak
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.02115
Pdf link: https://arxiv.org/pdf/2111.02115
Abstract
Traffic management in a city has become a major problem due to the increasing number of vehicles on roads. Intelligent Transportation System (ITS) can help the city traffic managers to tackle the problem by providing accurate traffic forecasts. For this, ITS requires a reliable traffic prediction algorithm that can provide accurate traffic prediction at multiple time steps based on past and current traffic data. In recent years, a number of different methods for traffic prediction have been proposed which have proved their effectiveness in terms of accuracy. However, most of these methods have either considered spatial information or temporal information only and overlooked the effect of other. In this paper, to address the above problem a deep learning based approach has been developed using both the spatial and temporal dependencies. To consider spatio-temporal dependencies, nearby road sensors at a particular instant are selected based on the attributes like traffic similarity and distance. Two pre-trained deep auto-encoders were cross-connected using the concept of latent space mapping and the resultant model was trained using the traffic data from the selected nearby sensors as input. The proposed deep learning based approach was trained using the real-world traffic data collected from loop detector sensors installed on different highways of Los Angeles and Bay Area. The traffic data is freely available from the web portal of the California Department of Transportation Performance Measurement System (PeMS). The effectiveness of the proposed approach was verified by comparing it with a number of machine/deep learning approaches. It has been found that the proposed approach provides accurate traffic prediction results even for 60-min ahead prediction with least error than other techniques.

Efficient 3D Deep LiDAR Odometry

Authors: Guangming Wang, Xinrui Wu, Shuyang Jiang, Zhe Liu, Hesheng Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.02135
Pdf link: https://arxiv.org/pdf/2111.02135
Abstract
An efficient 3D point cloud learning architecture, named PWCLO-Net, for LiDAR odometry is first proposed in this paper. In this architecture, the projection-aware representation of the 3D point cloud is proposed to organize the raw 3D point cloud into an ordered data form to achieve efficiency. The Pyramid, Warping, and Cost volume (PWC) structure for the LiDAR odometry task is built to estimate and refine the pose in a coarse-to-fine approach hierarchically and efficiently. A projection-aware attentive cost volume is built to directly associate two discrete point clouds and obtain embedding motion patterns. Then, a trainable embedding mask is proposed to weigh the local motion patterns to regress the overall pose and filter outlier points. The trainable pose warp-refinement module is iteratively used with embedding mask optimized hierarchically to make the pose estimation more robust for outliers. The entire architecture is holistically optimized end-to-end to achieve adaptive learning of cost volume and mask, and all operations involving point cloud sampling and grouping are accelerated by projection-aware 3D feature learning methods. The superior performance and effectiveness of our LiDAR odometry architecture are demonstrated on KITTI odometry dataset. Our method outperforms all recent learning-based methods and even the geometry-based approach, LOAM with mapping optimization, on most sequences of KITTI odometry dataset.

A Johnson--Lindenstrauss Framework for Randomly Initialized CNNs

Authors: Ido Nachum, Jan Hązła, Michael Gastpar, Anatoly Khina
Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Probability (math.PR)
Arxiv link: https://arxiv.org/abs/2111.02155
Pdf link: https://arxiv.org/pdf/2111.02155
Abstract
How does the geometric representation of a dataset change after the application of each randomly initialized layer of a neural network? The celebrated Johnson--Lindenstrauss lemma answers this question for linear fully-connected neural networks (FNNs), stating that the geometry is essentially preserved. For FNNs with the ReLU activation, the angle between two inputs contracts according to a known mapping. The question for non-linear convolutional neural networks (CNNs) becomes much more intricate. To answer this question, we introduce a geometric framework. For linear CNNs, we show that the Johnson--Lindenstrauss lemma continues to hold, namely, that the angle between two inputs is preserved. For CNNs with ReLU activation, on the other hand, the behavior is richer: The angle between the outputs contracts, where the level of contraction depends on the nature of the inputs. In particular, after one layer, the geometry of natural images is essentially preserved, whereas for Gaussian correlated inputs, CNNs exhibit the same contracting behavior as FNNs with ReLU activation.

Keyword: localization

Autonomous Magnetic Navigation Framework for Active Wireless Capsule Endoscopy Inspired by Conventional Colonoscopy Procedures

Authors: Yangxin Xu, Keyu Li, Ziqi Zhao, Max Q.-H. Meng
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.01977
Pdf link: https://arxiv.org/pdf/2111.01977
Abstract
In recent years, simultaneous magnetic actuation and localization (SMAL) for active wireless capsule endoscopy (WCE) has been intensively studied to improve the efficiency and accuracy of the examination. In this paper, we propose an autonomous magnetic navigation framework for active WCE that mimics the "insertion" and "withdrawal" procedures performed by an expert physician in conventional colonoscopy, thereby enabling efficient and accurate navigation of a robotic capsule endoscope in the intestine with minimal user effort. First, the capsule is automatically propelled through the unknown intestinal environment and generate a viable path to represent the environment. Then, the capsule is autonomously navigated towards any point selected on the intestinal trajectory to allow accurate and repeated inspections of suspicious lesions. Moreover, we implement the navigation framework on a robotic system incorporated with advanced SMAL algorithms, and validate it in the navigation in various tubular environments using phantoms and an ex-vivo pig colon. Our results demonstrate that the proposed autonomous navigation framework can effectively navigate the capsule in unknown, complex tubular environments with a satisfactory accuracy, repeatability and efficiency compared with manual operation.

A Strongly-Labelled Polyphonic Dataset of Urban Sounds with Spatiotemporal Context

Authors: Kenneth Ooi (1), Karn N. Watcharasupat (1), Santi Peksi (1), Furi Andi Karnapi (1), Zhen-Ting Ong (1), Danny Chua (1), Hui-Wen Leow (1), Li-Long Kwok (1), Xin-Lei Ng (1), Zhen-Ann Loh (1), Woon-Seng Gan (1) ((1) School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore)
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2111.02006
Pdf link: https://arxiv.org/pdf/2111.02006
Abstract
This paper introduces SINGA:PURA, a strongly labelled polyphonic urban sound dataset with spatiotemporal context. The data were collected via several recording units deployed across Singapore as a part of a wireless acoustic sensor network. These recordings were made as part of a project to identify and mitigate noise sources in Singapore, but also possess a wider applicability to sound event detection, classification, and localization. This paper introduces an accompanying hierarchical label taxonomy, which has been designed to be compatible with other existing datasets for urban sound tagging while also able to capture sound events unique to the Singaporean context. This paper details the data collection, annotation, and processing methodologies for the creation of the dataset. We further perform exploratory data analysis and include the performance of a baseline model on the dataset as a benchmark.

Three-dimensional Cooperative Localization of Commercial-Off-The-Shelf Sensors

Authors: Yulong Wang, Shenghong Li, Wei Ni, David Abbott, Mark Johnson, Guangyu Pei, Mark Hedley
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2111.02040
Pdf link: https://arxiv.org/pdf/2111.02040
Abstract
Many location-based services use Received Signal Strength (RSS) measurements due to their universal availability. In this paper, we study the association of a large number of low-cost Internet-of-Things (IoT) sensors and their possible installation locations, which can enable various sensing and automation-related applications. We propose an efficient approach to solve the corresponding permutation combinatorial optimization problem, which integrates continuous space cooperative localization and permutation space likelihood ascent search. A convex relaxation-based optimization is designed to estimate the coarse locations of blindfolded devices in continuous 3D spaces, which are then projected to the feasible permutation space. An efficient Cram'er-Rao Lower Bound based likelihood ascent search algorithm is proposed to refine the solution. Extensive experiments were conducted to evaluate the performance of the proposed approach, which show that the proposed approach significantly outperforms state-of-the-art combinatorial optimization algorithms and achieves close-to-100% accuracy with affordable execution time.

Dual Progressive Prototype Network for Generalized Zero-Shot Learning

Authors: Chaoqun Wang, Shaobo Min, Xuejin Chen, Xiaoyan Sun, Houqiang Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.02073
Pdf link: https://arxiv.org/pdf/2111.02073
Abstract
Generalized Zero-Shot Learning (GZSL) aims to recognize new categories with auxiliary semantic information,e.g., category attributes. In this paper, we handle the critical issue of domain shift problem, i.e., confusion between seen and unseen categories, by progressively improving cross-domain transferability and category discriminability of visual representations. Our approach, named Dual Progressive Prototype Network (DPPN), constructs two types of prototypes that record prototypical visual patterns for attributes and categories, respectively. With attribute prototypes, DPPN alternately searches attribute-related local regions and updates corresponding attribute prototypes to progressively explore accurate attribute-region correspondence. This enables DPPN to produce visual representations with accurate attribute localization ability, which benefits the semantic-visual alignment and representation transferability. Besides, along with progressive attribute localization, DPPN further projects category prototypes into multiple spaces to progressively repel visual representations from different categories, which boosts category discriminability. Both attribute and category prototypes are collaboratively learned in a unified framework, which makes visual representations of DPPN transferable and distinctive. Experiments on four benchmarks prove that DPPN effectively alleviates the domain shift problem in GZSL.

Multi-Cue Adaptive Emotion Recognition Network

Authors: Willams Costa, David Macêdo, Cleber Zanchettin, Lucas S. Figueiredo, Veronica Teichrieb
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2111.02273
Pdf link: https://arxiv.org/pdf/2111.02273
Abstract
Expressing and identifying emotions through facial and physical expressions is a significant part of social interaction. Emotion recognition is an essential task in computer vision due to its various applications and mainly for allowing a more natural interaction between humans and machines. The common approaches for emotion recognition focus on analyzing facial expressions and requires the automatic localization of the face in the image. Although these methods can correctly classify emotion in controlled scenarios, such techniques are limited when dealing with unconstrained daily interactions. We propose a new deep learning approach for emotion recognition based on adaptive multi-cues that extract information from context and body poses, which humans commonly use in social interaction and communication. We compare the proposed approach with the state-of-art approaches in the CAER-S dataset, evaluating different components in a pipeline that reached an accuracy of 89.30%

Subpixel Heatmap Regression for Facial Landmark Localization

Authors: Adrian Bulat, Enrique Sanchez, Georgios Tzimiropoulos
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.02360
Pdf link: https://arxiv.org/pdf/2111.02360
Abstract
Deep Learning models based on heatmap regression have revolutionized the task of facial landmark localization with existing models working robustly under large poses, non-uniform illumination and shadows, occlusions and self-occlusions, low resolution and blur. However, despite their wide adoption, heatmap regression approaches suffer from discretization-induced errors related to both the heatmap encoding and decoding process. In this work we show that these errors have a surprisingly large negative impact on facial alignment accuracy. To alleviate this problem, we propose a new approach for the heatmap encoding and decoding process by leveraging the underlying continuous distribution. To take full advantage of the newly proposed encoding-decoding mechanism, we also introduce a Siamese-based training that enforces heatmap consistency across various geometric image transformations. Our approach offers noticeable gains across multiple datasets setting a new state-of-the-art result in facial landmark localization. Code alongside the pretrained models will be made available at https://www.adrianbulat.com/face-alignment

New submissions for Fri, 4 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Multiscale Domain Adaptive YOLO for Cross-Domain Object Detection

Authors: Mazin Hnewa, Hayder Radha
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.01483
Pdf link: https://arxiv.org/pdf/2106.01483
Abstract
The area of domain adaptation has been instrumental in addressing the domain shift problem encountered by many applications. This problem arises due to the difference between the distributions of source data used for training in comparison with target data used during realistic testing scenarios. In this paper, we introduce a novel MultiScale Domain Adaptive YOLO (MS-DAYOLO) framework that employs multiple domain adaptation paths and corresponding domain classifiers at different scales of the recently introduced YOLOv4 object detector to generate domain-invariant features. We train and test our proposed method using popular datasets. Our experiments show significant improvements in object detection performance when training YOLOv4 using the proposed MS-DAYOLO and when tested on target data representing challenging weather conditions for autonomous driving applications.

DeepCompress: Efficient Point Cloud Geometry Compression

Authors: Ryan Killea, Yun Li, Saeed Bastani, Paul McLachlan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2106.01504
Pdf link: https://arxiv.org/pdf/2106.01504
Abstract
Point clouds are a basic data type that is increasingly of interest as 3D content becomes more ubiquitous. Applications using point clouds include virtual, augmented, and mixed reality and autonomous driving. We propose a more efficient deep learning-based encoder architecture for point clouds compression that incorporates principles from established 3D object detection and image compression architectures. Through an ablation study, we show that incorporating the learned activation function from Computational Efficient Neural Image Compression (CENIC) and designing more parameter-efficient convolutional blocks yields dramatic gains in efficiency and performance. Our proposed architecture incorporates Generalized Divisive Normalization activations and propose a spatially separable InceptionV4-inspired block. We then evaluate rate-distortion curves on the standard JPEG Pleno 8i Voxelized Full Bodies dataset to evaluate our model's performance. Our proposed modifications outperform the baseline approaches by a small margin in terms of Bjontegard delta rate and PSNR values, yet reduces necessary encoder convolution operations by 8 percent and reduces total encoder parameters by 20 percent. Our proposed architecture, when considered on its own, has a small penalty of 0.02 percent in Chamfer's Distance and 0.32 percent increased bit rate in Point to Plane Distance for the same peak signal-to-noise ratio.

Short-term Maintenance Planning of Autonomous Trucks for Minimizing Economic Risk

Authors: Xin Tao, Jonas Mårtensson, Håkan Warnquist, Anna Pernestål
Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2106.01871
Pdf link: https://arxiv.org/pdf/2106.01871
Abstract
New autonomous driving technologies are emerging every day and some of them have been commercially applied in the real world. While benefiting from these technologies, autonomous trucks are facing new challenges in short-term maintenance planning, which directly influences the truck operator's profit. In this paper, we implement a vehicle health management system by addressing the maintenance planning issues of autonomous trucks on a transport mission. We also present a maintenance planning model using a risk-based decision-making method, which identifies the maintenance decision with minimal economic risk of the truck company. Both availability losses and maintenance costs are considered when evaluating the economic risk. We demonstrate the proposed model by numerical experiments illustrating real-world scenarios. In the experiments, compared to three baseline methods, the expected economic risk of the proposed method is reduced by up to $47%$. We also conduct sensitivity analyses of different model parameters. The analyses show that the economic risk significantly decreases when the estimation accuracy of remaining useful life, the maximal allowed time of delivery delay before order cancellation, or the number of workshops increases. The experiment results contribute to identifying future research and development attentions of autonomous trucks from an economic perspective.

New submissions for Fri, 4 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Multiscale Domain Adaptive YOLO for Cross-Domain Object Detection

Authors: Mazin Hnewa, Hayder Radha
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.01483
Pdf link: https://arxiv.org/pdf/2106.01483
Abstract
The area of domain adaptation has been instrumental in addressing the domain shift problem encountered by many applications. This problem arises due to the difference between the distributions of source data used for training in comparison with target data used during realistic testing scenarios. In this paper, we introduce a novel MultiScale Domain Adaptive YOLO (MS-DAYOLO) framework that employs multiple domain adaptation paths and corresponding domain classifiers at different scales of the recently introduced YOLOv4 object detector to generate domain-invariant features. We train and test our proposed method using popular datasets. Our experiments show significant improvements in object detection performance when training YOLOv4 using the proposed MS-DAYOLO and when tested on target data representing challenging weather conditions for autonomous driving applications.

DeepCompress: Efficient Point Cloud Geometry Compression

Authors: Ryan Killea, Yun Li, Saeed Bastani, Paul McLachlan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2106.01504
Pdf link: https://arxiv.org/pdf/2106.01504
Abstract
Point clouds are a basic data type that is increasingly of interest as 3D content becomes more ubiquitous. Applications using point clouds include virtual, augmented, and mixed reality and autonomous driving. We propose a more efficient deep learning-based encoder architecture for point clouds compression that incorporates principles from established 3D object detection and image compression architectures. Through an ablation study, we show that incorporating the learned activation function from Computational Efficient Neural Image Compression (CENIC) and designing more parameter-efficient convolutional blocks yields dramatic gains in efficiency and performance. Our proposed architecture incorporates Generalized Divisive Normalization activations and propose a spatially separable InceptionV4-inspired block. We then evaluate rate-distortion curves on the standard JPEG Pleno 8i Voxelized Full Bodies dataset to evaluate our model's performance. Our proposed modifications outperform the baseline approaches by a small margin in terms of Bjontegard delta rate and PSNR values, yet reduces necessary encoder convolution operations by 8 percent and reduces total encoder parameters by 20 percent. Our proposed architecture, when considered on its own, has a small penalty of 0.02 percent in Chamfer's Distance and 0.32 percent increased bit rate in Point to Plane Distance for the same peak signal-to-noise ratio.

Short-term Maintenance Planning of Autonomous Trucks for Minimizing Economic Risk

Authors: Xin Tao, Jonas Mårtensson, Håkan Warnquist, Anna Pernestål
Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2106.01871
Pdf link: https://arxiv.org/pdf/2106.01871
Abstract
New autonomous driving technologies are emerging every day and some of them have been commercially applied in the real world. While benefiting from these technologies, autonomous trucks are facing new challenges in short-term maintenance planning, which directly influences the truck operator's profit. In this paper, we implement a vehicle health management system by addressing the maintenance planning issues of autonomous trucks on a transport mission. We also present a maintenance planning model using a risk-based decision-making method, which identifies the maintenance decision with minimal economic risk of the truck company. Both availability losses and maintenance costs are considered when evaluating the economic risk. We demonstrate the proposed model by numerical experiments illustrating real-world scenarios. In the experiments, compared to three baseline methods, the expected economic risk of the proposed method is reduced by up to $47%$. We also conduct sensitivity analyses of different model parameters. The analyses show that the economic risk significantly decreases when the estimation accuracy of remaining useful life, the maximal allowed time of delivery delay before order cancellation, or the number of workshops increases. The experiment results contribute to identifying future research and development attentions of autonomous trucks from an economic perspective.

New submissions for Thu, 11 Nov 21

Keyword: SLAM

TomoSLAM: factor graph optimization for rotation angle refinement in microtomography

Authors: Mark Griguletskii, Mikhail Chekanov, Oleg Shipitko
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.05562
Pdf link: https://arxiv.org/pdf/2111.05562
Abstract
In computed tomography (CT), the relative trajectories of a sample, a detector, and a signal source are traditionally considered to be known, since they are caused by the intentional preprogrammed movement of the instrument parts. However, due to the mechanical backlashes, rotation sensor measurement errors, thermal deformations real trajectory differs from desired ones. This negatively affects the resulting quality of tomographic reconstruction. Neither the calibration nor preliminary adjustments of the device completely eliminates the inaccuracy of the trajectory but significantly increase the cost of instrument maintenance. A number of approaches to this problem are based on an automatic refinement of the source and sensor position estimate relative to the sample for each projection (at each time step) during the reconstruction process. A similar problem of position refinement while observing different images of an object from different angles is well known in robotics (particularly, in mobile robots and self-driving vehicles) and is called Simultaneous Localization And Mapping (SLAM). The scientific novelty of this work is to consider the problem of trajectory refinement in microtomography as a SLAM problem. This is achieved by extracting Speeded Up Robust Features (SURF) features from X-ray projections, filtering matches with Random Sample Consensus (RANSAC), calculating angles between projections, and using them in factor graph in combination with stepper motor control signals in order to refine rotation angles.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Spatially and Seamlessly Hierarchical Reinforcement Learning for State Space and Policy space in Autonomous Driving

Authors: Jaehyun Kim, Jaeseung Jeong
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.05479
Pdf link: https://arxiv.org/pdf/2111.05479
Abstract
Despite advances in hierarchical reinforcement learning, its applications to path planning in autonomous driving on highways are challenging. One reason is that conventional hierarchical reinforcement learning approaches are not amenable to autonomous driving due to its riskiness: the agent must move avoiding multiple obstacles such as other agents that are highly unpredictable, thus safe regions are small, scattered, and changeable over time. To overcome this challenge, we propose a spatially hierarchical reinforcement learning method for state space and policy space. The high-level policy selects not only behavioral sub-policy but also regions to pay mind to in state space and for outline in policy space. Subsequently, the low-level policy elaborates the short-term goal position of the agent within the outline of the region selected by the high-level command. The network structure and optimization suggested in our method are as concise as those of single-level methods. Experiments on the environment with various shapes of roads showed that our method finds the nearly optimal policies from early episodes, outperforming a baseline hierarchical reinforcement learning method, especially in narrow and complex roads. The resulting trajectories on the roads were similar to those of human strategies on the behavioral planning level.

Keyword: mapping

TomoSLAM: factor graph optimization for rotation angle refinement in microtomography

Authors: Mark Griguletskii, Mikhail Chekanov, Oleg Shipitko
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.05562
Pdf link: https://arxiv.org/pdf/2111.05562
Abstract
In computed tomography (CT), the relative trajectories of a sample, a detector, and a signal source are traditionally considered to be known, since they are caused by the intentional preprogrammed movement of the instrument parts. However, due to the mechanical backlashes, rotation sensor measurement errors, thermal deformations real trajectory differs from desired ones. This negatively affects the resulting quality of tomographic reconstruction. Neither the calibration nor preliminary adjustments of the device completely eliminates the inaccuracy of the trajectory but significantly increase the cost of instrument maintenance. A number of approaches to this problem are based on an automatic refinement of the source and sensor position estimate relative to the sample for each projection (at each time step) during the reconstruction process. A similar problem of position refinement while observing different images of an object from different angles is well known in robotics (particularly, in mobile robots and self-driving vehicles) and is called Simultaneous Localization And Mapping (SLAM). The scientific novelty of this work is to consider the problem of trajectory refinement in microtomography as a SLAM problem. This is achieved by extracting Speeded Up Robust Features (SURF) features from X-ray projections, filtering matches with Random Sample Consensus (RANSAC), calculating angles between projections, and using them in factor graph in combination with stepper motor control signals in order to refine rotation angles.

Porting incompressible flow matrix assembly to FPGAs for accelerating HPC engineering simulations

Authors: Nick Brown
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2111.05651
Pdf link: https://arxiv.org/pdf/2111.05651
Abstract
Engineering is an important domain for supercomputing, with the Alya model being a popular code for undertaking such simulations. With ever increasing demand from users to model larger, more complex systems at reduced time to solution it is important to explore the role that novel hardware technologies, such as FPGAs, can play in accelerating these workloads on future exascale systems. In this paper we explore the porting of Alya's incompressible flow matrix assembly kernel, which accounts for a large proportion of the model runtime, onto FPGAs. After describing in detail successful strategies for optimisation at the kernel level, we then explore sharing the workload between the FPGA and host CPU, mapping most appropriate parts of the kernel between these technologies, enabling us to more effectively exploit the FPGA. We then compare the performance of our approach on a Xilinx Alveo U280 against a 24-core Xeon Platinum CPU and Nvidia V100 GPU, with the FPGA significantly out-performing the CPU and performing comparably against the GPU, whilst drawing substantially less power. The result of this work is both an experience report describing appropriate dataflow optimisations which we believe can be applied more widely as a case-study across HPC codes, and a performance comparison for this specific workload that demonstrates the potential for FPGAs in accelerating HPC engineering simulations.

STNN-DDI: A Substructure-aware Tensor Neural Network to Predict Drug-Drug Interactions

Authors: Hui Yu, ShiYu Zhao, JianYu Shi
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2111.05708
Pdf link: https://arxiv.org/pdf/2111.05708
Abstract
Motivation: Computational prediction of multiple-type drug-drug interaction (DDI) helps reduce unexpected side effects in poly-drug treatments. Although existing computational approaches achieve inspiring results, they ignore that the action of a drug is mainly caused by its chemical substructures. In addition, their interpretability is still weak. Results: In this paper, by supposing that the interactions between two given drugs are caused by their local chemical structures (sub-structures) and their DDI types are determined by the linkages between different substructure sets, we design a novel Substructure-ware Tensor Neural Network model for DDI prediction (STNN-DDI). The proposed model learns a 3-D tensor of (substructure, in-teraction type, substructure) triplets, which characterizes a substructure-substructure interaction (SSI) space. According to a list of predefined substructures with specific chemical meanings, the mapping of drugs into this SSI space enables STNN-DDI to perform the multiple-type DDI prediction in both transductive and inductive scenarios in a unified form with an explicable manner. The compar-ison with deep learning-based state-of-the-art baselines demonstrates the superiority of STNN-DDI with the significant improvement of AUC, AUPR, Accuracy, and Precision. More importantly, case studies illustrate its interpretability by both revealing a crucial sub-structure pair across drugs regarding a DDI type of interest and uncovering interaction type-specific substructure pairs in a given DDI. In summary, STNN-DDI provides an effective approach to predicting DDIs as well as explaining the interaction mechanisms among drugs.

Keyword: localization

MNet-Sim: A Multi-layered Semantic Similarity Network to Evaluate Sentence Similarity

Authors: Manuela Nayantara Jeyaraj, Dharshana Kasthurirathna
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2111.05412
Pdf link: https://arxiv.org/pdf/2111.05412
Abstract
Similarity is a comparative-subjective measure that varies with the domain within which it is considered. In several NLP applications such as document classification, pattern recognition, chatbot question-answering, sentiment analysis, etc., identifying an accurate similarity score for sentence pairs has become a crucial area of research. In the existing models that assess similarity, the limitation of effectively computing this similarity based on contextual comparisons, the localization due to the centering theory, and the lack of non-semantic textual comparisons have proven to be drawbacks. Hence, this paper presents a multi-layered semantic similarity network model built upon multiple similarity measures that render an overall sentence similarity score based on the principles of Network Science, neighboring weighted relational edges, and a proposed extended node similarity computation formula. The proposed multi-layered network model was evaluated and tested against established state-of-the-art models and is shown to have demonstrated better performance scores in assessing sentence similarity.

Towards Active Vision for Action Localization with Reactive Control and Predictive Learning

Authors: Shubham Trehan, Sathyanarayanan N. Aakur
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.05448
Pdf link: https://arxiv.org/pdf/2111.05448
Abstract
Visual event perception tasks such as action localization have primarily focused on supervised learning settings under a static observer, i.e., the camera is static and cannot be controlled by an algorithm. They are often restricted by the quality, quantity, and diversity of \textit{annotated} training data and do not often generalize to out-of-domain samples. In this work, we tackle the problem of active action localization where the goal is to localize an action while controlling the geometric and physical parameters of an active camera to keep the action in the field of view without training data. We formulate an energy-based mechanism that combines predictive learning and reactive control to perform active action localization without rewards, which can be sparse or non-existent in real-world environments. We perform extensive experiments in both simulated and real-world environments on two tasks - active object tracking and active action localization. We demonstrate that the proposed approach can generalize to different tasks and environments in a streaming fashion, without explicit rewards or training. We show that the proposed approach outperforms unsupervised baselines and obtains competitive performance compared to those trained with reinforcement learning.

Automated Pulmonary Embolism Detection from CTPA Images Using an End-to-End Convolutional Neural Network

Authors: Yi Lin, Jianchao Su, Xiang Wang, Xiang Li, Jingen Liu, Kwang-Ting Cheng, Xin Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.05506
Pdf link: https://arxiv.org/pdf/2111.05506
Abstract
Automated methods for detecting pulmonary embolisms (PEs) on CT pulmonary angiography (CTPA) images are of high demand. Existing methods typically employ separate steps for PE candidate detection and false positive removal, without considering the ability of the other step. As a result, most existing methods usually suffer from a high false positive rate in order to achieve an acceptable sensitivity. This study presents an end-to-end trainable convolutional neural network (CNN) where the two steps are optimized jointly. The proposed CNN consists of three concatenated subnets: 1) a novel 3D candidate proposal network for detecting cubes containing suspected PEs, 2) a 3D spatial transformation subnet for generating fixed-sized vessel-aligned image representation for candidates, and 3) a 2D classification network which takes the three cross-sections of the transformed cubes as input and eliminates false positives. We have evaluated our approach using the 20 CTPA test dataset from the PE challenge, achieving a sensitivity of 78.9%, 80.7% and 80.7% at 2 false positives per volume at 0mm, 2mm and 5mm localization error, which is superior to the state-of-the-art methods. We have further evaluated our system on our own dataset consisting of 129 CTPA data with a total of 269 emboli. Our system achieves a sensitivity of 63.2%, 78.9% and 86.8% at 2 false positives per volume at 0mm, 2mm and 5mm localization error.

Space-Time Memory Network for Sounding Object Localization in Videos

Authors: Sizhe Li, Yapeng Tian, Chenliang Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.05526
Pdf link: https://arxiv.org/pdf/2111.05526
Abstract
Leveraging temporal synchronization and association within sight and sound is an essential step towards robust localization of sounding objects. To this end, we propose a space-time memory network for sounding object localization in videos. It can simultaneously learn spatio-temporal attention over both uni-modal and cross-modal representations from audio and visual modalities. We show and analyze both quantitatively and qualitatively the effectiveness of incorporating spatio-temporal learning in localizing audio-visual objects. We demonstrate that our approach generalizes over various complex audio-visual scenes and outperforms recent state-of-the-art methods.

TomoSLAM: factor graph optimization for rotation angle refinement in microtomography

Authors: Mark Griguletskii, Mikhail Chekanov, Oleg Shipitko
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.05562
Pdf link: https://arxiv.org/pdf/2111.05562
Abstract
In computed tomography (CT), the relative trajectories of a sample, a detector, and a signal source are traditionally considered to be known, since they are caused by the intentional preprogrammed movement of the instrument parts. However, due to the mechanical backlashes, rotation sensor measurement errors, thermal deformations real trajectory differs from desired ones. This negatively affects the resulting quality of tomographic reconstruction. Neither the calibration nor preliminary adjustments of the device completely eliminates the inaccuracy of the trajectory but significantly increase the cost of instrument maintenance. A number of approaches to this problem are based on an automatic refinement of the source and sensor position estimate relative to the sample for each projection (at each time step) during the reconstruction process. A similar problem of position refinement while observing different images of an object from different angles is well known in robotics (particularly, in mobile robots and self-driving vehicles) and is called Simultaneous Localization And Mapping (SLAM). The scientific novelty of this work is to consider the problem of trajectory refinement in microtomography as a SLAM problem. This is achieved by extracting Speeded Up Robust Features (SURF) features from X-ray projections, filtering matches with Random Sample Consensus (RANSAC), calculating angles between projections, and using them in factor graph in combination with stepper motor control signals in order to refine rotation angles.

New submissions for Tue, 16 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

Observation Contribution Theory for Pose Estimation Accuracy

Authors: Zeyu Wan, Yu Zhang, Bin He, Zhuofan Cui, Weichen Dai, Lipu Zhou, Guoquan Huang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.07723
Pdf link: https://arxiv.org/pdf/2111.07723
Abstract
The improvement of pose estimation accuracy is currently the fundamental problem in mobile robots. This study aims to improve the use of observations to enhance accuracy. The selection of feature points affects the accuracy of pose estimation, leading to the question of how the contribution of observation influences the system. Accordingly, the contribution of information to the pose estimation process is analyzed. Moreover, the uncertainty model, sensitivity model, and contribution theory are formulated, providing a method for calculating the contribution of every residual term. The proposed selection method has been theoretically proven capable of achieving a global statistical optimum. The proposed method is tested on artificial data simulations and compared with the KITTI benchmark. The experiments revealed superior results in contrast to ALOAM and MLOAM. The proposed algorithm is implemented in LiDAR odometry and LiDAR Inertial odometry both indoors and outdoors using diverse LiDAR sensors with different scan modes, demonstrating its effectiveness in improving pose estimation accuracy. A new configuration of two laser scan sensors is subsequently inferred. The configuration is valid for three-dimensional pose localization in a prior map and yields results at the centimeter level.

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Background-Aware 3D Point Cloud Segmentationwith Dynamic Point Feature Aggregation

Authors: Jiajing Chen, Burak Kakillioglu, Senem Velipasalar
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.07248
Pdf link: https://arxiv.org/pdf/2111.07248
Abstract
With the proliferation of Lidar sensors and 3D vision cameras, 3D point cloud analysis has attracted significant attention in recent years. After the success of the pioneer work PointNet, deep learning-based methods have been increasingly applied to various tasks, including 3D point cloud segmentation and 3D object classification. In this paper, we propose a novel 3D point cloud learning network, referred to as Dynamic Point Feature Aggregation Network (DPFA-Net), by selectively performing the neighborhood feature aggregation with dynamic pooling and an attention mechanism. DPFA-Net has two variants for semantic segmentation and classification of 3D point clouds. As the core module of the DPFA-Net, we propose a Feature Aggregation layer, in which features of the dynamic neighborhood of each point are aggregated via a self-attention mechanism. In contrast to other segmentation models, which aggregate features from fixed neighborhoods, our approach can aggregate features from different neighbors in different layers providing a more selective and broader view to the query points, and focusing more on the relevant features in a local neighborhood. In addition, to further improve the performance of the proposed semantic segmentation model, we present two novel approaches, namely Two-Stage BF-Net and BF-Regularization to exploit the background-foreground information. Experimental results show that the proposed DPFA-Net achieves the state-of-the-art overall accuracy score for semantic segmentation on the S3DIS dataset, and provides a consistently satisfactory performance across different tasks of semantic segmentation, part segmentation, and 3D object classification. It is also computationally more efficient compared to other methods.

Observation Contribution Theory for Pose Estimation Accuracy

Authors: Zeyu Wan, Yu Zhang, Bin He, Zhuofan Cui, Weichen Dai, Lipu Zhou, Guoquan Huang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.07723
Pdf link: https://arxiv.org/pdf/2111.07723
Abstract
The improvement of pose estimation accuracy is currently the fundamental problem in mobile robots. This study aims to improve the use of observations to enhance accuracy. The selection of feature points affects the accuracy of pose estimation, leading to the question of how the contribution of observation influences the system. Accordingly, the contribution of information to the pose estimation process is analyzed. Moreover, the uncertainty model, sensitivity model, and contribution theory are formulated, providing a method for calculating the contribution of every residual term. The proposed selection method has been theoretically proven capable of achieving a global statistical optimum. The proposed method is tested on artificial data simulations and compared with the KITTI benchmark. The experiments revealed superior results in contrast to ALOAM and MLOAM. The proposed algorithm is implemented in LiDAR odometry and LiDAR Inertial odometry both indoors and outdoors using diverse LiDAR sensors with different scan modes, demonstrating its effectiveness in improving pose estimation accuracy. A new configuration of two laser scan sensors is subsequently inferred. The configuration is valid for three-dimensional pose localization in a prior map and yields results at the centimeter level.

Towards Optimal Strategies for Training Self-Driving Perception Models in Simulation

Authors: David Acuna, Jonah Philion, Sanja Fidler
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.07971
Pdf link: https://arxiv.org/pdf/2111.07971
Abstract
Autonomous driving relies on a huge volume of real-world data to be labeled to high precision. Alternative solutions seek to exploit driving simulators that can generate large amounts of labeled data with a plethora of content variations. However, the domain gap between the synthetic and real data remains, raising the following important question: What are the best ways to utilize a self-driving simulator for perception tasks? In this work, we build on top of recent advances in domain-adaptation theory, and from this perspective, propose ways to minimize the reality gap. We primarily focus on the use of labels in the synthetic domain alone. Our approach introduces both a principled way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator. Our method is easy to implement in practice as it is agnostic of the network architecture and the choice of the simulator. We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data (cameras, lidar) using an open-source simulator (CARLA), and evaluate the entire framework on a real-world dataset (nuScenes). Last but not least, we show what types of variations (e.g. weather conditions, number of assets, map design, and color diversity) matter to perception networks when trained with driving simulators, and which ones can be compensated for with our domain adaptation technique.

Keyword: loop detection

There is no result

Keyword: autonomous driving

DriverGym: Democratising Reinforcement Learning for Autonomous Driving

Authors: Parth Kothari, Christian Perone, Luca Bergamini, Alexandre Alahi, Peter Ondruska
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.06889
Pdf link: https://arxiv.org/pdf/2111.06889
Abstract
Despite promising progress in reinforcement learning (RL), developing algorithms for autonomous driving (AD) remains challenging: one of the critical issues being the absence of an open-source platform capable of training and effectively validating the RL policies on real-world data. We propose DriverGym, an open-source OpenAI Gym-compatible environment specifically tailored for developing RL algorithms for autonomous driving. DriverGym provides access to more than 1000 hours of expert logged data and also supports reactive and data-driven agent behavior. The performance of an RL policy can be easily validated on real-world data using our extensive and flexible closed-loop evaluation protocol. In this work, we also provide behavior cloning baselines using supervised learning and RL, trained in DriverGym. We make DriverGym code, as well as all the baselines publicly available to further stimulate development from the community.

Lifelong Vehicle Trajectory Prediction Framework Based on Generative Replay

Authors: Peng Bao, Zonghai Chen, Jikai Wang, Deyun Dai, Hao Zhao
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.07511
Pdf link: https://arxiv.org/pdf/2111.07511
Abstract
Accurate trajectory prediction of vehicles is essential for reliable autonomous driving. To maintain consistent performance as a vehicle driving around different cities, it is crucial to adapt to changing traffic circumstances and achieve lifelong trajectory prediction model. To realize it, catastrophic forgetting is a main problem to be addressed. In this paper, a divergence measurement method based on conditional Kullback-Leibler divergence is proposed first to evaluate spatiotemporal dependency difference among varied driving circumstances. Then based on generative replay, a novel lifelong vehicle trajectory prediction framework is developed. The framework consists of a conditional generation model and a vehicle trajectory prediction model. The conditional generation model is a generative adversarial network conditioned on position configuration of vehicles. After learning and merging trajectory distribution of vehicles across different cities, the generation model replays trajectories with prior samplings as inputs, which alleviates catastrophic forgetting. The vehicle trajectory prediction model is trained by the replayed trajectories and achieves consistent prediction performance on visited cities. A lifelong experiment setup is established on four open datasets including five tasks. Spatiotemporal dependency divergence is calculated for different tasks. Even though these divergence, the proposed framework exhibits lifelong learning ability and achieves consistent performance on all tasks.

Towards Optimal Strategies for Training Self-Driving Perception Models in Simulation

Authors: David Acuna, Jonah Philion, Sanja Fidler
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.07971
Pdf link: https://arxiv.org/pdf/2111.07971
Abstract
Autonomous driving relies on a huge volume of real-world data to be labeled to high precision. Alternative solutions seek to exploit driving simulators that can generate large amounts of labeled data with a plethora of content variations. However, the domain gap between the synthetic and real data remains, raising the following important question: What are the best ways to utilize a self-driving simulator for perception tasks? In this work, we build on top of recent advances in domain-adaptation theory, and from this perspective, propose ways to minimize the reality gap. We primarily focus on the use of labels in the synthetic domain alone. Our approach introduces both a principled way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator. Our method is easy to implement in practice as it is agnostic of the network architecture and the choice of the simulator. We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data (cameras, lidar) using an open-source simulator (CARLA), and evaluate the entire framework on a real-world dataset (nuScenes). Last but not least, we show what types of variations (e.g. weather conditions, number of assets, map design, and color diversity) matter to perception networks when trained with driving simulators, and which ones can be compensated for with our domain adaptation technique.

Keyword: mapping

Learning Generalized Gumbel-max Causal Mechanisms

Authors: Guy Lorberbom, Daniel D. Johnson, Chris J. Maddison, Daniel Tarlow, Tamir Hazan
Subjects: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2111.06888
Pdf link: https://arxiv.org/pdf/2111.06888
Abstract
To perform counterfactual reasoning in Structural Causal Models (SCMs), one needs to know the causal mechanisms, which provide factorizations of conditional distributions into noise sources and deterministic functions mapping realizations of noise to samples. Unfortunately, the causal mechanism is not uniquely identified by data that can be gathered by observing and interacting with the world, so there remains the question of how to choose causal mechanisms. In recent work, Oberst & Sontag (2019) propose Gumbel-max SCMs, which use Gumbel-max reparameterizations as the causal mechanism due to an intuitively appealing counterfactual stability property. In this work, we instead argue for choosing a causal mechanism that is best under a quantitative criteria such as minimizing variance when estimating counterfactual treatment effects. We propose a parameterized family of causal mechanisms that generalize Gumbel-max. We show that they can be trained to minimize counterfactual effect variance and other losses on a distribution of queries of interest, yielding lower variance estimates of counterfactual treatment effect than fixed alternatives, also generalizing to queries not seen at training time.

Reliably-stabilizing piecewise-affine neural network controllers

Authors: Filippo Fabiani, Paul J. Goulart
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2111.07183
Pdf link: https://arxiv.org/pdf/2111.07183
Abstract
A common problem affecting neural network (NN) approximations of model predictive control (MPC) policies is the lack of analytical tools to assess the stability of the closed-loop system under the action of the NN-based controller. We present a general procedure to quantify the performance of such a controller, or to design minimum complexity NNs with rectified linear units (ReLUs) that preserve the desirable properties of a given MPC scheme. By quantifying the approximation error between NN-based and MPC-based state-to-input mappings, we first establish suitable conditions involving two key quantities, the worst-case error and the Lipschitz constant, guaranteeing the stability of the closed-loop system. We then develop an offline, mixed-integer optimization-based method to compute those quantities exactly. Together these techniques provide conditions sufficient to certify the stability and performance of a ReLU-based approximation of an MPC control law.

PhysXNet: A Customizable Approach for LearningCloth Dynamics on Dressed People

Authors: Jordi Sanchez-Riera, Albert Pumarola, Francesc Moreno-Noguer
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.07195
Pdf link: https://arxiv.org/pdf/2111.07195
Abstract
We introduce PhysXNet, a learning-based approach to predict the dynamics of deformable clothes given 3D skeleton motion sequences of humans wearing these clothes. The proposed model is adaptable to a large variety of garments and changing topologies, without need of being retrained. Such simulations are typically carried out by physics engines that require manual human expertise and are subjectto computationally intensive computations. PhysXNet, by contrast, is a fully differentiable deep network that at inference is able to estimate the geometry of dense cloth meshes in a matter of milliseconds, and thus, can be readily deployed as a layer of a larger deep learning architecture. This efficiency is achieved thanks to the specific parameterization of the clothes we consider, based on 3D UV maps encoding spatial garment displacements. The problem is then formulated as a mapping between the human kinematics space (represented also by 3D UV maps of the undressed body mesh) into the clothes displacement UV maps, which we learn using a conditional GAN with a discriminator that enforces feasible deformations. We train simultaneously our model for three garment templates, tops, bottoms and dresses for which we simulate deformations under 50 different human actions. Nevertheless, the UV map representation we consider allows encapsulating many different cloth topologies, and at test we can simulate garments even if we did not specifically train for them. A thorough evaluation demonstrates that PhysXNet delivers cloth deformations very close to those computed with the physical engine, opening the door to be effectively integrated within deeplearning pipelines.

Color Mapping Functions For HDR Panorama Imaging: Weighted Histogram Averaging

Authors: Yilun Xu, Zhengguo Li, Weihai Chen, Changyun Wen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.07283
Pdf link: https://arxiv.org/pdf/2111.07283
Abstract
It is challenging to stitch multiple images with different exposures due to possible color distortion and loss of details in the brightest and darkest regions of input images. In this paper, a novel color mapping algorithm is first proposed by introducing a new concept of weighted histogram averaging (WHA). The proposed WHA algorithm leverages the correspondence between the histogram bins of two images which are built up by using the non-decreasing property of the color mapping functions (CMFs). The WHA algorithm is then adopted to synthesize a set of differently exposed panorama images. The intermediate panorama images are finally fused via a state-of-the-art multi-scale exposure fusion (MEF) algorithm to produce the final panorama image. Extensive experiments indicate that the proposed WHA algorithm significantly surpasses the related state-of-the-art color mapping methods. The proposed high dynamic range (HDR) stitching algorithm based on MEF also preserves details in the brightest and darkest regions of the input images well. The related materials will be publicly accessible at https://github.com/yilun-xu/WHA for reproducible research.

TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo

Authors: Lukas Koestler, Nan Yang, Niclas Zeller, Daniel Cremers
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.07418
Pdf link: https://arxiv.org/pdf/2111.07418
Abstract
In this paper, we present TANDEM a real-time monocular tracking and dense mapping framework. For pose estimation, TANDEM performs photometric bundle adjustment based on a sliding window of keyframes. To increase the robustness, we propose a novel tracking front-end that performs dense direct image alignment using depth maps rendered from a global model that is built incrementally from dense depth predictions. To predict the dense depth maps, we propose Cascade View-Aggregation MVSNet (CVA-MVSNet) that utilizes the entire active keyframe window by hierarchically constructing 3D cost volumes with adaptive view aggregation to balance the different stereo baselines between the keyframes. Finally, the predicted depth maps are fused into a consistent global map represented as a truncated signed distance function (TSDF) voxel grid. Our experimental results show that TANDEM outperforms other state-of-the-art traditional and learning-based monocular visual odometry (VO) methods in terms of camera tracking. Moreover, TANDEM shows state-of-the-art real-time 3D reconstruction performance.

PatchGraph: In-hand tactile tracking with learned surface normals

Authors: Paloma Sodhi, Michael Kaess, Mustafa Mukadam, Stuart Anderson
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.07524
Pdf link: https://arxiv.org/pdf/2111.07524
Abstract
We address the problem of tracking 3D object poses from touch during in-hand manipulations. Specifically, we look at tracking small objects using vision-based tactile sensors that provide high-dimensional tactile image measurements at the point of contact. While prior work has relied on a-priori information about the object being localized, we remove this requirement. Our key insight is that an object is composed of several local surface patches, each informative enough to achieve reliable object tracking. Moreover, we can recover the geometry of this local patch online by extracting local surface normal information embedded in each tactile image. We propose a novel two-stage approach. First, we learn a mapping from tactile images to surface normals using an image translation network. Second, we use these surface normals within a factor graph to both reconstruct a local patch map and use it to infer 3D object poses. We demonstrate reliable object tracking for over 100 contact sequences across unique shapes with four objects in simulation and two objects in the real-world. Supplementary video: https://youtu.be/JwNTC9_nh8M

AutoGMap: Learning to Map Large-scale Sparse Graphs on Memristive Crossbars

Authors: Bo Lyu, Shengbo Wang, Shiping Wen, Kaibo Shi, Yin Yang, Tingwen Huang
Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2111.07684
Pdf link: https://arxiv.org/pdf/2111.07684
Abstract
The sparse representation of graphs has shown its great potential for accelerating the computation of the graph applications (e.g. Social Networks, Knowledge Graphs) on traditional computing architectures (CPU, GPU, or TPU). But the exploration of the large-scale sparse graph computing on processing-in-memory (PIM) platforms (typically with memristive crossbars) is still in its infancy. As we look to implement the computation or storage of large-scale or batch graphs on memristive crossbars, a natural assumption would be that we need a large-scale crossbar, but with low utilization. Some recent works have questioned this assumption to avoid the waste of the storage and computational resource by "block partition", which is fixed-size, progressively scheduled, or coarse-grained, thus is not effectively sparsity-aware in our view. This work proposes the dynamic sparsity-aware mapping scheme generating method that models the problem as a sequential decision-making problem which is solved by reinforcement learning (RL) algorithm (REINFORCE). Our generating model (LSTM, combined with our dynamic-fill mechanism) generates remarkable mapping performance on a small-scale typical graph/matrix data (43% area of the original matrix with fully mapping), and two large-scale matrix data (22.5% area on qh882, and 17.1% area on qh1484). Moreover, our coding framework of the scheme is intuitive and has promising adaptability with the deployment or compilation system.

Volumetric Parameterization of the Placenta to a Flattened Template

Authors: S. Mazdak Abulnaga, Esra Abaci Turk, Mikhail Bessmeltsev, P. Ellen Grant, Justin Solomon, Polina Golland
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.07900
Pdf link: https://arxiv.org/pdf/2111.07900
Abstract
We present a volumetric mesh-based algorithm for parameterizing the placenta to a flattened template to enable effective visualization of local anatomy and function. MRI shows potential as a research tool as it provides signals directly related to placental function. However, due to the curved and highly variable in vivo shape of the placenta, interpreting and visualizing these images is difficult. We address interpretation challenges by mapping the placenta so that it resembles the familiar ex vivo shape. We formulate the parameterization as an optimization problem for mapping the placental shape represented by a volumetric mesh to a flattened template. We employ the symmetric Dirichlet energy to control local distortion throughout the volume. Local injectivity in the mapping is enforced by a constrained line search during the gradient descent optimization. We validate our method using a research study of 111 placental shapes extracted from BOLD MRI images. Our mapping achieves sub-voxel accuracy in matching the template while maintaining low distortion throughout the volume. We demonstrate how the resulting flattening of the placenta improves visualization of anatomy and function. Our code is freely available at https://github.com/mabulnaga/placenta-flattening .

Deep Semantic Manipulation of Facial Videos

Authors: Girish Kumar Solanki, Anastasios Roussos
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.07902
Pdf link: https://arxiv.org/pdf/2111.07902
Abstract
Editing and manipulating facial features in videos is an interesting and important field of research with a plethora of applications, ranging from movie post-production and visual effects to realistic avatars for video games and virtual assistants. To the best of our knowledge, this paper proposes the first method to perform photorealistic manipulation of facial expressions in videos. Our method supports semantic video manipulation based on neural rendering and 3D-based facial expression modelling. We focus on interactive manipulation of the videos by altering and controlling the facial expressions, achieving promising photorealistic results. The proposed method is based on a disentangled representation and estimation of the 3D facial shape and activity, providing the user with intuitive and easy-to-use control of the facial expressions in the input video. We also introduce a user-friendly, interactive AI tool that processes human-readable semantic labels about the desired emotion manipulations in specific parts of the input video and synthesizes photorealistic manipulated videos. We achieve that by mapping the emotion labels to valence-arousal (VA) values, which in turn are mapped to disentangled 3D facial expressions through an especially designed and trained expression decoder network. The paper presents detailed qualitative and quantitative experiments, which demonstrate the effectiveness of our system and the promising results it achieves. Additional results and videos can be found at the supplementary material (https://github.com/Girish-03/DeepSemManipulation).

Keyword: localization

Commodity Wi-Fi Sensing in 10 Years: Current Status, Challenges, and Opportunities

Authors: Sheng Tan, Jie Yang
Subjects: Networking and Internet Architecture (cs.NI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2111.07038
Pdf link: https://arxiv.org/pdf/2111.07038
Abstract
The prevalence of WiFi devices and ubiquitous coverage of WiFi networks provide us the opportunity to extend WiFi capabilities beyond communication, particularly in sensing the physical environment. In this paper, we survey the evolution of WiFi sensing systems utilizing commodity devices over the past decade. It groups WiFi sensing systems into three main categories: activity recognition (large-scale and small-scale), object sensing, and localization. We highlight the milestone work in each category and the underline techniques they adopted. Next, this work presents the challenges faced by existing WiFi sensing systems. Lastly, we comprehensively discuss the future trending of commodity WiFi sensing.

Unsupervised Action Localization Crop in Video Retargeting for 3D ConvNets

Authors: Prithwish Jana, Swarnabja Bhaumik, Partha Pratim Mohanta
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.07426
Pdf link: https://arxiv.org/pdf/2111.07426
Abstract
Untrimmed videos on social media or those captured by robots and surveillance cameras are of varied aspect ratios. However, 3D CNNs require a square-shaped video whose spatial dimension is smaller than the original one. Random or center-cropping techniques in use may leave out the video's subject altogether. To address this, we propose an unsupervised video cropping approach by shaping this as a retargeting and video-to-video synthesis problem. The synthesized video maintains 1:1 aspect ratio, smaller in size and is targeted at the video-subject throughout the whole duration. First, action localization on the individual frames is performed by identifying patches with homogeneous motion patterns and a single salient patch is pin-pointed. To avoid viewpoint jitters and flickering artifacts, any inter-frame scale or position changes among the patches is performed gradually over time. This issue is addressed with a poly-Bezier fitting in 3D space that passes through some chosen pivot timestamps and its shape is influenced by in-between control timestamps. To corroborate the effectiveness of the proposed method, we evaluate the video classification task by comparing our dynamic cropping with static random on three benchmark datasets: UCF-101, HMDB-51 and ActivityNet v1.3. The clip accuracy and top-1 accuracy for video classification after our cropping, outperform 3D CNN performances for same-sized inputs with random crop; sometimes even surpassing larger random crop sizes.

FastFlow: Unsupervised Anomaly Detection and Localization via 2D Normalizing Flows

Authors: Jiawei Yu1, Ye Zheng, Xiang Wang, Wei Li, Yushuang Wu, Rui Zhao, Liwei Wu1
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.07677
Pdf link: https://arxiv.org/pdf/2111.07677
Abstract
Unsupervised anomaly detection and localization is crucial to the practical application when collecting and labeling sufficient anomaly data is infeasible. Most existing representation-based approaches extract normal image features with a deep convolutional neural network and characterize the corresponding distribution through non-parametric distribution estimation methods. The anomaly score is calculated by measuring the distance between the feature of the test image and the estimated distribution. However, current methods can not effectively map image features to a tractable base distribution and ignore the relationship between local and global features which are important to identify anomalies. To this end, we propose FastFlow implemented with 2D normalizing flows and use it as the probability distribution estimator. Our FastFlow can be used as a plug-in module with arbitrary deep feature extractors such as ResNet and vision transformer for unsupervised anomaly detection and localization. In training phase, FastFlow learns to transform the input visual feature into a tractable distribution and obtains the likelihood to recognize anomalies in inference phase. Extensive experimental results on the MVTec AD dataset show that FastFlow surpasses previous state-of-the-art methods in terms of accuracy and inference efficiency with various backbone networks. Our approach achieves 99.4% AUC in anomaly detection with high inference efficiency.

Observation Contribution Theory for Pose Estimation Accuracy

Authors: Zeyu Wan, Yu Zhang, Bin He, Zhuofan Cui, Weichen Dai, Lipu Zhou, Guoquan Huang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.07723
Pdf link: https://arxiv.org/pdf/2111.07723
Abstract
The improvement of pose estimation accuracy is currently the fundamental problem in mobile robots. This study aims to improve the use of observations to enhance accuracy. The selection of feature points affects the accuracy of pose estimation, leading to the question of how the contribution of observation influences the system. Accordingly, the contribution of information to the pose estimation process is analyzed. Moreover, the uncertainty model, sensitivity model, and contribution theory are formulated, providing a method for calculating the contribution of every residual term. The proposed selection method has been theoretically proven capable of achieving a global statistical optimum. The proposed method is tested on artificial data simulations and compared with the KITTI benchmark. The experiments revealed superior results in contrast to ALOAM and MLOAM. The proposed algorithm is implemented in LiDAR odometry and LiDAR Inertial odometry both indoors and outdoors using diverse LiDAR sensors with different scan modes, demonstrating its effectiveness in improving pose estimation accuracy. A new configuration of two laser scan sensors is subsequently inferred. The configuration is valid for three-dimensional pose localization in a prior map and yields results at the centimeter level.

Beep: Fine-grained Fix Localization by Learning to Predict Buggy Code Elements

Authors: Shangwen Wang, Kui Liu, Bo Lin, Li Li, Jacques Klein, Xiaoguang Mao, Tegawendé F. Bissyandé
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2111.07739
Pdf link: https://arxiv.org/pdf/2111.07739
Abstract
Software Fault Localization refers to the activity of finding code elements (e.g., statements) that are related to a software failure. The state-of-the-art fault localization techniques, however, produce coarse-grained results that can deter manual debugging or mislead automated repair tools. In this work, we focus specifically on the fine-grained identification of code elements (i.e., tokens) that must be changed to fix a buggy program: we refer to it as fix localization. This paper introduces a neural network architecture (named Beep) that builds on AST paths to predict the buggy code element as well as the change action that must be applied to repair a program. Leveraging massive data of bugs and patches within the CoCoNut dataset, we trained a model that was (1) effective in localizing the buggy tokens with the Mean First Rank significantly higher than a statistics based baseline and a machine learning-based baseline, and (2) effective in predicting the repair operators (with the associated buggy code elements) with a Recall@1= 30-45% and the Mean First Rank=7-12 (evaluated by CoCoNut, ManySStuBs4J, and Defects4J datasets). To showcase how fine-grained fix localization can help program repair, we employ it in two repair pipelines where we use either a code completion engine to predict the correct token or a set of heuristics to search for the suitable donor code. A key strength of accurate fix localization for program repair is that it reduces the chance of patch overfitting, a challenge in generate-and-validate automated program repair: both two repair pipelines achieve a correctness ratio of 100%, i.e., all generated patches are found to be correct. Moreover, accurate fix localization helps enhance the efficiency of program repair.

FILIP: Fine-grained Interactive Language-Image Pre-Training

Authors: Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.07783
Pdf link: https://arxiv.org/pdf/2111.07783
Abstract
Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/self-attention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. FILIP successfully leverages the finer-grained expressiveness between image patches and textual words by modifying only contrastive loss, while simultaneously gaining the ability to pre-compute image and text representations offline at inference, keeping both large-scale training and inference efficient. Furthermore, we construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks including zero-shot image classification and image-text retrieval. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability.

New submissions for Thu, 18 Nov 21

Keyword: SLAM

Probabilistic Spatial Distribution Prior Based Attentional Keypoints Matching Network

Authors: Xiaoming Zhao, Jingmeng Liu, Xingming Wu, Weihai Chen, Fanghong Guo, Zhengguo Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09006
Pdf link: https://arxiv.org/pdf/2111.09006
Abstract
Keypoints matching is a pivotal component for many image-relevant applications such as image stitching, visual simultaneous localization and mapping (SLAM), and so on. Both handcrafted-based and recently emerged deep learning-based keypoints matching methods merely rely on keypoints and local features, while losing sight of other available sensors such as inertial measurement unit (IMU) in the above applications. In this paper, we demonstrate that the motion estimation from IMU integration can be used to exploit the spatial distribution prior of keypoints between images. To this end, a probabilistic perspective of attention formulation is proposed to integrate the spatial distribution prior into the attentional graph neural network naturally. With the assistance of spatial distribution prior, the effort of the network for modeling the hidden features can be reduced. Furthermore, we present a projection loss for the proposed keypoints matching network, which gives a smooth edge between matching and un-matching keypoints. Image matching experiments on visual SLAM datasets indicate the effectiveness and efficiency of the presented method.

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

Learning Scene Dynamics from Point Cloud Sequences

Authors: Pan He, Patrick Emami, Sanjay Ranka, Anand Rangarajan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.08755
Pdf link: https://arxiv.org/pdf/2111.08755
Abstract
Understanding 3D scenes is a critical prerequisite for autonomous agents. Recently, LiDAR and other sensors have made large amounts of data available in the form of temporal sequences of point cloud frames. In this work, we propose a novel problem -- sequential scene flow estimation (SSFE) -- that aims to predict 3D scene flow for all pairs of point clouds in a given sequence. This is unlike the previously studied problem of scene flow estimation which focuses on two frames. We introduce the SPCM-Net architecture, which solves this problem by computing multi-scale spatiotemporal correlations between neighboring point clouds and then aggregating the correlation across time with an order-invariant recurrent unit. Our experimental evaluation confirms that recurrent processing of point cloud sequences results in significantly better SSFE compared to using only two frames. Additionally, we demonstrate that this approach can be effectively modified for sequential point cloud forecasting (SPF), a related problem that demands forecasting future point cloud frames. Our experimental results are evaluated using a new benchmark for both SSFE and SPF consisting of synthetic and real datasets. Previously, datasets for scene flow estimation have been limited to two frames. We provide non-trivial extensions to these datasets for multi-frame estimation and prediction. Due to the difficulty of obtaining ground truth motion for real-world datasets, we use self-supervised training and evaluation metrics. We believe that this benchmark will be pivotal to future research in this area. All code for benchmark and models will be made accessible.

ARKitScenes -- A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Authors: Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, Elad Shulman
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.08897
Pdf link: https://arxiv.org/pdf/2111.08897
Abstract
Scene understanding is an active research area. Commercial depth sensors, such as Kinect, have enabled the release of several RGB-D datasets over the past few years which spawned novel methods in 3D scene understanding. More recently with the launch of the LiDAR sensor in Apple's iPads and iPhones, high quality RGB-D data is accessible to millions of people on a device they commonly use. This opens a whole new era in scene understanding for the Computer Vision community as well as app developers. The fundamental research in scene understanding together with the advances in machine learning can now impact people's everyday experiences. However, transforming these scene understanding methods to real-world experiences requires additional innovation and development. In this paper we introduce ARKitScenes. It is not only the first RGB-D dataset that is captured with a now widely available depth sensor, but to our best knowledge, it also is the largest indoor scene understanding data released. In addition to the raw and processed data from the mobile device, ARKitScenes includes high resolution depth maps captured using a stationary laser scanner, as well as manually labeled 3D oriented bounding boxes for a large taxonomy of furniture. We further analyze the usefulness of the data for two downstream tasks: 3D object detection and color-guided depth upsampling. We demonstrate that our dataset can help push the boundaries of existing state-of-the-art methods and it introduces new challenges that better represent real-world scenarios.

Keyword: loop detection

There is no result

Keyword: autonomous driving

There is no result

Keyword: mapping

Synthesis-Guided Feature Learning for Cross-Spectral Periocular Recognition

Authors: Domenick Poster, Nasser Nasrabadi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.08738
Pdf link: https://arxiv.org/pdf/2111.08738
Abstract
A common yet challenging scenario in periocular biometrics is cross-spectral matching - in particular, the matching of visible wavelength against near-infrared (NIR) periocular images. We propose a novel approach to cross-spectral periocular verification that primarily focuses on learning a mapping from visible and NIR periocular images to a shared latent representational subspace, and supports this effort by simultaneously learning intra-spectral image reconstruction. We show the auxiliary image reconstruction task (and in particular the reconstruction of high-level, semantic features) results in learning a more discriminative, domain-invariant subspace compared to the baseline while incurring no additional computational or memory costs at test-time. The proposed Coupled Conditional Generative Adversarial Network (CoGAN) architecture uses paired generator networks (one operating on visible images and the other on NIR) composed of U-Nets with ResNet-18 encoders trained for feature learning via contrastive loss and for intra-spectral image reconstruction with adversarial, pixel-based, and perceptual reconstruction losses. Moreover, the proposed CoGAN model beats the current state-of-art (SotA) in cross-spectral periocular recognition. On the Hong Kong PolyU benchmark dataset, we achieve 98.65% AUC and 5.14% EER compared to the SotA EER of 8.02%. On the Cross-Eyed dataset, we achieve 99.31% AUC and 3.99% EER versus SotA EER of 4.39%.

Achieving Short-Blocklength RCU bound via CRC List Decoding of TCM with Probabilistic Shaping

Authors: Linfang Wang, Dan Song, Felipe Areces, Richard D. Wesel
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2111.08756
Pdf link: https://arxiv.org/pdf/2111.08756
Abstract
This paper applies probabilistic amplitude shaping (PAS) to a cyclic redundancy check (CRC) aided trellis coded modulation (TCM) to achieve the short-blocklength random coding union (RCU) bound. In the transmitter, the equally likely message bits are first encoded by distribution matcher to generate amplitude symbols with the desired distribution. The binary representations of the distribution matcher outputs are then encoded by a CRC. Finally, the CRC-encoded bits are encoded and modulated by Ungerboeck's TCM scheme, which consists of a $\frac{k_0}{k_0+1}$ systematic tail-biting convolutional code and a mapping function that maps coded bits to channel signals with capacity-achieving distribution. This paper proves that, for the proposed transmitter, the CRC bits have uniform distribution and that the channel signals have symmetric distribution. In the receiver, the serial list Viterbi decoding (S-LVD) is used to estimate the information bits. Simulation results show that, for the proposed CRC-TCM-PAS system with 87 input bits and 65-67 8-AM coded output symbols, the decoding performance under additive white Gaussian noise channel achieves the RCU bound with properly designed CRC and convolutional codes.

Search and Rescue in a Maze-like Environment with Ant and Dijkstra Algorithms

Authors: Z. Husain, A. Al Zaabi, H. Hildmann, F. Saffre, D. Ruta, A. F. Isakovic
Subjects: Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2111.08882
Pdf link: https://arxiv.org/pdf/2111.08882
Abstract
With the growing reliability of modern Ad Hoc Networks, it is encouraging to analyze potential involvement of autonomous Ad Hoc agents in critical situations where human involvement could be perilous. One such critical scenario is the Search and Rescue effort in the event of a disaster where timely discovery and help deployment is of utmost importance. This paper demonstrates the applicability of a bio-inspired technique, namely Ant Algorithms (AA), in optimizing the search time for a near optimal path to a trapped victim, followed by the application of Dijkstra's algorithm in the rescue phase. The inherent exploratory nature of AA is put to use for a faster mapping and coverage of the unknown search space. Four different AA are implemented, with different effects of the pheromone in play. An inverted AA, with repulsive pheromones, was found to be the best fit for this particular application. After considerable exploration, upon discovery of the victim, the autonomous agents further facilitate the rescue process by forming a relay network, using the already deployed resources. Hence, the paper discusses a detailed decision making model of the swarm, segmented into two primary phases, responsible for the search and rescue respectively. Different aspects of the performance of the agent swarm are analyzed, as a function of the spatial dimensions, the complexity of the search space, the deployed search group size, and the signal permeability of the obstacles in the area.

Random Graph-Based Neuromorphic Learning with a Layer-Weaken Structure

Authors: Ruiqi Mao, Rongxin Cui
Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2111.08888
Pdf link: https://arxiv.org/pdf/2111.08888
Abstract
Unified understanding of neuro networks (NNs) gets the users into great trouble because they have been puzzled by what kind of rules should be obeyed to optimize the internal structure of NNs. Considering the potential capability of random graphs to alter how computation is performed, we demonstrate that they can serve as architecture generators to optimize the internal structure of NNs. To transform the random graph theory into an NN model with practical meaning and based on clarifying the input-output relationship of each neuron, we complete data feature mapping by calculating Fourier Random Features (FRFs). Under the usage of this low-operation cost approach, neurons are assigned to several groups of which connection relationships can be regarded as uniform representations of random graphs they belong to, and random arrangement fuses those neurons to establish the pattern matrix, markedly reducing manual participation and computational cost without the fixed and deep architecture. Leveraging this single neuromorphic learning model termed random graph-based neuro network (RGNN) we develop a joint classification mechanism involving information interaction between multiple RGNNs and realize significant performance improvements in supervised learning for three benchmark tasks, whereby they effectively avoid the adverse impact of the interpretability of NNs on the structure design and engineering practice.

Nonlinear Intensity Sonar Image Matching based on Deep Convolution Features

Authors: Xiaoteng Zhou, Changli Yu, Xin Yuan, Yi Wu, Haijun Feng, Citong Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.08994
Pdf link: https://arxiv.org/pdf/2111.08994
Abstract
In the field of deep-sea exploration, sonar is presently the only efficient long-distance sensing device. The complicated underwater environment, such as noise interference, low target intensity or background dynamics, has brought many negative effects on sonar imaging. Among them, the problem of nonlinear intensity is extremely prevalent. It is also known as the anisotropy of acoustic imaging, that is, when AUVs carry sonar to detect the same target from different angles, the intensity difference between image pairs is sometimes very large, which makes the traditional matching algorithm almost ineffective. However, image matching is the basis of comprehensive tasks such as navigation, positioning, and mapping. Therefore, it is very valuable to obtain robust and accurate matching results. This paper proposes a combined matching method based on phase information and deep convolution features. It has two outstanding advantages: one is that deep convolution features could be used to measure the similarity of the local and global positions of the sonar image; the other is that local feature matching could be performed at the key target position of the sonar image. This method does not need complex manual design, and completes the matching task of nonlinear intensity sonar images in a close end-to-end manner. Feature matching experiments are carried out on the deep-sea sonar images captured by AUVs, and the results show that our proposal has good matching accuracy and robustness.

Probabilistic Spatial Distribution Prior Based Attentional Keypoints Matching Network

Authors: Xiaoming Zhao, Jingmeng Liu, Xingming Wu, Weihai Chen, Fanghong Guo, Zhengguo Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09006
Pdf link: https://arxiv.org/pdf/2111.09006
Abstract
Keypoints matching is a pivotal component for many image-relevant applications such as image stitching, visual simultaneous localization and mapping (SLAM), and so on. Both handcrafted-based and recently emerged deep learning-based keypoints matching methods merely rely on keypoints and local features, while losing sight of other available sensors such as inertial measurement unit (IMU) in the above applications. In this paper, we demonstrate that the motion estimation from IMU integration can be used to exploit the spatial distribution prior of keypoints between images. To this end, a probabilistic perspective of attention formulation is proposed to integrate the spatial distribution prior into the attentional graph neural network naturally. With the assistance of spatial distribution prior, the effort of the network for modeling the hidden features can be reduced. Furthermore, we present a projection loss for the proposed keypoints matching network, which gives a smooth edge between matching and un-matching keypoints. Image matching experiments on visual SLAM datasets indicate the effectiveness and efficiency of the presented method.

Multi-Attribute Relation Extraction (MARE) -- Simplifying the Application of Relation Extraction

Authors: Lars Klöser, Philipp Kohl, Bodo Kraft, Albert Zündorf
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.09035
Pdf link: https://arxiv.org/pdf/2111.09035
Abstract
Natural language understanding's relation extraction makes innovative and encouraging novel business concepts possible and facilitates new digitilized decision-making processes. Current approaches allow the extraction of relations with a fixed number of entities as attributes. Extracting relations with an arbitrary amount of attributes requires complex systems and costly relation-trigger annotations to assist these systems. We introduce multi-attribute relation extraction (MARE) as an assumption-less problem formulation with two approaches, facilitating an explicit mapping from business use cases to the data annotations. Avoiding elaborated annotation constraints simplifies the application of relation extraction approaches. The evaluation compares our models to current state-of-the-art event extraction and binary relation extraction methods. Our approaches show improvement compared to these on the extraction of general multi-attribute relations.

Unifying Heterogenous Electronic Health Records Systems via Text-Based Code Embedding

Authors: Kyunghoon Hur, Jiyoung Lee, Jungwoo Oh, Wesley Price, Young-Hak Kim, Edward Choi
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.09098
Pdf link: https://arxiv.org/pdf/2111.09098
Abstract
EHR systems lack a unified code system forrepresenting medical concepts, which acts asa barrier for the deployment of deep learningmodels in large scale to multiple clinics and hos-pitals. To overcome this problem, we introduceDescription-based Embedding,DescEmb, a code-agnostic representation learning framework forEHR. DescEmb takes advantage of the flexibil-ity of neural language understanding models toembed clinical events using their textual descrip-tions rather than directly mapping each event toa dedicated embedding. DescEmb outperformedtraditional code-based embedding in extensiveexperiments, especially in a zero-shot transfertask (one hospital to another), and was able totrain a single unified model for heterogeneousEHR datasets.

Learning to Align Sequential Actions in the Wild

Authors: Weizhe Liu, Bugra Tekin, Huseyin Coskun, Vibhav Vineet, Pascal Fua, Marc Pollefeys
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.09301
Pdf link: https://arxiv.org/pdf/2111.09301
Abstract
State-of-the-art methods for self-supervised sequential action alignment rely on deep networks that find correspondences across videos in time. They either learn frame-to-frame mapping across sequences, which does not leverage temporal information, or assume monotonic alignment between each video pair, which ignores variations in the order of actions. As such, these methods are not able to deal with common real-world scenarios that involve background frames or videos that contain non-monotonic sequence of actions. In this paper, we propose an approach to align sequential actions in the wild that involve diverse temporal variations. To this end, we propose an approach to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions. Our model accounts for both monotonic and non-monotonic sequences and handles background frames that should not be aligned. We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning on four different benchmark datasets.

Keyword: localization

Probabilistic Spatial Distribution Prior Based Attentional Keypoints Matching Network

Authors: Xiaoming Zhao, Jingmeng Liu, Xingming Wu, Weihai Chen, Fanghong Guo, Zhengguo Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.09006
Pdf link: https://arxiv.org/pdf/2111.09006
Abstract
Keypoints matching is a pivotal component for many image-relevant applications such as image stitching, visual simultaneous localization and mapping (SLAM), and so on. Both handcrafted-based and recently emerged deep learning-based keypoints matching methods merely rely on keypoints and local features, while losing sight of other available sensors such as inertial measurement unit (IMU) in the above applications. In this paper, we demonstrate that the motion estimation from IMU integration can be used to exploit the spatial distribution prior of keypoints between images. To this end, a probabilistic perspective of attention formulation is proposed to integrate the spatial distribution prior into the attentional graph neural network naturally. With the assistance of spatial distribution prior, the effort of the network for modeling the hidden features can be reduced. Furthermore, we present a projection loss for the proposed keypoints matching network, which gives a smooth edge between matching and un-matching keypoints. Image matching experiments on visual SLAM datasets indicate the effectiveness and efficiency of the presented method.

Multi-Mobile Robot Localization and Navigation based on Visible Light Positioning

Authors: Yanyi Chen, Zhiqing Zhong, Shangsheng Wen, Weipeng Guan
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.09050
Pdf link: https://arxiv.org/pdf/2111.09050
Abstract
We demonstrated multi-mobile robot navigation based on Visible Light Positioning(VLP) localization. From our experiment, the VLP can accurately locate robots' positions in navigation.

DA-LMR: A Robust Lane Markings Representation for Data Association Methods

Authors: Miguel Ángel Muñoz-Bañón, Jan-Hendrik Pauls, Haohao Hu, Christoph Stiller
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.09230
Pdf link: https://arxiv.org/pdf/2111.09230
Abstract
While complete localization approaches are widely studied in the literature, their data association and data representation subprocesses usually go unnoticed. However, both are a key part of the final pose estimation. In this work, we present DA-LMR (Delta-Angle Lane Markings Representation), a robust data representation in the context of localization approaches. We propose a representation of lane markings that encodes how a curve changes in each point and includes this information in an additional dimension, thus providing a more detailed geometric structure description of the data. We also propose DC-SAC (Distance-Compatible Sample Consensus), a data association method. This is a heuristic version of RANSAC that dramatically reduces the hypothesis space by distance compatibility restrictions. We compare the presented methods with some state-of-the-art data representation and data association approaches in different noisy scenarios. The DA-LMR and DC-SAC produce the most promising combination among those compared, reaching 98.1% in precision and 99.7% in recall for noisy data with 0.5m of standard deviation.

New submissions for Wed, 24 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

PointCrack3D: Crack Detection in Unstructured Environments using a 3D-Point-Cloud-Based Deep Neural Network

Authors: Faris Azhari, Charlotte Sennersten, Michael Milford, Thierry Peynot
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.11615
Pdf link: https://arxiv.org/pdf/2111.11615
Abstract
Surface cracks on buildings, natural walls and underground mine tunnels can indicate serious structural integrity issues that threaten the safety of the structure and people in the environment. Timely detection and monitoring of cracks are crucial to managing these risks, especially if the systems can be made highly automated through robots. Vision-based crack detection algorithms using deep neural networks have exhibited promise for structured surfaces such as walls or civil engineering tunnels, but little work has addressed highly unstructured environments such as rock cliffs and bare mining tunnels. To address this challenge, this paper presents PointCrack3D, a new 3D-point-cloud-based crack detection algorithm for unstructured surfaces. The method comprises three key components: an adaptive down-sampling method that maintains sufficient crack point density, a DNN that classifies each point as crack or non-crack, and a post-processing clustering method that groups crack points into crack instances. The method was validated experimentally on a new large natural rock dataset, comprising coloured LIDAR point clouds spanning more than 900 m^2 and 412 individual cracks. Results demonstrate a crack detection rate of 97% overall and 100% for cracks with a maximum width of more than 3 cm, significantly outperforming the state of the art. Furthermore, for cross-validation, PointCrack3D was applied to an entirely new dataset acquired in different locations and not used at all in training and shown to detect 100% of its crack instances. We also characterise the relationship between detection performance, crack width and number of points per crack, providing a foundation upon which to make decisions about both practical deployments and future research directions.

AdaFusion: Visual-LiDAR Fusion with Adaptive Weights for Place Recognition

Authors: Haowen Lai, Peng Yin, Sebastian Scherer
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.11739
Pdf link: https://arxiv.org/pdf/2111.11739
Abstract
Recent years have witnessed the increasing application of place recognition in various environments, such as city roads, large buildings, and a mix of indoor and outdoor places. This task, however, still remains challenging due to the limitations of different sensors and the changing appearance of environments. Current works only consider the use of individual sensors, or simply combine different sensors, ignoring the fact that the importance of different sensors varies as the environment changes. In this paper, an adaptive weighting visual-LiDAR fusion method, named AdaFusion, is proposed to learn the weights for both images and point cloud features. Features of these two modalities are thus contributed differently according to the current environmental situation. The learning of weights is achieved by the attention branch of the network, which is then fused with the multi-modality feature extraction branch. Furthermore, to better utilize the potential relationship between images and point clouds, we design a twostage fusion approach to combine the 2D and 3D attention. Our work is tested on two public datasets, and experiments show that the adaptive weights help improve recognition accuracy and system robustness to varying environments.

VISTA 2.0: An Open, Data-driven Simulator for Multimodal Sensing and Policy Learning for Autonomous Vehicles

Authors: Alexander Amini, Tsun-Hsuan Wang, Igor Gilitschenski, Wilko Schwarting, Zhijian Liu, Song Han, Sertac Karaman, Daniela Rus
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.12083
Pdf link: https://arxiv.org/pdf/2111.12083
Abstract
Simulation has the potential to transform the development of robust algorithms for mobile agents deployed in safety-critical scenarios. However, the poor photorealism and lack of diverse sensor modalities of existing simulation engines remain key hurdles towards realizing this potential. Here, we present VISTA, an open source, data-driven simulator that integrates multiple types of sensors for autonomous vehicles. Using high fidelity, real-world datasets, VISTA represents and simulates RGB cameras, 3D LiDAR, and event-based cameras, enabling the rapid generation of novel viewpoints in simulation and thereby enriching the data available for policy learning with corner cases that are difficult to capture in the physical world. Using VISTA, we demonstrate the ability to train and test perception-to-control policies across each of the sensor types and showcase the power of this approach via deployment on a full scale autonomous vehicle. The policies learned in VISTA exhibit sim-to-real transfer without modification and greater robustness than those trained exclusively on real-world data.

Keyword: loop detection

There is no result

Keyword: autonomous driving

Integrating Imitation Learning with Human Driving Data into Reinforcement Learning to Improve Training Efficiency for Autonomous Driving

Authors: Heidi Lu
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2111.11673
Pdf link: https://arxiv.org/pdf/2111.11673
Abstract
Two current methods used to train autonomous cars are reinforcement learning and imitation learning. This research develops a new learning methodology and systematic approach in both a simulated and a smaller real world environment by integrating supervised imitation learning into reinforcement learning to make the RL training data collection process more effective and efficient. By combining the two methods, the proposed research successfully leverages the advantages of both RL and IL methods. First, a real mini-scale robot car was assembled and trained on a 6 feet by 9 feet real world track using imitation learning. During the process, a handle controller was used to control the mini-scale robot car to drive on the track by imitating a human expert driver and manually recorded the actions using Microsoft AirSim's API. 331 accurate human-like reward training samples were able to be generated and collected. Then, an agent was trained in the Microsoft AirSim simulator using reinforcement learning for 6 hours with the initial 331 reward data inputted from imitation learning training. After a 6-hour training period, the mini-scale robot car was able to successfully drive full laps around the 6 feet by 9 feet track autonomously while the mini-scale robot car was unable to complete one full lap round the track even after 30 hour training pure RL training. With 80% less training time, the new methodology produced significantly more average rewards per hour. Thus, the new methodology was able to save a significant amount of training time and can be used to accelerate the adoption of RL in autonomous driving, which would help produce more efficient and better results in the long run when applied to real life scenarios. Key Words: Reinforcement Learning (RL), Imitation Learning (IL), Autonomous Driving, Human Driving Data, CNN

Independent Learning in Stochastic Games

Authors: Asuman Ozdaglar, Muhammed O. Sayin, Kaiqing Zhang
Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Dynamical Systems (math.DS)
Arxiv link: https://arxiv.org/abs/2111.11743
Pdf link: https://arxiv.org/pdf/2111.11743
Abstract
Reinforcement learning (RL) has recently achieved tremendous successes in many artificial intelligence applications. Many of the forefront applications of RL involve multiple agents, e.g., playing chess and Go games, autonomous driving, and robotics. Unfortunately, the framework upon which classical RL builds is inappropriate for multi-agent learning, as it assumes an agent's environment is stationary and does not take into account the adaptivity of other agents. In this review paper, we present the model of stochastic games for multi-agent learning in dynamic environments. We focus on the development of simple and independent learning dynamics for stochastic games: each agent is myopic and chooses best-response type actions to other agents' strategy without any coordination with her opponent. There has been limited progress on developing convergent best-response type independent learning dynamics for stochastic games. We present our recently proposed simple and independent learning dynamics that guarantee convergence in zero-sum stochastic games, together with a review of other contemporaneous algorithms for dynamic multi-agent learning in this setting. Along the way, we also reexamine some classical results from both the game theory and RL literature, to situate both the conceptual contributions of our independent learning dynamics, and the mathematical novelties of our analysis. We hope this review paper serves as an impetus for the resurgence of studying independent and natural learning dynamics in game theory, for the more challenging settings with a dynamic environment.

Keyword: mapping

Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration

Authors: Yifan Gong, Geng Yuan, Zheng Zhan, Wei Niu, Zhengang Li, Pu Zhao, Yuxuan Cai, Sijia Liu, Bin Ren, Xue Lin, Xulong Tang, Yanzhi Wang
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2111.11581
Pdf link: https://arxiv.org/pdf/2111.11581
Abstract
Weight pruning is an effective model compression technique to tackle the challenges of achieving real-time deep neural network (DNN) inference on mobile devices. However, prior pruning schemes have limited application scenarios due to accuracy degradation, difficulty in leveraging hardware acceleration, and/or restriction on certain types of DNN layers. In this paper, we propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations that are applicable to any type of DNN layer while achieving high accuracy and hardware inference performance. With the flexibility of applying different pruning schemes to different layers enabled by our compiler optimizations, we further probe into the new problem of determining the best-suited pruning scheme considering the different acceleration and accuracy performance of various pruning schemes. Two pruning scheme mapping methods, one is search-based and the other is rule-based, are proposed to automatically derive the best-suited pruning regularity and block size for each layer of any given DNN. Experimental results demonstrate that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework with up to 2.48$\times$ and 1.73$\times$ DNN inference acceleration on CIFAR-10 and ImageNet dataset without accuracy loss.

A Logical Semantics for PDDL+

Authors: Vitaliy Batusov, Mikhail Soutchanski
Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2111.11588
Pdf link: https://arxiv.org/pdf/2111.11588
Abstract
PDDL+ is an extension of PDDL2.1 which incorporates fully-featured autonomous processes and allows for better modelling of mixed discrete-continuous domains. Unlike PDDL2.1, PDDL+ lacks a logical semantics, relying instead on state-transitional semantics enriched with hybrid automata semantics for the continuous states. This complex semantics makes analysis and comparisons to other action formalisms difficult. In this paper, we propose a natural extension of Reiter's situation calculus theories inspired by hybrid automata. The kinship between PDDL+ and hybrid automata allows us to develop a direct mapping between PDDL+ and situation calculus, thereby supplying PDDL+ with a logical semantics and the situation calculus with a modern way of representing autonomous processes. We outline the potential benefits of the mapping by suggesting a new approach to effective planning in PDDL+.

Using mixup as regularization and tuning hyper-parameters for ResNets

Authors: Venkata Bhanu Teja Pallakonda
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.11616
Pdf link: https://arxiv.org/pdf/2111.11616
Abstract
While novel computer vision architectures are gaining traction, the impact of model architectures is often related to changes or exploring in training methods. Identity mapping-based architectures ResNets and DenseNets have promised path-breaking results in the image classification task and are go-to methods for even now if the data given is fairly limited. Considering the ease of training with limited resources this work revisits the ResNets and improves the ResNet50 \cite{resnets} by using mixup data-augmentation as regularization and tuning the hyper-parameters.

A Customized NoC Architecture to Enable Highly Localized Computing-On-the-Move DNN Dataflow

Authors: Kaining Zhou, Yangshuo He, Rui Xiao, Jiayi Liu, Kejie Huang
Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2111.11744
Pdf link: https://arxiv.org/pdf/2111.11744
Abstract
The ever-increasing computation complexity of fastgrowing Deep Neural Networks (DNNs) has requested new computing paradigms to overcome the memory wall in conventional Von Neumann computing architectures. The emerging Computing-In-Memory (CIM) architecture has been a promising candidate to accelerate neural network computing. However, data movement between CIM arrays may still dominate the total power consumption in conventional designs. This paper proposes a flexible CIM processor architecture named Domino and "Computing-On-the-Move" (COM) dataflow, to enable stream computing and local data access to significantly reduce data movement energy. Meanwhile, Domino employs customized distributed instruction scheduling within Network-on-Chip (NoC) to implement inter-memory computing and attain mapping flexibility. The evaluation with prevailing DNN models shows that Domino achieves 1.77-to-2.37$\times$ power efficiency over several state-of-the-art CIM accelerators and improves the throughput by 1.28-to-13.16$\times$.

Time Series Prediction about Air Quality using LSTM-Based Models: A Systematic Mapping

Authors: Lucas L. S. Sachetti, Vinicius F. S. Mota
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.11848
Pdf link: https://arxiv.org/pdf/2111.11848
Abstract
This systematic mapping study investigates the use of Long short-term memory networks to predict time series data about air quality, trying to understand the reasons, characteristics and methods available in the scientific literature, identify gaps in the researched area and potential approaches that can be exploited on later studies.

Keyword: localization

Learning Dynamic Compact Memory Embedding for Deformable Visual Object Tracking

Authors: Pengfei Zhu, Hongtao Yu, Kaihua Zhang, Yu Wang, Shuai Zhao, Lei Wang, Tianzhu Zhang, Qinghua Hu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.11625
Pdf link: https://arxiv.org/pdf/2111.11625
Abstract
Recently, template-based trackers have become the leading tracking algorithms with promising performance in terms of efficiency and accuracy. However, the correlation operation between query feature and the given template only exploits accurate target localization, leading to state estimation error especially when the target suffers from severe deformable variations. To address this issue, segmentation-based trackers have been proposed that employ per-pixel matching to improve the tracking performance of deformable objects effectively. However, most of existing trackers only refer to the target features in the initial frame, thereby lacking the discriminative capacity to handle challenging factors, e.g., similar distractors, background clutter, appearance change, etc. To this end, we propose a dynamic compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method. Specifically, we initialize a memory embedding with the target features in the first frame. During the tracking process, the current target features that have high correlation with existing memory are updated to the memory embedding online. To further improve the segmentation accuracy for deformable objects, we employ a point-to-global matching strategy to measure the correlation between the pixel-wise query features and the whole template, so as to capture more detailed deformation information. Extensive evaluations on six challenging tracking benchmarks including VOT2016, VOT2018, VOT2019, GOT-10K, TrackingNet, and LaSOT demonstrate the superiority of our method over recent remarkable trackers. Besides, our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS2017 benchmark.

Reliable Deep Learning based Localization with CSI Fingerprints and Multiple Base Stations

Authors: Anastasios Foliadis, Mario H. Castañeda Garcia, Richard A. Stirling-Gallacher, Reiner S. Thomä
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2111.11839
Pdf link: https://arxiv.org/pdf/2111.11839
Abstract
Deep learning (DL) methods have been recently proposed for user equipment (UE) localization in wireless communication networks, based on the channel state information (CSI) between a UE and each base station (BS) in the uplink. With the CSI from the available BSs, UE localization can be performed in different ways. One the one hand, a single neural network (NN) can be trained for the UE localization by considering the CSI from all the available BSs as one overall fingerprint of the user's location. On the other hand, the CSI at each BS can be used to obtain an estimate of the UE's position with a separate NN at each BS, and then the position estimates of all BSs are combined to obtain an overall estimate of the UE position. In this work, we show that UE localization with the latter approach can achieve a higher positioning accuracy. We propose to consider the uncertainty in the UE localization at each BS, such that overall UE's position is determined by combining the position estimates of the different BSs based on the uncertainty at each BS. With this approach, a more reliable position estimate can be obtained in case of variations in the channel.

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

Authors: Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.12085
Pdf link: https://arxiv.org/pdf/2111.12085
Abstract
In this paper, we propose UNICORN, a vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture. Specifically, we quantize each box into four discrete box tokens and serialize them as a sequence, which can be integrated with text tokens. We formulate all VL problems as a generation task, where the target sequence consists of the integrated text and box tokens. We then train a transformer encoder-decoder to predict the target in an auto-regressive manner. With such a unified framework and input-output format, UNICORN achieves comparable performance to task-specific state of the art on 7 VL benchmarks, covering the visual grounding, grounded captioning, visual question answering, and image captioning tasks. When trained with multi-task finetuning, UNICORN can approach different VL tasks with a single set of parameters, thus crossing downstream task boundary. We show that having a single model not only saves parameters, but also further boosts the model performance on certain tasks. Finally, UNICORN shows the capability of generalizing to new tasks such as ImageNet object localization.

New submissions for Wed, 16 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Temporal Consistency Checks to Detect LiDAR Spoofing Attacks on Autonomous Vehicle Perception

Authors: Chengzeng You, Zhongyuan Hau, Soteris Demetriou
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.07833
Pdf link: https://arxiv.org/pdf/2106.07833
Abstract
LiDAR sensors are used widely in Autonomous Vehicles for better perceiving the environment which enables safer driving decisions. Recent work has demonstrated serious LiDAR spoofing attacks with alarming consequences. In particular, model-level LiDAR spoofing attacks aim to inject fake depth measurements to elicit ghost objects that are erroneously detected by 3D Object Detectors, resulting in hazardous driving decisions. In this work, we explore the use of motion as a physical invariant of genuine objects for detecting such attacks. Based on this, we propose a general methodology, 3D Temporal Consistency Check (3D-TC2), which leverages spatio-temporal information from motion prediction to verify objects detected by 3D Object Detectors. Our preliminary design and implementation of a 3D-TC2 prototype demonstrates very promising performance, providing more than 98% attack detection rate with a recall of 91% for detecting spoofed Vehicle (Car) objects, and is able to achieve real-time detection at 41Hz

Keyword: loop detection

There is no result

Keyword: autonomous driving

There is no result

New submissions for Thu, 17 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

A Multi-Layered Approach for Measuring the Simulation-to-Reality Gap of Radar Perception for Autonomous Driving

Authors: Anthony Ngo, Max Paul Bauer, Michael Resch
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2106.08372
Pdf link: https://arxiv.org/pdf/2106.08372
Abstract
With the increasing safety validation requirements for the release of a self-driving car, alternative approaches, such as simulation-based testing, are emerging in addition to conventional real-world testing. In order to rely on virtual tests the employed sensor models have to be validated. For this reason, it is necessary to quantify the discrepancy between simulation and reality in order to determine whether a certain fidelity is sufficient for a desired intended use. There exists no sound method to measure this simulation-to-reality gap of radar perception for autonomous driving. We address this problem by introducing a multi-layered evaluation approach, which consists of a combination of an explicit and an implicit sensor model evaluation. The former directly evaluates the realism of the synthetically generated sensor data, while the latter refers to an evaluation of a downstream target application. In order to demonstrate the method, we evaluated the fidelity of three typical radar model types (ideal, data-driven, ray tracing-based) and their applicability for virtually testing radar-based multi-object tracking. We have shown the effectiveness of the proposed approach in terms of providing an in-depth sensor model assessment that renders existing disparities visible and enables a realistic estimation of the overall model fidelity across different scenarios.

Scene Transformer: A unified multi-task model for behavior prediction and planning

Authors: Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zhengdong Zhang, Hao-Tien Lewis Chiang, Jeffrey Ling, Rebecca Roelofs, Alex Bewley, Chenxi Liu, Ashish Venugopal, David Weiss, Ben Sapp, Zhifeng Chen, Jonathon Shlens
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2106.08417
Pdf link: https://arxiv.org/pdf/2106.08417
Abstract
Predicting the future motion of multiple agents is necessary for planning in dynamic environments. This task is challenging for autonomous driving since agents (e.g., vehicles and pedestrians) and their associated behaviors may be diverse and influence each other. Most prior work has focused on first predicting independent futures for each agent based on all past motion, and then planning against these independent predictions. However, planning against fixed predictions can suffer from the inability to represent the future interaction possibilities between different agents, leading to sub-optimal planning. In this work, we formulate a model for predicting the behavior of all agents jointly in real-world driving environments in a unified manner. Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture fuses heterogeneous world state in a unified Transformer architecture by employing attention across road elements, agent interactions and time steps. We evaluate our approach on autonomous driving datasets for behavior prediction, and achieve state-of-the-art performance. Our work demonstrates that formulating the problem of behavior prediction in a unified architecture with a masking strategy may allow us to have a single model that can perform multiple motion prediction and planning related tasks effectively.

EdgeConv with Attention Module for Monocular Depth Estimation

Authors: Minhyeok Lee, Sangwon Hwang, Chaewon Park, Sangyoun Lee
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.08615
Pdf link: https://arxiv.org/pdf/2106.08615
Abstract
Monocular depth estimation is an especially important task in robotics and autonomous driving, where 3D structural information is essential. However, extreme lighting conditions and complex surface objects make it difficult to predict depth in a single image. Therefore, to generate accurate depth maps, it is important for the model to learn structural information about the scene. We propose a novel Patch-Wise EdgeConv Module (PEM) and EdgeConv Attention Module (EAM) to solve the difficulty of monocular depth estimation. The proposed modules extract structural information by learning the relationship between image patches close to each other in space using edge convolution. Our method is evaluated on two popular datasets, the NYU Depth V2 and the KITTI Eigen split, achieving state-of-the-art performance. We prove that the proposed model predicts depth robustly in challenging scenes through various comparative experiments.

2nd Place Solution for Waymo Open Dataset Challenge - Real-time 2D Object Detection

Authors: Yueming Zhang, Xiaolin Song, Bing Bai, Tengfei Xing, Chao Liu, Xin Gao, Zhihui Wang, Yawei Wen, Haojin Liao, Guoshan Zhang, Pengfei Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.08713
Pdf link: https://arxiv.org/pdf/2106.08713
Abstract
In an autonomous driving system, it is essential to recognize vehicles, pedestrians and cyclists from images. Besides the high accuracy of the prediction, the requirement of real-time running brings new challenges for convolutional network models. In this report, we introduce a real-time method to detect the 2D objects from images. We aggregate several popular one-stage object detectors and train the models of variety input strategies independently, to yield better performance for accurate multi-scale detection of each category, especially for small objects. For model acceleration, we leverage TensorRT to optimize the inference time of our detection pipeline. As shown in the leaderboard, our proposed detection framework ranks the 2nd place with 75.00% L1 mAP and 69.72% L2 mAP in the real-time 2D detection track of the Waymo Open Dataset Challenges, while our framework achieves the latency of 45.8ms/frame on an Nvidia Tesla V100 GPU.

Robustness of Object Detectors in Degrading Weather Conditions

Authors: Muhammad Jehanzeb Mirza, Cornelius Buerkle, Julio Jarquin, Michael Opitz, Fabian Oboril, Kay-Ulrich Scholl, Horst Bischof
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.08795
Pdf link: https://arxiv.org/pdf/2106.08795
Abstract
State-of-the-art object detection systems for autonomous driving achieve promising results in clear weather conditions. However, such autonomous safety critical systems also need to work in degrading weather conditions, such as rain, fog and snow. Unfortunately, most approaches evaluate only on the KITTI dataset, which consists only of clear weather scenes. In this paper we address this issue and perform one of the most detailed evaluation on single and dual modality architectures on data captured in real weather conditions. We analyse the performance degradation of these architectures in degrading weather conditions. We demonstrate that an object detection architecture performing good in clear weather might not be able to handle degrading weather conditions. We also perform ablation studies on the dual modality architectures and show their limitations.

New submissions for Fri, 4 Jun 21

Keyword: SLAM

There is no result

Keyword: VINS

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Multiscale Domain Adaptive YOLO for Cross-Domain Object Detection

Authors: Mazin Hnewa, Hayder Radha
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2106.01483
Pdf link: https://arxiv.org/pdf/2106.01483
Abstract
The area of domain adaptation has been instrumental in addressing the domain shift problem encountered by many applications. This problem arises due to the difference between the distributions of source data used for training in comparison with target data used during realistic testing scenarios. In this paper, we introduce a novel MultiScale Domain Adaptive YOLO (MS-DAYOLO) framework that employs multiple domain adaptation paths and corresponding domain classifiers at different scales of the recently introduced YOLOv4 object detector to generate domain-invariant features. We train and test our proposed method using popular datasets. Our experiments show significant improvements in object detection performance when training YOLOv4 using the proposed MS-DAYOLO and when tested on target data representing challenging weather conditions for autonomous driving applications.

DeepCompress: Efficient Point Cloud Geometry Compression

Authors: Ryan Killea, Yun Li, Saeed Bastani, Paul McLachlan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2106.01504
Pdf link: https://arxiv.org/pdf/2106.01504
Abstract
Point clouds are a basic data type that is increasingly of interest as 3D content becomes more ubiquitous. Applications using point clouds include virtual, augmented, and mixed reality and autonomous driving. We propose a more efficient deep learning-based encoder architecture for point clouds compression that incorporates principles from established 3D object detection and image compression architectures. Through an ablation study, we show that incorporating the learned activation function from Computational Efficient Neural Image Compression (CENIC) and designing more parameter-efficient convolutional blocks yields dramatic gains in efficiency and performance. Our proposed architecture incorporates Generalized Divisive Normalization activations and propose a spatially separable InceptionV4-inspired block. We then evaluate rate-distortion curves on the standard JPEG Pleno 8i Voxelized Full Bodies dataset to evaluate our model's performance. Our proposed modifications outperform the baseline approaches by a small margin in terms of Bjontegard delta rate and PSNR values, yet reduces necessary encoder convolution operations by 8 percent and reduces total encoder parameters by 20 percent. Our proposed architecture, when considered on its own, has a small penalty of 0.02 percent in Chamfer's Distance and 0.32 percent increased bit rate in Point to Plane Distance for the same peak signal-to-noise ratio.

Short-term Maintenance Planning of Autonomous Trucks for Minimizing Economic Risk

Authors: Xin Tao, Jonas Mårtensson, Håkan Warnquist, Anna Pernestål
Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2106.01871
Pdf link: https://arxiv.org/pdf/2106.01871
Abstract
New autonomous driving technologies are emerging every day and some of them have been commercially applied in the real world. While benefiting from these technologies, autonomous trucks are facing new challenges in short-term maintenance planning, which directly influences the truck operator's profit. In this paper, we implement a vehicle health management system by addressing the maintenance planning issues of autonomous trucks on a transport mission. We also present a maintenance planning model using a risk-based decision-making method, which identifies the maintenance decision with minimal economic risk of the truck company. Both availability losses and maintenance costs are considered when evaluating the economic risk. We demonstrate the proposed model by numerical experiments illustrating real-world scenarios. In the experiments, compared to three baseline methods, the expected economic risk of the proposed method is reduced by up to $47%$. We also conduct sensitivity analyses of different model parameters. The analyses show that the economic risk significantly decreases when the estimation accuracy of remaining useful life, the maximal allowed time of delivery delay before order cancellation, or the number of workshops increases. The experiment results contribute to identifying future research and development attentions of autonomous trucks from an economic perspective.

New submissions for Mon, 22 Nov 21

Keyword: SLAM

There is no result

Keyword: Visual inertial

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: Visual inertial odometry

There is no result

Keyword: lidar

There is no result

Keyword: loop detection

There is no result

Keyword: autonomous driving

Learning Robust Output Control Barrier Functions from Safe Expert Demonstrations

Authors: Lars Lindemann, Alexander Robey, Lejun Jiang, Stephen Tu, Nikolai Matni
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.09971
Pdf link: https://arxiv.org/pdf/2111.09971
Abstract
This paper addresses learning safe control laws from expert demonstrations. We assume that appropriate models of the system dynamics and the output measurement map are available, along with corresponding error bounds. We first propose robust output control barrier functions (ROCBFs) as a means to guarantee safety, as defined through controlled forward invariance of a safe set. We then present an optimization problem to learn ROCBFs from expert demonstrations that exhibit safe system behavior, e.g., data collected from a human operator. Along with the optimization problem, we provide verifiable conditions that guarantee validity of the obtained ROCBF. These conditions are stated in terms of the density of the data and on Lipschitz and boundedness constants of the learned function and the models of the system dynamics and the output measurement map. When the parametrization of the ROCBF is linear, then, under mild assumptions, the optimization problem is convex. We validate our findings in the autonomous driving simulator CARLA and show how to learn safe control laws from RGB camera images.

Panoptic Segmentation: A Review

Authors: Omar Elharrouss, Somaya Al-Maadeed, Nandhini Subramanian, Najmath Ottakath, Noor Almaadeed, Yassine Himeur
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.10250
Pdf link: https://arxiv.org/pdf/2111.10250
Abstract
Image segmentation for video analysis plays an essential role in different research fields such as smart city, healthcare, computer vision and geoscience, and remote sensing applications. In this regard, a significant effort has been devoted recently to developing novel segmentation strategies; one of the latest outstanding achievements is panoptic segmentation. The latter has resulted from the fusion of semantic and instance segmentation. Explicitly, panoptic segmentation is currently under study to help gain a more nuanced knowledge of the image scenes for video surveillance, crowd counting, self-autonomous driving, medical image analysis, and a deeper understanding of the scenes in general. To that end, we present in this paper the first comprehensive review of existing panoptic segmentation methods to the best of the authors' knowledge. Accordingly, a well-defined taxonomy of existing panoptic techniques is performed based on the nature of the adopted algorithms, application scenarios, and primary objectives. Moreover, the use of panoptic segmentation for annotating new datasets by pseudo-labeling is discussed. Moving on, ablation studies are carried out to understand the panoptic methods from different perspectives. Moreover, evaluation metrics suitable for panoptic segmentation are discussed, and a comparison of the performance of existing solutions is provided to inform the state-of-the-art and identify their limitations and strengths. Lastly, the current challenges the subject technology faces and the future trends attracting considerable interest in the near future are elaborated, which can be a starting point for the upcoming research studies. The papers provided with code are available at: https://github.com/elharroussomar/Awesome-Panoptic-Segmentation

Unsupervised Visual Time-Series Representation Learning and Clustering

Authors: Gaurangi Anand, Richi Nayak
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2111.10309
Pdf link: https://arxiv.org/pdf/2111.10309
Abstract
Time-series data is generated ubiquitously from Internet-of-Things (IoT) infrastructure, connected and wearable devices, remote sensing, autonomous driving research and, audio-video communications, in enormous volumes. This paper investigates the potential of unsupervised representation learning for these time-series. In this paper, we use a novel data transformation along with novel unsupervised learning regime to transfer the learning from other domains to time-series where the former have extensive models heavily trained on very large labelled datasets. We conduct extensive experiments to demonstrate the potential of the proposed approach through time-series clustering.

Bi-Mix: Bidirectional Mixing for Domain Adaptive Nighttime Semantic Segmentation

Authors: Guanglei Yang, Zhun Zhong, Hao Tang, Mingli Ding, Nicu Sebe, Elisa Ricci
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.10339
Pdf link: https://arxiv.org/pdf/2111.10339
Abstract
In autonomous driving, learning a segmentation model that can adapt to various environmental conditions is crucial. In particular, copying with severe illumination changes is an impelling need, as models trained on daylight data will perform poorly at nighttime. In this paper, we study the problem of Domain Adaptive Nighttime Semantic Segmentation (DANSS), which aims to learn a discriminative nighttime model with a labeled daytime dataset and an unlabeled dataset, including coarsely aligned day-night image pairs. To this end, we propose a novel Bidirectional Mixing (Bi-Mix) framework for DANSS, which can contribute to both image translation and segmentation adaptation processes. Specifically, in the image translation stage, Bi-Mix leverages the knowledge of day-night image pairs to improve the quality of nighttime image relighting. On the other hand, in the segmentation adaptation stage, Bi-Mix effectively bridges the distribution gap between day and night domains for adapting the model to the night domain. In both processes, Bi-Mix simply operates by mixing two samples without extra hyper-parameters, thus it is easy to implement. Extensive experiments on Dark Zurich and Nighttime Driving datasets demonstrate the advantage of the proposed Bi-Mix and show that our approach obtains state-of-the-art performance in DANSS. Our code is available at https://github.com/ygjwd12345/BiMix.

Keyword: mapping

Evaluating Self and Semi-Supervised Methods for Remote Sensing Segmentation Tasks

Authors: Chaitanya Patel, Shashank Sharma, Varun Gulshan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.10079
Pdf link: https://arxiv.org/pdf/2111.10079
Abstract
We perform a rigorous evaluation of recent self and semi-supervised ML techniques that leverage unlabeled data for improving downstream task performance, on three remote sensing tasks of riverbed segmentation, land cover mapping and flood mapping. These methods are especially valuable for remote sensing tasks since there is easy access to unlabeled imagery and getting ground truth labels can often be expensive. We quantify performance improvements one can expect on these remote sensing segmentation tasks when unlabeled imagery (outside of the labeled dataset) is made available for training. We also design experiments to test the effectiveness of these techniques when the test set has a domain shift relative to the training and validation sets.

An Index for Single Source All Destinations Distance Queries in Temporal Graphs

Authors: Lutz Oettershagen, Petra Mutzel
Subjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2111.10095
Pdf link: https://arxiv.org/pdf/2111.10095
Abstract
Typical tasks in analyzing temporal graphs are single-source-all-destination (SSAD) temporal distance queries, which are, e.g., common during the computation of centrality measures in temporal social networks. An SSAD query starting at a vertex $v$ asks for the temporal distances, e.g., durations, earliest arrival times, or the number of hops, between $v$ and all other reachable vertices. We introduce a new index to speed up SSAD temporal distance queries. The indexing is based on the construction of $k$ subgraphs and a mapping from the vertices to the subgraphs. Each subgraph contains the temporal edges sufficient to answer queries starting from any vertex mapped to the subgraph. We answer a query starting at a vertex $v$ with a single pass over the edges of the subgraph. The new index supports dynamic updates, i.e., efficient insertion and deletion of temporal edges. We call our index Substream index and show that deciding if there exists a Substream index of a given size is NP-complete. We provide a greedy approximation that constructs an index at most $k/\delta$ times larger than an optimal index where $\delta$, with $1\leq\delta\leq k$, depends on the temporal and spatial structure of the graph. Moreover, we improve the running time of the approximation in three ways. First, we use a secondary index called Time Skip index. It speeds up the construction and queries by skipping edges that do not need to be considered. Next, we apply min-hashing to avoid costly union operations. Finally, we use parallelization to take the parallel processing capabilities of modern processors into account. Our extensive evaluation using real-world temporal networks shows the efficiency and effectiveness of our indices.

On the Download Rate of Homomorphic Secret Sharing

Authors: Ingerid Fosli, Yuval Ishai, Victor I. Kolobov, Mary Wootters
Subjects: Information Theory (cs.IT); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2111.10126
Pdf link: https://arxiv.org/pdf/2111.10126
Abstract
A homomorphic secret sharing (HSS) scheme is a secret sharing scheme that supports evaluating functions on shared secrets by means of a local mapping from input shares to output shares. We initiate the study of the download rate of HSS, namely, the achievable ratio between the length of the output shares and the output length when amortized over $\ell$ function evaluations. We obtain the following results. * In the case of linear information-theoretic HSS schemes for degree-$d$ multivariate polynomials, we characterize the optimal download rate in terms of the optimal minimal distance of a linear code with related parameters. We further show that for sufficiently large $\ell$ (polynomial in all problem parameters), the optimal rate can be realized using Shamir's scheme, even with secrets over $\mathbb{F}_2$. * We present a general rate-amplification technique for HSS that improves the download rate at the cost of requiring more shares. As a corollary, we get high-rate variants of computationally secure HSS schemes and efficient private information retrieval protocols from the literature. * We show that, in some cases, one can beat the best download rate of linear HSS by allowing nonlinear output reconstruction and $2^{-\Omega(\ell)}$ error probability.

Keyword: localization

Grounded Situation Recognition with Transformers

Authors: Junhyeong Cho, Youngseok Yoon, Hyeonjun Lee, Suha Kwak
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2111.10135
Pdf link: https://arxiv.org/pdf/2111.10135
Abstract
Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image. Inspired by the remarkable success of Transformers in vision tasks, we propose a GSR model based on a Transformer encoder-decoder architecture. The attention mechanism of our model enables accurate verb classification by capturing high-level semantic feature of an image effectively, and allows the model to flexibly deal with the complicated and image-dependent relations between entities for improved noun classification and localization. Our model is the first Transformer architecture for GSR, and achieves the state of the art in every evaluation metric on the SWiG benchmark. Our code is available at https://github.com/jhcho99/gsrtr .

Learning to Detect Instance-level Salient Objects Using Complementary Image Labels

Authors: Xin Tian, Ke Xu, Xin Yang, Baocai Yin, Rynson W.H. Lau,
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2111.10137
Pdf link: https://arxiv.org/pdf/2111.10137
Abstract
Existing salient instance detection (SID) methods typically learn from pixel-level annotated datasets. In this paper, we present the first weakly-supervised approach to the SID problem. Although weak supervision has been considered in general saliency detection, it is mainly based on using class labels for object localization. However, it is non-trivial to use only class labels to learn instance-aware saliency information, as salient instances with high semantic affinities may not be easily separated by the labels. As the subitizing information provides an instant judgement on the number of salient items, it is naturally related to detecting salient instances and may help separate instances of the same class while grouping different parts of the same instance. Inspired by this observation, we propose to use class and subitizing labels as weak supervision for the SID problem. We propose a novel weakly-supervised network with three branches: a Saliency Detection Branch leveraging class consistency information to locate candidate objects; a Boundary Detection Branch exploiting class discrepancy information to delineate object boundaries; and a Centroid Detection Branch using subitizing information to detect salient instance centroids. This complementary information is then fused to produce a salient instance map. To facilitate the learning process, we further propose a progressive training scheme to reduce label noise and the corresponding noise learned by the model, via reciprocating the model with progressive salient instance prediction and model refreshing. Our extensive evaluations show that the proposed method plays favorably against carefully designed baseline methods adapted from related tasks.