| A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration [supp] |
| The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy [supp] |
| Deep Image-Based Illumination Harmonization [supp] |
| ViM: Out-of-Distribution With Virtual-Logit Matching [supp] |
| Active Learning by Feature Mixing [supp] |
| Towards Accurate Facial Landmark Detection via Cascaded Transformers [supp] |
| Class-Aware Contrastive Semi-Supervised Learning [supp] |
| Long-Term Visual Map Sparsification With Heterogeneous GNN [supp] |
| Debiased Learning From Naturally Imbalanced Pseudo-Labels |
| RNNPose: Recurrent 6-DoF Object Pose Refinement With Robust Correspondence Field Estimation and Pose Optimization [supp] |
| Ditto: Building Digital Twins of Articulated Objects From Interaction [supp] |
| Dual-AI: Dual-Path Actor Interaction Learning for Group Activity Recognition [supp] |
| Harmony: A Generic Unsupervised Approach for Disentangling Semantic Content From Parameterized Transformations [supp] |
| Talking Face Generation With Multilingual TTS |
| A Brand New Dance Partner: Music-Conditioned Pluralistic Dancing Controlled by Multiple Dance Genres [supp] |
| Kernelized Few-Shot Object Detection With Efficient Integral Aggregation [supp] |
| Transformer Based Line Segment Classifier With Image Context for Real-Time Vanishing Point Detection in Manhattan World |
| Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learning [supp] |
| Adaptive Early-Learning Correction for Segmentation From Noisy Annotations [supp] |
| Cross-Domain Correlation Distillation for Unsupervised Domain Adaptation in Nighttime Semantic Segmentation [supp] |
| Context-Aware Video Reconstruction for Rolling Shutter Cameras [supp] |
| Towards Efficient Data Free Black-Box Adversarial Attack |
| Robust Contrastive Learning Against Noisy Views [supp] |
| More Than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech [supp] |
| Cross-Modal Perceptionist: Can Face Geometry Be Gleaned From Voices? [supp] |
| On Generalizing Beyond Domains in Cross-Domain Continual Learning [supp] |
| RSTT: Real-Time Spatial Temporal Transformer for Space-Time Video Super-Resolution [supp] |
| Learning Memory-Augmented Unidirectional Metrics for Cross-Modality Person Re-Identification [supp] |
| A Closer Look at Few-Shot Image Generation [supp] |
| Depth-Supervised NeRF: Fewer Views and Faster Training for Free |
| Unsupervised Domain Generalization by Learning a Bridge Across Domains [supp] |
| Partial Class Activation Attention for Semantic Segmentation |
| Multi-Scale Memory-Based Video Deblurring [supp] |
| SkinningNet: Two-Stream Graph Convolutional Neural Network for Skinning Prediction of Synthetic Characters [supp] |
| A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching [supp] |
| Learning Trajectory-Aware Transformer for Video Super-Resolution [supp] |
| Differentiable Dynamics for Articulated 3D Human Motion Reconstruction [supp] |
| Geometric Structure Preserving Warp for Natural Image Stitching [supp] |
| GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping [supp] |
| Multi-Robot Active Mapping via Neural Bipartite Graph Matching [supp] |
| Adversarial Texture for Fooling Person Detectors in the Physical World [supp] |
| Focal Length and Object Pose Estimation via Render and Compare [supp] |
| TO-FLOW: Efficient Continuous Normalizing Flows With Temporal Optimization Adjoint With Moving Speed [supp] |
| Arbitrary-Scale Image Synthesis [supp] |
| Cross-Modal Representation Learning for Zero-Shot Action Recognition [supp] |
| Conditional Prompt Learning for Vision-Language Models |
| Graph Sampling Based Deep Metric Learning for Generalizable Person Re-Identification [supp] |
| Retrieval-Based Spatially Adaptive Normalization for Semantic Image Synthesis [supp] |
| Undoing the Damage of Label Shift for Cross-Domain Semantic Segmentation |
| GPV-Pose: Category-Level Object Pose Estimation via Geometry-Guided Point-Wise Voting [supp] |
| Dynamic 3D Gaze From Afar: Deep Gaze Estimation From Temporal Eye-Head-Body Coordination [supp] |
| Expressive Talking Head Generation With Granular Audio-Visual Control [supp] |
| Trustworthy Long-Tailed Classification [supp] |
| Primitive3D: 3D Object Dataset Synthesis From Randomly Assembled Primitives [supp] |
| Mix and Localize: Localizing Sound Sources in Mixtures |
| FisherMatch: Semi-Supervised Rotation Regression via Entropy-Based Filtering [supp] |
| NPBG++: Accelerating Neural Point-Based Graphics [supp] |
| SphericGAN: Semi-Supervised Hyper-Spherical Generative Adversarial Networks for Fine-Grained Image Synthesis |
| HairMapper: Removing Hair From Portraits Using GANs [supp] |
| Affine Medical Image Registration With Coarse-To-Fine Vision Transformer [supp] |
| SMPL-A: Modeling Person-Specific Deformable Anatomy [supp] |
| Image Dehazing Transformer With Transmission-Aware 3D Position Embedding [supp] |
| Out-of-Distribution Generalization With Causal Invariant Transformations [supp] |
| Panoptic-PHNet: Towards Real-Time and High-Precision LiDAR Panoptic Segmentation via Clustering Pseudo Heatmap [supp] |
| Dual-Key Multimodal Backdoors for Visual Question Answering [supp] |
| A Differentiable Two-Stage Alignment Scheme for Burst Image Reconstruction With Large Shift [supp] |
| Unifying Panoptic Segmentation for Autonomous Driving [supp] |
| Learning Motion-Dependent Appearance for High-Fidelity Rendering of Dynamic Humans From a Single Camera [supp] |
| On the Road to Online Adaptation for Semantic Image Segmentation [supp] |
| Deformable ProtoPNet: An Interpretable Image Classifier Using Deformable Prototypes [supp] |
| Context-Aware Sequence Alignment Using 4D Skeletal Augmentation [supp] |
| Perturbed and Strict Mean Teachers for Semi-Supervised Semantic Segmentation [supp] |
| Motion-Modulated Temporal Fragment Alignment Network for Few-Shot Action Recognition |
| Focal Sparse Convolutional Networks for 3D Object Detection [supp] |
| Masked Autoencoders Are Scalable Vision Learners [supp] |
| Point-BERT: Pre-Training 3D Point Cloud Transformers With Masked Point Modeling [supp] |
| Nested Collaborative Learning for Long-Tailed Visual Recognition [supp] |
| Crowd Counting in the Frequency Domain [supp] |
| Restormer: Efficient Transformer for High-Resolution Image Restoration [supp] |
| STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction |
| Learning From Untrimmed Videos: Self-Supervised Video Representation Learning With Hierarchical Consistency [supp] |
| Aladdin: Joint Atlas Building and Diffeomorphic Registration Learning With Pairwise Alignment [supp] |
| IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation [supp] |
| Large Loss Matters in Weakly Supervised Multi-Label Classification [supp] |
| Toward Practical Monocular Indoor Depth Estimation [supp] |
| Attention Concatenation Volume for Accurate and Efficient Stereo Matching |
| Learning Distinctive Margin Toward Active Domain Adaptation [supp] |
| Zero-Query Transfer Attacks on Context-Aware Object Detectors [supp] |
| Neural Inertial Localization [supp] |
| Speed Up Object Detection on Gigapixel-Level Images With Patch Arrangement |
| Finding Fallen Objects via Asynchronous Audio-Visual Integration |
| Learning sRGB-to-Raw-RGB De-Rendering With Content-Aware Metadata [supp] |
| GraftNet: Towards Domain Generalized Stereo Matching With a Broad-Spectrum and Task-Oriented Feature [supp] |
| Towards Total Recall in Industrial Anomaly Detection [supp] |
| DTA: Physical Camouflage Attacks Using Differentiable Transformation Network [supp] |
| Neural Recognition of Dashed Curves With Gestalt Law of Continuity [supp] |
| Semi-Supervised Object Detection via Multi-Instance Alignment With Global Class Prototypes [supp] |
| HODOR: High-Level Object Descriptors for Object Re-Segmentation in Video Learned From Static Images [supp] |
| Point Cloud Color Constancy [supp] |
| VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [supp] |
| Catching Both Gray and Black Swans: Open-Set Supervised Anomaly Detection [supp] |
| MLSLT: Towards Multilingual Sign Language Translation [supp] |
| Towards an End-to-End Framework for Flow-Guided Video Inpainting [supp] |
| Contrastive Test-Time Adaptation |
| Multimodal Colored Point Cloud to Image Alignment [supp] |
| MotionAug: Augmentation With Physical Correction for Human Motion Prediction [supp] |
| Active Teacher for Semi-Supervised Object Detection |
| CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data [supp] |
| Audio-Adaptive Activity Recognition Across Video Domains [supp] |
| Collaborative Learning for Hand and Object Reconstruction With Attention-Guided Graph Convolution [supp] |
| On Learning Contrastive Representations for Learning With Noisy Labels [supp] |
| Unsupervised Deraining: Where Contrastive Learning Meets Self-Similarity [supp] |
| Modeling Indirect Illumination for Inverse Rendering |
| BACON: Band-Limited Coordinate Networks for Multiscale Scene Representation [supp] |
| Regional Semantic Contrast and Aggregation for Weakly Supervised Semantic Segmentation [supp] |
| Class Re-Activation Maps for Weakly-Supervised Semantic Segmentation |
| TransWeather: Transformer-Based Restoration of Images Degraded by Adverse Weather Conditions |
| Merry Go Round: Rotate a Frame and Fool a DNN [supp] |
| H2FA R-CNN: Holistic and Hierarchical Feature Alignment for Cross-Domain Weakly Supervised Object Detection [supp] |
| Modeling sRGB Camera Noise With Normalizing Flows [supp] |
| A ConvNet for the 2020s [supp] |
| Reference-Based Video Super-Resolution Using Multi-Camera Video Triplets [supp] |
| Self-Supervised Image Representation Learning With Geometric Set Consistency [supp] |
| Deep Anomaly Discovery From Unlabeled Videos via Normality Advantage and Self-Paced Refinement [supp] |
| P3Depth: Monocular Depth Estimation With a Piecewise Planarity Prior [supp] |
| GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection |
| Simple Multi-Dataset Detection [supp] |
| MLP-3D: A MLP-Like 3D Architecture With Grouped Time Mixing |
| Proactive Image Manipulation Detection [supp] |
| Sketch3T: Test-Time Training for Zero-Shot SBIR [supp] |
| BANMo: Building Animatable 3D Neural Models From Many Casual Videos [supp] |
| StyTr2: Image Style Transfer With Transformers [supp] |
| Towards Discriminative Representation: Multi-View Trajectory Contrastive Learning for Online Multi-Object Tracking |
| Global Matching With Overlapping Attention for Optical Flow Estimation [supp] |
| Language As Queries for Referring Video Object Segmentation [supp] |
| Investigating the Impact of Multi-LiDAR Placement on Object Detection for Autonomous Driving [supp] |
| MViTv2: Improved Multiscale Vision Transformers for Classification and Detection [supp] |
| Audio-Visual Generalised Zero-Shot Learning With Cross-Modal Attention and Language [supp] |
| Rethinking Efficient Lane Detection via Curve Modeling [supp] |
| GreedyNASv2: Greedier Search With a Greedy Path Filter [supp] |
| Self-Supervised Arbitrary-Scale Point Clouds Upsampling via Implicit Neural Representation |
| Co-Advise: Cross Inductive Bias Distillation |
| AdaMixer: A Fast-Converging Query-Based Object Detector [supp] |
| DTFD-MIL: Double-Tier Feature Distillation Multiple Instance Learning for Histopathology Whole Slide Image Classification [supp] |
| BEVT: BERT Pretraining of Video Transformers [supp] |
| Deep Generalized Unfolding Networks for Image Restoration |
| Automatic Relation-Aware Graph Network Proliferation [supp] |
| AIM: An Auto-Augmenter for Images and Meshes |
| VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation [supp] |
| Deep Unlearning via Randomized Conditionally Independent Hessians [supp] |
| Patch-Level Representation Learning for Self-Supervised Vision Transformers [supp] |
| Sylph: A Hypernetwork Framework for Incremental Few-Shot Object Detection |
| Incremental Learning in Semantic Segmentation From Image Labels [supp] |
| Playable Environments: Video Manipulation in Space and Time [supp] |
| Robust Cross-Modal Representation Learning With Progressive Self-Distillation [supp] |
| What To Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions [supp] |
| Compressive Single-Photon 3D Cameras [supp] |
| Stereo Magnification With Multi-Layer Images [supp] |
| CO-SNE: Dimensionality Reduction and Visualization for Hyperbolic Data [supp] |
| Revisiting Skeleton-Based Action Recognition [supp] |
| Rethinking Controllable Variational Autoencoders [supp] |
| Contextual Instance Decoupling for Robust Multi-Person Pose Estimation |
| LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking [supp] |
| Boosting Crowd Counting via Multifaceted Attention |
| Stereo Depth From Events Cameras: Concentrate and Focus on the Future [supp] |
| A Probabilistic Graphical Model Based on Neural-Symbolic Reasoning for Visual Relationship Detection |
| A Simple Data Mixing Prior for Improving Self-Supervised Learning |
| Knowledge Distillation As Efficient Pre-Training: Faster Convergence, Higher Data-Efficiency, and Better Transferability [supp] |
| LOLNerf: Learn From One Look [supp] |
| Geometry-Aware Guided Loss for Deep Crack Recognition |
| Multi-Modal Alignment Using Representation Codebook |
| Maintaining Reasoning Consistency in Compositional Visual Question Answering [supp] |
| Structure-Aware Motion Transfer With Deformable Anchor Model [supp] |
| BigDL 2.0: Seamless Scaling of AI Pipelines From Laptops to Distributed Cluster [supp] |
| Integrative Few-Shot Learning for Classification and Segmentation [supp] |
| Acquiring a Dynamic Light Field Through a Single-Shot Coded Image [supp] |
| Attentive Fine-Grained Structured Sparsity for Image Restoration [supp] |
| Pix2NeRF: Unsupervised Conditional p-GAN for Single Image to Neural Radiance Fields Translation [supp] |
| HARA: A Hierarchical Approach for Robust Rotation Averaging [supp] |
| Diffusion Autoencoders: Toward a Meaningful and Decodable Representation [supp] |
| Learning Fair Classifiers With Partially Annotated Group Labels [supp] |
| StylizedNeRF: Consistent 3D Scene Stylization As Stylized NeRF via 2D-3D Mutual Learning [supp] |
| NightLab: A Dual-Level Architecture With Hardness Detection for Segmentation at Night [supp] |
| Knowledge Distillation With the Reused Teacher Classifier [supp] |
| Contrastive Learning for Unsupervised Video Highlight Detection |
| InfoGCN: Representation Learning for Human Skeleton-Based Action Recognition [supp] |
| Rethinking Image Cropping: Exploring Diverse Compositions From Global Views [supp] |
| Constrained Few-Shot Class-Incremental Learning [supp] |
| Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks [supp] |
| Threshold Matters in WSSS: Manipulating the Activation for the Robust and Accurate Segmentation Model Against Thresholds [supp] |
| Data-Free Network Compression via Parametric Non-Uniform Mixed Precision Quantization [supp] |
| Sparse to Dense Dynamic 3D Facial Expression Generation [supp] |
| Think Twice Before Detecting GAN-Generated Fake Images From Their Spectral Domain Imprints [supp] |
| Crafting Better Contrastive Views for Siamese Representation Learning |
| RSCFed: Random Sampling Consensus Federated Semi-Supervised Learning [supp] |
| TransMVSNet: Global Context-Aware Multi-View Stereo Network With Transformers [supp] |
| ROCA: Robust CAD Model Retrieval and Alignment From a Single Image [supp] |
| Continual Learning for Visual Search With Backward Consistent Feature Embedding [supp] |
| iFS-RCNN: An Incremental Few-Shot Instance Segmenter [supp] |
| DPGEN: Differentially Private Generative Energy-Guided Network for Natural Image Synthesis [supp] |
| MetaFSCIL: A Meta-Learning Approach for Few-Shot Class Incremental Learning [supp] |
| The Majority Can Help the Minority: Context-Rich Minority Oversampling for Long-Tailed Classification [supp] |
| Dense Depth Priors for Neural Radiance Fields From Sparse Input Views [supp] |
| EyePAD++: A Distillation-Based Approach for Joint Eye Authentication and Presentation Attack Detection Using Periocular Images [supp] |
| IntentVizor: Towards Generic Query Guided Interactive Video Summarization [supp] |
| Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks [supp] |
| Camera Pose Estimation Using Implicit Distortion Models [supp] |
| Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations [supp] |
| Shape-Invariant 3D Adversarial Point Clouds [supp] |
| LAS-AT: Adversarial Training With Learnable Attack Strategy [supp] |
| Bootstrapping ViTs: Towards Liberating Vision Transformers From Pre-Training [supp] |
| PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents |
| Styleformer: Transformer Based Generative Adversarial Networks With Style Vector [supp] |
| Efficient Two-Stage Detection of Human-Object Interactions With a Novel Unary-Pairwise Transformer [supp] |
| ELSR: Efficient Line Segment Reconstruction With Planes and Points Guidance [supp] |
| Meta-Attention for ViT-Backed Continual Learning [supp] |
| DST: Dynamic Substitute Training for Data-Free Black-Box Attack |
| Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing [supp] |
| A Low-Cost & Real-Time Motion Capture System |
| Unified Contrastive Learning in Image-Text-Label Space [supp] |
| Unifying Motion Deblurring and Frame Interpolation With Events [supp] |
| Generalizing Interactive Backpropagating Refinement for Dense Prediction Networks [supp] |
| Unsupervised Pre-Training for Temporal Action Localization Tasks [supp] |
| Light Field Neural Rendering [supp] |
| Fast Point Transformer [supp] |
| Look Outside the Room: Synthesizing a Consistent Long-Term 3D Scene Video From a Single Image [supp] |
| Unimodal-Concentrated Loss: Fully Adaptive Label Distribution Learning for Ordinal Regression [supp] |
| Augmented Geometric Distillation for Data-Free Incremental Person ReID [supp] |
| Deep Stereo Image Compression via Bi-Directional Coding |
| Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems Through Stochastic Contraction [supp] |
| Smooth-Swap: A Simple Enhancement for Face-Swapping With Smoothness [supp] |
| Full-Range Virtual Try-On With Recurrent Tri-Level Transform [supp] |
| Style Neophile: Constantly Seeking Novel Styles for Domain Generalization |
| High-Fidelity Human Avatars From a Single RGB Camera [supp] |
| ADAPT: Vision-Language Navigation With Modality-Aligned Action Prompts [supp] |
| Multiview Transformers for Video Recognition [supp] |
| RIO: Rotation-Equivariance Supervised Learning of Robust Inertial Odometry [supp] |
| How Good Is Aesthetic Ability of a Fashion Model? [supp] |
| Mining Multi-View Information: A Strong Self-Supervised Framework for Depth-Based 3D Hand Pose and Mesh Estimation [supp] |
| Automated Progressive Learning for Efficient Training of Vision Transformers [supp] |
| BTS: A Bi-Lingual Benchmark for Text Segmentation in the Wild [supp] |
| Learning Structured Gaussians To Approximate Deep Ensembles [supp] |
| Adaptive Trajectory Prediction via Transferable GNN [supp] |
| Total Variation Optimization Layers for Computer Vision |
| Defensive Patches for Robust Recognition in the Physical World [supp] |
| Single-Stage Is Enough: Multi-Person Absolute 3D Pose Estimation [supp] |
| Deformation and Correspondence Aware Unsupervised Synthetic-to-Real Scene Flow Estimation for Point Clouds [supp] |
| Learn From Others and Be Yourself in Heterogeneous Federated Learning |
| Sequential Voting With Relational Box Fields for Active Object Detection [supp] |
| Semantic-Aware Auto-Encoders for Self-Supervised Representation Learning |
| Learning Transferable Human-Object Interaction Detector With Natural Language Supervision |
| Fourier Document Restoration for Robust Document Dewarping and Recognition |
| Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection [supp] |
| Consistent Explanations by Contrastive Learning [supp] |
| Text2Pos: Text-to-Point-Cloud Cross-Modal Localization [supp] |
| MulT: An End-to-End Multitask Learning Transformer [supp] |
| Hierarchical Modular Network for Video Captioning [supp] |
| Learning With Neighbor Consistency for Noisy Labels [supp] |
| Depth Estimation by Combining Binocular Stereo and Monocular Structured-Light [supp] |
| Salient-to-Broad Transition for Video Person Re-Identification [supp] |
| Object-Region Video Transformers [supp] |
| DeeCap: Dynamic Early Exiting for Efficient Image Captioning |
| AME: Attention and Memory Enhancement in Hyper-Parameter Optimization [supp] |
| Alignment-Uniformity Aware Representation Learning for Zero-Shot Video Classification [supp] |
| RepMLPNet: Hierarchical Vision MLP With Re-Parameterized Locality [supp] |
| DR.VIC: Decomposition and Reasoning for Video Individual Counting [supp] |
| LiDARCap: Long-Range Marker-Less 3D Human Motion Capture With LiDAR Point Clouds [supp] |
| GeoEngine: A Platform for Production-Ready Geospatial Research |
| Revisiting Document Image Dewarping by Grid Regularization [supp] |
| Semi-Supervised Few-Shot Learning via Multi-Factor Clustering [supp] |
| CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation [supp] |
| Weakly-Supervised Generation and Grounding of Visual Descriptions With Conditional Generative Models [supp] |
| Novel Class Discovery in Semantic Segmentation [supp] |
| ARCS: Accurate Rotation and Correspondence Search [supp] |
| Learning To Anticipate Future With Dynamic Context Removal [supp] |
| GCFSR: A Generative and Controllable Face Super Resolution Method Without Facial and GAN Priors [supp] |
| Perception Prioritized Training of Diffusion Models [supp] |
| Using 3D Topological Connectivity for Ghost Particle Reduction in Flow Reconstruction [supp] |
| On the Integration of Self-Attention and Convolution [supp] |
| Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction [supp] |
| CHEX: CHannel EXploration for CNN Model Compression [supp] |
| M2I: From Factored Marginal Trajectory Prediction to Interactive Prediction |
| Domain Adaptation on Point Clouds via Geometry-Aware Implicits [supp] |
| Consistency Driven Sequential Transformers Attention Model for Partially Observable Scenes [supp] |
| GroupViT: Semantic Segmentation Emerges From Text Supervision [supp] |
| NeuralHOFusion: Neural Volumetric Rendering Under Human-Object Interactions [supp] |
| Generalizable Human Pose Triangulation [supp] |
| DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation [supp] |
| Occlusion-Aware Cost Constructor for Light Field Depth Estimation [supp] |
| SmartPortraits: Depth Powered Handheld Smartphone Dataset of Human Portraits for State Estimation, Reconstruction and Synthesis |
| BppAttack: Stealthy and Efficient Trojan Attacks Against Deep Neural Networks via Image Quantization and Contrastive Adversarial Learning [supp] |
| GlideNet: Global, Local and Intrinsic Based Dense Embedding NETwork for Multi-Category Attributes Prediction [supp] |
| Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation [supp] |
| Ensembling Off-the-Shelf Models for GAN Training |
| Towards Better Plasticity-Stability Trade-Off in Incremental Learning: A Simple Linear Connector |
| Topology-Preserving Shape Reconstruction and Registration via Neural Diffeomorphic Flow [supp] |
| Segment and Complete: Defending Object Detectors Against Adversarial Patch Attacks With Robust Patch Detection [supp] |
| Cross-Domain Few-Shot Learning With Task-Specific Adapters [supp] |
| MAXIM: Multi-Axis MLP for Image Processing [supp] |
| Learning Part Segmentation Through Unsupervised Domain Adaptation From Synthetic Vehicles [supp] |
| Delving Into the Estimation Shift of Batch Normalization in a Network [supp] |
| Towards Better Understanding Attribution Methods [supp] |
| Learning Object Context for Novel-View Scene Layout Generation |
| PSTR: End-to-End One-Step Person Search With Transformers |
| Neural Fields As Learnable Kernels for 3D Reconstruction [supp] |
| A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information [supp] |
| Detector-Free Weakly Supervised Group Activity Recognition [supp] |
| NFormer: Robust Person Re-Identification With Neighbor Transformer [supp] |
| Joint Forecasting of Panoptic Segmentations With Difference Attention [supp] |
| HairCLIP: Design Your Hair by Text and Reference Image [supp] |
| Imposing Consistency for Optical Flow Estimation [supp] |
| Style Transformer for Image Inversion and Editing [supp] |
| OakInk: A Large-Scale Knowledge Repository for Understanding Hand-Object Interaction [supp] |
| Pyramid Adversarial Training Improves ViT Performance [supp] |
| Bridging Global Context Interactions for High-Fidelity Image Completion [supp] |
| SwinBERT: End-to-End Transformers With Sparse Attention for Video Captioning [supp] |
| Maximum Spatial Perturbation Consistency for Unpaired Image-to-Image Translation [supp] |
| Unseen Classes at a Later Time? No Problem [supp] |
| InfoNeRF: Ray Entropy Minimization for Few-Shot Neural Volume Rendering [supp] |
| Learning the Degradation Distribution for Blind Image Super-Resolution |
| Dist-PU: Positive-Unlabeled Learning From a Label Distribution Perspective [supp] |
| SC2-PCR: A Second Order Spatial Compatibility for Efficient and Robust Point Cloud Registration [supp] |
| Relative Pose From a Calibrated and an Uncalibrated Smartphone Image [supp] |
| Towards Robust and Reproducible Active Learning Using Neural Networks [supp] |
| Retrieval Augmented Classification for Long-Tail Visual Recognition [supp] |
| Not All Tokens Are Equal: Human-Centric Visual Analysis via Token Clustering Transformer [supp] |
| Temporally Efficient Vision Transformer for Video Instance Segmentation |
| The Devil Is in the Margin: Margin-Based Label Smoothing for Network Calibration [supp] |
| NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks [supp] |
| Bringing Old Films Back to Life [supp] |
| Sound and Visual Representation Learning With Multiple Pretraining Tasks |
| WarpingGAN: Warping Multiple Uniform Priors for Adversarial 3D Point Cloud Generation [supp] |
| RePaint: Inpainting Using Denoising Diffusion Probabilistic Models [supp] |
| Revealing Occlusions With 4D Neural Fields [supp] |
| Meta Agent Teaming Active Learning for Pose Estimation [supp] |
| Forward Propagation, Backward Regression, and Pose Association for Hand Tracking in the Wild |
| Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding [supp] |
| E2(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition [supp] |
| ES6D: A Computation Efficient and Symmetry-Aware 6D Pose Regression Framework [supp] |
| Self-Supervised Deep Image Restoration via Adaptive Stochastic Gradient Langevin Dynamics [supp] |
| Towards Discovering the Effectiveness of Moderately Confident Samples for Semi-Supervised Learning [supp] |
| OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization [supp] |
| An Empirical Study of Training End-to-End Vision-and-Language Transformers [supp] |
| Multimodal Dynamics: Dynamical Fusion for Trustworthy Multimodal Classification [supp] |
| The Neurally-Guided Shape Parser: Grammar-Based Labeling of 3D Shape Regions With Approximate Inference [supp] |
| Unsupervised Homography Estimation With Coplanarity-Aware GAN [supp] |
| LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection [supp] |
| AutoLoss-Zero: Searching Loss Functions From Scratch for Generic Tasks [supp] |
| PatchNet: A Simple Face Anti-Spoofing Framework via Fine-Grained Patch Recognition [supp] |
| OnePose: One-Shot Object Pose Estimation Without CAD Models |
| Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos [supp] |
| Rethinking Minimal Sufficient Representation in Contrastive Learning [supp] |
| Disentangling Visual Embeddings for Attributes and Objects [supp] |
| Scalable Penalized Regression for Noise Detection in Learning With Noisy Labels |
| Effective Conditioned and Composed Image Retrieval Combining CLIP-Based Features |
| Registering Explicit to Implicit: Towards High-Fidelity Garment Mesh Reconstruction From Single Images [supp] |
| Federated Class-Incremental Learning [supp] |
| MiniViT: Compressing Vision Transformers With Weight Multiplexing [supp] |
| Practical Stereo Matching via Cascaded Recurrent Network With Adaptive Correlation [supp] |
| D-Grasp: Physically Plausible Dynamic Grasp Synthesis for Hand-Object Interactions [supp] |
| Show, Deconfound and Tell: Image Captioning With Causal Inference [supp] |
| Extracting Triangular 3D Models, Materials, and Lighting From Images [supp] |
| Weakly Supervised Segmentation on Outdoor 4D Point Clouds With Temporal Matching and Spatial Graph Propagation [supp] |
| ImFace: A Nonlinear 3D Morphable Face Model With Implicit Neural Representations [supp] |
| MobRecon: Mobile-Friendly Hand Mesh Reconstruction From Monocular Image [supp] |
| Layered Depth Refinement With Mask Guidance [supp] |
| Parameter-Free Online Test-Time Adaptation [supp] |
| SIGMA: Semantic-Complete Graph Matching for Domain Adaptive Object Detection [supp] |
| Global Convergence of MAML and Theory-Inspired Neural Architecture Search for Few-Shot Learning [supp] |
| LAKe-Net: Topology-Aware Point Cloud Completion by Localizing Aligned Keypoints [supp] |
| Scribble-Supervised LiDAR Semantic Segmentation [supp] |
| AlignMixup: Improving Representations by Interpolating Aligned Features [supp] |
| No Pain, Big Gain: Classify Dynamic Point Cloud Sequences With Static Models by Fitting Feature-Level Space-Time Surfaces [supp] |
| HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction [supp] |
| HerosNet: Hyperspectral Explicable Reconstruction and Optimal Sampling Deep Network for Snapshot Compressive Imaging |
| Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space |
| Brain-Inspired Multilayer Perceptron With Spiking Neurons |
| Learning To Estimate Robust 3D Human Mesh From In-the-Wild Crowded Scenes [supp] |
| ObjectFormer for Image Manipulation Detection and Localization |
| Detecting Deepfakes With Self-Blended Images [supp] |
| Correlation-Aware Deep Tracking [supp] |
| Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos [supp] |
| NeurMiPs: Neural Mixture of Planar Experts for View Synthesis [supp] |
| Implicit Sample Extension for Unsupervised Person Re-Identification |
| Energy-Based Latent Aligner for Incremental Learning [supp] |
| Towards Semi-Supervised Deep Facial Expression Recognition With an Adaptive Confidence Margin [supp] |
| GanOrCon: Are Generative Models Useful for Few-Shot Segmentation? [supp] |
| Bi-Level Doubly Variational Learning for Energy-Based Latent Variable Models [supp] |
| SplitNets: Designing Neural Architectures for Efficient Distributed Computing on Head-Mounted Systems [supp] |
| Masked-Attention Mask Transformer for Universal Image Segmentation [supp] |
| Reading To Listen at the Cocktail Party: Multi-Modal Speech Separation |
| AxIoU: An Axiomatically Justified Measure for Video Moment Retrieval [supp] |
| NOC-REK: Novel Object Captioning With Retrieved Vocabulary From External Knowledge [supp] |
| Boosting Robustness of Image Matting With Context Assembling and Strong Data Augmentation [supp] |
| Group R-CNN for Weakly Semi-Supervised Object Detection With Points [supp] |
| Weakly-Supervised Action Transition Learning for Stochastic Human Motion Prediction [supp] |
| Speech Driven Tongue Animation [supp] |
| Hybrid Relation Guided Set Matching for Few-Shot Action Recognition [supp] |
| Self-Supervised Spatial Reasoning on Multi-View Line Drawings [supp] |
| Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation |
| Cross-Patch Dense Contrastive Learning for Semi-Supervised Segmentation of Cellular Nuclei in Histopathologic Images [supp] |
| Frame-Wise Action Representations for Long Videos via Sequence Contrastive Learning [supp] |
| Coarse-To-Fine Deep Video Coding With Hyperprior-Guided Mode Prediction |
| Generalized Binary Search Network for Highly-Efficient Multi-View Stereo [supp] |
| SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation [supp] |
| Adaptive Hierarchical Representation Learning for Long-Tailed Object Detection [supp] |
| FlexIT: Towards Flexible Semantic Image Translation [supp] |
| Face2Exp: Combating Data Biases for Facial Expression Recognition |
| SAR-Net: Shape Alignment and Recovery Network for Category-Level 6D Object Pose and Size Estimation [supp] |
| Whose Hands Are These? Hand Detection and Hand-Body Association in the Wild [supp] |
| Mega-NERF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs [supp] |
| PINA: Learning a Personalized Implicit Neural Avatar From a Single RGB-D Video Sequence [supp] |
| Forecasting From LiDAR via Future Object Detection [supp] |
| CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow [supp] |
| Adversarial Eigen Attack on Black-Box Models [supp] |
| Training Quantised Neural Networks With STE Variants: The Additive Noise Annealing Algorithm [supp] |
| Split Hierarchical Variational Compression [supp] |
| Video Swin Transformer |
| Privacy Preserving Partial Localization [supp] |
| Cross-Modal Background Suppression for Audio-Visual Event Localization |
| Mutual Quantization for Cross-Modal Search With Noisy Labels |
| Lagrange Motion Analysis and View Embeddings for Improved Gait Recognition [supp] |
| SphereSR: 360deg Image Super-Resolution With Arbitrary Projection via Continuous Spherical Image Representation [supp] |
| Neural Mesh Simplification [supp] |
| Cloth-Changing Person Re-Identification From a Single Image With Gait Prediction and Regularization [supp] |
| BoxeR: Box-Attention for 2D and 3D Transformers [supp] |
| Neural Architecture Search With Representation Mutual Information [supp] |
| Deep Hyperspectral-Depth Reconstruction Using Single Color-Dot Projection [supp] |
| M3T: Three-Dimensional Medical Image Classifier Using Multi-Plane and Multi-Slice Transformer [supp] |
| 3MASSIV: Multilingual, Multimodal and Multi-Aspect Dataset of Social Media Short Videos [supp] |
| Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent From the Decision Boundary Perspective [supp] |
| Cross Domain Object Detection by Target-Perceived Dual Branch Distillation [supp] |
| A Proposal-Based Paradigm for Self-Supervised Sound Source Localization in Videos |
| Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation |
| GroupNet: Multiscale Hypergraph Neural Networks for Trajectory Prediction With Relational Reasoning [supp] |
| Unbiased Subclass Regularization for Semi-Supervised Semantic Segmentation |
| P3IV: Probabilistic Procedure Planning From Instructional Videos With Weak Supervision [supp] |
| Hierarchical Nearest Neighbor Graph Embedding for Efficient Dimensionality Reduction [supp] |
| Coupled Iterative Refinement for 6D Multi-Object Pose Estimation [supp] |
| Multi-View Transformer for 3D Visual Grounding |
| Structured Sparse R-CNN for Direct Scene Graph Generation [supp] |
| Multi-Grained Spatio-Temporal Features Perceived Network for Event-Based Lip-Reading [supp] |
| Semi-Supervised Video Paragraph Grounding With Contrastive Encoder |
| Continual Predictive Learning From Videos [supp] |
| Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory [supp] |
| BARC: Learning To Regress 3D Dog Shape From Images by Exploiting Breed Information [supp] |
| Knowledge Distillation: A Good Teacher Is Patient and Consistent [supp] |
| PCA-Based Knowledge Distillation Towards Lightweight and Content-Style Balanced Photorealistic Style Transfer Models [supp] |
| Frame Averaging for Equivariant Shape Space Learning [supp] |
| Transformer Tracking With Cyclic Shifting Window Attention [supp] |
| ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues |
| Towards Understanding Adversarial Robustness of Optical Flow Networks [supp] |
| Panoptic SegFormer: Delving Deeper Into Panoptic Segmentation With Transformers [supp] |
| Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation [supp] |
| AnyFace: Free-Style Text-To-Face Synthesis and Manipulation |
| HL-Net: Heterophily Learning Network for Scene Graph Generation [supp] |
| Lifelong Graph Learning [supp] |
| Hypergraph-Induced Semantic Tuplet Loss for Deep Metric Learning [supp] |
| Computing Wasserstein-p Distance Between Images With Linear Cost [supp] |
| DLFormer: Discrete Latent Transformer for Video Inpainting |
| Unsupervised Representation Learning for Binary Networks by Joint Classifier Learning [supp] |
| High Quality Segmentation for Ultra High-Resolution Images [supp] |
| Investigating Tradeoffs in Real-World Video Super-Resolution [supp] |
| MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound [supp] |
| Differentiable Stereopsis: Meshes From Multiple Views Using Differentiable Rendering [supp] |
| Towards Practical Certifiable Patch Defense With Vision Transformer |
| A Conservative Approach for Unbiased Learning on Unknown Biases [supp] |
| Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark [supp] |
| Label, Verify, Correct: A Simple Few Shot Object Detection Method |
| Aesthetic Text Logo Synthesis via Content-Aware Layout Inferring [supp] |
| Global Tracking via Ensemble of Local Trackers [supp] |
| Autoregressive Image Generation Using Residual Quantization [supp] |
| MPC: Multi-View Probabilistic Clustering [supp] |
| End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection |
| GrainSpace: A Large-Scale Dataset for Fine-Grained and Domain-Adaptive Recognition of Cereal Grains [supp] |
| BokehMe: When Neural Rendering Meets Classical Rendering [supp] |
| Learning Modal-Invariant and Temporal-Memory for Video-Based Visible-Infrared Person Re-Identification [supp] |
| MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning |
| Oriented RepPoints for Aerial Object Detection |
| OccAM's Laser: Occlusion-Based Attribution Maps for 3D Object Detectors on LiDAR Data [supp] |
| BigDatasetGAN: Synthesizing ImageNet With Pixel-Wise Annotations [supp] |
| Align Representations With Base: A New Approach to Self-Supervised Learning |
| Exploring Denoised Cross-Video Contrast for Weakly-Supervised Temporal Action Localization |
| Pre-Train, Self-Train, Distill: A Simple Recipe for Supersizing 3D Reconstruction [supp] |
| Meta Distribution Alignment for Generalizable Person Re-Identification |
| TeachAugment: Data Augmentation Optimization Using Teacher Knowledge [supp] |
| SVIP: Sequence VerIfication for Procedures in Videos [supp] |
| Weakly Supervised Temporal Sentence Grounding With Gaussian-Based Contrastive Proposal Learning |
| Low-Resource Adaptation for Personalized Co-Speech Gesture Generation [supp] |
| BoosterNet: Improving Domain Generalization of Deep Neural Nets Using Culpability-Ranked Features |
| Task-Specific Inconsistency Alignment for Domain Adaptive Object Detection [supp] |
| HDR-NeRF: High Dynamic Range Neural Radiance Fields [supp] |
| MS2DG-Net: Progressive Correspondence Learning via Multiple Sparse Semantics Dynamic Graph |
| Neural Emotion Director: Speech-Preserving Semantic Control of Facial Expressions in "In-the-Wild" Videos [supp] |
| Learning To Listen: Modeling Non-Deterministic Dyadic Facial Motion |
| 3PSDF: Three-Pole Signed Distance Function for Learning Surfaces With Arbitrary Topologies [supp] |
| Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation From Monocular Video [supp] |
| MixFormer: End-to-End Tracking With Iterative Mixed Attention [supp] |