Skip to content

iOPENCap/awesome-change-caption

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

awesome-change-caption

The project is currently under construction

Papers

Lite Version

Full Version

Paper info Description
Least Squares Twin Extreme Learning Machine for Pattern Classification
Rastogi, R. ; Bharti, A.
> South Asian University, New Delhi, 110021, Delhi, India
> ICIIF 2018
> Extreme learning machine``Twin support vector machine``Classification``recognition,
> Cited by 2594

The paper proposes Least Squares Twin Extreme Learning Machine (LSTELM) for pattern classification . LSTELM formulation solves Extreme Learning Machine (ELM) problem in twin framework.
A Survey of Computer Vision-Based Human Motion Capture
Thomas B. Moeslund and Erik Granum
> Laboratory of Computer Vision and Media Technology, Aalborg University, Niels Jernes Vej 14, Aalborg, 9220, Denmarkf1
> ScienceDirect 2001
> initialization``tracking``pose estimation``Least squares,
> Cited by 6

The paper proposes a comprehensive survey of computer vision-based human motion capture literature from the past two decades is presented. The focus is on a general overview based on a taxonomy of system functionalities, broken down into four processes: initialization, tracking, pose estimation, and recognition.
A Statistical Approach to Texture Classification from Single Images
Manik Varma / Andrew Zisserman
> Robotics Research Group, Department of Engineering Science, University of Oxford, Oxford, OX1 3PJ, UK
> ICIIF 2018
> Extreme learning machine``Twin support vector machine``Classification``recognition,
> Cited by 2594

The paper proposes texture classification from single images obtained under unknown viewpoint and illumination. A statistical approach is developed where textures are modelled by the joint probability distribution of filter responses.
Twenty Years of Document Image Analysis in PAMI
George Nagy
> South Asian University, New Delhi, 110021, Delhi, India
> IEEE 2000
> Extreme learning machine``Twin support vector machine``Classification``Least squares,
> Cited by 6

The contributions to document image analysis of 99 papers published in the E E E Transactions on Pattern Anaiysis and Machine hteiligence (PAMI) are clustered, summarized, interpolated, interpreted, and tactfully evaluated.
A Statistical Approach to Texture Classification from Single Images
Manik Varma / Andrew Zisserman
> Robotics Research Group, Department of Engineering Science, University of Oxford, Oxford, OX1 3PJ, UK
> ICIIF 2018
> Extreme learning machine``Twin support vector machine``Classification``recognition,
> Cited by 2594

The paper proposes Least Squares Twin Extreme Learning Machine (LSTELM) for pattern classification . LSTELM formulation solves Extreme Learning Machine (ELM) problem in twin framework.

Learning Distinct and Representative Modes for Image Captioning
Qi Chen, Chaorui Deng, Qi Wu
> NeurIPS 2022
> Computer Science``Computer Vision and Pattern Recognition
> Cited by 1

Therefore, this article proposes a method of learning discrete control signals from training corpus. The author believes that each text corresponds to a mode, and the number of modes is a hyperparameter. Each mode in the training phase corresponds to a control signal in the testing phase. The embedded representation of all control signals constitutes the codebook.
Scene Graph with 3D Information for Change Captioning
Zeming Liao, Qingbao Huang, Yu Liang, Mingyi Fu,YiCai,, Qing Li
>School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China2School of Software Engineering, South China University of Technology, Guangzhou, China3Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China4Key Laboratory of Big Data and Intelligent Robot (SCUT), MOE of China5Guangxi Key Laboratory of Multimedia Communications and Network Technology6Institute of Artificial Intellige
>ACM 2021
> Computing methodologies``Scene understanding``Natu-ral language generation
> Cited by 1

>This paper propose a 3D-aware Scene Graph-based Change Captioning (SGCC) model, extracting object semantics and 3D data. This constructs scene graphs for image pairs, with node representations aggregated using graph convolution. SGCC assists observers in quickly identifying changes and is partially immune to viewpoint shifts. Extensive experiments confirm SGCC's competitive performance on CLEVR-Change and Spot-the-Diff datasets, validating our model's effectiveness.
ChaLearn Looking at People: IsoGD and ConGD Large-Scale RGB-D Gesture Recognition
Wan et al
> Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
> IEEE 2022
> Gesture recognition``Measurement``Task analysis``Training``Conferences``Computer vision``Bidirectional long short-term memory (Bi-LSTM)``gesture recognition``RGB-D,
> Cited by 7

The paper proposes it describes the creation of both benchmark datasets and analyzes the advances in large-scale gesture recognition based on these two datasets.
MVTN: Multi-View Transformation Network for 3D Shape Recognition
Hamdi et al
> King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
> IEEE 2021
> NEURAL-NETWORK,
> Cited by 23

The main contribution of this paper is the introduction of a Multi-View Transformation Network (MVTN) that can predict optimal viewpoints to improve 3D shape recognition performance.
GLiT: Neural Architecture Search for Global and Local Image Transformer
Chen et al
> Univ Sydney, Sydney, NSW, Australia
> IEEE 2021
> GLiT``Image Recognition``transformer``Neural architecture Search``Local information ``Global Information``Self-attention``multi-head attention mechanism``ImageNet,
> Cited by 23

This paper introduces GLiT, the first Neural Architecture Search (NAS) method to find a better transformer architecture specifically for image recognition.
Learning-Based Point Cloud Registration for 6D Object Pose Estimation in the Real World
Dang et al
>Ecole Polytech Fed Lausanne, CVLab, Lausanne, Switzerland
> IEEE 2022
> 6D object pose estimation``Point cloud registration,
> Cited by 0

The main content of the paper is about addressing the challenges faced by learning-based 3D object registration algorithms to estimate the 6D pose of an object from point cloud data in the presence of real-world data.
An End-to-End Transformer Model for Crowd Localization
Liang et al
>Huazhong Univ Sci & Technol, Wuhan 430074, Peoples R China
> IEEE 2022
> Crowd localization``Crowd counting``Transformer,
> Cited by 3

The main content of this paper is the proposal of an end-to-end Crowd Localization Transformer (CLTR) model for the task of crowd localization, which aims to predict the location of each instance (head positions) in crowd scenes.
Large-Scale Pre-training for Person Re-identification with Noisy Labels
Fu et al
>Microsoft Res, Redmond, WA 98052 USA
> IEEE 2022
> person re-identification, pre-training, noisy labels, prototype-based contrastive learning, label-guided contrastive learning, deep learning, computer vision,
> Cited by 2

This paper proposes a framework for large-scale pre-training for person re-identification (Re-ID) with noisy labels. The framework, called PNL, consists of three learning modules: supervised Re-ID learning, prototype-based contrastive learning, and label-guided contrastive learning.
Clipped Hyperbolic Classifiers Are Super-Hyperbolic Classifiers
* Guo et al*
>UC Berkeley ICSI, Berkeley, CA 94720 USA
> IEEE 2022
> person re-identification, pre-training, noisy labels, prototype-based contrastive learning, label-guided contrastive learning, deep learning, computer vision,
> Cited by 41

This paper introduces a solution to the vanishing gradient problem in training Hyperbolic Neural Networks (HNNs), which is caused by the hybrid architecture connecting Euclidean features to a hyperbolic classifier.
CO-SNE: Dimensionality Reduction and Visualization for Hyperbolic Data
* Guo et al*
>UC Berkeley ICSI, Berkeley, CA 94720 USA
> IEEE 2022
> CO-SNE, hyper-likelihood distribution, high-dimensional data visualization, non-Euclidean space, hierarchical structure,
> Cited by 0

The main content of this article is the proposal of CO-SNE, a method for visualizing high-dimensional hyperbolic data in a low-dimensional hyperbolic space.
Latent Memory-augmented Graph Transformer for Visual Storytelling
* Qi et al*
>School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China
> IEEE 2021
> Visual Storytelling, Transformer, Scene Graph, Memory Network,
> Cited by 0

The main focus of this paper is on the task of visual storytelling, which involves automatically generating a narrative story for an image stream.
Group-based Distinctive Image Captioning with Memory Attention
* Wang et al*
>Department of Computer Science, City University of Hong Kong
> IEEE 2021
> Image Caption, Distinctiveness, Memory Attention, Similar Image,
> Cited by 18

The main content of this paper is the proposal of a Group-based Distinctive Captioning Model (GdisCap) that generates distinctive captions for images by comparing each image with other images in a similar group and highlighting the uniqueness of each image.
Human-like Controllable Image Captioning with Verb-specific Semantic Roles
* Chen et al*
>Zhejiang Univ, Hangzhou, Peoples R China
> IEEE 2021
> Controllable Image Captioning, Verb-specific Semantic Roles (VSR), Semantic Role Labeling (SRL), Semantic Structure Planner (SSP), Role-shift Caption Generation, Diverse Image Captioning, Visual Grounding,
> Cited by 17

The main content of the article is the proposal and implementation of a new control signal for Controllable Image Captioning (CIC) called Verb-specific Semantic Roles (VSR), which considers both event-compatibility and sample-suitability requirements for more human-like controllability.
Towards Accurate Text-based Image Captioning with Content Diversity Exploration
* Xu et al*
>South China Univ Technol, Guangzhou, Peoples R China
> IEEE 2021
> text-based image captioning, content diversity exploration, anchor proposal module, anchor captioning module, anchor-centred graph, OCR tokens,
> Cited by 15

The main content of this paper is the proposal of a new method called Anchor-Captioner for text-based image captioning, which aims to generate multiple captions that accurately describe different parts of an image in detail.

Datasets

Dataset info Description
CLEVR-Change
> Stanford University
>English
> 2019
>N/A
> 16000 images

The CLEVR-Change dataset is an extension of the CLEVR dataset and includes a series of questions and corresponding images involving changes made to objects in a scene.
TextCaps
> Peking University
>English
> 2021
>N/A
> 28,000 images

The TextCaps dataset is a large-scale text-image matching dataset that includes textual descriptions of images from different domains.
COCO
> Microsoft
>English
> 2015
> 80 categories
> 328000 images

The COCO dataset is widely used for image captioning and object detection tasks. It comprises images from various scenes, often with multiple objects and complex backgrounds, making it an ideal choice for multimodal tasks.
Conceptual Captions
> Google Research
>English
> 2018
> N/A
> 12000000 images

The Conceptual Captions (CC) dataset is a dataset containing (image URL, caption) pairs for the training and evaluation of machine learning image captioning systems. The dataset is available in two versions, about 3.3 million images (CC3M) and 12 million images (CC12M), and automatically collects weakly correlated descriptions from the network through a simple filtering procedure.
SBU Captions
>Stanford University
>English
> 2011

The SBU Captions dataset initially used image captions as a retrieval task, containing 1 million image URLs + captions pairs.

Popular Implementations

Code Paper Framework
Meshed-Memory Transformer for Image Captioning Meshed-Memory Transformer for Image Captioning Pytorch
CLIP Contrastive Language-Image Pre-training Pytorch
Oscar Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Pytorch
UNITER Learning Universal Image-Text Representations Pytorch
UNIMO Towards Unified-Modal Understanding and Generation via Text-Image Vision-Language Pre-training Pytorch
VL-BERT VL-BERT: Pre-training of Generic Visual-Linguistic Representations Pytorch

SOTA

DataSet:

CLEVR-Change

metrics sota method paper
BLEU-4 47.3 DUDA Robust Change Captioning
METEOR 33.9 DUDA Robust Change Captioning
SPICE 24.5 DUDA Robust Change Captioning
CIDEr 112.3 DUDA Robust Change Captioning

TextCaps

metrics sota method paper
BLEU-4 23.30 M4C-Captinoner TextCaps: a Dataset for Image Captioning with Reading Comprehension
METEOR 22.00 M4C-Captinoner TextCaps: a Dataset for Image Captioning with Reading Comprehension
ROUGE 46.20 M4C-Captinoner TextCaps: a Dataset for Image Captioning with Reading Comprehension
CIDEr 89.60 M4C-Captinoner TextCaps: a Dataset for Image Captioning with Reading Comprehension

COCO

metrics sota method paper
Overall mAP 49.6 Dual-Curriculum Teacher Dual-Curriculum Teacher for Domain-Inconsistent Object Detection in Autonomous Driving

Conceptual Captions

metrics sota method paper
Params (M) 156 MLP + GPT2 tuning ClipCap: CLIP Prefix for Image Captioning
SPICE 18.5 MLP + GPT2 tuning ClipCap: CLIP Prefix for Image Captioning
ROUGE-L 26.7 MLP + GPT2 tuning ClipCap: CLIP Prefix for Image Captioning
CIDEr 87.26 MLP + GPT2 tuning ClipCap: CLIP Prefix for Image Captioning

SBU Captions

metrics sota method paper
BLEU 0.1259 Global + Content Matching (linear SVM) Im2Text: Describing Images Using 1 Million Captioned Photographs

About

The project is currently under construction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors