awesome-change-caption

The project is currently under construction

Papers

Lite Version

[ICIIF 2018] Least Squares Twin Extreme Learning Machine for Pattern Classification Amisha et al. [paper]
[ScienceDirect 2001] A Survey of Computer Vision-Based Human Motion Capture Thomas et al. [paper]
[ICIIF 2018] A Statistical Approach to Texture Classification from Single Images Manik et al. [paper]
[IEEE 2000] Twenty Years of Document Image Analysis in PAMI George Nagy. [paper]
[ICIIF 2018] Least Squares Twin Extreme Learning Machine for Pattern Classification Amisha et al. [paper]
[NeurIPS 2022] Learning Distinct and Representative Modes for Image Captioning Chen et al. [paper]
[ACM 2021] Scene Graph with 3D Information for Change Captioning Liao et al. [paper]
[IEEE 2022] ChaLearn Looking at People: IsoGD and ConGD Large-Scale RGB-D Gesture Recognition Wan et al. [paper]
[IEEE 2021] MVTN: Multi-View Transformation Network for 3D Shape Recognition Hamdi et al. [paper]
[IEEE 2021] GLiT: Neural Architecture Search for Global and Local Image Transformer. Chen et al. [paper]
[IEEE 2022] Learning-Based Point Cloud Registration for 6D Object Pose Estimation in the Real World Dang et al. [paper]
[IEEE 2022] An End-to-End Transformer Model for Crowd Localization Liang et al. [paper]
[IEEE 2022] Large-Scale Pre-training for Person Re-identification with Noisy Labels. Fu et al. [paper]
[IEEE 2022] Clipped Hyperbolic Classifiers Are Super-Hyperbolic Classifiers Guo et al. [paper]
[IEEE 2022] CO-SNE: Dimensionality Reduction and Visualization for Hyperbolic Data Guo et al. [paper]
[ACM 2021] Latent Memory-augmented Graph Transformer for Visual Storytelling Qi et al. [paper]
[IEEE 2021] Group-based Distinctive Image Captioning with Memory Attention Wang et al. [paper]
[IEEE 2021] Human-like Controllable Image Captioning with Verb-specific Semantic Roles Chen et al. [paper]
[IEEE 2021] Towards Accurate Text-based Image Captioning with Content Diversity Exploration Xu et al. [paper]

Full Version

Paper info	Description
Least Squares Twin Extreme Learning Machine for Pattern Classification Rastogi, R. ; Bharti, A. > South Asian University, New Delhi, 110021, Delhi, India > ICIIF 2018 > Extreme learning machine``Twin support vector machine``Classification``recognition, > Cited by 2594	The paper proposes Least Squares Twin Extreme Learning Machine (LSTELM) for pattern classification . LSTELM formulation solves Extreme Learning Machine (ELM) problem in twin framework.
A Survey of Computer Vision-Based Human Motion Capture Thomas B. Moeslund and Erik Granum > Laboratory of Computer Vision and Media Technology, Aalborg University, Niels Jernes Vej 14, Aalborg, 9220, Denmarkf1 > ScienceDirect 2001 > initialization``tracking``pose estimation``Least squares, > Cited by 6	The paper proposes a comprehensive survey of computer vision-based human motion capture literature from the past two decades is presented. The focus is on a general overview based on a taxonomy of system functionalities, broken down into four processes: initialization, tracking, pose estimation, and recognition.
A Statistical Approach to Texture Classification from Single Images Manik Varma / Andrew Zisserman > Robotics Research Group, Department of Engineering Science, University of Oxford, Oxford, OX1 3PJ, UK > ICIIF 2018 > Extreme learning machine``Twin support vector machine``Classification``recognition, > Cited by 2594	The paper proposes texture classification from single images obtained under unknown viewpoint and illumination. A statistical approach is developed where textures are modelled by the joint probability distribution of filter responses.
Twenty Years of Document Image Analysis in PAMI George Nagy > South Asian University, New Delhi, 110021, Delhi, India > IEEE 2000 > Extreme learning machine``Twin support vector machine``Classification``Least squares, > Cited by 6	The contributions to document image analysis of 99 papers published in the E E E Transactions on Pattern Anaiysis and Machine hteiligence (PAMI) are clustered, summarized, interpolated, interpreted, and tactfully evaluated.
A Statistical Approach to Texture Classification from Single Images Manik Varma / Andrew Zisserman > Robotics Research Group, Department of Engineering Science, University of Oxford, Oxford, OX1 3PJ, UK > ICIIF 2018 > Extreme learning machine``Twin support vector machine``Classification``recognition, > Cited by 2594	The paper proposes Least Squares Twin Extreme Learning Machine (LSTELM) for pattern classification . LSTELM formulation solves Extreme Learning Machine (ELM) problem in twin framework.

Learning Distinct and Representative Modes for Image Captioning Qi Chen, Chaorui Deng, Qi Wu > NeurIPS 2022 > Computer Science``Computer Vision and Pattern Recognition > Cited by 1	Therefore, this article proposes a method of learning discrete control signals from training corpus. The author believes that each text corresponds to a mode, and the number of modes is a hyperparameter. Each mode in the training phase corresponds to a control signal in the testing phase. The embedded representation of all control signals constitutes the codebook.
Scene Graph with 3D Information for Change Captioning Zeming Liao, Qingbao Huang, Yu Liang, Mingyi Fu,YiCai,, Qing Li >School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China2School of Software Engineering, South China University of Technology, Guangzhou, China3Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China4Key Laboratory of Big Data and Intelligent Robot (SCUT), MOE of China5Guangxi Key Laboratory of Multimedia Communications and Network Technology6Institute of Artificial Intellige >ACM 2021 > Computing methodologies``Scene understanding``Natu-ral language generation > Cited by 1	>This paper propose a 3D-aware Scene Graph-based Change Captioning (SGCC) model, extracting object semantics and 3D data. This constructs scene graphs for image pairs, with node representations aggregated using graph convolution. SGCC assists observers in quickly identifying changes and is partially immune to viewpoint shifts. Extensive experiments confirm SGCC's competitive performance on CLEVR-Change and Spot-the-Diff datasets, validating our model's effectiveness.
ChaLearn Looking at People: IsoGD and ConGD Large-Scale RGB-D Gesture Recognition Wan et al > Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China > IEEE 2022 > Gesture recognition``Measurement``Task analysis``Training``Conferences``Computer vision``Bidirectional long short-term memory (Bi-LSTM)``gesture recognition``RGB-D, > Cited by 7	The paper proposes it describes the creation of both benchmark datasets and analyzes the advances in large-scale gesture recognition based on these two datasets.
MVTN: Multi-View Transformation Network for 3D Shape Recognition Hamdi et al > King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia > IEEE 2021 > `NEURAL-NETWORK`, > Cited by 23	The main contribution of this paper is the introduction of a Multi-View Transformation Network (MVTN) that can predict optimal viewpoints to improve 3D shape recognition performance.
GLiT: Neural Architecture Search for Global and Local Image Transformer Chen et al > Univ Sydney, Sydney, NSW, Australia > IEEE 2021 > GLiT``Image Recognition``transformer``Neural architecture Search``Local information ``Global Information``Self-attention``multi-head attention mechanism``ImageNet, > Cited by 23	This paper introduces GLiT, the first Neural Architecture Search (NAS) method to find a better transformer architecture specifically for image recognition.
Learning-Based Point Cloud Registration for 6D Object Pose Estimation in the Real World Dang et al >Ecole Polytech Fed Lausanne, CVLab, Lausanne, Switzerland > IEEE 2022 > 6D object pose estimation``Point cloud registration, > Cited by 0	The main content of the paper is about addressing the challenges faced by learning-based 3D object registration algorithms to estimate the 6D pose of an object from point cloud data in the presence of real-world data.
An End-to-End Transformer Model for Crowd Localization Liang et al >Huazhong Univ Sci & Technol, Wuhan 430074, Peoples R China > IEEE 2022 > Crowd localization``Crowd counting``Transformer, > Cited by 3	The main content of this paper is the proposal of an end-to-end Crowd Localization Transformer (CLTR) model for the task of crowd localization, which aims to predict the location of each instance (head positions) in crowd scenes.
Large-Scale Pre-training for Person Re-identification with Noisy Labels Fu et al >Microsoft Res, Redmond, WA 98052 USA > IEEE 2022 > `person re-identification, pre-training, noisy labels, prototype-based contrastive learning, label-guided contrastive learning, deep learning, computer vision`, > Cited by 2	This paper proposes a framework for large-scale pre-training for person re-identification (Re-ID) with noisy labels. The framework, called PNL, consists of three learning modules: supervised Re-ID learning, prototype-based contrastive learning, and label-guided contrastive learning.
Clipped Hyperbolic Classifiers Are Super-Hyperbolic Classifiers * Guo et al* >UC Berkeley ICSI, Berkeley, CA 94720 USA > IEEE 2022 > `person re-identification, pre-training, noisy labels, prototype-based contrastive learning, label-guided contrastive learning, deep learning, computer vision`, > Cited by 41	This paper introduces a solution to the vanishing gradient problem in training Hyperbolic Neural Networks (HNNs), which is caused by the hybrid architecture connecting Euclidean features to a hyperbolic classifier.
CO-SNE: Dimensionality Reduction and Visualization for Hyperbolic Data * Guo et al* >UC Berkeley ICSI, Berkeley, CA 94720 USA > IEEE 2022 > `CO-SNE, hyper-likelihood distribution, high-dimensional data visualization, non-Euclidean space, hierarchical structure`, > Cited by 0	The main content of this article is the proposal of CO-SNE, a method for visualizing high-dimensional hyperbolic data in a low-dimensional hyperbolic space.
Latent Memory-augmented Graph Transformer for Visual Storytelling * Qi et al* >School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China > IEEE 2021 > `Visual Storytelling, Transformer, Scene Graph, Memory Network`, > Cited by 0	The main focus of this paper is on the task of visual storytelling, which involves automatically generating a narrative story for an image stream.
Group-based Distinctive Image Captioning with Memory Attention * Wang et al* >Department of Computer Science, City University of Hong Kong > IEEE 2021 > `Image Caption, Distinctiveness, Memory Attention, Similar Image`, > Cited by 18	The main content of this paper is the proposal of a Group-based Distinctive Captioning Model (GdisCap) that generates distinctive captions for images by comparing each image with other images in a similar group and highlighting the uniqueness of each image.
Human-like Controllable Image Captioning with Verb-specific Semantic Roles * Chen et al* >Zhejiang Univ, Hangzhou, Peoples R China > IEEE 2021 > `Controllable Image Captioning, Verb-specific Semantic Roles (VSR), Semantic Role Labeling (SRL), Semantic Structure Planner (SSP), Role-shift Caption Generation, Diverse Image Captioning, Visual Grounding`, > Cited by 17	The main content of the article is the proposal and implementation of a new control signal for Controllable Image Captioning (CIC) called Verb-specific Semantic Roles (VSR), which considers both event-compatibility and sample-suitability requirements for more human-like controllability.
Towards Accurate Text-based Image Captioning with Content Diversity Exploration * Xu et al* >South China Univ Technol, Guangzhou, Peoples R China > IEEE 2021 > `text-based image captioning, content diversity exploration, anchor proposal module, anchor captioning module, anchor-centred graph, OCR tokens`, > Cited by 15	The main content of this paper is the proposal of a new method called Anchor-Captioner for text-based image captioning, which aims to generate multiple captions that accurately describe different parts of an image in detail.

Datasets

Dataset info	Description
CLEVR-Change > Stanford University >English > 2019 >N/A > 16000 images	The CLEVR-Change dataset is an extension of the CLEVR dataset and includes a series of questions and corresponding images involving changes made to objects in a scene.
TextCaps > Peking University >English > 2021 >N/A > 28,000 images	The TextCaps dataset is a large-scale text-image matching dataset that includes textual descriptions of images from different domains.
COCO > Microsoft >English > 2015 > 80 categories > 328000 images	The COCO dataset is widely used for image captioning and object detection tasks. It comprises images from various scenes, often with multiple objects and complex backgrounds, making it an ideal choice for multimodal tasks.
Conceptual Captions > Google Research >English > 2018 > N/A > 12000000 images	The Conceptual Captions (CC) dataset is a dataset containing (image URL, caption) pairs for the training and evaluation of machine learning image captioning systems. The dataset is available in two versions, about 3.3 million images (CC3M) and 12 million images (CC12M), and automatically collects weakly correlated descriptions from the network through a simple filtering procedure.
SBU Captions >Stanford University >English > 2011	The SBU Captions dataset initially used image captions as a retrieval task, containing 1 million image URLs + captions pairs.

Popular Implementations

Code	Paper	Framework
Meshed-Memory Transformer for Image Captioning	Meshed-Memory Transformer for Image Captioning	Pytorch
CLIP	Contrastive Language-Image Pre-training	Pytorch
Oscar	Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks	Pytorch
UNITER	Learning Universal Image-Text Representations	Pytorch
UNIMO	Towards Unified-Modal Understanding and Generation via Text-Image Vision-Language Pre-training	Pytorch
VL-BERT	VL-BERT: Pre-training of Generic Visual-Linguistic Representations	Pytorch

SOTA

DataSet:

CLEVR-Change

metrics	sota	method	paper
BLEU-4	47.3	DUDA	Robust Change Captioning
METEOR	33.9	DUDA	Robust Change Captioning
SPICE	24.5	DUDA	Robust Change Captioning
CIDEr	112.3	DUDA	Robust Change Captioning

TextCaps

metrics	sota	method	paper
BLEU-4	23.30	M4C-Captinoner	TextCaps: a Dataset for Image Captioning with Reading Comprehension
METEOR	22.00	M4C-Captinoner	TextCaps: a Dataset for Image Captioning with Reading Comprehension
ROUGE	46.20	M4C-Captinoner	TextCaps: a Dataset for Image Captioning with Reading Comprehension
CIDEr	89.60	M4C-Captinoner	TextCaps: a Dataset for Image Captioning with Reading Comprehension

COCO

metrics	sota	method	paper
Overall mAP	49.6	Dual-Curriculum Teacher	Dual-Curriculum Teacher for Domain-Inconsistent Object Detection in Autonomous Driving

Conceptual Captions

metrics	sota	method	paper
Params (M)	156	MLP + GPT2 tuning	ClipCap: CLIP Prefix for Image Captioning
SPICE	18.5	MLP + GPT2 tuning	ClipCap: CLIP Prefix for Image Captioning
ROUGE-L	26.7	MLP + GPT2 tuning	ClipCap: CLIP Prefix for Image Captioning
CIDEr	87.26	MLP + GPT2 tuning	ClipCap: CLIP Prefix for Image Captioning

SBU Captions

metrics	sota	method	paper
BLEU	0.1259	Global + Content Matching (linear SVM)	Im2Text: Describing Images Using 1 Million Captioned Photographs

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
pic		pic
README.md		README.md
img-storage		img-storage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-change-caption

Papers

Lite Version

Full Version

Datasets

Popular Implementations

SOTA

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

awesome-change-caption

Papers

Lite Version

Full Version

Datasets

Popular Implementations

SOTA

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages