Related papers: Using Deep Object Features for Image Descriptions

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding…

Machine Learning · Computer Science 2014-11-11 Ryan Kiros , Ruslan Salakhutdinov , Richard S. Zemel

Image Captioning: Transforming Objects into Words

Image captioning models typically follow an encoder-decoder architecture which uses abstract image feature vectors as input to the encoder. One of the most successful algorithms uses feature vectors extracted from the region proposals…

Computer Vision and Pattern Recognition · Computer Science 2020-01-14 Simao Herdade , Armin Kappeler , Kofi Boakye , Joao Soares

Image Captioning based on Feature Refinement and Reflective Decoding

Image captioning is the process of automatically generating a description of an image in natural language. Image captioning is one of the significant challenges in image understanding since it requires not only recognizing salient objects…

Computer Vision and Pattern Recognition · Computer Science 2022-07-26 Ghadah Alabduljabbar , Hafida Benhidour , Said Kerrache

Personalized Image Descriptions from Attention Sequences

People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Ruoyu Xue , Hieu Le , Jingyi Xu , Sounak Mondal , Abe Leite , Gregory Zelinsky , Minh Hoai , Dimitris Samaras

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object…

Computer Vision and Pattern Recognition · Computer Science 2023-04-06 Xuhui Jia , Yang Zhao , Kelvin C. K. Chan , Yandong Li , Han Zhang , Boqing Gong , Tingbo Hou , Huisheng Wang , Yu-Chuan Su

Object Recognition as Next Token Prediction

We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Kaiyu Yue , Bor-Chun Chen , Jonas Geiping , Hengduo Li , Tom Goldstein , Ser-Nam Lim

Image Captioning based on Deep Learning Methods: A Survey

Image captioning is a challenging task and attracting more and more attention in the field of Artificial Intelligence, and which can be applied to efficient image retrieval, intelligent blind guidance and human-computer interaction, etc. In…

Computer Vision and Pattern Recognition · Computer Science 2019-05-21 Yiyu Wang , Jungang Xu , Yingfei Sun , Ben He

Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding,…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Jintang Xue , Ganning Zhao , Jie-En Yao , Hong-En Chen , Yue Hu , Meida Chen , Suya You , C. -C. Jay Kuo

Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation

Image Captioning, or the automatic generation of descriptions for images, is one of the core problems in Computer Vision and has seen considerable progress using Deep Learning Techniques. We propose to use Inception-ResNet Convolutional…

Computer Vision and Pattern Recognition · Computer Science 2021-02-23 Sulabh Katiyar , Samir Kumar Borgohain

A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning

Generating textual descriptions for images has been an attractive problem for the computer vision and natural language processing researchers in recent years. Dozens of models based on deep learning have been proposed to solve this problem.…

Computer Vision and Pattern Recognition · Computer Science 2019-07-01 Ahmad Asadi , Reza Safabakhsh

Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning

Automatic transcription of scene understanding in images and videos is a step towards artificial general intelligence. Image captioning is a nomenclature for describing meaningful information in an image using computer vision techniques.…

Computer Vision and Pattern Recognition · Computer Science 2021-09-17 Shikha Dubey , Farrukh Olimov , Muhammad Aasim Rafique , Joonmo Kim , Moongu Jeon

Empirical Autopsy of Deep Video Captioning Frameworks

Contemporary deep learning based video captioning follows encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional Neural Networks (CNNs) and a transformed version of those features is passed to the…

Computer Vision and Pattern Recognition · Computer Science 2019-11-22 Nayyer Aafaq , Naveed Akhtar , Wei Liu , Ajmal Mian

Image Captioning with Object Detection and Localization

Automatically generating a natural language description of an image is a task close to the heart of image understanding. In this paper, we present a multi-model neural network method closely related to the human visual system that…

Computer Vision and Pattern Recognition · Computer Science 2017-06-09 Zhongliang Yang , Yu-Jin Zhang , Sadaqat ur Rehman , Yongfeng Huang

Captioning Images with Diverse Objects

Recent captioning models are limited in their ability to scale and describe concepts unseen in paired image-text corpora. We propose the Novel Object Captioner (NOC), a deep visual semantic captioning model that can describe a large number…

Computer Vision and Pattern Recognition · Computer Science 2017-07-24 Subhashini Venugopalan , Lisa Anne Hendricks , Marcus Rohrbach , Raymond Mooney , Trevor Darrell , Kate Saenko

An Improvement of Object Detection Performance using Multi-step Machine Learnings

Connecting multiple machine learning models into a pipeline is effective for handling complex problems. By breaking down the problem into steps, each tackled by a specific component model of the pipeline, the overall solution can be made…

Computer Vision and Pattern Recognition · Computer Science 2021-01-20 Tomoe Kishimoto , Masahiko Saito , Junichi Tanaka , Yutaro Iiyama , Ryu Sawada , Koji Terashi

Object Counts! Bringing Explicit Detections Back into Image Captioning

The use of explicit object detectors as an intermediate step to image captioning - which used to constitute an essential stage in early work - is often bypassed in the currently dominant end-to-end approaches, where the language model is…

Computer Vision and Pattern Recognition · Computer Science 2018-05-02 Josiah Wang , Pranava Madhyastha , Lucia Specia

An image structure model for exact edge detection

The paper presents a new model for single channel images low-level interpretation. The image is decomposed into a graph which captures a complete set of structural features. The description allows to accurately identify every edge location…

Computer Vision and Pattern Recognition · Computer Science 2019-04-23 Alessandro Dal Palu'

A Comprehensive Review of Modern Object Segmentation Approaches

Image segmentation is the task of associating pixels in an image with their respective object class labels. It has a wide range of applications in many industries including healthcare, transportation, robotics, fashion, home improvement,…

Computer Vision and Pattern Recognition · Computer Science 2023-01-19 Yuanbo Wang , Unaiza Ahsan , Hanyan Li , Matthew Hagen

Dense Captioning with Joint Inference and Visual Context

Dense captioning is a newly emerging computer vision topic for understanding images with dense language descriptions. The goal is to densely detect visual concepts (e.g., objects, object parts, and interactions between them) from images,…

Computer Vision and Pattern Recognition · Computer Science 2017-08-09 Linjie Yang , Kevin Tang , Jianchao Yang , Li-Jia Li

Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory

Deep-learning and large scale language-image training have produced image object detectors that generalise well to diverse environments and semantic classes. However, single-image object detectors trained on internet data are not optimally…

Robotics · Computer Science 2024-02-07 Nicolas Harvey Chapman , Feras Dayoub , Will Browne , Chris Lehnert