Related papers: Visual Speech Language Models

Comparing phonemes and visemes with DNN-based lipreading

There is debate if phoneme or viseme units are the most effective for a lipreading system. Some studies use phoneme units even though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on…

Computer Vision and Pattern Recognition · Computer Science 2018-05-09 Kwanchiva Thangthai , Helen L Bear , Richard Harvey

Which phoneme-to-viseme maps best improve visual-only computer lip-reading?

A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L. Bear , Richard W. Harvey , Barry-John Theobald , Yuxuan Lan

Visual Grounding Helps Learn Word Meanings in Low-Data Regimes

Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to…

Computation and Language · Computer Science 2024-03-27 Chengxu Zhuang , Evelina Fedorenko , Jacob Andreas

Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent…

Computation and Language · Computer Science 2025-11-18 Tyler Loakman , Joseph James , Chenghua Lin

Visual speech recognition: aligning terminologies for better understanding

We are at an exciting time for machine lipreading. Traditional research stemmed from the adaptation of audio recognition systems. But now, the computer vision community is also participating. This joining of two previously disparate areas…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L Bear , Sarah Taylor

Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as `visemes'. In this…

Image and Video Processing · Electrical Eng. & Systems 2019-09-17 Helen Bear , Richard Harvey

Visual Language Models show widespread visual deficits on neuropsychological tests

Visual Language Models (VLMs) show remarkable performance in visual reasoning tasks, successfully tackling college-level challenges that require high-level understanding of images. However, some recent reports of VLMs struggling to reason…

Computer Vision and Pattern Recognition · Computer Science 2025-04-17 Gene Tangtartharakul , Katherine R. Storrs

Some observations on computer lip-reading: moving from the dream to the reality

In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L. Bear , Gari Owen , Richard Harvey , Barry-John Theobald

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-15 Jeong Hun Yeo , Seunghee Han , Minsu Kim , Yong Man Ro

Visual gesture variability between talkers in continuous visual speech

Recent adoption of deep learning methods to the field of machine lipreading research gives us two options to pursue to improve system performance. Either, we develop end-to-end systems holistically or, we experiment to further our…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L Bear

Learn an Effective Lip Reading Model without Pains

Lip reading, also known as visual speech recognition, aims to recognize the speech content from videos by analyzing the lip dynamics. There have been several appealing progress in recent years, benefiting much from the rapidly developed…

Computer Vision and Pattern Recognition · Computer Science 2020-11-17 Dalu Feng , Shuang Yang , Shiguang Shan , Xilin Chen

Finding phonemes: improving machine lip-reading

In machine lip-reading there is continued debate and research around the correct classes to be used for recognition. In this paper we use a structured approach for devising speaker-dependent viseme classes, which enables the creation of a…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L. Bear , Richard W. Harvey , Yuxuan Lan

An Introduction to Vision-Language Modeling

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models…

Machine Learning · Computer Science 2024-05-28 Florian Bordes , Richard Yuanzhe Pang , Anurag Ajay , Alexander C. Li , Adrien Bardes , Suzanne Petryk , Oscar Mañas , Zhiqiu Lin , Anas Mahmoud , Bargav Jayaraman , Mark Ibrahim , Melissa Hall , Yunyang Xiong , Jonathan Lebensold , Candace Ross , Srihari Jayakumar , Chuan Guo , Diane Bouchacourt , Haider Al-Tahan , Karthik Padthe , Vasu Sharma , Hu Xu , Xiaoqing Ellen Tan , Megan Richards , Samuel Lavoie , Pietro Astolfi , Reyhane Askari Hemmat , Jun Chen , Kushal Tirumala , Rim Assouel , Mazda Moayeri , Arjang Talattof , Kamalika Chaudhuri , Zechun Liu , Xilun Chen , Quentin Garrido , Karen Ullrich , Aishwarya Agrawal , Kate Saenko , Asli Celikyilmaz , Vikas Chandra

Vision language models are unreliable at trivial spatial cognition

Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability…

Computer Vision and Pattern Recognition · Computer Science 2025-04-23 Sangeet Khemlani , Tyler Tran , Nathaniel Gyory , Anthony M. Harrison , Wallace E. Lawson , Ravenna Thielstrom , Hunter Thompson , Taaren Singh , J. Gregory Trafton

What Vision-Language Models `See' when they See Scenes

Images can be described in terms of the objects they contain, or in terms of the types of scene or place that they instantiate. In this paper we address to what extent pretrained Vision and Language models can learn to align descriptions of…

Computation and Language · Computer Science 2021-09-16 Michele Cafagna , Kees van Deemter , Albert Gatt

Automatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading

Speech is the most common communication method between humans and involves the perception of both auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, but it has been demonstrated that video…

Computer Vision and Pattern Recognition · Computer Science 2017-04-27 Adriana Fernandez-Lopez , Federico M. Sukno

Speaker-independent machine lip-reading with speaker-dependent viseme classifiers

In machine lip-reading, which is identification of speech from visual-only information, there is evidence to show that visual speech is highly dependent upon the speaker [1]. Here, we use a phoneme-clustering method to form new…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L. Bear , Stephen J. Cox , Richard W. Harvey

Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database

Speech is the most used communication method between humans and it involves the perception of auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, although the video can provide information…

Computer Vision and Pattern Recognition · Computer Science 2017-04-27 Adriana Fernandez-Lopez , Oriol Martinez , Federico M. Sukno

Multi-Grained Spatio-temporal Modeling for Lip-reading

Lip-reading aims to recognize speech content from videos via visual analysis of speakers' lip movements. This is a challenging task due to the existence of homophemes-words which involve identical or highly similar lip movements, as well as…

Computer Vision and Pattern Recognition · Computer Science 2019-09-04 Chenhao Wang

Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach

Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal…

Computation and Language · Computer Science 2024-10-31 Maxime Poli , Emmanuel Chemla , Emmanuel Dupoux