English
Related papers

Related papers: Visual Speech Language Models

200 papers

There is debate if phoneme or viseme units are the most effective for a lipreading system. Some studies use phoneme units even though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on…

Computer Vision and Pattern Recognition · Computer Science 2018-05-09 Kwanchiva Thangthai , Helen L Bear , Richard Harvey

A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L. Bear , Richard W. Harvey , Barry-John Theobald , Yuxuan Lan

Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to…

Computation and Language · Computer Science 2024-03-27 Chengxu Zhuang , Evelina Fedorenko , Jacob Andreas

With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent…

Computation and Language · Computer Science 2025-11-18 Tyler Loakman , Joseph James , Chenghua Lin

We are at an exciting time for machine lipreading. Traditional research stemmed from the adaptation of audio recognition systems. But now, the computer vision community is also participating. This joining of two previously disparate areas…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L Bear , Sarah Taylor

Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as `visemes'. In this…

Image and Video Processing · Electrical Eng. & Systems 2019-09-17 Helen Bear , Richard Harvey

Visual Language Models (VLMs) show remarkable performance in visual reasoning tasks, successfully tackling college-level challenges that require high-level understanding of images. However, some recent reports of VLMs struggling to reason…

Computer Vision and Pattern Recognition · Computer Science 2025-04-17 Gene Tangtartharakul , Katherine R. Storrs

In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L. Bear , Gari Owen , Richard Harvey , Barry-John Theobald

In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-15 Jeong Hun Yeo , Seunghee Han , Minsu Kim , Yong Man Ro

Recent adoption of deep learning methods to the field of machine lipreading research gives us two options to pursue to improve system performance. Either, we develop end-to-end systems holistically or, we experiment to further our…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L Bear

Lip reading, also known as visual speech recognition, aims to recognize the speech content from videos by analyzing the lip dynamics. There have been several appealing progress in recent years, benefiting much from the rapidly developed…

Computer Vision and Pattern Recognition · Computer Science 2020-11-17 Dalu Feng , Shuang Yang , Shiguang Shan , Xilin Chen

In machine lip-reading there is continued debate and research around the correct classes to be used for recognition. In this paper we use a structured approach for devising speaker-dependent viseme classes, which enables the creation of a…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L. Bear , Richard W. Harvey , Yuxuan Lan

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models…

Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability…

Computer Vision and Pattern Recognition · Computer Science 2025-04-23 Sangeet Khemlani , Tyler Tran , Nathaniel Gyory , Anthony M. Harrison , Wallace E. Lawson , Ravenna Thielstrom , Hunter Thompson , Taaren Singh , J. Gregory Trafton

Images can be described in terms of the objects they contain, or in terms of the types of scene or place that they instantiate. In this paper we address to what extent pretrained Vision and Language models can learn to align descriptions of…

Computation and Language · Computer Science 2021-09-16 Michele Cafagna , Kees van Deemter , Albert Gatt

Speech is the most common communication method between humans and involves the perception of both auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, but it has been demonstrated that video…

Computer Vision and Pattern Recognition · Computer Science 2017-04-27 Adriana Fernandez-Lopez , Federico M. Sukno

In machine lip-reading, which is identification of speech from visual-only information, there is evidence to show that visual speech is highly dependent upon the speaker [1]. Here, we use a phoneme-clustering method to form new…

Computer Vision and Pattern Recognition · Computer Science 2018-04-26 Helen L. Bear , Stephen J. Cox , Richard W. Harvey

Speech is the most used communication method between humans and it involves the perception of auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, although the video can provide information…

Computer Vision and Pattern Recognition · Computer Science 2017-04-27 Adriana Fernandez-Lopez , Oriol Martinez , Federico M. Sukno

Lip-reading aims to recognize speech content from videos via visual analysis of speakers' lip movements. This is a challenging task due to the existence of homophemes-words which involve identical or highly similar lip movements, as well as…

Computer Vision and Pattern Recognition · Computer Science 2019-09-04 Chenhao Wang

Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal…

Computation and Language · Computer Science 2024-10-31 Maxime Poli , Emmanuel Chemla , Emmanuel Dupoux
‹ Prev 1 2 3 10 Next ›