Related papers: How really effective are Multimodal Hints in enhan…
Input multimodality combining speech and hand gestures has motivated numerous usability studies. Contrastingly, issues relating to the design and ergonomic evaluation of multimodal output messages combining speech with visual modalities…
A preliminary experimental study is presented, that aims at eliciting the contribution of oral messages to facilitating visual search tasks on crowded visual displays. Results of quantitative and qualitative analyses suggest that…
This paper describes an experimental study that aims at assessing the actual contribution of voice system messages to visual search efficiency and comfort. Messages which include spatial information on the target location are meant to…
Humans sense of distance depends on the integration of multi sensory cues. The incoming visual luminance, auditory pitch and tactile vibration could all contribute to the ability of distance judgement. This ability can be enhanced if the…
Multi-modal word semantics aims to enhance embeddings with perceptual input, assuming that human meaning representation is grounded in sensory experience. Most research focuses on evaluation involving direct visual input, however, visual…
This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such…
Localization is a key requirement for mobile robot autonomy and human-robot interaction. Vision-based localization is accurate and flexible, however, it incurs a high computational burden which limits its application on many…
With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations.…
Multi-modal visual understanding of images with prompts involves using various visual and textual cues to enhance the semantic understanding of images. This approach combines both vision and language processing to generate more accurate…
Interaction plays a vital role during visual network exploration as users need to engage with both elements in the view (e.g., nodes, links) and interface controls (e.g., sliders, dropdown menus). Particularly as the size and complexity of…
Usability evaluation is an essential method to support the design of effective and intuitive user interfaces (UIs). However, it commonly relies on resource-intensive, expert-driven methods, which limit its accessibility, especially for…
We introduce visual hints expansion for guiding stereo matching to improve generalization. Our work is motivated by the robustness of Visual Inertial Odometry (VIO) in computer vision and robotics, where a sparse and unevenly distributed…
Research in multi-modal interfaces aims to provide solutions to immersion and increase overall human performance. A promising direction is combining auditory, visual and haptic interaction between the user and the simulated environment.…
Augmented reality (AR) is emerging in visual search tasks for increasingly immersive interactions with virtual objects. We propose an AR approach providing visual and audio hints along with gaze-assisted instant post-task feedback for…
Multimodal machine learning algorithms aim to learn visual-textual correspondences. Previous work suggests that concepts with concrete visual manifestations may be easier to learn than concepts with abstract ones. We give an algorithm for…
Hateful meme detection is a new multimodal task that has gained significant traction in academic and industry research communities. Recently, researchers have applied pre-trained visual-linguistic models to perform the multimodal…
Humans possess multimodal literacy, allowing them to actively integrate information from various modalities to form reasoning. Faced with challenges like lexical ambiguity in text, we supplement this with other modalities, such as thumbnail…
Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to…
Visual cues such as structure, emphasis, and icons play an important role in efficient information foraging by sighted individuals and make for a pleasurable reading experience. Blind, low-vision and other print-disabled individuals miss…
Finding a particular object in a display is important for viewers in many visualizations, for example, when reacting to brushing or to a highlighted object. This can be enabled by making the target object different in one of the visual…