Authors: Ricky Gonzalez, Jazmin Collins, Shiri Azenkot, Cindy Bennet

Cornell Tech PhD Students Ricky Gonzalez and Jazmin Collins, co-founder of XR Access Shiri Azenkot, and Accessibility Researcher at Google Cindy Bennet are investigating the potential of AI to describe scenes to blind and low vision (BLV) people. The team developed an iOS application called Eyevisor that simulated the use of SeeingAI to collect data about why, when, and how BLV people use AI to describe visuals to them.

The study’s results points to a variety of unique use cases for which BLV users would prefer to use AI rather than human assistance, such as:

  • Detecting disgusting, dirty, and potentially dangerous things in the environment
  • Solving disputes between blind and low vision friends that require visual information
  • Avoiding awkward situations (touching a sleeping person)
18 images randomly sampled from the diary study. These examples represent four of the most common goals: The first three images are examples of (A) Getting a scene description, the next six images are examples of (B) identifying features of objects, the next four images are examples of (C) obtaining the identity of a subject in the scene, and the last five are examples of (D) learning about the application.

During the diary study, participants submitted entries with different goals. These examples represent four of the most common goals:

  • (A) Getting a scene description
  • (B) Identifying features of objects
  • (C) Obtaining the identity of a subject in the scene
  • (D) Learning about the application
The scene description application we used to collect data. The screenshots show the flow of using the application and submitting a diary entry. It includes five screens: the photo submission, the photo description, and three diary entry question screens. The photo submission screen has a label that says Eyevisor, a button that says pick photo from album, and a dog in the background that is being captured with the camera. The visual interpretation screen contains a preview of the image captured, the visual interpretation result that says “a dog lying on the floor”, and two buttons “next”, and “back”. The first question screen contained the questions: (1) Where were you when you took the photo and (2) What information did you want to get from the description. The second question screen contained the questions: (3) How satisfied are you with the description, and (4) How much do you trust the description. Finally, in the third questions screen participants could enter additional comments. The interface was designed to group similar questions, while minimizing the number of elements on each screen.

The scene description application we used to collect data. Screenshots show the flow of using the application and submitting a diary entry. It includes five screens: the photo submission, the photo description, and three diary entry question screens. The interface was designed to group similar questions, while minimizing the number of elements on each screen.

Scene description applications and the AI powering them are changing rapidly, especially with continuing advancements in generative and non-generative AI. As these technologies grow, it remains important to investigate how users make use of these technologies, what their needs are, and what their goals when using AI. Our work guides advancements in scene description and establish a more useful baseline of visual interpretation for BLV users and their daily needs.

This work will be presented at the ACM CHI conference in Hawaii on May 11-16, 2024. Contact Ricardo Gonzalez at reg258 [at] cornell.edu for more information.