Picture a simple, regular activity such as meeting a friend for dinner. For most of us, it’s a relaxing and enjoyable setting, but for a visually-impaired person, it can be a situation entailing a potential myriad of pitfalls.
Starting with figuring out transportation to the restaurant to getting to your friend through a crowd of tables to scanning the menu of dishes — these are just few examples of the challenges blind people can be confronted with while navigating their environment. Worldwide, these challenges are encountered by 1.3 billion men and women.
How can technological progress be used to not just benefit businesses but actually reduce inequalities and improve the lives of people with disabilities? Assistive technologies powered by artificial intelligence (AI) — such as object recognition, scene understanding, and visual question answering (VQA) — can help pave the way toward a more equal and just world.
Nearly 10 years ago researchers developed the VizWiz app, which enables blind users to take pictures of objects with their phone, ask questions about these objects, and receive almost real-time spoken answers and descriptions from remotely situated, sighted employees. In that regard, a blind person could upload a picture of the packaging for an instant meal and receive the spoken cooking instruction almost in real time.
With enormous technological advancements in the past decade but stagnated efforts in improving visual question answering models like the VizWiz app, the European Conference on Computer Vision (ECCV) featured a VizWiz workshop, challenging the research community to join forces and come up with new approaches that meet the needs of blind people. As a global leader, SAP stands for the higher purpose of improving people’s lives — and beyond economic success. As part of its committed to the 17 United Nations (UN) Global Goals, the SAP Machine Learning Research team participated in the challenge, adding to the purpose of creating a world with reduced inequalities, good health, and well being for everyone.
VizWiz Grand Challenge
Computer vision researchers took advantage of the data that was collected by the VizWiz application in the past 10 years and put together the VizWiz dataset, summarizing 31,000 questions posed to the app by visually impaired users. The VizWiz Grand Challenge included two main tasks: predict the answer to a visual question, and predict whether a visual question cannot be answered.
With the help of this dataset, the SAP team — consisting of Tassilo Klein, Moin Nabi, Sandro Pezzelle, and Denis Dushi — examined various shortcomings and limitations of VQA models and evaluation metrics. They found that VQA models and metrics usually perform well under artificially curated datasets with high quality, clear images, and direct written questions that the algorithm can easily identify and respond to. However, considering the data that users are actually submitting to the system, data is often of poor image and audio quality. Pictures are often out-of-focus, and the questions asked are suffering from audio recording issues.
With the goal of improving the functionalities of the app, the SAP team analyzed the distribution of the answers in the dataset and managed to prove how its skewed toward very few frequent answers, just predicting the most frequent answers without actually learning to understand the images and questions asked. Moreover, they found that VQA evaluation metrics suffer from other flaws and have difficulties in finding commonalities in similar type of pictures — a chihuahua being a dog, for example.
The SAP team ended up among the three top-performing teams by demonstrating major shortcomings and necessary improvements of the application. These technological improvements are needed in order to develop more robust and accurate VQA models that we can rely on in real-world scenarios, and by this, help the world and its citizens to run better.
Could Computer Vision Become an Eye for Blind People?
Since visual question answering technology is still in its infancy, the significant breakthrough is yet to come. The VizWiz challenge initiated dialogues among the research community and stimulated further research on how to develop and advance VQA systems for disadvantaged people. Ultimately, the goal needs to be developing algorithms that can perform well under several limiting factors often found in real-life situations, such as data scarcity, label imbalance, and noisy labels. This would lead to cutting-edge visual assistive technologies that someday promise to transform the lives of blind people by facilitating the simple tasks of everyday life and granting them more independence and freedom.
Current research and development efforts, however, point to a future where machines could lend an eye to visually impaired people and help them see and understand the world around us in a better way. In this regard, a number of promising visual assistive technologies, such as Seeing AI or OrCam, have emerged. Based on the ideas of the VizWiz app, they assist blind users with reading text and handwritten documents, identifying products and persons, and making entire scenes understandable to them. Moreover, AI-powered technologies can not only improve the lives of the visually impaired, but also have proven to be able to give a voice to non-verbal people (Bright Sign), increase cognitive abilities of people with autism and dementia (Content Clarifier), and lend an ear to the 360 million people that are hearing-impaired (Roger Voice).
Reducing inequality worldwide is one of the UN’s global goals. SAP understands that data and technology are important tools to achieve that dream, and we believe that we have the responsibility to embrace collaboration and innovation with purpose, in order to create a just and equal world for everyone.
Ivona Crnoja is a communications associate for SAP Machine Learning Research.