Research shows that visual models cannot process queries with negative words

Imagine that a radiologist examines a new patient’s chest X-ray. She noticed the patient swelling in the tissues but had insufficient heart. Hopefully speeding up diagnosis, she might use visual language machine learning models to search for reports similar to patients.

However, if the model mistakenly identifies reports in both cases, the most likely diagnosis may be very different: If the patient has a swelling of tissue and enlarged heart, the condition is likely to be related to the heart, but without enlarged hearts, there may be several root causes.

In a new study, MIT researchers found that visual language models are highly likely to make such mistakes in the real world because they don’t understand negation—words like “no” and “not” indicate wrong or non-existent words.

“These negatives can have very significant impacts, and if we just use these models blindly, we can have catastrophic consequences,” said Kumail Alhamud, a graduate student at MIT and lead author of the study.

The researchers tested the ability of visual models to recognize negation in image titles. These models will usually be like random guesses. Building on these findings, the team created an image dataset with negative subtitles describing the missing objects.

They show that using this dataset to detect visual models results in performance improvements when the model is asked to retrieve images that do not contain certain objects. It also improves the accuracy of multiple choice questions answered with negative subtitles.

But researchers warn that more work is needed to address the root cause of the problem. They hope their research reminds potential users of previously unknown shortcomings that could have severe implications in the current high-risk environment using these models, from determining which patients receive certain treatments to identifying product defects in manufacturing plants.

Senior writer Marzyeh Ghassemi said: “This is a technical paper, but there are more issues to consider.

MIT graduate student Shaden Alshammari joined Ghassemi and Alhamoud. Yonglong Tian of Openai; Guohao Li, a former postdoctoral fellow at Oxford University; Philip HS Torr, an Oxford professor; Yoon Kim, an assistant professor at EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. The study will be presented at the Computer Vision and Pattern Recognition Conference.

Ignore the negation

Visual Language Models (VLM) are trainings using a large number of images and corresponding subtitles, and they learn to encode them into sets of numbers, called vector representations. The model uses these vectors to distinguish different images.

VLM uses two separate encoders, one for text and one for images, and the encoder learns to output similar vectors for the image and its corresponding text subtitles.

“Subtitles express what’s in the image – they’re a positive label. It’s actually the whole problem. No one looks at the image of a dog jumping over a fence and says “a dog jumps over a fence, no helicopter,” Ghassemi says.

Since the image capture dataset does not contain negative examples, VLMS will never learn to recognize it.

To delve into this issue, the researchers designed two benchmark tasks to test VLMS’s ability to understand negation.

First, they use a large language model (LLM) to recapture images in an existing dataset, requiring the LLM to consider related objects that are not in the image and write them into the title. They then test the model by prompting them to use a negative word to retrieve images containing some objects but no others.

For the second task, they designed multiple selection questions, requiring VLM to select the most appropriate title from the list of closely related options. These subtitles differ only by adding a reference to an object that does not appear in the image or negate an object that does appear in the image.

Models for both tasks usually fail, with image retrieval performance dropping by nearly 25% on negative subtitles. When answering multiple choice questions, the best model achieved only about 39% accuracy, with several models even lower than the random chance.

One of the reasons for this failure is what researchers call a shortcut to affirmative bias – VLMS ignores the negative words and focuses on objects in the image.

“It’s not only happening in words like ‘no’ and ‘not’. No matter how you express negative or exclusion, these models simply ignore it,” Alhamoud said.

This is consistent in every VLM they tested.

“Solved Problems”

Since VLM usually does not train negatively on image titles, the researchers developed datasets with negative words as the first step in solving the problem.

Using a dataset with 10 million image text subtitle pairs, they prompted LLM to propose relevant subtitles to specify what is excluded from the image, resulting in new subtitles with negative words.

They must be particularly careful to ensure that these synthetic subtitles are still read naturally, or may cause VLM to fail in the real world when faced with more complex subtitles written by humans.

They found that FINETUNTUNTINNE of VLM and its datasets brought a comprehensive performance improvement. It improves the model’s image retrieval capability by about 10%, while also improving the performance of multiple selective answer tasks by about 30%.

“But our solution is not perfect. We’re just capturing the dataset, which is a form of data augmentation. We don’t even touch on how these models work, but we hope that this shows that it’s a solution that others can take our solution and improve the solution,” Alhamoud said.

At the same time, he hopes that their work encourages more users to consider the problem they want to use VLM to solve and design some examples to test them before deployment.

In the future, researchers can extend this work by teaching VLM to process text and images separately, which may improve their ability to understand negation. Additionally, they can develop other data sets that include image capture pairs for specific applications, such as healthcare.

What's Hot

AT&T’s new locking option blocks thieves trying to access your account

Steam can now show you that the framework generation has changed your game

Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

Research shows that visual models cannot process queries with negative words

Unlock performance: Accelerate Pandas operation using Polars

CTGT’s AI platform is built to eliminate bias, hallucination in AI models

See blood clots before the strike

AI-controlled robot shows unstable driving, NHTSA problem Tesla

Estonia’s AI Leap brings chatbots to school

The competition between agents and controls enterprise AI

Smart Home Décor : Technology Offers a Slew of Options

Edifier W240TN Earbud Review: Fancy Specs Aren’t Everything

Review: Xiaomi’s New Mobile with Hi-fi and Home Cinema System

AT&T’s new locking option blocks thieves trying to access your account

Steam can now show you that the framework generation has changed your game

Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

Unlock performance: Accelerate Pandas operation using Polars

Our Picks

AT&T’s new locking option blocks thieves trying to access your account

Steam can now show you that the framework generation has changed your game

Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

Top Reviews

Smart Home Décor : Technology Offers a Slew of Options

Edifier W240TN Earbud Review: Fancy Specs Aren’t Everything

Review: Xiaomi’s New Mobile with Hi-fi and Home Cinema System

Subscribe to Updates

What's Hot

Research shows that visual models cannot process queries with negative words

Related Posts