Quick News Spot

A Novel Approach to Fine-Tuning Vision-Language Models


A Novel Approach to Fine-Tuning Vision-Language Models

Reviewed by Bethan DaviesDec 10 2024

A research team from Tokyo University of Science (TUS), Japan, led by Associate Professor Go Irie, has developed a novel technique called "black-box forgetting." This method enables the iterative optimization of text prompts provided to a black-box vision-language classifier model, allowing it to selectively "forget" specific classes it can recognize.

The capabilities of large-scale pre-trained AI models have advanced rapidly, as seen in models like CLIP and ChatGPT. These versatile generalist models perform well across a wide range of tasks, driving their widespread public adoption. However, such versatility comes with trade-offs.

Training and operating these models demand significant energy and time, conflicting with sustainability goals and limiting their deployment on standard computing systems. Moreover, in many real-world applications, users require AI models to serve specific roles rather than operate as generalists. In such cases, a model's broad capabilities can be redundant or even detrimental, reducing task-specific accuracy. This raises an important question: could large-scale pre-trained models be used more efficiently by enabling them to "forget" unnecessary information?

In a paper set to be presented at Neural Information Processing Systems (NeurIPS 2024), a research team led by Associate Professor Go Irie from Tokyo University of Science (TUS), Japan, addresses this challenge. The team developed a technique called "black-box forgetting," which iteratively optimizes text prompts given to a black-box vision-language classifier model to make it selectively "forget" certain classes it recognizes.

The study, co-authored by Yusuke Kuwana and Yuta Goto from TUS and Dr. Takashi Shibata from NEC Corporation, introduces a novel approach to tailoring AI models for specific applications.

In practical applications, the classification of all kinds of object classes is rarely required. For example, in an autonomous driving system, it would be sufficient to recognize limited classes of objects such as cars, pedestrians, and traffic signs. We would not need to recognize food, furniture, or animal species. Retaining the classes that do not need to be recognized may decrease overall classification accuracy, as well as cause operational disadvantages such as the waste of computational resources and the risk of information leakage.

Go Irie, Associate Professor, Tokyo University of Science

Although some techniques for selective forgetting in pre-trained models exist, they typically require a white-box setting, where users have access to the model's internal parameters and architecture. In most cases, however, users interact with black-box models, where access to internal details is restricted due to commercial or ethical considerations. To address this challenge, the researchers implemented a derivative-free optimization strategy, which does not rely on the model's gradients.

For this purpose, the team adapted CMA-ES, an evolutionary algorithm, with the image classifier model CLIP as their test subject. This approach involves sampling candidate prompts to input into the model, evaluating their effectiveness using predefined objective functions, and updating a multivariate distribution based on the results.

However, derivative-free optimization becomes increasingly inefficient for large-scale problems. As the number of classes to be forgotten grows, the 'latent context' -- used to optimize input prompts -- expands to impractical sizes.

To overcome this limitation, the researchers introduced a novel technique called 'latent context sharing.' This method decomposes the latent context into smaller components, some unique to individual prompt tokens and others shared across multiple tokens. By optimizing these smaller elements instead of the entire latent context, the complexity of the problem is significantly reduced, making the optimization process more feasible.

The researchers validated their method on several benchmark image classification datasets, aiming to make CLIP 'forget' 40 % of the classes in each dataset. This study is the first to explore selective forgetting in pre-trained vision-language models under black-box conditions, and the results, based on established performance baselines, were highly promising.

This breakthrough has significant implications for artificial intelligence and machine learning. It could enable large-scale models to excel in specialized tasks, enhancing their applicability. Another potential use case is controlling image generation models by making them 'forget' specific visual contexts, thereby preventing the creation of undesirable content.

Furthermore, the proposed strategy has the potential to help address privacy concerns, which are becoming increasingly prevalent in this field.

"If a service provider is asked to remove certain information from a model, this can be accomplished by retraining the model from scratch by removing problematic samples from the training data. However, retraining a large-scale model consumes enormous amounts of energy. Selective forgetting, or so-called machine unlearning, may provide an efficient solution to this problem," Dr. Irie added.

In other words, it could aid in the development of solutions to protect the so-called "Right to be Forgotten," a particularly sensitive issue in healthcare and banking.

This breakthrough technique not only empowers large-scale AI models but also protects end users, paving the road for AI's seamless incorporation into daily lives.

Source:

Tokyo University of Science

Previous articleNext article

POPULAR CATEGORY

corporate

3886

tech

4045

entertainment

4741

research

2097

misc

5032

wellness

3752

athletics

4893