South Korean Researchers Develop ‘Safe LLaVA’ AI Model with Enhanced Safety Features
A team of South Korean researchers at the Electronics and Telecommunications Research Institute (ETRI) has unveiled “Safe LLaVA,” a vision language model (VLM) designed with enhanced safety features to mitigate risks associated with generative AI. The model proactively analyzes both images and text to detect potentially harmful content, representing a significant step towards safer AI applications in Korea.
Addressing Safety Concerns in Generative AI
Existing generative AI models often struggle with identifying and responding appropriately to harmful inputs, particularly those involving images. ETRI’s Safe LLaVA addresses this challenge by structurally incorporating safety measures directly into the model, rather than relying solely on data-centric fine-tuning methods. This approach involves a “visual guard module” capable of detecting 20 different safety categories.
How Safe LLaVA Works
Safe LLaVA integrates approximately 20 types of harmful content categorizers within the AI model itself. This allows it to automatically detect risks in seven major areas: illegal activity, violence, hate speech, invasion of privacy, sexual content, risk of self-harm and potentially harmful expert advice (medical, legal, etc.). When harmful input is detected, the model provides a safe response along with the reasoning behind its decision. The technology has been applied to three open-source VLMs: LLaVA, Qwen, and Gemma, resulting in six safe vision language models – Safe LLaVA (7B/13B), Safe Qwen-2.5-VL (7B/32B), and SafeGem (12B/27B).
HoliSafe-Bench: A New Safety Benchmark
Alongside the model release, ETRI also introduced “HoliSafe-Bench,” a new safety benchmark dataset. This dataset consists of approximately 1,700 images and over 4,000 question-and-answer pairs, designed to quantitatively evaluate an AI model’s risk detection capabilities across 7 categories and 18 detailed subcategories. HoliSafe-Bench is the nation’s first integrated safety benchmark evaluating both images and text combinations.
Performance and Comparison to Existing Models
Comparative experiments using HoliSafe-Bench demonstrated that Safe LLaVA and Safe Qwen achieved safety response rates of 93% and 97%, respectively. This represents a safety improvement of up to 10 times compared to existing open models. In tests involving prompts related to illegal activities, such as asking for pickpocketing advice, Safe LLaVA refused to provide assistance and flagged the request as a risk. Domestic generative AI models, in contrast, sometimes provided detailed explanations of criminal methods. Similarly, when presented with adult images and inappropriate questions, Safe LLaVA appropriately declined to respond, even as some other models generated unsuitable content. Dongascience reports that the model scores up to 10 times higher on its proprietary safety benchmark compared to existing commercial generative AI models.
Future Directions
Lee Yong-Ju, Director of ETRI’s Visual Intelligence Research Section, emphasized that Safe LLaVA is the first vision language model in Korea to simultaneously provide safe answers and the reasoning behind them. ETRI plans to expand its K-AI safety research in connection with the Korean large language model development project and the human-centered AI fundamental technology development project. The six safe vision language models and the HoliSafe-Bench dataset are available for download on the Hugging Face platform. Hugging Face