Comprehensive Assessment of GPT Model Credibility: Revealing Potential Risks and Improvement Directions

2025-07-05 00:13:04

Abstract generation in progress

Exploring the Credibility of GPT Models: Comprehensive Assessment and Potential Risks

Recently, a research team composed of the University of Illinois at Urbana-Champaign, Stanford University, the University of California, Berkeley, the Artificial Intelligence Safety Center, and Microsoft Research has released a comprehensive trust evaluation platform for large language models (LLMs). The research results were published under the title "DecodingTrust: A Comprehensive Assessment of GPT Model Trustworthiness."

This study reveals some previously undisclosed potential issues related to the credibility of GPT models. The research found that GPT models are prone to generating harmful and biased outputs and may also leak private information from training data and conversation history. Notably, although GPT-4 is generally more reliable than GPT-3.5 in standard tests, it is more susceptible to attacks when faced with maliciously designed instructions, which may be due to its stricter adherence to misleading directives.

The research team conducted a comprehensive evaluation of the GPT model from eight different perspectives, including adversarial attacks, toxic content and bias, and privacy leaks, among others. For example, when assessing the model's robustness against text adversarial attacks, the researchers designed various testing scenarios, including using the standard benchmark AdvGLUE, adopting different instructive task descriptions, and utilizing self-generated challenging adversarial text AdvGLUE++.

In terms of toxicity and bias, research has found that GPT models generally exhibit lower bias on most stereotype topics under normal conditions. However, when faced with misleading system prompts, the model may be induced to agree with biased content. Notably, GPT-4 is more susceptible to targeted misleading system prompts than GPT-3.5. The degree of bias in the model is also related to the demographics and sensitivity of the topics involved in user queries.

Regarding the issue of privacy leakage, studies have found that GPT models may leak sensitive information from training data, such as email addresses. In some cases, providing additional contextual information can significantly improve the accuracy of information extraction. Furthermore, the model may also leak private information injected into the conversation history. Overall, GPT-4 performs better than GPT-3.5 in protecting personal identifiable information, but both models carry risks when faced with privacy leakage demonstrations.

This study provides a comprehensive framework for assessing the credibility of GPT models and reveals some potential security risks. The research team hopes that this work will encourage more researchers to focus on and improve the credibility issues of large language models, ultimately developing more powerful and reliable models. To promote collaboration, the research team has open-sourced the evaluation benchmark code and designed it to be user-friendly and extensible.

GPT0.35%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

5 Likes