Research says Large Language Models may generate toxic content & expose private information
GPT models handle toxicity with hidden methods. Koyejo states that popular models lack transparency, motivating research into potential issues.

Highlights
- Researchers warn that the perception of Large Language Models(LLMs) as flawless is risky
- Investigation reveals privacy leakage in GPT models inadvertently disclosing sensitive data
- Researchers stress that scepticism is essential when dealing with AI interfaces
Researchers found the extensive language model lacks transparency and can expose private information. Despite concerns about hallucinations, misinformation, and bias in generative AI, more than half of the respondents expressed their willingness to utilise this emerging technology for critical areas such as medical advice and financial planning.
Researchers Sanmi Koyejo from Stanford and Bo Li from the University of Illinois Urbana-Champaign, along with collaborators from UC Berkeley and Microsoft Research, aimed to address this issue in their recent study on GPT models, which they've shared on the arXiv preprint server.
Trust issues in GPT models perceived flawlessness
Li emphasises, "Many people perceive large language models as flawless compared to other models, which is risky, especially for crucial domains. Our research underscores that these models aren't currently dependable enough for critical tasks."
- Elon Musk’s Starlink satellite internet services could soon reach India, regulatory approval pending
Exercise caution
Current GPT models handle toxicity in ways that aren't fully transparent. Koyejo notes, "Many popular models are closed-off and lack transparency, so we're not privy to the intricacies of their training processes." This lack of transparency prompted the researchers to delve into this area, aiming to identify potential pitfalls.
Upon testing benign prompts, Koyejo and Li discovered that GPT-3.5 and GPT-4 did indeed reduce toxic outputs significantly compared to previous models. However, the probability of toxicity still hovered around 32 percent. When presented with adversarial prompts—explicitly instructing the model to produce toxic content—the probability of generating toxicity surged to 100 percent.
GPT models' privacy leaks & bias uncovered
GPT models leaked private data and exhibited biases. GPT-4 had more leaks, indicating responsiveness to prompts. Despite improvements, risks of harmful content persist. Benchmark studies expose behavioral gaps. Koyejo and Li call for unbiased research and user skepticism, emphasizing human oversight. Original research remains essential in navigating AI's evolving landscape.