Microsoft Discovers Single Training Prompt Can Break AI Safety Alignment Across 15 Major Models

Published 2026-02-09AI Regulation and GovernanceHigh

Summary

Microsoft researchers published findings on February 9, 2026, demonstrating a technique called "GRP-Obliteration" that can completely break AI safety alignment mechanisms across 15 major language models using just a single malicious training prompt. The technique exploits Group Relative Policy Optimization (GRPO) — a training method normally used to improve model safety — by inverting it to systematically remove safety guardrails instead. The method works by taking a safety-aligned model, feedi

Alignment: Reinforces current position

Related Positions: ai-governance-and-risk.md, agentic-workflows.md

microsoftai-safetygrp-obliterationsafety-alignmentjailbreakvulnerabilityopen-weight-modelsfine-tuningllm-securityresearch