The Secret World of AI Watermarks: From Invisible Signatures to Regulatory Warfare

Someone cranked the saturation on an image generated by Google Gemini’s Nano Banana model all the way up. What appeared was a grid pattern completely invisible to the naked eye under normal conditions. This discovery, which went viral on Reddit, carried significance far beyond curiosity. It was the moment the public first confronted a concrete question: what invisible signatures are hiding inside the AI-generated content we consume every day, and why should we care?

AI watermarks are the authentication technology of the deepfake era. But they are also vulnerable to removal attacks, can degrade output quality, and exist at the intersection of complex regulatory and commercial interests. This post examines the technical foundations of watermarking, the latest research, the attack-defense arms race, and the regulatory landscape in depth.

Watermark Fundamentals: What Gets Hidden Where

The core idea behind AI watermarks is straightforward: embed a signal into AI-generated content that is imperceptible to humans but machine-readable. This enables later verification of whether specific content was generated by AI. However, the technical approach varies dramatically depending on where and how that signal is embedded.

Image Watermarks: Spatial Domain vs. Frequency Domain

Image watermarks operate in two primary domains.

Spatial domain methods directly manipulate pixel values. They alter the brightness or color of pixels at specific locations to encode a pattern. This approach is intuitive to implement but fragile—basic edits like cropping, resizing, or compression can easily destroy the signal. The grid pattern discovered in Nano Banana images becoming visible through mere saturation adjustment reflects the presence of spatial domain components.

Frequency domain methods are more sophisticated. They decompose the image into frequency components via Fourier Transform or Discrete Cosine Transform (DCT), then embed the watermark signal in specific frequency bands. Since JPEG compression itself is DCT-based, frequency domain watermarks are inherently more resilient to compression than spatial approaches. They also withstand geometric transformations like cropping, rotation, and scaling more effectively.

Modern production watermark systems mostly use hybrid approaches combining both domains. The dominant architecture is a deep learning encoder-decoder framework. An encoder network takes the original image and watermark message as input, producing a watermarked image. A decoder network extracts the message from the watermarked image. The two networks are trained adversarially: the encoder learns to embed a signal with minimal distortion that the decoder can read, while the decoder learns to recover the signal even after various edits and attacks.

Text Watermarks: Manipulating Token Probabilities

Text watermarking faces a fundamentally different challenge than images. Images offer continuous pixel values where subtle changes can be made, but text is a discrete sequence of tokens. Changing “the” to “teh” is immediately obvious. Text watermarks must therefore create statistically detectable patterns while preserving the meaning of the text.

The most influential approach is the green-list/red-list method proposed by Kirchenbauer et al. in 2023 (arXiv: 2301.10226). At each token generation step, the preceding token is used as a seed to randomly partition the entire vocabulary into a “green list” and a “red list.” The logits of green-list tokens are then boosted by a bias value δ (delta), increasing their selection probability.

Two key parameters control the watermark. γ (gamma) determines the green list fraction—a default of 0.25 means 25% of the vocabulary is classified as green. δ (delta) controls the logit bias magnitude—a default of 2.0 substantially increases green token selection probability. During detection, the proportion of green tokens in the generated text is calculated. In naturally written text without watermarking, green tokens are expected at roughly 25% (the γ ratio), but watermarked text shows a significantly higher green token proportion. For example, if 28 out of 36 tokens are green when only 9 would be expected naturally, the probability of this occurring by chance is approximately 6×10⁻¹⁴—vanishingly small.

The limitations are clear. Short texts lack statistical power for reliable detection. Paraphrasing that substitutes tokens weakens the signal. And larger δ values make detection easier but degrade the naturalness of the text. This is the fundamental watermark tradeoff: the tension between detectability and output quality.

Audio and Video Watermarks

Audio watermarks operate on spectrograms (frequency-time representations). Google SynthID’s audio variant, applied to the Lyria music generation model, embeds signals in frequency bands where human hearing is less sensitive. Video watermarks extend per-frame image watermarking but face the additional constraint of maintaining temporal coherence across frames.

Major Approaches: A Deep Dive

Google SynthID: Watermarking at Industrial Scale

SynthID is the most comprehensive AI watermark system, developed by Google DeepMind. It covers images, text, audio, and video, with the text watermark open-sourced via Hugging Face in October 2024.

SynthID Image intervenes directly in the diffusion model’s generation process. Two deep learning models are trained in tandem: an embedder performs pixel-level adjustments during generation that are mathematically significant but imperceptible to humans, and a detector learns to recognize these patterns. According to the paper published in October 2025 (arXiv: 2510.09263), SynthID-Image is designed for internet-scale image watermarking with robustness against common transformations including cropping, resizing, JPEG compression, and noise addition.

SynthID Text’s core innovation is Tournament Sampling. At each token generation step, multiple candidate tokens are matched in pairwise comparisons following a tournament bracket structure. A pseudo-random function called a g-function assigns a score to each token, and in each pairwise comparison, the token with the higher g-value advances to the next round. This process repeats until a single token remains. Statistically, tokens with higher g-values are preferentially selected. This bias is the watermark signal.

Technically, SynthID-Text uses a LeftHash hash function with context size h=3, combined with tournament sampling and caching. Increasing the context size reduces spoofing risk. During detection, Bayesian score functions or mean score functions aggregate g-values across the entire text, comparing against a threshold to determine watermark presence.

A distinctive design property of SynthID is its support for non-distortionary mode. In this mode, the watermark embeds a detectable signal without altering the model’s output distribution. This means watermarking without quality degradation, though naturally with lower detection power than distortionary mode.

In May 2025, a Unified SynthID Detector was released for cross-media watermark verification.

Tree-Ring Watermarks: Fingerprinting Diffusion Models

Tree-Ring Watermarks, proposed by Wen et al. in 2023 (arXiv: 2305.20030, NeurIPS 2023), offer an elegant approach to diffusion model watermarking. While existing methods apply watermarks as a post-hoc modification after image generation, Tree-Ring intervenes in the generation process itself.

The key idea is to embed a pattern in the Fourier space of the initial noise vector used by the diffusion model. Diffusion models start from pure Gaussian noise and progressively denoise to produce an image. Tree-Ring replaces this initial noise with noise containing a specific concentric ring pattern (resembling tree growth rings) embedded in Fourier space.

Why Fourier space? The mathematical properties of the Fourier transform make this pattern invariant to convolutions, crops, dilations, flips, and rotations. For detection, DDIM inversion is applied to the generated image to recover the initial noise vector, which is then checked in Fourier space for the embedded pattern.

Tree-Ring offers three key advantages. First, it can be applied as a plug-in to any existing diffusion model without modifying model weights. Second, FID (Frechet Inception Distance) degradation is negligible—image quality is essentially unchanged. Third, robustness to geometric transformations surpassed alternatives available at the time of publication.

Gaussian Shading: Provably Lossless Watermarking

Gaussian Shading, presented by Yang et al. at CVPR 2024 (arXiv: 2404.04956), goes a step further than Tree-Ring by claiming provably performance-lossless watermarking.

The technical core maps the watermark bitstring to a latent representation that follows a standard Gaussian distribution. Since the diffusion model’s initial noise is originally sampled from a standard Gaussian distribution, the watermarked latent representation is statistically indistinguishable from the non-watermarked one. This is the mathematical basis for the “performance-lossless” claim.

The embedding process comprises three stages: watermark diffusion, randomization, and distribution-preserving sampling. Extraction recovers the latent representation via DDIM inversion and inverse sampling, then decodes the watermark bits.

Gaussian Shading’s decisive advantage is being training-free—no model parameter modifications required. It works as a plug-and-play solution with existing diffusion models and demonstrated superior robustness compared to prior methods across multiple Stable Diffusion versions. A follow-up study, Gaussian Shading++, was published in 2025 addressing additional challenges in real-world deployment scenarios.

Cryptographic Undetectable Watermarks: The Christ et al. Approach

Christ, Gunn, and Zamir’s work at CRYPTO 2024 (arXiv: 2306.09194) brought cryptographic rigor to the watermark field. Their definition of an “undetectable” watermark means that without knowledge of a secret key, it is computationally intractable to distinguish watermarked outputs from those of the original model.

Their construction is based on the existence of one-way functions, a standard assumption in cryptography. It guarantees three formal properties:

Completeness: Any watermarked text will be detected.
Soundness: Text cannot be watermarked without knowledge of the secret key.
Distortion-freeness: Watermarking does not change the output distribution.

Theoretically elegant, but with important limitations. It only works for autoregressive sampling and has low robustness to transformations like paraphrasing. Its value lies more in establishing the theoretical upper bound of what watermarks can achieve than in practical deployment.

The Arms Race: Attacks and Defenses

UnMarker: A Universal Watermark Removal Attack

Any serious discussion of watermark effectiveness must address removal attacks. UnMarker (arXiv: 2405.08363), developed by researchers at the University of Waterloo, represents the first universal, data-free, black-box, query-free attack against defensive watermarking.

Previous regeneration-based attacks (using diffusion models or VAEs) failed against semantic watermarks. UnMarker defeated all tested watermarking schemes including these. Classical spread-spectrum watermarks, Hidden, StegaStamp, TrustMark, VINE—all major deep learning methods were broken by UnMarker.

The implications are serious. Defensive watermarking has inherent tradeoffs that may render it invalid as a deepfake countermeasure, and robustness claims against limited adversaries require reassessment.

Diffusion-Based Removal Attacks

From 2025 onward, research on using diffusion models themselves as watermark removal tools has proliferated. “Vanishing Watermarks” (arXiv: 2602.20680) demonstrated that diffusion-based image editing can defeat even robust watermarks. Adding partial noise to an image and then re-denoising it destroys the watermark signal while preserving semantic content.

RAVEN (arXiv: 2601.08832) took a more creative approach—using 3D Novel View Synthesis to change the viewpoint of an image, thereby neutralizing watermarks embedded in the 2D plane. “Image Watermarks are Removable Using Controllable Regeneration from Clean Noise” (arXiv: 2410.05470) showed that controlled regeneration from clean noise can remove watermarks while maintaining image quality.

Detection Accuracy vs. False Positives

Every watermark system faces a tradeoff between false positive rate (incorrectly flagging human-written text as AI-generated) and false negative rate (missing AI-generated text). This is particularly sensitive in social contexts. Falsely flagging a student’s essay as AI-written could result in unjust penalties, and research suggests that non-native English speakers’ text is more frequently misclassified as AI-generated. This false positive problem was one of the primary reasons OpenAI hesitated to release its text watermark tool.

Red Teaming in Practice: From NeurIPS to the Frontlines

NeurIPS 2024 “Erasing the Invisible” Competition

The most systematic real-world test of watermark robustness was the NeurIPS 2024 “Erasing the Invisible” competition. 298 teams submitted 2,722 entries across two tracks.

Black-box track: Attackers have zero knowledge about the watermarking method. The winning team clustered images based on spatial or frequency-domain artifacts, then applied image-to-image diffusion models with controlled noise injection and semantic priors from ChatGPT-generated captions.

Beige-box track: Attackers know which watermarking method was used but not the specific keys or parameters. The winning team combined adaptive VAE-based evasion attacks with test-time optimization and color-contrast restoration in CIELAB color space.

The results were sobering. All top-5 teams removed watermarks from over 89% of images, with the winning team achieving a 95.7% removal rate—all while preserving high visual quality.

Microsoft’s GenAI Product Red-Teaming

Microsoft conducted large-scale red-team testing across more than 100 generative AI products. For text watermarks, numerous bypass cases were discovered through prompt injection that induced models to generate text without watermarks. Their hybrid approach (automated tools plus human experts) found vulnerabilities within 30-50 minutes.

Security Theater or Effective Defense?

Red-teaming results don’t mean watermarks are useless. The critical question is the threat model.

Against casual sharing by unaware users, watermarks are effective. Most people don’t know watermarks exist, let alone try to remove them. For platform-level automated detection, watermarks also hold value—social media platforms scanning uploads for AI-generated content will mostly encounter unattacked content.

But against motivated adversaries (state-level actors, organized disinformation campaigns), watermarks alone are not a realistic defense. In that scenario, watermarks are just one layer in a defense-in-depth strategy.

C2PA: Metadata-Based Provenance Tracking

Operating on a different axis from watermarks is the C2PA (Coalition for Content Provenance and Authenticity) standard. A Linux Foundation project with over 300 participating organizations, it records the origin and modification history of digital content through cryptographically signed metadata.

How It Works

C2PA attaches Content Credentials to content. These credentials include the content creator (camera, AI model, etc.), creation timestamp, and subsequent editing history, with each step verifiable through cryptographic signatures. Linked in a chain, they enable tracking the content’s entire lifecycle.

How It Differs from Watermarks

The fundamental difference between C2PA and watermarks is persistence. C2PA metadata is easily lost through screenshots, re-uploads, or format conversions. Watermarks survive these transformations to varying degrees. Current best practice combines both: embed a watermark like SynthID directly in the content, then link C2PA metadata to the watermark so provenance information remains recoverable as long as the watermark survives.

Adoption Status

C2PA specification 2.1 was released in 2025 with strengthened technical requirements against tampering attacks. Google joined as a steering committee member, displaying AI-generation status in Search and integrating C2PA metadata into advertising systems. Adobe and Digimarc partnered on durable, interoperable credentials. The C2PA specification is on the ISO fast-track standardization path, with adoption expanding to cameras, scanners, browsers, and streaming services.

The Regulatory Landscape: Pressure Toward Mandates

EU AI Act

The EU AI Act, scheduled for enforcement in August 2026, includes transparency obligations for AI-generated content. The first draft of the Code of Practice on transparency, published in December 2025, takes the position that no single marking technique is currently sufficient and proposes a multilayered approach.

Specific requirements include:

AI outputs (audio, image, video, text) must be marked in a machine-readable format.
Labeling icons must be clear and distinguishable at “first exposure.”
For live video, labeling must be displayed persistently “where feasible.”
Audio has requirements for audible disclaimers.

Debate over the effectiveness of these requirements is ongoing. If watermarks are not robust, does mandating them accomplish anything? If metadata can be trivially stripped, is metadata-based labeling sufficient?

United States: Biden’s Executive Order and Beyond

President Biden’s 2023 Executive Order on AI directed the Department of Commerce and NIST to establish guidelines for digital watermarking of AI-generated content. Key objectives include establishing standards and best practices for identifying and labeling synthetic content, and establishing the authenticity and provenance of digital content produced by the federal government.

The order also mandated red-team testing for AI models, requiring watermark system vulnerabilities to be discovered and addressed before deployment.

Industry Self-Regulation

Independent of regulation, major AI companies have diverged in their approaches. Google is aggressively deploying and open-sourcing SynthID. OpenAI has taken a more cautious stance—despite internally possessing a text watermark tool with 99.9% accuracy, they deferred public release for nearly two years. An internal survey found that roughly 30% of ChatGPT users would reduce their usage if watermarking were implemented. Instead, OpenAI opted for C2PA metadata and classifier-based approaches: embedding C2PA metadata in DALL-E images and combining it with a separate classifier achieving 98% accuracy.

Adjacent Technology: Nightshade and Glaze

Frequently confused with watermarks but fundamentally different are Glaze and Nightshade, developed by Ben Zhao’s research team at the University of Chicago. These tools aim to prevent artists’ works from being used to train AI models without permission.

Glaze adds perturbations to image pixels that are nearly imperceptible to humans but prevent AI models from learning the artist’s style. Since its launch in March 2023, it has been downloaded by over 7 million artists.

Nightshade is more aggressive. It transforms images into “poison” samples such that AI models trained on them without consent exhibit unpredictable behavior—for example, a prompt for “a cow floating in space” might generate a handbag image instead. It has been downloaded over 1.6 million times.

However, 2025 research revealed vulnerabilities. LightShed, a data cleaning system developed by researchers from the UK, Germany, and the US, detected Nightshade-protected images with 99.98% accuracy and effectively removed the embedded perturbations. This undermines the core value proposition of Glaze and Nightshade, demonstrating that data protection tools, like watermarks, are not immune to the arms race.

The Fundamental Tradeoffs

The tensions surrounding AI watermarks are not merely technical. Multiple dimensions of tradeoffs exist simultaneously.

Detectability vs. Output Quality

Stronger watermark signals make detection easier but degrade content quality. In the Kirchenbauer method, increasing δ strengthens the green-token bias for easier detection but makes text less natural. For image watermarks, embedding more bits increases detection confidence but introduces visual artifacts. Some research suggests watermark insertion can reduce model response quality by 10-20%.

Robustness vs. Computational Cost

Watermarks that withstand more transformations and attacks require more computational resources for embedding and detection. Tree-Ring watermark detection requires DDIM inversion—a computationally expensive operation. Performing high-cost watermarking while handling hundreds of millions of requests in real-time services creates significant infrastructure overhead.

Security vs. Transparency

Publishing watermark algorithms enables academic verification and improvement but also provides information to attackers. SynthID Text’s open-sourcing benefited the research community, but ETH Zurich’s SRI Lab used the published code to systematically probe the watermark’s vulnerabilities (arXiv: 2603.03410). Conversely, closed-source watermarks prevent independent verification, making trust-building difficult.

Private Interest vs. Public Good

OpenAI’s dilemma illustrates this perfectly. Deploying text watermarks would serve the public interest of combating misinformation but could cause business losses through user attrition. Regulation could resolve this dilemma, but questions about the technical effectiveness of mandated approaches remain.

What This Means

For Creators

The advancement of AI watermarks and C2PA carries dual significance. Tools for proving the provenance of original content are strengthening, but the battlefield for authenticating AI-generated content is growing more complex. For artists, protection tools like Glaze/Nightshade exist, but as the LightShed research demonstrates, their longevity is uncertain.

For Consumers

No technology yet provides a 100% definitive answer to “is this content real?” The absence of a watermark does not mean content is not AI-generated (watermarks can be removed). The presence of a watermark does not definitively mean content is AI-generated (false positives are possible). The absence of C2PA metadata also says nothing about content authenticity (it can be lost through screenshots). Media literacy remains essential regardless of technological progress.

For the Ecosystem

Watermarks are not a standalone solution but one component of a multi-layered defense strategy. A realistic approach combines:

Embedded watermarks: Signals directly inserted into content (SynthID, Tree-Ring, etc.)
Metadata-based provenance tracking: Cryptographically verifiable history (C2PA)
Post-hoc classifiers: Independent models that determine whether content is AI-generated
Platform policies: Automated detection and labeling at upload time
Regulatory frameworks: Mandatory transparency and accountability requirements

No single technology is complete. But in combination, they create an effective deterrent for the unmotivated majority and a cost-raising barrier for the motivated few. Content authentication in the deepfake era is simultaneously a technical problem and a social one, and watermarks are a critical piece of that complex puzzle.