About Experience Publications Projects
← Back to projects

Injecting Image Guidance into Diffusion Models

Agata Zywot, Iason Skylitsis, Thijmen Nijdam, Zoe Tzifa-Kratira, Derck Prinzhorn, Konrad Szewczyk, Aritra Bhowmik

Visual Concept Fusion pipeline overview

Summary

Visual Concept Fusion (VCF) enables dual conditioning of Stable Diffusion on both a text prompt and a reference image at inference time, without retraining the model. It trains a lightweight aligner network (a 2-layer MLP) that translates CLIP image embeddings into the CLIP text embedding space, allowing the frozen diffusion model to process visual and textual guidance together.

An optional Prompt-Noise Optimization (PNO) step refines the generation at test time by iteratively adjusting the conditioning and initial noise to maximize visual similarity to the reference. The result is generated images that respect the text prompt while inheriting style, color palette, composition, and texture from the reference image.

Key Contributions

Method

The pipeline consists of three components. First, an Image Aligner, a small two-layer network that translates CLIP image features into the same representation space as CLIP text features. It is trained with a combined objective: a contrastive loss (InfoNCE) for global alignment and a cross-attention loss for preserving local detail.

Second, Text-Image Fusion appends the translated image features alongside the text features, so the diffusion model receives both as a single combined input. This preserves both modalities without destructive blending.

Third, optional Prompt-Noise Optimization runs 10-50 optimization steps at test time, adjusting both the combined input and the initial random noise to push the generated image closer to the reference. The base model is Stable Diffusion v2.1 at 768×768 resolution with DDIM sampling (50 steps).

Results

Method CLIP Score ↑ LPIPS (perceptual distance) ↓
SD v2 (text-only) 0.29 0.78
Naive fusion 0.28 0.77
VCF (Ours) 0.27 0.76

CLIP Score measures how well the generated image matches the text prompt; LPIPS measures perceptual distance to the reference image (lower means more similar). VCF achieves the best visual similarity to reference images while only slightly reducing text alignment. PNO consistently improves both structural alignment and color fidelity. The method transfers high-level qualities (artistic style, background elements) and low-level details (color palette, shading) from reference images.