For the first part of this project, I explored diffusion models to generate, restore, and transform images by iteratively denoising them from pure noise to realistic visuals. Using the DeepFloyd IF model, I implemented tasks like inpainting, hybrid imagery, and visual anagrams.
I sampled using 'a man wearing a hat' and played around with the number of interference step. I found that while smaller steps were much quicker the quality was much less realistic. While larger steps took longer, but looked much more realistic.
In this task, I implemented the forward diffusion process to add controlled Gaussian noise to a clean image at specific timesteps, simulating how images degrade over time. Using this noisy data, I explored two denoising approaches: classical Gaussian blur filtering and a pretrained UNet model. The Gaussian blur provided a basic but limited ability to reduce noise, while the UNet, leveraging its pretrained knowledge, reconstructed the original image more effectively by estimating and removing the added noise. I visualized the results for each timestep to compare the original, noisy, and reconstructed images, highlighting the power of diffusion models in reversing noise for high-quality restoration.
This code implements an iterative denoising process to reconstruct a clean image from a noisy one using a diffusion model. Starting with a noisy image at a specified timestep, the iterative_denoise function progressively reduces noise step by step, leveraging the pretrained UNet model to estimate the clean image (x0_estimate) and compute the predicted image for the next (less noisy) timestep. The process follows equations from Denoising Diffusion Probabilistic Models (DDPM), iteratively refining the image until it approaches the original clean version. To enhance realism, variance is added at each step using the add_variance function. The function also visualizes the denoising progress at regular intervals, showcasing the gradual reduction of noise.
I started with random noise and used a diffusion model guided by the prompt "a high quality photo" to generate five unique images. By iteratively denoising each sample, I gradually transformed the noise into coherent, realistic visuals aligned with the given prompt.
I apply noise to the input image at different levels, based on specified start indices, to create progressively noisier versions. For each noisy image, I use Classifier-Free Guidance (CFG) to iteratively denoise it, aligning the output with the prompt "a high quality photo." This process generates unique edits of the original image, depending on the amount of initial noise I introduce. I then compare these edits with the original image to observe how varying noise levels influence the final results.
In this step I use previous CFG model to generate an image taken from the web and two hand drawn images.
I add noise to the original image at different levels and use CFG to iteratively denoise it, guided by the text prompt "a rocket ship." The process generates multiple variations of the image, where higher noise levels lead to more significant changes aligned with the rocket ship prompt. By displaying the original image alongside these variations, I demonstrate how the diffusion model transforms the image differently based on the initial noise level while adhering to the provided prompt.
In this part, I am creating a visual anagram where it appears as one image from one orietation and another the otherway. At each timestep of the denoising process, I calculated noise estimates for both prompts—one for the normal orientation and one for the flipped image—and combined them with a weighted average, giving slightly more emphasis to the "old man" prompt for balance. I repeat the process for the others.
I implemented a function to create hybrid images by blending two distinct concepts—"a lithograph of a skull" and "a lithograph of waterfalls." At each timestep of the denoising process, I generated noise estimates for both prompts and applied a low-pass filter to the first prompt (skull) to emphasize its broader features, while applying a high-pass filter to the second prompt (waterfalls) to preserve its finer details. By combining these filtered frequencies, I constructed an image that visually shifts between the two concepts, depending on the viewing distance or focus. I tested different random seeds to fine tune.
In this section of the project, we aim to train a U-Net model to denoise noisy MNIST digits. To prepare the training dataset, we first introduce noise to the MNIST images. Below are examples showcasing different noise levels:
Using a sigma=0.5 level, I trained the UNet model so it can learn to denoise noisy digits into sharper images of the number.
This is an example result after 1 epoch.
This is an example result after 5 epochs.
While running the model I recorded the loss after every iteration and this is the resulting graph:
Finally, while running the model I recorded the results at various noise levels:
In this part, I modified my U-Net so it predicts the noise in an image instead of directly denoising it. This change is based on what I learned in Part A: iterative denoising works much better than trying to denoise everything in one step. I also added timestep conditioning to the U-Net, so the model knows exactly which step of the diffusion process it’s working on. This helps it adapt its predictions to the specific amount of noise present at each stage.
To train the model, I took random batches of images, added noise to each image using a random timestep (from 0 to 299), and passed the noisy images through the U-Net. A timestep of 0 means no noise at all, while 299 means the image is pure noise. The model then predicted the noise that was added, and I calculated the loss by comparing the predicted noise to the actual noise that was applied. This way, the U-Net learns to understand how to work with all levels of noise in the diffusion process.
For sampling, I followed the iterative denoising process I used in Part A. I started with an image made entirely of noise and ran it through the U-Net multiple times, step by step, reducing the noise in stages. Since the model was trained on all kinds of noise levels, it could handle the denoising process effectively and recover a clean image by the end of the iterations. Here are some of the results I got from my time-conditioned U-Net, showing how it transforms pure noise into recognizable images:
Here is my MSE Loss Graph:
Now, I will enhance the UNet by incorporating class conditioning. This modification allows the model to utilize class-specific information during training, enabling it to denoise images more effectively. By training the UNet on specific digit images along with their corresponding labels, the model learns to better reconstruct and generate images of numbers. This also gives me the ability to generate images of specific digits by conditioning the UNet on the desired class. This approach is similar to providing a text prompt to a generative model and obtaining an image aligned with the prompt. In this case, instead of text prompts, I use class labels (e.g., digits 0–9) to guide the image generation process. Below, I present the results demonstrating the improvement achieved by adding class conditioning to the UNet.
Here is my MSE Loss Graph: