Compressing images with Stable Diffusion

You get the gist

Images are just too big. A 3 MB bitmap compresses down to a 500 KB JPEG, which, don’t get me wrong, 16% of the original size is great, but why 500 KB? That’s still pretty large.

This is 2022, we shouldn’t have to put up with large images. Our websites might load 60 MB of stuff for a pageview, but that stuff shouldn’t be images, it should be Javascript, as Brendan Eich intended.

We shouldn’t have to put up with fat images, but, until now, we had no choice.

Now we do.

The solution

a computer compressing data, by Caspar David Friedrich, matte painting trending on artstation HQ

A week or so ago, Stable Diffusion was released, and the world went crazy, and for good reason. Stable Diffusion, if you haven’t heard, is a new AI that generates realistic images from a text prompt. You basically give it a description of the image you want, and it generates it.

Now, this alone would be revolutionary, but we got double the revolution this time: This thing can also take an image and tell you the prompt you can use to generate it.

Are you thinking what I’m thinking?

That’s right, why compress an image to 500 KB when you can compress it to 50 bytes, where the bytes are the prompt that can be used to generate the exact same image again?

You wouldn’t, of course not.

Instead, what you would do, is ask the image-describing AI to describe the image, take the resulting (very small) prompt, transmit it over the wire, where the recipient would then use it to generate the image again based on the prompt.

I call this technique STAV, or Stable Transcription and Artistic Validation. Yes, the acronym might not contain any of the words “image”, “compression”, “reconstruction”, or “diffusion”, but Philip Katzip isn’t going to be the only one giving his name to compression techniques.

Expected gains

As is widely known, a picture is worth 1000 words. At an average English word length of 4.7, we can expect each image to take up to 4.7 KB, regardless of its original size. The corrolary here is that we can use this method to also upscale images without any loss in quality, which I have accepted as a very fortunate side-effect of my technique.

Sure, this may have some loss of quality, but it would generally depend on the number of iterations you ran when generating the image.

Based on the numbers above, here are some rough estimates on the gains we can expect to see:

Algorithm	Max size	Size compared to STAV
JPEG	∞	∞
AVIF	∞	∞
STAV	4.5 KB	1

As we can see, due to the fact that STAV has fixed size, it is easily potentially infinitely smaller than both AVIF and JPEG, which is good.

Real-world benchmarks

Of course, no new compression method is complete without real-world benchmark data to back up its claims. This is why I’ve compiled an extensive analysis of sample images from Unsplash, and am presenting them here.

In the images that follow, the leftmost is the uncompressed (raw) image, the middle image is compressed with JPEG, and the rightmost image is compressed with STAV. I haven’t bothered to include the raw and JPEG sizes, as they’re thousands of times larger than the sizes of the STAV images.

For your edification, I have also included the entire STAV-compressed data below each image, in the form of the prompt that was recognized by img2prompt. Let’s analyze them one by one.

Objects in shot

a person sitting in a chair holding a book and a pen, a stock photo by Chinwe Chukwuogo-Roy, trending on unsplash, art & language, stock photo, stockphoto, depth of field

As we can see, the compressor deals with objects in the shot excellently. There is no visible degradation at all, and the final image is sharp and vibrant.

One interesting note here: img2prompt has correctly intuited that the image is from Unsplash, and has mentioned that in its generated prompt. This will doubtless improve compression even further.

People

a woman with tattoos and a hat on, a tattoo by James Baynes, featured on unsplash, neo-romanticism, anamorphic lens flare, tattoo, backlight

Another excellent performance here. The lighting is impeccable, the hairs are sharp and well-defined, and the hat looks great on the lady.

Interior shots

a living room with a green chair next to a window, a stock photo by Aaron Bohrod, trending on unsplash, light and space, studio light, studio lighting, volumetric lighting

Performance here isn’t as stellar as in the other shots, as the colors are imperceptibly more muted than the original, but overall there is almost no difference. The original and the STAV-compressed images are nearly indistinguishable. JPEG is disappointing, as there are visible artifacts.

Food

three bowls of food on a white table, a stock photo by Kelly Sueda, trending on pinterest, mingei, shallow depth of field, pixel perfect, intricate patterns

Somehow, the food in the STAV-compressed image looks even more delicious than the original. Otherwise, there is no perceptible quality difference.

Food and people

a couple of women standing in a kitchen preparing food, a stock photo by Meredith Garniss, pinterest contest winner, private press, stock photo, stockphoto, film grain

This particular image posed a challenge for the compressor, with its sharp detail and subtle blur, but the compressor pulled through. Details are preserved and vibrant, and even the blur is visible. Why we’d want to keep the blur, I don’t know, but a compressor must be faithful above all.

Nature

a pink flower with green leaves on a white background, a macro photograph by Ikuo Hirayama, featured on unsplash, minimalism, shallow depth of field, depth of field, soft light

There isn’t much to say here. STAV blows JPEG out of the water, the flower looks almost alive, even though the original image contains no flower. If anything, this enhancement showcases a strength of this technique.

a person standing on top of a mountain, a tilt shift photo by Paul Bodmer, trending on unsplash, naturalism, sense of awe, shallow depth of field, photo taken with ektachrome

Exterior architecture

an aerial view of a pond in the middle of a forest, a tilt shift photo by Stanley Twardowicz, trending on unsplash, ecological art, high dynamic range, photo taken with nikon d750, isometric

We can see here that the compressor has preserved every tiny detail of the original image, except the house, which was, admittedly, kind of ugly. It’s heartening to see this method go from strength to strength as it even enhances images.

Conclusions

As you can see, there is basically no loss in quality, even though the images’s sizes are around a ten-thousandth the original’s. This is an absolutely astonishing result, and will definitely herald a new era of compression. There are even some cases where quality is better than the original, and it is astonishing for a compressor to achieve 100%+ quality.

There are some minor kinks that need to be worked out, such as the fact that each image takes around a day to generate on mobile, but this is more than acceptable in certain domains. Website visitors, for example, are well-accustomed to such loading times, and would barely notice any difference.

Epilogue

In conclusion, I really believe that this method can help lower file sizes and make a significant difference in various niches, e.g. the web, or games that come on multiple floppy disks. I urge you to give it a try and see what kind of results you get.

If you have any feedback, please Tweet or toot at me, or email me directly. I would especially like to hear of any pathological edge-cases where the final image is somehow significantly different from the original, so I can investigate.

Thank you!

Stavros' Stuff

On programming and other things.

Compressing images with Stable Diffusion

Conceived on Aug 31, 2022

The solution

Expected gains