SuperGeekery: A blog probably of interest only to nerds by John F Morton.

A blog prob­a­bly of inter­est only to nerds by John Morton.

06Nov2022

Sta­ble Dif­fu­sion: Explor­ing trained mod­els and unique tokens for more accu­rate text-to-image results

Stable diffusion trained models

Before I get start­ed, I’ve got a pre­am­ble to this post. I’m only in the ear­ly dis­cov­ery process of AI-pow­ered text-to-image tech­nol­o­gy. There’s a good chance you know more than I do about this top­ic! I’m writ­ing this to doc­u­ment what I’ve learned. If you’re read­ing this, you might be on the same jour­ney. If so, I hope you find this helpful.

If you peaked ahead in this post, you’d see all kinds of images based on the prompt, a tiger sitting on a log, and some results are not that impres­sive. I’ve used this basic prompt inten­tion­al­ly because it helps illus­trate two thoughts I want­ed to share. 

The first take­away is that the prompt you give these tools is incred­i­bly impor­tant. For exam­ple, take a peek at a col­lec­tion of sea otter images cre­at­ed with DALL‑E 2. If you hov­er over the images on that page, you’ll see the text that gen­er­at­ed the images.

DALL-E 2 - Sea Otters prompt example

DALL-E 2 - Sea Otters prompt example

The text you pro­vide will sig­nif­i­cant­ly impact the image you get out. That makes sense, and this sea otter” page offers ample proof.

The sec­ond take­away I want to address here is that the mod­el is even more impor­tant. This is a con­cept I did­n’t grasp until I start­ed exper­i­ment­ing with the same basic prompt and dif­fer­ent data mod­els. Sta­ble Dif­fu­sion, which I will get to in a moment, is how I’ve been exper­i­ment­ing with models.

In short, there’s a lot I don’t know, but I’m learn­ing. That’s the end of the pre­am­ble. Let’s get started.

Before Sta­ble Dif­fu­sion, DALL‑E

In ear­ly 2022, text-to-image cre­ation tools seemed to be every­where online. I could­n’t wait to try them myself. The first ver­sion I tried was DALL‑E, aka DALL‑E ver­sion 1. You can eas­i­ly try DALL‑E 1 in your brows­er at https://​www​.craiy​on​.com/. If you type in a tiger sit­ting on a log,” you will get some­thing like what you see below.

Tiger dalle1 image3
Tiger dalle1 image2
Tiger dalle1 image1

Prompt: "a tiger sitting on a log" | Engine: DALL-E 1

Tech­nol­o­gy has raced beyond this ear­ly ver­sion of DALL‑E, but these images are still impres­sive. Seem­ing­ly like mag­ic, some images appear that are rough rep­re­sen­ta­tions of what you typed. Yes, they look a lit­tle wonky, but they did­n’t exist until you typed in that phrase. You are a pow­er­ful wiz­ard. 🧙‍♂️

In 2022, the suc­ces­sor to DALL‑E was released by Ope­nAI. DALL‑E 2 was a giant leap for­ward. It had improved image qual­i­ty and a bet­ter text inter­preter. The same prompt, a tiger sitting on a log, pro­duced great­ly improved results.

DALLE2 2022 11 04 12 53 16 A tiger sitting on a log
DALLE2 2022 11 04 12 53 05 A tiger sitting on a log
DALLE2 2022 11 04 12 52 56 A tiger sitting on a log

Prompt: "a tiger sitting on a log" | Engine: DALL-E 2

Sta­ble Diffusion

Last week I had time to exper­i­ment with Sta­ble Dif­fu­sion. What’s the dif­fer­ence? I’ll quote from the Sta­ble Dif­fu­sion arti­cle on Wikipedia.

Unlike mod­els like DALL‑E, Sta­ble Dif­fu­sion makes its source code avail­able, along with pre­trained weights. […] The user owns the rights to their gen­er­at­ed out­put images, and is free to use them commercially.

In oth­er words, you can run Sta­ble Dif­fu­sion on your com­put­er and tin­ker with it. Sta­ble Dif­fu­sion claims no copy­right on the images you create. 

Dur­ing DALL‑E 2 pri­vate beta, the images were not avail­able for com­mer­cial use, but when DALL‑E 2 opened up into pub­lic beta, the cre­ators, Ope­nAI, updat­ed their terms of use, giv­ing users own­er­ship of the images they cre­ate. Their press release said, users get full usage rights to com­mer­cial­ize the images they cre­ate with DALL·E, includ­ing the right to reprint, sell, and merchandise.”

DALL‑E 2 has an easy-to-use web inter­face and gen­er­ates won­der­ful results, but I don’t see any way to manip­u­late the mod­el. (Could I be wrong about that? Absolutely!)

Sta­ble Dif­fu­sion, being open, has a grow­ing com­mu­ni­ty of hack­ers push­ing it for­ward. I installed Sta­ble Dif­fu­sion on my Mac to explore some of these experiments.

Installing Sta­ble Diffusion

There are a vari­ety of ways of installing Sta­ble Dif­fu­sion on your com­put­er. As of Novem­ber 2022, the moment I’m writ­ing this, my advice is to not do this unless you are com­fort­able work­ing on your com­put­er’s com­mand line. Installing this is exper­i­men­tal at best, and you could break things. Seri­ous­ly. Pro­ceed with cau­tion. If you can wait, these tools will become eas­i­er to use over time. 

If you want to exper­i­ment with­out installing this on your machine, a search engine called Lex­i­ca will allow you to search for images oth­ers have cre­at­ed, and if you make a free account, you can use Sta­ble Dif­fu­sion on their site. You won’t be able to use any cus­tom mod­els, but you won’t have to install any­thing on your machine.

Lexica tiger sitting on a log search

Lexica, a Stable Diffusion search engine

Ok, now that you’ve decid­ed to install this stuff any­way, good luck! 😜

I’ve installed sev­er­al ver­sions of Sta­ble Dif­fu­sion dur­ing my explo­ration, and InvokeAI is the one that has the eas­i­est instal­la­tion process. There are instruc­tions for many plat­forms. Once installed, Invoke can be used from the com­mand line or via the web inter­face it pro­vides. (Since writ­ing this post, I’ve heard about Charl‑e, for the Mac. I’ve not tried it, though. Thanks for the tip, Rudy.)

Invoke ai web interface

The Invoke AI web interface

What does Sta­ble Dif­fu­sion do with the same prompt we tried with DALL‑E? Here are three results.

Tiger Sitting on a log - stable-diffusion-1.5
000022 d80db867 1727558165
000021 f8dbdb9b 747648489

Prompt: "a tiger sitting on a log" | model: sd-v1-5.ckpt | Engine: Stable Diffusion

I think these are impres­sive results. Notice I’ve includ­ed an addi­tion­al note about the mod­el” above. The mod­el here, sd-v1-5.ckpt, is the mod­el’s file­name con­tain­ing the AI’s text and image train­ing data. This mod­el lives at https://huggingface.co/runwayml/stable-diffusion-v15. You can see exact­ly what this mod­el is sup­posed to do.

Sta­ble Dif­fu­sion is a latent text-to-image dif­fu­sion mod­el capa­ble of gen­er­at­ing pho­to-real­is­tic images giv­en any text input.

It’s trained for pho­to-real­ism. That’s prob­a­bly why the basic prompt I’ve used is return­ing some­thing that looks like a pho­to instead of an illustration.

If you want a deep dive into what mod­els do, check out Sta­ble Dif­fu­sion with Dif­fusers, where the authors explain how the mod­el works and final­ly dive a bit deep­er into how dif­fusers allow one to cus­tomize the image gen­er­a­tion pipeline.” I’ll let them explain the tech­ni­cal details because they’re good at it.

Adding style detail to the prompt

As you saw ear­li­er with the sea otter exam­ple from DALL‑E, you can tai­lor your prompts with more descrip­tive words to get bet­ter results. If we don’t want pho­to real­ism as the style of our images, we need to be more detailed with our request. 

Let’s say instead pho­to-real­is­tic tigers, you want­ed your tigers to look as if they were drawn by Hol­lie Mengert. Below is an exam­ple of her char­ac­ter design.

2022 11 04 12 33 50

Piper Coyote by Hollie Mengert

Own­er­ship and ethics

Before we go any fur­ther, we must acknowl­edge this art­work is Hol­lie’s pro­fes­sion. Cre­at­ing her art­work is how she makes a liv­ing. For con­text, please read Inva­sive Dif­fu­sion: How one unwill­ing illus­tra­tor found her­self turned into an AI mod­el. I’m writ­ing anoth­er piece regard­ing the top­ic, which I plan to post to my blog. I will update this post with a link when it’s ready.

Adding style detail to the prompt (cont.)

Using the same pho­to-real­is­tic mod­el we’ve been using so far, chang­ing our prompt to hollie mengert artstyle tiger sitting on a log cre­ates a very dif­fer­ent outcome.

000036 f01a8272 1435949227
000038 085484bb 181047119
000037 882f52a0 1244771978

Prompt: "hollie mengert artstyle tiger sitting on a log" | model: sd-v1-5.ckpt | Engine: Stable Diffusion

Although I like that first tiger’s hand-sketched style, these images do not look like Hol­lie’s work. The sd-v15.ckpt mod­el does­n’t know what Hol­lie’s style is beyond the fact that it’s not a pho­to­graph­ic style. That leads to an obvi­ous ques­tion. Can you teach an old mod­el new tricks? Yes, you can. 

Dream­Booth and train­ing models

There’s a project called Dream­Booth that offers a new approach for per­son­al­iza­tion’ of text-to-image dif­fu­sion mod­els” to fine-tune a pre-trained text-to-image model.”

The Illus­tra­tion-Dif­fu­sion mod­el used some of Hol­lie’s work and trained it to mim­ic her style. The doc­u­men­ta­tion for this mod­el says, the cor­rect token is hol­liemengert art­style.”

You might won­der if there is a typo in the token. Is it miss­ing a space? It’s not. This miss­ing space is inten­tion­al. The token is hol­liemengert and not hol­lie mengert. This mod­el con­tains data derived from images tagged with the made-up word hol­liemengert. You’ll see many of these made-up words if you use cus­tom mod­els. Using unique words allows the new data to be iso­lat­ed and not get pol­lut­ed with oth­er data. Imag­ine the images had been tagged with the word draw­ing which already con­tains lots of data. You’d have no way to tell Sta­ble Dif­fu­sion that you meant the hol­lie mengert” style with­out it being pol­lut­ed with many oth­er draw­ing styles.

In the images above, where we asked for hol­lie mengert art­style”, I’m a lit­tle sur­prised we did­n’t end up with some hol­ly (not Hol­lie) in the images.

Holly Christmas card from NLI

Christmas card with an illustration of holly

Let’s run our new prompt,holliemengert artstyle tiger sitting on a log, through Sta­ble Dif­fu­sion using the Illus­tra­tion-Dif­fu­sion mod­el, aka hollie-mengert.ckpt.

00012
000002 13066be5 1573212009
000024 0307c07f 4067289833

Prompt: "holliemengert artstyle tiger sitting on a log" | model: hollie-mengert.ckpt | Engine: Stable Diffusion

The results are remark­ably dif­fer­ent. These images are not the same qual­i­ty as would have been cre­at­ed by Hol­lie, but they have a much more dis­tinc­tive style. These results are heav­i­ly influ­enced by the new mod­el we’re using.

Using the mod­el with the wrong token”

How impor­tant is the token holliemengert artstyle in the prompt now that we’re using this new­ly trained mod­el? What if you want­ed the tiger to be influ­enced by clas­sic Dis­ney movies instead of Hol­lie’s work? 

Here’s what hap­pens if we try the token classic disney style as part of the prompt but still use the Illus­tra­tion-Dif­fu­sion mod­el trained on the Hol­lie Mengert art.

00018
000026 d850c0e4 3024695362
000025 f412cdb8 1735182626

Prompt: "classic disney style tiger sitting on a log" | model: hollie-mengert.ckpt | Engine: Stable Diffusion

The results are not exact­ly in the style of Hol­lie Mengert, but they’re not real­ly in the Dis­ney style either. The mod­el is pick­ing up some influ­ences to Dis­ney style but does­n’t seem quite right either. 

Clas­sic Ani­ma­tion Diffusion

Let’s take a look at a dif­fer­ent mod­el. The clas­sic-anim-dif­fu­sion mod­el, a mod­el trained on screen­shots from a pop­u­lar ani­ma­tion stu­dio,” (cough, cough) uses the token classic disney style as a style prompt. We’ll switch the mod­el to this Dis­ney-trained mod­el using the same prompt that gen­er­at­ed the pre­vi­ous images.

00016
00013
000027 1731e82a 330713985

Prompt: "classic disney artstyle tiger sitting on a log" | model: classicAnim-v1.ckpt | Engine: Stable Diffusion

Once again, there is an obvi­ous shift in the style. The first two images are not keep­ers. The third is accept­able. They look like Dis­ney tigers though, so the exper­i­ment worked.

Super­hero Diffusion

To wrap up these exper­i­ments, we’ll try anoth­er mod­el called Super­hero-Dif­fu­sion trained on Pepe Lar­raz’s work (you can see real exam­ples of his work with this Google Image search) that uses the token comicmay artstyle. Again, we’re keep­ing the same basic prompt and chang­ing the token and the model.

00015
000032 a6413a1d 97441445
000030 a811c752 2154701584

Prompt: "comicmay artstyle tiger sitting on a log" | model: superhero-diffusion.ckpt | Engine: Stable Diffusion

Whoa. These look cool to me. I notice that the tiger we asked to be sit­ting is actu­al­ly stand­ing, but the tiger is at least perched on a log. I’m sure if Pepe Lar­raz could cre­ate a sit­ting tiger on his first attempt, but that’s a dif­fer­ent dis­cus­sion entirely.

What about Midjourney?

There is anoth­er text-to-image project that I’m aware of called Mid­jour­ney. I have seen impres­sive results post­ed on Twit­ter with the hash­tag #mid­jour­ney, but I have not used Mid­jour­ney dur­ing my explo­ration. If you’re explor­ing, you may want to check it out also.

Wrap­ping up

I would be sur­prised if you’ve made it this far and haven’t felt the urge to try mak­ing some images of your own. Lex­i­ca is prob­a­bly the eas­i­est place to begin. Exper­i­ment with your prompt, and you’ll be amazed at your results even with the base mod­el. Also, real­ize that ver­sion 1.5 is the ver­sion of the mod­el that’s cur­rent as I write this, but I expect that to evolve quick­ly. The pace of change with these mod­els is incred­i­bly rapid, and the results will only improve with time. 

If you want to exper­i­ment with dead artists whose work has seeped deeply into the cul­ture and prob­a­bly into the data mod­els, try adding Leonardo da Vinci art style or frida kahlo art style and see what happens. 

Last­ly, I would encour­age you to think about liv­ing artists whose work influ­ence data mod­els. Are you com­fort­able cre­at­ing art based on a liv­ing artist? Does it mat­ter if it’s only for per­son­al explo­ration? Does it cross an eth­i­cal line for you if you used this art­work in a way that you made mon­ey from? Would you pay for access to a model?