SuperGeekery: A blog probably of interest only to nerds by John F Morton.

A blog prob­a­bly of inter­est only to nerds by John F Mor­ton.

Stable Diffusion: Exploring trained models and unique tokens for more accurate text-to-image results

Stable diffusion trained models

Before I get start­ed, I’ve got a pre­am­ble to this post. I’m only in the ear­ly dis­cov­ery process of AI-pow­ered text-to-image tech­nol­o­gy. There’s a good chance you know more than I do about this top­ic! I’m writ­ing this to doc­u­ment what I’ve learned. If you’re read­ing this, you might be on the same jour­ney. If so, I hope you find this help­ful.

If you peaked ahead in this post, you’d see all kinds of images based on the prompt, a tiger sitting on a log, and some results are not that impres­sive. I’ve used this basic prompt inten­tion­al­ly because it helps illus­trate two thoughts I want­ed to share. 

The first take­away is that the prompt you give these tools is incred­i­bly impor­tant. For exam­ple, take a peek at a col­lec­tion of sea otter images cre­at­ed with DALL‑E 2. If you hov­er over the images on that page, you’ll see the text that gen­er­at­ed the images.

DALL-E 2 - Sea Otters prompt example

DALL-E 2 - Sea Otters prompt example

The text you pro­vide will sig­nif­i­cant­ly impact the image you get out. That makes sense, and this sea otter” page offers ample proof.

The sec­ond take­away I want to address here is that the mod­el is even more impor­tant. This is a con­cept I didn’t grasp until I start­ed exper­i­ment­ing with the same basic prompt and dif­fer­ent data mod­els. Sta­ble Dif­fu­sion, which I will get to in a moment, is how I’ve been exper­i­ment­ing with mod­els.

In short, there’s a lot I don’t know, but I’m learn­ing. That’s the end of the pre­am­ble. Let’s get start­ed.

Before Sta­ble Dif­fu­sion, DALL‑E #

In ear­ly 2022, text-to-image cre­ation tools seemed to be every­where online. I couldn’t wait to try them myself. The first ver­sion I tried was DALL‑E, aka DALL‑E ver­sion 1. You can eas­i­ly try DALL‑E 1 in your brows­er at https://​www​.craiy​on​.com/. If you type in a tiger sit­ting on a log,” you will get some­thing like what you see below.

Prompt: "a tiger sitting on a log" | Engine: DALL-E 1

Tech­nol­o­gy has raced beyond this ear­ly ver­sion of DALL‑E, but these images are still impres­sive. Seem­ing­ly like mag­ic, some images appear that are rough rep­re­sen­ta­tions of what you typed. Yes, they look a lit­tle wonky, but they didn’t exist until you typed in that phrase. You are a pow­er­ful wiz­ard. :mage:‍♂️

In 2022, the suc­ces­sor to DALL‑E was released by Ope­nAI. DALL‑E 2 was a giant leap for­ward. It had improved image qual­i­ty and a bet­ter text inter­preter. The same prompt, a tiger sitting on a log, pro­duced great­ly improved results.

Prompt: "a tiger sitting on a log" | Engine: DALL-E 2

Sta­ble Dif­fu­sion #

Last week I had time to exper­i­ment with Sta­ble Dif­fu­sion. What’s the dif­fer­ence? I’ll quote from the Sta­ble Dif­fu­sion arti­cle on Wikipedia.

Unlike mod­els like DALL‑E, Sta­ble Dif­fu­sion makes its source code avail­able, along with pre­trained weights. […] The user owns the rights to their gen­er­at­ed out­put images, and is free to use them com­mer­cial­ly.

In oth­er words, you can run Sta­ble Dif­fu­sion on your com­put­er and tin­ker with it. Sta­ble Dif­fu­sion claims no copy­right on the images you cre­ate. 

Dur­ing DALL‑E 2 pri­vate beta, the images were not avail­able for com­mer­cial use, but when DALL‑E 2 opened up into pub­lic beta, the cre­ators, Ope­nAI, updat­ed their terms of use, giv­ing users own­er­ship of the images they cre­ate. Their press release said, users get full usage rights to com­mer­cial­ize the images they cre­ate with DALL·E, includ­ing the right to reprint, sell, and mer­chan­dise.”

DALL‑E 2 has an easy-to-use web inter­face and gen­er­ates won­der­ful results, but I don’t see any way to manip­u­late the mod­el. (Could I be wrong about that? Absolute­ly!)

Sta­ble Dif­fu­sion, being open, has a grow­ing com­mu­ni­ty of hack­ers push­ing it for­ward. I installed Sta­ble Dif­fu­sion on my Mac to explore some of these exper­i­ments.

Installing Sta­ble Dif­fu­sion #

There are a vari­ety of ways of installing Sta­ble Dif­fu­sion on your com­put­er. As of Novem­ber 2022, the moment I’m writ­ing this, my advice is to not do this unless you are com­fort­able work­ing on your computer’s com­mand line. Installing this is exper­i­men­tal at best, and you could break things. Seri­ous­ly. Pro­ceed with cau­tion. If you can wait, these tools will become eas­i­er to use over time. 

If you want to exper­i­ment with­out installing this on your machine, a search engine called Lex­i­ca will allow you to search for images oth­ers have cre­at­ed, and if you make a free account, you can use Sta­ble Dif­fu­sion on their site. You won’t be able to use any cus­tom mod­els, but you won’t have to install any­thing on your machine.

Lexica tiger sitting on a log search

Lexica, a Stable Diffusion search engine

Ok, now that you’ve decid­ed to install this stuff any­way, good luck! :stuck_​out_​tongue_​winking_​eye:

I’ve installed sev­er­al ver­sions of Sta­ble Dif­fu­sion dur­ing my explo­ration, and InvokeAI is the one that has the eas­i­est instal­la­tion process. There are instruc­tions for many plat­forms. Once installed, Invoke can be used from the com­mand line or via the web inter­face it pro­vides. (Since writ­ing this post, I’ve heard about Charl‑e, for the Mac. I’ve not tried it, though. Thanks for the tip, Rudy.)

Invoke ai web interface

The Invoke AI web interface

What does Sta­ble Dif­fu­sion do with the same prompt we tried with DALL‑E? Here are three results.

Prompt: "a tiger sitting on a log" | model: sd-v1-5.ckpt | Engine: Stable Diffusion

I think these are impres­sive results. Notice I’ve includ­ed an addi­tion­al note about the mod­el” above. The mod­el here, sd-v1-5.ckpt, is the model’s file­name con­tain­ing the AIs text and image train­ing data. This mod­el lives at‑v15. You can see exact­ly what this mod­el is sup­posed to do.

Sta­ble Dif­fu­sion is a latent text-to-image dif­fu­sion mod­el capa­ble of gen­er­at­ing pho­­to-real­is­tic images giv­en any text input.

It’s trained for pho­­to-real­ism. That’s prob­a­bly why the basic prompt I’ve used is return­ing some­thing that looks like a pho­to instead of an illus­tra­tion.

If you want a deep dive into what mod­els do, check out Sta­ble Dif­fu­sion with Dif­fusers, where the authors explain how the mod­el works and final­ly dive a bit deep­er into how dif­fusers allow one to cus­tomize the image gen­er­a­tion pipeline.” I’ll let them explain the tech­ni­cal details because they’re good at it.

Adding style detail to the prompt #

As you saw ear­li­er with the sea otter exam­ple from DALL‑E, you can tai­lor your prompts with more descrip­tive words to get bet­ter results. If we don’t want pho­to real­ism as the style of our images, we need to be more detailed with our request. 

Let’s say instead pho­­to-real­is­tic tigers, you want­ed your tigers to look as if they were drawn by Hol­lie Mengert. Below is an exam­ple of her char­ac­ter design.

2022 11 04 12 33 50

Piper Coyote by Hollie Mengert

Own­er­ship and ethics #

Before we go any fur­ther, we must acknowl­edge this art­work is Hollie’s pro­fes­sion. Cre­at­ing her art­work is how she makes a liv­ing. For con­text, please read Inva­sive Dif­fu­sion: How one unwill­ing illus­tra­tor found her­self turned into an AI mod­el. I’m writ­ing anoth­er piece regard­ing the top­ic, which I plan to post to my blog. I will update this post with a link when it’s ready.

Adding style detail to the prompt (cont.) #

Using the same pho­­to-real­is­tic mod­el we’ve been using so far, chang­ing our prompt to hollie mengert artstyle tiger sitting on a log cre­ates a very dif­fer­ent out­come.

Prompt: "hollie mengert artstyle tiger sitting on a log" | model: sd-v1-5.ckpt | Engine: Stable Diffusion

Although I like that first tiger’s hand-sketched style, these images do not look like Hollie’s work. The sd‑v15.ckpt mod­el doesn’t know what Hollie’s style is beyond the fact that it’s not a pho­to­graph­ic style. That leads to an obvi­ous ques­tion. Can you teach an old mod­el new tricks? Yes, you can. 

Dream­Booth and train­ing mod­els #

There’s a project called Dream­Booth that offers a new approach for per­son­al­iza­tion’ of text-to-image dif­fu­sion mod­els” to fine-tune a pre-trained text-to-image mod­el.”

The Illus­­tra­­tion-Dif­­fu­­sion mod­el used some of Hollie’s work and trained it to mim­ic her style. The doc­u­men­ta­tion for this mod­el says, the cor­rect token is hol­liemengert art­style.”

You might won­der if there is a typo in the token. Is it miss­ing a space? It’s not. This miss­ing space is inten­tion­al. The token is hol­liemengert and not hol­lie mengert. This mod­el con­tains data derived from images tagged with the made-up word hol­liemengert. You’ll see many of these made-up words if you use cus­tom mod­els. Using unique words allows the new data to be iso­lat­ed and not get pol­lut­ed with oth­er data. Imag­ine the images had been tagged with the word draw­ing which already con­tains lots of data. You’d have no way to tell Sta­ble Dif­fu­sion that you meant the hol­lie mengert” style with­out it being pol­lut­ed with many oth­er draw­ing styles.

In the images above, where we asked for hol­lie mengert art­style”, I’m a lit­tle sur­prised we didn’t end up with some hol­ly (not Hol­lie) in the images.

Holly Christmas card from NLI

Christmas card with an illustration of holly

Let’s run our new prompt,holliemengert artstyle tiger sitting on a log, through Sta­ble Dif­fu­sion using the Illus­­tra­­tion-Dif­­fu­­sion mod­el, aka hollie-mengert.ckpt.

Prompt: "holliemengert artstyle tiger sitting on a log" | model: hollie-mengert.ckpt | Engine: Stable Diffusion

The results are remark­ably dif­fer­ent. These images are not the same qual­i­ty as would have been cre­at­ed by Hol­lie, but they have a much more dis­tinc­tive style. These results are heav­i­ly influ­enced by the new mod­el we’re using.

Using the model with the wrong token” #

How impor­tant is the token holliemengert artstyle in the prompt now that we’re using this new­ly trained mod­el? What if you want­ed the tiger to be influ­enced by clas­sic Dis­ney movies instead of Hollie’s work?

Here’s what hap­pens if we try the token classic disney style as part of the prompt but still use the Illus­tra­tion-Dif­fu­sion mod­el trained on the Hol­lie Mengert art.

Prompt: "classic disney style tiger sitting on a log" | model: hollie-mengert.ckpt | Engine: Stable Diffusion

The results are not exact­ly in the style of Hol­lie Mengert, but they’re not real­ly in the Dis­ney style either. The mod­el is pick­ing up some influ­ences to Dis­ney style but doesn’t seem quite right either. 

Clas­sic Ani­ma­tion Dif­fu­sion #

Let’s take a look at a dif­fer­ent mod­el. The clas­sic-anim-dif­­fu­­sion mod­el, a mod­el trained on screen­shots from a pop­u­lar ani­ma­tion stu­dio,” (cough, cough) uses the token classic disney style as a style prompt. We’ll switch the mod­el to this Dis­­ney-trained mod­el using the same prompt that gen­er­at­ed the pre­vi­ous images.

Prompt: "classic disney artstyle tiger sitting on a log" | model: classicAnim-v1.ckpt | Engine: Stable Diffusion

Once again, there is an obvi­ous shift in the style. The first two images are not keep­ers. The third is accept­able. They look like Dis­ney tigers though, so the exper­i­ment worked.

Super­hero Dif­fu­sion #

To wrap up these exper­i­ments, we’ll try anoth­er mod­el called Super­hero-Dif­­fu­­sion trained on Pepe Larraz’s work (you can see real exam­ples of his work with this Google Image search) that uses the token comicmay artstyle. Again, we’re keep­ing the same basic prompt and chang­ing the token and the mod­el.

Prompt: "comicmay artstyle tiger sitting on a log" | model: superhero-diffusion.ckpt | Engine: Stable Diffusion

Whoa. These look cool to me. I notice that the tiger we asked to be sit­ting is actu­al­ly stand­ing, but the tiger is at least perched on a log. I’m sure if Pepe Lar­raz could cre­ate a sit­ting tiger on his first attempt, but that’s a dif­fer­ent dis­cus­sion entire­ly.

What about Mid­jour­ney? #

There is anoth­er text-to-image project that I’m aware of called Mid­jour­ney. I have seen impres­sive results post­ed on Twit­ter with the hash­tag #mid­jour­ney, but I have not used Mid­jour­ney dur­ing my explo­ration. If you’re explor­ing, you may want to check it out also.

Wrap­ping up #

I would be sur­prised if you’ve made it this far and haven’t felt the urge to try mak­ing some images of your own. Lex­i­ca is prob­a­bly the eas­i­est place to begin. Exper­i­ment with your prompt, and you’ll be amazed at your results even with the base mod­el. Also, real­ize that ver­sion 1.5 is the ver­sion of the mod­el that’s cur­rent as I write this, but I expect that to evolve quick­ly. The pace of change with these mod­els is incred­i­bly rapid, and the results will only improve with time. 

If you want to exper­i­ment with dead artists whose work has seeped deeply into the cul­ture and prob­a­bly into the data mod­els, try adding Leonardo da Vinci art style or frida kahlo art style and see what hap­pens. 

Last­ly, I would encour­age you to think about liv­ing artists whose work influ­ence data mod­els. Are you com­fort­able cre­at­ing art based on a liv­ing artist? Does it mat­ter if it’s only for per­son­al explo­ration? Does it cross an eth­i­cal line for you if you used this art­work in a way that you made mon­ey from? Would you pay for access to a mod­el?