The Starving Theater Artist: Diffused Goals

Hit a stall on The Early Fox. Possibly due to the new meds — which are promising, at least.

I was reading up (well, mostly listening to a podcast series) on the Apache, and looking at videos of Cloudcroft, NM. Sigh. Cloudcroft really doesn’t fit the vibe I was going for.

Took several days to figure out that this could be a good direction after all. And now the cast living inside my head has made adjustment to their new status and they’ve come to life again.

But they are no longer searching for Doc Noss’s lost treasure. That pulled the narrative too far off course. Pity, because I’d even worked out a clue they could enlist Penny into helping with. (See, an old letter was using bad schoolboy Greek, but Penny recognizes it is a paraphrase from Xenophon, because that’s the schoolboy lesson. And her Greek is equally bad and she’s making the same kinds of mistakes so she understood what she was looking at...)

Anyhow.

We’ve got this deeply ingrained instinct to learn shit. And once we learn a new thing, we get proprietary about it. We want to get better, and we want to boast. Yesterday at work I got into a conversation about 7400 series chip codes. It is hard not to remember once being good at a thing, and wanting to pick it up again and polish those old skills.

We get into Jeff Goldblum territory too easily, where we start pushing at a thing because we’ve gotten intrigued by the technical challenge. And we lose track of why (if there ever was a reason) we wanted to do it in the first place.

So, Stable Diffusion. AI image creation is moving with lightning speed. This is more of a tech bubble thing where the industry is visibly trying to excite people, and throwing a ton of money at it (which hides much of the true cost), but nobody has quite answered what it is they are actually trying to solve.

They’ve got a cool thing, and someone must be willing to pay money for it. Dot dot dot profit.

All of us down much lower on the tech pyramid are chasing around trying to learn about it, trying to figure out how it will affect us. And in some cases, playing with the thing. A project that started out as fun but is now increasingly just about the technical challenge.

I’m still on the aging Web UI based AUTOMATIC1111 front end. Mostly because I already know where everything is. And my hardware might not be able to take advantage of the modular structure of ComfyUI.

The WebUI SD implementation was originally built around the SD1 model, based on the LAION data set, a 512x512 pixel data set. The SD1.5 proved the most popular and long-lasting.

I’ve never been particularly lucky with AI upscaling. Probably because I’ve been generating with a variety of LoRAs with narrower and more specialized data sets and focus, and lacking those resources, the upscalers tend to try to turn everything into a variation of what it is they expect to see.

A basic and perennial problem with AI. Even the more recent data sets are mass data scrapes of largely copyright archives. Poses, for instance, are over-represented by advertising, fashion, news; meaning they default to the standard upright and facing (with a 20-something, good-looking, white model, too). The AI borks when asked to do fighting poses because that’s such a smaller part of its resources. Even if it starts with the right pose, it drifts off (or it fleshes out its equivalent of a gestural drawing with the wrong muscle groups and clothing details — all of them belonging on a model in a more familiar pose.)

And you may ask, how can you generate at a higher resolution in the first place? Because the source images weren’t all taken at the same distance. One might be a full-length person, one might be a close-up of hands. It uses the later to fill in when it is doing the later passes.

Theoretically. Since it is looking for any resemblance that fits the guidelines, you can (and sometimes do) find a clear kneecap instead of a knuckle. Because the dice have no memory; there is no underlying plan. At every sequential numbered step between the original gaussian blur and the final render, it is treating it as a new problem of “what is this blurred image and what in my training data might look like it?” Modified of course by prompt and other weighting such as ControlNet.

This relates to what is seen as the problem of hands but isn’t a problem itself; it is a diagnostic. But I’ll get back to that.

The next model was the SDXL, which used a 1024x1024 set of sources. With some curation towards representation et al (yet, still massive copyright violations). So with that as a base you can generate at 1024 native, and up to at least 2048 with a low level of artifacts.

For me personally, I couldn’t get XL to run correctly. There’s a fork called Pony which added a ton of anime images (2.5 million scrapes of anime, and furry -- or couldn’t you guess?) That biases that model, so there are some forks of Pony towards more realistic images.

I’m using one of those as the base model now. Each model has its own peculiarities, both in the variety of training data, the weighting of parts of that data, and the prompts which are recognized. One model might completely ignore “Mazda,” another immediately spit out four-door compacts.

(Or ancient sun gods).

This is the basic and endemic problem of AI; it converges on the norm. More than that, it produces a convincing simulacrum of that norm.

Which is not to say people aren’t able to explore personal visions. But that convergence means, among other things, that the dice memory effect gets amplified. The AI does not understand it is supposed to be a steampunk dirigible. At every step of the render it will be attempting to relate what is in the image to what it finds familiar.

LoRA attempt to swamp this effect by having their own pool of training images which are heavily weighted. But since that is a smaller number of images, they can’t handle the variety that might appear in the final image. So it started to render a brass gear, but it ran out of reference material that matched what was currently in the render and swapped it out for a gold foil star.

But back to my current process.

Inpainting is the key. Inpainting is basically the img2img process with a mask.

When you are generating from scratch, the engine fills a block of the requested image size with gaussian noise. It then progressively looks for patterns in what is first noise, then a noisy image. In img2img mode the starting point is a different image. A selectable amount of noise is added; basically, the AI blurs the original, then tries to construct what it has been told (by prompt and other weighting) to expect to see.

Inpainting mode further restricts this with a mask, meaning only certain parts are corrected. In a typical render-from-scratch workflow, the area of a badly rendered hand is selected, then that part of the image re-rendered until a decent hand appears. (Not picking on hands here, regardless of how meme-able those have been. It just makes an easy example).

For my process, I select the dirigible (made-up example; not sure I’ve ever attempted a proper steampunk image) and load up a specific LoRA and rewrite the prompt to focus attention on what I need to see. Then I switch to the guy with the sword, inpainting again with a pirate LoRA and prompt, and so on until all the image elements of this hybrid idea are present.

I want to get back to this, but the idea of a dinosaur in Times Square is easy to achieve with any of the various AI implementations, but only casually. It will not be a good dinosaur, or a good Times Square, and the ideas will get contaminated. The dinosaur will get Art Deco architecture and the buildings will sprout vines. At a casual glance, it is fun, but this is why AI is and will probably remain unsatisfying.

When you drill at all deeply, it is getting it wrong.

I just tried to do some desert landscape and at first glance, sure, it does all the desert things. Sand, rocks, wonderful sky. Except. I’m no geologist but look any longer than a second or two and the geology looks just really, really wrong. And there’s a reason for that besides lack of sufficient specific references that force it to repurpose more generalized resources.

That reason is that this is entirely built on casual resemblances. There’s nothing in the process resembling the rules that underly the appearance of almost all things. It doesn’t put two hands on a person because it is working through the denoising process in assembling a person of standard anatomy, it does this because most of the training examples present it with more than one and less than three hands.

It has no concept of hand. It finds hands in proximity to arms but, like a baby, there’s no object permanence. An arm that goes behind something else now no longer carries forward the assumption of a hand being involved.

That's why you will hear the AI bros shouting that hands are a solved issue. They aren't "solved"; the symptom was attacked brute-force style by giving it more reference images of hands until the statistical probability of looking like a normal hand rose sufficiently. The underlying problem remains.

If you look closely, and especially if you have any subject-matter competence, the details are always wrong. No matter what it is -- and the more out of the mainstream the subject, the more likely it will be wrong.

The big models were trained on a shitload of human beings and the average of that mass is a thing that looks to the casual eye like a human. We apes are trained to pay attention to apes so one of the ways AI images convince for a moment (before Uncanny Valley yawns wide) is a nice smile and a pair of eyes you can make contact with...and it is only with a longer glance that you see the scenery behind this particular Mona Lisa is a worse fantasy than whatever Leo was painting there.

That's the trick of AI. It has this glossy, convincing look that up until AI came along took a lot of labor and a lot of skill to achieve. Just like LLMs can convince us with four-dollar words and flawless grammar that the facts contained in that text are also correct. But there's no connection. The kinds of details that took a photograph or a really dedicated painted are achieved without effort, because these are just surface artifacts.

In a slightly different context, Hans Moravec talked about why we overestimated computer intelligence for so long: because we humans find math hard. Adding up large numbers is hard for us because we are general-purpose analog machines and the reality of the elaborate calculus we are doing just to catch a ball in on hand is hidden from us. So the machine, by adding big numbers, looks smart. And we can't understand intuitively why recognizing a face should, then, be so hard for it.

So AI images have this same apparent competence. It takes an artist’s eye, or an anatomist’s, to see the pose is fucked up, the muscle groups are wrong. The more you know what to look for, the more you fail to see the things that a Rodin was able to carve into clay and stone. How this finger flexed means this muscle is tensed. There’s reasons for things. There are underlying structures.

It’s not just a pile of rocks in the desert. It is the underlying rock partially covered by weathered material.

But back to the art process. I am well beyond inpainting the bad hand. I am doing this inpainting cycle right down to the basic composition, because what I am after lies too far outside any trained concepts or available references.

Part of that fault lies in my base model of choice. Especially the 2.5 million image Pony set is character art, so very presentational in a single large posed figure. It doesn't want to do a long camera view of three people having a conversation.

I usually start with another image. It might be a generated image — but one that might be using a different model entirely. And even that will be so far off, that image goes outside to be painted on with a tablet and digital brush.

Other times I've started with a photograph that is close in some way. Or a rough sketch. Or once (and it went so well I mean to continue the experiments!) a posed artist's mannequin.

Multiple passes at different levels of blur and different focuses of prompt are needed to get the thing to move in the direction I’m envisioning. For this, another useful dial to tweak is the “steps” dial. A low blur and a low steps means it doesn’t change much, but it changes it with very little resemblance to the original.

A high step count means it moves conservatively from the blurred image, refining a little bit at a time, and thus tends to preserve details in a way that low blur doesn’t do.

High blur, on the other hand, frees the engine to make radical changes in shape and color; changes that are not the same as the conceptual changes aimed at with low step count.

Often, though, the AI needs a more direct hint. So back through an external paint application.

This part is peculiarly fascinating to me because, a bit like the Moravec example, it requires me to think in a very different way. For one very basic lesson, the AI responds to value, not lines. That's one of the tough things for many young artists to learn because from the first moment we pick up a pen we tend to think in terms of lines. Of outlines, of borders. Seeing things in shading planes is a further step. But seeing just raw tones, divorced from other clues; that's an unfamiliar way of looking.

That pride in learning a new skill? I have a certain pride in knowing the shortcuts that communicate to the AI. It isn't about realism, it is about certain tricks it recognizes. And on the flip side, avoiding things I have learned confuses it. This happens in prompting, too, have no doubt about that, but it is a particular joy in being able to use those skills of seeing and visualizing that I used back when I was designing for the stage. Or trying to learn how to draw comic books.

In any case, the last steps are performed on the whole image, using a more conservative LoRA and prompt, low blur and high step count. This emphasizes refining, cleaning up what is already there. The final pass is done with multiply on — my graphics card can handle up to 2.5x the working resolution without tiling (and I’ve gone 4x with).

I know the upscaler is supposed to use the prompt and LoRA from the image in question but this method gets much, much closer to conserving the details that are peculiar to that LoRA.

And at the end of it I look at it, say, “That looks cool” and then close the file. Because there’s really little purpose in it otherwise. The goal was learning something technical.

The Starving Theater Artist

Thursday, August 7, 2025

Diffused Goals

No comments:

Post a Comment