Sunday, April 9, 2023

Dancing about architecture

That is actually the state-of-art of AI right now. It will get better, and quickly. Those hard-learned skills people are using today will pass out of use as the software gets over certain hurdles.

Some things are much harder, though, and there will always be this frustrating indirect aspect to it. That's why I said dancing about architecture; but this is more like trying to communicate a building design through interpretive dance. Or chanting memorized spells that worked for someone else and worked last week and you almost, but not quite, understand why they work at all...

So first there's the learning curve. As with every field, it is a whole complicated world and it is hard to get over that first hurdle and orient to the ground and start understanding what the processes are, what the issues are.

I'm currently running Stable Diffusion -- which is actually a model trained on the LAION-5B database, through the AUTOMATIC1111 front end, which is a webgui that, like the Stable Diffusion package is running on Python (including torch, which pipes to cuda, and xformers, which I currently have disabled (because honestly my python-fu is weak and I needed to invoke the arg "--lowvram" due to my not-quite-up-to-all-that-parallel-processing graphics card.)

So, yeah, command-line stuff, installing off github, all that good stuff. And a starter checkpoint model -- the usual v1.5, and I don't actually remember what flavor, although I have four other checkpoints and a handful of lora and even a textual inversion or so in my folder now.

But mostly I'm messing around with upscaling, trying out the difference between, say, ERSGAN_4x and the R-ESRGANs, and trying to remember to get a newer VAE, and what is all this about ControlNet and Openpose?

But what does this all mean?

So the current state of the art of a theoretical AI pipeline is not really "push the button get art." More like "study the magic words huddled alone in a garret, sometimes they work, sometimes demons pop out." I had one of those weird demons...I got a sort of totem pole of Michelle Trachtenberg (yes -- celebrity names are one of the big Invokes in the Deep Magic of chanting Prompts towards the keyboard. 

Here's PART of one found in the wild: ..glamorous pose, trending on ArtStation, dramatic lighting, ice, fire and smoke, orthodox symbolism Diesel punk, mist, ambient occlusion, volumetric lighting...

And it gets even more fun when you start seeing modifiers. [jennifer connely:0.8 jennifer anniston],(((long hair))), (bokeh)...

Yeah, what did I say about demonology? The engine goes crazy on celebrities. Even more so artists. I was trying out text to image and I put a "william turner" in there and I got the most lovely clouds and dramatic lights...and the damned Fighting Temeraire hovering around the cornfields, refusing to be banished.

Because that leads to why this is arcane. We don't know what the machine is thinking. We can guess. Put "steve jobs" in there and an apple logo will show up like a buddy, because (like turner) the one is associated with the other with the majority of materials the model will have been trained on.

It doesn't understand the world. Of course. It is the basic problem of AI. Plus, the things that look difficult for us...well, as pioneer Hans Moravec put it, a computer can beat a man at chess. A five-year-old kid can pick up a toppled king and put it back on the board.

Machine vision is making progress but it is brute force stuff. People lie awake at nights wondering what we are actually teaching the self-driving cars to avoid. (This goes way back... a wartime story that might be a tall tale of dogs trained to run under tanks. On the day of the battle, they had bombs strapped to their backs. Unfortunately in all the noise and confusion they vastly preferred to run under the tanks they'd been training on...)

We can't talk to the machines. Not the Stable Diffusion trained checkpoints, not Google search engine, not Amazon's algorithm. We can only make educated guesses about what the world looks like to it. You might be trying to render "elephant in a church" and you keep getting Roman soldiers. Why? Oh turns out one person wrote a prompt to reconstruct a campaign against Carthage for a personal render, and it ended up on a popular website with beginners, and there are only three elephants in the images shown to the v.1 engine, so it always assumes you are going to want Romans, too.

Totally made up. Actually, the AI is worse than Doctor Brundle's teleporter. It isn't even certain how many heads one person has (it can never get fingers right) so with two people in there...weird things happen.

And it is the text-to-image that is the current hot button. The idea on the one hand that someone could skip all those years in art school. The idea on the other hand that you are stealing J.W.R. Turner's wonderful clouds. And worse, the internet has fads, and the AI world is practically a world of script-kiddies, borrowing arcane words of power from the old masters and using them with abandon. So one person discovers that "Sarah Andersen" is as powerful a cry as calling on the All-Seeing Eye of Agamoto and pretty soon half the user base is doing it to crank out that image flavor of the week.

But upscaling -- the reason, honestly, I looked at this stuff in the first place -- is leveraging many of the same tools and assets. You upscale a brick wall and the AI is not just using the photograph you took and the PhotoShop hand painting you did and the freely-given math and programming the whole edifice rests on, but it is going through a stash of images from, well, Getty among others.

And they are probably not recognizable, not in this use. And I have both sympathy and schadenfreude, because Getty charges hard for the use of the materials they have there. And this is in the end probably not a war that anyone will win. Like the fight against mix tapes, which became music sharing, which became large-scale piracy...and the music industry still is not back in control like they want to be (and never will).

But the moral aspect is still back there.

And there is the practical problem. I am on the fence whether it "counts" as art if you are still having to, basically, work just as hard. Are we going to tell people they aren't real artists, real musicians, real writers, because they lack innate talent and had to slog through the trenches instead?

My small encounter with the stuff is that at least some of the (few) art skills I came by -- via that same long round of study and practice -- are still useful here. Half of it is understanding the world well enough to understand what the AI might be looking at. Those skills of being able to both abstract the idea and be specific about the details of a cat or a building or a face. And the AI does share some of that language; if you note, the terms photographers developed to talk about their approaches are recognized by the AI and used by the prompt-chanters.

Another is that repainting is the hidden skill. I don't think more than a handful of people are at the leading edge of popularity in AI art without being able to work PhotoShop themselves.

As I mentioned, the AI can't be relied on to count heads, know how hems work, and hands are a nightmare creation. Far too often, the cycle is actually to get close, then to drop into a paint program and throw a dash of corrective paint on that blot or scar the AI has gotten all worked up about and insists on re-working into a miniature cathedral flanked by grotesques.

This is where both sides of that artistic training come into play, as you can lead the AI in the directions you'd rather it went, with a little understanding and artistic skill and a good hand on the brush.

And that's all I'm going to say at the moment. Sometimes a Fox takes place in 2019 and AI hasn't blown up big, not yet, not for them. And it is also just a bit too bleeding-edge for a wider readership. Sure, I might slip a reference to "Team Catradora" in here and there, but there was a reason why Penny was still using Facebook...

No comments:

Post a Comment