Thursday, September 18, 2025

Painting with a potato

There is an artist's method. Like scientific method or engineering method, it is a base process. This idea of processes that are largely application-independent underlies the Maker movement -- which has merged and morphed and now is more like the kind of "hacker" the Hackaday Blog was built for.


This is why there are so many multi-instrumentalists in the home studio music circles, and how visual artists change media. There's cross-over, too. Not to in any way diminish the immense investment in skills in the specific mediums and tools, but there are commonalities at more fractal levels than just the generalized artistic method.

In any case, the thinking is basically tools is tools. I've designed sound and lighting for the stage. I've designed props. I've done book covers. I've done music. I'm not saying I do any of these things well, but there is a hacker approach to it. Understand the new medium and tools and then apply the core skills to it. It all feels sort of the same, reaching for personal aesthetics and understanding of the real world and learning about the tools and the traditions of this specific media.

From a process point of view, from this higher level "making art" point of view, AI is just another tool set.

It tends to be a toolset that is relatively easy to learn, but also less responsive. It really is like painting with a potato (and not to make stamps, either!) You can't really repurpose the hand-and-eye skills learned in inking a comic book panel with a nibbed pen or painting a clean border on a set wall with a sash brush. Because it is a potato. You hold it, you dip it in paint, it makes a big messy mark on the canvas.

It is mostly the upper-level skills and that general artistic approach that brings to bear. What of the things I can imagine and want to visualize, will this tool let me achieve? You can't get a melody out of a sculpture or paint the fluting on a hand-knapped obsidian point, and you learn quickly what subjects and styles and so forth the AI tools can support, and which are not really what they are meant for.

This will change. This will change quickly. Already, I've discovered even my potato computer could have done those short videos with animation synchronized to the music.

That's been the latest "Hey, I found this new kind of ink marker at the art store and I just had to try it out on something" for me. S2V with an audio-to-video model built around the WAN2.2 (14B) AI image model. In a spaghetti workflow hosted in the ComfyUI.

On a potato. (I5 with 3060Ti. I want to upgrade to 3090 -- generations don't matter as much as that raw 24GB of VRAM -- but can't stomach another thousand bucks on the proper host of an I9 machine with the bus for DDR5.)

Render times for ten seconds of video at HD (480x640) is around an hour. Hello, 1994; I'm back to Bryce3D!


Actually, right now longer clips lose coherence. That's the reason all that stuff on YouTube right now is based on 30 seconds and under. It's the same problem that underlies why GPT-4 loses the plot a chapter in to a novel and ACE-step forgets the tune or even the key. 

I suspect there's a fix with adding another agent in there so it won't surprise me to find this is a temporary problem. But it also won't surprise me if it proves more difficult than that, as this is out of something that sort of underlines the entire genetic algorithm approach of things. Even on a single still image, I've seen Stable Diffusion forget what it was rendering and start detailing up something other than what it started with. Particularly a problem with inpainting. "Add a hand here." "Now detail that hand." Responds SD "What hand?"

(This also may change with the move towards natural language prompting. The traditional approach was spelling things out, as if the AI was an idiot. This is the "Bear, brown, with fur, holding a spatula in one paw..." approach. The WAN (and I assume Flux) tools claim better results with "A bear tending a barbecue" and let the AI figure out what that entails.)

Anyhow. Been experimenting with a wacky pipeline, but this time using speech instead of music.

Rip the audio track off a video with VLC. Extract and clean up a voice sample via Audacity. Drop voice sample and script into Chatterbox (I've been messing with the TTL and VC branches.) The other pipeline is using basic paint to compose, Stable Diffusion (using Automatic1111 as a front end, because while it may be aging, the all-in-one canvas approach is much more suitable for inpainting. Plus, I know this one pretty well now.)

Throw both into the WAN2.2-base S2V running under ComfyUI. It does decent lip synch. I haven't tried it out yet for instrumentals but apparently it can parse some of that, too. It also has a different architecture than the WAN2.1 I was doing I2V on before. It has a similar model consistency, which is a nice surprise, but the workflow I'm using leverages an "extend video" hack that means that my potato -- as much as it struggles fitting a 20GB model into 12GB of VRAM -- can get out to at least 14 seconds at HD.

As usual, the fun part is trying out edge cases. Always, the artist wants to mess around and see what the tools can do. And at some point you are smacking that sculpture with a drum stick, recording the result and programming it into a sample keyboard just to see if it can be done.

There is an image-to-mesh template already on my ComfyUI install. The AI can already interpolate outside of the picture frame, which is what it is doing -- almost seamlessly! -- in the image2video workflow. So it makes sense that it can interpolate the back side of an object in an image. And then produce a surface and export it as an STL file you can send to your physical printer!

So there are many things the tools can do. The underlying paradox remains -- perhaps even more of one as the tools rely on the AI reading your intent and executing that; an understanding that relies on there being a substantial number of people that asked for the same thing using similar words.

It is the new digital version of the tragedy of the commons. It is the search algorithm turned into image generation. It has never been easier to look like everyone else (well, maybe in Paris in 1900, if you were trying to get selected for show by the Académie).

And that indeed is my new fear. I think there is this thought in some of the big companies that this could be a new way to stream content. Give everyone tools to "create art" so, like the nightmare the Incredibles faced, nobody is special. The actual artists starve out even faster than they have been already. The consumer believes they are now the creator, but they aren't; they are only the users of a service that spits out variations on what they have already been sold.

Right now, AI art is in the hacker domain, where academics and tinkerers and volunteers are creating the tools. Soon enough, the walls will close. Copyright and DMCA, aggressive "anti-pornography" stances. None of that actually wrong by itself, but applied surgically to make sure the tools are no longer in your hands but are only rented from the company that owns them.

The big software companies have often been oddly hostile to content creators. "Why would you need to create and share when we've made it so easy to watch DVD's?" This just accelerates in that direction, where the end-user doesn't create and doesn't own.

We're back to streaming content. It is just the buttons have changed labels. Instead of "play me something that's like what I usually listen to" it will be, "create me something that sounds like what I usually listen to.

Doesn't stop me from playing around with the stuff myself.


(Call to Power players will recognize this one.)

No comments:

Post a Comment