The Starving Theater Artist: Anaconda

My go-to ComfyUI workflow now has more spaghetti than my most recent factory.

(Not mine; some guy in Reddit.)

The VRAM crunch for long videos seems to rest primarily in the KSampler. There's an s2v workflow in the templates of a standard ComfyUI install that uses a tricky little module that picks up the latent and renders another chunk of video, for all of them to be stitched together at the end. With that thing, the major VRAM crunch is size of the image.

Of course there's still the decoherence issue. I've been running 40-second tests to see how badly the image decomposes over that many frames. Also found the quality is acceptable rendering at 720 and upscaling to 1024 via a simple frame-by-frame lanczos upscaler (nothing AI about it). And I'm rather proud I figured that our all by myself. At 16 fps and with Steps set down at 4 I can get a second of video for every minute the floor heater is running.

Scripting is still a big unknown. I've been experimenting with the s2v (sound to video) and as usual there are surprises. AI, after all, is an exercise in probabilities. "These things are often found with those things." It is, below the layers of agents and control nets and weighting, a next-word autocomplete.

That means it seems to have an uncanny ability to extract emotional and semantic meaning from speech. It is strictly associational; videos in the training material tended to show a person pointing when the vocal patterns of "look over there" occurred. More emergence. Cat logic, even.

So anyhow, I broke Automatic1111. Sure, it had a venv folder, but somehow Path got pointed in the wrong direction. Fortunately was able to delete python, clean install 3.10.9 inside the SD folder, Automatic1111 came back up and ComfyUI was still safe in its own sandbox. And now to try to install Kohya.

Experimenting with the tech has led to thinking about shots, and that in turn has circled back to the same thing I identified earlier, a thing that becomes particularly visible when talking about AI.

We all have an urge to create. And we all have our desires and internal landscapes that, when given the chance, will attempt to shape the work. Well, okay, writing forums talk about the person who wants to have written a book; the book itself being of no import, just as the nature of the film they starred in having nothing to do with the desire to be a famous actor. It is the fame and fortune that is the object.

In any case, the difference between the stereotype of push-button art (paint by numeric control) and the application of actual skills that took time and effort to learn is, in relation to the process of creation itself, just a matter of how granular you are getting about it.

Music has long had chance music and aleatoric music. Some artists throw paint at a canvass. And some people hire or collaborate. Is a composer not a composer if they hire an arranger?

That said, I feel that in video, the approach taken by many in AI is getting in the way of achieving a meaningful goal. As it exists right now, AI video is poorly scriptable, and its cinematography -- the choice of shots and cutting in order to tell the story -- is lacking. This, as with all things AI, will change.

But right now a lot of people getting into AI are crowding the subreddits asking how to generate longer videos.

I'm sorry, but wrong approach. In today's cinematography, 15 seconds is considered a long shot. Many movies are cut at a faster tempo than that. Now, there is the issue of coverage...but I'll get there. In any case, this is just another side of the AI approach that wants nothing more than to press buttons. In fact, it isn't even the time, effort, or artistic skills or tools that are being avoided. It is the burden of creativity. People are using AI to create the prompts to create AI images. And not just sometimes; there are workflows designed to automate this terribly challenging chore of getting ChatGPT to spit out a string of words that can be plugged into ComfyUI.

Art and purposes change. New forms arise. A sonnet is not a haiku. There is argument to recognize as a form the short-form AI video that stitches together semi-related clips in a montage style.

But even here, the AI is going to do poorly at generating it all in one go. It will do better if each shot is rendered separately, and something (a human editor, even!) splices the shots together. And, especially if the target is TikTok or the equivalent, the individual shots are rarely going to be more than five seconds in length.

Cutting to develop a story, using language similar to modern filmic language, is a different beast entirely. The challenge I'm thinking a lot about now is consistency. Consistency of character, consistency of set. There are also challenges in matching camera motions and angles if you want to apply the language correctly. For that shot-reverse-shot of which the OTS is often part, you have to obey the 180 rule or the results become confusing.

One basic approach is image to video. With i2v, every shot has the same starting point, although they diverge from there. As a specific example, imagine a render of a car driving off. In one render, the removal of the car reveals a fire hydrant. In the second render from the same start point, a mailbox. The AI rolled the dice each time because that part of the background wasn't in the original reference.

One weird problem as well. In editing, various kinds of buffer shots are inserted to hide the cuts made to the master shot. The interview subject coughed. If you just cut, there'd be a stutter in the film. So cut to the interviewer nodding as if listening (those are usually filmed at a different time, and without the subject at all!) Then cut back.

In the case of an i2v workflow, a cutaway done like this would create a strange déjà vu; after the cut, the main shot seems to have reset in time.

So this might actually be an argument for a longer clip, but not to be used as the final output; to be used as a master shot to be cut into for story beats.

Only we run into another problem. It is poorly scriptable at present. In the workflows I am currently using, there's essentially one idea per clip. So a simple idea such as "he sees the gun and starts talking rapidly" doesn't work with this process.

What you need is to create two clips with different prompts. And you need to steal the last frame from the first clip and use it as the starting image of the second clip. Only this too has problems; the degradation over a length of a clip means even if you add a node in the workflow to automatically save the target frame, it will need to be cleaned up, corrected back to being on-model, and have the resolution increased back to the original.

And, yes, I've seen a workflow that automates all of that, right down to a preset noise setting in the AI model that regenerates a fresh and higher-resolution image.

My, what a tangled web we weave.

The Starving Theater Artist

Tuesday, October 7, 2025

Anaconda

No comments:

Post a Comment