I got the PC I built up and running, after the usual 22h2 hassle (tip; don't use the internal updater. Run the web installer at Microsoft. For as long as that lasts!)
ComfyUI is sandboxed (and a one-click install) and Automatic1111, though now an abandoned project, also installs a venv folder within the stable_diffusion folder, meaning it can run on Python 3.10.6 Now trying to get Kohya running, and learning venv so I can get that on 3.10.11 or higher...without breaking everything else.
I still like the primitive but functional GUI of Automatic1111 for stills. But ComfyUI opens up video as well. Motion.
And that got me thinking about linear narrative.
There does exist a form call "non-linear narrative." But that refers to the relationship between the narrative and some other chronology. The latter may be shifted around. A writer can at any point refer to a different time, including such techniques as the flashback and flash-forward. But the narrative itself remains linear. One reads one word at a time.
(Arguably, from our understanding of the process of reading, we parse chunks of text and thus multiple words may be included in what is experienced as a phoneme of extracted meaning.)
This means it is extremely difficult to capture the near-simultaneous flow of information that a real person not reading an account in a book would experience. In our old gaming circles, the joke was the monster quietly waiting until the room description was finished. It is a basic problem in writing; you can't tell it all at the same time. And the order you chose influences the relative weight given.
Again, arguably, our attention can't be split too many ways. In most cases, the realization you had while you were in the middle of doing something else arrives as a discrete event. You may have heard the voice behind you, and been processing it, but the moment of understanding that cry of "Stop!" can be treated narratively at the moment it becomes the focus of attention. And the observations that lead to that moment of realization back-filled at that time, as they, too, rise to the top of the consciousness.
Or in another way of putting it, a narrative is an alias of the stream of consciousness and the order of presentation can be taken as the order of items brought into focus.
This idea of the sequential scroll of attention has been used in artwork. We normally absorb a piece of art by moving from one focus to another (in a matrix of probable interest including size, color, position, human faces, etc.) The artist can construct a narrative through this shifting of focus.
This one sneaks up in stages. The first impression is very calming. The next impressions are not. Especially in some periods, there could be subtler and subtler clues and symbols that you don't notice until you've been looking for a while.
Or there are artists, from the triptych to the Bayeux Tapestry that arrange distinct framed panels in a sequential order.
Motion controls this flow of narrative more tightly. Not to say there can't be the same slow realizations. But it means thinking sequentially.
In comic book terminology, the words "Closure" and "Encapsulation" are used to describe the concepts I've been talking about. "Closure" is the mental act of bringing together information that had been presented over a sequence of panels in order to extract the idea of a single thing or event. "Encapsulation" is a single panel that is both the highlight of and a reference pointer or stand-in for that event.
In text, narrative, especially immersive narrative that is keyed to a strong POV or, worse, a first-person POV, has a bias towards moving chronologically. Especially in first-person, this will lead the unwary writer into documenting every moment from waking to sleep (which is why I call it "Day-Planner Syndrome.")
I've been more and more conscious of the advantages and drawbacks of jumping into a scene at a more interesting point and rapidly back-filling (tell not show) the context of what that moment came out of. I don't like these little loops and how they disturb the illusion of a continuous consciousness that the reader is merely eves-dropping on as they go about their day, but I like even less spending pages on every breakfast.
And speaking of time. The best way to experience the passage of time is to have time pass. That is, if you want the reader to feel that long drive through the desert, you have to make them spend some time reading it. There's really no shortcut.
I decided for The Early Fox I wanted to present Penny as more of a blank slate, and to keep the focus within New Mexico. So no talking about her past experiences, comparisons to other places she's been, comparison or discussion of the histories of other places, technical discussions that bring in questions of where Penny learned geology or Latin or whatever, or quite so many pop-culture references.
And that means I am seriously running out of ways I can describe yucca.
In any case, spent a chunk of the weekend doing test runs with WAN2.2 and 2.1 on the subject of "will it move?" Which is basically the process of interrogating an AI model to see what it understands and what form the answer will take.
My first test on any new install is the prompt "bird." Just the one word. Across a number of checkpoints the result is a bird on the ground, usually grass. A strange and yet almost specific and describable bird; it is sort of a combination of bluebird and puffin with a large hooked beak, black/white mask, blue plumage and yellow chicken legs.
In investigating motion in video, I discovered there are two major things going on under the hood. The first is that when you get out of the mainstream ("person talking") and into a more specific motion ("person climbing a cliff") you run into the paucity of training data problem. When there is a variety of data, the AI can synthesize something that appears original. When the selection is too small, the AI recaps that bit of data in a way that becomes recognizable. Oh, that climbing move where he steps up with his left foot, then nods his head twice.
The other is subject-background detection. AI video works now (more-or-less) because of subject consistency. The person walking remains in the same clothing from the first frame to the last. It does interpolate, creating its own synthesized 3d version, but it can be thought of as, basically, detaching the subject then sliding it around on the background.
We've re-invented Flash.
Now, because the AI is detaching then interpolating, and the interpolation makes use of the training data of what the back of a coat or the rest of a shoe looks like (and, for video models, moves like), it does have the ability to animate things like hair appropriately when that subject is in motion. But AI is pretty good at not recognizing stuff, too. In this case, it takes the details it doesn't quite understand and basically turns them into a game skin.
Whether this is something the programmers were thinking, or an emergent behavior in which AI is discovering similar ways of approximating reality to what game creators have been doing, the subject becomes basically a surface mesh that gets the large-scale movements right but can reveal that things like the pauldrons on a suit of armor are basically surface details, parts of the "mesh."
It can help to think of AI animation as Flash in 3D. The identified subjects move around a background, with both given consistency from frame to frame. And think of the subject, whether it is a cat or a planet, as a single object that can be folded and stretched with the surface details more-or-less following.
But back to that consistency thing. For various reasons, video renders are limited to the low hundreds of frames (the default starter, depending on model, is 33 to 77 frames). And each render is a fresh roll of the dice.
It is a strange paradox, possibly unavoidable in the way we are currently doing this thing we call "AI." In order to have something with the appearance of novelty, it has to fold in the larger bulk of training data. In order to have consistency, it has to ignore most of that data. And since we've decided to interrogate the black box of the engine with a text prompt, we are basically left with "make me a bird" and the engine spitting out a fresh interpretation every time.
That plays hell on making an actual narrative. Replace comic-book panel with film-terminology "shot," and have that "Closure" built on things developed over multiple shots, and you are confronted with the problem that the actors and setting are based on concepts, not on a stable model that exists outside the world of an individual render. If you construct "Bird walking," "Bird flies off," and "Bird in the sky" with each time interpreting the conceptual idea of "Bird," in a different way it is going to be a harder story to understand.
That is going to change. There are going to be character turn-arounds or virtual set building soon enough. As I understand it, though, the necessary randomness means the paradox is baked into the process. No matter what the model or template, it is treated the same as a prompt or a LoRA or any other weighting; as a suggestion. One that gets interpreted in the light of what that roll of the dice spat out that run.
And that's why the majority of those AI videos currently clogging YouTube go for conceptual snippets arranged in a narrative order, not a tight sequence of shots in close chronological time. You can easily prompt the AI to render the hero walking into a spaceport, and the hero piloting his spacecraft...now wearing a spacesuit and with a visibly different haircut.
For now, the best work-around appears to be using the "I2V" subset. That generates a video from an image reference. The downside is that anything that isn't in the image -- the back of the head, say -- is interpolated, and thus will be different in every render. It also requires creating starter images that are themselves on-model.
A related trick is pulling the last frame of the first render and using that as the starter image for a second render. The problem this runs into is the Xerox Effect; the same problem that is part of why there is a soft limit of the number of frames of animation can be rendered in a single run.
(The bigger problem in render length is memory management. I am not entirely clear why,
As with most things AI, or 3D for that matter, it turns into the Compile Dance. Since each run is a roll of the dice, you often can't tell if there is a basic error of setup (bad prompt, a mistake in the reference image, a node connected backwards) or just a bad draw from the deck. You have to render a couple of times. Tweak a setting. Render a couple times to see if that change was in the right direction. Lather, rinse.
With my new GPU and the convenient test size I have been working with, render times fall into the sour spot. 1-3 minutes; not long enough to do something else, but long enough it is annoying to wait it out.
I still don't have an application, but it is an amusing enough technical problem to keep chasing for a bit longer. The discussions on the main subreddit seem to show a majority of questioners who just want "longer video" and hope that by crafting the right prompt, they can build a narrative in an interesting way.
The small minority is there, however, explaining that cutting together shorter clips better approaches how the movies have been doing it for a long time; a narrative approach that seems to work for the viewer. But that really throws things back towards the problem of consistency between clips.
And that's why I'm neck-deep in Python, trying not to break the rest of the tool kit in adding a LoRA trainer to the mix.
No comments:
Post a Comment