Yeah, I've gone to the dark side.
I said I was messing around with AI music creation via ACE-step. That one is a bit more towards a diffusion model, so (at the moment) less controllable than Suno, but potentially more flexible. They boast about the number of languages it can handle...but at the same time, admit that they really haven't trained it on opera, or (as my own small experiments seemed to show) much of anything at all outside of pop music.
(They also side-step around the sound quality, with the user base developing magic incantations that might be improving things if you squint, or might just be random "sometimes it doesn't sound quite so bad" and has nothing to do with their "and set cfig to 1.14" stuff.)
The one that I actually wanted to talk about was the attempt to do a villain song/"I want" song. The result is less Broadway, more Billie Eilish. Among other things, ACE doesn't know how to belt. Or like female singers in a range other than light soprano. It's understanding of "chorus" v. "verse" is "sing louder!" It also only takes lyrics, like anything else, as suggestions.
But on the fourth or fifth run it came out with this weird melisma that's almost Sondheim or something. Pity it couldn't carry through. I could improvise a better chorus (it is an absolute natural to sequence this one, half-stepping higher and higher). Plus, at least one previous run was smart enough to do the intro on solo piano. I like the "is that supposed to be an oboe?" but it should enter later.
Like I said. Not really trained on showtunes. That's why it surprised me that this run, instead of repeating the chorus over and over, drifting further and further from the melody as it goes (and not even a Truck Driver Gear Shift!) it went for an extended instrumental break. Cool. On stage, this would be a dance, or perhaps an aerial dance.
The less said about the Wan2.1 animations, the better. At some point in the increasingly large and power-hungry tools you can script better, and start using LORA to keep the animator from going completely off model. At this level it is just as well I've got a 3060 ti with 12Gb that is practically a potato when it comes to video -- because Wan2.1 also loses the plot after 60-90 frames of video. So no Bryce-length render times. It also has some memory leaks so even if I could automate a batch, the second one would fail.
But on the other hand; this gave a nice chance to demonstrate using img2img mode for composing a Stable Diffusion image. I did those roughs in Paint3D, using my knowledge of how SD "sees" an image. The airship took a bit of inpainting as the AI really wanted to make either a blimp or a clipper, not try to get both in the same machine. The others I used simple convergence; rendered a half-dozen versions, picked one, then re-rendered on that one to zero it in.
At least with SDXL you can prompt a little. ACE-step and WAN2.1 go even more nuts about stray words -- my first attempt at the airship had "flying" in the prompt and the AI added wings to it. For the ACE renders, my prompts were as short as four words long.
The other lesson here is that the AI will willingly change lots of things, but it is real stickler about proportions. What it saw in the comp, it will keep. A big-headed guy with a short-necked guitar is going to stay that way through the generations, and it is really difficult to fix it once you are in the clean-up render phase.
No comments:
Post a Comment