Egyptian hieroglyphs contain determinatives; special characters used to indicate whether characters are meant to be taken for semantic value or phonetic (or for a non-literal value that has become common).
This is true of all logographic languages. Learning how to make a rebus is the innovation that allows them to become full languages, capable of capturing new words and foreign concepts.
In many other written languages, forms have grown that allows one to disambiguate; from the typography of computer languages that distinguishes a string from a command, to the use of balloon shapes to give context to the text floating over a comic book page.
AI doesn't have this.
I think it is a philosophical choice driven by this Apple Computer-like idea that the machine should be seamless. The illusion of the power of "natural language." No special grammar like the nested parentheses of LiSP or technical vocabulary as is used in law, science, and every trade.
No, an ordinary unspecialized person can just talk to the computer and it will magically understand what they need.
And yet such is already creeping around the corners. In the Stable Diffusion community, there is a primitive grammar, such as using two types of bracket to emphasize or de-emphasize keywords. It is more straight-forward, and less risky, to use ((big)) rather than attempt to rank synonyms, guessing whether the particular combination of checkpoint and LoRAs will weight "gigantic" or "gargantuan" as bigger.
And there is a specialist vocabulary that crossed over from the Danbooru image board, appearing in enough third-party anime-friendly checkpoints shorthand like the ubiquitous "1girl" have migrated into the major models.
The point is, this natural language thinking is crap. There's a reason we have specialist language. Jargon. And grammar to make distinctions that can be unclear even in face-to-face conversation.
This is already obvious in our search engines. But then, the primary driver is not to satisfy your search, but to satisfy the need of the advertisers to sell you something. AI, as currently implemented in so many customer-facing applications, from phone trees to Siri to search to queries, is an over-eager salesperson with a short attention span. They aren't listening for context. They are looking for keywords they can jump on.
It is democratization in the worst way. "It just works," but only if you were looking for whatever the mainstream is at the moment. This is why the "help" systems at something like the phone company are entirely designed around assuming you have made one of three common mistakes. Not that, maybe, something slightly less usual happened, like their equipment being broken.
What I'm saying, is that when I play around with generative AI at home and for my own amusement, I am seeing just how terrible an idea all this AI crap is. When it wants to find "the crook" somehow, it will grab on to any recognizable and repeatable element in the camera it monitors. The agentic systems aren't properly antagonistic, like a court proceeding, they are incentivized for results at any cost, even if the result is a whole bunch of false positives.
This means the entire history of metaphor and simile is out the window. Our speech is littered with terms of phrase and terms of art. Phrases that are each made up of individual words, and the underlying token philosophy of AI, and the way the limitations of data handling means language needs to be broken into the smallest possible pieces, means words are more likely to be taken out of context than in context.
I'm not saying this couldn't be addressed. After all, that's the point of agentic AI. Assign some agents to look for phrases and to weight them higher than the other agents looking at words. You can tokenize any semantic or conceptual item, after all. It doesn't have to be a word. But it isn't being done now, because it isn't useful. What is being sold to us now is something that will always answer, confidently, even if it is wrong. That will always find the right line to connect you to or the evil-doer on the security footage or the product you need to buy (even if you weren't searching for that in the first place).
Where movement is happening is in smaller user bases, more technically oriented user bases. Not the people stuck with whatever stupidity the iPhone is pushing with the latest mandatory update (speech to text gets worse with every iteration. It is so afraid of not having an answer, it prioritizes on whatever is the common answer. Invite a friend for pizza, it will be your pal. Take notes from a history lecture, and it will make a complete mess of it. No, Celine Dion did not speak at the Council of Ephesus. Or perhaps that will be the "Council of ePayments," since after all commercial interest is the only thing important here.)
AI, and Generative AI, has potential. I still think it is a brute-force attack on problems that are much subtler, and being grossly oversold, but when we wake up and start using it sensibly it can be useful.
I've been wasting time again with image generation. Hey, I got a novel concept out of it. (But that's nothing to do with how "creative" AI is -- I did exactly the same thing with pen and paper).
There were times when it felt collaborative. The downside to using Generative AI this way is that it is too easy to chase new discoveries. AI is a lootbox (especially now as the price of tokens are beginning the turn towards approximating actual cost). At any point, it can throw out something that totally isn't what you were working on, but is so "Oooh, shiny!" you want to stop and work on that instead.
(The rest of the analogy of lootboxes is that the house usually wins, giving you not-quite-right results that give you enough of the endorphin boost to keep playing, but also force you to pay up more tokens and roll again. Even with a free generator, it turns too easily into doomscrolling. With a home rig, a doomscrolling that takes fifteen minutes per generation and blasts hot air at your feet. Not a good way to waste a weekend.)
But when I started with a concept that lay inside the models I had available, and I was careful of the no man's land of things the model wasn't trained on and wouldn't understand, I could iterate and repaint until it was pretty close to what I had visualized.
And yes, the trick is to be relentlessly literal. This is a good trick in a lot of design anyhow; don't do concept, do appearance. Write out not the back story, but what you want to appear on the screen. And always be aware of keywords. Negative phrases, for instance, are counter-productive. "It is cloudy but not raining" is a terrible prompt; you will get rain. "The monster is as large as a house" is likely to get you a house-shaped monster. Or a random house.
Using the image-to-video pipeline (Wan2.2 mostly, lately the Smoothmix checkpoint), a key is that the AI can create anything that isn't currently in the frame but could be. It has great difficulty changing what is already in the frame -- and that includes difficulty not interpreting what it sees, including the smallest artifacts, as something that needs to be there.
More than once a stray fleck on the wall has turned into a fly buzzing around the scene. The negative prompts get longer and longer. But want a lion to walk in through the door? No problem.
Still a waste of a good weekend...but fun.
No comments:
Post a Comment