Walt Disney’s career as a director of animated film was not a particularly inspiring one.

We’ll ignore the Laugh-O-Grams and Alice Comedies since those were cranked out under Stakhanovite conditions for nearly no money for men who often turned out to be literal criminals (Pat Sullivan has a borderline classic Wikipedia page, littered with lines like “Sulivan(sic) would often fire employees in a drunken haze, not remembering the next day, when they would return to work as if nothing had happened“, and a Controversies section split into subheadings “Rape Conviction” and “Racism”).

We’ll also ignore early output like 1921’s Kansas City Girls Are Rolling Their Own Now, which mainly serve as insight into fetishes Walt may have had.

Yes, “Steamboat Willie” and “Skeleton Dance” and “Hell’s Bells” and “The Problematically-Depicted Negro” (etc) are holy classics, but Ub Iwerks (and his hunger for violence) deserve a lot of credit for those. Probably more than he got or will ever get, even by me. “Poor Papa” is great and underrated. “Minnie’s Yoo Hoo” sucks. Etc. More misses than hits, by my lights.

On the whole, you would describe Disney’s directorial output as “stiff, stagey, and moralistic.” You would not describe it as “very fun”. He did not make animation sing. He made it squawk, fret, and preach. His skills were adequate for the rubber hose era. By the 1930s, cartoons were entering their golden years, rapidly exploding in complexity, detail and quality of writing/acting/etc. Walt ended up over his head, his aged and dating skillset like racing a Model T at the Indy 500.

“The Golden Touch” (1935) was famously the result of a bet that Walt couldn’t direct as well as his animators: a bet that his animators immediately and decisively won. The last animated short ever directed by the man behind the mouse, it’s somewhat watchable, but most of the fun parts—like Midas giving himself a gangsterish gold tooth—feel like they were added by animators to try and punch life into things.

The story is flat and predictable and preachy. Don’t be greedy! Even if you don’t know who King Midas is, you can guess the plot after thirty seconds. Countless opportunities for gags are missed. King Midas spends half the short sitting in a chair. And when Goldie grants Midas the Golden Touch, shouldn’t he do it in a funny or interesting way? Instead of just saying “you have the Golden Touch now!” (or something) and disappearing?

I liked the skeleton. I wonder if that came from Walt. I expect it did. He always had an eye for the morbid.

What was Walt good at? I see him as a visionary and a dreamer who made audacious technical bets (synchronized sound, Technicolor, feature-length films), re-imagined the concept of what a cartoon could be, and then found talented artists to execute his vision. He wasn’t much of an artist himself, but that’s okay. There’s the big picture and the small picture. Georgy Zhukov was a talented general on the Eastern Front, but I could probably beat him at kickboxing—him dying in 1974 helps.

No Comments »

This post is speculation + crystal balling. A change might be coming.

OpenAI has spent six months rolling out updates to GPT-4o. These perform extremely well by human-preference metrics.

gpt-4o-2024-11-20, the latest endpoint, boasts a colossal 1360 ELO on Chatbot Arena, compared to the earliest GPT4-o, which scored a meager 1285. What does that mean? That blind human raters prefer gpt-4o-2024-11-20’s output 70% of the time.

I believe this is the result of aggressive human preference-hacking on OpenAI’s part, not any real advances.

Control for style, and gpt-4o-2024-11-20 drops by fifty points. It remains at #1, but only because the other preference-hacked models at the top also drop a lot.

Claude 3.5 Sonnet gains points. So do most of the older GPT-4s.

Optimizing for human preference is not a wrong thing to do, per se. So long as humans use LLMs, what they like matters. An LLM that produced output in the form of Morse code being punched into your balls would suck to use, even if it was smart.

But this is exactly why you should be careful when using Chatbot Arena to make statements about model capabilities – the top of the chart is mainly determined by style and presentation.

Benchmarks tell a different story: gpt-4o’s abilities are declining.

https://github.com/openai/simple-evals

In six months, GPT4-o’s 0-shot MMLU score has fallen from 87.2 to 85.7, which is probably similar to what GPT-4 scored on release.

(to be clear, “GPT-4” doesn’t mean “an older GPT-4o” or “GPT-4 Turbo”, but “the original broke-ass GPT-4 from March 2023, with 8k context and no tools/search/vision and Sept 2021 training data”).

I am more concerned about the collapse of GPT4-o’s score on the GPQA benchmark, which fell from 53.1 to 46.0. This is a significant drop, particularly in light of the tendency for scores to rise as data contaminates the internet. (Claude 3.5 Sonnet scores 59.4, for comparison)

Even this may be optimistic:

https://twitter.com/ArtificialAnlys/status/1859614633654616310

An independent test by Artificial Analysis (on the GPQA diamond subset) found that GPT-4o scored 39.00. They’ve downgraded the model to 71/100, or equal to GPT-4o mini (OpenAI’s free model) in capabilities.

Further benching here:

https://artificialanalysis.ai/providers/openai

Some of their findings complicate the picture I’ve just described (in particular, they have GPT4-o scoring a higher MMLU than OpenAI’s internal evals), but the bottom-line is that the new gpt-4o-2024-11-20 is the worst of its line by nearly every metric they test, except for token generation speed.

Livebench

https://livebench.ai

GPT-4o’s scores appear to be either stagnant or regressing.

gpt-4o-2024-05-13 -> 53.98
gpt-4o-2024-08-06 -> 56.03
chatgpt-4o-latest-0903 -> 54.25
gpt-4o-2024-11-20 -> 52.83

Aider Bench

https://github.com/Aider-AI/aider-swe-bench

Stagnant or regressing.

gpt-4o-2024-05-13 -> 72.9%
gpt-4o-2024-08-06 -> 71.4%
chatgpt-4o-latest-0903 -> 72.2%
gpt-4o-2024-11-20 -> 71.4%

Personal benchmarks

It doesn’t hurt to have a personal benchmark or two, relating to your own weird corner of the world. Either you’ll have a way to evaluate AIs that escapes the Goodharting suffered by large benchmarks, or OpenAI starts fine-tuning AIs on your niche use case (in which case, mission fucking accomplished.)

I like to ask LMMs to list the levels in the 1997 PC game Claw (an obscure videogame.)

Claude 3.5 Sonnet and Claude 3 Opus do great, getting about 80-90% of Claw’s levels correct.

GPT-4-0314 makes a reasonable attempt, getting about 50-75% right. Typically the first half of the game is fine, with the other levels being a mix of real and hallucinated.

(once, it listed “Wreckage” as a level in the game. That’s actually a custom level I helped make when I was 14-15. I found that weirdly moving: I’d found a shard of myself in the corpus.)

GPT-4o scores like ass: typically in the sub-50% range. It doesn’t even consistently nail how many levels are in the game. It correctly lists some levels but these are mostly out of order. It has strange fixed hallucinations. Over and over, it insists there’s a level called “Tawara Seaport”—which is a real-world port near the island of Kiribati. Not even a sensible hallucination given the context of the game.

Another prompt is “What is Ulio, in the context of Age of Empires II?”

GPT-4-0314 tells me it’s a piece of fan-made content, created by Ingo van Thiel. When I asked what year Ulio was made, it says “2002”. This is correct.

GPT-4o-2024-11-20 has no idea what I’m talking about.

To me, it looks like a lot of “deep knowledge” has vanished from the GPT-4 model. It’s now smaller and shallower and lighter, its mighty roots chipped away, its “old man strength” replaced with a cheap scaffold of (likely crappy) synthetic data.

What about creative writing? Is it better on creative writing?

Who the fuck knows. I don’t know how to measure that. Do you?

A notable attempt is EQBench, which uses Claude 3.5 as a judge to evaluate writing samples. gpt-4o-2024-11-20 is tied for first place. So that seems bullish.

https://eqbench.com/creative_writing.html

…but you’ll note that it’s tied with a 9B model, which makes me wonder about Claude 3.5 Sonnet’s judging.

https://eqbench.com/results/creative-writing-v2/gpt-4o-2024-11-20.txt

Most of these samples seem fairly mediocre to me. Uncreative, generic, packed with empty stylistic flourishes and pretentious “fine writing”.

The cockpit was a cacophony of dwindling lights and systems gasping their final breaths, a symphony of technological death rattles. Captain Elara Veyra sat in the command chair, her face illuminated by the sickly green glow of the emergency power indicator, which pulsed like the heartbeat of a dying creature. The Erebus Ascendant, once a proud envoy of humanity’s indomitable spirit, now drifted derelict and untethered in the silent abyss of interstellar void. The engines were cold, the life support systems faltering, and the ship’s AI had succumbed to cascading failures hours ago, leaving Elara alone with her thoughts, her resolve, and the unceasing hum of entropy.

A cacophony refers to sound: lights cannot form a cacaphony. How can there be an “unceasing hum” in a “silent abyss”? How does a light gasp a final breath? What is this drizzling horseshit?

This is what people who don’t read imagine good writing to be. It’s exactly what you’d expect from a model preference-hacked on the taste of people who do not have taste.

ChatGPTese is creeping back in (a problem I thought they’d fixed). “Elara”…”once a proud envoy of humanity’s indominable spirit”… “a testament to…” At least it doesn’t say “delve”.

Claude Sonnet 3.5’s own efforts feel considerably more “alive”, thoughtful, and humanlike.

https://eqbench.com/results/creative-writing-v2/claude-3-5-sonnet-20241022.txt

(Note the small details of the thermal blanket and the origami bird in “The Last Transmission”. There’s nothing really like that in GPT4-o’s stories)

So if GPT-4o is getting worse, what would that mean?

There are two options:

1) It’s unintentional. In this world, OpenAI is incompetent. They are dumpstering their model to win a leaderboard dick-measuring measuring contest against DeepMind.

2) It’s intentional. In this world, a new, better model is coming, and GPT4-o is being “right-sized” for a new position in the OA product line.

Evidence for the latter is the fact that token-generation speed has increased, which indicates they’ve actively made the model smaller.

If this is the path we’re on, I predict that GPT4-o will become a free model soon. And behind the ChatGPT Plus paywall will be something else: Orion, GPT-5, or the full o1.

No Comments »

A miserable listen. One of the most violently wrong-sounding albums I own. It captures a band ready to break up, and its silly melodies and forced-happy tone gives it a tragicomic “fiddling on the Titanic” tone. The singer was fired three months after its release, and a year after that the drummer jumped in front of a train.

It’s the only Helloween album that gives me no way in, the only one where the question “what were they trying for here?” has no clear answer. The title and cover suggests a band making a statement for artistic diversity: for breaking out of the power metal ghetto, for doing the unexpected. But “weird” is an adjective, not a noun. An approach, not an identity. You can’t have a band founded on sonic diversity and nothing else: that simply means you don’t have a sound. The cover sums things up—it’s colors for the sake of colors, not actually a painting of anything.

In practice, Chameleon is a three-way solo album between singer Michael Kiske and guitarists Michael Weikath and Roland Grapow, who are now apparently communicating through lawyers who end every correspondence with “conduct yourself accordingly”. The hostility in this hate triangle is palpable, and bleeds through on the record. None of them like or respect what the other two are doing, and at times they almost seem to be sabotaging each other. Also present are the ever-reliable bassist Markus Grosskopf, who does what he can, and drummer Ingo Swichtenberg, whose paranoid schizophrenia was sadly worsening, and who clearly hates the Beatles- and Queen-influenced songs more than anyone.

It’s horribly overproduced, and an example of how money can’t make bad music good. Songs like “In the Night” are overwrought and overthought, packed with vocal and guitar and saxophone (?) overdubs to disguise how weak they are. Synthesizers prove a particularly hateful presence: even good songs like “Giants” and “I Believe” have cheesy bleep-bloopy one-finger Fairlight arpeggios on them, of the sort you normally hear on Huey Lewis songs. Abominable. If you’re ripping off Queen, couldn’t you also rip off the “No Synthesizers” sleeve notes?

Michael Weikath’s songs have the largest quality delta. “First Time” is an okay hair metal song that passes without much pain. “Giants” is actually a minor classic, and would have fit well on either Keeper album. It has a heavy as hell NWOBHM-influenced main riff, and the chorus is sublime. “Don’t you, won’t you, say that we’ll be free again!” On the other hand, “Revolution Now” is a droning 70s Jimi Hendrix knockoff that’s eight minutes long. It sounds like Oasis’s Be Here Now, and is equally boring. “Windmill” (or “Shitmill”, in Ingo’s memorable term) is the worst ballad ever written by the band: rank, rancid, and insipid.

Roland Grapow’s songs are largely dull. “Crazy Cat” has some big band flash but no good hooks. You’d have to pay me to listen to “I Don’t Wanna Cry No More” again. “Music” has a Pink Floyd-inspired bridge with some fine single-coil Strat guitar soloing, but otherwise is as unmemorable as its title implies. “Step Out of Hell” is filler burdened with yet more synth cheese.

Michael Kiske was never the band’s greatest songwriter. Here, he offers a surprise in “I Believe”, an emotionally bludgeoning but effective ode to faith that’s nearly a masterpiece. It has some wonderful ideas in the Iron Maiden/Manilla Road vein (ironically, he’d soon swear off heavy metal entirely), but it’s just too long and draggy. It needed some tempo changes in the middle. Still, I think this might be the album’s finest track. “When the Sinner” is overlong and mediocre at best, and is overloaded with questionable ideas (if you’re one of the millions of fans who thought “Helloween would sound much better with alto sax solos”, then I’ve got the album for you.) The Paul McCartney-esque “In the Night” is just too sonically confused to stay in the memory.

Not only did Helloween tear to shreds what made them successful, they replaced it with…nothing. Just shallow, derivative imitations of other bands and styles. Chameleon has two good songs and ten bad ones, with saxophones and synthesizers. At times it seems like a practical joke. At least they released it in 1993, when the world’s appetite for retro-progressive dad rock was at an all time low. The album’s title feels appropriate: it was literally invisible.

No Comments »