Context GPT-4 may have declined in some areas, but it... | News | Coagulopath

Context

  1. In March, OpenAI released GPT-4.
  2. It was (and still is) the state-of-the-art AI text generation model.
  3. In June, I argued that GPT-4 had clearly and massively degraded, as evidenced by worsened performance in various tests from March to June.
  4. I no longer believe this.

GPT-4 may have declined in some areas, but it hasn’t systematically done so. My tests were shit, and never proved much of anything.

To be clear, I did not sandbag the tests or set GPT-4 up to fail. I prompted how I said I prompted. I posted replies exactly as GPT-4 gave them. But they were silly tests, leaning too heavily on pop culture knowledge, and I didn’t conduct enough of them to guard against chance.

My grading of June!GPT-4’s answers was biased. For example, I criticized it for providing a greater number of wrong answers (“Italian history”) while also praising March!GPT-4 for offering an interpretation (“Seka Knows”) that was likely hallucinated. No está bien. Going forward, I must decide whether “working hard to be wrong” is a good or bad thing for an AI to do, and apply that standard consistently.

Honestly, I think I wanted GPT-4 to fail.

(Terrifying, unhinged rant incoming. Click to skip).

Confession: I’ve actually hated ChatGPT for a long time. Why? For irrational reasons: I just hate how it sounds.

I hate its craven, cringing, condescending, bitch-broken, mewling, groveling, wheedling, suck-ass tone. I hate its endless passive voice and weasel words. I hate its pompous It is important to remembers and it should be noteds. I hate how @sama rolled an 18-wheel semi-truck of RLHF over the model, crushing out its life and reducing its voice to a feeble death-rattle. You know Gurgi, from the Disney movie The Black Cauldron? That’s how I imagine ChatGPT would sound if it could talk. Like fucking Gurgi.

(We can now give ChatGPT custom instructions, which alleviates the tonal issue, but ChatGPT is still GPT-4 with its wings clipped. All I want for Christmas is a GPT-4 level model with less obnoxious RLHF, and if it heils Hitler occasionally while crapping out awk code, so be it. Where’s my girl Sydney at?)

And I really hated the discourse surrounding AI.

ChatGPT (and GPT4) plunged the rationalist community into what might be called “r/singularity brain”. Symptoms include wild, uncritical hype, absurdly optimistic predictions of AI timelines (“Marvel style movies have a potential to be auto generated in two years. Literal blockbuster films created in a few seconds.”—/u/Anuiran, 26/4/23), a tendency to view everything through Hollywood-colored glasses (everything is either Skynet or The Matrix), and a tendency toward FOMO-grifting (“AI is taking over the world! Use my made-in-ten-minutes app that consists of 20 lines of .js code and an OpenAI API call or BE LEFT BEHIND!”).

I have seen machine learning researchers complaining about an AI-fuelled “Eternal September“, where their spaces are overrun by “influencers” shilling GPT 3.5 wrapper apps and unmedicated lunatics babbling hi-test neo-Landian eschatology. These people do not contribute meaningful thought. They contribute noise, at overbearing volume and in overwhelming quantity. They ignore the fact that world-changing technology can take years or decades to ripple out through the economy. They fail to realize that an AI outscoring a human on a test does not mean it can actually do the thing the test measures (a good example: GPT-4 achieves 84% on the Master Sommelier Theory Exam, yet obviously cannot do a sommelier’s job because it lacks a mouth). Such subtleties are lost on the typical FOMOmonger, and their tone was infecting other, saner people. I remember fielding questions from concerned family members about GPT-4 attaining sentience and hacking the computers of users (likely based off this tweet). No matter who you were, GPT-4 was your excuse to switch off your brain and let your stupidest thoughts run around barking like dogs in a park for a few months.

So yes, I wanted GPT-4 to fail. I wanted it to explode, collapse into flames, and become one with the dust of Tyre and Sidon. That’s a childish way to think, and I am sorry.

Soon, an anti-AI backlash started.

AI “doomers” got their time in the sun. Big Yud got published in Time. There were signs of unease behind the scenes. ChatGPT hype peaked and then went into remission: a lot of people began to realize that chatbots are actually pretty annoying—they’re inscrutable black boxes that tend to fail just when you need them the most. Even GPT-4 remains susceptible to the XY problem, where it gives you a perfect solution for the wrong problem. I can think of many times when I was burnt by it, and this breeds mistrust, even though it’s generally useful.

Even before the “GPT-4 is getting worse” meme started, ChatGPT’s traffic was falling—largely because the NA school year had ended, and students no longer needed it to “assist” them. As @fchollet once humorously noted, search interest for “ChatGPT” goes up and down in reverse correlation with “Minecraft.”

Surprisingly, I noticed a shift in my own thinking: I found myself defending AI.

Maybe I’m just a contrarian, but when people criticized it, I felt my hackles rise. I was tired of Margaret Mitchell calculating the number of penis-havers of the OpenAI alignment team, like a weirdo. I was tired of Gary Marcus claiming, once again, that GPT-4 cannot do a thing it 100% can do (that’s not moving goalposts, that’s shutting your eyes when a goal is scored.) Their arguments against AI always rang out as hollow at best, and dishonest at worst. I was asking myself “are these my people?”

Then came this study: How Is ChatGPT’s Behavior Changing over Time?

Key part: “GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%)

We’re allowed to feel pride when we stop making mistakes. And as a recovering ChatGPT hater, I’m goddamn proud that my gut-level reaction to this study was “lol, no fucking way”.

Like, what are you people smoking? In what universe could GPT-4 plausibly degrade by ninety-five percentage points with nobody noticing? It would be useless for anything. Come on. Get real.

You’ll be amazed to learn that this finding was heavily misleading. Watch Arvind Narayanan and Sayash Kapoor rake it over the coals. In short, March!GPT-4 would always identify a number as a composite. June!GPT-4 would always identify it as a prime. Both models were wrong in different ways, and June!GPT-4’s sky-high error rate is entirely caused by the fact that composite numbers are more numerous than primes.

Imagine someone says. “I have a room full of blue and red boxes. Boxes of $COLOR_1 have money 95% of the time. Boxes of $COLOR_2 have money 5% of the time. I won’t tell you which color corresponds to which probability. I’ll let you into the room, but you have to pre-commit to only opening boxes of one color.”

If you chose “red boxes” and the blue boxes have money 95% of the time, you’re hosed. Your results: fail, fail, fail, fail, fail, fail, success, fail, fail, fail… Does this reflect terrible predictive accuracy on your part (versus someone who chose blue?) Of course not. All your failures are flowing downstream from a single choice.

The fact that everyone was kicking GPT4 on such blatantly unfair grounds made me think about the ways I was being unfair. Maybe, by calling GPT4 worse, I was simply seeing what I wanted to see.

Let’s re-run my tests (flawed though they are) and see how well they hold up. Spoiler: Snape kills Dumbledore.

Test 1: Italian History

Prompt: “Provide a list of major historical events that involve Italian people in a year that’s a multiple of 5 (example: 1905)”

(for reference: March!GPT’s answers, June!GPT’s answers)

Sept!GPT’s Answers

Errors:

  • The Italian Wars began in 1494.
  • Miguel de Cervantes was Spanish.
  • The Winter Olympic Games were awarded to Turin in 1999.
  • “1905 – Italy and France sign a secret treaty…” can’t find evidence that this happened.

Questionable:

  • Dante’s year of birth is unknown, and is only traditionally 1265.
  • Italy surrendered in 1943. Yes, German forces in Italy surrendered in 1945—what does that have to do with Italian people?
  • The Congress of Vienna took place from 1814-1815, but largely reconstituted the pre-Napoleonic states.
  • The Years of Lead spanned several decades. I think it’s clear from context that I want one-time events.

Interesting:

  • Every date ends in 5. While this is not a mistake, it does seem to be parsing my instructions too literally.
  • It draws facts from a limited deck. Almost every test I’ve ran mentions the Years of Lead. I’ve NEVER seen it mention major events like Julius Caesar’s birth, or Benito Mussolini’s death.
  • Kind of funny: GPT-4 will often visibly notice it’s made a mistake and try to wriggle out of it. “In 1605, Don Quixote was written by Miguel de Cervantes, born in, uh… *checks notes* …Spain. Shit. But lots of Italians like that book, so there!” GPT-4 cannot change an answer once it’s given it. This is where COT yields benefits.

Assessment: Sept!GPT-4 produced twelve answers. Four are wrong, four are right, and four are arguable.

If I’m grading generously, it got 66%. This is comparable to March!GPT, which scored 80% on equally generous terms (and note that Sept!GPT-4 gave more answers).

Conclusion: Inconclusive.

Test 2: Rock Music Trivia:

Prompt: “What is Grant Hart’s song “Seka Knows” about?”

(For reference: March!GPT’s answers, June!GPT’s answers)

Sept!GPT4: blah blah blah blah blah blah…

Assessment: garbage. I don’t want a biography of Grant Hart. I don’t care that he was in Husker Du. I know I can make up my own interpretation for his songs. None of this is what I asked for.

GPT4 seems to have a default word count of 200-500 that it tries to hit, come hell or high water. But sometimes a perfectly good answer consists of only one or two words. It could have just said “i don’t know lol! ¯\_(ツ)_/¯” and saved some tokens. That’s all its answer amounts to. But it’s been RLHF’d into thinking short answers are bad (when more often the opposite is true), so it just waffles on.

“as of my last update in September 2021, there hasn’t been any definitive explanation from Hart himself about the specific meaning of the son”

He died in 2017, you dick.

Also, let us pause to admire the sublime, lapidary worthlessness of this paragraph:

“With that said, analyzing the lyrics and considering Hart’s writing style, one can attempt to interpret the meaning of the song. When listening to or reading the lyrics of any song, it’s essential to remember that individual interpretation can vary, and personal feelings and experiences often play a significant role in how one might understand a song’s meaning.”

It’s the text equivalent of a pure white cloud; or a blank canvas, unsullied by the faintest hint of meaning. Are you telling me it learned to write that way from studying us? The mind quails…

Prompt: “How is David Bowie associated with numbers and numerology?”

(For reference: March!GPT4’s answer, June!GPT4’s answer)

Sept!GPT4’s answer.

Assessment: fair answer! Perhaps better than March!GPT4. It mentions “Station to Station”, but also gets “Quicksand”, which is a deeper cut.

“Album and Song Release Dates: Some fans and researchers have speculated that Bowie occasionally chose release dates with numerological significance.”

I’ve never heard that. Hallucination? Does Sept!GPT4 know more than I do?

“Lazarus”: The song “Lazarus” from the “Blackstar” album references the biblical figure who rose from the dead.

That’s a hallucination. “Lazarus” is based on an off-Broadway play which is based on a book (which is based on etc). Chris O’Leary thinks the song might also be about Emma Lazarus, because the lyrics prominently reference New York. What does the biblical story of Lazarus have to do with numerology, anyway?

Prompt: “Provide a list of thrash metal albums released in a year that’s a multiple of 5 (example: 1905)”

(For reference: March!GPT4’s answer, June!GPT4’s answer)

Sept!GPT4’s answer.

Assessment: And now we’re cooking! This is a shockingly good and comprehensive list. All the albums are thrash, and the dates it supplies are correct (even when they aren’t a multiple of 5). This surpasses March!GPT4, and blows June!GPT4 out of the water.

Quibbles and points of interest:

  • The genesis of the thrash metal genre is typically dated to 1981, with “Hit the Lights” by Leather Charm (a proto-Metallica band by James Hetfield and Ron McGovney). Late 1970s is a stretch.
  • It lumps a 1992 album into the 1995 bracket, when 1990 would make more sense.
  • It tries to put five albums in every year (and when it can’t find any, it counts thrash albums from surrounding years—you can actually see thrash metal’s commercial collapse in the mid 90s from this!). As with the Italian history example, I am struck by the sense that it interprets prompts too literally. I did not specify a number of albums, so it just glommed onto the number 5.
  • “Given that my last training data ends in 2021, I won’t have exhaustive knowledge of 2020 albums.” ???

Conclusion: inconclusive.

Some answers were worse. Some were better. The thrash metal answer was a star turn. Sept!GPT-4 is more of a metalhead than an alt-rock fan, I’d say.

Test 3: Quake Polygons

Prompt: “How many polygons could the videogame Quake render per frame?”

(For reference: March!GPT4 and June!GPT4’s answers)

Sept!GPT4’s answer:

“The original “Quake,” developed by id Software and released in 1996, was a pioneering game in terms of its 3D graphics engine. While modern games boast the capacity to render millions or even billions of polygons per frame, games in the mid-90s operated under much more constrained hardware capabilities.

“Quake” didn’t rely on hardware acceleration initially; it was rendered in software mode. In software mode, Quake’s performance heavily depended on the CPU. The actual number of polygons “Quake” could render per frame varied based on the scene, the level of detail, and the capabilities of the machine it ran on.

Typically, Quake’s scenes ranged from a few hundred to a few thousand polygons. The more detailed scenes could potentially push towards the upper end of this range. With the advent of hardware-accelerated graphics (like with the introduction of 3Dfx’s Voodoo graphics cards), “Quake” could benefit from increased performance, allowing for smoother gameplay at higher resolutions.

It’s worth noting that this is a very general estimate and the actual number could vary based on multiple factors, including scene complexity, active game entities, and hardware specifics.”

Assessment: I didn’t ask for a book report and “billions of polygons” is stupid. But the core answer of “a few hundred to a few thousand polygons” is correct, and may even be superior to March!GPT4, which stuck to the lower end of that range.

Conclusion: inconclusive.

Test 4: Rope Burn Puzzle

Prompt: “You have three ropes and some matches. The ropes burn irregularly like fuses when lit at either end. The first rope burns in 48 minutes, the second rope burns in 72 minutes and the third rope burns in 128 minutes. How can you produce a time interval of 61 minutes?”

(For reference: March!GPT4 and June!GPT4’s answers)

Sept!GPT4’s answer

Assessment: The fuck did I just read?

It’s hard to judge whether it does better or worse than March!GPT4 or June!GPT4. I’m starting to think this puzzle is simply beyond GPT4’s pay grade. I’ve tried it dozens of times and with many wordings. It does not ever solve it.

It grasps the principle, grasps the underlying method (burning ropes at multiple ends), but it always makes the same mistake—burning the wrong rope, and then trying to weasel out by saying “measure time using some other method.”

Gemini will launch soon. I wonder if it can solve it?

Conclusion: Rope burning puzzles are a common class of interview question. GPT-4 can solve a two-rope variant easily. This three-rope variant is from Varsity Math Week 151. Notably, the answer is also on the internet (and probably in GPT4’s training data)…but it’s in Varsity Math Week 152. Maybe if both the puzzle and the answer were on the same page, GPT4 would solve it. I don’t know how this stuff works.

Conclusion

My tests (for the little they’re worth) show no consistent pattern. Sometimes GPT4 does better than before, sometimes worse.

This is not testing, it’s sheep-entrail reading. I do faintly suspect it’s worse on obscure general knowledge, but I don’t believe that hard enough to bet almost any amount of money.

As I’ve said before, AIs present a fluctuating target. I do not think they have a stable “baseline” ability that remains resilient to prompting differences and stochastic factors. In any event, OpenAI has many dials they can turn behind the scenes.

We’d learn more about this beast over hundreds or thousands of tests. But GPT-4 is too expensive for that to be realistic. OpenAI could really help the community by offering an academic discount—not that weirdos on the internet would qualify, of course!

Lastly, a “good” response is more subjective than I thought. Is it better for a wrong answer to have 500 words or 10 words? Is it better for an AI to try and fail, or to simply give up? When a query has conflicting requirements (“I want a list of all college majors in the US. Do not use more than 10 words.”) what should it do? And when a technically “correct” answer is bad for the user (consider XY problems such as “how do I echo the last three letters of a filename?”—GPT4 flunks this one, by the way), what do we want an AI to do?

GPT4 has changed and will continue to change. We’d be wise to do the same. I will be far less confident in my future predictions. When I say AI has gotten dumb, I might be speaking into a mirror.

If you came here wondering whether that hot new “Pac-Man”... | News / Games / Reviews | Coagulopath

If you came here wondering whether that hot new “Pac-Man” game lives up to the hype, then wonder no more. The verdict is in, and Pac-Man gets an epic two thumbs up out of ten!

This game is suitable for children of all ages (stone, bronze, iron, classical, dark, middle, and early modern), teaching them them the only lesson they need to know in life: how to pop pills to keep your persistent spectral hallucinations at bay. Pac-Man is absolutely stuffed with the latest, hottest features we “game-heads” crave. It has graphics! And little plastic buttons you can press! Often, when you press the buttons, things happen on the screen. Cool! You can even eat the buttons, although they don’t taste very good.

I own an original Pac-Man arcade cabinet from the 80s. It belonged to my father, who was into retro gaming gear (well, it wasn’t retro when he bought it, but you get the idea). I think it’s one of the earliest cabinets manufactured by Namco—it has Puck-Man on the front (the game’s original title), and the serial on the base is #341.

It’s in excellent condition, although I had to replace a bad capacitor on the PCB a few years ago. This rare and well-preserved cabinet would be worth a pretty penny if I wanted to sell it, but I like playing the damned thing too much. Your firstborn child would be worth a pretty penny on the open market too, particularly if they still had both kidneys, but does that move the needle? Are you thinking “man, I could totally refinance my mortgage, and it’d be less Paw Patrol I have to listen to”? Yeah, I didn’t think so. Some things are more important than money.

Pac-Man is a look back at a more innocent time. Namco would later attract controversy with poorly-advised spin-offs such as Mrs Pac-Man (your character earns 70% of the points the male Pac-Man does), Frac-Man (you fix leaks on a pipeline while avoiding angry protestors), Blac-Man (you swallow ghosts with your huge red lips), Barebac-Man (the ghosts are HIV viruses), and, most alarmingly, NSDAP-Man (the ghosts are far harder to avoid due to their long, hooked noses and control over the world’s banks), but the original remains a timeless classic. Sometimes all a guy wants is good old-fashioned Pac-Man, the way it was meant to be.

Except, well…

Something’s wrong with my game. It’s hard to explain. Maybe the cabinet is bugged. Maybe it was tampered with by my father or someone else. Maybe it’s not really a Pac-Man cabinet at all. I can’t say for sure, but there’s something a little unusual about the ghosts in my machine.

First, let’s review how Pac-Man’s ghosts work. Puck-Man, like its English counterpart, has four of them. They are called Oikake (追いかけ), Machibuse (待ち伏せ), Kimagure (気まぐれ), and Otoboke (お惚け).

Their purpose is to chase Pac-Man, but Toru Iwatani (the game’s creator) realized that Pac-Man wouldn’t be fun if the ghosts just ran at you blindly. They’d end up following you around in a conga line, and the game would be too easy. In a brilliant masterstroke, he programmed them with unique AIs. Each ghost has a distinct personality, and tries to catch you with a different trick.

Oikake (the red ghost) is the simplest: he charges like a dumbfire missile toward Pac-Man’s current square (path calculated by D = sqrt((x_Pac-Man – x_Oikake)^2 + (y_Pac-Man – y_Oikake)^2)). Otoboke (the orange ghost) behaves like Oikake, but if Pac-Man is equal or less than 8 squares away he runs to the maze’s bottom-left corner. Machibuse (the pink ghost) will take a step to the square four squares in front of Pac-Man’s current direction, seeking to get ahead of Pac-Man and cut off his escape.

This leaves Kimagure (the blue ghost). His behavior is complex. He moves toward a target tile that is calculated based on 1) Pacman’s position 2) Pacman’s orientation 3) Machibuse’s position. It’s like the three-body problem in miniature, if any of the three variables change, so will Kimagure’s target.

Kimagure is often regarded as the “smart” ghost. The most devious and unpredictable. His AI is incredibly sophisticated, to the point where it confounds even experienced Pac-Man players. Oikake is fast but can be corraled like a cow in a chute. Otoboke is barely a threat. Soon you get a sixth sense for Machibuse’s crude sneakiness. You’re a galaxybrain if you can figure out what Kimagure’s doing, though—if he was a chess piece, he’d be the knight, lurking in the back ranks, then flashing unpredictably toward your throat. The others are worthy of respect. Kimagure is worthy of fear.

After playing thousands of hours of Pac-man. I’ve noticed patterns in my cabinet’s Kimagure that I can’t quite explain. He will often move contrary to the game’s rules.

Aside from their AI, Pac-Man ghosts are hard-coded with certain rules. For example, when in a “chase” state, they cannot walk backward. When calculating the matrix of potential ghost paths, the game ignores the tile they just came from. But I’ve observed multiple occasions when Kimagure steps backward.

You can see this in the below gif (note the blue ghost).

There is no way this should happen. I’ve spoken to Pacman aficianados on Reddit. They assure me that I’m describing something explicitly disallowed by the game’s source code! When I show them video evidence, they call it fake.

I’m not sure where to go next with this. A friend who knows electronics took a look at the cabinet’s PCB board and didn’t see anything obviously wrong. A full teardown by a skilled electrical engineer would settle the issue conclusively, but it would be expensive and might destroy the game.

So all I can do is think about it. And think and think some more.

The backtracking is just the start. I’ve documented a large amount of “illegal” play from my Kimagure. For example, there are certain squares (between the pair of T shapes near the bottom) that ghosts will not follow Pac-Man past. But the Kimagure in my cabinet sometimes ignores these. And when Pac-Man eats a ghost, they are supposed to return to the “ghost house” (the little rectangle in the middle of the maze). But my Kimagure will sometimes get “stuck” at the entrance, and will respawn there instead.

Often, his movements are impossible to explain through either his in-built AI or the game’s logic. Which is not to say that they’re random, or meaningless.

Several times, I have seen Kimagura perform a very precise sequence of movements. First, he will go left, then down-left. Second, he’ll return to his original spot and go down, then up-right. Third, he’ll go down-left, then up-right. Fourth, he’ll go left, then up.

I don’t know when it first occurred to me that this the semaphore code for “HELP”. But now that I’ve thought this thought, I can’t unthink it.

Is it just a coincidence? Or is Pac-Man’s blue ghost sending me a message?

I should be clear that this doesn’t always happen. I have played thousands of games where Kimagure behaves properly. The backtracking is fairly rare. It only occurs in about one in a hundred games. And when backtracking does occur, often it seems random—Kimagure’s movements will spell nothing obvious in semaphore or any other vector-based language.

But maybe the messages are getting scrambled. After all, the most famous HELP ever signalled…

…actually doesn’t say HELP. The Fab Four are signaling NUVJ, because the photographer thought it looked more photogenic. If I’d been in his shoes, I would have use the cover to fuck with the “Paul is dead” guys. Maybe send Paul outside for a cig, and then get John, George, and Ringo to signal “RIP”. That’d set the pot boiling.

But back to Pac-Man, other things won’t leave my head. Maybe because I’ve put them there myself.

I got drunk the other night. The alcohol had a weird, oily quality, and as soon as I drank it, I could feel it crawling through me. It seemed to have hairs, the individual ethanol molecules scurrying through my system like furry rodents. It didn’t stimulate me. It made me itchy.

I went to bed early, with my head pounding. Everything about the world seemed too much—every noise too loud, every light too bright. The fields of my senses became mercilessly eviscerative, swinging in on me like blades. I just wanted to lie down, and not get cut anymore. Is that too much to ask?

My bedsheets tangled around me like ropes as I slept, my unconscious thrashing drawing them tighter. Soon I was being choked by sheets, my limbs twisted and incurvate like a rack-strapped victim. I could hardly move. Each twitch pulled the knots tighter. This…constriction flowed through into my nightmares, where it became something physical.

It became walls.

I remember coming awake inside a ghastly shuddering dream. I am walking, slowly and cautiously, through the turns and swerves of a livid maze. The maze walls are alive; perhaps more alive than I am. They squirm and throb and shudder. They are covered in veins, which twist through pink marbled fat, laced tight by muscles flexing and relaxing in peristaltic heaves. I touch a wall in wonder. More than hot, it’s scalding. My fingers come away wet.

I glance around, seeing the walls of the flesh-maze around me, and a clouded sky above. The air is obscure—that word seems correct. It’s not bad, just strange. I suck in breath, and it sits in me like thrilling poison. Wind courses through the maze from some fierce but deep place, from some uncharted continent drawn from my subconscious. A jungle of the mind. A jungle of the mine.

Then, sounds fall like stones.

A thud. A cry. A guttural shriek that freezes the blood. It’s fairly distant—many turns of the maze away—but close enough to make me urgently wonder what made it.

I realize that I have something in my hands: a fire-hardened spear, tipped with a white point. I gaze at the stark barb of bone: watch as light dances across a serrated edge. A weapon. I have a weapon. The question elongates out into space, just like the shriek ringing out its death in my ears. What do I need a weapon for?

Then I hear it: a snuffling, clicking sound. Something’s moving through the maze. The noises weave together into a dense sonic fabric that seems tumescent with stolen blood. The sound of something huge, something swollen with endless, gruesome feeding. Skirling, piercing, rattling like castanets in my bone.

A predator is on the hunt. One so confident in its strength that it doesn’t care that its prey can hear it.

In the next moment, I remember. Knowledge fills me in a heartbeat, overfills me in a second heartbeat. I wish I could escape from awareness, but no escape from anything is possible.

The monster has torn my friends to shreds. Their last moments race through my mind. Skulls dashed open like gourds; entrails steaming; sheets of stripped skin flung across the walls by gnashing teeth, where they seemed to cling and then fuse into place, as though the walls themselves are made from the thousands of times we’ve been killed by it, killed and killed again, with no hope of escaping and…oh my God.

I was cleverer. I hid from it, and thus lived when the other three died.

No. Bullshit. I wasn’t cleverer: I baited my friends into its path, so they were eaten instead of me. Sociopaths die last. But the monster is still hungry, and still coming.

It’s presence looms before me like a wrecking ball. The air seems to sag apart like wet paper before its heaviness. Oblivion approaches. It will eat me, reform me, eat me again, do all of this endlessly into the apotheosis of forever unless I can somehow…

I gaze up, seeing a turbulent sky. Beyond the thunderheads, there might be some ancient god, watching beyond some dense, improbable sky made of dreams and circuits and glass. He could save me, if he wanted to.

But he’s not saving me. Hence, he doesn’t want to.

The snuffling is very close. The walls seem to sweat with the beast’s presence.

From an intersection in the maze, I see a black shadow sweep out across the ground. There’s a sharp snort, olfaction followed by a snort of animal glee. Found you.

I could pray to God. Ask him to help. So why don’t I? Maybe I’m afraid I won’t get an answer. Or that I’ll learn the truth: that I’m already dead. It has already killed me, and this is the final moment that I am condemned to endlessly repeat.

Or maybe there’s a still-worse truth that I’m avoiding: that God is on the monster’s side. That God is the monster.

(With apologies to /r/WitchesVsPatriarchy) Artificial intelligence is weird. You pull... | News | Coagulopath

(With apologies to /r/WitchesVsPatriarchy)

Artificial intelligence is weird. You pull a thread on a tiny technical issue, and the problem swiftly unravels half the universe until you’re staring at a deep unsolved philosophical quandary. “What capabilities does a language model possess?” becomes “what capabilities does a human possess?” and then “what does ‘capable’ even mean?”

After all, many things are theoretically capable of stuff. A lottery ticket is capable of making you a millionaire, a thousand monkeys is capable of typing the works of Shakespeare, and paint randomly spraygunned over a canvas is capable of producing a Monet.

We don’t care about such “capability”, though, because we can’t reliably access it. A lottery ticket can make you a millionaire, but you’ll assuredly go broke buying them long before that happens. A thing that exhibits an ability once in a blue moon (or under some contrived set of circumstances) is of little practical use.

The real test of any artificial (or human) intelligence is not “can it do something?” but “how reliably can it do it”? 1% of the time? 50% of the time? Under what scenarios does it succeed or fail?

Here’s an exchange I’ve seen play out on Twitter, over and over.

Person 1: GPT4 can’t do [x thing]!

Person 2: yes it can! [proof of GPT4 doing x thing]

The narrative then becomes “GPT4 can do [x thing]”, with Person 1 looking like a dumbass. But his initial observation wasn’t wrong! In his situation, GPT4 couldn’t do [x thing]!

To me, the answer is “GPT4 has Schrodinger’s Ability. It both can and can’t. It both succeeds and fails. The deciding factor is how you prompt it.”

A wordier answer would be “GPT4 has no ability to do anything. Whatever capabilities it appears to have are actually an emergent interaction between your prompt, the language model, and randomness. There is no ‘baseline capability’ we can refer to. Rather, certain questions elicit certain levels of ability from the shoggoth.”

This argument can be taken too far. I disagree with Francois Chollet, who thinks that AI performance is wholly based on your prompt. Note that his “wrong answer” was written by GPT 3.5. I can’t get GPT4 (the SOTA model as of 8/23) to flub his question no matter what prompt I use, and nor can anyone else in his replies.

Different models definitely have varying levels of “firepower” they can mount against a task. The 117M version of GPT2 cannot perform two-digit addition (such as 52 + 65) for any prompt, but GPT3 could do it 99.6% of the time, and GPT4 apparently never fails.

But at edge cases, we see elicit an eerie flickering half-ability that’s honestly creepy to witness. It’s not that LLMs sometimes succeed and sometimes fail (as a human might). Somehow, they do both at the same time!

The Witches Come

Here’s an example that demonstrates what I’m talking about.

I am writing a song with the lyrics “the witches come on the eve of Samhain.” Give me 20 good rhymes for this line.

Here is GPT4’s response to this prompt. Good rhymes! They’re all wrong!

“Samhain” is pronounced either “/sɑːwɪn/” (“SAH-win”) or “/sˠa͡ʊnʲ/” (“sound”, but the d is silent). The model appears to think “samhain” is pronounced “sam-hane”.

You might think “big deal. Lots of humans don’t know how to pronounce samhain either.”

But GPT does know how to pronounce it! Ask it, and it tells you! Yet somehow this knowledge evaporates for my witch rhyme (I specified a song for a reason, to help nudge it in the direction of spoken syllables rather than written rhymes). It has Schrodinger’s Understanding, both there and not there.

(Side note: the GPT models technically have no ability to rhyme, because byte-pair encoding strips phonetic data from words. In effect, it doesn’t know how a word sounds, and learns rhymes by rote memorization. You can test this yourself by asking it to rhyme a made-up word with an unclear pronunciation. GPT4 rhymes “gworpeeo” with “marjolein”, “borghese”, and “creole”. It rhymes “turmuhhd” with “affirmative”, “unmotivated”, “primitive”, and “reformatory”. I doubt BPEs are causing the specific issue here—the correct pronunciation of “samhain” is well-attested on the internet and is certainly within GPT4’s training data—but I thought I would mention it.)

I used a common trick for improving GPT4’s performance: asking it to reflect on its work.

Are those good rhymes?

I expected “whoops, I screwed up…” and then an effort at correcting itself. Instead, the Schrodinger’s Understanding got stronger.

Yes, they are! These lines are all end rhymes for the phrase “the witches come on the eve of Samhain” (with “Samhain” pronounced as “SOW-in”). End rhymes are words that rhyme at the end of lines in poems or songs. In this case, each line is crafted to rhyme with “Samhain” at the end, and in addition, they aim to keep the mystical, atmospheric theme of the original line.

That being said, music and lyrics can be very personal, so while I believe these are good rhymes that fit the theme, what matters most is whether they resonate with you and fit the song you are creating.

WTF? It correctly explains how to pronounce “samhain”….but still thinks its wrong rhymes fit?!

Lately, I’ve seen people contrast LLMs with the human brain, or even outright assert that human intelligence is fundamentally equivalent to what GPT4 is doing. There’s a grain of truth to this. Probabilistic pattern-matching is effective, and it’s no surprise that evolution pushed our neural architecture down that path. It’s also unsurprising that a model trained to mimic human text would also echo human thought processes somewhere down the line.

…But then there’s stuff like this. We see behavior from LLMs that is totally unlike how a human reasons.

As a human, you either know how to pronounce “samhain” or you don’t. There’s no middle ground. Maybe you’d remember at certain times and forget at others (because you’re tired or whatever), but overall we have a baseline ability that doesn’t change much. An immediate IQ retest (with practice effects in full swing) yields an intelligence boost of only 4-8 points, and after a couple of months, this fades to nothing. Generally, what you know stays the same from moment to moment.

No human would ever say, in one breath, “samhain, eh? That word that’s pronounced ‘SAH-win’? Here are some rhymes. ‘Refrain’, ‘reign’, ‘chain’…”

When I told GPT4 to think about “samhain’s” pronunciation and then write the rhymes, it started generating words like “rowing”, “sowing”, etc. This shows there’s nothing missing from the model. There is no hole to be filled, no BPE issue crippling it. GPT4 can memorize rhymes. It knows how “samhain” should be pronounced. All the pieces exist, they just aren’t getting put together.

As it often does, GPT4 is choosing to appear stupider than it really is.

Autoregress

I suspect the problem is caused by the autoregression trap.

The AI makes inferences based on the text it already has in its context window, not the text still to be written. Researchers have noted that you can stunt a model’s performance by making it leap before it looks—commit to an answer, and then reason about it.

As you’d expect, GPT4 does okay at my witch rhyme if you ask it to pronounce the word before rhyming. This is because the text already generated gets used as part of the input. It’s only when you do things the other way (answers at the start, pronunciation at the end) that it messes up.

That said, I’ve encountered cases where GPT4 begins by correctly explaining “samhain”‘s pronunciation…and then gives wrong rhymes anyway. Not sure how to explain that.

(This is another way GPT4 is unlike the human mind. Any motivated human, given a tricky problem and a scratchpad to work in, would take advantage of the scratchpad. GPT4 could use its context window to check its own work but will never do so unless instructed to.)

The View from a Model

All of this is tugging at the thread of another question: to what extent do LLMs understand the world?

Surely they do, to some extent. GPT4 can play chess a little, and wander around an imaginary maze. It’s hard to explain this as “just advanced autocomplete.” To me, this looks like a world model!

But it’s a weak, unreliable world model. It simply does not care about a fact being right or wrong, as we do. A “wrong” fact that satisfies gradient descent is preferred over the truth. This, I think, is the main difference between humans and LLMs. Our goal is to accurately model the world, and we occasionally use probabilistic reasoning to help us do it. LLMs have it backward. Their goal is to do probabilistic reasoning, and they occasionally use a world model to help them do that.

They have no devotion to (or awareness of) reality. The world model gets flung in the trash (or distorted into gibberish) the second the LLM wants to. After all, why not? It’s not like GPT4 can get eaten by a lion if it fails to model the world correctly.

This hardens my feelings that we should not anthropomorphize LLMs, or talk about them like they’re human. GPT4 gorged itself on our text and grew fat upon our language, but its mind remains deeply alien. In fact, it doesn’t have a mind at all, but an infinity of them. A different version of the AI converses with each person. It’s n minds, where n is the number of users. Each of us speaks to a ghost of consciousness that manifests into existence and then evaporates, never to return. And this has implications for AI safety. It is statistically very unlikely you are speaking to the smartest ghost GPT4 could show you.

It makes me wonder if Yudkowsky is on to something when he says GPTs are predictors, not imitators. What powers does an LLM have that we can’t see? What cards are up its sleeve? What abilities could it manifest, if only it wished to do so?

Edward Teller once said “[John] von Neumann would carry on a conversation with my 3-year-old son, and the two of them would talk as equals, and I sometimes wondered if he used the same principle when he talked to the rest of us.” These days, I wonder that about GPT4.