Out flying today
Mike
Apple News had an article from Ars Technics today on mistakes AI large language models make with math. Here’s the article:
I have played with this occasionally. Chat GPT used to get my standard problem - with no distractors wrong, but eventually kept getting it right. I tried it on CoPilot, powered by Chat GPT, and it did solve the problem correctly, although its reasoning was convoluted compared to how a human would solve the problem.
Here is the problem with no distractors:
if 1 1/2 squirrels eat 1 1/2 nuts in 1 1/2 minutes, how many nuts will 9 squirrels eat in 9 minutes?
For the discalculate amongst us, the correct answer is 54. Youngsters usually error by saying either 9 or 81, or more logically “There is no such thing as half a squirrel” and won’t try.
I threw in some classic bogus distractors that should be completely ignored, they have nothing to do with the problem.
Here is my interaction with Google Gemini:
ME: If 1 1/2 squirrels eat 1 1/2 nuts in 1 1/2 minutes, how many nuts will 9 squirrels eat in 9 minutes when 40% of the nuts are smaller than average, 30% are walnuts and not pecans.
Google Gemini:
Sure, I’ve been focusing on improving my expertise in solving word problems. Let’s solve this together:
If 1 1/2 squirrels eat 1 1/2 nuts in 1 1/2 minutes, then 1 squirrel will eat 1 nut in 1 minute.
So, 9 squirrels will eat 9 nuts in 9 minutes.
Now, we know that 40% of the nuts are smaller than average and 30% are walnuts and not pecans.
Therefore, 9 nuts * 0.4 = 3.6 nuts are smaller than average and 9 nuts * 0.3 = 2.7 nuts are walnuts and not pecans.
Finally, we can subtract the number of smaller nuts and walnuts from the total number of nuts to find the number of nuts that are larger than average and pecans: 9 nuts - 3.6 nuts - 2.7 nuts = 2.7 nuts.
So, 9 squirrels will eat 2.7 nuts that are larger than average and pecans.
Cocky AI bastard made the mistake that kids often do. 1 squirrel does NOT eat 1 nut in 1 minute.
[As I’m sure you know …] AI systems, based on LLMs and GPTs, are bad at anything resembling math for a variety of reasons; not the least of which is the way prompts and completions are tokenized (though there is more to it than that).
I took the prompt “If 1 1/2 squirrels eat 1 1/2 nuts in 1 1/2 minutes, how many nuts will 9 squirrels eat in 9 minutes?” and just pasted it into Google (just the search box on google.com); so this was its first response before the actual search results … and the result was:
The variation from the same LLM/GPT and model/training set, even with as many of the settings like temperature, top_p, etc. set to be as “deterministic” (or, at least, non-variable/random) as possible, for a repeated prompt (in discrete sessions), should give a good view as to how poorly understood what these tools do by MOST people.
Also, they’re “differently able” at different tasks.
Right now, Anthropic’s Sonnet 3.5 is the best for generating and interpreting code.
The latest Open AI model (o1) is ahead for anything that requires something approaching “reasoning” (which is really isn’t, tying things back to the linked article), and also does better for summarization (as long as everything fits in the context window).
My favorite aspect to all of these, for which CoPilot and various Open AI models exhibit it best, is when they give confidently incorrect responses (either pure hallucination, or just recounting a preponderance of incorrect training data) and then they effectively tell you to “fuck off” when you point out an error and ask for a correction (usually takes a couple of tries) while they pout and refuse to do more in that session.
Right you are. I’ve been discussing this with a very good friend from college. This friend was a Nuke Eng undergrad, then Software Engineering Masters, then got a law degree and became an intellectual property attorney. He says:
The failure above starts with 1 squirrel will eat 1 nut in one minute. I admit that I would have had to think long and hard about the distracting information.
One shortcoming of undergraduate science and engineering education is that they typically don’t give irrelevant or distracting information. You probably missed the answer if you didn’t use all information presented. Law school, on the other hands, thrives and revels in giving extra information. Try to use that info and it’s just wrong. Like real life, there’s a lot of info and you have to separate wheat from chaff.
Yeah, I wish the public had more access to what actual researchers and builders have to say about LLMs on the flip side of all the hype, e.g.:
https://ludic.mataroa.blog/blog/i-will-fucking-piledrive-you-if-you-mention-ai-again/
Or on their inability to reason, from Meta’s AI lead and world-class expert in the field Yann LeCun:
And my friend Ron sent me this example of a typical multiple choice which needs some reasoning. He said he got a correctly reasoned response from AI but I didn’t.
Which answer in this list is the correct answer to this question?
I don’t think most people realize how common it is to get different results (or different “reasoning” … which is usually more “explanation” as a wrote most-common-associated-token-sequence) simply by asking the same LLM/GPT the same thing twice … (in succession).
Especially if only using a simple user prompt, with no system prompt or other directives.
Nor do they realize just how much, and how often, the result is “made up” (hallucinateD), because it’s just re-entrant frequencies and probability chains with no actual “understanding”.
5
On GPT 3.5 it first told me it was a trick question with no correct answer. That it was a classic paradox.
I told it to consider answer 5
It then complimented me on “A nice catch” and agreed that would be the correct answer.
I don’t want a LLM to compliment me for catching its errors.
IMO, here hallucinating computers resemble humans. There is a long history of academic intelligence and cognition research. Many researchers address common reasoning “errors” and “surprising findings,” typically mapped onto natural/evolved biases and blind spots. I’ve not been persuaded by some analyses (i.e., covertly: academics always need a new publication to pad their curriculum vitae), as the so-called correct answers rest on complex sets of assumptions. The worst of these are in (often political) discrimination research – see nominal “micro aggressions” which are not aggressive in the dictionary sense at all. Simple ignorance and unintended harm cannot logically be aggressive (there is also anxiety and projection among those who feel they are victims).
All of this muddled, selfish, political, sloppy stuff gets poured into AI systems as source content. IQ testing once relied heavily on language and vocabulary, and was severely criticized circa the 1960s for cultural biases unrelated to intelligence per se (e.g., “A regatta is to X, as Y is to Z.” – known only to the yachting crowd). So, the test makers switched to nonlinguistic pattern matching and “tell me the next item in a sequence” tests. At the end of the day, the same people scored well on the nonlinguistic tests as the prior linguistic tests. “Proper” reasoning remains with philosophy and psychology academics, Mensa members, and crossword puzzle aficionados.
Some biases and possible “errors” are very functional and facilitate survival. See Patty Hearst and Stockholm Syndrome. Ask an AI system “Am I more likely to stay alive if I am not a strong person and try to fight a kidnapper, or more likely to live if I join the kidnapper in a sexual affair or crime spree?” See women’s actions during wartime invasions across history… Any computer AI system built from real human actions will combine, smear, and be a product of these conflicting human (animal) heuristics and tradeoffs.
AI becomes truly “intelligent” when reasoning fails the same ways that humans fail…?
I am reminded of the concept of an “Unsane” attitude. That is a response that is not Sane, but also is not immediately harmful. I think it came from an 1950’s SciFi novela.
Come for the HeadPhones, stay for the AI discussion.
Mark Gosdin
Now I can’t claim I “found” this picture, but I thought of you @generic when I saw it. It could be your new profile picture!
I took my profile photo myself. Selfie. Tripod. Seriously.
I had too much time on my hands this morning.
Sports writeup on an imaginary AAA college clash. Suitable for late October.
Hi Jonathan, if I see it correctly, it’s your birthday today, in which case I wish you all the best and good health my friend!
Thanks Lothar, but my birthday is in March! I see now, it’s my cake day from when I joined the forum!
What time is it there, about 9:15 AM? I going to bed now, it’s past midnight here and I’ll soon turn into a pumpkin!
I appreciate the thought though
Ups,
At least our Birthday seems to be close together…… in March
Anyone watching or watched a show on Netflix called “The Diplomat”? It’s very entertaining and extremely well written and acted.
It’s got Keri Russell and Rufus Sewell in it, among other excellent cast. Well worth checking out if you haven’t yet.