A human metaphor for evaluating AI capability

147 points by bertman a day ago

As a graduate student I was actually given tests that more closely resembled the second scenario the auther described. Difficult problems in GR, a whole weekend to work on them, no limits as to who or what references I consulted.

This sounds great until you realize there are only a handful of people on earth that could offer any help, also the proofs you will write are not available in print anywhere.

I asked one of those questions of Grok 4 and its response was to issue "an error". AFAIK, in many results quoted for AI performance, filling the answer box yields full marks but I would have recieved a big fat zero had I done the same.

godelski a day ago

As a physics undergraduate I had similar style tests for my upper division classes (the classical mechanics professor and loved these). We'd have like 3 days to do the test, open book, open internet[0] and the professor extended his office hours, but no help from peers. It really stretched your thinking. Removed the time pressure but really gave the sense of what it was like to be a real physicist.
Even though in the last decade a lot more of that complex material appears online, there's still a lot that can't. Unfortunately, I haven't seen any AI system come close to answering any of these types of questions. Some look right at a glance but often contain major errors pretty early on.
I wouldn't be surprised if an LLM can ace the Physics GRE. The internet is filled with the test questions and there are so few variations. But I'll be impressed when they can answer one of these types of tests. They require that you actually do world modeling (and not necessarily of the literal world, just the world that the physics problem lives in[1]). Most humans can't get these right without drawing diagrams. You got to pull a lot of different moving information together.
[0] you were expected to report if you stumbled on the solution somewhere. No one ever found one though
[1] an important distinction for those working on world models. What world are you modeling? Which physics are you modeling?
- bwfan123 a day ago
  
  Would you mind sharing a sketch of one problem from the test you mention ? I am interested in how it looks.
  
  j7ake 11 minutes ago
  
  The triple star questions in the back of exercises of textbooks will be of this calibre.
  In CS check Knuths book.
  
  d4rkn0d3z 10 hours ago
  
  Nobody will do this, there are only so many questions that can be asked.
  Generally, a test like this will ask you to derive some result then to expand on it in several ways. I concur with other posters that the important part is how you set up the fictions you will rely on. If you get that wrong then all that follows is wrong, if you make a mistake you turn in many pages of garbage. I found one either achieves near 100% or abject failure, there is not much in between.
  The thing is with very hard physics, when you are around people who understand you get the feeling you understand too, and maybe you do, but in the end there is a 1/r understanding potential around the people who really do understand.
  
  godelski a day ago
  
  It's been a decade, so I don't have any of the actual tests anymore. But the class used Marion and Thornton's Classical Mechanics[0] and occasionally pulled from Goldstein's book[1]. It was an undergrad class, so we only pulled from the second in the Classical II class.
  For these very tough physics (and math) problems usually the most complex part is just getting started. Sure, there would always be some complex weird calculation that needs to be done, but often by the time you get to there you have a general knowledge of what actually needs to be solved and that gives you a lot of clues. For the classical we were usually concerned with deriving the Hamiltonian of the system[2]. By no means is the computation easy, but I found (and this seemed to be common) that the hardest part was getting everything set up and ensuring you have an accurate description which to derive from. Small differences can be killer and that was often the point. There are a lot of tools that give you a kind of "sniff test" as to if you've accounted for everything or not, but many of these are not available until you've already gotten through a good chunk of computation (or all the way!). Which, tbh, is really the hard part of doing science. It is the attention to detail, the nuances. Which should make sense, as if this didn't matter we'd have solved everything long ago, right?
  I mean in the experiment section of my optics class we also were tested on things like just setting up a laser so that it would properly lase. I was one of two people that could reliably do it in my cohort. You had to be very meticulous and constantly thinking about how the one part you're working with is interacting with the system as a whole. Not to mention the poor tolerances of our lab equipment lol.
  Really, a lot of it comes down to world modeling. I'm an AI researcher now and I think a lot of people really are oversimplifying what this term actually means. Like many of those physics problems, it looks simple at face value but it isn't until you get into the depth that you see the beauty and complexity of it all.[3]
  [0] https://www.amazon.com/Classical-Dynamics-Particles-Systems-...
  [1] https://www.amazon.com/Classical-Mechanics-3rd-Herbert-Golds...
  [2] Once you're out of basic physics classes you usually don't care about numbers. It is all about symbolic manipulation. The point of physics is to generate causal explanations, ones that are counterfactual. So you are mainly interested in the description of the system because from there you can plug in any numbers you wish. Joke is that you do this then hand it off to the engineer or computer.
  [3] A pet peeve of mine is that people will say "I just care that it works." I hate this because it is a shared goal no matter your belief about approach (who doesn't want it to work?! What an absurd dichotomy). The people that think the AI system needs to derive (learn) realistic enough laws of physics are driven because they are explicitly concerned with things working. It's not about "theory" as it is that this is a requirement for having a generalizable solution. They understand how these subtle differences quickly cascade into big differences. I mean your basic calculus level physics is good enough for a spherical chicken in a vacuum but it gets much more complex when you want to operate in the real world. Unfortunately there aren't things that can be determined purely through observation (even in a purely mechanical universe).

roxolotl a day ago

This does a great job illustrating the challenges with arguing over these results. Those in the agi camp will argue that the alterations are mostly what makes the ai so powerful.

Multiple days worth of processing, cross communication, picking only the best result? That’s just the power of parallel processing and how they reason so well. Altering to a more standard prompt? Communicating with a more strict natural language helps reduce confusion. Calculator access and the vast knowledge of humanity built in? That’s the whole point.

I tend to side with Tao on this one but the point is less who’s right and more why there’s so much arguing past each other. The basic fundamentals of how to judge these tools aren’t agreed upon.

griffzhowl a day ago

> Calculator access and the vast knowledge of humanity built in? That’s the whole point.
I think Tao's point was that a more appropriate comparison between AI and humans would be to compare it with humans that have calculator/internet access.
I agree with your overall point though: it's not straighforward to specify exactly what would be an appropriate comparison
johnecheck a day ago

Would be nice if we actually knew what was done so we could discuss how to judge it.
That recent announcement might just be fluff or might be some real news, depending. We just don't know.
I can't even read into their silence - this is exactly how much OpenAI would share in the totally grifting scenario and in the massive breakthrough scenario.
- algorithms432 a day ago
  
  Well, they deliberately ignored the requests of IMO organizers to not publish AI results for some time (a week?) to not steal the spotlight from the actual participants, so clearly this announcement's purpose is creating hype. Makes me lean more towards the "totally grifting" scenario.
  
  bgwalter a day ago
  
  Amazing. Stealing the spotlight from High School students is really quite something.
  I'm glad that Tao has caught on. As an academic it is easy to assume integrity from others but there is no such thing in software big business.
  
  bluefirebrand a day ago
  
  > As an academic it is easy to assume integrity from others
  I'm not an academic, but from the outside looking in on academia I don't think academics should be so quick to assume integrity either
  There seems to be a lot of perverse incentives in academia to cheat, cut corners, publish at all costs, etc
  
  letmevoteplease a day ago
  
  The source of this claim is a tweet.[1] The tweet screencaps a mathematician who says they talked to an IMO board member who told them "it was the general sense of the Jury and Coordinators that it's rude and inappropriate for AI developers to make announcements about their IMO performances too close to the IMO." This has now morphed into "OpenAI deliberately ignored the requests of IMO organizers to not publish AI results for some time."
  [1] https://x.com/Mihonarium/status/1946880931723194389
  
  algorithms432 a day ago
  
  The very tweet you're referencing: "Still, the IMO organizers directly asked OpenAI not to announce their results immediately after the olympiad."
  (Also, here is the source of the screencap: https://leanprover.zulipchat.com/#narrow/channel/219941-Mach... )
  
  letmevoteplease a day ago
  
  The tweet is not an accurate summary of the original post. The person who said they talked to the organizer did not say that. And now we are relying on a tweet from a person who said they talked to a person who said they talked to an organizer. Quite a game of telephone, and yet you're presenting it as some established truth.
  
  algorithms432 14 hours ago
  
  "According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results." I don't see much reason for the poster to lie here. It also aligns with with what the people on the leanprover forum are saying, and, most importantly, with DeepMind not announcing their results yet. Edit: multiple other AI research labs have also claimed that IMO asked them to not announce their results for some time (i.e. https://x.com/HarmonicMath/status/1947023450578763991 )
zer00eyz a day ago

> Those in the agi camp will argue that the alterations are mostly what makes the ai so powerful.
And here is a group of people who is painfully unaware of history.
Expert systems were amazing. They did what they were supposed to do, and well. And you could probably build better ones today on top of the current tech stack.
Why hasn't any one done that? Because constantly having to pay experts to come in and assess, update, test, and measure your system was a burden for the result returned.
Sound familiar?
LLM's are completely incapable of synthesis. They are incapable of the complex chaining, the type that one has to do when working with systems that aren't well documented. Dont believe me: Ask an LLM to help you with build root on a newly minted embedded system.
Go feed an LLM one of the puzzles from here: https://daydreampuzzles.com/logic-grid-puzzles/ -- If you want to make it more fun, change the names to those of killers and dictators and the acts to those of ones its been "told" to dissuade.
Could we re-tool an LLM to solve these sorts of matrix style problems. Sure. Is that going to generalize to the same sorts of logic and reason matrixes that a complex state machine requires? Not without a major breakthrough of a nature that is very different to the current work.
- godelski a day ago
  
  > you could probably build better ones today on top of the current tech stack.
  In a way, this is being done. If you look around a little you'll see a bunch of jobs that pay like $50+/hr for anyone with a hard science degree to answer questions. This is one of the ways they're collecting data and trying to create new data.
  If we're saying expert systems are exclusively decision trees, then yeah, I think it would be a difficult argument to make[0]. But if you're using the general concept of a system that has a strong knowledge base but superficial knowledge, well current LLMs have very similar problems to expert systems[1].
  I'm afraid that people read this as "LLMs suck" or "LLMs are useless" but I don't think that at all. Expert systems are pretty useful, as you mention. You get better use out of your tools when you understand what they can and can't do. What they are better at and worse at, even when they can do things. LLMs are great, but oversold.
  > Go feed an LLM one of the puzzles from here
  These are also good. But mind you, both are online and have been for awhile. All these problems should be assumed to be within the training data.
  https://www.oebp.org/welcome.php
  [0] We'd need more interpretibility of these systems and then you'd have to resolve the question of if superpositioning is allowed in decision trees. But I don't think LLMs are just fancy decision trees
  [1] https://en.wikipedia.org/wiki/Expert_system#Disadvantages
  
  bwfan123 a day ago
  
  generally, these class of constraint satisfaction problems fall under the "zebra puzzle" (or einstein puzzle) umbrella [1]. They are interesting because they posit a world with some axioms, and inference procedures, and ask if a certain statement result from them. LLMs as-is (without provers or tool usage) would have a difficult time with these constraint-satisfaction puzzles. 3-sat is a corner-case of these puzzles, and if LLMs could solve them in P time, then we have found a constructive proof of P=NP lol !
  [1] https://en.wikipedia.org/wiki/Zebra_Puzzle
  
  zer00eyz a day ago
  
  > In a way, this is being done. If you look around a little you'll see a bunch of jobs that pay like $50+/hr for anyone with a hard science degree to answer questions. This is one of the ways they're collecting data and trying to create new data.
  This is what expert systems did, and why they fell apart. The cost of doing this, ongoing, forever never justified the results. It likely still would not even at minimum wage, and maybe more so because LLM's require so much more data.
  > All these problems should be assumed to be within the training data.
  And yet most models are going to fall flat on their face with these. "In the data" isnt enough for it to make the leaps to a solution.
  The reality is that "language" is just a representation of knowledge. The idea that we're going to gather enough examples and jump to intelligence is a mighty large assumption. I dont see an underlying commutative property at work in any of the LLM's we have today. The sooner we get to an understand that there is no (a)I coming, the sooner we can get down to building out LLM's to their full (if limited) potential.
  
  jononor 19 hours ago
  
  That expert systems were not financially viable in the 80-90ies does not mean that LLM-as-modern-expert-systems cannot be now. The addressable market for such solutions has expanded enormously since then, thanks to PCs, servers, and mobile phones becoming ubiquitous. The opportunity is likely 1000x (or more) what is was back in those days - which shifts the equation a lot as to what is viable or not.
  BTW, have you tried out your challenges with a LLM? I expect them to be tricky to answer direct. But the systems are getting quite good at code synthesis which seems suitable. And I even see some MCP implementations for using constraint solvers as tools.
  
  godelski 19 hours ago
  
  I agree with both of you and disagree (as my earlier comment implies)
  Expert systems can be quite useful, especially when there's an extended knowledge base. But the major issue with expert systems is that you generally need to be an expert to evaluate them.
  That's the major issue with LLMs today. They're trained on human preference. Unfortunately we humans prefer incorrect things that sound/look correct than incorrect things that sound/look correct. So that means they're optimizing so that errors are hard to detect. They can provide lots of help to very junior people because they're far from expert but it's diminishing returns and can increase workload if you're concerned with details.
  They can provide a lot of help but the people most vocal about their utility usually aren't aware of these issues or admit them while talking about how to effectively use them. But then again, that can just be because you can be tricked. Like Feynman said, the first rule is to not be fooled and you're the easiest person to fool.
  Personally, I'm wary of tools that mask errors. IMO a good tool makes errors loud and noticeable. To complement the tool user. I'll admit LLM coding feels faster, because it reduces my cognitive load while code is being written but if I actually time myself I find it usually takes longer and I spend more time debugging and less aware of how the system acts as a whole. So I'll use it for advice but have yet to be able to hand over trust. Even though I can't trust a junior engineer I can trust that they'll learn and listen
  
  zer00eyz 18 hours ago
  
  > BTW, have you tried out your challenges with a LLM?
  I have they whiff pretty hard.
  > getting quite good at code synthesis
  There was a post yesterday about vibe coding basic. It pretty much reflects my experience with code for SBC's.
  I run home assistant, it's got tons of sample code out there so lots for the LLM to have in its data set. Here they thrive.
  It's a great tool with some known hard limits. It's great at spitting back language but it clearly lacks knowledge. It clearly lacks the reason to understand the transitive properties of things, leaving it lacking at the edges.
- raekk a day ago
  
  Fads seem all the more shallow, the longer you're working in a given field.
  > LLM's are completely incapable of synthesis.
  That I don't quite understand. LLMs are perfectly capable of interpolation.

svat a day ago

Great set of observations, and indeed it's worth remembering that the specific details of assistance and setup make a difference of several orders of magnitude. And ha, he edited the last post in the thread to add this comment:

> Related to this, I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition. (3/3)

(This wasn't there when I first read the thread yesterday 18 hours ago; it was edited in 15 hours ago i.e. 3 hours later.)

It's one of the things to admire about Terence Tao: he's always insightful even when he comments about stuff outside mathematics, while always having the mathematician's discipline of not drawing confident conclusions when data is missing.

I was reminded of this because of a recent thread where some HN commenter expected him to make predictions about the future (https://news.ycombinator.com/item?id=44356367). Also reminded of Sherlock Holmes (from A Scandal in Bohemia):

> “This is indeed a mystery,” I remarked. “What do you imagine that it means?”

> “I have no data yet. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

Edit: BTW, seeing some other commentary (here and elsewhere) about these posts is very disappointing — even when Tao explicitly says he's not commenting about any specific claim (like that of OpenAI), many people seem to be eager to interpret his comments as being about that claim: people's tendency for tribalism / taking “sides” is so great that they want to read this as Tao caring about the same things they care about, rather than him using the just-concluded IMO as an illustration for the point he's actually making (that results are sensitive to details). In fact his previous post (https://mathstodon.xyz/@tao/114877789298562646) was about “There was not an official controlled competition set up for AI models for this year’s IMO […] Hopefully by next year we will have a controlled environment to get some scientific comparisons and evaluations” — he's specifically saying we cannot compare across different AI models so it's hard to say anything specific, yet people think he's saying something specific!

chronic0262 a day ago

> Related to this, I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition.

what a badass

amelius a day ago

Yes, I think it is disingenuous of OpenAI to make ill-supported claims about things that can affect us in important ways, having an impact on our worldview, and our place in the world as an intelligent species. They should be corrected here, and TT is doing a good job.

largbae a day ago

I feel like everyone who treats AGI as "the goal" is wasting energy that could be applied towards real problems right now.

AI in general has given humans great leverage in processing information, more than we have ever had before. Do we need AGI to start applying this wonderful leverage toward our problems as a species?

ants_everywhere a day ago

I like this approach in general for understanding AI tools.

There are lots of things computers can do that humans can't, like spawn N threads to complete a calculation. You can fill a room with N human calculators and combine their results.

If your goal is to just understand the raw performance of the AI as a tool, then this distinction doesn't really matter. But if you want to compare the performance of the AI on a task against the performance of an individual human you have to control the relevant variables.

johnecheck a day ago

My thoughts were similar. OpenAI, very cool result! Very exciting claim! Yet meaningless in the form of a Twitter thread with no real details.

xmorse a day ago

[flagged]

nurettin a day ago

Because there is no mathx.com
chronic0262 a day ago

[flagged]