AI has all the answers. Even the wrong ones | 不论答案对错,AI知道一切? - FT中文网
登录×
电子邮件/用户名
密码
记住我
请输入邮箱和密码进行绑定操作:
请输入手机号码,通过短信验证(目前仅支持中国大陆地区的手机号):
请您阅读我们的用户注册协议隐私权保护政策,点击下方按钮即视为您接受。
FT英语电台

AI has all the answers. Even the wrong ones
不论答案对错,AI知道一切?

ChatGPT has the appearance of a brilliant logician and that’s a problem
大型语言模型解决逻辑谜题的准确性与可信度探究。
00:00

Can large language models solve logic puzzles? There’s one way to find out, which is to ask. That’s what Fernando Perez-Cruz and Hyun Song Shin recently did. (Perez-Cruz is an engineer; Shin is the head of research at the Bank for International Settlements as well as the man who, in the early 1990s, taught me some of the more mathematical pieces of economic theory.)

The puzzle in question is commonly known as the “Cheryl’s birthday puzzle”. Cheryl challenges her friends Albert and Bernard to guess her birthday, and for puzzle-reasons they know it’s one of 10 dates: May 15, 16 or 19; June 17 or 18; July 14 or 16; or August 14, 15 or 17. To speed up the guessing, Cheryl tells Albert her birth month, and tells Bernard the day of the month, but not the month itself.

Albert and Bernard think for a while. Then Albert announces, “I don’t know your birthday, and I know that Bernard doesn’t either.” Bernard replies, “In that case, I now know your birthday.” Albert responds, “Now I know your birthday too.” What is Cheryl’s birthday?* More to the point, what do we learn by asking GPT-4?

The puzzle is a challenging one. Solving it requires eliminating possibilities step by step while pondering questions such as “what is it that Albert must know, given what he knows that Bernard does not know?” It is, therefore, hugely impressive that when Perez-Cruz and Shin repeatedly asked GPT-4 to solve the puzzle, the large language model got the answer right every time, fluently elaborating varied and accurate explanations of the logic of the problem. Yet this bravura performance of logical mastery was nothing more than a clever illusion. The illusion fell apart when Perez-Cruz and Shin asked the computer a trivially modified version of the puzzle, changing the names of the characters and of the months.

GPT-4 continued to produce fluent, plausible explanations of the logic, so fluent, in fact, it takes real concentration to spot the moments when those explanations dissolve into nonsense. Both the original problem and its answer are available online, so presumably the computer had learnt to rephrase this text in a sophisticated way, giving the appearance of a brilliant logician.

When I tried the same thing, preserving the formal structure of the puzzle but changing the names to Juliet, Bill and Ted, and the months to January, February, March and April, I got the same disastrous result. GPT-4 and the new GPT-4o both authoritatively worked through the structure of the argument but reached false conclusions at several steps, including the final one. (I also realised that in my first attempt I introduced a fatal typo into the puzzle, making it unsolvable. GPT-4 didn’t bat an eyelid and “solved” it anyway.)

undefined

Curious, I tried another famous puzzle. A game show contestant is trying to find a prize behind one of three doors. The quizmaster, Monty Hall, allows a provisional pick, opens another door to reveal no grand prize, and then offers the contestant the chance to switch doors. Should they switch?

The Monty Hall problem is actually much simpler than Cheryl’s Birthday, but bewilderingly counterintuitive. I made things harder for GPT4o by adding some complications. I introduced a fourth door and asked not whether the contestant should switch (they should), but whether it was worth paying $3,500 to switch if two doors were open and the grand prize were $10,000.**

GPT-4’s response was remarkable. It avoided the cognitive trap in this puzzle, clearly articulating the logic of every step. Then it fumbled at the finishing line, adding a nonsensical assumption and deriving the wrong answer as a result.

What should we make of all this? In some ways, Perez-Cruz and Shin have merely found a twist on the familiar problem that large language models sometimes insert believable fiction into their answers. Instead of plausible errors of fact, here the computer served up plausible errors of logic.

Defenders of large language models might respond that with a cleverly designed prompt, the computer may do better (which is true, although the word “may” is doing a lot of work). It is also almost certain that future models will do better. But as Perez-Cruz and Shin argue, that may be besides the point. A computer that is capable of seeming so right yet being so wrong is a risky tool to use. It’s as though we were relying on a spreadsheet for our analysis (hazardous enough already) and the spreadsheet would occasionally and sporadically forget how multiplication worked.

Not for the first time, we learn that large language models can be phenomenal bullshit engines. The difficulty here is that the bullshit is so terribly plausible. We have seen falsehoods before, and errors, and goodness knows we have seen fluent bluffers. But this? This is something new.

*If Bernard was told 18th (or 19th) he would know the birthday was June 18 (or that it was May 19). So when Albert says that he knows that Bernard doesn’t know the answer, that rules out these possibilities: Albert must have been told July or August instead of May or June. Bernard’s response that he now knows the answer for certain reveals that it can’t be the 14th (which would have left him guessing between July or August). The remaining dates are August 15 or 17, or July 16. Albert knows which month, and the statement that he now knows the answer reveals the month must be July and that Cheryl’s birthday is July 16.

**The chance of initially picking the correct door is 25 per cent, and that is not changed when Monty Hall opens two empty doors. Therefore the chance of winning $10,000 is 75 per cent if you switch to the remaining door, and 25 per cent if you stick with your initial choice. For a sufficiently steely risk-taker, it is worth paying up to $5,000 to switch.

Follow @FTMag to find out about our latest stories first and subscribe to our podcast Life and Art wherever you listen

版权声明:本文版权归FT中文网所有,未经允许任何单位或个人不得转载,复制或以任何其他方式使用本文全部或部分,侵权必究。

从台北到布达佩斯:寻呼机爆炸的神秘轨迹

黎巴嫩真主党遭遇的大胆袭击事件所涉设备的供应链跨越三大洲。

Lex专栏:无论如何衡量,私募股权基金的表现都很糟糕

投资者急于回笼资金,迫使私募股权基金不得不降低标价以售出资产。

欧盟新任竞争事务专员:必须“改进”合并规则

特雷莎•里贝拉在接受FT采访时表示,欧洲企业需要具备规模才能与全球对手竞争。

铺设中国太阳能板的热潮威胁巴基斯坦负债累累的电网

电价飙升促使巴基斯坦企业争相在工厂屋顶铺设超低价的中国太阳能板。

针对特朗普的明显暗杀企图:到目前为止我们知道什么?

嫌疑人被捕引发了人们对美国总统选举最后阶段候选人安全的担忧。

技术能源正在重塑世界

拥有化石燃料储备的传统权力掮客将看到他们的全球影响力减弱。
设置字号×
最小
较小
默认
较大
最大
分享×