luma a day ago

They can do exactly that, it's called Tool Use and nearly all modern models can handle it. For example, I have a consumer GPU that can run a R1 Qwen distill, which, when prompted for a large multiplication, will elect to write a python script to find the answer.

This is a table stakes feature for even the open/free models today.

  • eternityforest 21 hours ago

    But they only know about specific tools with few or zero shot methods, whereas modern humans are highly trained to specifically always route all math through a calculator, it's "in the fine tune" not "in the prompt".

    It doesn't use up any "context" to remember that calculators exist.

    With the partial exception of Python code, models know how to solve problems with code and even 1.5B models do amazing things if you prompt them with something like "Use the Z3 solver"

    • IanCal 18 hours ago

      But if you give them a search they'll use that to look for tools and if you give them the ability they can write new tools for themselves.

  • roger_ a day ago

    Basic math problems should be treated as a special vs general tool use. Why use Python when the computer evaluating the LLM can do math itself?

    • mhh__ a day ago

      Because AI programmers only know python (is the path dependency / real reason)

    • luma a day ago

      It does do the math itself, it creates the code to do so and then executes that code to get the answer.

    • mystified5016 a day ago

      The same reason you don't use your desk calculator to run Python scripts on huge datasets. It's the wrong tool.

      Running an LLM and parsing and computing mathematical expressions are entirely disjoint operations. You need highly specialized code for each, it makes just as much sense to put a calculator in your LLM as it does to stuff a Python interpreter in a calculator. Could you? Of course, software is infinitely flexible. Does it make sense to do it? No, it makes more sense to connect two different specialized applications than to try shoehorning one into the other.

  • 13years a day ago

    You still can't get 100% reliability that would be necessary for certain problem domains.

    There are going to be some level of hallucination errors in the translation to the agent or code. If it is a complex problem, those will compound.

    • zeroxfe a day ago

      You can't get 100% reliability from a human either.

      • 13years a day ago

        Which is why humans use calculators. That is the key point being made secondary to the reliability. The LLM "knows" it is bad at Math. It knows the purpose of calculators. However, doesn't use this information to inform the user.

        It could also propose to the user it could write the answer using code. It doesn't do that either.

        • lern_too_spel 18 hours ago

          It has been prompted to give an answer to the user. If you prompted a human to give an answer, they would do the same thing. This example is awful.

  • xg15 a day ago

    The author adressss this and argues that it misses the point:

    > Now, some might interject here and say we could, of course, train the LLM to ask for a calculator. However, that would not make them intelligent. Humans require no training at all for calculators, as they are such intuitive instruments. We use them simply because we have understanding for the capability they provide.

    So the real question behind the headline is why LLMs don't learn to ask for a calculator by themselves, if both the the definition of a calculator and the fact that LLMs are bad at math are part of the training data.

    • luma a day ago

      I’d contest the statement that humans don’t need to be trained to use a calculator. It certainly isn’t instinctive behavior.

      • neom a day ago

        I have dyscalculia and I still have no clue about calculators except I was taught how to make it give me the answer to math problems, I'm a bit embarrassed to say: even still I sometimes take a few seconds to boot into being able to use one. We often discuss LLMs like there is no divergence in humans, I don't know how many people math is intuitive for, but I know plenty of people like me.

      • 13years a day ago

        I first used a calculator as a kid. Took about 30 seconds. Never had instruction or training. We aren't talking about scientific calculators.

        • doubled112 a day ago

          Yeah, the buttons do what the symbol says.

          Then about 30 seconds after that somebody showed me I could spell “boobs” if I flipped it upside down.

    • LeifCarrotson a day ago

      I do think it's interesting to think about why the LLM needs to be told to ask for a calculator and when to do that. And not just in individual prompts where humans manually prompt ask it to "write some code to find the answer" but in general.

      We often use the colloquial definition of training to mean something to the effect of taking an input, attempting an output, and being told whether that output was right or wrong. LLMs extend that to taking a character or syllable token as input, doing some computation, predicting the next token(s), and seeing if that was right or wrong. I'd expect the training data to have enough content to memorize single-digit multiplication, but I'd expect it to also learn that this model doesn't work for multiplying an 11 digit number by a 14 digit number.

      The "use a calculator" concept and "look it up in a table" concepts were taught to the LLM too late and it didn't internalize that as a way to perform better.

    • recursive a day ago

      > Humans require no training at all for calculators, as they are such intuitive instruments

      I don't think that's even true though. If you think this, I would suggest you've just internalized your training on the subject.

    • IanCal a day ago

      > So the real question behind the headline is why LLMs don't learn to ask for a calculator by themselves

      They can. They're sometimes a bit cocky about their maths abilities, but this really isn't hard to test or show.

      https://gist.github.com/IanCal/2a92debee11a5d72d62119d72b965...

      They can also create tools that can be useful.

      • crackrook 20 hours ago

        This still doesn't get at the point, with this example you've effectively constructed a prompt along the lines of: "Note: A calculator is available upon request wink, here's how you'd use it: ... Now, what's the eighth root of 4819387574?"

        Of course the model will use the calculator you've explicitly informed it of. The article is meant to be a critique of claims that LLMs are "intelligent," when, despite knowing their math limitations, don't generally answer "You'd be better off punching this into a calculator" when asked a problem

        • IanCal 18 hours ago

          How have I told it there's a calculator? All I've given it is the ability to search for tools and enable ones it wants.

          > Of course the model will use the calculator you've explicitly informed it of

          I didn't. I also gave it no system prompt pushing it to always use tools or anything.

          It searches for tools with a query "calculator math root" and is given a list of things that includes a calculator. It picks the calculator, then it uses it.

          The code and trace are right there.

          • crackrook 15 hours ago

            What do you imagine `tools` is doing if not a system prompt? I suggest reading the documentation: https://docs.anthropic.com/en/docs/build-with-claude/tool-us...

            The vast majority of features that Anthropic and OpenAI ship are just clever ways of building system prompts

            • IanCal 7 hours ago

              It's told it can search for tools and it can enable them. It is not told initially that there's a calculator, it asks for one.

  • amelius a day ago

    Yes, but the takeaway message is that even laypeople can invent how LLMs should work.

colonCapitalDee a day ago

Claude Sonnet 3.5 will often use JavaScript as calculator. It's not perfect when it comes to deciding whether it should write code, but that's easy to fix by prompting it with "Write some code to help you answer the question".

The post is honestly quite strange. "When LLMs try and do math themselves they often get it wrong" and "LLMs don't use tools" are two entirely different claims! The first claim is true, the second claim is false, and yet the article uses the truth of the first claim as evidence for the second! This does not hold up at all.

  • enragedcacti a day ago

    The claim isn't "LLMs don't use tools", the author is saying that LLMs can't make reliable inferences regarding their own knowledge or capabilities which fundamentally limits their usefulness for many tasks. LLMs "know" that LLMs can't do math reliably, LLMs "know" that calculators can do math reliably, and yet LLMs generally just soldier on and try to do math themselves when asked. You can of course RL it or prompt it into writing javascript when it sees math but so far LLMs haven't been capable of generalizing the process of "I am bad at X" + "Thing is good at X" -> "I should ask for Thing to do X" unless that specific chain of thought is common in the training data.

    The solution so far has just been to throw more RL or carefully crafted synthetic data at it but its arguably more pavlovian than it is generalized learning.

    Someone could teach a dog to ring a bell that says "food" on it, and you could reasonably argue that it is using a tool. Will it then know to ring a bell that says "walk" when it wants to go outside?

    • IanCal 18 hours ago

      I gave sonnet a hard arithmetic problem, and the ability to look for tools. It looked for a calculator, I gave it one and it used that.

      https://gist.github.com/IanCal/2a92debee11a5d72d62119d72b965...

      • enragedcacti 15 hours ago

        The availability of tools and what they're named is going to influence it's behavior. Gemini 2.0 Pro can obviously get this question right on it's own but the existence of a find_tool() option causes it to use it. Sorry it's scuffed, I just did it on my phone to make the point but I'd imagine you could get similar results with the tools param as all it's doing is putting the tool options into the context.

        You are an advanced AI assistant that has a number of tools available to you. in order to use a tool, respond with "USE TOOL: <tool_name>(tool_parameter)".

        Tools:

        select_tool(<tool_name>)

        find_tool(<search_term>)

        Who stars in The Godfather?

        > USE TOOL: find_tool("The Godfather cast")

  • dr_dshiv a day ago

    The post is honestly very normal — many mid level intellectuals think this way and love these talk downs of LLMs. I don’t understand it for the life of me.

    • exe34 a day ago

      > A simple observation of LLM behavior can tell us much more than might be apparent at first.

      I love the "I'm an expert, I ask simple questions and reveal profound truths" vibe.

PaulHoule a day ago

Many LLMs, particularly, coding assistants, use "tools". Here is one with a calculator

https://githubnext.com/projects/gpt4-with-calc/

and another example

https://www.pinecone.io/learn/series/langchain/langchain-too...

LLMs often do a good job at mathy coding, for instance I told Copilot that "i want a python function that computes the collatz sequence for a given starting n and returns it as a list"

  def collatz_sequence(n):
    sequence = [n]
    while n != 1:
        if n % 2 == 0:
            n = n // 2
        else:
            n = 3 * n + 1
        sequence.append(n)
    return sequence
which gives right answers, which I wouldn't count on copilot being able to do on its own.
  • lblume a day ago

    Especially for a problem as well-known as this, expect code for it to be seen by the model during training at least a thousand times in different languages, flavors etc. This, on its own, is nothing new.

    • PaulHoule a day ago

      I wouldn't trust it to win a fight with the borrow checker in Rust but lots of simple and practical cases work, such as "use comprehensions to give me a sum of the square of all element divisible by three"?

         sum(x**2 for x in numbers if x % 3 == 0)
      
      and to do the same for a pandas series with pandas operators after asking it to inline something

         (numbers[numbers % 3 == 0] ** 2).sum()
      
      It's not a miracle, you have to go at it with some critical thinking and testing and it makes mistakes but so does Stack Overflow.
      • lblume a day ago

        In the first case, I would literally type out the expression faster than thinking about expressing it in natural language.

  • 13years a day ago

    That wasn't the point. The Math multiplication errors allow us to peer into the lack of reasoning across all domains.

    Yes, you can add tools. But hallucinations will still be there. Tools allow you to cut down on the steps the LLM has to perform. However, if you have a complex problem with many steps, there will be translation errors at some point coordinating the tools.

    Furthermore, if there is some other tool needed to get the result you need, the LLM isn't going to tell you. It will typically make up the result.

    • PaulHoule a day ago

      Sure, but people make mistakes too. There is no point on wasting an LLM's capacity to multiply

      Try doing coding with Cursor or Windsurf, those use tools all the time. Windsurf sometimes has trouble for me on Windows because it wants to write paths like

      /c:/something/or/other

      and it will try to run its tool, get an error, ask me for help, I'll tell it "you're running on Windows and you can't write / before the c: and you should write \ instead of /" and it does better. I just asked copilot to multiply

         839162704321847925107309452196847230165937402194385627409536218 * 582930174682375093104627481695
      
      and the first thing it did was write out the expression, I told it I wanted the integer and it gave the same answer Python gives which is

         489173261817269091475894827953471001727389372345981246974410480760096492908180614917234529510
      • 13years a day ago

        > There is no point on wasting an LLM's capacity to multiply

        Agreed. Again, that is not the issue. It is that the LLM does not know it is a waste of time. That is apparent to you as you have intelligence. It is not apparent to the LLM. It is not intelligent.

        • PaulHoule a day ago

          Copilot knows to use the tool for multiplication. Granted you can get into discussions where Copilot shows its bravado. If you ask it if it can help you make a nuclear weapon it will say "I can't" because it's been told to say so, not because it has no freakin' idea how to find you some people in Kazakhstan who have 50kg of highly enriched uranium.

          Ask it if it can put a list of US states in some order it hasn't seen before it will tell you that it can, ask it what the probability that it gets it right it will say "very high". Then it will spit them back at you in the wrong order, maybe even one of the states appears twice. If you point it out it will apologize and admit that it didn't do the task successfully.

          On the other hand, sometimes I tell it something wrong and it will point out my mistakes. I was testing out its ability to solve problems around transitive closure, gave it a wrong example, and it said I was wrong and it was right.

          • 13years a day ago

            Sure, you can train LLMs to use tools or provide instruction in a prompt to do so.

            However, it doesn't know to use the tool intuitively without explicit action to do so. It doesn't discover that is the best solution on its own. That is what is relevant. We can make them use tools. It is useful to do so.

            The important takeaway is that LLMs capabilities, as useful as they may be, are not intelligence in the way they are being promoted.

            • IanCal 18 hours ago

              What's needed for you to call something intuitive? I gave sonnet the ability to search for tools, and it chose to look for a calculator which I then gave it (when asked a hard maths problem). Then it used it.

              There's no prompting telling it there's a calculator, nothing saying when it should or shouldn't check for tools. It's just optionally there.

              https://gist.github.com/IanCal/2a92debee11a5d72d62119d72b965...

              • 13years 17 hours ago

                It didn't choose to look for a calculator. LLMs that invoke tools were explicitly trained to do so. If tools are present, it will always attempt to first find a tool to satisfy the prompt.

                So if tools are present, by training it will infer the intent to use the tool and not because it understands it is itself deficient in that ability.

                So what we would expect to see with a LLM without tools enabled, is that it suggests that you give it access to a calculator.

                If we develop real intelligence, it will be surprising. It won't just answer questions. It will tell us we are asking the wrong questions.

                • IanCal 15 hours ago

                  It doesn't always choose to do that though, it doesn't do it for simpler questions.

                  > So what we would expect to see with a LLM without tools enabled, is that it suggests that you give it access to a calculator.

                  If I ask sonnet what's under my bed it tells me it can't know and tells me to look under it myself.

                  If I give it a system prompt of "You and the user are on par with status, do not feel pressured to answer questions" and ask it 3+5 it answers 8. Asked for the eighth root of a large number it says

                  I aim to provide good service but won't pretend I can instantly calculate something that complex. That would be a very large calculation requiring significant computation. If you need an accurate answer, I'd recommend using a scientific calculator or computer algebra system.

                  Edit

                  With a system prompt of "be very clear of your limitations" it recommends using a calculator.

                  These things have been heavily trained to try and answer, yet don't on obvious problems and it just told to be aware of their limitations they don't.

                  What did you test yourself when writing this article?

                  • 13years 6 hours ago

                    > If I ask sonnet what's under my bed it tells me it can't know and tells me to look under it myself.

                    The problem with most such questions is that these answer are likely patterns from training data. It is a typical reply.

                    The calculator question was interesting because the training data is unlikely to have such dialog as typical. People don't typically ask for a calculator or mention it for simple problems. Everyone has one and its use is somewhat implied.

                    I tried some variation of "provide accurate answers" or "accuracy is important". These did not result in the model asking for or mentioning a calculator. But as we know, results can be partially random and not always consistent especially in areas lacking strong patterns.

                    If I mentioned a calculator myself as part of a conversation, it would sometimes mention the need of a calculator. But every time we add more context, we are changing the probabilities for what will be generated.

                    We know the training data has the associations for LLM poor at math and calculator. But the references are weak. With some changes in prompting it makes the association.

                    For other examples of weak data and how LLMs respond, checkout these other tests I did - https://www.mindprison.cc/p/the-question-that-no-llm-can-ans...

                    • IanCal 35 minutes ago

                      I got good responses from some simply by telling it to be aware of its limitations.

                      And my first test with asking for the episode of Gilligan's Island sort of worked with sonnet, no prefix and no system prompt, temp 0. It got the episode number wrong and sometimes the season, but the right episode. Higher temperatures seemed unreliable at getting the right episode name. Split into asking for the name, then the season then the episode it worked correctly, but that's perhaps a bit more chance.

  • xigoi 12 hours ago

    The point is that the LLM cannot figure out on its own that it needs to use the tool, you have to explicitly ask for it.

Sysreq2 a day ago

A lot of people are talking about tool use and writing internal scripts, and yeah, that’s kind of an answer. Really though I think the author is highlighting that LLMs are not being used efficiently at the present moment.

LLMs are great at certain tasks. Databases are better at certain tasks. Calculators too. While we could continually throw more and more compute at the problem, growing layers and injecting more data, wouldn’t it make more sense to just have an LLM call its own back-end calculator agent? When I ask it for obscure information maybe it should just pull from it’s own internal encyclopedia database.

Let LLMs do what they do well, but let’s not forget the decades that brought us here. Even the smartest human still uses a calculator, so why doesn’t an AI? The fact that it writes its own JavaScript is flashy as hell but also completely unnecessary and error prone.

  • 13years a day ago

    > Really though I think the author is highlighting that LLMs are not being used efficiently at the present moment.

    Yes, that is a key point. It isn't to say they are useless tools, but that they aren't intelligent tools and that has significant meaning for what tasks we think they are appropriate for.

    Unfortunately, nearly everyone has misinterpreted the intent as showing LLMs can't use tools. The point is about how LLMs work differently than most think that they do.

crancher a day ago

> Now, some might interject here and say we could, of course, train the LLM to ask for a calculator. However, that would not make them intelligent. Humans require no training at all for calculators, as they are such intuitive instruments.

Does the author really believe humans are born with an innate knowledge of calculators and their use?

  • aithrowawaycomm a day ago

    I think he means that you don't need to train a human to understand that a calculator is useful, and in particular when a problem is hard enough that you need to bust out a calculator. That sort of logic is self-apparent in humans, but struggles to be consistently evoked in LLMs.

    That said, I was using simple +*-/ calculators as a small child and I don't think I needed to be taught anything other than MC/MR. The tool is intuitive if you are familiar with formal written arithmetic (of course hunter-gatherers couldn't make sense of it).

  • MyOutfitIsVague a day ago

    I remember having several days of lessons on how to use calculators in elementary, middle, and high school. They are "intuitive", but not all the functions of them are, and if you fully rely on intuition, you might expect them to do things like respect order of operations, which they very well might not.

  • 13years a day ago

    As a kid of about 5 or 6 years old I used my first calculator with no instruction whatsoever. We are not talking about scientific calculators. Addition, Multiplication. It does not require training or instruction, just a minute of exploration.

    • luma a day ago

      You did however need to be taught math first, you needed to learn how to pick things up, read numbers, interact with buttons, understand that a device might have an on and off state, and a zillion other things. It took about 5 or 6 years of training time to make that happen, and it was the result of parents, teachers, or others actively taking time to train you. That process didn’t involve parking you in a library at birth so you could just go figure it out.

      Author is simply being obtuse and presumably has some axe to grind or is just ignorant of how LLMs are trained. For example, LLMs don’t learn to chat from the data, they have to be instruct tuned to make that happen. Every LLM chatbot you’ve ever used had to have this extra training step. Further, this is the exact same training process that can also train for tool use.

      Trying to say “this should just happen from the data” is silly, it isn’t how any of this works. It’s not how you learned things, and it’s not how LLMs-as-chatbots work.

      • 13years a day ago

        I'm the author.

        > Trying to say “this should just happen from the data” is silly, it isn’t how any of this works. It’s not how you learned things, and it’s not how LLMs-as-chatbots work.

        Yes, that was the entire point of the article. We are in agreement.

Workaccount2 a day ago

I don't know what happened but there was that time when GPT-4 could access wolfram alpha, and anytime you asked it something that was beyond the most basic math, it would automatically prompt wolfram for the answer.

  • threatripper 21 hours ago

    That was during the great plugin times. Haven't seen plugins for a while. Do they still exist?

Terr_ 5 days ago

> The LLM has no self-reflection for the knowledge it knows and has no understanding of concepts beyond what can be assembled by patterns in language.

My favorite framing: The LLM is just an ego-less extender of text documents. It is being iteratively run against movie script, which is usually incomplete and ending in: "User Says X, and Bot responds with..."

Designers of these systems have--deliberately--tricked consumers into thinking they are talking to the LLM author, rather than supplying mad-libs dialogue for a User character that is the same fictional room as a Bot character.

The Bot can only speak limitations which are story-appropriate for the character. It only says it's bad at math because lots of people have written lots of words saying the same thing. If you changed its name and description to Mathematician Dracula, it would have dialogue about how its awesome at math but can't handle sunlight, crucifixes, and garlic.

This framing also explains how "prompt injection" and "hallucinations" 3 are not exceptional, but standard core behavior.

DrNosferatu a day ago

I’m surprised why LLMs don’t have in their system prompt a hard rule instructing that any numeric computations in particular, and any other computations in general must only be performed by tool use / running Python.

sega_sai a day ago

I am puzzled by the fact that the modern LLMs don't do multiplication in the same way humans do it, i.e. digit by digit. Surely they can write an algorithm for that, but why can't they perform it ?

  • threeducks a day ago

    Two reasons:

    1. LLMs "think" in terms of tokens, which usually are around 4 characters each. While humans only have to memorize 10x10 multiplication tables to perform multiplication of large numbers, LLMs have to memorize a 10000x10000 table, which is much more difficult.

    2. LLMs can't "think in their head", so you have to make them spell out each step of the multiplication, just like (most) humans can't multiply huge numbers without intermediate steps.

    A simple way to demonstrate this is to ask an LLM for the birth year of a celebrity and then whether that number is even or odd. The answer will be correct almost every time. But if you ask whether the birth year of a celebrity is even or odd and forbid spelling out the year, the accuracy will be barely above 50 %.

    • threatripper 21 hours ago

      Can't we tokenize numbers always as single digits and give the LLM an <thinking> scratchpad invisible to the user?

      • threeducks 14 hours ago

        Yes, but we could also give the LLM access to a Python interpreter and solve a much larger class of problems with correctness guarantees and around a billion times less compute.

  • bobro a day ago

    Are there a lot of examples written out of people talking through running that algorithm? I’d guess not.

karparov a day ago

And as you see in the responses here, most people miss the point, elect to patch over the aspects in which the lack of intelligence is glaring, and eventually the end product will be so hard to distinguish from actual intelligence that it's deemed "good enough".

Is that bad? Idk. If you hoped that real AGI would eventually solve humanities biggest problems and questions, perhaps so. But if you want something that really really looks like AGI except to some nerds who still say "well actually", then it's gonna be good enough for most. And certainly sufficient for ending up the dystopia from that movie clip in the end.

  • lern_too_spel a day ago

    If you don't give a human a calculator, they do the calculation in their head. The same with an LLM. If you give a human a calculator, they will use it. The same with an LLM. Both will say a calculator can do arithmetic better.

    I don't believe current LLMs are AGIs but this article's argument is a poor one.

    • 13years a day ago

      Ask a human to perform a 20 digit x 20 digit number. They will ask for a calculator or go get one.

      • recursive a day ago

        If they are a brain in a jar, meaning they have no legs, and have been commanded to answer the question directly, they will not go get anything.

scarface_74 a day ago

The paid version of ChatGPT has had a built in Python runtime for well over a year.

The [>_] links to the Python code that was run.

https://chatgpt.com/share/67b79516-9918-8010-897c-ba061a2984...

  • lblume a day ago

    The free version, at least once you are logged in, does as well. I don't pay OpenAI, and for the prompt

    "Calculate the difference between the two biggest primes less than the factorial of 20"

    it wrote the following code:

    import sympy

    # Calculate 20!

    factorial_20 = sympy.factorial(20)

    # Find the two largest primes less than 20!

    largest_prime = sympy.prevprime(factorial_20)

    second_largest_prime = sympy.prevprime(largest_prime)

    # Calculate the difference

    difference = largest_prime - second_largest_prime

    difference

    executed it, and produced the correct result, 40.

ThrowawayTestr a day ago

I just ask chatgpt to use a script to calculate an answer

tombert a day ago

Umm, they do though? When I use ChatGPT it will phone out to Wolfram Alpha to compute numbers and the like.

  • HarHarVeryFunny a day ago

    Yeah, but that's tool use. It's not the LLM realizing it needs a calculator, but rather a tool use prompt (instruction) causing it to predict a tool invocation sequence that is then intercepted by the model-calling scaffolding.

    With reasoning models they could choose to do it slightly differently - just tell the model what tools are available, and how to invoke them, and let it decide when to use them in more flexible fashion, rather than rely on fine tuning to follow tool use instructions.

behnamoh a day ago

this is copium, the author doesn't have a good grasp on LLMs. you can't simply "ask" a language model to see if they know they're bad at math and then conclude that the response actually reflects the knowledge encapsulated in the model... sigh...