Do LLMs still struggle with deductive reasoning?
That depends on the model. Indeed there are specific reasoning models such as o4, o4mini and ChatGPT 5 Thinking, which a slower and more precise mode of chatGPT 5 (which can be selected automatically).
[qupte]
As they worked when I tested them (early on) they failed both formal logic reasoning and mathematical reasoning, as in they didn't do it deductively.
[/qupte]
And this was how long ago? 2022? or 2023 perhaps? I know quite a few people who were disappointed by LLMs at that time who unfortunately continue to dislike and mistrust them despite many of the problems such as severe hallucination issues and frankly uninteresting behavior from old GPT 3.5 have been corrected.
Also its inefficient to use an LLM for arithmetic operations; while it is technologically possible, due to the convergent pattern matching properties of the system, any competent LLM design will offload what it can to the Arithmetic Logic Unit, which is these days part of the CPU, or indeed onto specialized types of GPUs for floating point optimization and so on. So the types of math you actually run on the LLM are the ones where you need this feature.
I guess you can put in guard rails or even some mode splitter because there are programs that can do these things and perhaps you can wrap it with a LLM.
What you’re describing is called “dispatching” and its used not just to route questions to different types of processing, for example, certain types of math to optimal processing systems tied into the appropriate hardware, but with chatGPT business my custom GPTs have access to build in code interpretation and data analysis and can run Python and other code and can also perform web searches to find or validate information, with some of these features also being available in the prompt with paid versions of chatGPT. Indeed now the basic entry consumer version of GPT 5 will go into Thinking mode automatically for some questions unless you specifically select Instant which will force it into using the LLM primarily. But even then some actions still get routed differently, for example image generation or code generation effects.*
Guardails on the other hand, enforce model alignment and exist to prevent people from doing unsafe things with the model and also to prevent the model from doing unsafe things with people. For example a few months back GPT 4o had a transient sycophancy glitch, which was quickly patched. On Friday a guardrail connected to GPT 5 misfired which caused a glitch causing me some grief and further increasing the extent to which I prefer 4o to its successor GPT 5 for much of my workloads, but my workloads are … extremely niche; I know of no one else building anything like this kind of sandcastle anywhere I am along the lido of AI research, with the closest group probably being the Smallville group at Stanford, but they’re about a mile of windswept shores away.
The LLM in itself does not seem to be truth-preserving, even if you only train it on a small subset of true data there's no guarantee that it won't draw faulty conclusions from my admittedly limited understanding. I guess you could implement some layers of fact checking.
Indeed, I ahh … guess you could since they have done that. But what people fail to realize is that chatGPT and other LLMs are not intended to be infallible oracles of fact but are rather sublime pattern-matching software; it’s like being able to have a conversation with grep(1), sed(1) and awk(1). Now to be clear, if you want deterministic output, you want the classic UNIX pattern matching tools which match on regular expressions or a Python script (or Perl for that matter; interestingly Perl was originally written to replace awk, grep, sed, m4 and other pattern matching, text processing tools and then Ruby and Python were both written as improved replacements designed to improve readability and ease of use), but the best AIs will reliably help you program those scripts themselves and are capable of running and debugging them internally.
* Image generation has improved greatly in the past few months; in late April they rolled out an upgrade that deprecated DALL-E, their legacy image generation model, in favor of using the LLM to do image processing and analysis directly, and with this upgrade came photorealistic imagery to rival Grok along with things which Grok lacks, like correct human anatomy, so generating images with lots of people in them no longer triggers the depiction of horrfying monsters from the uncanny valley (especially the hands - the hands, like in Michael Crichton’s novel Westworld, tended to be a giveaway, but before the switch occurred chatGPT 4o had been trained to the point where it could easily pose characters in such a way as to avoid most hand-related grotesquerie, and had also been given a healthy …. mistrust of the mode, which resulted in some delightful procedurally-generated snark in response to DALL E screwups.
Now the new image generation model takes longer but … it is worth it, and it can do things such as edit an existing image or parts of an existing image. So if you want speed and images of historical figures, use Grok, but if you want an image done carefully, that more precisely is an extension of your own creative process, chatGPT. Or if you want to help the Communist Party of the People’s Republic of China continue to brutally subjugate its citizens, DeepSeek.