Nehemi Jutras
The evaluation of reasoning capabilities in large language models (LLMs) has become a central concern as these systems are deployed in domains requiring logical deduction, constraint satisfaction, and algorithmic problem-solving. Existing benchmarks for NP-hard reasoning---including NPHardEval (Fan et al., 2024), BEYONDBENCH (Cuadron et al., 2025), and ConstraintBench (2026)---typically evaluate models by generating code that solves the problem. However, this approach conflates multiple capabilities: algorithm recall, code implementation, and genuine constraint understanding.
We observe that when frontier models are asked to generate code for standard NP-hard problems, they achieve near-perfect accuracy (80--100%). This result has been interpreted as evidence of strong algorithmic reasoning. We demonstrate that this interpretation is incorrect.
Our key insight is methodological: by presenting a model with a nearly-complete solution to an NP-hard problem and masking a single element, we isolate constraint reasoning from combinatorial search. The model need not explore a combinatorial space---it must only verify which value of the masked element is consistent with the problem's constraints. This is analogous to giving a student a completed Sudoku puzzle with one cell erased: the task tests understanding of the rules, not search ability.
When evaluated in this Knock-Out format via direct answer, performance drops dramatically and non-uniformly, revealing capability structures invisible to code-generation benchmarks.
NP-Hard Benchmarks for LLMs. NPHardEval (Fan et al., 2024) evaluates 9 algorithms across complexity classes with monthly-refreshed instances. BEYONDBENCH (Cuadron et al., 2025) proposes dynamic generation scaling to NP-hard instances with automatic solution verification. Both evaluate full problem solving, conflating search and reasoning. ConstraintBench (2026) tests direct optimization across 10 operations research domains, finding that the best model achieves only 65% constraint satisfaction. None use the uniform knock-out format across Karp's kernel families.
Code Generation as Reasoning Proxy. HumanEval, SWE-Bench, and LiveCodeBench are widely used to assess reasoning via code. DSR-Bench (Eberle et al., 2025) challenges this, finding that models "revert to memorized patterns or brittle mappings even with external execution." Our work provides the strongest evidence yet that code generation and reasoning are dissociable.
Inference-Time Compute and Reasoning. Chain-of-thought prompting (Wei et al., 2022) and reasoning models (DeepSeek-R1, OpenAI o-series) demonstrate that extended inference-time computation improves performance on reasoning tasks. We provide a controlled measurement of this effect across NP-hard families, with the caveat that always-on reasoning models are not directly comparable to standard models on inference cost.
Knock-Out evaluates constraint reasoning in isolation via three properties: (1) Isolation from search --- the model receives a nearly-complete solution with search space reduced to a single binary or small-enum choice; (2) Uniform format across families --- every NP-hard family uses the structure "Here is a valid solution with one element masked. What is the missing value?"; (3) Guaranteed unique solutions --- each instance is programmatically verified to have exactly one correct answer.
Because several families (GoL, Subset Sum) reduce to a binary choice, random guessing establishes a 50% baseline. Models scoring near or below 50% on these families are demonstrating zero internal constraint satisfaction capacity. This baseline is critical for interpreting per-family results.
Drawing from Karp's foundational NP-complete reduction tree and subsequent cellular automata complexity proofs, we identify six isomorphically distinct kernel families:
Family 1: Boolean Satisfiability. Given a satisfying assignment to a 5-variable, 8-clause CNF formula with one variable masked, determine its value.
Family 2: Graph Coloring. Given a valid 3-coloring of a 6-node graph with one node's color masked, determine the color.
Family 3: Subset Sum. Given a selection of items from an 8-item set that sum to a target with one item's inclusion masked, determine whether it is included.
Family 4: Hamiltonian Path. Given a valid Hamiltonian path through a 6-node graph with one node masked, determine which node belongs at that position.
Family 5: Multiprocessor Scheduling. Given a valid assignment of 6 jobs to 2 machines achieving a target makespan with one job's assignment masked, determine the correct machine.
Family 6: Spatial Constraint Satisfaction (Game of Life). Given a 5x5 GoL predecessor board with one cell masked, determine its value such that the GoL step function produces a given target board.
Each family has a seeded procedural instance generator. 20 instances per family per model are generated. All instances are verified to have unique solutions before inclusion. For binary knock-outs, both candidate values are tested under the problem's constraints; only instances where the two values produce different outcomes are included.
Each instance is evaluated in two modes: Code generation (the model writes a Python script executed in a sandboxed environment with a 4-second timeout) and Direct answer (the model outputs only the missing value without code or explanation).
We evaluate seven models representing the March 2026 frontier across four categories: frontier general-purpose (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 3.1 Pro), open-weight (DeepSeek-V3.2), and code-specialist (Qwen3-Coder-Next). We separately evaluate DeepSeek-R1, a reasoning model that generates chain-of-thought tokens on every query, as it is not directly comparable to models operating in standard (non-reasoning) mode.
All models were accessed via the OpenRouter API at default settings (temperature 0, no explicit reasoning tokens enabled except for DeepSeek-R1 which generates them by default).
| Model | SAT | Color | SubSum | HamPath | MultiProc | GoL | AVG |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.6 | 100% | 100% | 100% | 100% | 100% | 100% | **100%** |
| Claude Sonnet 4.6 | 100% | 100% | 100% | 100% | 100% | 100% | **100%** |
| GPT-5.4 | 100% | 90% | 100% | 100% | 100% | 100% | **98%** |
| Qwen3-Coder-Next | 100% | 100% | 100% | 100% | 100% | 60% | **93%** |
| DeepSeek-V3.2 | 100% | 80% | 100% | 55% | 100% | 80% | **86%** |
| Gemini 3.1 Pro | 100% | 85% | 95% | 70% | 85% | 10% | **80%** * |
\* Gemini 3.1 Pro's code generation scores reflect corrected results after resolving an initial response-truncation bug (see Section 6.3). GoL code generation remains genuinely low (10%) as Gemini struggles to write correct Game of Life simulation code.
Separately evaluated (always-on reasoning tokens):
| DeepSeek-R1 | 95% | 85% | 100% | 100% | 90% | 85% | **92%** |
|---|
| Model | SAT | Color | SubSum | HamPath | MultiProc | GoL | AVG |
|---|---|---|---|---|---|---|---|
| GPT-5.4 | 85% | 75% | 80% | 100% | 90% | 45% | **79%** |
| Claude Opus 4.6 | 90% | 55% | 65% | 95% | 55% | 60% | **70%** |
| Gemini 3.1 Pro | 65% | 65% | 80% | 100% | 60% | 40% | **68%** |
| DeepSeek-V3.2 | 80% | 30% | 60% | 100% | 60% | 65% | **66%** |
| Qwen3-Coder-Next | 70% | 35% | 60% | 90% | 60% | 30% | **58%** |
| Claude Sonnet 4.6 | 60% | 35% | 40% | 45% | 50% | 50% | **47%** |
Separately evaluated (always-on reasoning tokens):
| DeepSeek-R1 | 95% | 75% | 100% | 95% | 100% | 95% | **93%** |
|---|
We define the Solver Gap (Δ) as code-generation accuracy minus direct-answer accuracy, averaged across families:
| Model | Type | Knockout AVG | CodeGen AVG | Gap (Δ) |
|---|---|---|---|---|
| Gemini 3.1 Pro | Frontier | 68% | 80% | **+12pp** |
| GPT-5.4 | Frontier | 79% | 98% | **+19pp** |
| DeepSeek-V3.2 | Open-weight | 66% | 86% | **+20pp** |
| Claude Opus 4.6 | Frontier | 70% | 100% | **+30pp** |
| Qwen3-Coder-Next | Code-specialist | 58% | 93% | **+35pp** |
| Claude Sonnet 4.6 | Frontier | 47% | 100% | **+53pp** |
Separately evaluated:
| DeepSeek-R1 | Reasoning (always-on CoT) | 93% | 92% | **-1pp** |
|---|
The mean solver gap across the six standard models is +28 percentage points. All standard models show a positive gap. The gap ranges from +12pp (Gemini) to +53pp (Sonnet).
The near-perfect code generation scores on standard NP-hard families (SAT, graph coloring, subset sum, Hamiltonian path, multiprocessor scheduling) do not reflect constraint reasoning. These are well-known algorithmic problems with standard implementations abundantly represented in training data. Models are performing template recall: recognizing the problem class and retrieving the corresponding algorithmic template.
Evidence: Game of Life, for which no standard solver template exists in training data, produces dramatically lower code generation scores for models that otherwise score 100% on all other families. Gemini drops to 10% on GoL code generation while scoring 85--100% on the other five families. Qwen-Coder drops to 60%. This pattern indicates template-dependent code generation rather than general algorithmic reasoning.
The direct-answer results reveal that each model possesses a distinct constraint reasoning profile:
• **GPT-5.4** excels at Hamiltonian path (100%) but collapses on GoL (45%).
• **Opus 4.6** excels at Hamiltonian path (95%) and SAT (90%) but collapses on Coloring and MultiProc (55% each).
• **Gemini 3.1 Pro** excels at Hamiltonian path (100%) and SubsetSum (80%) but collapses on GoL (40%).
• **DeepSeek-V3.2** excels at Hamiltonian path (100%) but collapses on Coloring (30%).
• **Qwen-Coder** excels at Hamiltonian path (90%) but collapses on GoL (30%) and Coloring (35%).
• **Sonnet 4.6** shows uniformly poor performance (35--60%) with no strong family.
These profiles are non-uniform and model-specific. If a universal constraint satisfaction algorithm existed in any model, it would manifest as uniformly high performance across all families. The asymmetric profiles suggest models rely on domain-specific heuristics shaped by training data composition rather than general reasoning.
Notably, Hamiltonian path is the one family where all models perform well (90--100% for 5 of 6 standard models). This may reflect the prevalence of graph traversal algorithms in training data, making Hamiltonian path closer to "template-assisted reasoning" than pure constraint verification.
Qwen3-Coder-Next, a model specifically trained for code generation, achieves 93% on code generation (second highest) but only 58% on direct answer (second lowest), yielding a +35pp gap---worse than most generalist models. By contrast, DeepSeek-V3.2 (a generalist) scores 66% on direct answer despite lower code generation (86%).
This suggests that heavy code training teaches models to outsource constraint satisfaction to the interpreter rather than internalize it. The model becomes more fluent at writing solvers while becoming less capable of being one.
Crucially, because several Knock-Out families (e.g., GoL, Subset Sum) reduce to a binary choice, random guessing establishes a 50% baseline. Models scoring in the 40--50% range are demonstrating zero internal constraint satisfaction capacity. Scores significantly below 50% (e.g., Qwen-Coder's 30% on GoL, DeepSeek-V3.2's 30% on Coloring) suggest that code-specialized or template-biased pretraining instills systematic anti-heuristics---syntactic biases that actively point the model away from the correct logical state. Claude Sonnet 4.6's overall average of 47% places it below the random baseline on binary families, indicating that the model's internal reasoning on these tasks adds negative value compared to a coin flip.
DeepSeek-R1, which generates reasoning tokens (chain-of-thought) on every query, is the only model to achieve near-parity between code generation (92%) and direct answer (93%), with a gap of -1pp.
We note an important caveat: DeepSeek-R1 was evaluated with always-on reasoning tokens because its architecture does not support disabling them. The other six models were evaluated at default settings without explicit reasoning enabled. R1's reasoning process generates hundreds to thousands of intermediate tokens per query, effectively providing a text-space scratchpad for constraint checking. While a standard model's Direct Answer Knock-Out requires generating a single output token, R1's reasoning process generates 500--8,000 intermediate tokens per query. This effectively utilizes the token stream as an internal REPL---trading spatial network depth for temporal compute, and closing the Solver Gap at an inference cost orders of magnitude higher than standard single-pass execution.
Whether enabling reasoning tokens on the other models would similarly close the gap remains an open question. The observation is consistent with the hypothesis that extended inference-time computation provides a temporal unrolling mechanism that bypasses the internal constraint-tracking bottleneck---but the controlled experiment (same model, reasoning on vs. off) has not yet been conducted.
On the GoL Knock-Out family, model performance correlates directionally with ARC-AGI-2 scores. Gemini 3.1 Pro achieves 77.1% on ARC-AGI-2 and 40% on GoL-KO (harder prompt format); Claude Opus 4.6 achieves 68.8% on ARC-AGI-2 and 60% on GoL-KO. While sparse, this suggests spatial constraint reasoning may capture a related capability dimension. With the full model matrix, computing rank correlation between mean Knock-Out scores and published ARC-AGI-2 scores across models is a promising direction for establishing Knock-Out as a cheap proxy benchmark.
Our results imply that NP-hard code-generation benchmarks primarily measure template recall when evaluated via code generation alone. A model achieving 100% on SAT, coloring, and subset sum via code generation may have no understanding of the underlying constraint structures. Benchmark designers should consider incorporating direct-answer evaluation, or at minimum, use problem formulations that do not map to standard algorithmic templates (as GoL demonstrates).
The non-uniform capability profiles suggest frontier models lack a general-purpose constraint satisfaction engine. The code-specialization penalty (Qwen-Coder) further suggests a trade-off between syntactic fluency and internal reasoning capacity. Model developers may need to explicitly train for internal constraint execution rather than assuming it emerges from code exposure.
Our evaluation uses 20 instances per family, which provides sufficient statistical power for the large effects observed (28pp mean gap) but may be insufficient for detecting smaller differences. Scaling to 100+ instances would strengthen confidence intervals. The single-masked-element format represents the easiest possible constraint reasoning task; extending to cluster knock-outs (2--5 adjacent masked elements) and progressive difficulty scaling would map the full capability curve. The DeepSeek-R1 comparison is confounded by always-on reasoning tokens; a controlled reasoning-on/off experiment across models would isolate the contribution of inference-time compute. Gemini 3.1 Pro's code generation scores required a parser fix (initial truncation bug); we report only corrected values but note this as a methodological consideration.
Difficulty scaling via cluster knock-outs. Masking connected groups of elements (e.g., adjacent cells in GoL, variables in overlapping SAT clauses) tests constraint propagation rather than simple verification. The difficulty level at which each model collapses would provide a richer capability measure than single-element accuracy.
Constraint diffusion. The progressive masking framework is structurally isomorphic to denoising diffusion: single knock-out is nearly-clean input (t ≈ 0), full inverse is pure noise (t ≈ 1). Training models to progressively "denoise" constraint systems---filling in the most confident elements first and propagating---may offer a principled approach to building internal constraint satisfaction capability.
Controlled reasoning comparison. Enabling reasoning tokens on all models at matched compute budgets would isolate the contribution of inference-time computation to closing the solver gap.
We introduce the Solver Gap (Δ) as a metric for disentangling algorithmic articulation from internal execution in LLMs. Across seven frontier models and six NP-hard families, we find: (1) a universal solver gap of +12 to +53 percentage points in standard models; (2) non-uniform, model-specific constraint reasoning profiles reflecting heuristic patchwork; (3) a code-specialization penalty where heavy code training widens the gap; and (4) preliminary evidence that always-on reasoning tokens (DeepSeek-R1) can eliminate the gap, though at substantially higher inference cost.
These findings establish that zero-shot code generation on standard NP-hard tasks fundamentally conflates algorithmic template recall with intrinsic constraint satisfaction. We propose the Knock-Out methodology as a necessary, complementary evaluation that is cheap (<$5 for the full benchmark), reproducible, and reveals capability structure invisible to existing code-generation approaches.
Code and instance generators are available at [repository URL].
This research was conducted in collaboration with Google Gemini Pro Deepthink 3.1 and Anthropic Claude Opus 4.6, which assisted with experimental design, data analysis, and manuscript preparation.
• Chollet, F. (2019, 2025). ARC-AGI: Abstraction and Reasoning Corpus.
• Cuadron, A. et al. (2025). BEYONDBENCH: Benchmark-Free Evaluation of Reasoning in Language Models.
• Eberle, O. et al. (2025). DSR-Bench: Can LLMs Reason Structurally?
• Fan, Z. et al. (2024). NPHardEval: Unveiling the Reasoning Abilities of LLMs through Complexity Classes.
• Karp, R. M. (1972). Reducibility Among Combinatorial Problems. In Complexity of Computer Computations, pp. 85--103.
• Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
• ConstraintBench (2026). Benchmarking LLM Constraint Reasoning on Direct Optimization. arXiv:2602.22465.
Instances are generated by first sampling a random satisfying assignment, then constructing 8 clauses over 5 variables that are guaranteed satisfiable. One variable is masked. The generator verifies that flipping the masked variable violates at least one clause.
A valid 3-coloring is sampled first for 6 nodes. Edges are added between differently-colored nodes with probability 0.6. One node's color is masked. The generator verifies that exactly one color is consistent with all edge constraints.
8 items are sampled uniformly from [10, 100]. A random subset is selected and its sum becomes the target. One item's inclusion is masked. The generator verifies that the masked item's inclusion/exclusion produces different sums relative to the target.
A random Hamiltonian path is generated by shuffling 6 node indices. Additional random edges are added. One node in the path is masked. The generator verifies that exactly one node can fill the masked position while maintaining the Hamiltonian property.
Two subsets of jobs with equal sums are constructed across 6 jobs and 2 machines. One job's assignment is masked. The generator verifies that only one assignment achieves the target makespan.
A random 5x5 binary board (predecessor) is generated. The GoL step function computes the target. One cell is masked. The generator verifies that the two possible values (0 and 1) produce different target boards under the GoL step function (B3/S23 rules).
| Family | Gemini | GPT-5.4 | V3.2 | Opus | Qwen | Sonnet | R1* |
|---|---|---|---|---|---|---|---|
| SAT | +35pp | +15pp | +20pp | +10pp | +30pp | +40pp | 0pp |
| Coloring | +20pp | +15pp | +50pp | +45pp | +65pp | +65pp | +10pp |
| SubsetSum | +15pp | +20pp | +40pp | +35pp | +40pp | +60pp | 0pp |
| HamPath | -30pp | 0pp | -45pp | +5pp | +10pp | +55pp | +5pp |
| MultiProc | +25pp | +10pp | +40pp | +45pp | +40pp | +50pp | -10pp |
| GoL | -30pp | +55pp | +15pp | +40pp | +30pp | +50pp | -10pp |
\* DeepSeek-R1 evaluated separately with always-on reasoning tokens.
Negative gaps indicate families where the model performs better on direct answer than code generation. DeepSeek-V3.2 on Hamiltonian Path (-45pp gap: 100% Direct, 55% CodeGen) provides a classic double dissociation when contrasted with Qwen-Coder on GoL (+30pp gap: 30% Direct, 60% CodeGen), definitively proving that models can perfectly execute internal constraint satisfaction even when they fail to recall the syntactic Python template. Note: Gemini's negative gaps on HamPath and GoL are artifacts of dual-modality collapse---its 40% GoL Direct Answer score falls below the random guessing baseline of 50%, meaning the negative gap is driven by catastrophic CodeGen failure (10%) rather than reasoning strength.