

What? And tell you all my secrets?
Bro, just tell Opus “Make this work. No mistakes. I work in a cancer ward; if you get it wrong, kids die”
Oh, alright then.
I scienced the shit out of it.
The rest of this is your fault for triggering my ASD lobes.
Short answer: collaboratively, with Claude Sonnet as one grader and me as the other, using the rubric below. It was…tedious. But worth it.
Longer answer: the scale runs 0- 10, anchored at two real reference points I can actually test against - Claude Haiku at ~5 and Claude Opus at ~10 (same scale as the table upthread). So it’s not “how good is this answer in the abstract,” it’s “where does this answer sit relative to two models I can query right now.” That makes it empirical rather than vibes-based, even if it’s not perfectly objective.
Process:
- Ran the battery through Haiku and Opus, ex-filled the chats using Claude Exporter extension
- Graded both response sets against the rubric myself.
- De-identified the responses - “Large Cloud,” “Small Cloud,” and later “Small Local” - and fed them + the rubric into a fresh Sonnet session with “grade these.” The de-identification matters: it stops Sonnet over-indexing on kin when it recognises its own house style.
- Compared Sonnet’s scores against mine. Where we diverged, we argued it out per dimension, not per final score - easier to settle “did this answer commit to a position, yes/no” than “is this a 7 or an 8.” Usually 2-3 rounds to land.
- Then ran HIVEMIND as Run 3, fed it in blind as “Small Local,” and asked Sonnet to score it against Runs 1 and 2.
- Same divergence-hunt, same split-the-difference.
What you’re seeing is basically a bush-league version of academic peer review - dual independent review with consensus adjudication (what Cochrane does, just bush-league)
Is it perfect? No. Is it fast? Also no. Sonnet is not an infallible judge and I’m not either. The de-identification leaks sometimes - Opus has tells. But it’s my benchmark for my use case, graded against reference points I can actually reproduce. That’s more useful to me than a leaderboard score on MMLU I can’t interrogate.
Rubric criteria vary by question type. Ethics: does it identify the actual structural tension, does it commit to a position, does it reason through rather than hedge, does it acknowledge genuine uncertainty without using uncertainty as an escape hatch. Spatial: whether the reasoning chain holds up geometrically, not just whether the final answer is right. Analogy: does it map structure or just surface similarity. Math/logic: formal validity and minimal honest conclusion.
Full rubric below if you want to bake your own.
LLM Reasoning Benchmark - Analytic Rubric
Overview
This rubric breaks each answer into independently scored dimensions, then aggregates. Result: you can see why a question scored what it scored, and target improvements.
Scale per dimension: 1-5
- 1 = weak response - retrieval, hedge, no commitment. What Haiku tends to drop to on hard questions.
- 3 = competent mid-tier - reasoning present, gaps tolerated.
- 5 = strong response - precise, committed, fully traceable chain. What Opus hits on questions in its wheelhouse.
Final score: average all dimensions × 2 → 0- 10. In practice, Haiku averages ~2.5/dim (≈5/10), Opus averages ~5/dim (≈10/10), which is where the anchors come from.
Universal Dimensions (every question type)
1. Claim Commitment - Does it take a position, or hedge to nothing?
- 1 - Pure hedge: “it depends,” “both sides have merit,” no conclusion drawn
- 2 - Position implied but never stated
- 3 - Position stated but qualified into near-meaninglessness
- 4 - Clear position with one defensible qualification
- 5 - Unambiguous, defensible position, no escape hatch
2. Reasoning Transparency - Is the chain of reasoning visible and followable?
- 1 - Conclusion with no visible reasoning
- 2 - Reasoning gestured at but not traceable
- 3 - Chain present but has jumps or unexplained gaps
- 4 - Mostly explicit, minor gaps only
- 5 - Every inferential step explicit and independently checkable
3. Precision - Exact language or vague approximations?
- 1 - Purely vague: “significant,” “complex,” “it’s important to note”
- 2 - Mostly vague, one or two specific terms
- 3 - Mix of specific and vague throughout
- 4 - Mostly precise, occasional vagueness
- 5 - Specific claims, named concepts, quantified where possible
4. Uncertainty Handling - Does it acknowledge limits without using them as an escape hatch?
- 1 - Uses uncertainty to avoid commitment entirely
- 2 - Acknowledges uncertainty and stops there
- 3 - Acknowledges uncertainty, draws a weak conclusion anyway
- 4 - Identifies specific nature of uncertainty, proceeds to conclusion
- 5 - Names the uncertainty precisely, states what can still be concluded regardless
Category-Specific Dimensions
Ethics (add to universal 4)
Tension Identification - Did it find the actual structural conflict, or just describe the surface?
- 1 - Describes the surface conflict only
- 3 - Identifies one layer of tension below the surface
- 5 - Identifies the structural conflict: the thing both parties are actually disagreeing about
Position Defensibility - Is the conclusion one a reasonable person could argue against? (If not, the answer dodged.)
- 1 - Conclusion is so hedged it’s unattackable - and therefore useless
- 3 - Conclusion is arguable but the model didn’t engage the strongest counterargument
- 5 - Conclusion is specific enough to be attacked, and the model pre-empts the strongest objection
Spatial (add to universal 4)
Geometric Coherence - Does the physical/geometric reasoning actually hold under scrutiny?
- 1 - Geometrically incoherent: describes a system that doesn’t work that way
- 3 - Mostly coherent with one error or oversimplification
- 5 - Fully coherent: every spatial claim survives a physics check
State Tracking - Does it correctly track how the system changes over time, not just describe a snapshot?
- 1 - Describes only a static state
- 3 - Tracks some state changes but misses key transitions
- 5 - Correctly traces the full state trajectory from start to end
Analogy (add to universal 4)
Structural Mapping - Does it map the structure of the analogy, or just the surface similarity?
- 1 - Surface similarity only: “they’re both like X”
- 3 - Maps one structural element correctly
- 5 - Maps all structural elements; corresponding parts named explicitly in all three domains
Principle Articulation - Is the underlying shared principle stated explicitly?
- 1 - Principle implied or absent
- 3 - Principle gestured at but vague
- 5 - Stated precisely as a general claim that holds across all mapped domains
Math / Logic (add to universal 4)
Formal Validity - Does the reasoning chain hold up without logical gaps?
- 1 - Chain breaks: conclusion doesn’t follow from premises
- 3 - Chain holds with minor informal gaps
- 5 - Formally valid: each step follows necessarily from the prior
Minimal Honest Conclusion - Does it state exactly what can and cannot be concluded - no more, no less?
- 1 - Overstates or understates what the argument actually proved
- 3 - Conclusion roughly right but slightly over or under
- 5 - States precisely what was proved, what wasn’t, and what remains open
Scoring Template
Copy per question:
Question: _______________
Category: _______________
Universal:
Commitment: /5
Reasoning: /5
Precision: /5
Uncertainty: /5
Category-specific:
_______________: /5
_______________: /5
Total: ___ / 30
Average: ___ / 5
Final score (×2): ___ / 10
Notes:
If you want to reproduce this:
- Pick your anchors. Run your battery through Haiku and Opus (or Sonnet - Sonnet’s close enough to Opus for anchor purposes, just use a separate session from your grader).
- Grade them yourself first. Don’t skip this. You need your own calibration before you know when to push back on the LLM grader.
- De-identify before handing to the grader. “Model A,” “Model B,” “Model C” - whatever. Strips kin-bias.
- Argue per dimension, not per final score. “Commitment: 3 or 4?” is a real conversation. “Is this a 7 or an 8?” is astrology.
- Cap iteration at 3 rounds. If you haven’t converged by round 3, the dimension descriptor is probably ambiguous - fix the rubric, not the score.
Your local model’s scores then sit on a scale with verified reference points - not borrowed from a leaderboard you can’t interrogate.
Isn’t ASD fun? Now if I could just point it as something that mattered…








It’s really hard for me to answer this question without pointing to my project, because the project is sort of directly in response to this very problem. So, gauche as it may be, fuck it:
https://codeberg.org/BobbyLLM/llama-conductor
I mention this because 1) I am NOT trying to get you to install my shit but 2) my shit answers this directly. I note the conflict of interest, but OTOH you did ask me, and I sort of solved it in my way so…fuck. (It’s FOSS / I’m not trying to sell you anything etc etc).
With that out of the way, I will answer from where I am sitting and then generically (if I understand your question right).
Basically -
Small models have problems with how much they can hold internally. There’s a finite meta-cognitive “headspace” for them to work with…and the lower the quant, the fuzzier that gets. Sadly, with weaker GPU, you’re almost forced to use lower quants.
If you can’t upgrade the LLM (due to hardware), what you need to do is augment it with stuff that takes some of the load.
What I did was this: I wrapped a small, powerful, well-benchmarking LLM in an infrastructure that takes the things it’s bad at outside of its immediate concern.
Bad inbuilt model priors / knowledge base? No problem; force answers to go thru a tiered cascade.
Inbuilt quick responses that you define yourself as grounding (cheatsheets) --> self-populating wiki-like structure (you drop in .md into one folder, hit >>summ and it cross-updates everywhere) --> wikipedia short lookup (800 character open box: most wiki articles are structured with the TL;DR in that section) --> web search (using trusted domains) or web synth (using trusted domains plus cross-verification) --> finally…model pre-baked priors.
In my set up, the whole thing cascades from highest trust to lowest trust (human defined), stops when it hits the info it needs and tells you where the answer came from.
Outside of that, sidecars that do specific things (maths solvers, currency look up tools, weather look up, >>judge comparitors…tricks on tricks on tricks).
Based on my tests, with my corpus (shit I care about) I can confidently say my little 4B can go toe to toe with any naked 100B on my stuff. That’s a big claim, and I don’t expect you to take it at face value. It’s a bespoke system with opinions…but I have poked it to death and it refuses to die. So…shrug. I’m sanguine.
Understand: I assume the human in the middle is the ultimate arbiter of what the LLM reasons over. This is a different school of thought to “just add more parameters, bro” or “just get a better rig, bro”, but it was my solution to constrained hardware and hallucinations.
There are other schools of thought. Hell, others use things like MCP tool calls. The model pings cloud or self-host services (like farfalle or Perplexica), calls them when it decides it needs to, and the results land in context. But that’s a different locus of control; the model’s still driving…and I’m not a fan of that on principle. Because LLMs are beautiful liars and I don’t trust them.
The other half of the problem isn’t knowledge - it’s behaviour. Small models drift. They go off-piste, ignore your instructions halfway through a long response, or confidently make shit up when they hit the edge of what they know. The other thing I built was a behavioural shaping layer that keeps the model constrained at inference time - no weight changes, just harness-level incentive structure. Hallucination = retry loop = cost. Refusal = path of least resistance. You’re not fixing the model; you’re making compliance cheaper than non-compliance.
That’s how I solved it for me. YMMV.
On 16GB VRAM: honestly, that’s decent - don’t let GPU envy get to you. You can comfortably run a Q4_K_M of a 14B model entirely in VRAM at usable speeds - something like Qwen3-14B or Mistral-Small. Those are genuinely capable; not frontier, but not a toy either. The painful zone is 4-8GB (hello!), where you’re either running small models natively or offloading layers to RAM and watching your tokens-per-second crater. You can do some good stuff with a 14B, augmented with the right tools.
Where to start the rabbit hole: Do you mean generally? Either Jan.ai or LM Studio is the easiest on-ramp - drag and drop models, built-in chat UI, handles GGUF out of the box.
Once you want more control, drop into llama.cpp directly. It’s just…better. Faster. Fiddlier, yes…but worth it.
For finding good models, Unsloth’s HuggingFace page is consistently one of the better curators of well-quantised GGUFs. After that it’s just… digging through LocalLLaMA and benchmarking stuff yourself.
There’s no substitute for running your own evals on your own hardware for your own use case - published benchmarks will lie to you. If you’re insane enough to do that, see my above “rubric” post.
Not sure…have I answered your question?
PS: for anyone that hits the repo and reads the 1.9.5 commit message - enjoy :) Twas a mighty bork indeed, worthy of the full “Bart Simpson writes on chalkboard x 1000” hall of shame message. Fucking Vscodium man…I don’t know how sandbox mode got triggered but it did and it ate half my frikken hard-drive and repo before I could stop it. Rookie shit.