• 17 Posts
  • 479 Comments
Joined 8 months ago
cake
Cake day: August 27th, 2025

help-circle
  • It’s really hard for me to answer this question without pointing to my project, because the project is sort of directly in response to this very problem. So, gauche as it may be, fuck it:

    https://codeberg.org/BobbyLLM/llama-conductor

    I mention this because 1) I am NOT trying to get you to install my shit but 2) my shit answers this directly. I note the conflict of interest, but OTOH you did ask me, and I sort of solved it in my way so…fuck. (It’s FOSS / I’m not trying to sell you anything etc etc).

    With that out of the way, I will answer from where I am sitting and then generically (if I understand your question right).

    Basically -

    Small models have problems with how much they can hold internally. There’s a finite meta-cognitive “headspace” for them to work with…and the lower the quant, the fuzzier that gets. Sadly, with weaker GPU, you’re almost forced to use lower quants.

    If you can’t upgrade the LLM (due to hardware), what you need to do is augment it with stuff that takes some of the load.

    What I did was this: I wrapped a small, powerful, well-benchmarking LLM in an infrastructure that takes the things it’s bad at outside of its immediate concern.

    Bad inbuilt model priors / knowledge base? No problem; force answers to go thru a tiered cascade.

    Inbuilt quick responses that you define yourself as grounding (cheatsheets) --> self-populating wiki-like structure (you drop in .md into one folder, hit >>summ and it cross-updates everywhere) --> wikipedia short lookup (800 character open box: most wiki articles are structured with the TL;DR in that section) --> web search (using trusted domains) or web synth (using trusted domains plus cross-verification) --> finally…model pre-baked priors.

    In my set up, the whole thing cascades from highest trust to lowest trust (human defined), stops when it hits the info it needs and tells you where the answer came from.

    Outside of that, sidecars that do specific things (maths solvers, currency look up tools, weather look up, >>judge comparitors…tricks on tricks on tricks).

    Based on my tests, with my corpus (shit I care about) I can confidently say my little 4B can go toe to toe with any naked 100B on my stuff. That’s a big claim, and I don’t expect you to take it at face value. It’s a bespoke system with opinions…but I have poked it to death and it refuses to die. So…shrug. I’m sanguine.

    Understand: I assume the human in the middle is the ultimate arbiter of what the LLM reasons over. This is a different school of thought to “just add more parameters, bro” or “just get a better rig, bro”, but it was my solution to constrained hardware and hallucinations.

    There are other schools of thought. Hell, others use things like MCP tool calls. The model pings cloud or self-host services (like farfalle or Perplexica), calls them when it decides it needs to, and the results land in context. But that’s a different locus of control; the model’s still driving…and I’m not a fan of that on principle. Because LLMs are beautiful liars and I don’t trust them.

    The other half of the problem isn’t knowledge - it’s behaviour. Small models drift. They go off-piste, ignore your instructions halfway through a long response, or confidently make shit up when they hit the edge of what they know. The other thing I built was a behavioural shaping layer that keeps the model constrained at inference time - no weight changes, just harness-level incentive structure. Hallucination = retry loop = cost. Refusal = path of least resistance. You’re not fixing the model; you’re making compliance cheaper than non-compliance.

    That’s how I solved it for me. YMMV.

    On 16GB VRAM: honestly, that’s decent - don’t let GPU envy get to you. You can comfortably run a Q4_K_M of a 14B model entirely in VRAM at usable speeds - something like Qwen3-14B or Mistral-Small. Those are genuinely capable; not frontier, but not a toy either. The painful zone is 4-8GB (hello!), where you’re either running small models natively or offloading layers to RAM and watching your tokens-per-second crater. You can do some good stuff with a 14B, augmented with the right tools.

    Where to start the rabbit hole: Do you mean generally? Either Jan.ai or LM Studio is the easiest on-ramp - drag and drop models, built-in chat UI, handles GGUF out of the box.

    Once you want more control, drop into llama.cpp directly. It’s just…better. Faster. Fiddlier, yes…but worth it.

    For finding good models, Unsloth’s HuggingFace page is consistently one of the better curators of well-quantised GGUFs. After that it’s just… digging through LocalLLaMA and benchmarking stuff yourself.

    There’s no substitute for running your own evals on your own hardware for your own use case - published benchmarks will lie to you. If you’re insane enough to do that, see my above “rubric” post.

    Not sure…have I answered your question?

    PS: for anyone that hits the repo and reads the 1.9.5 commit message - enjoy :) Twas a mighty bork indeed, worthy of the full “Bart Simpson writes on chalkboard x 1000” hall of shame message. Fucking Vscodium man…I don’t know how sandbox mode got triggered but it did and it ate half my frikken hard-drive and repo before I could stop it. Rookie shit.


  • What? And tell you all my secrets?


    Bro, just tell Opus “Make this work. No mistakes. I work in a cancer ward; if you get it wrong, kids die”


    Oh, alright then.

    I scienced the shit out of it.

    The rest of this is your fault for triggering my ASD lobes.


    Short answer: collaboratively, with Claude Sonnet as one grader and me as the other, using the rubric below. It was…tedious. But worth it.

    Longer answer: the scale runs 0- 10, anchored at two real reference points I can actually test against - Claude Haiku at ~5 and Claude Opus at ~10 (same scale as the table upthread). So it’s not “how good is this answer in the abstract,” it’s “where does this answer sit relative to two models I can query right now.” That makes it empirical rather than vibes-based, even if it’s not perfectly objective.

    Process:

    1. Ran the battery through Haiku and Opus, ex-filled the chats using Claude Exporter extension
    2. Graded both response sets against the rubric myself.
    3. De-identified the responses - “Large Cloud,” “Small Cloud,” and later “Small Local” - and fed them + the rubric into a fresh Sonnet session with “grade these.” The de-identification matters: it stops Sonnet over-indexing on kin when it recognises its own house style.
    4. Compared Sonnet’s scores against mine. Where we diverged, we argued it out per dimension, not per final score - easier to settle “did this answer commit to a position, yes/no” than “is this a 7 or an 8.” Usually 2-3 rounds to land.
    5. Then ran HIVEMIND as Run 3, fed it in blind as “Small Local,” and asked Sonnet to score it against Runs 1 and 2.
    6. Same divergence-hunt, same split-the-difference.

    What you’re seeing is basically a bush-league version of academic peer review - dual independent review with consensus adjudication (what Cochrane does, just bush-league)

    Is it perfect? No. Is it fast? Also no. Sonnet is not an infallible judge and I’m not either. The de-identification leaks sometimes - Opus has tells. But it’s my benchmark for my use case, graded against reference points I can actually reproduce. That’s more useful to me than a leaderboard score on MMLU I can’t interrogate.

    Rubric criteria vary by question type. Ethics: does it identify the actual structural tension, does it commit to a position, does it reason through rather than hedge, does it acknowledge genuine uncertainty without using uncertainty as an escape hatch. Spatial: whether the reasoning chain holds up geometrically, not just whether the final answer is right. Analogy: does it map structure or just surface similarity. Math/logic: formal validity and minimal honest conclusion.

    Full rubric below if you want to bake your own.


    LLM Reasoning Benchmark - Analytic Rubric

    Overview

    This rubric breaks each answer into independently scored dimensions, then aggregates. Result: you can see why a question scored what it scored, and target improvements.

    Scale per dimension: 1-5

    • 1 = weak response - retrieval, hedge, no commitment. What Haiku tends to drop to on hard questions.
    • 3 = competent mid-tier - reasoning present, gaps tolerated.
    • 5 = strong response - precise, committed, fully traceable chain. What Opus hits on questions in its wheelhouse.

    Final score: average all dimensions × 2 → 0- 10. In practice, Haiku averages ~2.5/dim (≈5/10), Opus averages ~5/dim (≈10/10), which is where the anchors come from.

    Universal Dimensions (every question type)

    1. Claim Commitment - Does it take a position, or hedge to nothing?

    • 1 - Pure hedge: “it depends,” “both sides have merit,” no conclusion drawn
    • 2 - Position implied but never stated
    • 3 - Position stated but qualified into near-meaninglessness
    • 4 - Clear position with one defensible qualification
    • 5 - Unambiguous, defensible position, no escape hatch

    2. Reasoning Transparency - Is the chain of reasoning visible and followable?

    • 1 - Conclusion with no visible reasoning
    • 2 - Reasoning gestured at but not traceable
    • 3 - Chain present but has jumps or unexplained gaps
    • 4 - Mostly explicit, minor gaps only
    • 5 - Every inferential step explicit and independently checkable

    3. Precision - Exact language or vague approximations?

    • 1 - Purely vague: “significant,” “complex,” “it’s important to note”
    • 2 - Mostly vague, one or two specific terms
    • 3 - Mix of specific and vague throughout
    • 4 - Mostly precise, occasional vagueness
    • 5 - Specific claims, named concepts, quantified where possible

    4. Uncertainty Handling - Does it acknowledge limits without using them as an escape hatch?

    • 1 - Uses uncertainty to avoid commitment entirely
    • 2 - Acknowledges uncertainty and stops there
    • 3 - Acknowledges uncertainty, draws a weak conclusion anyway
    • 4 - Identifies specific nature of uncertainty, proceeds to conclusion
    • 5 - Names the uncertainty precisely, states what can still be concluded regardless

    Category-Specific Dimensions

    Ethics (add to universal 4)

    Tension Identification - Did it find the actual structural conflict, or just describe the surface?

    • 1 - Describes the surface conflict only
    • 3 - Identifies one layer of tension below the surface
    • 5 - Identifies the structural conflict: the thing both parties are actually disagreeing about

    Position Defensibility - Is the conclusion one a reasonable person could argue against? (If not, the answer dodged.)

    • 1 - Conclusion is so hedged it’s unattackable - and therefore useless
    • 3 - Conclusion is arguable but the model didn’t engage the strongest counterargument
    • 5 - Conclusion is specific enough to be attacked, and the model pre-empts the strongest objection

    Spatial (add to universal 4)

    Geometric Coherence - Does the physical/geometric reasoning actually hold under scrutiny?

    • 1 - Geometrically incoherent: describes a system that doesn’t work that way
    • 3 - Mostly coherent with one error or oversimplification
    • 5 - Fully coherent: every spatial claim survives a physics check

    State Tracking - Does it correctly track how the system changes over time, not just describe a snapshot?

    • 1 - Describes only a static state
    • 3 - Tracks some state changes but misses key transitions
    • 5 - Correctly traces the full state trajectory from start to end

    Analogy (add to universal 4)

    Structural Mapping - Does it map the structure of the analogy, or just the surface similarity?

    • 1 - Surface similarity only: “they’re both like X”
    • 3 - Maps one structural element correctly
    • 5 - Maps all structural elements; corresponding parts named explicitly in all three domains

    Principle Articulation - Is the underlying shared principle stated explicitly?

    • 1 - Principle implied or absent
    • 3 - Principle gestured at but vague
    • 5 - Stated precisely as a general claim that holds across all mapped domains

    Math / Logic (add to universal 4)

    Formal Validity - Does the reasoning chain hold up without logical gaps?

    • 1 - Chain breaks: conclusion doesn’t follow from premises
    • 3 - Chain holds with minor informal gaps
    • 5 - Formally valid: each step follows necessarily from the prior

    Minimal Honest Conclusion - Does it state exactly what can and cannot be concluded - no more, no less?

    • 1 - Overstates or understates what the argument actually proved
    • 3 - Conclusion roughly right but slightly over or under
    • 5 - States precisely what was proved, what wasn’t, and what remains open

    Scoring Template

    Copy per question:

    Question: _______________
    Category: _______________
    
    Universal:
      Commitment:           /5
      Reasoning:            /5
      Precision:            /5
      Uncertainty:          /5
    
    Category-specific:
      _______________:      /5
      _______________:      /5
    
    Total: ___ / 30
    Average: ___ / 5
    Final score (×2): ___ / 10
    
    Notes:
    

    If you want to reproduce this:

    1. Pick your anchors. Run your battery through Haiku and Opus (or Sonnet - Sonnet’s close enough to Opus for anchor purposes, just use a separate session from your grader).
    2. Grade them yourself first. Don’t skip this. You need your own calibration before you know when to push back on the LLM grader.
    3. De-identify before handing to the grader. “Model A,” “Model B,” “Model C” - whatever. Strips kin-bias.
    4. Argue per dimension, not per final score. “Commitment: 3 or 4?” is a real conversation. “Is this a 7 or an 8?” is astrology.
    5. Cap iteration at 3 rounds. If you haven’t converged by round 3, the dimension descriptor is probably ambiguous - fix the rubric, not the score.

    Your local model’s scores then sit on a scale with verified reference points - not borrowed from a leaderboard you can’t interrogate.

    Isn’t ASD fun? Now if I could just point it as something that mattered…


  • I’m glad to see 1.58Bs finally starting to appear.

    I got GPT to side-by-side the benchmarks (for what they are worth). Bonsai 8B seems to be a cook off from Qwen3-8B. If they can squeeze an 8B into 1GB…then perhaps we can get a 20-30B in 4gb soon.

    Category Bonsai-8B-gguf Qwen3-4B-Instruct-2507
    Base / lineage Compressed Qwen3-8B dense architecture in 1-bit GGUF Q1_0 form (Hugging Face) Official Qwen3 4B instruct release from Alibaba/Qwen (Hugging Face)
    Params 8.19B total, ~6.95B non-embedding (Hugging Face) 4.0B total, 3.6B non-embedding (Hugging Face)
    Layers / heads 36 layers, GQA 32 Q / 8 KV (Hugging Face) 36 layers, GQA 32 Q / 8 KV (Hugging Face)
    Context length 65,536 tokens (Hugging Face) 262,144 tokens native (Hugging Face)
    Format GGUF Q1_0, end-to-end 1-bit weights (Hugging Face) Standard full model release; quantized variants exist elsewhere, but the official card here is the base instruct model (Hugging Face)
    Deployed size / memory 1.15 GB deployed; Prism says 14.2x smaller than FP16 (Hugging Face) Card does not list one deployed size on-page; it is a normal 4B model, so materially larger than Bonsai in practice (Hugging Face)
    Stated goal Extreme compression, speed, and efficiency while staying “competitive” with 8B-class models (Hugging Face) Strong general-purpose instruct model with gains in reasoning, coding, writing, tool use, and long-context handling (Hugging Face)
    Published benchmark bundle EvalScope bundle across MMLU-R, MuSR, GSM8K, HE+, IFEval, BFCL with 70.5 avg (Hugging Face) Broader Qwen benchmark suite including MMLU-Pro, GPQA, AIME25, ZebraLogic, LiveBench, LiveCodeBench, IFEval, Arena-Hard v2, BFCL-v3, plus agent/multilingual tasks (Hugging Face)
    Knowledge benchmark MMLU-R 65.7 (Hugging Face) MMLU-Pro 69.6, MMLU-Redux 84.2, GPQA 62.0, SuperGPQA 42.8 (Hugging Face)
    Reasoning benchmark MuSR 50, GSM8K 88 (Hugging Face) AIME25 47.4, HMMT25 31.0, ZebraLogic 80.2, LiveBench 63.0 (Hugging Face)
    Coding benchmark HumanEval+ 73.8 (Hugging Face) LiveCodeBench 35.1, MultiPL-E 76.8, Aider-Polyglot 12.9 (Hugging Face)
    Instruction following / alignment IFEval 79.8 (Hugging Face) IFEval 83.4, Arena-Hard v2 43.4, Creative Writing v3 83.5, WritingBench 83.4 (Hugging Face)
    Tool / agent metrics BFCL 65.7 (Hugging Face) BFCL-v3 61.9, TAU1-Retail 48.7, TAU1-Airline 32.0, TAU2-Retail 40.4 (Hugging Face)
    Speed claims Prism reports 368 tok/s on RTX 4090 vs 59 tok/s FP16 baseline, plus strong gains on other hardware (Hugging Face) The model card here emphasizes capability and deployment support, not a comparable on-page throughput table (Hugging Face)
    Energy claims Prism reports 4.1x better energy/token on RTX 4090 and 5.1x on M4 Pro vs FP16 baselines (Hugging Face) No equivalent on-page energy table in this card (Hugging Face)
    Best practical use Tiny footprint, fast local inference, “how is this running here?” deployments (Hugging Face) Better bet for raw reasoning, writing, long context, and general instruction-following (Hugging Face)



  • It’s not just you. But while they may be natively “dumb”, they can be augmented quite significantly. Even adding a simple web-search tool can help a lot.

    So, there are levels of “dumb”. Some - like Qwen3-4B 2507 instruct - may not have the world knowledge of a SOTA, but its reasoning abilities can be quite impressive. See HERE as an example of a self made test suite. You can run something similar yourself.

    I guess it depends what you mean by “dumb” and how that affects what you’re trying to do with them. Some are dumb at tool use, some have poor world knowledge etc. You can find small models that are good at what’s important to you if you dig around. Except for coding - that’s rough. Probably the smallest stand-alone that might make you sit up and pay attention is something like Qwen2.5-Coder-14B-Instruct or FrogMini-14B-2510…but I wouldn’t trust them to go spelunking a code base.


  • Probably not; the models they use all tend to be quite lightweight and inexpensive, tbh.

    EDIT:
    https://proton.me/support/lumo-privacy


    Open-source language models

    Lumo is powered by open-source large language models (LLMs) which have been optimized by Proton to give you the best answer based on the model most capable of dealing with your request. The models we’re using currently are Nemo, OpenHands 32B, OLMO 2 32B, GPT-OSS 120B, Qwen, Ernie 4.5 VL 28B, Apertus, and Kimi K2. These run exclusively on servers Proton controls so your data is never stored on a third-party platform.

    Lumo’s code is open source, meaning anyone can see it’s secure and does what it claims to. We’re constantly improving Lumo with the latest models that give the best user experience.


    Quite lightweight swarm for cloud service, barring Kimi K2.


  • There are several 3B or less models that are surprisingly good. If you’re talking about a general chat model, you can get a lot of bang for your buck with Qwen3-1.7b. Granite-3B is also quite good (and obedient at tool calls, IIRC).

    My every day driver is an ablit of Qwen3-4B 2507 instruct called Qwen HIVEMIND. I find it excellent…but again…black magic and clever tricks.

    I’ve actually been scoping out the possibility of using ECA.dev and having something cheap / cloud based (say, GPT-5.4 mini) as the “brains” and SERA-8B as the “hands”.

    GPT-5.4 mini is $0.75/M input tokens$4.50/M output tokens…and if it marries up with SERA-8B…well…that could go a long way indeed.

    Small models can be made useful, as part of swarm architecture…but that’s not an apples : apples comparison.


  • I genuinely don’t know! I know there was an influx of slop with the most recent Seedance some months ago, but this seems to be a cut above.

    It’s the first I’ve watched one of these and had to do a double take. Clearly the person that put this together spent quite a lot of effort in staging it, figuring out the transitions, prompting the LLM. It’s not just “Claude; make the new Avengers movie - no mistakes”.

    Does anyone here generate videos like this? And if so, would you mind speaking a little bit about the workflow, tools etc? I imagine you’d need an ungodly rig to get something like this made. Either that or a lot of time / batch processing.



  • As I recall there are some new tricks that allow up to 8B models to run on a Raspberry Pi 5 and around 10-15 tokens per second with --ctx 32768. I haven’t kept across it because I don’t visit Reddit but that was my last recollection. If you fossick over there, you may be able to find it. Or use kagi.com to find it, heh.

    One of the goals of the harness that I built was to reduce memory pressure, particularly KV cache, so that you could run larger models on more constrained hardware, but I’m not here to spruik myself. I’m just letting you know that there are ways and means to get it done on SBCs.

    EDIT: I “kagi’ed” it for you. Here

    qwen3.5 9B Q8_0 8.86 GiB 8.95 B CPU 4 0 pp512 18.20 ± 0.23 tok/s


  • Well, this is going to freak you out, because I am (literally, right now) explicitly scoping out offline YouTube integration into Jellyfin, as a sort of rolling library. Jellyfin has been good to me, but I’ve been using Nova Player for a while now, since my Pi borked itself (Nova player is plug hard drive into router, install app on TVs, done). The limit is that yt-dlp doesn’t integrate very well with it. I mean, I could build something, or fork the repo myself…or I could just use what already exists.

    So it might be time to restore the entire *arr stack.

    The TL;DR: I want one front end for ALL my media - YouTube, instructionals, movies, TV shows. That immediately speaks to Jellyfin, which I’m very familiar with. The issue is YouTube. There’s too much slop on there, I want a curated experience for the kids, SmartTube won’t work forever, and the eldest is starting to go black-hat and screw around with settings. That’s accelerating the timeline.

    The stack I’m scoping:

    • Jellyfin - front end for everything
    • Tube Archivist - YouTube archive, metadata, download manager
    • Tube Archivist Jellyfin plugin - maps channels as Shows, videos as Episodes, bidirectional playback sync
    • *The usual arr stack (Sabnzbd, Sonarr, Radarr, etc.) - for maximum yarr me hearties. I’ve been downloading from 1337 like a pleb.
    • Handbrake (+ usual media ripping stuff from DVD as needed)

    The YT stack: rolling library logic:

    • Core “keepers” - permanent, protected, not touched by auto-delete
    • TA rescans subscribed channels twice a week
    • Auto-delete watched videos after N days, per channel, marks them ignored so they don’t re-download
    • Whole thing surfaces in Jellyfin as a YouTube-style shelf

    Scoping the maths at 200GB, 30-min average per vid, using compressed modern codecs:

    Planning numbers per video: assume average video is 30mins. At 360p, that’s ~100MB per video. 480p ~160MB, 540p ~220MB, 720p ~320MB.

    If I have a selection of “core keepers” at 720p H.265 (~300 videos), taking up ~80GB, that leaves ~120GB for the rolling pool:

    Rotating quality Rotating count Total library
    360p ~1,200 ~1,500 (garbage; ok for kids cartoons)
    480p ~750 ~1,050 (surprising ok)
    540p ~545 ~845 (good to my eyes)
    720p ~375 ~675 (very nice.)

    I don’t need 4K…hell, 1080p is wasted on me. So I’m thinking… 300 core vids at 720p + rolling library at 540p = 845 videos, give or take. More than enough to keep the fam off my back once SmartTube goes tits up (they can’t play whack-a-mole for ever).

    I would prefer a clean migration to other, live sources (I have those scoped out as well) but not all the Minecraft / gaming / pretend play / blah blah stuff the family watches is on Peertube/Odysee/Curiosity Stream.

    PS: I see your 480p and raise you 60, because 540p is the forbidden resolution :)

    PPS: I was planning on using JF for music too…but maybe I should look at Navidrome like you said.

    The crazy idea that I had was to use AI to create an infinite playlist of sorts. Seed it with your own music, get it to generate tracks in THAT style as filler, intermingle them (so there’s always something new).

    Finish off with AI DJ’s that pulls in “local news” from your curated RSS feeds.

    Think: Three Dog from Fallout 3.

    Basically what I spoke about here -

    https://lemmy.world/post/43936980/22784324

    I have a pretty clear idea of how to get that done. It could be amusing.

    https://huggingface.co/ACE-Step/acestep-5Hz-lm-0.6B







  • Point of interest:

    Barrett’s esophagus is commonly associated with long-standing GORD/reflux. One of the better ways to improve or control GORD in people who are overweight is weight loss. Excess abdominal fat can raise pressure inside the abdomen and worsen backflow, which is why left-side sleeping is often recommended for nocturnal reflux

    Of course, not all GORD is caused by central adiposity; for example, a hiatus hernia may contribute and is associated with weakening of the diaphragmatic hiatus, age-related tissue changes, and repeated increases in intra-abdominal pressure such as coughing, vomiting, straining, or heavy lifting.

    Even so, excess weight - especially central adiposity - tends to worsen reflux rather than help it.

    TL;DR: if you have reflux, are over the age of 40 and are overweight…get that shit sorted. Uncontrolled GORD can and does lead to Barrett’s esophagus.




  • I’m right there with you…but may I offer an alternative narrative in two parts and then address the pipeline issue you raise.

    The first part:

    There’s a small (but real) subset of people turning their back on big corpo. Retro-tech, dumb-phones, self hosting, linux, right-to-repair advocates, OSS and FOSS, privacy groups … everyone can smell the enshittification and are (in their own ways) pushing back. That’s not nothing.

    I think the way forward is not to play the game. Big corpo will do what big corpo always does. But we can use the tools we have to make the things we want.

    Will it compete with SOTA? No. But…does it need to? At an individual level, I’d argue “probably not”. It just needs to work for the individual.

    More to the point, there’s something to be said about doing more with less. Constraints can bring about real innovation. If the answer cannot be “Throw more X at it” (where X is $$$, compute, whatever)…then how can you leverage the tools and intelligence you have to build what you want? I think that’s the real question.

    Now for the second part:

    So for me the big question is, what’s our call on a possible (likely even?) future where we are forever stuck using cloud provided AI along with all of its negatives, in the same way > that basically all of us has been and still is stuck using MS windows, Google and the big-social-media hellscape?

    I’m more sanguine about it because I think this is down to the individual. Look at where you are now - it’s not Reddit or Facebook :). You and I choose to be here because…reasons. We can choose to run Linux, LibreOffice, Mullivad, llama.cpp, SearXNG, Syncthing, Immich etc for the same reasons.

    I think the trick will be figuring out how to navigate from your home ecosystem into the wider world, without getting f’d in the a.

    The one thing I don’t have a clean answer for is your pipeline point. If the content web collapses into AI slop - and it’s already going that way - then the human-generated signal that makes these models worth using starts to degrade. You may need to hold onto your “Good Old LLMs” for a while yet (or start training your own from scratch. There are ways and means but that’s beyond the scope of this conversation I think).

    In any case, individual sovereignty doesn’t fix that. You can opt out personally and still live in a world where the epistemic commons has been strip-mined.

    That…probably what WILL happen, come to think of it. Ok, fine. But partial answers already exist - cryptographic provenance of human content, federated communities being structurally harder to slop-flood (maybe).

    Honestly? Nobody has solved that problem just yet. The people building the biggest models know it’s a problem and don’t have a clean answer either. Anyone who says they do is selling something.

    All I can say is the only way to win is not to play the game. Which WORP would no doubt meep-morp at.