[{"data":1,"prerenderedAt":609},["ShallowReactive",2],{"navigation":3,"\u002Fblog\u002Flocal-ai-upgrade":30,"posts":578},[4],{"title":5,"path":6,"stem":7,"children":8,"page":29},"Blog","\u002Fblog","blog",[9,13,17,21,25],{"title":10,"path":11,"stem":12},"I Spent $5K on GPUs Just to Learn That One GPU Was Enough","\u002Fblog\u002Flocal-ai-rig","blog\u002Flocal-ai-rig",{"title":14,"path":15,"stem":16},"I Spent a Saturday Trying to Replace My Local AI to Save on Tokens. The Server Voted No.","\u002Fblog\u002Flocal-ai-upgrade","blog\u002Flocal-ai-upgrade",{"title":18,"path":19,"stem":20},"Building a Roguelike Game with Amazon Q","\u002Fblog\u002Fq-roguelike","blog\u002Fq-roguelike",{"title":22,"path":23,"stem":24},"I Spent 22 Years Programming Just to Fail at Making a Skeleton Swing a Sword","\u002Fblog\u002Fthundoria-architecture","blog\u002Fthundoria-architecture",{"title":26,"path":27,"stem":28},"Build First - Learn Later","\u002Fblog\u002Fvectly-scaling","blog\u002Fvectly-scaling",false,{"id":31,"title":14,"author":32,"body":33,"date":564,"description":565,"draft":29,"extension":566,"image":567,"meta":568,"navigation":569,"path":15,"seo":570,"sitemap":571,"stem":16,"tags":573,"updated":576,"__hash__":577},"blog\u002Fblog\u002Flocal-ai-upgrade.md","Tony Costanzo",{"type":34,"value":35,"toc":549},"minimark",[36,41,68,72,76,84,87,90,93,97,100,103,136,139,142,146,149,162,165,169,176,186,202,205,209,216,219,222,229,232,235,238,242,245,263,266,273,277,280,283,286,289,292,295,298,302,314,317,320,340,348,351,355,358,492,495,499,502,522,529,532,536,539,546],[37,38,40],"h2",{"id":39},"links","Links",[42,43,44,54,61],"ul",{},[45,46,47],"li",{},[48,49,53],"a",{"href":50,"rel":51},"https:\u002F\u002Fwww.techhivelabs.net\u002Fblog\u002Flocal-ai-rig",[52],"nofollow","The original rig writeup",[45,55,56],{},[48,57,60],{"href":58,"rel":59},"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\u002Fissues\u002F38643",[52],"vLLM hybrid attention bug (#38643)",[45,62,63],{},[48,64,67],{"href":65,"rel":66},"https:\u002F\u002Fgithub.com\u002Fnoonghunna\u002Fqwen36-dual-3090",[52],"noonghunna\u002Fqwen36-dual-3090",[37,69,71],{"id":70},"the-cope-that-started-it-all","The Cope That Started It All",[73,74,75],"p",{},"I run my entire AI stack locally to save on token costs.",[73,77,78,79,83],{},"That sentence is technically true and economically deranged. I dropped five grand on a 4x RTX 3090 EPYC server (a story for ",[48,80,82],{"href":50,"rel":81},[52],"another post","), I burn enough power to heat my office better than the actual heating system, and I tell my wife with a straight face that the math works out. The math does not work out. The math will never work out. I would have to refuse Claude tokens for the rest of my natural life to break even on this thing.",[73,85,86],{},"But the model that's been quietly carrying my agent backend for weeks -- Qwen3.5-122B-A10B, 122 billion total parameters with 10 billion active per token, 256K context, multimodal, ~26 tok\u002Fs decode -- runs flawlessly on my closet jet engine. And that genuinely rules.",[73,88,89],{},"Then Qwen3.6 dropped.",[73,91,92],{},"And I made the classic mistake of assuming the year-old model in production needed to be replaced.",[37,94,96],{"id":95},"the-plan-past-tense","The Plan (Past Tense)",[73,98,99],{},"Qwen3.6-27B's benchmarks looked filthy. 77.2% on SWE-Bench Verified, within shouting distance of Claude Opus 4.6. Smaller, faster, more efficient. Designed for exactly the kind of homelab shenanigans I keep getting myself into.",[73,101,102],{},"The migration plan was clean:",[42,104,105,112,118,124,130],{},[45,106,107,111],{},[108,109,110],"strong",{},"Main brain:"," Qwen3.6-27B in FP8 via vLLM, tensor parallel across two 3090s",[45,113,114,117],{},[108,115,116],{},"Vision sidecar:"," Qwen3.6-35B-A3B for the Frigate camera analysis",[45,119,120,123],{},[108,121,122],{},"Frees a GPU"," vs the 122B's CPU-offload pattern",[45,125,126,129],{},[108,127,128],{},"Same multimodal coverage"," as the current stack",[45,131,132,135],{},[108,133,134],{},"Faster decode"," in theory",[73,137,138],{},"I had a Saturday. I had a freshly downloaded model. I had coffee.",[73,140,141],{},"What's the worst that could happen.",[37,143,145],{"id":144},"wall-one-the-hybrid-attention-tax","Wall One: The Hybrid Attention Tax",[73,147,148],{},"Every modern Qwen3 model uses hybrid attention, a 3:1 ratio of Gated DeltaNet layers (linear attention, fast, light) to standard Gated Attention layers. On paper it's elegant. In practice, inference engines are still catching up.",[73,150,151,152,156,157,161],{},"vLLM 0.19.0 + Qwen3.6-27B FP8 hits a ",[48,153,155],{"href":58,"rel":154},[52],"tensor format mismatch in the FLA path",". The bug doesn't crash gracefully, it just produces gibberish. You ask the model what 2+2 is and it confidently answers in slashes. 0.19.1 ships.. still broken. Adding ",[158,159,160],"code",{},"--enforce-eager"," makes it stable, then drops decode to 3-4 tok\u002Fs because cudagraphs get disabled. At that speed I might as well type the response myself.",[73,163,164],{},"Yikes.",[37,166,168],{"id":167},"wall-two-24gb-is-a-lie","Wall Two: 24GB Is a Lie",[73,170,171,172,175],{},"Tried loading ",[158,173,174],{},"Qwen3.6-35B-A3B-AWQ"," on a single 3090 for the Frigate sidecar. AWQ weights are ~17.5GB. Should fit comfortably in 24GB. It doesn't.",[177,178,183],"pre",{"className":179,"code":181,"language":182},[180],"language-text","torch.OutOfMemoryError: CUDA out of memory.\nGPU 0 has a total capacity of 23.80 GiB of which 182.19 MiB is free.\n","text",[158,184,181],{"__ignoreMap":185},"",[73,187,188,189,192,193,192,196,192,199,201],{},"vLLM's profile_run pre-allocates worst-case multimodal buffers during warmup, and the vision encoder balloons to ~23GB before the model even starts running. I tweaked every flag I knew.. ",[158,190,191],{},"--gpu-memory-utilization 0.85",", ",[158,194,195],{},"--max-model-len 16384",[158,197,198],{},"--max-num-seqs 8",[158,200,160],{},". Same answer every time: 24GB consumer card, AWQ, vision encoder.. pick two.",[73,203,204],{},"Fell back to llama.cpp + GGUF for that one (which fit fine at ~22GB). Then realized Frigate genai analysis was a \"nice to have\" rather than a must-have, and disabled it entirely. Who needs to know when the UPS guy is spending too much time at my frontdoor anyway.",[37,206,208],{"id":207},"wall-three-quantized-kv-cache-speaks-only-in-slashes","Wall Three: Quantized KV Cache Speaks Only in Slashes",[73,210,211,212,215],{},"Tried llama.cpp on Qwen3.6-27B Q5_K_M with my usual flags: ",[158,213,214],{},"-ctk q4_0 -ctv q4_0",". Same flags I run on the 122B every day with zero issues.",[73,217,218],{},"Asked the model \"What is 2+2?\"",[73,220,221],{},"The decode rate was perfect, the tokenization was clean, the entire pipeline was confidently producing garbage.",[73,223,224,225,228],{},"It replied: ",[158,226,227],{},"\u002F\u002F\u002F\u002F\u002F\u002F\u002F\u002F\u002F\u002F\u002F\u002F\u002F\u002F\u002F\u002F\u002F\u002F"," at a brisk 37 tok\u002Fs.",[73,230,231],{},"Q8 KV did the same thing. Turns out Qwen3.6 only rotates 64 of its 256 head dimensions, and the scoring math the KV-quant code uses leans on exactly the rotated portion. Quantize what holds 75% of the cell content and the model goes blind.",[73,233,234],{},"Switched to BF16 KV. Output came back as English. Decode held up at small context, but at 65K+ I started catching CUDA errors during prefill activations. Practical ceiling on dual 3090s with BF16 KV: about 128K context... half of what the 122B handles without breaking a sweat.",[73,236,237],{},"Strike three.",[37,239,241],{"id":240},"wall-four-a-hermes-naming-gotcha","Wall Four: A Hermes Naming Gotcha",[73,243,244],{},"Nous Research recommends their Hermes 4 fine-tunes for the kind of agent loop I'm running. There are two of them:",[42,246,247,253],{},[45,248,249,252],{},[108,250,251],{},"Hermes 4 35B-A3B"," -- actual MoE, 3B active, fine-tuned from Qwen3.5-35B-A3B",[45,254,255,258,259,262],{},[108,256,257],{},"Hermes 4.3 36B"," -- dense, 36 billion parameters, fine-tuned from ",[108,260,261],{},"ByteDance Seed"," (not Qwen, surprise)",[73,264,265],{},"The \"A3B\" doesn't appear in the file or repo names, and I assumed wrong. Pulled the dense one. 22GB Q4_K_M. Loaded fine. Worked great at \u003C2K context, hung the moment I tried 8K.",[73,267,268,269,272],{},"But the real lesson here was the KV cache math. 65 layers all running full attention means ~33GB of KV at 128K context regardless of FFN sparsity. ",[108,270,271],{},"MoE saves compute, not KV memory."," I'd been quietly wrong about that for months. Embarrassing.",[37,274,276],{"id":275},"wall-five-the-ipmi-lockup","Wall Five: The IPMI Lockup",[73,278,279],{},"This is the one I'd like to forget.",[73,281,282],{},"I'd seen MiniMax M2.5 doing the rounds.. 230B-A10B agentic model, 80.2% on SWE-Bench Verified, designed end-to-end for tool use. Different model family entirely, no Qwen3 hybrid attention bugs to inherit. The 10B active footprint meant the CPU-offload pattern would still work on my 128GB of DDR4. 95GB download. I got more coffee.",[73,284,285],{},"I forgot to stop the 122B service before launching M2.5.",[73,287,288],{},"Two simultaneous CPU-offload models meant 62GB (122B experts) + 99GB (M2.5 experts) trying to share 128GB of physical RAM. The kernel started swap-thrashing. SSH stopped responding. The fans got loud. Then the fans got quiet, which is somehow worse.",[73,290,291],{},"I had to IPMI-reboot the box.",[73,293,294],{},"(For the non-homelab readers: IPMI is the \"remote-press the physical reset button\" interface. Using it is not a flex. Using it because you stacked two MoE models on top of each other and ate all the RAM is, frankly, embarrassing.)",[73,296,297],{},"After a clean reboot with the 122B properly stopped, M2.5 ran. Decode at 1K context: 26 tok\u002Fs. Decode at 128K context: 5.9 tok\u002Fs. That's about one word per second. Useful for crossword puzzles, less useful for a multi-turn agent loop. Plus, I confirmed this the hard way, the open weights are text-only. So much for replacing the multimodal pipeline.",[37,299,301],{"id":300},"wall-six-the-community-recipe-speedrun","Wall Six: The Community Recipe Speedrun",[73,303,304,305,309,310,313],{},"There's a ",[48,306,308],{"href":65,"rel":307},[52],"GitHub repo"," by ",[158,311,312],{},"noonghunna",", a community-engineered stack with custom vLLM patches that allegedly gets Qwen3.6-27B running on dual 3090s at ~50 tok\u002Fs with 100K context. DigitalSpaceport showed it running in a YouTube video. I had to try.",[73,315,316],{},"Required prerequisite: NVIDIA driver 580.x (I was on 570). Did the upgrade in-place via apt. Clean swap, all five GPUs detected on first boot, every existing service came back. The driver upgrade was the one universally good thing to come out of the entire day. Hold that thought.",[73,318,319],{},"Then I tried each compose variant the repo ships:",[42,321,322,328,334],{},[45,323,324,327],{},[108,325,326],{},"Default fp8 build:"," booted, threw the same vLLM hybrid attention bug at 65K+ context. Turns out the default variant doesn't apply the patches that fix it.",[45,329,330,333],{},[108,331,332],{},"Turbo build with Genesis patches:"," mounted a file the repo doesn't actually contain (Docker silently created an empty placeholder, lol). Once I no-op stubbed it, the engine crashed importing a vLLM symbol that no longer exists in current nightly.",[45,335,336,339],{},[108,337,338],{},"DFlash speculative decoding build:"," the drafter model is gated on HuggingFace, the setup script doesn't tell you, the first download silently produces 36KB of metadata, you accept the gate, redownload, then watch the engine wedge for forty straight minutes during cudagraph capture before you give up and kill it.",[73,341,342,343,347],{},"The repo's README is honest about all of this: ",[344,345,346],"em",{},"\"this stack tracks vllm:nightly rather than pinning to a tested digest.\""," Translation: this worked end-to-end for about 48 hours after each commit. By the time you find it, half the patches reference symbols that have moved.",[73,349,350],{},"Community recipes rot. Fast.",[37,352,354],{"id":353},"what-i-actually-tested","What I Actually Tested",[73,356,357],{},"The day's burn-down chart, in the form of a table I'd rather not look at again:",[359,360,361,377],"table",{},[362,363,364],"thead",{},[365,366,367,371,374],"tr",{},[368,369,370],"th",{},"Stack",[368,372,373],{},"Engine",[368,375,376],{},"Result",[378,379,380,396,407,418,428,439,449,459,470,481],"tbody",{},[365,381,382,388,391],{},[383,384,385],"td",{},[108,386,387],{},"Qwen3.5-122B-A10B (production)",[383,389,390],{},"llama.cpp + cpu-moe",[383,392,393],{},[108,394,395],{},"Stable. 26 tok\u002Fs @ 256K. Untouched.",[365,397,398,401,404],{},[383,399,400],{},"Qwen3.6-27B FP8",[383,402,403],{},"vLLM 0.19.x",[383,405,406],{},"Gibberish, or 3-4 tok\u002Fs with eager mode",[365,408,409,412,415],{},[383,410,411],{},"Qwen3.6-27B Q5_K_M",[383,413,414],{},"llama.cpp BF16 KV",[383,416,417],{},"OOM at 65K",[365,419,420,422,425],{},[383,421,411],{},[383,423,424],{},"llama.cpp q4_0 KV",[383,426,427],{},"Confident gibberish",[365,429,430,433,436],{},[383,431,432],{},"Qwen3.6-35B-A3B AWQ",[383,434,435],{},"vLLM single 3090",[383,437,438],{},"Vision encoder OOM",[365,440,441,444,446],{},[383,442,443],{},"Hermes 4.3-36B (dense, not Qwen)",[383,445,414],{},[383,447,448],{},"Hangs at 8K",[365,450,451,454,456],{},[383,452,453],{},"MiniMax M2.5 230B-A10B",[383,455,390],{},[383,457,458],{},"Slow at long ctx, text-only",[365,460,461,464,467],{},[383,462,463],{},"Qwen3.6-27B AutoRound (default)",[383,465,466],{},"vLLM nightly",[383,468,469],{},"Same FLA bug",[365,471,472,475,478],{},[383,473,474],{},"Qwen3.6-27B AutoRound (Turbo)",[383,476,477],{},"vLLM nightly + Genesis",[383,479,480],{},"Genesis ImportError",[365,482,483,486,489],{},[383,484,485],{},"Qwen3.6-27B AutoRound (DFlash)",[383,487,488],{},"vLLM nightly + DFlash",[383,490,491],{},"Wedged in compile",[73,493,494],{},"Six hours. ~155GB of disk burned. One driver upgrade kept. One server briefly bricked. Production: unchanged.",[37,496,498],{"id":497},"what-i-actually-learned","What I Actually Learned",[73,500,501],{},"Three things came out of this day clean:",[42,503,504,510,516],{},[45,505,506,509],{},[108,507,508],{},"Hybrid attention is the new bug magnet."," Every Qwen3-family model and Gemma 4 use hybrid layouts, and every mainline inference engine is still chasing the bugs. Qwen3.6-27B is widely beloved. You can find a dozen people on X right now telling you it's their daily driver, but the runtime story at long context is still not there if you need stability. The bleeding edge actually bleeds.",[45,511,512,515],{},[108,513,514],{},"MoE saves compute, not KV cache memory."," All attention layers compute KV for every token regardless of expert routing. A 36B dense model and a 36B-A3B MoE model have the same KV footprint at the same context length. I'd been wrong about this for months and didn't notice until the math broke me.",[45,517,518,521],{},[108,519,520],{},"Don't run two CPU-offload models at the same time."," Or do, and reserve some quality time to chat with your IPMI interface like an old friend.",[73,523,524,525,528],{},"The CPU-offload pattern (",[158,526,527],{},"--cpu-moe"," in llama.cpp, attention on the GPU, MoE experts in DDR4) remains the only stable path I've found to 256K+ context on consumer 3090s for this model family. The 122B works because experts go to RAM (~46GB at IQ4_XS) and only the attention slice touches the 24GB GPU. That dodges most of the GPU-side bugs the new stuff trips on.",[73,530,531],{},"Marketing context numbers are usually YaRN max, by the way. MiniMax sells \"1M context\", native is 204,800. Qwen3.6 sells \"262K native, 1M with YaRN\" (at least that one is honest). Always check the model card, not the launch blog.",[37,533,535],{"id":534},"the-cope-revisited","The Cope, Revisited",[73,537,538],{},"So I'm back on Qwen3.5-122B-A10B. Same model, same 26 tok\u002Fs, same 256K context, same multimodal, same boring stability. The only thing that changed today is my driver version and my newfound respect for swap thrashing.",[73,540,541,542,545],{},"The token-cost cope still doesn't pencil out. I knew that going in. But six hours of failure left me with the one thing the API never gives you: ",[108,543,544],{},"definitive evidence"," that the obvious upgrade isn't ready for my use case. Not \"I keep meaning to try it.\" Not \"maybe Qwen3.6 fixes everything.\" Burned in. Documented. Done.",[73,547,548],{},"I'll check back in a few weeks. The best part is.. this is exactly how I ended up on a 122B model in the first place.",{"title":185,"searchDepth":550,"depth":550,"links":551},2,[552,553,554,555,556,557,558,559,560,561,562,563],{"id":39,"depth":550,"text":40},{"id":70,"depth":550,"text":71},{"id":95,"depth":550,"text":96},{"id":144,"depth":550,"text":145},{"id":167,"depth":550,"text":168},{"id":207,"depth":550,"text":208},{"id":240,"depth":550,"text":241},{"id":275,"depth":550,"text":276},{"id":300,"depth":550,"text":301},{"id":353,"depth":550,"text":354},{"id":497,"depth":550,"text":498},{"id":534,"depth":550,"text":535},"2026-04-28T00:00:00.000Z","Six hours, five models, one IPMI reboot, zero upgrades -- a field report on why the bleeding edge of local LLM hosting is mostly bleeding.","md","\u002Fimages\u002Fblog\u002Flocal-ai-upgrade\u002Fhero.webp",{},true,{"title":14,"description":565},{"loc":15,"lastmod":572},"2026-04-28",[574,575],"AI","Self Hosting",null,"hc2mRitx-iB44JVuWRPrODaHG2ceA5U71PHZ4RSbZ3o",[579,581,587,595,603],{"title":14,"author":32,"date":564,"draft":29,"description":565,"image":567,"tags":580,"navigation":569,"path":15,"stem":16,"id":31},[574,575],{"title":10,"author":32,"date":582,"draft":29,"description":583,"image":584,"tags":585,"navigation":569,"path":11,"stem":12,"id":586},"2026-03-25T00:00:00.000Z","Four RTX 3090s, an EPYC server, and a 122B parameter model -- the journey from underwhelming to overkill and back to surprisingly elegant.","\u002Fimages\u002Fblog\u002Fai-server\u002Fhero.webp",[574,575],"blog\u002Fblog\u002Flocal-ai-rig.md",{"title":22,"author":32,"date":588,"draft":29,"description":589,"image":590,"tags":591,"navigation":569,"path":23,"stem":24,"id":594},"2025-08-27T00:00:00.000Z","REST APIs to roguelikes: How I'm using a smaller game to learn the fundamentals before building my dream MMORPG. A honest indie gamedev journey.","\u002Fimages\u002Fblog\u002Fthundoria\u002Fhero.webp",[592,593],"Game Development","Nuxt","blog\u002Fblog\u002Fthundoria-architecture.md",{"title":26,"author":32,"date":596,"draft":29,"description":597,"image":598,"tags":599,"navigation":569,"path":27,"stem":28,"id":602},"2025-06-25T00:00:00.000Z","Building an AI app with zero knowledge of AI, hitting deployment walls, and evolving architecture to meet real-world needs.","\u002Fimages\u002Fblog\u002Fvectly\u002Fhero.webp",[574,600,601,593],"Web Development","Side Projects","blog\u002Fblog\u002Fvectly-scaling.md",{"title":18,"author":32,"date":604,"draft":29,"description":605,"image":606,"tags":607,"navigation":569,"path":19,"stem":20,"id":608},"2025-06-17T00:00:00.000Z","A deep dive into using Amazon's AI coding assistant, Q, to build a browser-based roguelike game with Nuxt.js. From initial concept to implementation challenges, this post explores the potential and limitations of AI-assisted development.","\u002Fimages\u002Fblog\u002Fq-roguelike\u002Fhero.webp",[574,592,593],"blog\u002Fblog\u002Fq-roguelike.md",1777393892094]