Choosing Your Stack
Choosing Your Stack — A Decision Guide for Puente
Section titled “Choosing Your Stack — A Decision Guide for Puente”This document helps you decide how to deploy, what to run, and why it matters — before you touch any configuration.
Start Here: Strategy, Not Tools
Section titled “Start Here: Strategy, Not Tools”Before picking hardware or models, it helps to understand what you are actually doing when you use an LLM.
LLMs do not retrieve facts. They interpolate. Every response is a statistically predicted continuation — the next most plausible token given everything before it. This is extraordinary for certain tasks and unreliable for others. Knowing which is which determines whether the tool amplifies you or misleads you.
Two modes of use
Section titled “Two modes of use”| Mode | Description | Risk |
|---|---|---|
| Cognitive offloading | Using the model as a thinking partner — brainstorming, ideating, iterating, drafting, critiquing | Low. You stay in the loop. Hallucinations are caught because you are still thinking. |
| Cognitive surrender | Treating the model as an oracle — asking it for facts, trusting its output without verification | High. Hallucinations get amplified, not filtered. |
The framing matters: a conversation with an AI that amplifies your thinking is fundamentally different from outsourcing your thinking to one. The first is a productivity multiplier. The second transfers your judgment — and your errors — to a system that cannot know what it does not know.
Practical heuristic: Use local AI for brainstorming, ideating, iterating, drafting, and cognitive offloading of simple tasks. Stay close to frontier models for tasks where factual precision matters, and verify outputs independently. The gap between a good local model and a frontier model is much smaller when you are the one doing the thinking.
Why local at all?
Section titled “Why local at all?”- Data sovereignty — your documents, conversations, and ideas stay on your hardware
- No subscription model — pay once for hardware (or a VPS), not monthly per seat
- No usage throttling — run as many queries as your hardware allows
- Reproducibility — pin a model version; it does not change under you
- Skill transferability — understanding the local stack prepares you for the edge AI wave (see below)
The Minimal Viable Stack
Section titled “The Minimal Viable Stack”Three services give you roughly 90% of what commercial AI providers offer:
Open WebUI + ComfyUI + Open Terminal| What you get | Commercial equivalent |
|---|---|
| Chat with any local or remote LLM | ChatGPT, Claude, Gemini web UI |
| Image generation (SD, SDXL, FLUX) | Midjourney, DALL·E |
| Terminal with LLM-aware tooling | Codex CLI, GitHub Copilot CLI |
| Built-in audio in/out (Open WebUI) | Voice mode on ChatGPT / Gemini |
What you do not get (yet): video generation. That requires significantly more VRAM and is still maturing. ComfyUI supports video workflows (AnimateDiff, Wan, CogVideoX) but treat them as experimental unless you have 24 GB+ VRAM.
Deployment Options
Section titled “Deployment Options”Option A — Fully Local (on-premise GPU)
Section titled “Option A — Fully Local (on-premise GPU)”You own the hardware, the data never leaves your network.
| VRAM | Realistic capability | Notes |
|---|---|---|
| 4–6 GB | 7B models at Q4, SD 1.5 image gen | Tight but functional. CPU offload helps. |
| 8 GB | 13B at Q4, SDXL image gen | Good daily driver tier |
| 12 GB | 13–30B at Q4, SDXL comfortable | Solid research workstation |
| 16–24 GB | 34–70B at Q4, FLUX image gen | Near-frontier local capability |
| 24 GB+ (RTX 3090/4090) | 70B+ at Q4, video workflows | Full stack, no meaningful compromise |
| 2× 24 GB | 70B+ at Q5/Q8, multi-GPU split | Effectively on-par with frontier API |
Run LocoBench on your hardware to get actual throughput numbers (tokens/sec) rather than relying on this table. Model performance varies significantly by architecture, quantisation method, and generation.
Note on older GPUs: NVIDIA CUDA Compute Capability 5.x (Maxwell architecture — GTX 900 series and equivalents) is officially in maintenance mode. PyTorch and related frameworks are progressively dropping support for Compute < 6.0/7.0. If you are on a Compute 5.x card, your current software stack will work until it does not — plan for a hardware refresh or cloud fallback rather than assuming indefinite support.
Option B — Cloud VPS with GPU (RunPod, Vast.ai, AWS, etc.)
Section titled “Option B — Cloud VPS with GPU (RunPod, Vast.ai, AWS, etc.)”You rent the hardware; you still control the software and data.
| Provider | Notes |
|---|---|
| RunPod | Best GPU-per-dollar for AI workloads. Spot and on-demand. Community templates available. |
| Vast.ai | Peer-to-peer GPU rental, cheapest option. Variable reliability. |
| Lambda Labs | Reserved GPU instances, more stable than spot. |
| AWS / GCP / Azure | Enterprise SLAs, higher cost, best for institutional compliance requirements. |
When cloud makes sense:
- You need more VRAM than you own for a specific job (large model inference, FLUX, video)
- You want a persistent server accessible to multiple users without maintaining local hardware
- Your institution has compliance requirements around where data is processed
Data sovereignty note: a VPS you control (with a model you run) is categorically different from a managed AI API. Your data goes to your rented server, not through a third party’s inference infrastructure. Both VPS and local give you sovereignty; managed APIs do not.
Option C — Hybrid
Section titled “Option C — Hybrid”Run the fast lightweight tasks locally (chat, coding assistance) and offload VRAM-heavy tasks (image gen, large models) to a cloud GPU pod on demand. Puente’s configuration supports pointing services at external endpoints, so the two can coexist cleanly.
Choosing a Model
Section titled “Choosing a Model”The right model depends on your task, your VRAM, and your tolerance for latency. General guidance:
| Task | Model family to consider |
|---|---|
| General chat, writing, reasoning | Llama 3.x, Qwen 2.5, Mistral |
| Coding | Qwen 2.5 Coder, DeepSeek Coder, CodeLlama |
| Long context (documents, research) | Gemma 3, Qwen 2.5 (128k context variants) |
| Multimodal (image + text) | LLaVA, Qwen2-VL, Gemma 3 |
| Image generation | SD 1.5 (lowest VRAM), SDXL, FLUX.1 |
For precise model-to-VRAM-tier recommendations, run LocoBench. It benchmarks candidate models against your actual hardware and surfaces the optimal quality/speed tradeoff for your specific setup. This table dates; LocoBench results do not.
Quantisation: Q4_K_M is a reliable default — good quality, roughly half the VRAM of full precision. Q5_K_M and Q8 give higher quality if your VRAM allows. Q2/Q3 quantisations are useful on very constrained hardware but quality degrades noticeably at that level.
The Edge AI Trend — Why This Skill Matters Beyond Your Current GPU
Section titled “The Edge AI Trend — Why This Skill Matters Beyond Your Current GPU”Model efficiency is improving faster than hardware is. A few developments worth watching:
- Aggressive quantisation (GPTQ, AWQ, GGUF and successors) has already pushed 7B models onto consumer GPUs that would have required a data centre three years ago
- 1.58-bit ternary weight models (Microsoft BitNet and related work) use only addition — no multiplication — making inference viable on CPU at useful speeds, and drastically reducing VRAM requirements for a given parameter count
- Google and others are actively publishing quantisation research that promises running significantly larger models in the same VRAM envelope
The implication: the hardware you own today may run meaningfully better models in two years without any upgrade. More importantly, the architectural shift toward edge inference means that understanding how to deploy and operate local AI is a durable skill — the same mental model that lets you run Llama 3 on a workstation today will let you run its successors on a laptop or embedded device as the ecosystem matures.
Running locally now is not just about the current capability. It is practice for a future where capable AI runs at the edge by default.
Quick Decision Flowchart
Section titled “Quick Decision Flowchart”Do you have a GPU with 8 GB+ VRAM?├── Yes → Start with the minimal stack (Open WebUI + ComfyUI + Open Terminal)│ Run LocoBench to find your optimal model tier│ Add services as you need them└── No → Consider a cloud VPS (RunPod spot is cheapest) Or run CPU-only with smaller quantised models (7B Q4 is usable) Watch 1.58-bit model developments — CPU viability improving fast
Is your data sensitive?├── Yes → Local or VPS only. Do not use managed AI APIs for this data.└── No → Managed APIs fine for non-sensitive tasks; local for everything else
Do you want to supplement rather than replace commercial tools?└── Yes → Minimal stack. Use local for iteration and drafting, frontier API for final-pass precision tasks where you verify output.For Institutions and Organisations
Section titled “For Institutions and Organisations”The case for local AI looks different at institutional scale. The key framing shift is CapEx vs. OpEx.
Cloud AI is operational expenditure: recurring per-seat costs (typically $200+/month per user for enterprise tiers), rate limits, token quotas, and zero data sovereignty. Local inference is capital expenditure: hardware purchased once, depreciating over time, with uncapped throughput and full data control.
The compliance argument is often decisive. Healthcare, legal, financial, and educational institutions handling sensitive data may be unable to use cloud AI APIs regardless of preference. On-premises inference is not a budget decision in these contexts — it is a regulatory requirement.
The throughput argument matters at scale. Enterprise cloud subscriptions impose rate limits and quotas. Local hardware transforms AI from a metered external utility into an internal resource — running continuously, without per-query cost, on institutional data that never leaves the network.
Recommended approach for institutions:
| Scenario | Recommendation |
|---|---|
| Sensitive data, compliance requirements | Full local deployment — no cloud APIs for regulated data |
| Mixed workloads, some public data | Hybrid: local for sensitive/high-volume, cloud for complex reasoning |
| Evaluation / pilot phase | Start with the minimal stack (Open WebUI + Ollama), run LocoBench on available hardware, assess before committing to hardware procurement |
Implementation sequence:
- Audit data sensitivity — determine what can and cannot leave institutional infrastructure
- Run LocoBench on available hardware to establish capability baseline
- Deploy the minimal stack; validate workflows before expanding
- Procure hardware based on actual measured throughput requirements, not theoretical specs
See Hardware for hardware selection and procurement guidance.
What Puente Is Not
Section titled “What Puente Is Not”- A replacement for critical thinking
- A fact retrieval system (use SearXNG for that — it is in the stack for a reason)
- A substitute for domain expertise
What it is: a self-hosted environment that lets you experiment, build intuition, and develop a working relationship with these tools on your own terms — without giving your data or your cognitive autonomy to a third party.
For hardware benchmarks specific to your machine, see LocoBench. For deployment configuration, see the project README.