Proof of Concept
LocoPuente is the “closing the gap” PoC: a minimal, credible demonstration that local AI on one secondhand consumer GPU can deliver the student-facing capabilities an institution currently leans on frontier cloud AI for. One workstation, one card, three backend inference services, one tool-augmented general chat front end, and three purpose-built research and study front ends.
OpenWebUI, augmented with ComfyUI (image generation), Speaches (voice), and OpenTerminal (coding and terminal workflow), provides functional equivalence to commercial chat interfaces — chat, voice in/out, image generation, code assistance — in a single browser tab. The three companion front ends cover deep research, tutoring, and NotebookLM-style research-to-media. The claim is not that local small models match frontier models on every benchmark; the claim is that for the way students actually use AI — dialogically, in conversation — they are close enough to close the gap.
Hardware
Section titled “Hardware”| Machine | Puente |
| Chassis | AMD Ryzen 5 2600 desktop tower |
| GPU | NVIDIA RTX 3090 24 GB GDDR6X |
| Memory bandwidth | 936 GB/s |
| CUDA compute | 8.6 |
| System RAM | 32 GB DDR4 (minimum) |
| OS | Ubuntu 22.04 LTS |
The entire PoC runs on a single RTX 3090. The 24 GB of VRAM is what makes the minimal PoC work — it absorbs LLM inference, image generation, and voice services concurrently.
Backend Services
Section titled “Backend Services”Three services run on the RTX 3090, each exposing a clean API that the front-end apps consume:
| Service | Role | Port |
|---|---|---|
| Ollama | LLM inference (OpenAI-compatible chat/completions API) | 11434 |
| ComfyUI | Image generation (backend API + optional direct UI) | 8188 |
| Speaches | Audio in/out — STT (faster-whisper) and TTS (Kokoro, Piper fallback) | 8000 |
Every capability the PoC demonstrates routes through one of these three services. OpenTerminal, a coding-and-terminal tool, is configured as an additional OpenWebUI tool/service rather than as a stand-alone front end.
Front-End Apps
Section titled “Front-End Apps”Four purpose-built front ends sit on top of the backend services:
| App | Purpose | Consumes |
|---|---|---|
| OpenWebUI | General-purpose chat interface with tool augmentation — text chat, voice in/out (Speaches), image generation (ComfyUI), and coding assistance (OpenTerminal). Functional equivalent of commercial chat UIs. | Ollama, Speaches, ComfyUI, OpenTerminal |
| Vane | Deep research | Ollama |
| DeepTutor | Research and tutoring | Ollama |
| OpenNotebook | Podcast generation, quizzes, structured notes — a NotebookLM clone without video | Ollama, Speaches |
OpenWebUI carries the general-purpose chat envelope a student would otherwise get from a commercial service. The three companion apps sit beside it for deep research, tutoring, and notebook-style research-to-media. Each is an existing open-source project; the PoC is the integration and the hardware, not novel app code.
VRAM Budget (RTX 3090, 24 GB)
Section titled “VRAM Budget (RTX 3090, 24 GB)”Headline: the full stack fits concurrently with comfortable headroom.
| Service | Model | VRAM |
|---|---|---|
| Ollama — primary LLM | Llama 3.1 8B Q4_K_M (or Qwen 2.5 7B) | ~5 GB |
| Speaches STT + TTS | Whisper base/small + Kokoro 82M | ~0.7 GB |
| ComfyUI — image generation | SDXL base + refiner | ~8-10 GB |
| Full stack concurrent | ~14-16 GB — comfortable |
Larger models fit when the card is not doing image generation at the same time:
- Llama 3.1 13B Q4_K_M: ~8 GB
- 30B-class Q4: ~18 GB (image gen idle)
- FLUX.1 Dev FP16: ~16-24 GB (standalone image run)
PoC Capabilities
Section titled “PoC Capabilities”| Capability | Provided by | Front-end |
|---|---|---|
| General chat | Ollama | OpenWebUI |
| Voice input / voice output | Speaches | OpenWebUI (via tool integration) |
| Image generation | ComfyUI | OpenWebUI (via tool integration) or ComfyUI directly |
| Coding assistant and terminal workflow | Ollama + OpenTerminal | OpenWebUI (via tool integration) |
| Deep research | Ollama | Vane |
| Research and tutoring | Ollama | DeepTutor |
| Podcast generation from notes | Ollama + Speaches | OpenNotebook |
| Quizzes and structured summaries | Ollama | OpenNotebook |
All services expose OpenAI-compatible APIs. All run without internet access. All student data stays on the machine.
Known Constraints
Section titled “Known Constraints”- All services share the one 24 GB card. Concurrent LLM + SDXL + voice runs comfortably. FLUX.1 Dev FP16 at full quality consumes most of the card on its own and is best run when other workloads are idle.
- System RAM should be 32 GB minimum to avoid model paging to disk.
What Is Deliberately Out of Scope
Section titled “What Is Deliberately Out of Scope”The minimal PoC is the four front ends above. The broader LocoLabo ecosystem — the Keep Asking research chat tool, AnythingLLM unit RAG chatbots, Perplexica, Stirling PDF, Excalidraw, CiteSight, LocoEnsayo rehearsal chatbots, and the TalkBuddy / StuddyBuddy / Career Compass desktop clients — also runs against Puente’s Ollama and Speaches endpoints where relevant. Those are documented in their own project docs. Keeping the PoC itself narrow is the point of “closing the gap”: prove the core on one card, then expand.
The Cost Argument
Section titled “The Cost Argument”One secondhand AMD Ryzen 5 2600 desktop. One secondhand RTX 3090. The full student-facing stack. For less than a year of frontier cloud AI subscriptions for a small team.