The LocoPuente PoC is running today on two consumer GPUs already in the LocoLabo fleet. This is not a simulation. These machines are running the full stack.
| GPU 0 (Primary) | GPU 1 (Secondary) |
|---|
| Machine | Pulpo | Puente |
| Card | NVIDIA RTX 3060 | NVIDIA RTX 2060 Super |
| VRAM | 12 GB GDDR6 | 8 GB GDDR6 |
| Bandwidth | 360 GB/s | 448 GB/s |
| CUDA Compute | 8.6 | 7.5 |
| Role | Primary LLM + image generation | Voice (TTS/STT) + secondary LLM |
The dual-GPU arrangement eliminates the sequential switching constraint of a single-GPU setup. All services run concurrently — voice and LLM inference happen simultaneously on separate cards.
| Service | Model | VRAM |
|---|
| Ollama instance 0 | Llama 3.1 8B Q4_K_M | ~5 GB |
| ComfyUI (SDXL) | SDXL 1.0 base | ~6.5 GB |
| LLM only | | ~5 GB |
| Image gen only | | ~6.5 GB |
| LLM + image gen | | ~11.5 GB — tight, not recommended concurrently |
| Service | Model | VRAM |
|---|
| Speaches STT | Whisper base/small | ~0.5 GB |
| Speaches TTS | Kokoro 82M | ~0.2 GB |
| Ollama instance 1 | Mistral 7B / Phi-3 Mini Q4 | ~4.5 GB |
| Total concurrent | | ~5.2 GB — comfortable headroom |
| Capability | Tool | GPU | Status |
|---|
| LLM chat — general | Open WebUI + Ollama | Pulpo | Ready |
| LLM chat — secondary | Open WebUI + Ollama | Puente | Ready |
| Web search in chat | Open WebUI + SearXNG | — | Ready |
| Cited AI web search | Perplexica + SearXNG | Pulpo | Ready |
| Research nudge intervention | Custom chat | Pulpo | Ready |
| Unit RAG chatbot (Blackboard) | AnythingLLM | Pulpo | Ready |
| Voice input (STT) | Speaches + Whisper | Puente | Ready |
| Voice output (TTS) | Speaches + Kokoro | Puente | Ready |
| Research assistant + podcast | Open Notebook AI | Pulpo | Ready |
| Image generation (in-chat) | Open WebUI + ComfyUI | Pulpo | Ready |
| Image generation (direct UI) | ComfyUI | Pulpo | Ready |
| PDF tools | Stirling PDF | — | Ready |
| Collaborative whiteboard | Excalidraw | — | Ready |
| Citation + writing check | CiteSight | External | Ready |
| Voice + LLM concurrent | All services | Both cards | Ready |
- LLM inference and SDXL image generation on Pulpo should not run simultaneously — both together approach the 12 GB ceiling. In practice, Ollama unloads after inactivity before image generation is triggered.
- Puente’s 8 GB VRAM is sufficient for voice + secondary LLM but cannot run SDXL. Image generation stays on Pulpo only.
- System RAM should be 32 GB minimum to avoid model paging to disk.
- The custom chat tool is the only interface with research consent and logging. Do not route research participants through other interfaces.
The PoC hardware costs less than a single semester of commercial AI subscriptions for a cohort of students. Two secondhand consumer GPUs. Running the full stack. Today.