Skip to content

Proof of Concept

The LocoPuente PoC is running today on two consumer GPUs already in the LocoLabo fleet. This is not a simulation. These machines are running the full stack.


GPU 0 (Primary)GPU 1 (Secondary)
MachinePulpoPuente
CardNVIDIA RTX 3060NVIDIA RTX 2060 Super
VRAM12 GB GDDR68 GB GDDR6
Bandwidth360 GB/s448 GB/s
CUDA Compute8.67.5
RolePrimary LLM + image generationVoice (TTS/STT) + secondary LLM

The dual-GPU arrangement eliminates the sequential switching constraint of a single-GPU setup. All services run concurrently — voice and LLM inference happen simultaneously on separate cards.


ServiceModelVRAM
Ollama instance 0Llama 3.1 8B Q4_K_M~5 GB
ComfyUI (SDXL)SDXL 1.0 base~6.5 GB
LLM only~5 GB
Image gen only~6.5 GB
LLM + image gen~11.5 GB — tight, not recommended concurrently
ServiceModelVRAM
Speaches STTWhisper base/small~0.5 GB
Speaches TTSKokoro 82M~0.2 GB
Ollama instance 1Mistral 7B / Phi-3 Mini Q4~4.5 GB
Total concurrent~5.2 GB — comfortable headroom

CapabilityToolGPUStatus
LLM chat — generalOpen WebUI + OllamaPulpoReady
LLM chat — secondaryOpen WebUI + OllamaPuenteReady
Web search in chatOpen WebUI + SearXNGReady
Cited AI web searchPerplexica + SearXNGPulpoReady
Research nudge interventionCustom chatPulpoReady
Unit RAG chatbot (Blackboard)AnythingLLMPulpoReady
Voice input (STT)Speaches + WhisperPuenteReady
Voice output (TTS)Speaches + KokoroPuenteReady
Research assistant + podcastOpen Notebook AIPulpoReady
Image generation (in-chat)Open WebUI + ComfyUIPulpoReady
Image generation (direct UI)ComfyUIPulpoReady
PDF toolsStirling PDFReady
Collaborative whiteboardExcalidrawReady
Citation + writing checkCiteSightExternalReady
Voice + LLM concurrentAll servicesBoth cardsReady

  • LLM inference and SDXL image generation on Pulpo should not run simultaneously — both together approach the 12 GB ceiling. In practice, Ollama unloads after inactivity before image generation is triggered.
  • Puente’s 8 GB VRAM is sufficient for voice + secondary LLM but cannot run SDXL. Image generation stays on Pulpo only.
  • System RAM should be 32 GB minimum to avoid model paging to disk.
  • The custom chat tool is the only interface with research consent and logging. Do not route research participants through other interfaces.

The PoC hardware costs less than a single semester of commercial AI subscriptions for a cohort of students. Two secondhand consumer GPUs. Running the full stack. Today.