⚡ Local Inference Layer
BlindSpot × LLM
Your data. Your hardware. Your models.
Every BlindSpot engine generates structured data. The LLM layer reads that data and tells you what it means — pre-session briefs, pattern interpretation, convergent analysis. Runs on your GPU. Zero cloud. Zero API costs.
BlindSpot · llm⚡ LOCAL INFERENCE
● ollama runninggpu: detectedmodels: 3 loaded
Runtime
Ollama
local · no cloud
Inference Speed
~25
tokens/sec · GPU
Models Loaded
3
primary · secondary · compact
API Cost
$0
forever · your hardware
Privacy
100%
nothing leaves your machine
📊
→Engine Data
Fact tables, dimensions
📝
→Prompt Template
Governed schema, rules
🧠
→Local LLM
GPU-accelerated inference
📋
→Intelligence Brief
Flags, recommendations
🎯
Action
You decide. Model interprets.
Model · Roster
Primary · Parsing + Briefs
Llama 3.1 8B
Meta · Q4 quantization · Instruct-tuned
VRAM
~5 GB
Speed
~25 t/s
Download
4.7 GB
Context
128K
structured outputCSV parsingbrief generation
Secondary · Convergent Analysis
Mistral 7B
Mistral AI · Q4 quantization · Instruct v0.3
VRAM
~4.5 GB
Speed
~30 t/s
Download
4.1 GB
Context
32K
second opinionreasoningcomparison
Compact · Fast Parsing
Phi-3.5 Mini
Microsoft · 3.8B parameters · Long context
VRAM
~2.5 GB
Speed
~45 t/s
Download
2.2 GB
Context
128K
fast extractionlong documentslightweight
Convergent · Analysis Demo
DEMOPrompt: "Analyze this week’s trading performance and flag blind spots"
Llama 3.1 8B
Flag 1: Win rate dropped from 68% to 52% on Thursday-Friday trades.
Flag 2: Avg loss size increased 40% this week. Stop discipline may be slipping.
Flag 3: 3 of 4 losses were momentum plays. Mean reversion outperformed.
VS
Mistral 7B
Flag 1: Late-week performance degradation. Thu-Fri win rate 50% vs Mon-Wed 71%.
Flag 2: Loss magnitude expanding — risk/reward ratio inverted on 2 trades.
Flag 3: Sector concentration: 80% of trades in tech. Diversification blind spot.
✓
Convergence: Both flag late-week decay and loss magnitude expansion. High confidence signals. Mistral uniquely catches sector concentration — a blind spot the primary model missed.
blindspot · inference session
▸ show session transcript
Hardware · Compatibility Matrix
| GPU | VRAM | 8B | 7B | 3.8B | Dual Model | Speed |
|---|---|---|---|---|---|---|
| RTX 4090 | 24 GB | Full | Full | Full | Both in VRAM | ~50+ t/s |
| RTX 4070 Ti | 12 GB | Full | Full | Full | Swap required | ~35 t/s |
| RTX 3070 | 8 GB | Full | Full | Full | Swap required | ~25 t/s |
| RTX 3060 | 12 GB | Full | Full | Full | Swap required | ~20 t/s |
| GTX 1660 | 6 GB | Tight | Full | Full | No | ~12 t/s |
| CPU Only | — | 8GB+ RAM | 8GB+ RAM | 4GB+ RAM | Slow | ~3-5 t/s |
| Apple M1/M2/M3 | Shared | Full | Full | Full | Unified memory | ~20-40 t/s |
Setup · Readiness
✓
Install Ollama runtime
~2 min
✓
Verify GPU detection
ollama list
✓
Pull primary model (8B)
~5 min download
Pull secondary model (7B)
~4 min download
Run smoke test
structured output
Create prompt templates
parse + brief
First real inference run
engine data → brief
Convergent analysis run
dual-model compare
Architecture · Principles
→
The LLM interprets. It doesn't compute.
Numbers come from the engine. The model reads them.
→
Convergent > singular.
Two models agreeing = signal. Divergence = judgment call.
→
Prompt templates are the contract.
Structure > intelligence.
→
Local means sovereign.
Privacy isn't a feature. It's the architecture.
→
Models are swappable.
The engine doesn't care which model reads the data.
BlindSpot · Local LLM Integration v1.0
ollama · gpu-accelerated · zero cloud · governed prompts