An offline LLM inference engine enabling ultra-large SOTA 100-600B+ MoE models on consumer-grade hardware instead of GPU clusters, achieved through advanced memory management, online quantization and caching algorithms that exploit MoE sparsity.
Professionals and SMEs face a critical dilemma. Cloud AI (Google, OpenAI, Deepseek) offers state-of-the-art (SOTA) performance but at the expense of data privacy, high and unpredictable costs, API rate limiting, artificial context window constraints, and network latency. Conversely, current Local AI inferencing systems preserve privacy but struggle with subpar model quality, from the significantly smaller models used, due to hardware constraints—failing to meet professional workflow demands.
Our vision is to democratize state-of-the-art AI by making it private, affordable, and universally accessible—eliminating the need for cloud dependence. Monadd-AI introduces a virtual memory system for neural network weights. It is a Mixed-Precision Expert Offloading system that intelligently manages expert weights across a memory hierarchy—from high-speed GPU VRAM to system RAM and SSD storage. The architecture is engineered for flexibility and ease of use, eliminating the need for custom model file formats or pre-processing steps by working directly with standard FP16 GGUF models and performing on-demand quantization at runtime.
Works directly with standard FP16 model files. The system maps expert tensor locations at load time without immediately loading their data into memory.
Dynamically quantizes expert weights from FP16 to lower precisions like INT4/INT2 on-the-fly, based on the importance of each expert in relation to the token being processed/generated.
Provides unlimited inference with no RPM, TPM, or artificial context window constraints. Scale with your hardware, not API quotas.
Our system is built on three cooperating components that work across the token, layer, and sequence levels to optimize inference.
The mixed-precision strategy determines expert handling at the token level. It uses a "Dynamic Cumulative Top-p" analysis based on gating scores, which approximate expert importance with high accuracy.
To hide I/O latency, the system predicts which experts will be needed in upcoming layers (N+1). This is possible due to the high cosine similarity of gating inputs across consecutive layers.
A sophisticated cache manager for the GPU memory holding the "hot" expert weights, minimizing cache-miss penalties through a hybrid eviction policy.
A deeper look at the implementation that powers Monadd-AI.
A fundamental conflict exists between a static computation graph and Monadd's need for dynamic, on-the-fly decisions. This is solved by refactoring MoE layer execution into two stages:
PagedAttention uses OS virtual memory concepts to manage the KV cache in non-contiguous blocks, cutting memory waste from 80% to near-zero. This frees GPU VRAM by offloading the cache to RAM/SSD, enabling massive context windows (10M+ tokens).
Our public beta will target the models listed below, with the full commercial release expanding to support all major Hugging Face MoE models. The inference engine is designed for systems equipped with a minimum of 16GB of RAM and a 24GB VRAM graphics card (e.g., NVIDIA RTX 4090, AMD Radeon RX 7900 XTX) or better.
Model | Total / Activated Params | Experts per layer / Active experts per token | Native Context Window |
---|---|---|---|
DeepSeek V3/R1 (671 B) | 671 B / 37 B | 257 (256 routed + 1 shared) / 9 (8 routed + 1 shared) | 128 K tokens |
Llama 4 Scout (109 B) | 109 B / 17 B | 16 routed + 1 shared / 1 routed + 1 shared | 10 M tokens |
Llama 4 Maverick (400 B) | 400 B / 17 B | 128 routed + 1 shared / 1 routed + 1 shared | 1 M tokens |
Qwen 3 (233 B) | 233 B / 22 B | 128 / 8 | 32 K tokens |
*Unquantized weight tensors for all models above are in FP16 or BF16 precision.
We are seeking investors, partners, and talent to build the future of democratized AI.
Invest in our $0.5M pre-seed round to capitalize on the multi-billion dollar shift to private, powerful AI with our breakthrough inference technology.
Partner with us to build and monetize a new class of deeply integrated AI solutions for high-value industries.
Join our founding team as a key engineer or scientist to build a world-first inference engine and solve some of the toughest challenges in AI.