The Monadd-AI Engine: Capability without compromise

An offline LLM inference engine enabling ultra-large SOTA 100-600B+ MoE models on consumer-grade hardware instead of GPU clusters, achieved through advanced memory management, online quantization and caching algorithms that exploit MoE sparsity.

Problem: The AI Privacy-Capability Paradox

Professionals and SMEs face a critical dilemma. Cloud AI (Google, OpenAI, Deepseek) offers state-of-the-art (SOTA) performance but at the expense of data privacy, high and unpredictable costs, API rate limiting, artificial context window constraints, and network latency. Conversely, current Local AI inferencing systems preserve privacy but struggle with subpar model quality, from the significantly smaller models used, due to hardware constraints—failing to meet professional workflow demands.

Solution: The Monadd-AI Engine

Our vision is to democratize state-of-the-art AI by making it private, affordable, and universally accessible—eliminating the need for cloud dependence. Monadd-AI introduces a virtual memory system for neural network weights. It is a Mixed-Precision Expert Offloading system that intelligently manages expert weights across a memory hierarchy—from high-speed GPU VRAM to system RAM and SSD storage. The architecture is engineered for flexibility and ease of use, eliminating the need for custom model file formats or pre-processing steps by working directly with standard FP16 GGUF models and performing on-demand quantization at runtime.

Standard GGUF Compatibility

Works directly with standard FP16 model files. The system maps expert tensor locations at load time without immediately loading their data into memory.

Runtime Mixed-Precision

Dynamically quantizes expert weights from FP16 to lower precisions like INT4/INT2 on-the-fly, based on the importance of each expert in relation to the token being processed/generated.

Zero Limiting

Provides unlimited inference with no RPM, TPM, or artificial context window constraints. Scale with your hardware, not API quotas.

Core Technological Principles

Our system is built on three cooperating components that work across the token, layer, and sequence levels to optimize inference.

1. Token-Level: Dynamic Expert Loading

The mixed-precision strategy determines expert handling at the token level. It uses a "Dynamic Cumulative Top-p" analysis based on gating scores, which approximate expert importance with high accuracy.

  • Experts are sorted by importance for the current token.
  • High-Precision (FP16): Loaded if the expert falls within the top ~60% of importance.
  • Low-Precision (INT4/2): Loaded if importance is between ~60-90%.
  • Skipped: The least important experts (beyond 90%) are skipped entirely to save memory and compute.

2. Layer-Level: Adaptive Expert Prefetching

To hide I/O latency, the system predicts which experts will be needed in upcoming layers (N+1). This is possible due to the high cosine similarity of gating inputs across consecutive layers.

  • Future gating modules are computed in parallel to predict needed experts.
  • Non-blocking, low-priority prefetch tasks are issued to a background scheduler.
  • Experts are prefetched at low precision to minimize the bandwidth cost and performance penalty of a misprediction.

3. Sequence-Level: Multidimensional Caching

A sophisticated cache manager for the GPU memory holding the "hot" expert weights, minimizing cache-miss penalties through a hybrid eviction policy.

  • The policy score is a weighted sum of multiple metrics:
    LHU: Least High Precision Used,
    LFU: Least Frequently Used,
    LRU: Least Recently Used,
    FLD: Far Layer Distance.
  • The cache is structured with separate high and low-precision sections to minimize miss penalties.

Engine Components & Key Innovations

A deeper look at the implementation that powers Monadd-AI.

Online Quantization & Fidelity Preservation

  • On-demand quantization from FP16 to lower bit-widths is CPU-intensive. To ensure high performance, our engine uses a CPU-side LRU (Least Recently Used) cache to store recent quantization results, avoiding repeated, expensive work.
  • To preserve model fidelity, Adaptive Percentile Clipping is introduced to minimize the effect of outliers during low bit-width quantization.

Two-Stage Dynamic Execution

A fundamental conflict exists between a static computation graph and Monadd's need for dynamic, on-the-fly decisions. This is solved by refactoring MoE layer execution into two stages:

  • 1 - Gating Evaluation: A small, partial graph is computed to acquire only the gating weights for the current token/layer.
  • 2 - Expert Evaluation: The engine logic runs, enqueues loading tasks, and waits. Once experts are in the GPU cache, the main graph for expert computation is built and executed.

PagedAttention for KV-Cache Offloading

PagedAttention uses OS virtual memory concepts to manage the KV cache in non-contiguous blocks, cutting memory waste from 80% to near-zero. This frees GPU VRAM by offloading the cache to RAM/SSD, enabling massive context windows (10M+ tokens).

  • Copy-on-Write: Shares memory blocks between sequences until a divergence occurs.
  • Swapping or Recomputation: To efficiently recover evicted blocks.
  • Batched Processing: Optimized for aggressive, asynchronous batched processing to enable parallel agentic workflows with minimal memory overhead.

Supported Models & Hardware Requirements

Our public beta will target the models listed below, with the full commercial release expanding to support all major Hugging Face MoE models. The inference engine is designed for systems equipped with a minimum of 16GB of RAM and a 24GB VRAM graphics card (e.g., NVIDIA RTX 4090, AMD Radeon RX 7900 XTX) or better.

Model Total / Activated Params Experts per layer / Active experts per token Native Context Window
DeepSeek V3/R1 (671 B) 671 B / 37 B 257 (256 routed + 1 shared) / 9 (8 routed + 1 shared) 128 K tokens
Llama 4 Scout (109 B) 109 B / 17 B 16 routed + 1 shared / 1 routed + 1 shared 10 M tokens
Llama 4 Maverick (400 B) 400 B / 17 B 128 routed + 1 shared / 1 routed + 1 shared 1 M tokens
Qwen 3 (233 B) 233 B / 22 B 128 / 8 32 K tokens

*Unquantized weight tensors for all models above are in FP16 or BF16 precision.

Join Us

We are seeking investors, partners, and talent to build the future of democratized AI.

For Investors

Invest in our $0.5M pre-seed round to capitalize on the multi-billion dollar shift to private, powerful AI with our breakthrough inference technology.

For Partners

Partner with us to build and monetize a new class of deeply integrated AI solutions for high-value industries.

For Talent

Join our founding team as a key engineer or scientist to build a world-first inference engine and solve some of the toughest challenges in AI.

Get in touch