# Eval Based System

### Overview

Our Adaptive Reasoning Models represent a breakthrough in large-scale inference optimization and real-time decision support systems. Built on our Service Fabric architecture, these models dynamically adjust their reasoning processes based on context, computational constraints, and performance requirements.

The system leverages multi-tiered processing, speculative decoding, and self-calibration to deliver unprecedented performance while maintaining accuracy and reliability in production environments.

### Core Architecture

#### Multi-Tiered Processing System

Our adaptive reasoning system employs a sophisticated multi-tiered architecture designed for real-time analysis and decision support:

**Processing Intervals:**

* **1-second checks**: Immediate response validation and basic reasoning
* **5-second checks**: Intermediate reasoning with context evaluation
* **60-second checks**: Deep reasoning and comprehensive analysis

**Concurrent Processing Capabilities:**

* Support for up to **144,000 concurrent streaming sessions**
* **224 CPU cores** processing approximately **650 checks per core per second** at peak
* Memory-efficient design with **\~20MB of embeddings per 2-hour stream**

#### Speculative Decoding Architecture

Our implementation uses an innovative speculative decoding approach:

* **2.8B parameter model** generates 5 possible continuations
* **400B parameter model** (LLaMA-4) verifies and selects optimal predictions
* **Custom kernels** optimize the verification process
* **Self-calibration system** adjusts weights during runtime

### Infrastructure & Cost Analysis

#### Current Usage & Projections

<table><thead><tr><th>Baseline Costs:</th><th>Projected Infrastructure Savings:</th><th data-hidden></th></tr></thead><tbody><tr><td><p></p><ul><li>Current Anthropic usage: <strong>~$900 every 1-2 days</strong> ($5.4k month-to-date)</li><li>Significant cost optimization opportunity through self-hosted infrastructure</li></ul></td><td><p></p><ul><li><strong>$35k/month</strong> for two nodes at peak capacity</li><li><strong>$20k reduction</strong> in GCP costs</li><li>Potential to <strong>absorb entire OpenAI bill</strong> through optimized self-hosting</li></ul></td><td></td></tr></tbody></table>

#### Hardware Configuration

**Production Setup:**

* **48 NVL-72s** running 24/7 for maximum availability
* **Docker image size**: 46GB (compiled binary for optimization)
* Multi-node architecture supporting horizontal scaling


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sully.gitbook.io/sully.ai-docs/FEM1tMOJsSSszbLOgKLA/eval-based-system.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
