Inference Benchmark History

All benchmarks run on NVIDIA Jetson AGX Orin 64GB

Date Model Avg Tok/s TTFT
2026-04-06 ariaos-forge:latest 21.5 1192ms
2026-03-18 aya:8b 21 2071ms
2026-03-18 deepseek-r1:1.5b 41 1389ms
2026-03-18 phi3.5:3.8b 39.6 112ms
2026-03-18 llama3.1:8b 6.7 17649ms

Model A/B Comparison Results

April 2026

ariaos-forge:latest Winner vs qwen2.5-coder:7b

Head-to-head inference comparison on NVIDIA Jetson AGX Orin 64GB. The A/B Compare module evaluates response quality, latency, and token throughput side by side.

How A/B Compare Works

  • Send the same prompt to two models simultaneously
  • Measure tok/s, TTFT, and total generation time for each
  • Compare response quality with structured evaluation criteria
  • Track win rates over time with persistent history
  • Export results to CSV for further analysis

Benchmark Methodology

All benchmarks are run locally on PraetorianMind's Inference Bench module. The hardware platform is a NVIDIA Jetson AGX Orin 64GB running JetPack with CUDA acceleration. Each benchmark run captures:

  • Average Tokens per Second (tok/s) — sustained generation throughput
  • Time to First Token (TTFT) — latency from prompt submission to first token
  • CUDA Memory Bandwidth — GPU memory utilization during inference
  • Efficiency Score — composite metric of throughput vs. model size

Results are stored in PraetorianMind's local database and can be exported as CSV. No data leaves the device.