- Send the same prompt to two models simultaneously
- Measure tok/s, TTFT, and total generation time for each
- Compare response quality with structured evaluation criteria
- Track win rates over time with persistent history
- Export results to CSV for further analysis
Inference Benchmark History
All benchmarks run on NVIDIA Jetson AGX Orin 64GB
| Date | Model | Avg Tok/s | TTFT |
|---|---|---|---|
| 2026-04-06 | ariaos-forge:latest |
21.5 | 1192ms |
| 2026-03-18 | aya:8b |
21 | 2071ms |
| 2026-03-18 | deepseek-r1:1.5b |
41 | 1389ms |
| 2026-03-18 | phi3.5:3.8b |
39.6 | 112ms |
| 2026-03-18 | llama3.1:8b |
6.7 | 17649ms |
Model A/B Comparison Results
April 2026
ariaos-forge:latest Winner
vs
qwen2.5-coder:7b
Head-to-head inference comparison on NVIDIA Jetson AGX Orin 64GB. The A/B Compare module evaluates response quality, latency, and token throughput side by side.
How A/B Compare Works
Benchmark Methodology
All benchmarks are run locally on PraetorianMind's Inference Bench module. The hardware platform is a NVIDIA Jetson AGX Orin 64GB running JetPack with CUDA acceleration. Each benchmark run captures:
- Average Tokens per Second (tok/s) — sustained generation throughput
- Time to First Token (TTFT) — latency from prompt submission to first token
- CUDA Memory Bandwidth — GPU memory utilization during inference
- Efficiency Score — composite metric of throughput vs. model size
Results are stored in PraetorianMind's local database and can be exported as CSV. No data leaves the device.