AQEA Gen-5 · Shader

1.85× faster than FAISS-GPU.
Exact, not approximate.

Hyperscale vector search on the AQEA Gen-5 Substrate — bit-identical across CUDA, Vulkan, Metal and WebGPU.

msmarco-10M on H100 PCIe1.85× faster than FAISS-GPU23× smaller index100% Bit-Identity4/4 GPU backends

Section 01

Six numbers, none of them hedged.

1.85×

faster than Float-FAISS-GPU at msmarco-10M (H100 PCIe)

23×

smaller index storage at equal retrieval recall

100%

Bit-Identity verified vs CPU brute-force reference

4 / 4

GPU backends bit-identical (Vulkan · Metal · WebGPU · CUDA)

99.68%

reversible decoding at 1M scale

13 / 13

cross-modal validation domains above 80% floor

WHY CROSS-VENDOR MATTERS

Compile once. Run everywhere. Identically.

Today's production retrieval stacks (FAISS-GPU, NeMo Retriever, RAPIDS cuVS) are CUDA-only by construction. AQEA Shader is the only system that delivers exact (not approximate) results across NVIDIA, AMD, Apple, Intel, ARM and browser hardware — from a single shader source.

Section 02

Not “high recall.” Exact equality.

Most ANN systems trade recall for speed. AQEA Shader does not. Bit-Identity means the Top-K set we return equals the brute-force Top-K — same documents, same order, same distance values, byte-identical against a brute-force reference.

We verified this against an independent CPU reference at both 1M and 10M corpus scales, and across NVIDIA-Vulkan vs Apple-Metal hardware.

Section 03

Two targets, one shader source.

CPU Pipeline

AVX-512 / NEON

  • Hetzner AX102 production hardware
  • 13.7× higher throughput vs FAISS-CPU-Float (msmarco-100k)
  • 6.6× lower energy per query (37.66 mJ vs 248.63 mJ)
  • 97.4% recall retention

GPU Pipeline

WGSL via wgpu

  • Lambda H100 PCIe production hardware
  • 1.85× faster than Torch-FlatIP-GPU at 10M
  • 1.73× more energy-efficient (103 mJ vs 179 mJ per query at 1M)
  • Single shader source runs on Vulkan / Metal / WebGPU / DirectX-12

Section 04

CUDA-only is a problem.

FAISS-GPU, NeMo Retriever, RAPIDS cuVS — all CUDA-only by construction. For hyperscalers diversifying away from single-vendor compute (AMD MI-series, custom silicon, Apple-based edge), for enterprises with multi-cloud or sovereign-cloud requirements, and for edge/browser deployments where CUDA isn't present — there is no portable, exact alternative. We are it.

Section 05

msmarco detail results.

WorkloadAQEA R@10Float R@10RatioQPS (batch)Energy Ratio
msmarco-100k (CPU)0.94720.972597.4%13.7× higher6.6× lower
msmarco-1M (GPU)1.63× faster (p50)1.73× lower
msmarco-10M (GPU)100% Bit-Identity1.85× faster (p50)~2.5× lower (est.)

Section 06

Engineering engagement.

Step 01

Engineering Eval

4–6 weeks

Joint benchmark on your retrieval workload and hardware. Recommended starting point.

Step 02

Integration Pilot

8–12 weeks

Shadow-mode deployment in your production retrieval stack.

Step 03

Co-Development

Engagement

Hardware-specific tuning, encoder-family extensions, bespoke deployment.

Bit-Identity is verifiable today.

Run our test fixtures on your own hardware — Apple M3 to NVIDIA H100 to AMD MI300X.