Wayfinder Tune your stack before you deploy.

Operating an inference stack involves making a huge number of decisions. Wayfinder helps navigate that space to find the most cost-effective deployment for your model and show you how close your stack is to the hardware's theoretical limits.

Every new release is a potential tripwire.

The inference ecosystem is evolving rapidly, with updated model architectures and runtimes shipping every few weeks. Each of these runs the risk of a regression: hybrid models can silently break prefix caching for certain runtimes, resulting in a 5× drop for some workloads. Even in the absence of regressions, operators are faced with a huge number of choices: whether to disaggregate, whether to offload the KV cache to remote memory or disk, which kernels to run, and so on.

Wayfinder sweeps the configuration space, comparing different runtimes and their tunable parameters for your workload, to find the most cost-effective way of hosting the model on available hardware.

Sweep. Prune. Vet.

01 · Sweep

Programmable testing

Wayfinder runs experiments that are defined in a single declarative configuration file and sweep hundreds of parameters in parallel. It runs these experiments in reproducible environments across available hardware and tracks the provenance of the results.

Storing experiments as code allows for better reproducibility: experiments can be run with identical workloads and configurations across new runtimes or releases.

02 · Prune

Comparative analysis

Centralizing experiment results and parameters in a database allows for comparing results on different hardware. Alternatively, it tracks performance across releases on the same hardware and catches regressions before they ship.

Wayfinder includes a constraint-aware calculator which surfaces the most cost-effective hosting options for a model-workload combination given specific performance objectives and available hardware.

03 · Vet

Theoretical upper bounds

LLM inference performance is bimodal: primarily compute-bound during prefill and largely memory-bound during decode. This allows modeling the performance of any hardware given a specific model architecture, workload, and the compute FLOPS and memory bandwidth of the hardware.

Wayfinder builds a roofline model for your hardware and workloads so that performance can be framed against a concrete target, rather than being driven purely by intuition.

IOP Systems created the ability to understand how to optimize for what matters most — p99 latency — while improving overall cost and resource utilization… The ability to visualize and make data-based decisions is a big step up from what we accomplish on our own.

Kelly Hammond · Sr. Director, Intel