BenchmarkPerformance

The purpose of this Vignette is to demonstrate the performance of onnxruntime inference with nativeORT, including CoreML capabilities. This will demonstrate nativeORT is capable of running at real-time (sub-29.97fps) inferencing.

This is tested on 50 256x256 arrays on an Apple M1 machine, simulating an incoming video stream.

nativeORT CPU & CoreML

# typical RGB 256x256 image
input <- array(
  runif(1 * 3 * 256 * 256),
  dim=c(1L, 3L, 256L, 256L)
)

session <- nativeORT::ort_session(model_path,
                                  threads=0L,
                                  opt_level=99L)

times_cpu <- numeric(100)
for (i in 1:100){
  times_cpu[i] <- system.time(
    nativeORT::ort_infer_raw(session, input)
  )["elapsed"] * 1000
}

# CoreML
dir.create(path.expand("~/.nativeORT/cache"),
           recursive = TRUE, showWarnings = FALSE
           )
session <- nativeORT::ort_session(model_path,
                                  provider='coreml',
                                  cache_dir=path.expand("~/.nativeORT/cache"),
                                  threads=0L,
                                  opt_level=99L
           )

times_coreml <- numeric(100)
for (i in 1:100){
  times_coreml[i] <- system.time(
    nativeORT::ort_infer_raw(session, input)
  )["elapsed"] * 1000
}
results <- data.frame(
  run=rep(1:length(times_cpu), 2),
  provider=c(
    rep("CPU (nativeORT)", length(times_cpu)),
    rep("CoreML (nativeORT)", length(times_coreml))
  ),
  latency_ms=c(times_cpu, times_coreml)
)

ggplot(results, aes(x=run, y=latency_ms, color=provider)) +
  geom_line() + 
  geom_hline(yintercept=33.3, linetype="dashed", color="red") +
  annotate("text", x=85, y=40, label="29.97 fps threshold") +
  labs(
    title="Inference Latency Across Inference Engines",
    subtitle="YOLOv11n, 256x256 Images, Apple M1",
    x="Run",
    y="Latency (ms)"
  ) +
  theme_minimal()

Results

Notably, nativeORT can run substantially below real-time requirements. Due to optimization in the C++ bindings, the CPU and CoreML latency are near parity; however, it is of note that the CoreML runs offer better stability as they sit on dedicated hardware, whereas the CPU is subject to slowdowns when other processes hit.

CoreML does require a warmup (as noticed in the spike) but after one or two inferences, it becomes real-time performant. At a median latency of 7-8 milliseconds on Apple M1 Silicon, there is still time to run post-processing and remain under target latency.