Containerizing the Edge: Building a High-Performance llama.cpp Docker Stack for NVIDIA Spark (arm64)

The NVIDIA Spark is a marvel of edge engineering. By bringing the Grace Blackwell (GB10) architecture into a compact form factor, it offers something previously unheard of in local AI: a massive 128GB unified memory pool shared between 20 custom ARM64 cores and a Blackwell GPU.

Image generated via Gemini’s nano banana

While llama.cpp is legendary for its efficiency on bare metal, I’ve always found that running AI services directly on a host OS can lead to a management nightmare. I wanted the best of both worlds: the raw inference speed of Blackwell and the clean isolation of Docker.

This is the story of how I bypassed the “out-of-the-box” failures of official images to build a ground-up, optimized container environment for the next generation of ARM64 AI.

Why I Chose Docker: Beyond Bare Metal

Efficiency is vital, but I prioritize running my services in containers for two main reasons:

  1. Isolation & Stability: llama.cpp requires specific versions of CUDA toolkits and libraries. By containerizing it, I ensure that my AI stack never interferes with my host system’s drivers or other services.
  2. Easy Management: Upgrading the inference engine or swapping out models becomes a single-line command. I can experiment with different builds without ever “polluting” my primary work environment.

The “Out-of-the-Box” Gap

When I started, I hit a frustrating wall: The official llama.cpp Docker registry does not support ARM64 with CUDA. The pre-built images are strictly for x86_64. If you are running an NVIDIA Spark or a Jetson, the official containers simply won’t start.

I tried using the provided Dockerfiles in the repository, but they failed on the Spark. The Blackwell SoC requires a specific SM 12.1 (Compute Capability 121) target and unique pathing for its compatibility libraries. The standard scripts couldn’t find the drivers, and the default configurations weren’t tuned for the ARMv9 vector instructions I had at my disposal.

Engineering My Stack from Scratch

To solve this, I threw out the templates and engineered a build system tailored specifically for the Spark.

I designed a two-stage Dockerfile to ensure the final image was as lean as possible without sacrificing power.

1. The Build Stage (The Forge)

I started with nvidia/cuda:12.2.0-devel-ubuntu22.04. This stage is where the heavy lifting happens. To ensure a perfectly clean environment, I implemented a “nuclear” build step:

  • Clean Slate: My script runs rm -rf build before every compilation. This prevents CMakeCache.txt errors that often occur when switching environments.
  • The Linking Secret: I used -DCMAKE_LIBRARY_PATH=/usr/local/cuda/lib64/stubs and -Wl,--allow-shlib-undefined. This is the “secret sauce” that allows the compiler to link against CUDA features that are only fully realized when the GPU is engaged at runtime.

2. The Runtime Stage (The Sword)

The second stage uses nvidia/cuda:12.2.0-runtime-ubuntu22.04. I don’t need compilers here—only the results.

  • Surgical Precision: I only copy the llama-server binary and the necessary .so shared libraries from the build stage.
  • Library Path Injection: This was the critical “Eureka” moment. I set the LD_LIBRARY_PATH to include /app, /usr/local/cuda/lib64, and importantly, /usr/local/cuda/compat. This tells the isolated container exactly where to find the Spark’s Blackwell-optimized drivers.

My Dockerfile looks like this at the end:

# Use NVIDIA's official CUDA devel image for ARM64
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04 AS build

# Install build dependencies
RUN apt-get update && apt-get install -y \
    cmake \
    build-essential \
    libcurl4-openssl-dev \
    git \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy the llama.cpp repository
COPY llama.cpp/ .

# Build llama.cpp with CUDA and cURL support from scratch
# remove any existing build directory to ensure a clean build
RUN rm -rf build && \
    cmake -B build \
    -DGGML_CUDA=ON \
    -DLLAMA_CURL=ON \
    -DLLAMA_BUILD_EXAMPLES=OFF \
    -DCMAKE_LIBRARY_PATH=/usr/local/cuda/lib64/stubs \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,--allow-shlib-undefined" \
    . && \
    cmake --build build --config Release -j $(nproc)

# Final stage: Create a slim runtime image
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    libcurl4 \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy only the compiled binaries from the build stage
COPY --from=build /app/build/bin/llama-server /app/llama-server
COPY --from=build /app/build/bin/*.so* /app/

# Create a directory for models to be mounted
RUN mkdir /models

# Set environment variables for CUDA
#ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
ENV LD_LIBRARY_PATH=/app:/usr/local/cuda/lib64:/usr/local/cuda/compat:$LD_LIBRARY_PATH

EXPOSE 8033

# The entrypoint allows you to pass any model/params at runtime
ENTRYPOINT ["/app/llama-server"]

Unleashing the Unified Memory

By containerizing llama.cpp this way, I didn’t lose the Spark’s greatest strength: its unified memory. Because the container has direct access to the CUDA driver, it treats the system’s LPDDR5x as a single massive pool. I can run large models with high context windows, all while keeping the service isolated and easy to manage.

The Multimodal Pipeline: Beyond Text

While standard LLMs only process text, Vision-Language Models (VLMs) like Qwen2.5-VL require a more sophisticated loading strategy. I had to bridge the gap between “seeing” pixels and “reading” tokens.

1. Seamless Downloads with huggingface-hub

To keep my setup dynamic, I don’t hardcode models into the image. Instead, I use the huggingface-hub Python CLI to pull exactly what I need on demand. This allows me to keep the Docker image slim while having access to the millions of models on the Hub.

I use a simple one-liner to fetch specific GGUF files:

# Example: Downloading a Qwen2.5-VL 3B model
$ hf download ggml-org/Qwen2.5-VL-3B-Instruct-GGUF \
Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
--local-dir ./models/

By using the --local-dir flag, I ensure the models land in the exact volume I’ve mounted to my container, bypassing the default hidden cache and making model management visible and easy.

# Example: Downloading a Qwen2.5-VL 3B model's mmproj file
$ hf download ggml-org/Qwen2.5-VL-3B-Instruct-GGUF \
  mmproj-Qwen2.5-VL-3B-Instruct-Q8_0.gguf \ 
  --local-dir ./models/

Original post: https://medium.com/@cslev/containerizing-the-edge-building-a-high-performance-llama-cpp-883f60a461de