Swift LLM Training: Matrix Mult From Gflop/s to Tflop/s
A deep dive into optimizing matrix multiplication in Swift for LLM training, revealing CPU vs GPU performance and the CUDA software moat.
Matt Gallagher recently published an article on CocoaWithLove titled "Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s." It's a deep dive into optimizing matrix multiplication in Swift for LLM training. By carefully tuning matrix multiplication—the bread and butter of neural networks—Gallagher shows how to push an M3 Max's CPU from a few Gflop/s to over a Tflop/s. It's not just about LLMs; it's a lesson in how far you can go when you understand the hardware.
What the Article Covers: From Naive to Tflop/s
The article walks through writing an optimized matrix multiplication routine in Swift, targeting Apple Silicon's CPU (including its AMX coprocessor). Gallagher starts with a naive implementation that achieves only 0.2 Gflop/s and incrementally applies optimizations: loop order, pointer arithmetic, vectorization via Accelerate, and finally direct use of AMX instructions. The result: 1.1 Tflop/s for float16 matrix multiplication. He notes that the GPU on the same chip can theoretically hit 15 Tflop/s, but the real ceiling for this task in practice is 3-5 Tflop/s due to overhead.
The key takeaway: Apple's AMX (Advanced Matrix Extensions) is a hidden gem—undocumented, but accessible via assembly or intrinsics. Gallagher cracks it open, showing how to achieve GPU-like performance on the CPU, which is crucial for tasks like LLM inference where GPU launch latency can be a bottleneck.
Why the HN Community Is Buzzing
The HN community is buzzing for good reason. As one commenter wrote:
"This is a pretty phenomenal article. Even for those who don't care about LLM use, this is just a great article on optimizing Swift performance, which is sadly something that doesn't have a lot of written material for."
Another commenter highlighted the broader implications:
"This is so true. And also why people should not take basic GPU benchmarks so seriously. Getting peak performance out of a GPU is much more complex than it is with a CPU."
The article fills a gap: Swift documentation on low-level hardware optimization is sparse. The fact that Gallagher is willing to reverse-engineer Apple's AMX instructions—a piece of hardware that Apple has never publicly documented—makes this a landmark piece for anyone serious about performance on Apple Silicon.
Key Takeaways for Builders
This article matters far beyond LLMs. It's a case study in the diminishing returns of GPU hype for many real-world workloads. Everyone assumes that because a GPU has massive theoretical peak FLOPS, it's the best tool for every matrix multiplication. In practice, the cost of moving data and launching kernels often eats into that peak, especially for small-to-medium matrices typical in transformer inference. Gallagher's results show that a carefully tuned CPU can be competitive—and for on-device ML where GPU may be busy or power-hungry, that's a huge win.
The fact that he used Swift is also noteworthy. Swift has been gaining traction in server-side and systems programming, but its performance potential is often overshadowed by C++ and Rust. This article proves Swift can hold its own when you get close to the metal.
That said, I have mixed feelings about relying on undocumented AMX instructions. Apple could change them in the next chip without notice, and your code might break. For production systems, that's a risk. The Accelerate framework is safer, but Gallagher shows it's not as fast as raw AMX. It's a trade-off between performance and maintainability.
Practical Code: Using Accelerate and AMX
If you're building ML apps that run on Apple devices—especially on-device inference or fine-tuning—this is directly relevant. You can now write Swift code that leverages AMX to get GPU-like throughput without the overhead of Metal.
Here's an example of how to call the Accelerate framework for matrix multiplication in Swift (a step before the full AMX approach):
import Accelerate
func gemm(A: [Float], B: [Float], m: Int32, n: Int32, k: Int32) -> [Float] {
var C = [Float](repeating: 0, count: Int(m * n))
let alpha: Float = 1.0
let beta: Float = 0.0
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, A, k, B, n, beta, &C, n)
return C
}
For ultimate speed, you'd need to drop down to assembly or use the amx intrinsics that Gallagher reverse-engineered. That's more complex but achievable.
Another implication: The software moat of CUDA is real. Nvidia has spent years building libraries like cuBLAS and cuDNN that are hand-tuned for their hardware. Competing GPU vendors lack equivalent tooling, which is why Apple's Metal Performance Shaders often fall short. This article is a small step toward closing that gap by showing that Apple's hardware can compete if you're willing to invest the engineering effort.
The Software Moat of CUDA and Apple Silicon
The M3 Max chip features a powerful CPU and GPU, but extracting peak performance is non-trivial. Gallagher's work demonstrates that the CPU with AMX can rival the GPU for certain workloads. This matters for on-device ML where GPU resources are shared or power-constrained.
Should you care? If you're optimizing ML inference on Apple Silicon, yes—this article gives you a concrete path to significant speedups. If you write performance-critical Swift code (e.g., for games or audio processing), the optimization techniques are broadly applicable. If you only work on cloud GPUs or high-level frameworks like PyTorch, you can probably skip it—but it's still a fascinating look at what's possible when you go deep.