Reverse engineering Apple's simdgroup async copy on M4
A while ago I read a great post on percisely.xyz/gemm about writing fast GEMM kernels on Apple GPUs. The author talked about an undocumented Metal instruction called simdgroup_async_copy, used the intrinsic __metal_simdgroup_async_copy_2d to move tiles from device memory into threadgroup memory, and reported a counterintuitive finding. A single processor doing the load was much faster than splitting the work across many.
I wanted to replicate that on my M4 mac. Things did not go as planned, and the path to a working benchmark turned into a small archaeology project. I worked through it with Claude, which is the kind of problem where having an LLM that can hold dozens of binaries and symbol tables in its head at once turns out to be the right shape of tool.
Setup
Everything below is on this exact configuration. If your toolchain differs the symbol names and lowering behavior may shift, since this is all undocumented and Apple changes it between Xcode releases.
- Apple M4, GPU family
AGXMetalG16G_B0 - macOS 26.0 (Darwin 25.2.0)
- Xcode 26.4 SDK (
MacOSX26.4.sdk) - Metal toolchain
32023.883, AIR versionv28, MSLMetal 4 - Max threadgroup memory per workgroup, 32768 bytes
The first attempt
The percisely.xyz article gives a clear declaration of the intrinsic.
struct _simdgroup_event_t;
thread _simdgroup_event_t* __metal_simdgroup_async_copy_2d(
ulong, ulong,
threadgroup void *,
ulong, ulong, ulong2,
const device void *,
ulong, ulong, ulong2,
long2, int)
__asm("air.simdgroup_async_copy_2d.p3i8.p1i8");
I pasted that into a Metal kernel on Xcode 26 with the Metal 4 toolchain, on an M4 mac. The compiler said
error: illegal string literal in 'asm'
__asm("air.simdgroup_async_copy_2d.p3i8.p1i8");
Strange. The article was clearly working code when it was written. Something had changed underneath it.
What is actually in the compiler today
Working with Claude I started inspecting the metal toolchain binaries directly. The compiler driver lives at Metal.xctoolchain/usr/metal/<ver>/bin/air-nt, and running strings on it shows all the AIR symbols the frontend recognizes. The relevant ones are
air.async_wg_copy
air.async_wg_strided_copy
air.wait_wg_events
air.get_null_wg_event
No simdgroup_async_copy at all. Apple has migrated from the simdgroup level async copy API the article uses to a new workgroup level one. So I wrote a kernel using the new builtins.
__metal_threadgroup_event_t prev;
__metal_threadgroup_event_t e = __metal_async_wg_copy(p, s, 64, prev);
__metal_wait_wg_events(1, &e);
It compiled. Then pipeline state creation failed with
Encountered unlowered function call to air.async_wg_copy.p3f32.p1f32
The frontend knows about the new API. The M4 GPU backend has no implementation for it. The wg variant is a dead end. It exists in the compiler driver but not in libapplegpu-nt.dylib, which is where the actual GPU lowerings live. PSO creation fails for every form I tried (read, write, strided, float, float4, uchar). The wg path is genuinely broken on M4 right now.
Why the original asm label trick stopped working
The article's declaration uses an asm label string "air.simdgroup_async_copy_2d.p3i8.p1i8". I tried a fresh version of the same trick, with various tweaks like escaping the dot as \56 (octal for .) to see if the filter was just a literal text match.
extern "C" void f() __asm__("foo\56bar");
extern "C" void f() __asm__("air\56xx");
The first one compiles. The second one does not. The frontend specifically rejects asm label strings whose first three characters are air followed by a dot, regardless of escapes. The filter runs after escape processing. Apple put this in deliberately to keep people out of internal AIR symbols.
So the asm label route is closed. The wg builtins are unlowered. The simdgroup builtins are not exposed.
Looking at the backend library
If the frontend will not call into air.simdgroup_async_copy_*, maybe the GPU backend still knows how to lower it. I ran strings on libapplegpu-nt.dylib.
air.simdgroup_async_copy_1d
air.simdgroup_async_copy_1d.read
air.simdgroup_async_copy_1d.write
air.simdgroup_async_copy_2d
air.simdgroup_async_copy_2d.read
air.simdgroup_async_copy_2d.read.unchecked
air.simdgroup_async_copy_2d.write
air.simdgroup_async_copy_2d.write.unchecked
air.wait_simdgroup_events
air.get_null_simdgroup_event
The old family is alive in the backend. The frontend just pretends it does not exist. Apple is in the middle of an API transition where the old lowerings still work but the new builtins are not yet wired up. Both halves of the new API are visible to me from different binaries. They just have not been connected.
The path forward was to bypass the frontend filter entirely.
Generating IR by hand
The Apple Metal toolchain ships metal -emit-llvm -S to produce LLVM IR from MSL source, plus metal-as to assemble textual IR into AIR bitcode, plus metallib to pack bitcode into a metallib that Metal can load at runtime. The frontend filter on asm labels lives in metal. The other tools do not care what AIR symbols you call.
My first try was a sed substitution. Compile a wg version of the kernel to IR, then rewrite the symbol name.
sed -e 's/air\.async_wg_copy/air.simdgroup_async_copy_1d.read/g' \
-e 's/air\.wait_wg_events/air.wait_simdgroup_events/g' \
-e 's/_work_group_event_t/_simdgroup_event_t/g' \
probe.ll > probe_sg.ll
metal-as probe_sg.ll -o probe_sg.bc
metallib probe_sg.bc -o probe_sg.metallib
The metallib built. PSO creation succeeded. I dispatched the kernel and looked at the output.
Every value was zero.
The first no op trap
The substituted kernel ran. It took GPU time. The threadgroup memory was untouched. Reading more carefully, I realized the wg copy I borrowed the IR shape from has a 4 argument signature. The 1D simdgroup variant has a different signature. The backend lowering accepted my call structurally but produced something that did not actually move any data. The wrong shape silently lowered to a no op.
This is worth saying clearly because there are two different no ops in this story.
The wg builtin (air.async_wg_copy, air.async_wg_strided_copy) does not lower at all on M4. PSO creation fails. That is the easy failure to spot.
The simdgroup builtin called with the wrong argument shape does lower. PSO creation succeeds. The kernel runs. The data does not move. That is the silent failure, and it is what got me for a few hours.
The right shape
The article's 2D form has 12 arguments, not 4. I went back to it and wrote the IR by hand with exactly the shape it documents.
%struct._simdgroup_event_t = type opaque
declare %struct._simdgroup_event_t addrspace(3)*
@"air.simdgroup_async_copy_2d.read.p3i8.p1i8"(
i64, i64,
i8 addrspace(3)*,
i64, i64, <2 x i64>,
i8 addrspace(1)*,
i64, i64, <2 x i64>,
<2 x i64>, i32)
declare void @"air.wait_simdgroup_events"(i32, %struct._simdgroup_event_t addrspace(3)**)
The arguments are sizeof(element), alignof(element), destination pointer in threadgroup address space, destination elements per row, destination elements per layer (use 1 for 2D), tile dimensions as a <cols, rows> vector, source pointer in device address space, source elements per row, source elements per layer, source matrix shape as <cols, rows>, source position as <col, row>, and a transpose flag.
A working call looks like this.
%ev = call %struct._simdgroup_event_t addrspace(3)*
@"air.simdgroup_async_copy_2d.read.p3i8.p1i8"(
i64 4, i64 4, i8 addrspace(3)* %tile_i8,
i64 64, i64 1, <2 x i64> <i64 64, i64 8>,
i8 addrspace(1)* %src_i8,
i64 64, i64 1, <2 x i64> %src_shape,
<2 x i64> %pos, i32 0)
store %struct._simdgroup_event_t addrspace(3)* %ev,
%struct._simdgroup_event_t addrspace(3)** %ev_slot, align 8
call void @"air.wait_simdgroup_events"(i32 1,
%struct._simdgroup_event_t addrspace(3)** %ev_slot)
This assembles, packages into a metallib, creates a pipeline state, and actually moves the data.
The intrinsics that lower on M4
After more probing, this is the full set of AIR symbols that work on M4 with the Xcode 26 toolchain. Use them via hand written LLVM IR plus metal-as plus metallib.
air.simdgroup_async_copy_2d.read.p3i8.p1i8 with the 12 argument signature shown above is the workhorse for tile loads from device into threadgroup memory.
air.simdgroup_async_copy_2d.write.p3i8.p1i8 is the same shape, opposite direction, for storing a tile from threadgroup back to device.
air.simdgroup_async_copy_2d.read.unchecked.p3i8.p1i8 and the matching .write.unchecked skip bounds checking against the source matrix shape. Slightly faster at low concurrency.
The 1D forms take a shorter 5 argument signature.
declare %struct._simdgroup_event_t addrspace(3)*
@"air.simdgroup_async_copy_1d.read.p3i8.p1i8"(
i64, i64,
i8 addrspace(3)*,
i8 addrspace(1)*,
i64)
Arguments are sizeof(element), alignof(element), destination pointer, source pointer, element count. The matching .write form exists.
The wait builtin takes a count and a pointer to an array of event pointers.
declare void @"air.wait_simdgroup_events"(i32, %struct._simdgroup_event_t addrspace(3)**)
A single workgroup can have multiple events in flight at once. You allocate an array of event slots, issue K copies into different regions of threadgroup memory, then call wait with the count and the array pointer.
air.async_wg_copy, air.async_wg_strided_copy, and air.wait_wg_events do not lower on M4. They are visible in the metal driver but they are not in the GPU backend library. Skip them.
Benchmarking it
With the intrinsic working, I built a sweep in the style of the SemiAnalysis Blackwell TMA piece, varying bytes per copy and pipeline depth, with a manual cooperative float4 load as a sync baseline.
Setup. Workgroup of 256 threads, which is 8 simdgroups of 32 threads each. 1024 workgroups dispatched. Each outer iteration the kernel issues K async copies per simdgroup before waiting on all K events. Source position shifts by one row each iteration so the L2 cache does not just serve every load. Total work per dispatch scales with the bytes in flight.
For each configuration I record bytes per copy, total bytes in flight per workgroup, pipeline depth K, and measured throughput in GB/s. A separate single workgroup, single tile, single iteration probe measures end to end latency.
What the plot shows. The colored lines are async copy throughput at various per copy sizes, from 256 bytes up to 4 KB. Each point on a line is a different pipeline depth K. The dashed black line is a manual cooperative float4 load of the same tile size, run with all 256 threads. The latency annotation at the top left is the round trip time for one minimal async copy from a single workgroup.
A few things stand out.
Per copy size dominates throughput. A 256 byte async copy runs at 70 GB/s. A 2 KB async copy runs at 951 GB/s. There is real fixed overhead per issue. Pipelining helps but does not erase the gap. K=8 at 256 bytes per copy reaches 323 GB/s, which is much better than K=1 at 70 GB/s, but still a long way from where the larger copies sit.
Async throughput soft caps around 950 GB/s. The 1 KB and 2 KB curves climb toward that ceiling regardless of pipeline depth. Beyond 16 KB in flight, throughput drops, because we hit the M4 per workgroup threadgroup memory limit of 32 KB and occupancy collapses.
The manual sync load wins clean throughput at every comparable size. The dashed line is above the async lines everywhere they coexist, and it peaks at 1454 GB/s with a 16 KB tile. That is more than 50 percent faster than the best async result. If pure device to threadgroup bandwidth is your goal, write a cooperative float4 loop and skip the async machinery.
Latency floor sits at 4.7 microseconds for one minimal round trip. That is the cost to issue an async copy and wait on it, before any data movement bandwidth matters.
When async copy actually pays
Async copy on M4 is not faster than manual loads on raw throughput. What it gives you is the ability to fire and forget. If you can overlap a load of tile N+1 with the matmul on tile N, the async path lets the load happen on the dedicated copy path while your simdgroup matrix units stay busy on compute. A manual load competes for issue slots on the same lanes the matmul wants.
For matmul that is probably the right pattern. Issue an async load of the next K tile. Run the simdgroup matrix multiply on the current K tile. Wait. Swap buffers. Loop.
For anything else, I would just use the manual cooperative load.
How this differs from the article
The article ran on an M2 mac with an older Xcode. The asm label trick worked because the frontend at the time accepted dotted symbol names, so a single C declaration was enough to call into the intrinsic. The hardware behavior the author found, that a single simdgroup doing the load was faster than collaborative loading, was specific to M2.
On M4 with Xcode 26, three things have changed. The frontend now blocks asm labels containing air.. Apple has built a new workgroup level API to replace the simdgroup one, declared in the metal driver but not yet wired up to the M4 GPU backend. And the M4 hardware itself prefers more simdgroups participating, not fewer. In my benchmarks the 8 simdgroup case is about 6 times faster than the 1 simdgroup case, the opposite of the article's finding.
The reverse engineering needed to get to a working call is also new. You cannot do it from MSL anymore. The hand written IR plus assembler plus packager path is the only way through on this toolchain.
Closing
The article was right about how the intrinsic worked. The world around it changed. The frontend filter, the API migration, and the silent no op trap all conspired to make this look broken when it is not. With a small detour through LLVM IR the simdgroup async copy on M4 is alive and well, just not as fast as Apple's marketing would suggest, and not as fast as just rolling your own load.
The thing that made this tractable was being able to enumerate symbol tables across several binaries and reason about the seam between the frontend and the backend, with Claude helping me hold all the pieces at once. The clues were spread across the metal driver, the GPU backend library, the runtime metallibs for different GPU families, and the error messages from PSO creation. None of them were enough by themselves. Together they pointed at a single explanation that no amount of reading the public Apple docs would have suggested.
