Subho's research at your service 🫡

21 May, 2026 Reverse engineering Apple's simdgroup async copy on M4
23 Mar, 2026 My 2 cents on Fusing GEMM + Top-K + Softmax on SM100
18 Apr, 2025 Optimizing 3D Square Convolution for cuDNN-like Performance - A Worklog
30 Jan, 2025 Preconditioned SGD can level up your training game
25 Jan, 2025 Understanding Lightning Attention: A Breakthrough in Linear Attention Efficiency
17 Nov, 2024 From 10 to 1000 Tokens/Second: Cursor AI's Secret Weapon Revealed