SKILLEX

TongmingLAIC / AKO4ALL

ako4all

Drive an agentic loop that iteratively optimizes a GPU kernel for maximum speedup. Use this skill whenever the user wants to optimize / speed up / benchmark a GPU kernel (CUDA, Triton, TileLang, C++, Python), mentions AKO / AKO4ALL / AKO4X / agentic kernel optimization, asks to "make this kernel faster", or has a kernel they want measured against a PyTorch reference. The skill handles setup, profiling (ncu), correctness checking, iteration logging, and git commits. Bootstraps a workspace in any directory the user points at.

Drive a profile → modify → benchmark → log → commit loop on a GPU kernel until it runs faster than the reference. The user provides at minimum a kernel; everything else (reference, inputs, bench script, hints) is optional.

When this skill applies

  • "optimize this kernel" / "speed up this CUDA / Triton / TileLang kernel"
  • "run AKO / AKO4ALL on ..."
  • "benchmark this kernel against PyTorch"
  • "iterate on this kernel until it's faster"
  • mentions of ncu, kernel profiling, GPU speedup target

Does NOT apply when:

  • User wants to write a new kernel from scratch with no optimization target — just write code, no loop.
  • User wants Codex / GPT to review or implement — use codex:rescue instead.
  • User wants generic performance advice for code that isn't a GPU kernel.

First action

Before doing anything else, establish the workspace — the directory the loop runs in. It is typically the user's CWD, or a subdirectory / path they name in the prompt.

Inventory the workspace + prompt

Browse the workspace (don't run a fixed checklist — look around) and read the user's prompt to identify what the loop needs:

SKILL.md