# Kernel Generation

Papers on using LLMs or agents for kernel generation, tensor program generation, and compiler optimization.

## Agent-Based Kernel Generation

* KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta (ISCA 2026) \[[arXiv](https://arxiv.org/abs/2512.23236)] \[[Blog](https://engineering.fb.com/2026/04/02/developer-tools/kernelevolve-how-metas-ranking-engineer-agent-optimizes-ai-infrastructure/)]
  * Meta
  * Present **KernelEvolve**, an agentic kernel coding framework that automates kernel generation and optimization from kernel specifications for recommendation workloads across heterogeneous accelerators.
  * Search over kernels across multiple programming abstractions, from Triton and CuTe DSL to low-level hardware-agnostic languages, using graph-based search with runtime-aware retrieval-augmented prompt synthesis.
  * Validate 100% pass rates on all 250 KernelBench problems and on 160 PyTorch ATen operators across NVIDIA GPUs, AMD GPUs, and Meta accelerators, while reducing kernel development time from weeks to hours.
* AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search (arXiv:2603.21331) \[[arXiv](https://arxiv.org/abs/2603.21331)] \[[Code](https://github.com/RightNow-AI/autokernel)]
  * RightNow AI
  * Introduce an autonomous GPU kernel optimizer that starts from end-to-end PyTorch models and iteratively optimizes the most impactful kernels.
  * Use Amdahl's law to rank kernel opportunities and a five-stage correctness pipeline to validate candidate optimizations.
  * Report strong kernel-level and end-to-end model speedups across matmul, attention, convolution, and MLP workloads on H100 GPUs.
* Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis (arXiv:2603.10846) \[[arXiv](https://arxiv.org/abs/2603.10846)] \[[Homepage](https://evokernel.zhuo.li/)] \[[Dataset](https://huggingface.co/datasets/noahli/EvoKernel)]
  * SJTU & Shanghai AI Lab & MemTensor
  * Introduce **EvoKernel**, a self-evolving agentic framework for NPU kernel synthesis in data-scarce programming domains.
  * Formulate synthesis as a memory-based reinforcement learning task with value-driven retrieval and cross-task memory sharing for cold-start drafting and continual latency refinement.
  * Improve frontier models' correctness from 11.0% to 83.0% and achieve a median speedup of 3.60x over initial drafts on an NPU variant of KernelBench.
* CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation (arXiv:2602.24286) \[[arXiv](https://arxiv.org/abs/2602.24286)] \[[Code](https://github.com/BytedTsinghua-SIA/CUDA-Agent)] \[[Homepage](https://cuda-agent.github.io/)]
  * ByteDance Seed & Tsinghua AIR
  * Present a large-scale agentic RL system for high-performance CUDA kernel generation.
  * Combine scalable data synthesis, a skill-augmented CUDA development environment, and stable RL training for verifiable kernel optimization.
  * Outperform `torch.compile` by 100%, 100%, and 92% on the three KernelBench levels.

## Compiler Optimization

* Meta Large Language Model Compiler: Foundation Models of Compiler Optimization \[[Paper](https://ai.meta.com/research/publications/meta-large-language-model-compiler-foundation-models-of-compiler-optimization/)]
  * Meta AI

## Benchmarks

* SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits (arXiv:2603.19173) \[[arXiv](https://arxiv.org/abs/2603.19173)] \[[Code](https://github.com/NVIDIA/SOL-ExecBench)] \[[Benchmark](https://research.nvidia.com/benchmarks/sol-execbench)]
  * NVIDIA
  * Present a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models, targeting NVIDIA Blackwell GPUs.
  * Measure candidate kernels against analytically derived Speed-of-Light bounds via **SOLAR**, using a SOL Score that quantifies how much of the remaining gap to hardware-efficient execution is closed.
  * Provide a sandboxed evaluation harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static-analysis-based checks against reward hacking.
* TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators (arXiv:2502.14752) \[[arXiv](https://arxiv.org/abs/2502.14752)] \[[Benchmark](https://github.com/thunlp/TritonBench)]
  * THU-NLP
  * Present the first comprehensive benchmark for Triton operator generation.
  * Cover two evaluation channels: 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces.
  * Evaluate both functional correctness and efficiency on widely deployed GPUs.
* KernelBench: Can LLMs Write Efficient GPU Kernels? (arXiv:2502.10517) \[[arXiv](https://arxiv.org/abs/2502.10517)] \[[Benchmark](https://github.com/ScalingIntelligence/KernelBench)] \[[Homepage](https://scalingintelligence.stanford.edu/blogs/kernelbench/)]
  * Stanford
  * Provide an open-source benchmark for evaluating whether LMs can generate fast and correct GPU kernels for PyTorch ML workloads.
  * Cover 250 tasks spanning single-kernel operators, simple fusion patterns, and full model architectures.
  * Introduce `fast_p`, a metric that measures the fraction of generated kernels that are both correct and faster than a configurable baseline threshold.
