CuTeDSL at Perplexity - Perplexity
Summary
Perplexity AI has integrated CuTeDSL kernels into its Runtime-Optimized Serving Engine (ROSE) to achieve peak performance for various models, from embeddings to trillion-parameter LLMs, on NVIDIA Hopper and Blackwell GPUs. ROSE, an in-house inference engine, supports custom Llama models, ranking, classification, and other transformer-based architectures. Initially built with Triton, ROSE transitioned to CuTeDSL as its primary GPU programming environment due to its ability to compile just-in-time to optimized PTX, offer fine-grained hardware control, and significantly reduce compilation times compared to CUDA and CUTLASS C++. This shift enables aggressive compile-time specialization for a wide range of model configurations, including hidden dimensions, data types, and quantization schemes, without hindering development speed. CuTeDSL also facilitates specialized kernel implementations for prefill and decode stages, grid synchronization, and optimized QK Norm and Mixture-of-Experts (MoE) dispatch/combine operations, leading to substantial performance gains.
Key takeaway
For AI Engineers optimizing GPU inference, adopting CuTeDSL can significantly enhance kernel performance and development velocity. Your teams should consider migrating from CUDA or Triton to CuTeDSL to leverage its just-in-time compilation and fine-grained hardware control, enabling more aggressive compile-time specialization for diverse model architectures and reducing iteration times. This approach will allow you to achieve higher throughput and lower latency across various workloads, from embedding models to large language models.
Key insights
CuTeDSL provides fine-grained GPU control and JIT compilation, enabling rapid optimization and specialization for high-performance inference.
Principles
- JIT compilation reduces development-time overhead.
- Compile-time specialization improves kernel performance.
- Fine-grained hardware control is crucial for peak performance.
Method
CuTeDSL compiles Python-based DSL kernels just-in-time to optimized PTX, leveraging CuTe layout algebra and MLIR for aggressive specialization across diverse model configurations and hardware primitives.
In practice
- Use CuTeDSL for specialized GPU kernel development.
- Implement distinct prefill and decode kernels for optimal performance.
- Employ grid barriers for efficient synchronization in decode kernels.
Topics
- CuTeDSL
- Perplexity ROSE Engine
- GPU Kernel Optimization
- Mixture-of-Experts
- NVIDIA Hopper GPUs
Code references
Best for: AI Engineer, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by perplexity.ai via Google News.