CuTeDSL at Perplexity - Perplexity

· Source: perplexity.ai via Google News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Perplexity AI has integrated CuTeDSL kernels into its Runtime-Optimized Serving Engine (ROSE) to achieve peak performance for various models, from embeddings to trillion-parameter LLMs, on NVIDIA Hopper and Blackwell GPUs. ROSE, an in-house inference engine, supports custom Llama models, ranking, classification, and other transformer-based architectures. Initially built with Triton, ROSE transitioned to CuTeDSL as its primary GPU programming environment due to its ability to compile just-in-time to optimized PTX, offer fine-grained hardware control, and significantly reduce compilation times compared to CUDA and CUTLASS C++. This shift enables aggressive compile-time specialization for a wide range of model configurations, including hidden dimensions, data types, and quantization schemes, without hindering development speed. CuTeDSL also facilitates specialized kernel implementations for prefill and decode stages, grid synchronization, and optimized QK Norm and Mixture-of-Experts (MoE) dispatch/combine operations, leading to substantial performance gains.

Key takeaway

For AI Engineers optimizing GPU inference, adopting CuTeDSL can significantly enhance kernel performance and development velocity. Your teams should consider migrating from CUDA or Triton to CuTeDSL to leverage its just-in-time compilation and fine-grained hardware control, enabling more aggressive compile-time specialization for diverse model architectures and reducing iteration times. This approach will allow you to achieve higher throughput and lower latency across various workloads, from embedding models to large language models.

Key insights

CuTeDSL provides fine-grained GPU control and JIT compilation, enabling rapid optimization and specialization for high-performance inference.

Principles

Method

CuTeDSL compiles Python-based DSL kernels just-in-time to optimized PTX, leveraging CuTe layout algebra and MLIR for aggressive specialization across diverse model configurations and hardware primitives.

In practice

Topics

Code references

Best for: AI Engineer, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by perplexity.ai via Google News.