| Management number | 231975251 | Release Date | 2026/06/18 | List Price | US$11.58 | Model Number | 231975251 | ||
|---|---|---|---|---|---|---|---|---|---|
| Category | |||||||||
A hands-on guide to making AI systems fast — from GPU kernels to production LLM inference.Most AI systems run well below the speed their hardware allows — GPUs idle waiting on data, LLMs serve a fraction of their throughput, and adding hardware sometimes makes things slower. AI Performance Engineering: From GPU Kernels to LLM Inference is a practitioner's guide to diagnosing, profiling, and fixing those bottlenecks — systematically, with real tools and runnable code, from hardware first principles to production LLM serving.What You Will LearnGPU architecture and the roofline model — classify any kernel as compute- or memory-bound, from first principles.Professional profiling — Nsight Systems and Compute, torch.profiler, Linux perf, eBPF, and CPU flame graphs.PyTorch optimization — mixed precision, quantization, torch.compile, CUDA Graphs, and DataLoader tuning.LLM inference — prefill vs decode, the KV cache and grouped-query attention, PagedAttention, continuous batching, and speculative decoding.Distributed inference and training — tensor and pipeline parallelism, NCCL cost, FSDP, Mixture-of-Experts, and disaggregated serving.Honest benchmarking — avoid the five common mistakes and build throughput-latency curves that survive review.2024-2026 hardware — NVIDIA Blackwell, AMD MI300X/ROCm, Intel Gaudi 3, AWS Trainium, Apple Silicon, and CXL memory.Production operation — vLLM serving, observability with DCGM/Prometheus/Grafana, multi-GPU scaling, and cost per token.Hands-On From Start to FinishEvery chapter pairs concepts with runnable Python — no toy examples. Seven end-to-end capstone projects mirror real production work, and the companion repository ships 82 runnable exercises, most with CPU fallbacks.Interview PreparationAppendix C provides 50 interview questions with model answers across GPU architecture, profiling, LLM inference, distributed systems, and benchmarking — organized by domain for targeted study.Inside the BookNine parts, 31 chapters, seven capstone projects, six appendices, and a glossary — roughly 330 pages, from CPU caches and NUMA through the CUDA execution model and LLM inference internals to production fleet economics.Who This Book Is ForML engineers, AI infrastructure and platform engineers, and senior software and systems engineers who profile and optimize AI workloads in Python and PyTorch. GPU experience helps but is not required; a CUDA-capable GPU is needed for the GPU-programming chapters, and the rest run on CPU. Read more
| ASIN | B0H2ZC9JGM |
|---|---|
| ISBN13 | 979-8198692480 |
| Language | English |
| Publisher | Independently published |
| Dimensions | 7 x 0.76 x 10 inches |
| Item Weight | 1.61 pounds |
| Print length | 336 pages |
| Publication date | May 26, 2026 |
If you notice any omissions or errors in the product information on this page, please use the correction request form below.
Correction Request Form