LLM Inference Engineering: Quantization, KV-Cache Optimization, and High-Throughput Serving: A Production Engineer's Guide to INT4/INT8 Quantization, ... Speculative Decoding, and Cost Optimization - Brossura

Libro 11 di 20: Production AI Engineering Series

Team, ChatVariety

9798180985187: LLM Inference Engineering: Quantization, KV-Cache Optimization, and High-Throughput Serving: A Production Engineer's Guide to INT4/INT8 Quantization, ... Speculative Decoding, and Cost Optimization

Brossura

ISBN 13: 9798180985187

Casa editrice: Independently published, 2026

Vedi tutte le copie di questa edizione con ISBN

0 Usato

5 Nuovo

Da: EUR 13,59

Master the Art of Low-Latency, High-Throughput LLM Serving

In 2026, the defining challenge of production AI is no longer training—it is cost-effective inference. LLM Inference Engineering is the definitive production guide for software engineers, ML developers, and DevOps professionals tasked with deploying large language models at scale without breaking the bank.

This hands-on manual strips away the theoretical academic jargon and delivers practical, production-ready strategies to cut your GPU and cloud serving costs by 50% to 70% while maintaining absolute response quality.

What You Will Master:

Advanced Quantization: Hands-on implementation of INT4/INT8 quantization using AWQ, GPTQ, and GGUF algorithms without destroying model accuracy.
High-Throughput Architectures: Deep dives into PagedAttention, continuous batching, and GPU memory management to maximize hardware utilization.
Serving Frameworks: Configuration recipes and production tuning guidelines for vLLM, TGI (Text Generation Inference), and llama.cpp.
Speed Optimization: Implement speculative decoding to achieve 2x to 4x latency reduction with mathematically guaranteed quality.
Scaling to 70B+ Models: Configure multi-GPU setups using tensor parallelism to distribute memory footprints efficiently.
Rigorous Benchmarking: Establish robust metrics for latency, cost-per-token, and throughput to justify infrastructure decisions.

Written specifically for practicing engineers, this guide assumes familiarity with Python and basic PyTorch. Inside, you will find real-world deployment examples, benchmarking code, and architectural breakdowns that bridge the gap between model training and highly scalable production deployments. Equip yourself with the skills to architect the next generation of AI infrastructure. Stop wasting expensive GPU cycles—optimize your inference pipeline today.

Le informazioni nella sezione "Riassunto" possono far riferimento a edizioni diverse di questo titolo.

Editore: Independently published
Data di pubblicazione: 2026
Lingua: Inglese
ISBN 13: 9798180985187
Rilegatura: Copertina flessibile
Numero di pagine: 82
Contatto del produttore: Manufactured by Amazon on behalf of the author
https://www.amazon.it/hz/contact-us

c/o Amazon Media EU S.�.r.l., 38 Avenue John F. Kennedy
Luxembourg
L-1855
Lussemburgo

Risultati della ricerca per LLM Inference Engineering: Quantization, KV-Cache Optimizati...

Foto dell'editore

LLM Inference Engineering: Quantization, KV-Cache Optimization, and High-Throughput Serving: A Production Engineer's Guide to INT4/INT8 Quantization, ... (Production AI Engineering Series)

Team, ChatVariety

Editore: Independently published, 2026

ISBN 13: 9798180985187

Nuovo Brossura

Print on Demand

Da: California Books, Miami, FL, U.S.A.

Valutazione del venditore 4 su 5 stelle

Condizione: New. Print on Demand. Codice articolo I-9798180985187

Contatta il venditore

Compra nuovo

EUR 13,59

Spedizione gratuita
Spedito in U.S.A.

Quantit�: Pi� di 20 disponibili

Aggiungi al carrello

Foto dell'editore

LLM Inference Engineering

Team, Chatvariety

Editore: Independently published, 2026

ISBN 13: 9798180985187

Nuovo PAP

Da: PBShop.store US, Wood Dale, IL, U.S.A.

Valutazione del venditore 5 su 5 stelle

PAP. Condizione: New. New Book. Shipped from UK. Established seller since 2000. Codice articolo L2-9798180985187

Contatta il venditore

Compra nuovo

EUR 14,12

Spedizione gratuita
Spedito in U.S.A.

Quantit�: Pi� di 20 disponibili

Aggiungi al carrello

Foto dell'editore

LLM Inference Engineering

Team, Chatvariety

Editore: Branching Plot Books, 2026

ISBN 13: 9798180985187

Nuovo PAP

Da: PBShop.store UK, Fairford, GLOS, Regno Unito

Valutazione del venditore 5 su 5 stelle

PAP. Condizione: New. New Book. Shipped from UK. Established seller since 2000. Codice articolo L2-9798180985187

Contatta il venditore

Compra nuovo

EUR 13,49

Spedizione EUR 3,85
Spedito da Regno Unito a U.S.A.

Quantit�: Pi� di 20 disponibili

Aggiungi al carrello

Foto dell'editore

LLM Inference Engineering (Paperback)

Chatvariety Team

Editore: Independently Published, 2026

ISBN 13: 9798180985187

Nuovo Paperback

Print on Demand

Da: CitiRetail, Stevenage, Regno Unito

Valutazione del venditore 5 su 5 stelle

Paperback. Condizione: new. Paperback. Master the Art of Low-Latency, High-Throughput LLM ServingIn 2026, the defining challenge of production AI is no longer training-it is cost-effective inference. LLM Inference Engineering is the definitive production guide for software engineers, ML developers, and DevOps professionals tasked with deploying large language models at scale without breaking the bank.This hands-on manual strips away the theoretical academic jargon and delivers practical, production-ready strategies to cut your GPU and cloud serving costs by 50% to 70% while maintaining absolute response quality.What You Will Master: Advanced Quantization: Hands-on implementation of INT4/INT8 quantization using AWQ, GPTQ, and GGUF algorithms without destroying model accuracy.High-Throughput Architectures: Deep dives into PagedAttention, continuous batching, and GPU memory management to maximize hardware utilization.Serving Frameworks: Configuration recipes and production tuning guidelines for vLLM, TGI (Text Generation Inference), and llama.cpp.Speed Optimization: Implement speculative decoding to achieve 2x to 4x latency reduction with mathematically guaranteed quality.Scaling to 70B+ Models: Configure multi-GPU setups using tensor parallelism to distribute memory footprints efficiently.Rigorous Benchmarking: Establish robust metrics for latency, cost-per-token, and throughput to justify infrastructure decisions.Written specifically for practicing engineers, this guide assumes familiarity with Python and basic PyTorch. Inside, you will find real-world deployment examples, benchmarking code, and architectural breakdowns that bridge the gap between model training and highly scalable production deployments. Equip yourself with the skills to architect the next generation of AI infrastructure. Stop wasting expensive GPU cycles-optimize your inference pipeline today. This item is printed on demand. Shipping may be from our UK warehouse or from our Australian or US warehouses, depending on stock availability. Codice articolo 9798180985187

Contatta il venditore

Compra nuovo

EUR 16,84

Spedizione EUR 43,25
Spedito da Regno Unito a U.S.A.

Quantit�: 1 disponibili

Aggiungi al carrello

Foto dell'editore

LLM Inference Engineering : Quantization, KV-Cache Optimization, and High-Throughput Serving: A Production Engineer's Guide to INT4/INT8 Quantization, vLLM, TGI, Speculative Decoding, and Cost Optimization

Chatvariety Team

Editore: Independently Published Jun 2026, 2026

ISBN 13: 9798180985187

Nuovo Taschenbuch

Da: AHA-BUCH GmbH, Einbeck, Germania

Valutazione del venditore 5 su 5 stelle

Taschenbuch. Condizione: Neu. Neuware - Master the Art of Low-Latency, High-Throughput LLM ServingIn 2026, the defining challenge of production AI is no longer training-it is cost-effective inference. LLM Inference Engineering is the definitive production guide for software engineers, ML developers, and DevOps professionals tasked with deploying large language models at scale without breaking the bank.This hands-on manual strips away the theoretical academic jargon and delivers practical, production-ready strategies to cut your GPU and cloud serving costs by 50% to 70% while maintaining absolute response quality.What You Will Master: - Advanced Quantization: Hands-on implementation of INT4/INT8 quantization using AWQ, GPTQ, and GGUF algorithms without destroying model accuracy.- High-Throughput Architectures: Deep dives into PagedAttention, continuous batching, and GPU memory management to maximize hardware utilization.- Serving Frameworks: Configuration recipes and production tuning guidelines for vLLM, TGI (Text Generation Inference), and llama.cpp.- Speed Optimization: Implement speculative decoding to achieve 2x to 4x latency reduction with mathematically guaranteed quality.- Scaling to 70B+ Models: Configure multi-GPU setups using tensor parallelism to distribute memory footprints efficiently.- Rigorous Benchmarking: Establish robust metrics for latency, cost-per-token, and throughput to justify infrastructure decisions.Written specifically for practicing engineers, this guide assumes familiarity with Python and basic PyTorch. Inside, you will find real-world deployment examples, benchmarking code, and architectural breakdowns that bridge the gap between model training and highly scalable production deployments. Equip yourself with the skills to architect the next generation of AI infrastructure. Stop wasting expensive GPU cycles-optimize your inference pipeline today. Codice articolo 9798180985187

Contatta il venditore

Compra nuovo

EUR 13,00

Spedizione EUR 60,63
Spedito da Germania a U.S.A.

Quantit�: 2 disponibili

Aggiungi al carrello

LLM Inference Engineering: Quantization, KV-Cache Optimization, and High-Throughput Serving: A Production Engineer's Guide to INT4/INT8 Quantization, ... Speculative Decoding, and Cost Optimization - Brossura

Team, ChatVariety

Sinossi

Risultati della ricerca per LLM Inference Engineering: Quantization, KV-Cache Optimizati...

LLM Inference Engineering: Quantization, KV-Cache Optimization, and High-Throughput Serving: A Production Engineer's Guide to INT4/INT8 Quantization, ... (Production AI Engineering Series)

Compra nuovo

LLM Inference Engineering

Compra nuovo

LLM Inference Engineering

Compra nuovo

LLM Inference Engineering (Paperback)

Compra nuovo

LLM Inference Engineering : Quantization, KV-Cache Optimization, and High-Throughput Serving: A Production Engineer's Guide to INT4/INT8 Quantization, vLLM, TGI, Speculative Decoding, and Cost Optimization

Compra nuovo