Build and deploy data-intensive applications at scale using the combined capability of Python and Spark 2.3 About This Book * Build ETL pipelines with PySpark and Spark MLlib * Apply Spark Streaming and Spark SQL with Python * Perform distributed machine learning and work with Gradient Boosted Trees and Random Forests Who This Book Is For Learning PySpark is for big data professionals and data scientists who want to accelerate their data tasks and deliver real-time data analytics. This book is also a good starting point for Python programmers who want to enter the data analytics field and get up and running with Apache Spark and its Python interface. What You Will Learn * Get to grips with Apache Spark and the Spark 2.3 architecture * Build and interact with Spark DataFrames using Spark SQL * Solve graph and deep learning problems using GraphFrames and TensorFrames respectively * Read, transform, and understand data, and use it to train machine learning models * Build machine learning models with MLlib and ML * Submit your applications using the spark-submit command * Deploy locally built applications to a cluster * Run Spark on AWS, Azure, Google Cloud Platform In Detail Apache Spark is an open source analytics engine for big data processing application, with built-in modules for streaming, SQL, machine learning, and graph processing. This second edition of Learning PySpark teaches you how to use the PySpark API to good effect and handle big data processing and live streaming applications. To start with, you'll discover how to use Apache Spark capabilities without learning Scala or Java, and execute simple batch and real-time stream processing tasks. The book focuses on performing machine learning tasks using the PySpark API. You'll explore the latest features of PySpark 2.3, followed by understanding the challenges faced in building real-time data processing applications. The book also teaches you how to leverage the benefits of Spark DataFrames and address your day-to-day big data problems. You'll explore more practical coverage, along with other Python libraries such as NumPy, Pandas, and Matplotlib, applied in streaming applications. By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.
Le informazioni nella sezione "Riassunto" possono far riferimento a edizioni diverse di questo titolo.