Data Analysis With Python and PySpark - Brossura

Rioux, Jonathan

 
9781617297205: Data Analysis With Python and PySpark

Sinossi

<b>Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines.</b><br><br>In <i>Data Analysis with Python and PySpark</i> you will learn how to:<br> <br> &#160;&#160;&#160; Manage your data as it scales across multiple machines<br> &#160;&#160;&#160; Scale up your data programs with full confidence<br> &#160;&#160;&#160; Read and write data to and from a variety of sources and formats<br> &#160;&#160;&#160; Deal with messy data with PySpark&#8217;s data manipulation functionality<br> &#160;&#160;&#160; Discover new data sets and perform exploratory data analysis<br> &#160;&#160;&#160; Build automated data pipelines that transform, summarize, and get insights from data<br> &#160;&#160;&#160; Troubleshoot common PySpark errors<br> &#160;&#160;&#160; Creating reliable long-running jobs<br> <br> <i>Data Analysis with Python and PySpark</i> is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you&#8217;ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required.<br> <br> Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.<br> <br> About the technology<br> The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark&#8217;s core engine with a Python-based API. It helps simplify Spark&#8217;s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem.<br> <br> About the book<br> <i>Data Analysis with Python and PySpark</i> helps you solve the daily challenges of data science with PySpark. You&#8217;ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source&#8212;whether that&#8217;s Hadoop clusters, cloud data storage, or local data files. Once you&#8217;ve covered the fundamentals, you&#8217;ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code.<br> <br> What&#39;s inside<br> <br> &#160;&#160;&#160; Organizing your PySpark code<br> &#160;&#160;&#160; Managing your data, no matter the size<br> &#160;&#160;&#160; Scale up your data programs with full confidence<br> &#160;&#160;&#160; Troubleshooting common data pipeline problems<br> &#160;&#160;&#160; Creating reliable long-running jobs<br> <br> About the reader<br> Written for data scientists and data engineers comfortable with Python.<br> <br> About the author<br> As a ML director for a data-driven software company, <b>Jonathan Rioux</b> uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.<br> <br> Table of Contents<br> <br> 1 Introduction<br> PART 1 GET ACQUAINTED: FIRST STEPS IN PYSPARK<br> 2 Your first data program in PySpark<br> 3 Submitting and scaling your first PySpark program<br> 4 Analyzing tabular data with pyspark.sql<br> 5 Data frame gymnastics: Joining and grouping<br> PART 2 GET PROFICIENT: TRANSLATE YOUR IDEAS INTO CODE<br> 6 Multidimensional data frames: Using PySpark with JSON data<br> 7 Bilingual PySpark: Blending Python and SQL code<br> 8 Extending PySpark with Python: RDD and UDFs<br> 9 Big data is just a lot of small data: Using pandas UDFs<br> 10 Your data under a different lens: Window functions<br> 11 Faster PySpark: Understanding Spark&#8217;s query planning<br> PART 3 GET CONFIDENT: USING MACHINE LEARNING WITH PYSPARK<br> 12 Setting the stage: Preparing features for machine learning<br> 13 Robust machine learning with ML Pipelines<br> 14 Building custom ML transformers and estimators

Le informazioni nella sezione "Riassunto" possono far riferimento a edizioni diverse di questo titolo.

Informazioni sull?autore

As a data scientist for an engineering consultancy <b>Jonathan Rioux</b> uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.

Le informazioni nella sezione "Su questo libro" possono far riferimento a edizioni diverse di questo titolo.