Learning Spark SQL

更新时间：2021-07-02 18:24:31

最新章节：Summary

cover

Title Page

Learning Spark SQL

Credits

About the Author

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Getting Started with Spark SQL

What is Spark SQL?

Introducing SparkSession

Understanding Spark SQL concepts

Understanding Resilient Distributed Datasets (RDDs)

Understanding DataFrames and Datasets

Understanding the Catalyst optimizer

Understanding Catalyst optimizations

Understanding Catalyst transformations

Introducing Project Tungsten

Using Spark SQL in streaming applications

Understanding Structured Streaming internals

Summary

Using Spark SQL for Processing Structured and Semistructured Data

Understanding data sources in Spark applications

Selecting Spark data sources

Using Spark with relational databases

Using Spark with MongoDB (NoSQL database)

Using Spark with JSON data

Using Spark with Avro files

Using Spark with Parquet files

Defining and using custom data sources in Spark

Summary

Using Spark SQL for Data Exploration

Introducing Exploratory Data Analysis (EDA)

Using Spark SQL for basic data analysis

Identifying missing data

Computing basic statistics

Identifying data outliers

Visualizing data with Apache Zeppelin

Sampling data with Spark SQL APIs

Sampling with the DataFrame/Dataset API

Sampling with the RDD API

Using Spark SQL for creating pivot tables

Summary

Using Spark SQL for Data Munging

Introducing data munging

Exploring data munging techniques

Pre-processing of the household electric consumption Dataset

Computing basic statistics and aggregations

Augmenting the Dataset

Executing other miscellaneous processing steps

Pre-processing of the weather Dataset

Analyzing missing data

Combining data using a JOIN operation

Munging textual data

Processing multiple input data files

Removing stop words

Munging time series data

Pre-processing of the time-series Dataset

Processing date fields

Persisting and loading data

Defining a date-time index

Using the TimeSeriesRDD object

Handling missing time-series data

Computing basic statistics

Dealing with variable length records

Converting variable-length records to fixed-length records

Extracting data from "messy" columns

Preparing data for machine learning

Pre-processing data for machine learning

Creating and running a machine learning pipeline

Summary

Using Spark SQL in Streaming Applications

Introducing streaming data applications

Building Spark streaming applications

Implementing sliding window-based functionality

Joining a streaming Dataset with a static Dataset

Using the Dataset API in Structured Streaming

Using output sinks

Using the Foreach Sink for arbitrary computations on output

Using the Memory Sink to save output to a table

Using the File Sink to save output to a partitioned table

Monitoring streaming queries

Using Kafka with Spark Structured Streaming

Introducing Kafka concepts

Introducing ZooKeeper concepts

Introducing Kafka-Spark integration