更新时间:2021-07-02 18:24:31
cover
Title Page
Copyright
Learning Spark SQL
Credits
About the Author
About the Reviewer
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Getting Started with Spark SQL
What is Spark SQL?
Introducing SparkSession
Understanding Spark SQL concepts
Understanding Resilient Distributed Datasets (RDDs)
Understanding DataFrames and Datasets
Understanding the Catalyst optimizer
Understanding Catalyst optimizations
Understanding Catalyst transformations
Introducing Project Tungsten
Using Spark SQL in streaming applications
Understanding Structured Streaming internals
Summary
Using Spark SQL for Processing Structured and Semistructured Data
Understanding data sources in Spark applications
Selecting Spark data sources
Using Spark with relational databases
Using Spark with MongoDB (NoSQL database)
Using Spark with JSON data
Using Spark with Avro files
Using Spark with Parquet files
Defining and using custom data sources in Spark
Using Spark SQL for Data Exploration
Introducing Exploratory Data Analysis (EDA)
Using Spark SQL for basic data analysis
Identifying missing data
Computing basic statistics
Identifying data outliers
Visualizing data with Apache Zeppelin
Sampling data with Spark SQL APIs
Sampling with the DataFrame/Dataset API
Sampling with the RDD API
Using Spark SQL for creating pivot tables
Using Spark SQL for Data Munging
Introducing data munging
Exploring data munging techniques
Pre-processing of the household electric consumption Dataset
Computing basic statistics and aggregations
Augmenting the Dataset
Executing other miscellaneous processing steps
Pre-processing of the weather Dataset
Analyzing missing data
Combining data using a JOIN operation
Munging textual data
Processing multiple input data files
Removing stop words
Munging time series data
Pre-processing of the time-series Dataset
Processing date fields
Persisting and loading data
Defining a date-time index
Using the TimeSeriesRDD object
Handling missing time-series data
Dealing with variable length records
Converting variable-length records to fixed-length records
Extracting data from "messy" columns
Preparing data for machine learning
Pre-processing data for machine learning
Creating and running a machine learning pipeline
Using Spark SQL in Streaming Applications
Introducing streaming data applications
Building Spark streaming applications
Implementing sliding window-based functionality
Joining a streaming Dataset with a static Dataset
Using the Dataset API in Structured Streaming
Using output sinks
Using the Foreach Sink for arbitrary computations on output
Using the Memory Sink to save output to a table
Using the File Sink to save output to a partitioned table
Monitoring streaming queries
Using Kafka with Spark Structured Streaming
Introducing Kafka concepts
Introducing ZooKeeper concepts
Introducing Kafka-Spark integration