Chapter 1: Getting Started with Apache Spark 1
Introduction 1
Installing Spark from binaries 3
Building the Spark source code with Maven 5
Launching Spark on Amazon EC2 7
Deploying on a cluster in standalone mode 12
Deploying on a cluster with Mesos 16
Deploying on a cluster with YARN 18
Using Tachyon as an off-heap storage layer 21
Chapter 2: Developing Applications with Spark 27
Introduction 27
Exploring the Spark shell 27
Developing Spark applications in Eclipse with Maven 29
Developing Spark applications in Eclipse with SBT 33
Developing a Spark application in IntelliJ IDEA with Maven 34
Developing a Spark application in IntelliJ IDEA with SBT 36
Chapter 3: External Data Sources 39
Introduction 39
Loading data from the local filesystem 40
Loading data from HDFS 41
Loading data from HDFS using a custom InputFormat 45
Loading data from Amazon S3 47
Loading data from Apache Cassandra 49
Loading data from relational databases 54
ii
Table of Contents
Chapter 4: Spark SQL 57
Introduction 57
Understanding the Catalyst optimizer 60
Creating HiveContext 63
Inferring schema using case classes 65
Programmatically specifying the schema 66
Loading and saving data using the Parquet format 69
Loading and saving data using the JSON format 72
Loading and saving data from relational databases 74
Loading and saving data from an arbitrary source 76
Chapter 5: Spark Streaming 79
Introduction 79
Word count using Streaming 82
Streaming Twitter data 83
Streaming using Kafka 88
Chapter 6: Getting Started with Machine Learning Using MLlib 95
Introduction 95
Creating vectors 96
Creating a labeled point 98
Creating matrices 99
Calculating summary statistics 101
Calculating correlation 102
Doing hypothesis testing 104
Creating machine learning pipelines using ML 105
Chapter 7: Supervised Learning with MLlib ?Regression 109
Introduction 109
Using linear regression 111
Understanding cost function 113
Doing linear regression with lasso 118
Doing ridge regression 120
Chapter 8: Supervised Learning with MLlib ?Classification 121
Introduction 121
Doing classification using logistic regression 122
Doing binary classification using SVM 128
Doing classification using decision trees 131
Doing classification using Random Forests 138
Doing classification using Gradient Boosted Trees 143
Doing classification with Na飗e Bayes 145
iii
Table of Contents
Chapter 9: Unsupervised Learning with MLlib 147
Introduction 147
Clustering using k-means 148
Dimensionality reduction with principal component analysis 155
Dimensionality reduction with singular value decomposition 161
Chapter 10: Recommender Systems 167
Introduction 167
Collaborative filtering using explicit feedback 169
Collaborative filtering using implicit feedback 172
Chapter 11: Graph Processing Using GraphX 177
Introduction 177
Fundamental operations on graphs 178
Using PageRank 179
Finding connected components 181
Performing neighborhood aggregation 184
Chapter 12: Optimizations and Performance Tuning 187
Introduction 187
Optimizing memory 190
Using compression to improve performance 193
Using serialization to improve performance 193
Optimizing garbage collection 194
Optimizing the level of parallelism 195
Understanding the future of optimization ?project Tungsten 196
Index 199