# [电子书]Spark for Data Science PDF下载

昨天分享了《[电子书]Apache Spark 2 for Beginners pdf下载》，这本书很适合入门学习Spark，虽然书名上写着是Apache Spark 2，但是其内容介绍几乎和Spark 2毫无关系，今天要分享的图书也是一本适合入门的Spark电子书，也是Packt出版，2016年09月开始发行的，全书共339页，其面向读者是数据科学家，本书内容涵盖了Spark编程模型、DataFrame介绍、统一数据访问、机器学习、结构化数据分析、大数据可视化等知识。

## 本书的章节

```Chapter 1: Big Data and Data Science – An Introduction
Chapter 2: The Spark Programming Model
Chapter 3: Introduction to DataFrames
Chapter 4: Unified Data Access
Chapter 5: Data Analysis on Spark
Chapter 6: Machine Learning
Chapter 7: Extending Spark with SparkR
Chapter 8: Analyzing Unstructured Data
Chapter 9: Visualizing Big Data
Chapter 10: Putting It All Together
Chapter 11: Building Data Science Applications
```

## 详细目录

```Preface
Chapter 1: Big Data and Data Science – An Introduction
Big data overview
Challenges with big data analytics
Computational challenges
Analytical challenges
Evolution of big data analytics
Spark for data analytics
The Spark stack
Spark core
Spark SQL
Spark streaming
MLlib
GraphX
SparkR
Summary
References
Chapter 2: The Spark Programming Model
Supported programming languages
Scala
Java
Python
R
Choosing the right language
The Spark engine
Driver program
The Spark shell
SparkContext
Worker nodes
Executors
Shared variables
Flow of execution
The RDD API
RDD basics
Persistence
RDD operations
Creating RDDs
Transformations on normal RDDs
The filter operation
The distinct operation
The intersection operation
The union operation
The map operation
The flatMap operation
The keys operation
The cartesian operation
Transformations on pair RDDs
The groupByKey operation
The join operation
The reduceByKey operation
The aggregate operation
Actions
The collect() function
The count() function
The take(n) function
The first() function
The takeSample() function
The countByKey() function
Summary
References
Chapter 3: Introduction to DataFrames
Why DataFrames?
Spark SQL
The Catalyst optimizer
The DataFrame API
DataFrame basics
RDDs versus DataFrames
Similarities
Differences
Creating DataFrames
Creating DataFrames from RDDs
Creating DataFrames from JSON
Creating DataFrames from databases using JDBC
Creating DataFrames from Apache Parquet
Creating DataFrames from other data sources
DataFrame operations
Under the hood
Summary
References
Chapter 4: Unified Data Access
Data abstractions in Apache Spark
Datasets
Working with Datasets
Creating Datasets from JSON
Datasets API's limitations
Spark SQL
SQL operations
Under the hood
Structured Streaming
The Spark streaming programming model
Under the hood
Comparison with other streaming engines
Continuous applications
Summary
References
Chapter 5: Data Analysis on Spark
Data analytics life cycle
Data acquisition
Data preparation
Data consolidation
Data cleansing
Missing value treatment
Outlier treatment
Duplicate values treatment
Data transformation
Basics of statistics
Sampling
Simple random sample
Systematic sampling
Stratified sampling
Data distributions
Frequency distributions
Probability distributions
Descriptive statistics
Measures of location
Mean
Median
Mode
Range
Variance
Standard deviation
Summary statistics
Graphical techniques
Inferential statistics
Discrete probability distributions
Bernoulli distribution
Binomial distribution
Sample problem
Poisson distribution
Sample problem
Continuous probability distributions
Normal distribution
Standard normal distribution
Chi-square distribution
Sample problem
Student's t-distribution
F-distribution
Standard error
Confidence level
Margin of error and confidence interval
Variability in the population
Estimating sample size
Hypothesis testing
Null and alternate hypotheses
Chi-square test
F-test
Problem:
Correlations
Summary
References
Chapter 6: Machine Learning
Introduction
The evolution
Supervised learning
Unsupervised learning
MLlib and the Pipeline API
MLlib
ML pipeline
Transformer
Estimator
Introduction to machine learning
Parametric methods
Non-parametric methods
Regression methods
Linear regression
Loss function
Optimization
Regularizations on regression
Ridge regression
Lasso regression
Elastic net regression
Classification methods
Logistic regression
Linear Support Vector Machines (SVM)
Linear kernel
Polynomial kernel
Sigmoid kernel
Training an SVM
Decision trees
Impurity measures
Gini Index
Entropy
Variance
Stopping rule
Split candidates
Categorical features
Continuous features
Example
Ensembles
Random forests
Multilayer perceptron classifier
Clustering techniques
K-means clustering
Example
Summary
References
Chapter 7: Extending Spark with SparkR
SparkR basics
Accessing SparkR from the R environment
RDDs and DataFrames
Getting started
Programming with SparkR
Subsetting data
Column functions
Grouped data
SparkR DataFrames
SQL operations
Set operations
Merging DataFrames
Machine learning
The Naive Bayes model
The Gaussian GLM model
Summary
References
Chapter 8: Analyzing Unstructured Data
Sources of unstructured data
Processing unstructured data
Count vectorizer
TF-IDF
Stop-word removal
Normalization/scaling
Word2Vec
n-gram modelling
Text classification
Naive Bayes classifier
Text clustering
K-means
Dimensionality reduction
Singular Value Decomposition
Principal Component Analysis
Summary
References:
Chapter 9: Visualizing Big Data
Why visualize data?
A data engineer's perspective
A data scientist's perspective
Data visualization tools
IPython notebook
Apache Zeppelin
Third-party tools
Data visualization techniques
Summarizing and visualizing
Subsetting and visualizing
Sampling and visualizing
Modeling and visualizing
Summary
References
Data source citations
Chapter 10: Putting It All Together
A quick recap
Introducing a case study
Data acquisition and data cleansing
Developing the hypothesis
Data exploration
Data preparation
Too many levels in a categorical variable
Numerical variables with too much variation
Missing data
Continuous data
Categorical data
Preparing the data
Model building
Data visualization
Communicating the results to business users
Summary
References
Chapter 11: Building Data Science Applications
Scope of development
Expectations
Presentation options
Interactive notebooks
References
Web API
References
PMML and PFA
References
Development and testing
References
Data quality management
Spark development status
Spark 2.0's features and enhancements
Unifying Datasets and DataFrames
Structured Streaming
Project Tungsten phase 2
What's in store?
The big data trends
Summary
References
Index
```