欢迎关注Hadoop、Spark、Flink、Hive、Hbase、Flume等大数据资料分享微信公共账号:iteblog_hadoop
  1. 文章总数:961
  2. 浏览总数:11,481,324
  3. 评论:3873
  4. 分类目录:103 个
  5. 注册用户数:5841
  6. 最后更新:2018年10月17日
过往记忆博客公众号iteblog_hadoop
欢迎关注微信公众号:
iteblog_hadoop
大数据技术博客公众号bigdata_ai
大数据猿:
bigdata_ai

[电子书]Spark for Data Science PDF下载

  昨天分享了《[电子书]Apache Spark 2 for Beginners pdf下载》,这本书很适合入门学习Spark,虽然书名上写着是Apache Spark 2,但是其内容介绍几乎和Spark 2毫无关系,今天要分享的图书也是一本适合入门的Spark电子书,也是Packt出版,2016年09月开始发行的,全书共339页,其面向读者是数据科学家,本书内容涵盖了Spark编程模型、DataFrame介绍、统一数据访问、机器学习、结构化数据分析、大数据可视化等知识。


如果想及时了解Spark、Hadoop或者Hbase相关的文章,欢迎关注微信公共帐号:iteblog_hadoop

本书的章节

Chapter 1: Big Data and Data Science – An Introduction
Chapter 2: The Spark Programming Model
Chapter 3: Introduction to DataFrames
Chapter 4: Unified Data Access
Chapter 5: Data Analysis on Spark
Chapter 6: Machine Learning
Chapter 7: Extending Spark with SparkR
Chapter 8: Analyzing Unstructured Data
Chapter 9: Visualizing Big Data
Chapter 10: Putting It All Together
Chapter 11: Building Data Science Applications

详细目录

Preface
Chapter 1: Big Data and Data Science – An Introduction
  Big data overview
  Challenges with big data analytics
    Computational challenges
    Analytical challenges
  Evolution of big data analytics
  Spark for data analytics
  The Spark stack
    Spark core
    Spark SQL
    Spark streaming
    MLlib
    GraphX
    SparkR
  Summary
  References
Chapter 2: The Spark Programming Model
  The programming paradigm
    Supported programming languages
      Scala
      Java
      Python
      R
    Choosing the right language
  The Spark engine
    Driver program
    The Spark shell
    SparkContext
    Worker nodes
    Executors
    Shared variables
    Flow of execution
  The RDD API
    RDD basics
    Persistence
  RDD operations
    Creating RDDs
    Transformations on normal RDDs
      The filter operation
      The distinct operation
      The intersection operation
      The union operation
      The map operation
      The flatMap operation
      The keys operation
      The cartesian operation
    Transformations on pair RDDs
      The groupByKey operation
      The join operation
      The reduceByKey operation
      The aggregate operation
    Actions
      The collect() function
      The count() function
      The take(n) function
      The first() function
      The takeSample() function
      The countByKey() function
  Summary
  References
Chapter 3: Introduction to DataFrames
  Why DataFrames?
  Spark SQL
    The Catalyst optimizer
  The DataFrame API
    DataFrame basics
    RDDs versus DataFrames
      Similarities
      Differences
  Creating DataFrames
    Creating DataFrames from RDDs
    Creating DataFrames from JSON
    Creating DataFrames from databases using JDBC
    Creating DataFrames from Apache Parquet
    Creating DataFrames from other data sources
  DataFrame operations
    Under the hood
  Summary
  References
Chapter 4: Unified Data Access
  Data abstractions in Apache Spark
  Datasets
    Working with Datasets
      Creating Datasets from JSON
    Datasets API's limitations
  Spark SQL
    SQL operations
    Under the hood
  Structured Streaming
    The Spark streaming programming model
    Under the hood
    Comparison with other streaming engines
  Continuous applications
  Summary
  References
Chapter 5: Data Analysis on Spark
  Data analytics life cycle
  Data acquisition
  Data preparation
    Data consolidation
    Data cleansing
      Missing value treatment
      Outlier treatment
      Duplicate values treatment
    Data transformation
  Basics of statistics
    Sampling
      Simple random sample
      Systematic sampling
      Stratified sampling
    Data distributions
      Frequency distributions
      Probability distributions
  Descriptive statistics
    Measures of location
      Mean
      Median
      Mode
    Measures of spread
      Range
      Variance
      Standard deviation
    Summary statistics
    Graphical techniques
  Inferential statistics
    Discrete probability distributions
      Bernoulli distribution
      Binomial distribution
        Sample problem
      Poisson distribution
        Sample problem
    Continuous probability distributions
      Normal distribution
      Standard normal distribution
      Chi-square distribution
        Sample problem
      Student's t-distribution
      F-distribution
    Standard error
    Confidence level
    Margin of error and confidence interval
    Variability in the population
    Estimating sample size
    Hypothesis testing
      Null and alternate hypotheses
      Chi-square test
      F-test
        Problem:
      Correlations
  Summary
  References
Chapter 6: Machine Learning
  Introduction
    The evolution
    Supervised learning
    Unsupervised learning
  MLlib and the Pipeline API
    MLlib
    ML pipeline
      Transformer
      Estimator
  Introduction to machine learning
    Parametric methods
    Non-parametric methods
  Regression methods
    Linear regression
      Loss function
      Optimization
    Regularizations on regression
      Ridge regression
      Lasso regression
      Elastic net regression
  Classification methods
      Logistic regression
  Linear Support Vector Machines (SVM)
    Linear kernel
    Polynomial kernel
    Radial Basis Function kernel
    Sigmoid kernel
    Training an SVM
  Decision trees
    Impurity measures
      Gini Index
      Entropy
      Variance
    Stopping rule
    Split candidates
      Categorical features
      Continuous features
    Advantages of decision trees
    Disadvantages of decision trees
    Example
  Ensembles
    Random forests
      Advantages of random forests
    Gradient-Boosted Trees
  Multilayer perceptron classifier
  Clustering techniques
    K-means clustering
      Disadvantages of k-means
      Example
  Summary
  References
Chapter 7: Extending Spark with SparkR
  SparkR basics
    Accessing SparkR from the R environment
    RDDs and DataFrames
    Getting started
  Advantages and limitations
  Programming with SparkR
    Function name masking
    Subsetting data
    Column functions
    Grouped data
  SparkR DataFrames
    SQL operations
    Set operations
    Merging DataFrames
  Machine learning
    The Naive Bayes model
    The Gaussian GLM model
  Summary
  References
Chapter 8: Analyzing Unstructured Data
  Sources of unstructured data
  Processing unstructured data
    Count vectorizer
    TF-IDF
    Stop-word removal
    Normalization/scaling
    Word2Vec
    n-gram modelling
  Text classification
    Naive Bayes classifier
  Text clustering
    K-means
  Dimensionality reduction
  Singular Value Decomposition
    Principal Component Analysis
  Summary
  References:
Chapter 9: Visualizing Big Data
  Why visualize data?
    A data engineer's perspective
    A data scientist's perspective
    A business user's perspective
  Data visualization tools
    IPython notebook
    Apache Zeppelin
    Third-party tools
  Data visualization techniques
    Summarizing and visualizing
    Subsetting and visualizing
    Sampling and visualizing
    Modeling and visualizing
  Summary
  References
    Data source citations
Chapter 10: Putting It All Together
  A quick recap
  Introducing a case study
  The business problem
  Data acquisition and data cleansing
  Developing the hypothesis
  Data exploration
  Data preparation
    Too many levels in a categorical variable
    Numerical variables with too much variation
      Missing data
      Continuous data
      Categorical data
      Preparing the data
  Model building
  Data visualization
  Communicating the results to business users
  Summary
  References
Chapter 11: Building Data Science Applications
  Scope of development
    Expectations
    Presentation options
      Interactive notebooks
        References
      Web API
        References
      PMML and PFA
        References
    Development and testing
      References
    Data quality management
  The Scala advantage
  Spark development status
    Spark 2.0's features and enhancements
      Unifying Datasets and DataFrames
      Structured Streaming
      Project Tungsten phase 2
    What's in store?
  The big data trends
  Summary
  References
Index

下载地址

关注本微信公众号iteblog_hadoop并回复Spark2_data获取本书的下载地址。或
点击进入下载

本博客文章除特别声明,全部都是原创!
转载本文请加上:转载自过往记忆(https://www.iteblog.com/)
本文链接: 【[电子书]Spark for Data Science PDF下载】(https://www.iteblog.com/archives/1854.html)
喜欢 (12)
分享 (0)
发表我的评论
取消评论

表情
本博客评论系统带有自动识别垃圾评论功能,请写一些有意义的评论,谢谢!