欢迎关注Hadoop、Spark、Flink、Hive、Hbase、Flume等大数据资料分享微信公共账号:iteblog_hadoop
  1. 文章总数:961
  2. 浏览总数:11,481,362
  3. 评论:3873
  4. 分类目录:103 个
  5. 注册用户数:5841
  6. 最后更新:2018年10月17日
过往记忆博客公众号iteblog_hadoop
欢迎关注微信公众号:
iteblog_hadoop
大数据技术博客公众号bigdata_ai
大数据猿:
bigdata_ai

[电子书]Big Data Analytics pdf下载

本书作者Venkat Ankam,由Packt Publishing出版社在2016年09月发行,全书供326页。本书基于Spark 2.0和Hadoop 2.7版本介绍,是适合数据分析师和数据科学家的参考手册,当然也适合那些想入门的人。


如果想及时了解SparkHadoop或者Hbase相关的文章,欢迎关注微信公共帐号:iteblog_hadoop

本书的章节

Chapter 1: Big Data Analytics at a 10,000-Foot View
Chapter 2: Getting Started with Apache Hadoop and Apache Spark
Chapter 3: Deep Dive into Apache Spark
Chapter 4: Big Data Analytics with Spark SQL, DataFrames,and Datasets
Chapter 5: Real-Time Analytics with Spark Streaming and Structured Streaming
Chapter 6: Notebooks and Dataflows with Spark and Hadoop
Chapter 7: Machine Learning with Spark and Hadoop
Chapter 8: Building Recommendation Systems with Spark and Mahout
Chapter 9: Graph Analytics with GraphX
Chapter 10: Interactive Analytics with SparkR

详细目录

Preface
Chapter 1: Big Data Analytics at a 10,000-Foot View
  Big Data analytics and the role of Hadoop and Spark
    A typical Big Data analytics project life cycle
      Identifying the problem and outcomes
      Identifying the necessary data
      Data collection
      Preprocessing data and ETL
      Performing analytics
      Visualizing data
    The role of Hadoop and Spark
  Big Data science and the role of Hadoop and Spark
    A fundamental shift from data analytics to data science
      Data scientists versus software engineers
      Data scientists versus data analysts
      Data scientists versus business analysts
    A typical data science project life cycle
      Hypothesis and modeling
      Measuring the effectiveness
      Making improvements
      Communicating the results
    The role of Hadoop and Spark
  Tools and techniques
  Real-life use cases
  Summary
Chapter 2: Getting Started with Apache Hadoop and Apache Spark
  Introducing Apache Hadoop
    Hadoop Distributed File System
    Features of HDFS
    MapReduce
    MapReduce features
    MapReduce v1 versus MapReduce v2
      MapReduce v1 challenges
    YARN
    Storage options on Hadoop
      File formats
      Compression formats
  Introducing Apache Spark
    Spark history
    What is Apache Spark?
    What Apache Spark is not
    MapReduce issues
    Spark's stack
  Why Hadoop plus Spark?
    Hadoop features
    Spark features
      Frequently asked questions about Spark
  Installing Hadoop plus Spark clusters
  Summary
Chapter 3: Deep Dive into Apache Spark
  Starting Spark daemons
    Working with CDH
    Working with HDP, MapR, and Spark pre-built packages
  Learning Spark core concepts
    Ways to work with Spark
      Spark Shell
      Spark applications
    Resilient Distributed Dataset
      Method 1 – parallelizing a collection
      Method 2 – reading from a file
    Spark context
    Transformations and actions
    Parallelism in RDDs
    Lazy evaluation
    Lineage Graph
    Serialization
    Leveraging Hadoop file formats in Spark
    Data locality
    Shared variables
    Pair RDDs
  Lifecycle of Spark program
    Pipelining
    Spark execution summary
  Spark applications
    Spark Shell versus Spark applications
    Creating a Spark context
    SparkConf
    SparkSubmit
    Spark Conf precedence order
    Important application configurations
  Persistence and caching
    Storage levels
    What level to choose?
  Spark resource managers – Standalone, YARN, and Mesos
    Local versus cluster mode
    Cluster resource managers
      Standalone
      YARN
      Mesos
      Which resource manager to use?
  Summary
Chapter 4: Big Data Analytics with Spark SQL, DataFrames,and Datasets
  History of Spark SQL
  Architecture of Spark SQL
  Introducing SQL, Datasources, DataFrame, and Dataset APIs
  Evolution of DataFrames and Datasets
    What's wrong with RDDs?
    RDD Transformations versus Dataset and DataFrames Transformations
  Why Datasets and DataFrames?
    Optimization
    Speed
    Automatic Schema Discovery
    Multiple sources, multiple languages
    Interoperability between RDDs and others
    Select and read necessary data only
  When to use RDDs, Datasets, and DataFrames?
  Analytics with DataFrames
    Creating SparkSession
    Creating DataFrames
      Creating DataFrames from structured data files
      Creating DataFrames from RDDs
      Creating DataFrames from tables in Hive
      Creating DataFrames from external databases
    Converting DataFrames to RDDs
    Common Dataset/DataFrame operations
      Input and Output Operations
      Basic Dataset/DataFrame functions
      DSL functions
      Built-in functions, aggregate functions, and window functions
      Actions
      RDD operations
    Caching data
    Performance optimizations
  Analytics with the Dataset API
    Creating Datasets
    Converting a DataFrame to a Dataset
      Converting a Dataset to a DataFrame
    Accessing metadata using Catalog
  Data Sources API
    Read and write functions
    Built-in sources
      Working with text files
      Working with JSON
      Working with Parquet
      Working with ORC
      Working with JDBC
      Working with CSV
    External sources
      Working with AVRO
      Working with XML
      Working with Pandas
      DataFrame based Spark-on-HBase connector
  Spark SQL as a distributed SQL engine
    Spark SQL's Thrift server for JDBC/ODBC access
    Querying data using beeline client
    Querying data from Hive using spark-sql CLI
    Integration with BI tools
  Hive on Spark
  Summary
Chapter 5: Real-Time Analytics with Spark Streaming and Structured Streaming
  Introducing real-time processing
    Pros and cons of Spark Streaming
    History of Spark Streaming
  Architecture of Spark Streaming
    Spark Streaming application flow
    Stateless and stateful stream processing
  Spark Streaming transformations and actions
    Union
    Join
    Transform operation
    updateStateByKey
    mapWithState
    Window operations
    Output operations
  Input sources and output stores
    Basic sources
    Advanced sources
    Custom sources
    Receiver reliability
    Output stores
  Spark Streaming with Kafka and HBase
    Receiver-based approach
      Role of Zookeeper
    Direct approach (no receivers)
    Integration with HBase
  Advanced concepts of Spark Streaming
    Using DataFrames
    MLlib operations
    Caching/persistence
    Fault-tolerance in Spark Streaming
      Failure of executor
      Failure of driver
    Performance tuning of Spark Streaming applications
  Monitoring applications
  Introducing Structured Streaming
    Structured Streaming application flow
      When to use Structured Streaming?
    Streaming Datasets and Streaming DataFrames
      Input sources and output sinks
    Operations on Streaming Datasets and Streaming DataFrames
  Summary
Chapter 6: Notebooks and Dataflows with Spark and Hadoop
  Introducing web-based notebooks
  Introducing Jupyter
    Installing Jupyter
    Analytics with Jupyter
  Introducing Apache Zeppelin
    Jupyter versus Zeppelin
    Installing Apache Zeppelin
      Ambari service
      The manual method
    Analytics with Zeppelin
  The Livy REST job server and Hue Notebooks
    Installing and configuring the Livy server and Hue
    Using the Livy server
      An interactive session
      A batch session
      Sharing SparkContexts and RDDs
    Using Livy with Hue Notebook
    Using Livy with Zeppelin
  Introducing Apache NiFi for dataflows
    Installing Apache NiFi
    Dataflows and analytics with NiFi
  Summary
Chapter 7: Machine Learning with Spark and Hadoop
  Introducing machine learning
  Machine learning on Spark and Hadoop
  Machine learning algorithms
    Supervised learning
    Unsupervised learning
    Recommender systems
    Feature extraction and transformation
    Optimization
    Spark MLlib data types
  An example of machine learning algorithms
    Logistic regression for spam detection
  Building machine learning pipelines
    An example of a pipeline workflow
    Building an ML pipeline
    Saving and loading models
  Machine learning with H2O and Spark
    Why Sparkling Water?
    An application flow on YARN
    Getting started with Sparkling Water
  Introducing Hivemall
  Introducing Hivemall for Spark
  Summary
Chapter 8: Building Recommendation Systems with Spark and Mahout
  Building recommendation systems
    Content-based filtering
    Collaborative filtering
      User-based collaborative filtering
      Item-based collaborative filtering
  Limitations of a recommendation system
  A recommendation system with MLlib
    Preparing the environment
    Creating RDDs
    Exploring the data with DataFrames
    Creating training and testing datasets
    Creating a model
    Making predictions
    Evaluating the model with testing data
    Checking the accuracy of the model
    Explicit versus implicit feedback
  The Mahout and Spark integration
    Installing Mahout
    Exploring the Mahout shell
    Building a universal recommendation system with
    Mahout and search tool
  Summary
Chapter 9: Graph Analytics with GraphX
  Introducing graph processing
    What is a graph?
    Graph databases versus graph processing systems
    Introducing GraphX
    Graph algorithms
  Getting started with GraphX
    Basic operations of GraphX
      Creating a graph
      Counting
      Filtering
      inDegrees, outDegrees, and degrees
      Triplets
    Transforming graphs
      Transforming attributes
      Modifying graphs
      Joining graphs
      VertexRDD and EdgeRDD operations
    GraphX algorithms
      Triangle counting
      Connected components
  Analyzing flight data using GraphX
    Pregel API
  Introducing GraphFrames
    Motif finding
    Loading and saving GraphFrames
  Summary
Chapter 10: Interactive Analytics with SparkR
  Introducing R and SparkR
    What is R?
    Introducing SparkR
    Architecture of SparkR
  Getting started with SparkR
    Installing and configuring R
    Using SparkR shell
      Local mode
      Standalone mode
      Yarn mode
      Creating a local DataFrame
      Creating a DataFrame from a DataSources API
      Creating a DataFrame from Hive
    Using SparkR scripts
  Using DataFrames with SparkR
  Using SparkR with RStudio
  Machine learning with SparkR
    Using the Naive Bayes model
    Using the k-means model
  Using SparkR with Zeppelin
  Summary
Index

下载地址

关注本微信公众号iteblog_hadoop并回复 大数据分析 获取本书的下载地址。


点击进入下载

本博客文章除特别声明,全部都是原创!
转载本文请加上:转载自过往记忆(https://www.iteblog.com/)
本文链接: 【[电子书]Big Data Analytics pdf下载】(https://www.iteblog.com/archives/1880.html)
喜欢 (4)
分享 (0)
发表我的评论
取消评论

表情
本博客评论系统带有自动识别垃圾评论功能,请写一些有意义的评论,谢谢!