21 Big Data
This week we’ll demystify the term “big data” and give you some hands-on experience working with it. We’ll start by reviewing Hadoop and its ecosystem. Within this context, we’ll cover MapReduce and how it has improved the process for handling big data. We’ll then move on to PySpark, the leading technology for handling big data. After diving into some of the technologies, we’ll learn about advanced features of PySpark and how to optimize query execution times with large datasets by storing data in parquet format and partitioning and caching data
21.1 Introduction to Big Data
Overview
For today’s lesson, you’ll learn how to identify parts of the Apache Hadoop ecosystem. Then, you’ll learn how to write Python scripts that implement the Apache MapReduce programming model. Next, you’ll learn the differences between the Apache Hadoop and Apache Spark environments. Finally, you’ll learn how to create and filter DataFrames using PySpark.
What You’ll Learn
By the end of this lesson, you will be able to:
Identify the parts of the Hadoop ecosystem.
Write a Python script that implements the MapReduce programming model.
Identify the differences between the Hadoop and Spark environments.
Create a DataFrame by using PySpark.
Filter and order a DataFrame by using Spark.
21.2 Querying Big Data with PySpark
Overview
For today’s lesson, you’ll learn how to group, aggregate, and order data, and learn how to parse, format a date to a timestamp, and plot time series data. Then, you’ll learn how to create temporary views to run Spark SQL queries faster with large datasets. Finally, you’ll combine PySpark and Spark SQL to run SQL queries.
What You’ll Learn
By the end of this lesson, you will be able to:
Apply grouping and aggregation functions to a dataset by using Spark.
Parse and format date timestamps by using Spark.
Use temporary tables to prepare data for SQL.
Combine PySpark and SQL to run queries.
21.3 Optimizing Spark: Storage, Partitioning, and Caching
Overview
For today’s lesson, you’ll learn how to store data in parquet format and partition the parquet data to optimize query execution times. Then you’ll practice caching data and determine optimal query execution times between partitioned and cached data.
What You’ll Learn
By the end of this lesson, you will be able to:
Compare the file storage types (other than tabular) that work the best for Spark.
Understand how partitioning affects Spark performance.
Explain the cause of shuffling and limit it when possible.
Identify when caching is the best option.
Explain how to broadcast a lookup table, and force it when it doesn’t happen automatically.
Set the shuffle partitions to an appropriate value and demonstrate how to cache data.
21.4 Databricks
Overview
In today’s class, you will use Spark on Databricks to perform data analysis in the cloud. Through a series of exercises, you will gain hands-on experience with the Python and SQL interfaces of Databricks, with an emphasis on using the SQL interface for increasingly complex queries. The class will conclude with a group activity in which you will query a database in SQL, create a brief report with recommendations, and report your findings to the class.
What You’ll Learn
By the end of this lesson, you will be able to:
Explain the purpose, key features, and applications of Databricks.
Set up a Databricks environment.
Identify the key components of a Databricks environment.
Navigate the Databricks workspace using
dbutils
.Import data into a new notebook by using Parquet files, CSV files, and S3.
Explain the advantage of Parquet as a big data storage format.
Perform complex data analysis, including joins, using the Python and SQL interfaces.
Describe two advantages of using Databricks over PySpark for data analysis.