1 What is Data Analysis?
1.1 Talking About Data
When people talk about “data” what exactly do they mean? And what is data analysis? Who is a data analyst and what do they do? To help us to really complete our understanding of data, in the next few chapters, we will also discuss the following three notions of data:
- What are some of the things that analysts tend to think when discussing data?
- How do we store and organize data?
- How do programs handle data?
But first we can start by defining some of these terms:
- Data
- In general, data are facts and statistics collected together for reference or analysis.
- Data Analysis
- Data analysis is the collection, transformation, organization of data in order to draw conclusions, make predictions, and drive informed decision-making.
- Data Analyst
- A data analyst is someone who collects, transforms, and organizes data in order to help make informed decisions.
- Data Analytics
- Data analytics is the science of data.
Let’s take a look at the following example:
- Example:
- Luke Skywalker grew up in Tatooine. He is a force sensitive male, 1.72 meters tall, and uses a lightsaber.
There are numerous ways to store this data:
- Pen and paper
- Photo
- Audio
- Word Document
- Plain text file
In a compter, this data will be stored in a file.
- File
-
A file is simply a block of computer memory. It’s a collection of related data typically organized as records.
- can be a few bytes to gigabytes in size
- a file format is a way of interpreting the bytes in a file
- File Extension
-
A file extension is the suffix that appears at the end of the file name and it indicates the type of file (i.e. the type of file).
- .txt, .pdf, .docx, .mp3, .jpg are all examples of file extensions.
1.2 The Data Analysis Process
Data analysis can be broken down into multiple parts and each plays a key role. These parts together define what’s often referred to as the data analysis process.
- Identify the question or problem
- Collect the data
- Clean the data
- Analyze the data
- Interpret the results
1. Identify
The data analysis process first begins with identifying the questions that we want to answer or the problems that we want to solve. In this step we want to ask smart and effective questions and also begin by putting things into context. It’s also important to manage team and stakeholder expetations.
2. Collect
Once we’ve identified our questions and/or problems we can begin by collecting the data that will help us answer these questions. In this book we’ll learn the tools that can be used to collect data and the best practices on how to store data.
3. Clean
In this step we clean the data which prepares it prior to analysis. Cleaning also allows for a much more efficient and organized analysis. Most data collected is dirty data, so in this book we will learn about tools in both Python and R on how to clean data.
4. Analyze
The analysis step is where we begin to answer our questions and solve our problems. Here, there are many methods that can be used for anlysis such as mathematical methods, statistical methods, and visualizations that are used. This book will cover a few of these while also covering some very important preliminary concepts for visualizing data.
5. Interpret
The final step is to interpret and share our findings. Communication is highly important here because we have to be very considerate of who our audience is. This can be our team, our stakeholder’s, the public, or even our family and friends. Each can require a different approach for sharing our results which presents a unique challenge from the other steps, but a highly important and highly rewarding one.