Recitation 5: Data Exploration
In this recitation, you will explore a dataset of your choice using Excel.
Preparation
Prepare by watching at least the first two sections of the course “Excel: Introduction to Charts and Graphs” (i.e., the sections titled “Understand Excel Chart Concepts” and “Create Basic Charts”). The course is hosted on LinkedIn Learning, which is available for free to Clark students.
Note: I highly recommend that you use Excel for this activity. Excel is very widely used, so there are lots of educational resources available for it. Excel is available for free to Clark students. However, Google Sheets, Apple Numbers, and LibreOffice do offer similar functionality.
At Recitation
You can complete this activity either individually or with a partner.
Task 1: Select and Download a Dataset
You can choose to analyze any of the datasets listed here: https://vincentarelbundock.github.io/Rdatasets/datasets.html
Click on the “DOC” link to read more about each dataset. Right-click on the “CSV” link to download the data itself.
Note: I’ve used the “Do Workplace Smoking Bans Reduce Smoking?” dataset in the examples below, so please choose a different dataset.
Task 2: Open the Data in Excel
The file you downloaded contains the dataset in a “comma-separated values” (CSV) format. This is the most common format for sharing datasets, since it is easy for programs to read. CSV files are easy to recognize, since they have the .csv
file extension.
Next, you will open this file in Excel, either by dragging the file into Excel, or using the “Open” option in Excel. After opening the data, you should see something like this:
Note: Open a CSV file in a plain-text editor (e.g., Notepad on Windows, or TextEdit on macOS), and you will see how this format got its name.
Task 3: Explore the Data
The first thing scientists do when analyzing a dataset is perform an exploratory data analysis. Typically, scientists calculate summary statistics like counts, averages, medians, etc., and generate some simple graphs. The goal is to confirm that the data was imported properly, and to get an overview of the data.
You can choose which statistics and graphs to generate for your data. I have given some examples from my dataset below.
Average Age
My dataset has 10,000 rows, and I like to calculate summary statistics at the bottom of the data. The first thing I’ll do is freeze the header row, so I can see it after I scroll down:
After scrolling down, I click on the cell under the age data, then type: =AVERAGE(D2:D10001)
. This tells Excel to apply the AVERAGE
function to the specified cells. Each cell is specified by its column letter and row number.
After pressing enter, I get the result: the average age of people in this dataset is 38.6932
years old.
Pie Chart
Next, I want to understand how many people smoke. This is easy to do using Excel’s “PivotTables” feature. First, select the column of interest by clicking on the column letter. Then, choose “Insert”, “Tables”, “PivotTable.”
A dialogue box will open. Leave all the options as defaults, click “OK,” and a new worksheet (AKA, a tab) containing the PivotTable will be created. Check the column of interest, then drag it to the “Rows” area, and you should see something like this:
As you can see, the PivotTable counts the number of times each value appears in the column of interest. This summary data will make it easy to create a pie chart. Select the data in the pivot table, then select the pie chart from the “Insert” ribbon.
The pie chart shows that about 1/4 of people were smokers.
Of course, it’s a good idea to give the pie chart an informative title (e.g., “Proportion of Smokers”).
Other Ideas
There are many other ways I could visualize the data. For example:
- I could look for a connection between whether people worked at a company with a smoking ban and whether they smoked. This could be done by creating a PivotTable from multiple columns (also referred to as a contingency table).
- I could create column chart of the age data (also referred to as a histogram). This would make it possible to see how the ages are distributed.
- I could repeat some of these analyses on different columns.
Submit
Ensure that both partners submit to Moodle. You should both upload a PDF containing:
- You and your partner’s names
- The date and time of the recitation session you attended
- If you could not attend a recitation session, simply explain why you could not attend
- The name of the dataset you analyzed
- The results of your exploratory data analysis. You should include:
- A short paragraph explaining your dataset and what you learned from your analysis
- At least one chart
- At least one summary statistic (e.g., average, minimum, maximum, etc.)