Recitation 5: Data Exploration

In this recitation, you will explore a dataset of your choice using Excel.

Preparation

Prepare by watching at least the first two sections of the course “Excel: Introduction to Charts and Graphs” (i.e., the sections titled “Understand Excel Chart Concepts” and “Create Basic Charts”). The course is hosted on LinkedIn Learning, which is available for free to Clark students.

Note: I highly recommend that you use Excel for this activity. Excel is very widely used, so there are lots of educational resources available for it. Excel is available for free to Clark students. However, Google Sheets, Apple Numbers, and LibreOffice do offer similar functionality.

At Recitation

You can complete this activity either individually or with a partner.

Task 1: Select and Download a Dataset

You can choose to analyze any of the datasets listed here: https://vincentarelbundock.github.io/Rdatasets/datasets.html

Click on the “DOC” link to read more about each dataset. Right-click on the “CSV” link to download the data itself.

Note: I’ve used the “Do Workplace Smoking Bans Reduce Smoking?” dataset in the examples below, so please choose a different dataset.

Task 2: Open the Data in Excel

The file you downloaded contains the dataset in a “comma-separated values” (CSV) format. This is the most common format for sharing datasets, since it is easy for programs to read. CSV files are easy to recognize, since they have the .csv file extension.

Next, you will open this file in Excel, either by dragging the file into Excel, or using the “Open” option in Excel. After opening the data, you should see something like this:

Rows and columns of an Excel spreadsheet

Note: Open a CSV file in a plain-text editor (e.g., Notepad on Windows, or TextEdit on macOS), and you will see how this format got its name.

Task 3: Explore the Data

The first thing scientists do when analyzing a dataset is perform an exploratory data analysis. Typically, scientists calculate summary statistics like counts, averages, medians, etc., and generate some simple graphs. The goal is to confirm that the data was imported properly, and to get an overview of the data.

You can choose which statistics and graphs to generate for your data. I have given some examples from my dataset below.

Average Age

My dataset has 10,000 rows, and I like to calculate summary statistics at the bottom of the data. The first thing I’ll do is freeze the header row, so I can see it after I scroll down:

View -> Window -> Freeze Top Row

After scrolling down, I click on the cell under the age data, then type: =AVERAGE(D2:D10001). This tells Excel to apply the AVERAGE function to the specified cells. Each cell is specified by its column letter and row number.

Type: =AVERAGE(D2:D10001)

After pressing enter, I get the result: the average age of people in this dataset is 38.6932 years old.

Pie Chart

Next, I want to understand how many people smoke. This is easy to do using Excel’s “PivotTables” feature. First, select the column of interest by clicking on the column letter. Then, choose “Insert”, “Tables”, “PivotTable.”

Insert -> Tables -> PivotTable

A dialogue box will open. Leave all the options as defaults, click “OK,” and a new worksheet (AKA, a tab) containing the PivotTable will be created. Check the column of interest, then drag it to the “Rows” area, and you should see something like this:

PivotTable showing the number of rows with 'yes' and 'no' values in the 'smoker' column

As you can see, the PivotTable counts the number of times each value appears in the column of interest. This summary data will make it easy to create a pie chart. Select the data in the pivot table, then select the pie chart from the “Insert” ribbon.

Select the pie chart from the 'Insert' ribbon

The pie chart shows that about 1/4 of people were smokers.

A pie chart showing that about 1/4 of people were smokers

Of course, it’s a good idea to give the pie chart an informative title (e.g., “Proportion of Smokers”).

Other Ideas

There are many other ways I could visualize the data. For example:

Submit

Ensure that both partners submit to Moodle. You should both upload a PDF containing: