This workshop illustrates the basics of Excel, as well as a few intermediate techniques that you might encounter in the context of a GIS or data science project. As a tutorial and a case study, we investigate Airbnb listings in the Greater Boston Area.
As planners work with increasingly growing variety and volume of data, it is critical to know the typology and dimension of your dataset before any calculation, manipulation, or analysis. Generally, the datasets you work in Excel are structured as tabular data—in other words, data stored in columns (usually representing attributes) and rows (usually representing observations). Thus, the Dimension of the dataset equals…
Let’s open the file airbnb_units.xlsx
. In Excel you can
simply select the columns (usually the first row) or the index (usually
the first column) to inspect the dimensions of the dataset. Select row 1
to check the total number of columns:
How about the rows? Select the column A (the first column) to see how many rows in total. Then select Column K to see if the count of rows keeps the same number. Why does Excel return a different number of rows for Column K? It turns out that Excel is counting values, not strictly rows and columns. This means that using this method you should be careful to select rows and columns that do not have any null observations (like an index column or the first row, which usually contains field names).
Metadata means "data about data". It provides a high-level summary of one or a series of datasets, such as the author, means of the data generation, purpose of the data, data format, file size, data quality, and other necessary guidance/explanations for the users. In particular, structural metadata describes the structure of a dataset or database such as tables, columns, keys, and indexes.
Now open the Excel file metadata.xlsx
. Is this
comprehensive metadata for airbnb_units.xlsx
? What is the
possible reason for this discrepancy? Columns have been renamed here!
(And because they’re all 8 characters or under, we can assume this was
exported from a Shapefile, which caps column names at 8 characters.)
After checking with your colleague who processed this data, you now confirm the following column names and their meaning:
crt_date
: The date the vacation rental listing was
created.
pid
: Property id.
rp_type
: Types of accommodations.
avgdly_u
: Average daily rate in USD.
annrev_u
: Annual revenue in USD.
occ_rate
: Occupancy rate.
bed
: Number of bedrooms.
max_gs
: The maximum number of guests the vacation
rental property can accommodate.
canc_pol
: Cancellation policy for the vacation
rental listing.
Data values in Excel can be stored as different types with related
methods for calculating. Look at crt_date
, or the date the
vacation rental listing was created. It appears as a date—if you open up
the format pane (cmd-1
or ctrl-1
on Macs and
PCs, respectively) you can adjust how it is displayed.
The Sort & Filter tool enables you to…
Create a filter for each column by selecting all columns and clicking
Data > Filter
in the ribbon. Note that each column now
has a small triangle next to the name, indicating that you can click to
set up a filter using a helper screen. Set up a filter on the
city
column such that the sheet only displays listings in
Cambridge and Boston. Check out the row index (i.e., the numbers on the
far left)—these are telling you that this is a subset! See how they
skip? That indicates that we’re excluding some rows from our view.
Finally, filter out cases where the average daily rate is 0 or missing (a greater than zero criterion will filter out rows with missing values as well).
Let’s create a new sheet, name it Cambridge and Boston
.
Then copy and paste the following columns from the
airbnb_units
sheet with associated conditions specified
below.
id
, pid
, prp_type
,
lst_type
, city
, avgdly_u
,
bed
Now check at the row number (a.k.a. index) on this new sheet—above, these indicated a subset. This is no longer the case! We’ve created a copy of the subset and placed it into a new table.
Use functions AVG( )
, MAX( )
, and
MIN( )
to calculate some simple descriptive statistics:
We can also use some slightly more sophisticated tools if we load Excel’s Analysis toolpack—click “Analysis tools” in the ribbon and check “Analysis ToolPak.”
Click Data > Data Analysis
in the ribbon and set the
following parameters:
Running the tool, you’ll see the following results:
This is a quick way to get a sense of your data using common summary statistics! There may be a few you’re unfamiliar with—kurtosis and skewness, for example. These are measures of how long the tails are on your frequency distribution! (This data is very positively skewed—i.e., it has a long right tail.)
You can use a Pivot Table to calculate, summarize, and analyze data to compare by groups and explore patterns.
Let's create a pivot table on a new sheet named
Pivot_Table
.
Cambridge and Boston
sheet.Now click the table area, you will see a
PivotTable Fields
pane. This is the interface for you to
choose what to calculate/report on your pivot table. From the
Field Name
panel…
bed
to Columns;city
to Rows;avgdly_u
to Values;The newly generated column summarizes the sum of average daily rate by cities. This is not very informative—the average would be far more informative!
Click the small i
icon next to the metrics calculated in
Values
pane. Change from Sum to Average. You can also
change the filed name. Let’s call it
Average Rate ($/day)
Let’s work together to answer this question:
What are the average daily rates in Cambridge and Boston, grouped by the number of bedrooms (bedroom <= 3) and property type (apartment and house).
To answer this question, you’ll need to filter the pivot table. One of the quirks of pivot tables is that they don’t allow you to use a column as a filter and a row or column at the same time. (I KNOW, RIGHT?) But! You can filter first, then drag back into a row or column to see your changes reflected.
As such, move bed
into the columns and remove all except
0, 1, 2, and 3 bedroom units. Then, move it back into your rows. Do the
same for prp_type
, filtering for apartment and house.
city
should be your column.
The resulting column compares daily rates for apartments and houses in Boston and Cambridge by bedroom count. Now, we’ll plot a figure!
Let’s add a plot! Click on the pivot table, and from the ribbon,
select Insert > Pivot Chart
. Excel will automatically
plot a bar chart of average daily rates, comparing Boston to
Cambridge—play with changing the Axis order! This will affect how bars
are grouped.
From here, play with your chart—under the ‘design’ tab, you can choose a range of different defaults. You can also add titles, change colors, and modify other attributes of your plot.
Of course, we may want to produce statistical plots as well—for example, box and whisker plots! These are a great way to measure the spread, or dispersion of the price for one-bedroom apartment listings between Boston and Cambridge.
Create a new sheet and copy your all rows that are in Boston or Cambridge, have 3 or fewer bedrooms, and that are apartments or houses. This should be familiar! Once you’ve created this new sheet…
Amazing! You’ve produced a graphic to analyze the spread of listing prices for Airbnbs in Cambridge and Boston.
In this workshop, we go through the following key components of Excel: