4. Learning about your data

Author

Camille Seaberry

Modified

February 8, 2024

From Wickham et al. (2023), Exploratory Data Analysis chapter:

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd edition). O’Reilly. https://r4ds.hadley.nz/

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as: - What type of variation occurs within my variables? - What type of covariation occurs between my variables?

We’ll follow the steps of the EDA chapter using the acs dataset in the {justviz} package. For simplicity, we’ll focus on Maryland census tracts and just a few variables dealing with housing and income.

library(dplyr)
library(ggplot2)
library(justviz)

acs_tr <- acs |>
  filter(level == "tract") |>
  select(county, name, total_pop, total_hh,
         homeownership, total_cost_burden, renter_cost_burden,
         owner_cost_burden, no_vehicle_hh, median_hh_income, pop_density)
knitr::kable(head(acs_tr))

county	name	total_pop	total_hh	homeownership	total_cost_burden	renter_cost_burden	owner_cost_burden	no_vehicle_hh	median_hh_income	pop_density
Allegany County	24001000100	3474	1577	0.78	0.18	0.39	0.12	0.06	56232	18.48533
Allegany County	24001000200	4052	1390	0.86	0.20	0.33	0.18	0.04	66596	84.29020
Allegany County	24001000500	2304	683	0.60	0.20	0.36	0.09	0.13	47781	520.91879
Allegany County	24001000600	3005	1374	0.70	0.20	0.36	0.14	0.10	48607	1898.93501
Allegany County	24001000700	3233	1462	0.49	0.37	0.48	0.25	0.25	36090	4538.49941
Allegany County	24001000800	1932	786	0.48	0.52	0.68	0.35	0.17	27130	1529.51773

summary(acs_tr)

    county              name             total_pop        total_hh   
 Length:1460        Length:1460        Min.   :    5   Min.   :   0  
 Class :character   Class :character   1st Qu.: 2960   1st Qu.:1120  
 Mode  :character   Mode  :character   Median : 4026   Median :1528  
                                       Mean   : 4220   Mean   :1588  
                                       3rd Qu.: 5358   3rd Qu.:1999  
                                       Max.   :14109   Max.   :4644  
                                                                     
 homeownership    total_cost_burden renter_cost_burden owner_cost_burden
 Min.   :0.0000   Min.   :0.0400    Min.   :0.00       Min.   :0.0000   
 1st Qu.:0.5000   1st Qu.:0.2100    1st Qu.:0.31       1st Qu.:0.1600   
 Median :0.7400   Median :0.2800    Median :0.44       Median :0.2100   
 Mean   :0.6752   Mean   :0.2932    Mean   :0.43       Mean   :0.2181   
 3rd Qu.:0.8800   3rd Qu.:0.3700    3rd Qu.:0.55       3rd Qu.:0.2700   
 Max.   :1.0000   Max.   :0.7400    Max.   :1.00       Max.   :1.0000   
 NA's   :4        NA's   :4         NA's   :7          NA's   :12       
 no_vehicle_hh     median_hh_income  pop_density      
 Min.   :0.00000   Min.   : 10000   Min.   :    1.05  
 1st Qu.:0.02000   1st Qu.: 70525   1st Qu.:  984.33  
 Median :0.05000   Median : 98164   Median : 3413.10  
 Mean   :0.09468   Mean   :104585   Mean   : 4968.47  
 3rd Qu.:0.12000   3rd Qu.:132052   3rd Qu.: 6822.62  
 Max.   :0.85000   Max.   :250001   Max.   :57424.37  
 NA's   :4         NA's   :6

What types of values do each of these variables represent? Why are their scales so different?

Variation

First a histogram of median household income values:

ggplot(acs_tr, aes(x = median_hh_income)) +
  geom_histogram(color = "white")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

There’s a message and a warning: the message suggests being intentional about the number of bins, and the warning calls our attention to missing values in this column.

Use the next few chunks of code to experiment with bin specifications. Does your understanding of the data’s distribution change?

ggplot(acs_tr, aes(x = median_hh_income)) +
  geom_histogram(bins = 50) # bins can be determined by setting the number of bins

Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

ggplot(acs_tr, aes(x = median_hh_income)) +
  geom_histogram(binwidth = 10000) # or by the width of bins, with a scale corresponding to the x-axis

Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

What are some values of bins or binwidth that seem reasonable? At what point do either of them start to obscure data?

As for the missing values:

acs_tr |>
  filter(is.na(median_hh_income))

county	name	total_pop	total_hh	homeownership	total_cost_burden	renter_cost_burden	owner_cost_burden	no_vehicle_hh	median_hh_income	pop_density
Anne Arundel County	24003740400	4241	0	NA	NA	NA	NA	NA	NA	6000.504539
Anne Arundel County	24003980000	5	0	NA	NA	NA	NA	NA	NA	1.048504
Washington County	24043011000	5212	0	NA	NA	NA	NA	NA	NA	3049.954209
Baltimore city	24510100300	1999	0	NA	NA	NA	NA	NA	NA	33122.764732
Baltimore city	24510190300	2122	839	0.20	0.48	0.52	0.34	0.68	NA	15572.409097
Baltimore city	24510250600	11	11	0.55	0.55	0.00	1.00	0.00	NA	5.033091

Even though we’re probably not going to use the total population and total household variables for any analysis here, I kept them because those sorts of variables that define what your observational unit is are important for checking what’s going on in your data. By which I mean a census tract is made up of a bunch of people (usually about 4,000) in a contiguous area who mostly live in households. But if you work with census data enough, you’ll know that some places have population but few households, or only very small populations altogether—a tract might actually be a jail or a set of college dorms, or maybe the majority of a tract is those sorts of group quarters, and the remainder is too small to reliably calculate some of the data. What we want to do with those tracts can depend on context, but I’ll drop them here.

acs_tr2 <- filter(acs_tr, !is.na(median_hh_income))

Typical values

Does anything seem weird about the median household income values? Look back at Figure 1 where it may be more apparent. (We’ll talk about this anomaly in the data.)

Switching to cost burden rates:

ggplot(acs_tr2, aes(x = total_cost_burden)) +
  geom_histogram(binwidth = 0.03)

This approaches a normal curve, but is skewed. From the histogram, the mean looks to be around 0.3 (looking back at the summary, this is correct), but with quite a few tracts with higher rates. Because this is a proportion, we don’t expect there to be any values below 0 or above 1.

A boxplot can make it a little easier to figure out what’s typical in your distribution.

# use a dummy value for x because ggplot expects boxplots to be done by a discrete variable
ggplot(acs_tr2, aes(x = 1, y = total_cost_burden)) +
  geom_boxplot() +
  coord_flip()

Unusual values

ggplot(acs_tr2, aes(x = pop_density)) +
  geom_histogram(binwidth = 1000)

There are a few tracts that are extremely dense. If we wanted to get a sense of more typical tracts, we could filter those, either from the data or within the limits of the chart:

acs_tr2 |>
  filter(pop_density < 30000) |>
  ggplot(aes(x = pop_density)) +
  geom_histogram(binwidth = 1000)

# if you want bars to be between tick marks instead of centered over them, set boundary = TRUE
acs_tr2 |>
  ggplot(aes(x = pop_density)) +
  geom_histogram(binwidth = 1000, boundary = TRUE, color = "white") +
  scale_x_continuous(limits = c(NA, 30000))

Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).

We could decide to investigate those high-density tracts. For example, if we’re interested in housing costs, we might drop tracts that seem to mostly be dorms. However, at least these tracts in Montgomery County are actually high-rise condos bordering DC, so we should keep them in.

Covariation

Especially when we talk about housing and socio-economic data, we expect things to be correlated—probably even more so than with naturally occurring phenomena, since so much of where we live and what resources we have are determined by history and policy decisions. So it shouldn’t surprise you to find correlations in data like this. In fact, the CDC PLACES dataset uses demographic data to model health measures where they don’t have direct measurements available, so in cases like that you actually want to lean away from digging into correlations too much, or you might end up just confirming the makeup of the model, not finding anything new.

A categorical and a numerical variable

I’ll reshape the data to get housing tenure into one categorical variable. (If this code doesn’t make sense it’s okay.)

cost_burden <- acs_tr2 |>
  tidyr::pivot_longer(cols = matches("cost_burden"), 
                      names_to = c("tenure", ".value"), # split column names into tenure and the name of the measure (cost_burden)
                      names_pattern = "(^[a-z]+)_(\\w+$)", # use regex to match a set of lowercase letters at the start of the string, followed by an underscore, then match word characters until the end
                      names_ptypes = list(tenure = factor())) |>
  filter(!is.na(cost_burden))

ggplot(cost_burden, aes(x = cost_burden, color = tenure)) +
  geom_freqpoly(binwidth = 0.02)

The bit about calling after_stat in the book chapter doesn’t apply here, since we have the same number of observations for each tenure.

ggplot(cost_burden, aes(x = tenure, y = cost_burden)) +
  geom_boxplot()

ggplot(cost_burden, aes(x = forcats::fct_reorder(tenure, cost_burden), 
                        y = cost_burden)) +
  geom_boxplot()

Two categorical variables

This is a pretty contrived example to match section 10.5.2, but I’ll bin homeownership and housing cost burden into categorical variables, and look at these by county.

acs_tr2 |>
  mutate(ownership_brk = cut(homeownership, 
                             breaks = c(0, 0.25, 0.5, 0.75, 1), 
                             include.lowest = TRUE, right = FALSE)) |>
  mutate(income_brk = cut(median_hh_income, 
                          breaks = c(0, 5e4, 1e5, 1.5e5, Inf),
                          include.lowest = TRUE, right = FALSE)) |>
  count(income_brk, ownership_brk) |>
  ggplot(aes(x = income_brk, y = ownership_brk)) +
  geom_point(aes(size = n)) +
  scale_size_area()

See if you can adjust the code to use tiles and a color (fill) scale.

Two numerical variables

Same variables, without the binning

ggplot(acs_tr2, aes(x = median_hh_income, y = homeownership)) +
  geom_point()

Use the methods in the book (changing alpha or using a 2D density) to deal with overplotting here.

ggplot(acs_tr2, aes(x = median_hh_income, y = homeownership)) +
  geom_point(alpha = 0.4)

ggplot(acs_tr2, aes(x = median_hh_income, y = homeownership)) +
  geom_point(shape = "circle open")

ggplot(acs_tr2, aes(x = median_hh_income, y = homeownership)) +
  geom_bin2d()