13. Mapping data to space

Author

Camille Seaberry

Modified

April 10, 2024

Space

Think back to the first principles we talked about with mapping data onto visual encodings (reading lengths of bar charts, reading darkness of color). With spatial data, we have many of the same principles, but we’re confined by location. We can’t just, say, flip our axes to make them easier to read, or change the aspect ratio to fit in a tight space on a page. We also can’t just arbitrarily change the size of things we’re drawing—for example, in a US state map, we can’t just make Texas less huge.

That said, there are a lot of decisions we do make about how we encode data to space or manipulate it, and these can change the nature of what we know about our data. That makes it important to think really hard about those decisions, and situate those decisions in the purpose of your visualization, your audience, the scope of your work, domain knowledge, etc etc. Often there are no perfectly right answers.

library(dplyr)
library(justviz)
library(sf)
library(ggplot2)

# justviz::tracts_sf is only baltimore & neighboring counties
all_tracts_sf <- tigris::tracts(state = "24", cb = TRUE, progress_bar = FALSE) |>
  select(geoid = GEOID, county = NAMELSADCO)
counties_sf <- all_tracts_sf |>
  group_by(county) |>
  summarise()

When we looked at Hans Rosling’s bubble plots, the size (area) of points showed the population of each country. That makes sense perceptually—your brain is going to see things of different sizes, pick up on that difference, and think it means something. But in spatial data, it might not; different sizes might actually interfere with our ability to understand what the visual is meant to show. That is, the area of a location might have something to do with the data, but oftentimes it doesn’t (sometimes if anything it shows the opposite—some of the largest US states have among the smallest populations, same goes for Maryland counties). Census tracts are designed to have similar populations, but counties are arbitrary political boundaries.

acs |>
  filter(level == "county") |>
  ggplot(aes(x = area_sqmi, y = total_pop)) +
  geom_point() +
  ggrepel::geom_text_repel(aes(label = name), 
                           data = ~filter(., between(total_pop, 5e5, 7e5))) +
  labs(title = "Total population vs area, Maryland counties")

acs |>
  filter(level %in% c("tract", "county")) |>
  ggplot(aes(x = total_pop)) +
  geom_histogram() +
  facet_wrap(vars(level), scales = "free") +
  labs(title = "Total population, Maryland counties and tracts")

Can you guess what this is a map of?

Areal units

One major issue that comes up in mapping is the modifiable areal unit problem:

The MAUP occurs when statistical results or visual patterns embedded in maps differ according to changes in the scale or zoning of the partitioning scheme used to aggregate spatial data (Openshaw, 1984). The problem commonly occurs when data on individual objects, such as people, are disseminated in data sets in aggregated form, such that the actual spatial distribution of the original individual objects is unobtainable….

One consequence of the MAUP is that spatial patterns that may be observed in a map using one partitioning scheme may not be visible using another partitioning scheme where the scale or zoning differs. –Mennis (2019)

Mennis, J. (2019). Problems of Scale and Zoning. Geographic Information Science & Technology Body of Knowledge, 2019(Q1). https://doi.org/10.22224/gistbok/2019.1.2

Sometimes the units you use for analysis are ones you can decide on, but sometimes you don’t have a choice (e.g. data would be most useful at the tract level but is only available by town; in a survey you ask for people’s ZIP code because you know they probably don’t know their tract number).

Similar to this is the ecological fallacy, where you assume that patterns or phenomena that you see in aggregate are true for individuals or individual areas within that population.

MAUP and ecological fallacy

When we break a geography into areas that are too large, we lose the patterns that occur within them. If we break into areas that are too small, we might no longer be able to do the analysis we want, or we might be looking at something so closely that patterns aren’t able to develop. For example, counting trees across an entire city won’t tell you anything about tree density in people’s neighborhoods if most of those trees are within one large park; counting trees per parcel in that same city won’t be that useful if you only expect one or two trees at most in front of any house, and if residents get the same benefits of having a tree in front of their neighbor’s house as they do their own.

For example, Baltimore city is 27 percent white. That’s the lowest share of all the counties in the state except Prince George’s. But that population is very concentrated within the city.

acs_county <- acs |>
  filter(level == "county") |>
  mutate(name = forcats::as_factor(name) |>
           forcats::fct_reorder(white, .desc = FALSE))

w1 <- acs_county |>
  ggplot(aes(x = name, y = white)) +
  geom_col(width = 0.8) +
  coord_flip() +
  scale_y_continuous(labels = scales::label_percent()) +
  labs(x = NULL)

w2 <- acs |>
  filter(level == "tract") |>
  mutate(county = forcats::as_factor(county) |>
           forcats::fct_relevel(levels(acs_county$name))) |>
  ggplot(aes(x = county, y = white)) +
  geom_boxplot() +
  coord_flip() +
  scale_y_continuous(labels = scales::label_percent()) +
  labs(x = NULL)

library(patchwork)

w1 + w2

Even though a much lower share of Baltimore city overall is white than some of the smaller or less diverse counties like Garrett and Carroll, the city is so segregated that we have neighborhoods that do in fact rival those counties’ averages. We lose that fact when we only look at averages or aggregates, but it’s a major pattern underlying much of how the city and region work (hot take: that’s by design).

Joining spatial and nonspatial data: a detour

You can join sf data frames with other data frames with dplyr and other tools like we’ve done before. Both tracts_sf and acs inherit the data.frame object type, but tract_sf also has the sf object type. You always know a sf object by how it prints, starting with “Simple feature collection …”.

class(tracts_sf)
[1] "sf"         "data.frame"
class(acs)
[1] "tbl_df"     "tbl"        "data.frame"
head(tracts_sf)
county geoid geometry
Baltimore city 24510160300 POLYGON ((-76.64612 39.3031…
Baltimore city 24510020100 POLYGON ((-76.59047 39.2916…
Baltimore County 24005402602 POLYGON ((-76.76732 39.3673…
Baltimore city 24510190300 POLYGON ((-76.64696 39.2881…
Baltimore city 24510271801 POLYGON ((-76.68336 39.3517…
Anne Arundel County 24003730403 POLYGON ((-76.62029 39.1510…
head(acs)
level county name total_pop white black latino asian other_race diversity_idx foreign_born total_hh homeownership total_cost_burden total_severe_cost_burden owner_cost_burden owner_severe_cost_burden renter_cost_burden renter_severe_cost_burden no_vehicle_hh median_hh_income ages25plus less_than_high_school high_school_grad some_college_or_aa bachelors grad_degree pov_status_determined poverty low_income area_sqmi pop_density
us NA United States 331097593 0.59 0.12 0.19 0.06 0.05 0.7376547 0.14 125736353 0.65 0.29 0.14 0.21 0.09 0.45 0.23 0.08 75149 226600992 0.11 0.26 0.28 0.21 0.13 323275448 0.13 0.29 NA NA
msa NA Abilene, TX Metro Area 176656 0.63 0.07 0.25 0.02 0.03 0.6353998 0.06 65075 0.63 0.27 0.11 0.18 0.05 0.43 0.20 0.04 61924 111552 0.11 0.31 0.32 0.17 0.09 164028 0.14 0.35 2743.514 64.39042
msa NA Akron, OH Metro Area 700578 0.78 0.12 0.02 0.03 0.05 0.4932053 0.05 292223 0.68 0.25 0.12 0.16 0.06 0.43 0.22 0.08 68647 487995 0.07 0.31 0.28 0.21 0.13 684199 0.12 0.28 900.198 778.24875
msa NA Albany, GA Metro Area 147716 0.41 0.52 0.03 0.01 0.03 0.5978239 0.03 57043 0.57 0.30 0.16 0.18 0.09 0.46 0.27 0.09 53173 97010 0.14 0.32 0.31 0.13 0.10 142224 0.22 0.41 1591.030 92.84302
msa NA Albany-Lebanon, OR Metro Area 128598 0.83 0.00 0.10 0.01 0.06 0.3838219 0.04 49944 0.67 0.31 0.14 0.23 0.10 0.45 0.20 0.04 69523 89890 0.09 0.29 0.41 0.13 0.07 126135 0.12 0.33 2289.276 56.17408
msa NA Albany-Schenectady-Troy, NY Metro Area 901729 0.77 0.07 0.06 0.05 0.05 0.5241427 0.08 372919 0.64 0.25 0.12 0.17 0.07 0.41 0.22 0.10 83360 627234 0.07 0.25 0.28 0.22 0.18 871777 0.11 0.23 2811.804 320.69412

A data frame can have a geometry column without the data frame being an sf object. If I convert tracts_sf to a base data frame, it still has a geometry column (the type is shown as sfc_POLYGON, and it’s a list-column of polygons), but it’s no longer an sf object.

tracts_df <- as.data.frame(tracts_sf)
head(tracts_df)
county geoid geometry
Baltimore city 24510160300 POLYGON ((-76.64612 39.3031…
Baltimore city 24510020100 POLYGON ((-76.59047 39.2916…
Baltimore County 24005402602 POLYGON ((-76.76732 39.3673…
Baltimore city 24510190300 POLYGON ((-76.64696 39.2881…
Baltimore city 24510271801 POLYGON ((-76.68336 39.3517…
Anne Arundel County 24003730403 POLYGON ((-76.62029 39.1510…
class(tracts_df)
[1] "data.frame"

Why this matters to you is this gotcha: when you join an sf data frame with a regular one, the “left” data frame in the join takes priority in setting the object type.

sf_on_left <- inner_join(tracts_sf, acs, by = c("geoid" = "name"))
sf_on_right <- inner_join(acs, tracts_sf, by = c("name" = "geoid"))

class(sf_on_left)
[1] "sf"         "data.frame"
class(sf_on_right)
[1] "tbl_df"     "tbl"        "data.frame"

Remember that nice plot method for sf objects? It only applies to sf objects, i.e. sf_on_left but not sf_on_right.

plot(sf_on_left[, "median_hh_income"])

plot(sf_on_right[, "median_hh_income"])

Similarly, ggplot2::geom_sf expects an sf object, not just a geometry column: 1

1 There is a workaround but it also requires telling ggplot your data’s CRS, and why not just do it the expected way?

# this runs fine
ggplot(sf_on_left) +
  geom_sf()

# this throws an error:
# `stat_sf()` requires the following missing aesthetics: geometry
ggplot(sf_on_right) +
  geom_sf(aes(geometry = geometry))

You’ll probably make this mistake anyway—I do it all the time—but now you have a heads up about it.

You can also just convert your data frame with an sf column back into a sf data frame.

sf_once_again <- sf_on_right |>
  sf::st_as_sf()

class(sf_once_again)
[1] "sf"         "tbl_df"     "tbl"        "data.frame"

Joining spatial and nonspatial data for real

Now that we know what to watch out for, we can actually join our spatial and nonspatial data and make a choropleth. This is a type of thematic map (probably the most common type) where you shade areas in a map based on some additional value. These can be very useful but also easily misleading.

all_tracts_sf |>
  left_join(acs, by = c("geoid" = "name", "county")) |>
  ggplot(aes(fill = white)) +
  geom_sf(color = "white") +
  scale_fill_distiller(palette = "YlGnBu", direction = 1) +
  labs(title = "Percent white, Maryland tracts, 2022")

There’s a minor problem with this map, but it’s a byproduct of the very different areas of tracts here, particularly in dense urban tracts: some of those values in Baltimore or PG County look very low. Is it really because their values are close to 0, or is some of it just the white borders around tracts?

all_tracts_sf |>
  left_join(acs, by = c("geoid" = "name", "county")) |>
  ggplot(aes(fill = white)) +
  # set linewidth to adjust thickness of borders
  geom_sf(color = "white", linewidth = 0) +
  scale_fill_distiller(palette = "YlGnBu", direction = 1) +
  labs(title = "Percent white, Maryland tracts, 2022")

It’s both! I’m endlessly amazed by how often I see this in published work. Sometimes the borders are useful, but here they weren’t really.

Note also that these are percentages, not the number of white people per tract. With tracts, since they’re of standardized sizes, this wouldn’t be as big a deal as with other geographies, but if we showed populations in this way, we’d just be displaying population density. That’s because what you’re reading is a value presumably spread over an area. You pretty much always want a choropleth to be a value that’s been scaled by some population measure, or something not based on a raw count:

  • percentage of population ages 65+
  • median household income
  • percentage of households without a car
  • polling places per 1,000 registered voters
  • GDP per capita

In most cases, a choropleth would not be appropriate for count of people ages 65+ or total GDP, etc.

Exercises

Time to make a choropleth! Join the ACS data to one of the tract shapefiles (justviz::tracts_sf for just the city and its neighboring counties, or all_tracts_sf defined here for the whole state). Find a different variable of interest and make a choropleth of it like the ones above.

Filter for just Baltimore city or a single county (use filter(county == "county name")), and make another choropleth of the same variable. Compare what you can learn from looking at that choropleth versus looking at the county-wide value. In some places, the tracts will be very different from the county aggregate; in a more homogenous area, they might not be.

Back to top