Merging data

Author

Camille Seaberry

Modified

April 10, 2024

library(dplyr)
library(justviz)
library(ggplot2)

Here are some notes on merging data from different data frames. A lot of the functions here come from dplyr, including all the *_join ones.

Types of joins

There are different types of joins that are defined by what data you want to keep and under what circumstances. These are consistent across many different languages (e.g. same terminology in R should apply in most/all SQL variants). The ones you’ll use most often are left joins and inner joins; when in doubt, a left join is safer than an inner join.

There’s an overly complicated chapter in R for Data Science on joins. There are some less complicated examples in the dplyr docs.

A diagram from LearnSQL.com illustrating left joins, right joins, inner joins, and full joins as Venn diagrams between two tables. — An illustration of joins

Imagine we’re joining two tables of data for counties A, B, C, D, and E, one row per county. The left table, housing, has housing information for each county but is missing County B. The right table, income, has income information for counties A, B, and E. That means there are a total of 5 counties, but only 2 of them are in both tables.

Left join will include every county that’s in housing, regardless of whether it’s also in income. There will be a row for income variables, but their values will be NA.
Inner join will include every county that’s in both housing and income.
Right join is like left join: it will include every county that’s in income, regardless of whether it’s also in housing.
Full join will include every county in either table.

set.seed(1)
housing <- data.frame(county = c("A", "C", "D", "E"), 
                      homeownership = runif(4),
                      vacancy = runif(4, min = 0, max = 0.1))
income <- data.frame(county = c("A", "B", "E"), 
                     poverty = runif(3))

left_join(housing, income, by = "county")

county	homeownership	vacancy	poverty
A	0.2655087	0.0201682	0.6291140
C	0.3721239	0.0898390	NA
D	0.5728534	0.0944675	NA
E	0.9082078	0.0660798	0.2059746

inner_join(housing, income, by = "county")

county	homeownership	vacancy	poverty
A	0.2655087	0.0201682	0.6291140
E	0.9082078	0.0660798	0.2059746

right_join(housing, income, by = "county")

county	homeownership	vacancy	poverty
A	0.2655087	0.0201682	0.6291140
E	0.9082078	0.0660798	0.2059746
B	NA	NA	0.0617863

full_join(housing, income, by = "county")

county	homeownership	vacancy	poverty
A	0.2655087	0.0201682	0.6291140
C	0.3721239	0.0898390	NA
D	0.5728534	0.0944675	NA
E	0.9082078	0.0660798	0.2059746
B	NA	NA	0.0617863

There are other joins that might be useful for filtering, but that don’t add any new columns. Semi joins return the rows of the left table that have a match in the right table, and anti joins return the rows of the left table that do not have a match in the right table. If you were making separate charts on housing and income, but wanted your housing chart to only include counties that are also in your income data, semi join would help.

semi_join(housing, income, by = "county")

county	homeownership	vacancy
A	0.2655087	0.0201682
E	0.9082078	0.0660798

Joining justviz datasets

acs_tract <- acs |> filter(level == "tract")

head(acs_tract)

level	county	name	total_pop	white	black	latino	asian	other_race	diversity_idx	foreign_born	total_hh	homeownership	total_cost_burden	total_severe_cost_burden	owner_cost_burden	owner_severe_cost_burden	renter_cost_burden	renter_severe_cost_burden	no_vehicle_hh	median_hh_income	ages25plus	less_than_high_school	high_school_grad	some_college_or_aa	bachelors	grad_degree	pov_status_determined	poverty	low_income	area_sqmi	pop_density
tract	Allegany County	24001000100	3474	0.98	0.00	0.01	0.00	0.01	0.0826466	0.01	1577	0.78	0.18	0.08	0.12	0.07	0.39	0.14	0.06	56232	2671	0.09	0.47	0.28	0.07	0.08	3461	0.12	0.35	187.932766	18.48533
tract	Allegany County	24001000200	4052	0.75	0.19	0.02	0.00	0.03	0.4674189	0.03	1390	0.86	0.20	0.12	0.18	0.11	0.33	0.18	0.04	66596	3255	0.15	0.49	0.24	0.08	0.05	2949	0.11	0.30	48.072019	84.29020
tract	Allegany County	24001000500	2304	0.66	0.19	0.05	0.01	0.07	0.6236263	0.04	683	0.60	0.20	0.04	0.09	0.02	0.36	0.07	0.13	47781	1283	0.09	0.37	0.38	0.11	0.06	1777	0.27	0.51	4.422954	520.91879
tract	Allegany County	24001000600	3005	0.91	0.02	0.01	0.00	0.07	0.2348727	0.00	1374	0.70	0.20	0.09	0.14	0.04	0.36	0.23	0.10	48607	2141	0.07	0.35	0.35	0.13	0.09	2910	0.14	0.37	1.582466	1898.93501
tract	Allegany County	24001000700	3233	0.93	0.02	0.01	0.00	0.04	0.1844055	0.02	1462	0.49	0.37	0.16	0.25	0.07	0.48	0.24	0.25	36090	2045	0.13	0.38	0.33	0.08	0.08	3217	0.28	0.58	0.712350	4538.49941
tract	Allegany County	24001000800	1932	0.89	0.01	0.06	0.01	0.03	0.2934502	0.02	786	0.48	0.52	0.25	0.35	0.19	0.68	0.30	0.17	27130	1253	0.14	0.48	0.28	0.06	0.04	1926	0.35	0.61	1.263143	1529.51773

head(ej_natl)

tract	indicator	value_ptile	d2_ptile	d5_ptile
24001000100	pm25	20	18	25
24001000100	ozone	7	7	9
24001000100	diesel	19	18	24
24001000100	air_cancer	6	20	31
24001000100	air_respiratory	4	14	20
24001000100	releases_to_air	44	34	50

ACS data has several geographies, including census tracts (I’ve subset for just tract data). Their ID (GEOID, or FIPS codes) are in the column name. The EPA data is only by tract, and its column of IDs is labeled tract. So we’ll be joining name from acs_tract with tract from ej_natl (this is the nationwide rankings, not the state ones).

n_distinct(acs_tract$name)

[1] 1460

n_distinct(ej_natl$tract)

[1] 1475

There are 15 tracts that are included in the EPA data but not the ACS data. That’s because those are tracts with no population that I dropped from the ACS table when I made it. I can check up on that with an anti-join (not running this here but it confirms that these are all zero-population tracts).

pop <- tidycensus::get_acs("tract", table = "B01003", state = "MD", year = 2022)

anti_join(ej_natl, acs_tract, by = c("tract" = "name")) |>
  distinct(tract) |>
  inner_join(pop, by = c("tract" = "GEOID"))

There’s another hiccup for merging data here: the ACS data is in a wide format (each variable has its own column), while the EPA data is in a long format (one column gives the indicator, then different types of values have their own columns). Those formatting differences could be awkward because you’d end up with some values repeated. The easiest thing to do is select just the data you’re interested in, either by selecting certain columns or filtering rows, then reshape, then join.

Let’s say I’m interested in the relationship, if any, between demographics and a few waste-related risk factors (proximity to wastewater, hazardous waste, and superfund sites). I’ll filter ej_natl for just those 2 indicators and reshape it so the columns have the value percentiles for each of those two risk factors (not the adjusted percentiles). Then I’ll select the columns I want from acs, then join them.

The tidyr::pivot_wider and tidyr::pivot_longer functions can be confusing, but there are some good examples in the docs and a lot of Stack Overflow posts on them. Basically here I’m reshaping from a long shape to a wide shape, so I’ll use pivot_wider.

# in practice I would do this all at once, but want to keep the steps separate
# so they're more visible
waste_long <- ej_natl |>
  filter(indicator %in% c("haz_waste", "superfund", "wastewater"))

head(waste_long)

tract	indicator	value_ptile	d2_ptile	d5_ptile
24001000100	superfund	54	43	62
24001000100	haz_waste	10	10	12
24001000100	wastewater	53	40	58
24001000200	superfund	89	73	81
24001000200	haz_waste	35	43	44
24001000200	wastewater	66	65	69

# id_cols are the anchor of the pivoting
# only using value_ptile as a value column, not scaled ones
waste_wide <- waste_long |> 
  tidyr::pivot_wider(id_cols = tract, 
                     names_from = indicator, 
                     values_from = value_ptile)

head(waste_wide)

tract	superfund	haz_waste	wastewater
24001000100	54	10	53
24001000200	89	35	66
24001000500	90	42	7
24001000600	93	50	63
24001000700	92	60	68
24001000800	89	78	75

Then the columns I’m interested in from the ACS data (For good measure, I’ll also add a variable to indicate that a tract is in Baltimore city):

acs_demo <- acs_tract |>
  select(name, county, total_pop, white, poverty, foreign_born, homeownership) |>
  mutate(is_baltimore = ifelse(county == "Baltimore city", "Baltimore", "Other counties"))

head(acs_demo)

name	county	total_pop	white	poverty	foreign_born	homeownership	is_baltimore
24001000100	Allegany County	3474	0.98	0.12	0.01	0.78	Other counties
24001000200	Allegany County	4052	0.75	0.11	0.03	0.86	Other counties
24001000500	Allegany County	2304	0.66	0.27	0.04	0.60	Other counties
24001000600	Allegany County	3005	0.91	0.14	0.00	0.70	Other counties
24001000700	Allegany County	3233	0.93	0.28	0.02	0.49	Other counties
24001000800	Allegany County	1932	0.89	0.35	0.02	0.48	Other counties

So each of these two data frames has a column of tract IDs, and several columns of relevant values. I only want tracts that are in both datasets, so I’ll use an inner join.

waste_x_demo <- inner_join(acs_demo, waste_wide, by = c("name" = "tract"))

head(waste_x_demo)

name	county	total_pop	white	poverty	foreign_born	homeownership	is_baltimore	superfund	haz_waste	wastewater
24001000100	Allegany County	3474	0.98	0.12	0.01	0.78	Other counties	54	10	53
24001000200	Allegany County	4052	0.75	0.11	0.03	0.86	Other counties	89	35	66
24001000500	Allegany County	2304	0.66	0.27	0.04	0.60	Other counties	90	42	7
24001000600	Allegany County	3005	0.91	0.14	0.00	0.70	Other counties	93	50	63
24001000700	Allegany County	3233	0.93	0.28	0.02	0.49	Other counties	92	60	68
24001000800	Allegany County	1932	0.89	0.35	0.02	0.48	Other counties	89	78	75

ggplot(waste_x_demo, aes(x = poverty, y = haz_waste)) +
  geom_point(aes(color = is_baltimore, shape = is_baltimore), alpha = 0.7) +
  scale_color_manual(values = c("purple", "gray60")) +
  scale_shape_manual(values = c("circle", "triangle"))

A scatterplot of data by tract in Maryland. The x-axis shows poverty rates, ranging from 0 percent to 100 percent. The y-axis shows hazardous waste percentiles within the state, ranging from 0 to 100. Points are colored to show whether the tract is in Baltimore city. There isn't a clear correlation between the two variables, but points in the city are clustered at the high end of hazardous waste percentiles, as well as higher poverty rates.

Is there a pattern? Maybe, maybe not, but now we know how to investigate it. There’s definitely something up with Baltimore though.

Joining with a crosswalk

This follows the example in the docs for the justviz::xwalk_tract_10_to_20 dataset. This has columns for 2010 tract GEOIDs, 2020 GEOIDs, and weights to use for converting values based on 2010 geographies to those based on 2020 geographies. This is necessary if you want to join the CDC health data with the ACS or EJSCREEN data, because the CDC data uses 2010 geographies. Let’s expand the example from the docs to merge health indicators with the previous demographic and environmental ones—say we want to know if there’s a relationship between adult asthma rates and waste-related risks.

# these are based on 2010 geographies
asthma10 <- cdc |>
  filter(level == "tract",
         indicator == "Current asthma")

head(asthma10)

level	year	location	indicator	value	pop
tract	2021	24005408200	Current asthma	8.6	1853
tract	2021	24003706700	Current asthma	8.8	4890
tract	2021	24015030506	Current asthma	11.1	8223
tract	2021	24005401301	Current asthma	11.3	3891
tract	2021	24005411302	Current asthma	9.3	3310
tract	2021	24005402403	Current asthma	11.3	2283

# calculate a weighted mean of 2010 tract rates, weighted sum of 2010 counts (pop)
# both grouped by 2020 tract
asthma20 <- asthma10 |>
  inner_join(xwalk_tract_10_to_20, by = c("location" = "tract10")) |>
  group_by(tract20) |>
  # don't know why cdc rates aren't divided by 100 so do it now
  summarise(asthma = weighted.mean(value, weight) / 100,
            adult_pop = sum(pop * weight)) |>
  ungroup()

waste_x_demo_x_cdc <- waste_x_demo |>
  left_join(asthma20, by = c("name" = "tract20"))

head(waste_x_demo_x_cdc)

name	county	total_pop	white	poverty	foreign_born	homeownership	is_baltimore	superfund	haz_waste	wastewater	asthma	adult_pop
24001000100	Allegany County	3474	0.98	0.12	0.01	0.78	Other counties	54	10	53	0.1040000	3718.000
24001000200	Allegany County	4052	0.75	0.11	0.03	0.86	Other counties	89	35	66	0.0920000	4564.000
24001000500	Allegany County	2304	0.66	0.27	0.04	0.60	Other counties	90	42	7	0.1149507	2735.419
24001000600	Allegany County	3005	0.91	0.14	0.00	0.70	Other counties	93	50	63	0.1038916	2979.443
24001000700	Allegany County	3233	0.93	0.28	0.02	0.49	Other counties	92	60	68	0.1160000	3387.000
24001000800	Allegany County	1932	0.89	0.35	0.02	0.48	Other counties	89	78	75	0.1190000	2213.000

So now we’ve got a data frame with some demographic indicators, some environmental risks, and rates of a major health outcome. Expanding the previous question to include asthma rates:

ggplot(waste_x_demo_x_cdc, aes(x = haz_waste, y = asthma)) +
  geom_point(aes(color = is_baltimore, shape = is_baltimore), alpha = 0.7) +
  scale_color_manual(values = c("purple", "gray60")) +
  scale_shape_manual(values = c("circle", "triangle"))

A scatterplot of data by tract in Maryland. The x-axis shows hazardous waste percentiles within the state, ranging from 0 to 100. The y-axis shows adult asthma rates, ranging from about 7 percent to 18 percent. Points are colored to show whether the tract is in Baltimore city. There is a positive correlation between the two variables. Points in the city are clustered at the high end of hazardous waste percentiles, as well as higher asthma rates.

So there does seem to be a correlation between the two variables, with Baltimore city tracts standing out on both. ¹

¹ It’s worth noting that the CDC data uses crude rates, not age-adjusted ones, which we don’t have to get into but it becomes important with age-related health conditions like cancer. That’s why I chose asthma instead.

Joining data with different shapes

Last, an example of a situation where you might want to join one long data frame with one wide one. If I want to compare poverty rate to each of the waste risk factors, and create small multiple charts, I’ll need a variable to facet on. So I’ll go back to waste_long and join that to the ACS data. Because of those differing shapes, the ACS data will be repeated, once for each waste indicator.

waste_long_x_demo <- waste_long |>
  inner_join(acs_demo, by = c("tract" = "name")) |>
  select(tract, is_baltimore, indicator, value_ptile, poverty)

head(waste_long_x_demo)

tract	is_baltimore	indicator	value_ptile	poverty
24001000100	Other counties	superfund	54	0.12
24001000100	Other counties	haz_waste	10	0.12
24001000100	Other counties	wastewater	53	0.12
24001000200	Other counties	superfund	89	0.11
24001000200	Other counties	haz_waste	35	0.11
24001000200	Other counties	wastewater	66	0.11

As you can see, the first of these tracts (24001000100) takes up 3 rows, one for each of the 3 waste indicators. To match it, that tract’s poverty rate (0.12) is repeated 3 times.

ggplot(waste_long_x_demo, aes(x = value_ptile, y = poverty)) +
  geom_point(aes(color = is_baltimore, shape = is_baltimore), alpha = 0.7) +
  scale_color_manual(values = c("purple", "gray60")) +
  scale_shape_manual(values = c("circle", "triangle")) +
  facet_wrap(vars(indicator), nrow = 1) +
  theme(legend.position = "bottom")

Small multiple panels of 3 scatterplots of data by tract in Maryland. For each panel, the x-axis shows percentile rankings of one of 3 environmental risk factors (superfunds, hazardous waste, and wastewater). Each of these ranges from 0 to 100. The y-axis shows poverty rates, ranging from 0 percent to 100 percent. Points are colored to show whether the tract is in Baltimore city. There aren't clear correlations between the variables, but Baltimore city tracts are clustered at the higher ends of superfund and hazardous waste rankings.