Fuzzy Matching with Texas High School Academic Competition Results and SAT/ACT Scores

Introduction

As a follow-up to [a previous post]((https://tonyelhabr.rbind.io/posts/correlations-texas-high-school-academics/) about correlations between Texas high school academic UIL competition scores and SAT/ACT scores, I wanted explore some of the “alternatives” to joining the two data sets—which come from different sources. In that post, I simply perform a an inner_join() using the school and city names as keys. While this decision ensures that the data integrity is “high”, there are potentially many un-matched schools that could have been included in the analysis with some sound “fuzzy matching”. For my exploration here, I’ll leverage David Robinson’s excellent {fuzzyjoin} package.

My dissatisfaction with the results that I came up with various fuzzy matching attempts ultimately led me to “settle” upon

Fuzzy Matching

In its raw form,the UIL data—scraped from https://www.hpscience.net/ (which was the only site that I found to have the academic competition scores)—is much “more unclean” than the data from the SAT and ACT data from the TEA website. In particular, many of the city, schoo, and individual participant names are inconsistent across different webpages. 1 Even after eliminating much of the “self-inconsistency” of the UIL data and creating a suitable “clean” data set to use as the basis for my series of posts exploring the UIL data exclusively, there are more than a few differences in the names of schools between the UIL and TEA data.

To aid in my experimentation with “fuzzy joining”, I use a function from my {tetidy} package—join_fuzzily(), which is essentially a wrapper for the stringdist_*_join() functions provided by the {fuzzyjoin} package. With this function, I was able to evaluate different values for max_dist and different join columns to help me make a judgement regarding the quality of joins. I primarily considered counts of joined and unjoined rows computed with the summarise_join_stats() function—also from my {tetidy} package—to make a decision on how I should join the UIL and SAT/ACT school data. 2

What follows is the results of some of my experimentation. Of course, let’s first have a look at the data sets that we’re working with. (See my other posts to see how I created these data sets.)

schools_tea
test year school district county city math reading writing english science total
ACT 2011 A C JONES BEEVILLE ISD BEE CORPUS CHRISTI 19 18 NA 17 19 18
ACT 2011 A J MOORE ACAD WACO ISD MCLENNAN WACO 19 18 NA 16 18 18
ACT 2011 A M CONS COLLEGE STATION ISD BRAZOS HUNTSVILLE 26 24 NA 23 24 24
ACT 2011 A MACEO SMITH HIGH SCHOOL DALLAS ISD DALLAS RICHARDSON 16 14 NA 13 15 14
ACT 2011 ABBOTT SCHOOL ABBOTT ISD HILL WACO 20 20 NA 19 21 20
ACT 2011 ABERNATHY ABERNATHY ISD HALE LUBBOCK 22 20 NA 19 21 21
ACT 2011 ABILENE ABILENE ISD TAYLOR ABILENE 21 21 NA 20 21 21
ACT 2011 ACADEMY ACADEMY ISD BELL WACO 24 23 NA 21 24 23
ACT 2011 ACADEMY HIGH SCHOOL HAYS CISD HAYS AUSTIN NA NA NA NA NA NA
ACT 2011 ACADEMY OF CAREERS AND TECHNOLOGIE ACADEMY OF CAREERS AND TECHNOLOGIE BEXAR SAN ANTONIO 15 14 NA 12 14 14
ACT 2011 ACADEMY OF CREATIVE ED NORTH EAST ISD BEXAR SAN ANTONIO NA NA NA NA NA NA
ACT 2011 ADRIAN SCHOOL ADRIAN ISD OLDHAM AMARILLO 19 18 NA 20 19 19
ACT 2011 ADVANTAGE ACADEMY ADVANTAGE ACADEMY DALLAS RICHARDSON 18 20 NA 19 16 18
ACT 2011 AGUA DULCE AGUA DULCE ISD NUECES CORPUS CHRISTI 21 19 NA 18 20 19
ACT 2011 AIM CENTER VIDOR ISD ORANGE BEAUMONT NA NA NA NA NA NA
ACT 2011 AKINS AUSTIN ISD TRAVIS AUSTIN 19 17 NA 16 17 17
ACT 2011 ALAMO HEIGHTS ALAMO HEIGHTS ISD BEXAR SAN ANTONIO 25 24 NA 24 24 24
ACT 2011 ALBA-GOLDEN ALBA-GOLDEN ISD WOOD KILGORE 20 19 NA 18 20 19
ACT 2011 ALBANY JR-SR ALBANY ISD SHACKELFORD ABILENE 24 22 NA 21 22 22
ACT 2011 ALDINE ALDINE ISD HARRIS HOUSTON 19 17 NA 16 18 18
1 # of total rows: 15,073
schools_uil
school city complvl_num score year conf complvl comp advanced n_state n_bycomp prnk n_defeat w
HASKELL HASKELL 13 616 2011 1 District Calculator Applications 1 0 8 1.00 7 TRUE
POOLVILLE POOLVILLE 13 609 2011 1 District Calculator Applications 1 0 8 0.86 6 FALSE
LINDSAY LINDSAY 17 553 2011 1 District Calculator Applications 1 0 7 1.00 6 TRUE
PLAINS PLAINS 3 537 2011 1 District Calculator Applications 1 0 10 1.00 9 TRUE
SAN ISIDRO SAN ISIDRO 32 534 2011 1 District Calculator Applications 1 0 4 1.00 3 TRUE
CANADIAN CANADIAN 7 527 2011 1 District Calculator Applications 1 0 7 1.00 6 TRUE
GARDEN CITY GARDEN CITY 10 518 2011 1 District Calculator Applications 1 0 8 1.00 7 TRUE
WATER VALLEY WATER VALLEY 10 478 2011 1 District Calculator Applications 0 0 8 0.86 6 FALSE
GRUVER GRUVER 7 464 2011 1 District Calculator Applications 0 0 7 0.83 5 FALSE
YANTIS YANTIS 19 451 2011 1 District Calculator Applications 1 0 10 1.00 9 TRUE
SHINER SHINER 27 450 2011 1 District Calculator Applications 1 0 9 1.00 8 TRUE
WEST TEXAS STINNETT 7 443 2011 1 District Calculator Applications 0 0 7 0.67 4 FALSE
HONEY GROVE HONEY GROVE 17 440 2011 1 District Calculator Applications 1 0 7 0.83 5 FALSE
LATEXO LATEXO 23 439 2011 1 District Calculator Applications 1 0 10 1.00 9 TRUE
MUENSTER MUENSTER 17 436 2011 1 District Calculator Applications 0 0 7 0.67 4 FALSE
VAN HORN VAN HORN 1 436 2011 1 District Calculator Applications 1 0 7 1.00 6 TRUE
SLOCUM ELKHART 23 415 2011 1 District Calculator Applications 0 0 10 0.89 8 FALSE
ERA ERA 17 415 2011 1 District Calculator Applications 0 0 7 0.50 3 FALSE
GOLDTHWAITE GOLDTHWAITE 15 413 2011 1 District Calculator Applications 1 0 7 1.00 6 TRUE
NEWCASTLE NEWCASTLE 12 408 2011 1 District Calculator Applications 1 0 10 1.00 9 TRUE
1 # of total rows: 27,359

For my first attempt at fuzzy matching, I tried joining using only school as a key column and setting max_dist = 1. (Note that I do a “full” join because my tetidy::summarise_join_stats() function works best with this kind of input.)

library("tidyverse")
schools_uil_distinct <-
  schools_uil %>%
  distinct(school, city)
summ_schools_joinfuzzy_1 <-
  schools_tea %>%
  tetidy::join_fuzzily(
    schools_uil_distinct,
    mode = "full",
    max_dist = 1,
    cols_join = c("school"),
    suffix_x = "_tea",
    suffix_y = "_uil"
  ) %>%
  tetidy::summarise_join_stats(school_uil, school_tea) %>% 
  select(-x, -y) %>% 
  gather(metric, value)
summ_schools_joinfuzzy_1
metric value
n_x 12,367
n_y 17,542
n_joined 12,119
n_x_unjoined 5,423
n_y_unjoined 248
x_in_y_pct 98
y_in_x_pct 69

This kind of join represents a very “optimistic” or “over-zealous” implementation that likely results in matches that are not “true”. Nonetheless, it’s hard to really conclude anything about the quality of the join without comparing it to the results of another attempt.

Next, I tried the same join, only setting max_dist = 0 this time. Note that this is really like “exact” string matching, meaning that there is not really any fuzzy matching going on. (Nonetheless, it was worthwhile to evaluate the counts of matches and mis-matches with a full join.)

summ_schools_joinfuzzy_2 <-
  schools_tea %>%
  tetidy::join_fuzzily(
    schools_uil_distinct,
    mode = "full",
    max_dist = 0,
    cols_join = c("school"),
    suffix_x = "_tea",
    suffix_y = "_uil"
  ) %>%
  tetidy::summarise_join_stats(school_uil, school_tea) %>% 
  select(-x, -y) %>% 
  gather(metric, value)
summ_schools_joinfuzzy_2
metric value
n_x 11,666
n_y 16,898
n_joined 11,399
n_x_unjoined 5,499
n_y_unjoined 267
x_in_y_pct 98
y_in_x_pct 67

As we should expect, this results in a lower number of joins, but my feeling is that the join is still too naive as a result of using only one key column for joining—school.

What about changing the key columns to school and city and setting max_dist = 1? (Note that I unite() the two columns only for my {tetidy::summarise_join_stats() function, which can only work with a single “key” column. If only joining the data, the two join columns could be kept separate.)

summ_schools_joinfuzzy_3 <-
  schools_tea %>%
  unite(school_city, school, city, remove = FALSE) %>%
  tetidy::join_fuzzily(
    schools_uil_distinct %>%
      unite(school_city, school, city, remove = FALSE),
    mode = "full",
    max_dist = 1,
    cols_join = c("school_city"),
    suffix_x = "_tea",
    suffix_y = "_uil"
  ) %>%
  tetidy::summarise_join_stats(school_city_uil, school_city_tea) %>% 
  select(-x, -y) %>% 
  gather(metric, value)
summ_schools_joinfuzzy_3
metric value
n_x 2,842
n_y 15,076
n_joined 1,697
n_x_unjoined 13,379
n_y_unjoined 1,145
x_in_y_pct 60
y_in_x_pct 11

This is probably the best option of the many alternatives that I tested—including many not shown here that try different columns for joining and increasing the value of max_dist. In fact, the number of rows that would be returned with an inner join and the same settings (i.e. max_dist = 1 and the key columns)—as indicated by the value n_joined—is only slightly greater than that which you get when you combine the two data sets with an inner_join() on school and city, as I ended up doing. 3

Conclusion

In some ways, I think this is not too unlike the classic “bias vs. variance” trade-off with machine learning. To what extent do you try to minimize error with your training data—and potentially overfit your model? Or do you try to make a simpler model that generalizes better to a data set outside the training data—potentially creating a model that is “underfit” because it is too simplified? In this case, the question is about to what extent do you try to maximize the number of observations joined—perhaps creating some false matches that compromise the data and the results? And, from the other persepective, if you don’t include all of the data that is only not included due to poor matching, how can you guarante that your data “representative” are that your results are legitimate? It’s a tricky trade-off, as with many things in the realm of data analysis. And, as with most things, the best solution depends heavily on the context. In this case, I think erring on the side of caution with a “conservative” joining policy is a logical choice because there are lots of rows that are matched, and one can only assume that there is no underlying “trend” or “bias” in those schools that are mismatched. (i.e. The un-matched schools are just as ikely to be superior/inferior academically relative to all other schools as those that are matched.)

Something that I did not consider here but is certainly worthy of exploration is neural network text classificaton. With this approach, I could quantify the probability that one string is “equivalent” to another, and choose matches with the highest probabilities for any given string. I have a feeling that this kind of approach would be more successful, although it does not seem as easy to implement. (I may come back and visit this idea in a future post.)


  1. In fact, the rigor involved in cleaning the UIL data obliged be to completely hide it from the reader in my write-ups on the topic.

  2. Check out my post on the “DRY” principle and its application to R packages. Creating packages for actions that you perform across projects is a real time-saver!

  3. Actually, I also joined on year in the previous post (in order to make sure that school scores across years were “aligned” properly.

Share Comments
comments powered by Disqus