# Correlations Between Texas High School Academic Competition Results and SAT/ACT Scores

## Introduction

I wanted to do a follow-up on my series of posts about Texas high school University Interscholastic League (UIL) academic competitions to more closely evaluate the relationship between the school performance in those competitions with school-wide SAT) and ACT scores. For those who may not be familiar with these tests, these are the two most popular standardized tests used for college admission in the United States.

In my introduction to that series, I stated the following: School-wide … scores on state- and national-standardized tests (e.g. the SAT) certainly are the most common measure of academic strength, but I think rankings by academic competitions may be more indicative.

Essentially, I was implying that the academic UIL scores may not correspond well, or at all, with standardized test scores. However, I did not attempt to prove this hypothesis, which is what I set out to do here. While I’m at it, I’ll show the code and provide some commentary to explain my process.

## Data Collection

While I already have collected and cleaned the UIL data that I’ll need by virtue of my work for my series of posts analyzing the UIL competitions, I did not retrieve data for standardized test scores. To my delight, the Texas Education Agency’s website publishes Texas high school SAT and ACT scores for the years 2011 through 2015. The task of scraping from this source is a perfect use-case for the super-handy {xml2} and {rvest} packages, as well the well-known awesome {stringr} and {purrr} packages in the {tidyverse}.

library("tidyverse")
library("rlang")
library("teplot") # Personal package.

urls_tea <-
"https://tea.texas.gov/acctres/sat_act_index.html" %>%
rvest::html_nodes(xpath = "//tr //td //a") %>%
rvest::html_attr("href") %>%
str_subset("^\\/acctres\\/[:alpha:]{3}_[Cc]ampus_[Dd]ata")
urls_tea

create_path_tea <-
function(url_suffix, dir = "data-raw", ext = "csv") {
if(!dir.exists(dir)) {
dir.create(dir)
}
url_suffix %>%
str_remove_all("acctres|\\/") %>%
paste0(".", ext) %>%
file.path(dir, .)
}

# NOTE(s):
# + urls_tea_dl is actually the same as url_tea because purrr::walk() returns its first argument.
# + mode = "wb" is important! Otherwise, the downloaded files have empty lines every other line
# (due to the way that CR and LFs are handled.
urls_tea_dl <-
urls_tea %>%
walk(
url = paste0("https://tea.texas.gov/", .x),
destfile = create_path_tea(url_suffix = .x),
mode = "wb"
)
)


## Data Cleaning

Next, I bind the data from all of the downloaded files together and do some cleaning. I put these actions in function(s) because I plan on re-using them in future posts where I explore this data set in other ways.

One relatively significant choice that I make here is to only include the data for the school-wide level (via the "All Students" filtering criteria), although data for different demographics within each school is provided. The other data set that I am evaluating—the academic UIL data— does not have demographci-specific information, so I want to treat the two set as “equally” as possible.

Additionally, in order to better understand the resulting data set, the reader should be made aware of some of the details of the tests. The SAT has math, reading, and writing sections, each having minimum and maximum scores of 200 and 800, meaning that the total can range from 600 to 2400. The ACT has math, reading, english, and science sections, each having a minimum and maximum score of 1 and 36, combined for a single compos score also ranging from 1 to 36. To eliminate duplicate columns representing the same underlying “thing”. I don’t distinguish the math and reading section scores for each test in separate columns, I rename the ACT’s compos score to total, following the convention used for the SAT’s cumulative score. The other sections—writing for the SAT and english and science for the ACT— are not really analogous to sections in the other test, so they are filled with NAs appropriately.

Finally, for the interEsted reader, there are some details regarding the code implementation that I document in comments (both for explaining actions for myself and for the reader).

import_tea_data <-
function(path, rgx_grp) {
res <-
path %>%
rename_all(funs(tolower))

if(!is.null(rgx_grp)) {
res <-
res %>%
filter(group %>% str_detect(rgx_grp))
}
res <-
res %>%
select(
matches(
"^group$|name$|math|reading|writing|total|english|science|compos"
)
)
res
}

import_tea_data_cleanly <-
function(urls, rgx_grp, ...) {

res <-
urls %>%
create_path_tea(...) %>%
tibble(path = .) %>%
mutate(
test = stringr::str_extract(path, "([Ss][Aa][Tt])|([Aa][Cc][Tt])") %>% toupper(),
year = stringr::str_extract(path, "[:digit:]+") %>% as.integer()
) %>%
mutate(contents = purrr::map(path, ~import_tea_data(.x, rgx_grp = rgx_grp))) %>%
unnest() %>%
# NOTE: No longer need this columns(s) any more.
select(-path) %>%
mutate_at(vars(total), funs(ifelse(test == "ACT", compos, .))) %>%
# NOTE: No longer need this column(s) any more.
select(-compos) %>%
# NOTE: Rearranging score columns in a more logical fashion.
select(-total, everything(), total) %>%
# NOTE: Renaming "important" columns.
rename(school = campname,
district = distname,
county = cntyname,
city = regnname) %>%
mutate_if(is.character, funs(str_replace_all(., "=|\"", ""))) %>%
mutate_at(vars(school, district, county, city), funs(toupper)) %>%
# NOTE: Some county names are truncated and end with COUN or COUNT.
# (The max seems to be 18 characters).
# Fortunately, ther are no county names with COUN in their names, so the following
# regular expression is sufficient.
mutate_at(vars(county), funs(str_remove_all(., "\\s+COUN.*$"))) %>% # NOTE: Remove all HS/H S at the end of school names, as well as ampersands. # This seems to improve join percentages with other data sets. mutate_at(vars(school), funs(str_remove_all(., "([H]\\s*[S]$)|(\\s+\\&)") %>% str_trim())) %>%
# NOTE: This is (try to) to resolve duplicates in raw data.
# group_by_at(vars(matches("test|year|school|district|county|city"))) %>%
# summarise_all(funs(max(., na.rm = TRUE))) %>%
# ungroup() %>%
arrange(test, year, school)
res
}

schools_tea <-
urls_tea %>%
import_tea_data_cleanly(rgx_grp = "All Students")
schools_tea

test year school district county city math reading writing english science total
ACT 2011 A C JONES BEEVILLE ISD BEE CORPUS CHRISTI 19 18 NA 17 19 18
ACT 2011 A J MOORE ACAD WACO ISD MCLENNAN WACO 19 18 NA 16 18 18
ACT 2011 A M CONS COLLEGE STATION ISD BRAZOS HUNTSVILLE 26 24 NA 23 24 24
ACT 2011 A MACEO SMITH HIGH SCHOOL DALLAS ISD DALLAS RICHARDSON 16 14 NA 13 15 14
ACT 2011 ABBOTT SCHOOL ABBOTT ISD HILL WACO 20 20 NA 19 21 20
ACT 2011 ABERNATHY ABERNATHY ISD HALE LUBBOCK 22 20 NA 19 21 21
ACT 2011 ABILENE ABILENE ISD TAYLOR ABILENE 21 21 NA 20 21 21
ACT 2011 ACADEMY ACADEMY ISD BELL WACO 24 23 NA 21 24 23
ACT 2011 ACADEMY HIGH SCHOOL HAYS CISD HAYS AUSTIN NA NA NA NA NA NA
ACT 2011 ACADEMY OF CAREERS AND TECHNOLOGIE ACADEMY OF CAREERS AND TECHNOLOGIE BEXAR SAN ANTONIO 15 14 NA 12 14 14
ACT 2011 ACADEMY OF CREATIVE ED NORTH EAST ISD BEXAR SAN ANTONIO NA NA NA NA NA NA
ACT 2011 ADRIAN SCHOOL ADRIAN ISD OLDHAM AMARILLO 19 18 NA 20 19 19
ACT 2011 AGUA DULCE AGUA DULCE ISD NUECES CORPUS CHRISTI 21 19 NA 18 20 19
ACT 2011 AIM CENTER VIDOR ISD ORANGE BEAUMONT NA NA NA NA NA NA
ACT 2011 AKINS AUSTIN ISD TRAVIS AUSTIN 19 17 NA 16 17 17
ACT 2011 ALAMO HEIGHTS ALAMO HEIGHTS ISD BEXAR SAN ANTONIO 25 24 NA 24 24 24
ACT 2011 ALBA-GOLDEN ALBA-GOLDEN ISD WOOD KILGORE 20 19 NA 18 20 19
ACT 2011 ALBANY JR-SR ALBANY ISD SHACKELFORD ABILENE 24 22 NA 21 22 22
ACT 2011 ALDINE ALDINE ISD HARRIS HOUSTON 19 17 NA 16 18 18
1 # of total rows: 15,073

### EDA: Year-to-Year Correlations

First, before evaluating the primary concern at hand—the relationship between the academic UIL scores and the SAT/ACT scores (available in the schools_tea data created above)—I want to verify that there is some non-trivial relationship among the scores for a given school on a given test across years. (I would be surprised if this were not shown to be true.)

schools_tea_cors_byyear <-
schools_tea %>%
distinct(test, year, school, .keep_all = TRUE) %>%
filter(!is.na(total)) %>%
unite(test_school, test, school) %>%
widyr::pairwise_cor(
feature = test_school,
item = year,
value = total
) %>%
rename(year1 = item1, year2 = item2, cor = correlation)
schools_tea_cors_byyear %>%
filter(year1 <= year2)

year1 year2 cor
2011 2012 0.80
2011 2013 0.76
2012 2013 0.86
2011 2014 0.69
2012 2014 0.78
2013 2014 0.83
2011 2015 0.64
2012 2015 0.74
2013 2015 0.78
2014 2015 0.86
![](viz_schools_tea_cors_byyear_show-1.png) As expected, there are some strong correlations among the years for school-wide scores on these tests. Ok, now let’s bring in the “cleaned” school data (schools_uil) that I collected and cleaned in my UIL analysis. I’ll subset the data to include only the same years found in schools_tea—2011 through 2015.
school city complvl_num score year conf complvl comp advanced n_state n_bycomp prnk n_defeat w
HASKELL HASKELL 13 616 2011 1 District Calculator Applications 1 0 8 1.00 7 TRUE
POOLVILLE POOLVILLE 13 609 2011 1 District Calculator Applications 1 0 8 0.86 6 FALSE
LINDSAY LINDSAY 17 553 2011 1 District Calculator Applications 1 0 7 1.00 6 TRUE
PLAINS PLAINS 3 537 2011 1 District Calculator Applications 1 0 10 1.00 9 TRUE
SAN ISIDRO SAN ISIDRO 32 534 2011 1 District Calculator Applications 1 0 4 1.00 3 TRUE
CANADIAN CANADIAN 7 527 2011 1 District Calculator Applications 1 0 7 1.00 6 TRUE
GARDEN CITY GARDEN CITY 10 518 2011 1 District Calculator Applications 1 0 8 1.00 7 TRUE
WATER VALLEY WATER VALLEY 10 478 2011 1 District Calculator Applications 0 0 8 0.86 6 FALSE
GRUVER GRUVER 7 464 2011 1 District Calculator Applications 0 0 7 0.83 5 FALSE
YANTIS YANTIS 19 451 2011 1 District Calculator Applications 1 0 10 1.00 9 TRUE
SHINER SHINER 27 450 2011 1 District Calculator Applications 1 0 9 1.00 8 TRUE
WEST TEXAS STINNETT 7 443 2011 1 District Calculator Applications 0 0 7 0.67 4 FALSE
HONEY GROVE HONEY GROVE 17 440 2011 1 District Calculator Applications 1 0 7 0.83 5 FALSE
LATEXO LATEXO 23 439 2011 1 District Calculator Applications 1 0 10 1.00 9 TRUE
MUENSTER MUENSTER 17 436 2011 1 District Calculator Applications 0 0 7 0.67 4 FALSE
VAN HORN VAN HORN 1 436 2011 1 District Calculator Applications 1 0 7 1.00 6 TRUE
SLOCUM ELKHART 23 415 2011 1 District Calculator Applications 0 0 10 0.89 8 FALSE
ERA ERA 17 415 2011 1 District Calculator Applications 0 0 7 0.50 3 FALSE
GOLDTHWAITE GOLDTHWAITE 15 413 2011 1 District Calculator Applications 1 0 7 1.00 6 TRUE
NEWCASTLE NEWCASTLE 12 408 2011 1 District Calculator Applications 1 0 10 1.00 9 TRUE
1 # of total rows: 27,359

Now let’s try to evaluate whether or not year-to-year correlations also exist with this data set.

Importantly, some choice about how to quantify performance needs to be made. As I discussed in my long-form series of posts exploring the UIL academic data, the evaluation of performance is somewhat subjective. Should we use number of times a school advanced to the next level of competition in a given year? (Note that there are three competition levels—District, Region, and State.) What about the number the number of other schools it “defeated” in head-to-head competitions? In that separate analysis, I made the choice to use the percentile rank (prnk) of the school’s placings across all competition levels for a given competition type (comp). I believe this measure bests represent a school’s quality of performance (where a higher value indicates better performance). As I stated there when explaining my choice to use percent rank for identifying “dominant” individual“, ”I choose to use percent rank—which is a always a value between 0 and 1—because it inherently accounts for the wide range of number of competitors across all competitions. (For this context, a percent rank of 1 corresponds to the highest score in a given competition, and, conversely, a value of 0 corresponds to the lowest score.)”

Aside from this decision regarding performance evaluation in academic UIL competitions, note that I treat the competition type (comp) in schools_uil as analogous to the test variable indicating SAT or ACT score in the schools_tea data set. For those who have not read through my UIL analysis, note that scores for five different competition types was collected—Calculator Applications, Computer Science, Mathematics, Number Sense, and Science.

schools_uil_cors_byyear <-
schools_uil %>%
select(year, school, city, comp, prnk) %>%
group_by(year, school, city, comp) %>%
summarise(prnk_sum = sum(prnk, na.rm = TRUE)) %>%
ungroup() %>%
unite(comp_school, comp, school) %>%
widyr::pairwise_cor(
feature = comp_school,
item = year,
value = prnk_sum
) %>%
rename(year1 = item1, year2 = item2, cor = correlation)

schools_uil_cors_byyear %>%
filter(year1 <= year2)


table class=“table” style=“width: auto !important; margin-left: auto; margin-right: auto;“> year1 year2 cor 2011 2012 0.74 2011 2013 0.63 2012 2013 0.72 2011 2014 0.53 2012 2014 0.60 2013 2014 0.75 2011 2015 0.48 2012 2015 0.52 2013 2015 0.61 2014 2015 0.70

We can see that correlations among years do exist, as we would expect. The strength of the correlations decrease for years that are farther apart, which is also what we might expect.

### “Final” Correlation Analysis

So, at this point, I have set myself up to do that which I set out to do—evaluate the relationship between the academic UIL competition scores and the national SAT/ACT scores.

In order to put the two sets of data on “equal grounds”, I only evaluate math scores. In particular, I filter comp in the UIL data to just the mathematically-based competitions—Calculator Applications, Mathematics, and Number Sense—excluding Science and Computer Science. And, for the SAT/ACT data, I select only the math score, which is available fore both tests, excluding the total and reading scores also available for each and the writing, english, and science scores available for one or the other. (Perhaps the ACT’s science score could be compared to the Science UIL scores, but I choose not to do so here.)

schools_uil_math <-
schools_uil %>%
filter(str_detect(comp, "Calculator|Math|Number")) %>%
group_by(year, school, city) %>%
summarise(prnk_sum = sum(prnk, na.rm = TRUE)) %>%
ungroup() %>%
# NOTE: "Renormalize" prnk_sum.
mutate(math_prnk = percent_rank(prnk_sum)) %>%
select(-prnk_sum)
schools_uil_math

year school city math_prnk
2011 ABBOTT ABBOTT 0.82
2011 ABERNATHY ABERNATHY 0.59
2011 ABILENE ABILENE 0.00
2011 ACADEMY OF FINE ARTS FORT WORTH 0.55
2011 AGUA DULCE AGUA DULCE 0.57
2011 ALAMO HEIGHTS SAN ANTONIO 0.70
2011 ALBA-GOLDEN ALBA 0.72
2011 ALBANY ALBANY 0.95
2011 ALEDO ALEDO 0.89
2011 ALEXANDER LAREDO 0.56
2011 ALICE ALICE 0.10
2011 ALLEN ALLEN 0.85
2011 ALPINE ALPINE 0.57
2011 ALTO ALTO 0.19
1 # of total rows: 5,596
schools_tea_math <-
schools_tea %>%
select(test, year, school, city, math) %>%
filter(!is.na(math)) %>%
group_by(test) %>%
mutate(math_prnk = percent_rank(math)) %>%
ungroup() %>%
group_by(year, school, city) %>%
summarise_at(vars(math_prnk), funs(mean(., na.rm = TRUE))) %>%
ungroup()
schools_tea_math

year school city math_prnk
2011 A C JONES CORPUS CHRISTI 0.51
2011 A J MOORE ACAD WACO 0.24
2011 A M CONS HUNTSVILLE 0.97
2011 A MACEO SMITH HIGH SCHOOL RICHARDSON 0.03
2011 ABBOTT SCHOOL WACO 0.72
2011 ABERNATHY LUBBOCK 0.63
2011 ABILENE ABILENE 0.60
2011 ACADEMY HIGH SCHOOL AUSTIN 0.32
2011 ACADEMY OF CAREERS AND TECHNOLOGIE SAN ANTONIO 0.03
2011 ACADEMY OF CREATIVE ED SAN ANTONIO 0.48
2011 AGUA DULCE CORPUS CHRISTI 0.66
2011 AKINS AUSTIN 0.25
2011 ALAMO HEIGHTS SAN ANTONIO 0.95
2011 ALBA-GOLDEN KILGORE 0.52
2011 ALBANY JR-SR ABILENE 0.83
2011 ALDINE HOUSTON 0.23
1 # of total rows: 7,730
schools_join_math <-
schools_tea_math %>%
rename_at(vars(matches("^math")), funs(paste0("tea_", .))) %>%
inner_join(schools_uil_math %>%
rename_at(vars(matches("^math")), funs(paste0("uil_", .))),
by = c("year", "school", "city")) %>%
select(year, school, city, matches("math"))
schools_join_math

year school city tea_math_prnk uil_math_prnk
2011 ABILENE ABILENE 0.60 0.00
2011 ALAMO HEIGHTS SAN ANTONIO 0.95 0.70
2011 AMERICAS EL PASO 0.31 0.69
2011 ANDERSON AUSTIN 0.98 0.64
2011 ANDRESS EL PASO 0.15 0.56
2011 ARLINGTON HEIGHTS FORT WORTH 0.63 0.49
2011 AUSTIN AUSTIN 0.89 0.22
2011 AUSTIN EL PASO 0.22 0.68
2011 AUSTIN HOUSTON 0.09 0.85
2011 BEL AIR EL PASO 0.17 0.49
2011 BERKNER RICHARDSON 0.80 0.50
2011 BOSWELL FORT WORTH 0.83 0.22
2011 BOWIE AUSTIN 0.97 0.70
2011 BOWIE EL PASO 0.15 0.15
2011 BRANDEIS SAN ANTONIO 0.79 0.39
2011 BREWER FORT WORTH 0.48 0.19
2011 BURBANK SAN ANTONIO 0.22 0.76
2011 BURGES EL PASO 0.57 0.70
2011 CALALLEN CORPUS CHRISTI 0.47 0.93
2011 CANUTILLO EL PASO 0.18 0.81
1 # of total rows: 699
schools_join_math_cors <-
schools_join_math %>%
select(-year) %>%
select_if(is.numeric) %>%
corrr::correlate()
schools_join_math_cors

rowname tea_math_prnk uil_math_prnk
tea_math_prnk NA 0.36
uil_math_prnk 0.36 NA

So, this correlation value—0.36—seems fairly low. At face value, it certainly does not provide any basis to claim that schools that do well in the academic UIL competitions also do well with SAT/ACT tests. However, perhaps if I used a different methodology, the result would be different. Other metrics used to quantify academic UIL performance could be tested in some kind of sensitivity analysis.

### EDA: Year-to-Year Correlations, Cont.

While I won’t do any kind of rigorous second evaluation here, I do want to try to quantify the impact of the “missing” data dropped due to mismatched school names. If all possible data had been used, would the final correlation value have increased (or decreased) with more (or less) data? Although finding direct answer to this question is impossible, we can evaluate the difference in the year-to-year correlations of scores from the schools that are joined with the correlations calculated for all in “unjoined” schools_tea and schools_uil data sets. If we find that there are large discrepancies (one way or the other), then we may have some reason to believe that the 0.36 number found above is misleading.

To perform this task, I create a couple of intermediary data sets, as well as some functions.

schools_postjoin_math_tidy <-
schools_join_math %>%
unite(school_city, school, city) %>%
gather(metric, value, matches("prnk"))

pairwise_cor_f1 <-
function(data, which = c("tea", "uil")) {
which <- match.arg(which)
data %>%
filter(metric %>% str_detect(which)) %>%
# filter_at(vars(value), all_vars(!is.nan(.))) %>%
widyr::pairwise_cor(
feature = school_city,
item = year,
value = value
) %>%
rename(year1 = item1, year2 = item2, cor = correlation) %>%
mutate(source = which %>% toupper())
}

pairwise_cor_f2 <-
function(data, which = c("tea", "uil")) {
which <- match.arg(which)
col <-
data %>%
names() %>%
str_subset("math")
data %>%
unite(school_city, school, city) %>%
rename(value = !!rlang::sym(col)) %>%
mutate(source = which %>% toupper()) %>%
widyr::pairwise_cor(
feature = school_city,
item = year,
value = value
) %>%
rename(year1 = item1, year2 = item2, cor = correlation) %>%
mutate(source = which %>% toupper())
}

schools_postjoin_math_cors_byyear <-
bind_rows(
schools_postjoin_math_tidy %>%
pairwise_cor_f1("tea"),
schools_postjoin_math_tidy %>%
pairwise_cor_f1("uil")
)

schools_prejoin_math_cors_byyear <-
bind_rows(
schools_tea_math %>%
pairwise_cor_f2("tea"),
schools_uil_math %>%
pairwise_cor_f2("uil")
)

schools_math_cors_byyear_diffs <-
schools_postjoin_math_cors_byyear %>%
inner_join(schools_prejoin_math_cors_byyear,
by = c("year1", "year2", "source"),
suffix = c("_join", "_unjoin")) %>%
mutate(cor_diff = cor_join - cor_unjoin)


Ok, enough of the data munging—let’s review the results!

schools_math_cors_byyear_diffs_wide <-
schools_math_cors_byyear_diffs %>%
filter(year1 <= year2) %>%
select(-matches("join\$")) %>%
unite(year_pair, year1, year2) %>%
schools_math_cors_byyear_diffs_wide

year_pair TEA UIL
2011_2012 0.09 -0.02
2011_2013 0.05 -0.04
2011_2014 0.11 -0.06
2011_2015 0.24 -0.03
2012_2013 0.01 0.00
2012_2014 0.06 0.04
2012_2015 0.15 0.00
2013_2014 -0.01 -0.03
2013_2015 0.08 0.00
2014_2015 0.05 -0.01

Note that the correlations in the joined data are a bit “stronger”—in the sense that they are more positive—among the TEA SAT/ACT data, although not in any kind of magnificent way. Additionally, the differences for the UIL data are trivial. Thus, we might say that the additional data that could have possibly increased (or decreased) the singular correlation value found—0.36—would not have changed much at all.

## Conclusion

So, my initial inclination in my analysis of academic UIL competitions) seems correct—there is no significant relationship between Texas high school academic competition scores and standardized test scores (for math, between 2011 and 2015). And, with that question answered, I intend to explore this rich data set in other ways in future blog posts.

##### Tony ElHabr
###### Data person

Passionate mostly about energy markets and sports analytics.