Exploring Diversity in California Schools

Last updated on May 28, 2021 14 min read

library(tidyverse)
library(rvest)
library(knitr)

Keeping with the theme of EDA, in this post I will explore the ethnic diversity of the student population in schools in California. We will utilize the data provided by California Department of Education that has Enrollment by School which includes “school-level enrollment by racial/ethnic designation, gender, and grade.”

We will combine this data with another dataset that contains income information on the schools’ location. Hopefully we will be able to draw some neat inference and plots using these datasets.

Importing the Data

Enrollment Data

We start by downloading the dataset to our local disk and reading it from there. We mainly do this since the download speed of these files are slow. To get an idea of what the data is lets get a glimpse.

data <- readr::read_tsv("filesenr.asp.txt")
head(data) %>% kable()

CDS_CODE	COUNTY	DISTRICT	SCHOOL	ETHNIC	GENDER	KDGN	GR_1	GR_2	GR_3	GR_4	GR_5	GR_6	GR_7	GR_8	GR_9	GR_10	GR_11	GR_12	ENR_TOTAL
08100820830059	Del Norte	Del Norte County Office of Education	Castle Rock	1	F	0	1	0	0	0	2	1	1	2	1	1	5	6	20
08100820830059	Del Norte	Del Norte County Office of Education	Castle Rock	1	M	1	0	0	0	0	1	1	1	2	6	3	4	5	24
08100820830059	Del Norte	Del Norte County Office of Education	Castle Rock	9	M	1	0	2	0	1	0	0	1	0	3	3	2	6	19
08100820830059	Del Norte	Del Norte County Office of Education	Castle Rock	2	M	0	0	0	0	0	0	0	0	0	0	0	0	1	1
08100820830059	Del Norte	Del Norte County Office of Education	Castle Rock	7	M	0	0	2	4	1	1	3	4	6	7	15	10	26	79
08100820830059	Del Norte	Del Norte County Office of Education	Castle Rock	5	M	0	1	0	0	1	0	1	0	0	4	4	1	11	23

One important thing to note is that the CDS_CODE was read as a character, which is better for us since we are using it as an ID. If it was read as in integer, we would lose leading zeros and it could have gotten us in some trouble.

Longitude and Latitude Data

The longitude and latitude of the schools are also available on California Department of Education’s website here. This dataset includes a lot of variables, most of them are uninteresting for this project. We can check it out here:

longlat_raw <- read_tsv("pubschls.txt")
longlat <- longlat_raw %>%
  select(CDSCode, Latitude, Longitude, School, City)
head(longlat) %>% kable()

CDSCode	Latitude	Longitude	School	City
01100170000000	37.658212	-122.09713	No Data	Hayward
01100170109835	37.521436	-121.99391	FAME Public Charter	Newark
01100170112607	37.804520	-122.26815	Envision Academy for Arts & Technology	Oakland
01100170118489	37.868991	-122.27844	Aspire California College Preparatory Academy	Berkeley
01100170123968	37.784648	-122.23863	Community School for Creative Education	Oakland
01100170124172	37.847375	-122.28356	Yu Ming Charter	Oakland

We see that the the columns have been read in in appropriate ways, thankfully. We recognize the CDS_CODE from before as CDSCode in this dataset which we will use to combine the datasets later.

Income Data

Lastly, we will get some income data. I found some fairly good data on Wikipedia. Here is what it looks like:

url <- 'https://en.wikipedia.org/wiki/List_of_California_locations_by_income'
income_data <- read_html(url) %>%
  html_nodes("table") %>%
  .[2] %>%
  html_table() %>%
  .[[1]] %>%
  set_names(c("county", "population", "population_density", 
              "per_capita_income", "median_household_income",
              "median_family_income"))
head(income_data) %>% kable()

county	population	population_density	per_capita_income	median_household_income	median_family_income
Alameda	1,559,308	2,109.8	$36,439	$73,775	$90,822
Alpine	1,202	1.6	$24,375	$61,343	$71,932
Amador	37,159	62.5	$27,373	$52,964	$68,765
Butte	221,578	135.4	$24,430	$43,165	$56,934
Calaveras	44,921	44.0	$29,296	$54,936	$67,100
Colusa	21,424	18.6	$22,211	$50,503	$56,472

Here we see a couple of weird things. All of the columns were read in as characters, which they shouldn’t be since 5 out of the 6 variables are numerical. Lets fix that quickly. Now we have:

income_data$population <- as.character(income_data$population)
income_data$population <- parse_number(income_data$population, locale = locale(grouping_mark = ","))

income_data$population_density <- as.character(income_data$population_density)
income_data$population_density <- parse_number(income_data$population_density, locale = locale(grouping_mark = ","))

income_data$per_capita_income <- as.character(income_data$per_capita_income)
income_data$per_capita_income <- parse_number(income_data$per_capita_income, locale = locale(grouping_mark = ","))

income_data$median_household_income <- as.character(income_data$median_household_income)
income_data$median_household_income <- parse_number(income_data$median_household_income, locale = locale(grouping_mark = ","))

income_data$median_family_income <- as.character(income_data$median_family_income)
income_data$median_family_income <- parse_number(income_data$median_family_income, locale = locale(grouping_mark = ","))

head(income_data) %>% kable()

county	population	population_density	per_capita_income	median_household_income	median_family_income
Alameda	1559308	2109.8	36439	73775	90822
Alpine	1202	1.6	24375	61343	71932
Amador	37159	62.5	27373	52964	68765
Butte	221578	135.4	24430	43165	56934
Calaveras	44921	44.0	29296	54936	67100
Colusa	21424	18.6	22211	50503	56472

We are good to go now.

The Diversity Index

To be able to compare the ethnic diversity between schools we need a measure that describes diversity to begin with. I’ll use a diversity index developed in 1991 by a researcher at the University of North Carolina at Chapel Hill. The index simply asks: “What is the probability that two people in a population picked at random will be from different ethnicity?”

To make sense of this measure, we will look at a plot:

Above, we have a population of 10 people split into two ethnicities, with 5 in each. The colored squares represent random picks with the same ethnicity, light grey squares are picks with different ethnicities, and dark grey indicates impossible picks (the same person being picked twice). Now the diversity index can be calculated by dividing the number of light grey squares with the sum of light grey and colored squares.

First, lets go back to the enrollemnt data. We would like to have it in a different format where each row represents a school. Another thing we want to do is look across each ethnic category within each school. The new dataset that we will be working with is:

small_data <- data %>%
  select(CDS_CODE, ETHNIC, ENR_TOTAL) %>%
  group_by(CDS_CODE, ETHNIC) %>%
  summarize(ENR_TOTAL = sum(ENR_TOTAL)) %>%
  spread(ETHNIC, ENR_TOTAL, fill = 0) %>%
  ungroup()
head(small_data) %>% kable()

CDS_CODE	0	1	2	3	4	5	6	7	9
01100170112607	7	3	3	3	1	198	157	19	21
01100170123968	3	0	26	1	2	164	51	9	11
01100170124172	0	1	227	1	12	34	37	38	137
01100170125567	7	1	24	1	4	107	80	116	53
01100170129403	1	4	7	1	3	191	32	1	6
01100170130401	0	0	0	2	0	24	42	4	1

Before we move on, lets reference the data documentation to see what each of the ethnic numbers mean. According to the File Structure:

Code 0 = Not reported
Code 1 = American Indian or Alaska Native, Not Hispanic
Code 2 = Asian, Not Hispanic
Code 3 = Pacific Islander, Not Hispanic
Code 4 = Filipino, Not Hispanic
Code 5 = Hispanic or Latino
Code 6 = African American, not Hispanic
Code 7 = White, not Hispanic
Code 9 = Two or More Races, Not Hispanic

Now we have two decisions to make before moving on. How shoule we take care of the “Not reported” cases, and “Two or More Races, Not Hispanic”? (These are the types of decisions you make during the EDA phase, which then set up your analysis later on)

Lets start with the case of code 0. We notice that the category generally is quite small compared to the other groups, so we need to reason about what would happen if we drop them. If we treat them as a separate group, it would inflate the diversity index so we will drop them.

Dealing with multi-ethnicity is slighlty harder. We have four options:

Assume everyone in “Two or More Races, Not Hispanic” is the same ethnicity
Assume everyone in “Two or More Races, Not Hispanic” is all different ethnicities
Ignore the group “Two or More Races, Not Hispanic”
Distribute the group “Two or More Races, Not Hispanic” evenly into the other groups

Options 3 and 4 will both yield low diversity indices considering that the choice of “Two or More Races, Not Hispanic” is because these students are different enough not to be put into one of the more precisely defined groups. The first option, while appealling from a computational standpoint, would treat a black-white mixture and an Asian-American Indian as members of the same group even though they have no racial identity in common. Therefore, we will work with the second option which might overestimate the diversity of any given population as it ignores people with “Two or More Races, Not Hispanic” with identical mixes (siblings would be a prime example).

After getting that out of the way, we can thus create a function to calculate the diversity index by using the geometric interpretation from before.

diversity <- function(..., y) {
 x <- sapply(list(...), cbind)
 total <- cbind(x, y)
 1 - (rowSums(x ^ 2) - rowSums(x)) / (rowSums(total) ^ 2 - rowSums(total))
}

data_diversity <- small_data %>%
  mutate(diversity = diversity(`1`, `2`, `3`, `4`, `5`, `6`, `7`, y = `9`))

Now we are ready to bring back the data and start to calculate diversity indices for the schools. Lets run a summary to see the ranges we get and make sure everything is working correctly.

summary(data_diversity$diversity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.3140  0.5205  0.4711  0.6425  1.0000     108

All the indices are between 0 and 1 which is good! This means that the diversity index worked as intended. However there are 108 schools that have NA valued diversity. Lets look at those schools:

filter(data_diversity, is.na(diversity)) %>% head() %>% kable()

CDS_CODE	5	6	7	diversity
05615720000001	0	0	1	NaN
05615800000001	0	0	1	NaN
06615980000001	1	0	0	NaN
06616140000001	0	1	0	NaN
07616630000001	0	0	1	NaN
07617130000001	0	0	1	NaN

It seems that these schools do not have missing values but rather have such a low number of kids that the calculations break down. We will exclude these schools for now.

data_diversity <- data_diversity %>%
  filter(!is.na(diversity))

The first thing we should do with the data is look at the distribution of the diversity indices.

ggplot(data_diversity, aes(x=diversity)) +
  geom_histogram(binwidth = 0.01) +
  xlab("Diversity Index") +
  ylab(NULL) +
  ggtitle("Diversity at California Schools") +
  theme(plot.title = element_text(hjust = 0.5))

We notice that most of the schools follow a nice shape with a median above 0.5. The only weird thing are the spikes at 0 and 1. Lets look at the 0’s first:

data_diversity %>%
  filter(diversity == 0) %>%
  head(20) %>% kable()

CDS_CODE	0	5	7
02613330137125	1	0	3
03100330330035	0	0	7
06100660000000	0	10	0
09619290000001	0	0	4
09619520000001	0	0	2
09619866005722	0	0	8
09737830000001	0	0	4
10621251030873	0	6	0
10622650000000	0	5	0
10622650000001	0	2	0
10623640126870	0	70	0
10624300000001	0	2	0
10738090123950	0	3	0
10738091030147	0	11	0
10751271030261	0	21	0
10751271030725	0	3	0
11101160000001	0	2	0
12630401230069	0	0	2
13101321330117	0	9	0
13630811330141	0	14	0

It seems that all of the schools with a 0 diversity index is because they have students from one ethnicity, which is predominantly category 5 which is “Hispanic or Latino”. Now lets look at the 1’s:

data_diversity %>%
  filter(diversity == 1) %>%
  head(20) %>% kable()

CDS_CODE	0	1	2	5	6	7	9	diversity
01611270130294	1	0	1	1	0	0	0	1
04614240000001	0	1	0	1	0	1	0	1
05615806115000	0	0	0	0	1	1	0	1
09619600000001	0	0	0	1	0	1	0	1
10621170101949	0	0	0	1	0	1	0	1
14766870118299	0	1	0	1	0	1	0	1
14766871430107	0	1	0	1	0	0	1	1
16638910110858	0	0	0	1	1	0	0	1
19643370000000	0	0	0	1	0	1	1	1
19646670000001	0	0	0	1	1	0	0	1
19646910000000	0	0	0	1	0	1	0	1
19647090000001	1	0	1	1	1	0	0	1
19648320000001	0	0	1	1	1	0	0	1
20764142030070	0	0	0	1	0	0	2	1
21653340000001	0	0	0	1	0	1	0	1
22655320000001	0	0	0	0	0	1	1	1
23102310000001	0	1	0	1	0	0	0	1
30665630000001	0	0	0	1	0	1	0	1
32103220108001	0	0	0	0	1	1	0	1
33671570000001	0	0	0	1	0	0	1	1

Intuitively, the only way a school can have a diveristy index of 1 is if the maximum number of students from each ethnicity is 1. Therefore, these schools are rather small.

Lets take a look at the total enrollment in the schools. We have seen a couple of instances with low enrollment and we wouldn’t want those to distort our view.

total_students <- data %>%
  select(CDS_CODE:SCHOOL, ENR_TOTAL) %>%
  group_by(CDS_CODE) %>%
  summarise(ENR_TOTAL = sum(ENR_TOTAL)) %>%
  ungroup()

meta <- data %>%
  select(CDS_CODE:SCHOOL) %>%
  distinct() 

ggplot(total_students, aes(ENR_TOTAL)) +
  geom_histogram(bins = 1000) +
  xlab("Total Enrollment") +
  ylab(NULL) +
  ggtitle("Total Enrollment at California Schools") +
  theme(plot.title = element_text(hjust = 0.5))

We see a hump at the 600-700 mark, but a very large spike closer to 0. We should take a closer look at these small values to see what is happening.

total_students %>%
  filter(ENR_TOTAL < 250) %>%
  ggplot(aes(ENR_TOTAL)) +
  geom_histogram(bins = 250) +
  xlab("Total Enrollment") +
  ylab(NULL) +
  ggtitle("Total Enrollment at Smaller California Schools") +
  theme(plot.title = element_text(hjust = 0.5))

Here, we will make another decision that will change the results of the analysis, but we will restrict the investigation to only include schools with total enrollments of 50 or larger. We do this so that these smaller schools do not skew the distribution of the diversity index.

data_big <- total_students %>%
  left_join(meta, by = "CDS_CODE") %>%
  filter(ENR_TOTAL > 50)
data_enrollment <- data_diversity %>%
  right_join(data_big, by = "CDS_CODE")

One thing we can look at is the relationship between enrollment and diversity. We will fit a simple linear model to start with. I expect larger schools to have more diversity, but lets see:

data_enrollment %>%
  ggplot(aes(ENR_TOTAL, diversity)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "lm") +
  theme_minimal() +
  ylim(0,1) +
  labs(x="Enrollment", y="Diversity Index", title = "Relationship between Enrollment and Diversity") +
  theme(plot.title = element_text(hjust = 0.5))

The slope seems quite small but still about a 5% increase in diversity per 2000 students. Lets check the null hypothesis to make sure for statistical significance.

data_enrollment %>%
  mutate(ENR_TOTAL = ENR_TOTAL / 1000) %>%
  lm(diversity ~ ENR_TOTAL, data = .) %>%
  summary()

## 
## Call:
## lm(formula = diversity ~ ENR_TOTAL, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51078 -0.15751  0.05026  0.17040  0.39595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.449955   0.003429 131.223  < 2e-16 ***
## ENR_TOTAL   0.024282   0.004063   5.976 2.37e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2136 on 9472 degrees of freedom
## Multiple R-squared:  0.003756,   Adjusted R-squared:  0.003651 
## F-statistic: 35.71 on 1 and 9472 DF,  p-value: 2.371e-09

The results show that we can reject the null hypothesis of no correlation and we get a better estimate of the slope of 2.3%. Its important to keep in mind that we have a miniscule R-squared, so we dont think there is much of a relationship here.

Does this have anything to do with money?

I will end this post by bringing in the income data and checking if there is a correlation between median household income in the county in which the school is located and the schools’ diversity indices.

data_enrollment %>%
  left_join(income_data, by = c("COUNTY" = "county")) %>%
  ggplot(aes(x=median_household_income, diversity)) +
  geom_jitter(alpha = 0.1, width = 500) +
  geom_smooth(method = "lm") +
  ylim(0,1) +
  labs(x="Median Household Income", y = "Diversity Index", title = "Correlation between Diversity Index and Median Household Income \n(in CA counties)") +
  theme(plot.title = element_text(hjust = 0.5))

As always, if you have a question or a suggestion related to the topic covered in this article, please feel free to contact me!