Recently there was a new post on Revolution Analytic’s blog titled “Statistics: Losing Ground to CS, Losing Image Among Students”
I actually find many of the criticisms of CS driven Machine Learning in the blog post personally agreeable. In fact it has been a common theme in many recent discussions with other Data Scientists (not from CS backgrounds mind you) that the machine learning fetish for curve fitting is reaching ever higher heights of ridiculousness edging wildly towards blind empiricism. If that is all that is necessary then lets just program massive parameter sweeps across all known models and be done with it already. I digress that is another blog for another time though.
Where I found myself scratching my head was where Matloff asserts that the key to making Statistics more attractive to students is to get rid of the AP Stats program. Now I will admit I am not in statistics education I do not know how much merit there is to the negative impact AP Stats may be having on students who might have taken additional studies in statistics otherwise. I do remember previously hearing about college calculus professors disgruntled with gaps in student skill sets when coming from AP calculus, but that doesn’t seem to be the issue here it isn’t skill set it is level of interest in the field.
I feel that as someone who ended up studying statistics at the university level, works in the field, and took the AP exam (admittedly 14 years ago) that calling AP Stats “destructive” does not quite pass the smell test as it were. If anyone can point me towards studies or data that supports Matloff’s assertion please enlighten me. The linked article in Matloff’s blog post to Xiao-Li Meng’s article talks about “the most frequent reason for not considering a statistical major was a “turn-off” experience from an AP statistics course” the article though was mainly about statistics at Harvard and general statistics education AP Stats is only briefly mentioned and no hard data was presented about a cascading negative effect.
From Matloff’s post “One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter.” While I cannot speak for all AP Stats teachers, the one I had (and who is still teaching) was trained at UIUC which is a well ranked statistics program. That may be my privilege showing in a comment like that, but the more I unwind this the more I think that is actually a recurring them to this so called “problem.”
Another comment from the blog “A typical example is that a student complained to me that his AP Stat teacher could not answer his question as to why it is customary to use n-1 rather than n in the denominator of s^2 , even though he had attended a top-quality high school in the heart of Silicon Valley.” While if true that is certainly embarrassing and not the sort of teacher I would want teaching my (theoretical) children AP Stats on the other hand I have interviewed enough grads from top programs who could give me all sorts of proofs about sigma algebras and properties for various stochastic processes, but did not know how to perform linear regression or had never even analyzed a real data set in their programs to know the problem suffers from both extremes. If that student made it to his classes though he obviously wasn’t scared off by his negative AP Stats experience.
Now while I am sure many incoming freshmen at high caliber schools take AP Stats lets take a step back and consider do that many students even have access to AP Stats to begin with that they could be negatively impacted by the curriculum or quality instruction? Of course they do not.
By the College Board’s own data summary we can see the following (sad to see data from my AP years 2000-2001 are not included in this link — I’m apparently too old to be worth studying at this point but for the record I got an A in the class and a 4 on the AP exam):
33.2 percent of public high school graduates in the class of 2013 took an AP Exam, compared to 18.9 percent of graduates in the class of 2003.
So we might be tempted to think that about 1 in 3 students takes the test now versus about 1 in 5 ten years ago. And while there is truth to that numerically is it uniformly distributed across the entirely student population? Of course not!
Scrolling down through the summary we will come across an item that indicates “Approximately 132,500 teachers taught AP classes in nearly 14,000 public high schools.” using numbers from the Department of Education (admittedly for 2011 not 2013) we see that there are 67,086 public secondary institutions in the US. That means roughly 20% of the institutions have access to AP classes. So a more accurate statement might be that a higher proportion of students at public high schools that offer AP classes take those classes. Though we would need to understand how many high schools offered the exams back in 2003 to really be able to test that. This might play to intuitions though when considering socioeconomic factors since wealthier neighborhoods would have an easier time developing and expanding AP programs than those with less resources. So it seems like an achievement gap may potentially be increasing due to limited access despite the growing numbers.
Around 58% of students got a 3 or higher only about 33.1% got a 4 or higher in 2013. Considering the general success rate tops out under 30% this seems like Stats students do slightly better on average. Nearly 170,000 students took the stats exam in 2013 limiting ourselves to just high school seniors in 2013 (according to Census Bureau estimates roughly 4.1 million) that tells us around 4% took the test that of course does not tell us how many took the class though. But let us take it to an extreme just to make a point. How about we assume 1 in 3 takes a class and that everyone who took a class takes a test, but with the requirement that now all of these students end up taking the Stats test, and let us assume they all fail. That might explain a negative impact on those students, but the majority of the students, the original 2 out of 3, would not be affected by this so what is diverting them away from statistics? Additionally this extreme scenario seems like it should be deeply departed from reality so what are the real factors dissuading students from studying more statistics or going into the field? If I had to guess statistics is suffering from people’s more general math anxiety which one university in Spain estimated it to be about 47% of the male students and 62% of the female students. The fact that The Mathematics Anxiety Rating Scale is even a thing tells me the problem is somewhere else not in the mathematically driven algorithm development of machine learning in computer science departments.
Let us study underlying assumptions though, Matloff claimed there are less students studying in the field and that that loss was going to computer science. Is it really the case that there are less students studying in the field? Are they really going to computer science? Lets try to find out.
UCLA surveys college freshmen every year and their survey is a fairly reliable measure. So much so the NSF uses it as their indicator to track enrollments. Here are some percentages between CS and Mathematics/Statistics from 1995 to 2010
So we can see the level of enrollment in Mathematics/Statistics has been fairly stable. While the gap between the majors was widest at the height of the dot com bubble, which appears to be the only time in this data enrollment went down modesty, the gap has been decreasing since and is the smallest it has been in recent years.
Admittedly data that could separate mathematics and statistics would have been better and data since 2010 would be more useful since the hype around Machine Learning and Data Science has taken its massive upswing since that time (though remember Google recommended Statistics back in 2009), but at least as far as this data is concerned Statistics does not seem to be losing anyone to anything least of all to Computer Science at the undergraduate level.
One or two more points as an aside. There is some data that breaks down fields in more detail at the graduate level by the NSF, but at the time of this writing I could only find 2012. However since Matloff’s comments were pointed at AP Stats understanding the dynamics at the undergraduate level seems more relevant since it would be immediately reflected at the level following high school. In this single years data there are many more in CS than Statistics. An anecdote I would like to call out from my own experiences though is that my masters cohort was the largest at my school at the time, and every subsequent cohort has been even larger still. So does a gap between CS and Statistics matter if both or at least the Statistics side show growth?