Powers and dangers of statistics: Simpson’s paradox
In this article, we are going to talk about a phenomena that shows the powers and dangers of statistics, the Simpson’s paradox. Sometimes, just grouping your data differently for you analysis can make your conclusions disappear or even be reversed.
Simpson’s paradox: A trend or result that is present when data is put into groups that reverses or disappears when the data is combined.
Curious? Let’s walk through an interesting case study that demonstrates this.
Let’s start looking at this dataset. It’s a simple dataset that show us four columns with some students informations. Basically, the columns represent:
- student_id: an ID for each student who has enrolled in the course
- gender: the student’s gender
- major: the course the student has enrolled in
- admitted: column that shows whether or not the student was able to enter the respective course
To begin to understand what is going on here, let’s do some descriptive analysis to try to extract information and insights from this dataset.
Proportion and admission rate for each gender
First, let’s look at some frequencies and proportions in this dataset for each gender.
The proportion of male and female students who tried to enroll in a course at the school is as follows: 51.4% are female and 48.6% are male.
Now let’s look at the proportions of students who were admitted or not admitted by gender.
The proportion of women who were admitted to any course are as shown above: 71.2% were not admitted and 28.8% were admitted.
To finish this first step, the proportion of men who were admitted to any course are as shown above: 51.4% were not admitted and 48.6% were admitted.
If we were to stop our analysis now, could we say, based on historical data, that men are more likely to be admitted to school courses? Is there a gender bias?
Well, that’s what it looks like isn’t it? But let’s not stop here and dive deeper and see if we can find out more information about it.
Proportion and admission rate for physics majors of each gender
Moving on, let’s look at some frequencies and proportions specifically for the physics course.
For the physics course there were many more male attempts than female ones, as we can see above.
Let us now analyze, the frequency with which each gender was admitted to the respective course.
It seems that this time the women have the upper hand: 74.2% of women were admitted and 25.8% were not admitted.
And for men: 51.6% were admitted and 48.4% were not admitted.
As we can see now, women had a higher frequency of admission than men to the physics course, indicating just the opposite of what we observed in the previous topic.
Now, let’s do the same thing for the chemistry course.
Proportion and admission rate for chemistry majors of each gender
Let’s look at some frequencies and proportions specifically for the chemistry course.
As we saw above, a lot more women tried to be admitted to the chemistry course than men.
Let us now analyze, the frequency with which each gender was admitted to the respective course.
This time, according to the data: 74.4% of women were not admitted and 22.6% were admitted.
And for men: 88.9% were not admitted and 11.1% were admitted.
What are we seeing here? Again, the same pattern emerged that had happened for the physics course, but at the beginning of the analysis, didn’t we see that in general men have better admissions than women and now analyzing the courses separately, women have higher admissions than men?
To summary up all this huge amount of numbers, this is what happened here
This is precisely what we wanted to get to. This is Simpson’s paradox.
The Simpson’s paradox shows us what we should always have when analyzing data: The importance of skepticism and the powers and dangers of statistics. We always have to be very careful when trying to oversimplify a complex truth. History often has different points of views and it is up to whoever is analyzing the data to see it and make everything clear!
One of the best-known examples of Simpson’s paradox comes from a study of gender bias among graduate school admissions to University of California, Berkeley.
Finally, I would like to say that, always try to understand the general context and all your points of view looking for a fair representation of the truth based on evidence! And so, you will be able to make a more correct and assertive decision, choosing to group or segregate the data, depending on the context. That’s the essence!
Hope this helps! Take care! :)