Powers and dangers of statistics: Simpson’s paradox

Pandas Couple
5 min readMay 9, 2022

In this article, we are going to talk about a phenomena that shows the powers and dangers of statistics, the Simpson’s paradox. Sometimes, just grouping your data differently for you analysis can make your conclusions disappear or even be reversed.

Simpson’s paradox: A trend or result that is present when data is put into groups that reverses or disappears when the data is combined.

https://en.wikipedia.org

Curious? Let’s walk through an interesting case study that demonstrates this.

dataset we are going to study

Let’s start looking at this dataset. It’s a simple dataset that show us four columns with some students informations. Basically, the columns represent:

  • student_id: an ID for each student who has enrolled in the course
  • gender: the student’s gender
  • major: the course the student has enrolled in
  • admitted: column that shows whether or not the student was able to enter the respective course

To begin to understand what is going on here, let’s do some descriptive analysis to try to extract information and insights from this dataset.

Proportion and admission rate for each gender

First, let’s look at some frequencies and proportions in this dataset for each gender.

Proportion of male and female students

The proportion of male and female students who tried to enroll in a course at the school is as follows: 51.4% are female and 48.6% are male.

Now let’s look at the proportions of students who were admitted or not admitted by gender.

Female admissions

The proportion of women who were admitted to any course are as shown above: 71.2% were not admitted and 28.8% were admitted.

Male admissions

To finish this first step, the proportion of men who were admitted to any course are as shown above: 51.4% were not admitted and 48.6% were admitted.

If we were to stop our analysis now, could we say, based on historical data, that men are more likely to be admitted to school courses? Is there a gender bias?

Well, that’s what it looks like isn’t it? But let’s not stop here and dive deeper and see if we can find out more information about it.

Proportion and admission rate for physics majors of each gender

Moving on, let’s look at some frequencies and proportions specifically for the physics course.

Female and male proportion who enrolled in the physics course

For the physics course there were many more male attempts than female ones, as we can see above.

Let us now analyze, the frequency with which each gender was admitted to the respective course.

Physics female admissions

It seems that this time the women have the upper hand: 74.2% of women were admitted and 25.8% were not admitted.

Physics male admissions

And for men: 51.6% were admitted and 48.4% were not admitted.

As we can see now, women had a higher frequency of admission than men to the physics course, indicating just the opposite of what we observed in the previous topic.

Now, let’s do the same thing for the chemistry course.

Proportion and admission rate for chemistry majors of each gender

Let’s look at some frequencies and proportions specifically for the chemistry course.

Female and male proportion who enrolled in the chemistry course

As we saw above, a lot more women tried to be admitted to the chemistry course than men.

Let us now analyze, the frequency with which each gender was admitted to the respective course.

Chemistry female admissions

This time, according to the data: 74.4% of women were not admitted and 22.6% were admitted.

Chemistry male admissions

And for men: 88.9% were not admitted and 11.1% were admitted.

What are we seeing here? Again, the same pattern emerged that had happened for the physics course, but at the beginning of the analysis, didn’t we see that in general men have better admissions than women and now analyzing the courses separately, women have higher admissions than men?

To summary up all this huge amount of numbers, this is what happened here

This is precisely what we wanted to get to. This is Simpson’s paradox.

The Simpson’s paradox shows us what we should always have when analyzing data: The importance of skepticism and the powers and dangers of statistics. We always have to be very careful when trying to oversimplify a complex truth. History often has different points of views and it is up to whoever is analyzing the data to see it and make everything clear!

One of the best-known examples of Simpson’s paradox comes from a study of gender bias among graduate school admissions to University of California, Berkeley.

Finally, I would like to say that, always try to understand the general context and all your points of view looking for a fair representation of the truth based on evidence! And so, you will be able to make a more correct and assertive decision, choosing to group or segregate the data, depending on the context. That’s the essence!

Hope this helps! Take care! :)

--

--

Pandas Couple

Casal de Cientistas de Dados, contribuindo para a comunidade de Data Science.