Beware of Selection Bias

Beware of Selection Bias

Lets travel back in time to World War II. You are an analyst and you discovered that your aircrafts has low survival rate. Only 10% of the aircrafts made it back from missions. You examined the aircrafts and discovered that the enemy has good marksmen and all the shots landed in the cockpit. 90% of the bullet holes are on the left cockpit. 10% are on the right cockpit. You can only armor plate one side of the aircraft due to weight constrain. Which side would you armor plate?

This is a simplified problem from a real case study (read Bomber Command's Operational Research Section (BC-ORS) here). As the name of the article imply, the answer is counter-intuitive. You ought to armor plate the right side of the cockpit. But what is selection bias? How does that affect our answer?

Selection Bias In Statistics

Selection bias is one of the most frequently talked about subject matter in the field of statistics. The reason why findings of a random sample can be generalized is because it is, as the name imply, random. When it is randomly chosen, there is a good chance that the sample is representative of the population. However, when the sample is not random, then there is a systematic bias and the result can be very skewed.

So, why are the statistics for the bullet holes skewed in our story? The answer lies in the aircrafts that we are taking our measurement. They are the survivals. So there is a huge non-response selection bias. The data is not collected from the aircrafts that did not return.

Let us simplify the problem. Suppose that a shot on the left has 90% chance of gunning down the aircraft and a shot on the right has 10% chance. We let LEFT be the number of aircrafts shot on the left that returned and RIGHT be the number of aircrafts shot on the right that returned.

LEFT = 0.1 × all aircrafts shot on left
RIGHT = 0.9 × all aircrafts shot on the right

Let us assume that there are 100 aircrafts that went out. And the shots land on the left side and right side randomly and so on average, 50% are hit on the left and 50% are hit on the right.

Therefore, LEFT = 5 and RIGHT = 45.

Looking at the data, we can see that 10% of the aircrafts have bullet holes on the left and 90% have bullet holes on the right that is consistent with our observation. This result also depends on an assumption that there is an equal chance of getting hit on the left or right side. Thus the data that there are more bullet holes on the right side does not necessarily imply that right side is more frequently hit at all. In fact, in this situation left and right has equal chances of being hit but reinforcing the left would result in a better survival rate and more aircrafts would return with bullet holes on the left. The reason why the data is skewed is because of the non-response from the aircraft that was shot down.

What does response bias have to do with marketing?

You may argue that such mathematical tasks are better left to hardcore statisticians, researchers or analysts. But it is very important to marketers as well. Can you name a metric that is frequently used that is affected by such a similar non-response bias? Take a few seconds to think.

There are many candidates but the most obvious one is NPS (net promoter score). If you haven’t thought of this, can you reason out why NPS is easily affected by non-response bias? Let me give you a scenario. You are running a hotel operation and you have NPS survey at the end of the stay. If customers are not satisfied, they would not return to the hotel and probably ask their friends not to book a room with you. The NPS would appear to be getting better over time. That is because the people who stayed are the more tolerant to bad services and the distractors are gone (selection bias).

Segments, segments and more segments

The key to understanding customer analytics is segments, segments and more segments. Usually, bias and skewed results happen when the measurement you are taking belongs to a sub-segment and not the entire population. The aircrafts example belongs to a segment known as survivors. The NPS example is also a segmented data known as people who are tolerant of bad services.

Segments may have skewed results that we cannot generalize to the entire customer base or even non-customers (market). But understanding the segments would help us identify the traits of the segments. This is when we can turn the skewed result to our advantage to study our customers.

We often segment our customers based on how we intend to target our customers. Yet, the data that we collect may sometimes yield previously unknown segments. The goal of a good marketer is to sniff out these hidden segments and learn more about them.