Can Big Data Be Used To Establish Causality?

Causality is the relationship between cause and effect.  For example, smoking does not lead to lung cancer in every smoker, but smoking increases the probability that a person will get lung cancer.  In other words, smoking has a high probability of causing lung cancer.

In order to establish probabilistic causality using data R. A. Fisher introduced randomized experimentation in 1935.  Fisher’s ingenious idea was to isolate cause and effect using random assignment to groups with and without the cause then compare the effect.  To do this, create two groups by flipping a coin so there is a 50:50 chance of belonging to one of the groups.  If the coin is heads then the person is assigned to the group of people who will receive an e-mail ad that includes a coupon for a discount, and if the coin is tails then the person receives the same e-mail but without a coupon.  After the coupon has expired the click-through rate (CTR) and sales between the two groups are compared.  By establishing the two groups using a simple coin toss the groups should be similar in all aspects (age, sex, previous purchase history) except for receipt of coupon.  So, if there is a difference in the CTR or sales then it was probably caused by the coupon.

In many big data sets, such as those based on social networking sites, people’s connections are observed along with, mentions, posts, and other data that they share.  Social networking data from sites such as Facebook, Twitter, and Google + are observational.  So if an online ad that has a discount results in a higher CTR and sales compared to the same ad without a discount the causal question is:  did the ad with a discount cause the increased CTR and sales (effects) or are there other differences between the groups causing the increased CTR and sales?  For example, did both groups contain similar numbers of men and women?  Are men just as likely to buy from this company as women?  If there are more men in the group without the coupon then perhaps it’s gender driving the difference in CTR and sales, and not the discount.

Three statistical methodologies that can be applied to observational data to explore if the ad with a discount caused an increased CTR and sales are: propensity scores; multivariable regression modeling; and instrumental variable analysis.  These methods allow the evaluation of causation in observational big data sets.

Big data has big possibilities, to be sure, but age-old questions of causation still loom large.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s