This post examines the relationship between media reports and casualty estimation. Counting the total number of people in a large stable population during peacetime is a difficult task for many reasons, the most notable reason being that everyone in the target population has to be enumerated.
Suppose the people that we want to count have been injured or killed in a conflict zone or as part of a complex humanitarian emergency such as a flood or earthquake. Changing the context can make it even more difficult to obtain an accurate count of the number of people in a population.
Below are a few general points about estimating casualties using media reports (and more generally in observational data) in conflict situations that seem to be true.
- Media reports are easy to obtain, relatively simple and inexpensive to convert into a database.
- The number of casualties will almost surely be underestimated, and the nature of what is reported will depend on the identity of the victim and suspected perpetrator.
- Media reports may be missing or not collect key data that is useful in understanding associations and patterns in armed violence research.
- Collecting events that have only resulted in death make it impossible to evaluate probabilities.
- Significant statistical associations within media reports may not be significant when considering media and non-media media reports and vice versa.
- Conclusion: Generalizing frequencies and associations based on media reports beyond newsworthy events can result in erroneous statements.
Suppose that we conduct a study to assess if marathon finishing times are associated with diet while controlling for age (this example was inspired by Gretchen Reynolds recent NYT article). The extraneous variable is age. Can we ignore age in our analysis and still assess the association between finishing time and diet?
Confounding exists when different interpretations of the relationship between finishing time and diet exist when age is ignored or included in the data analysis. The assessment requires a comparison between crude estimate of association and an adjusted estimate of association.
Interaction exists when the relationship between finishing time and diet is different for different age groups. The assessment requires describing the relationship between finishing time and diet for different age groups.
Interaction and confounding can exist in the same data set. A variable can be a confounding variable and might also have an interaction. If a strong interaction is found then an adjustment for confounding is inappropriate.
Data is the raw material used to derive statistical information about the past. People often hope that if we collect data on important topics and present it in beautiful charts, tables, or maps then something good will happen in the future. For example, will data on the number of civilians killed in the Iraq conflict help prevent future deaths in similar circumstances? Data on civilian deaths during a conflict is important. But, there is no evidence that stand-alone data on civilian casualties has helped prevent future deaths in conflict. Statistical information should be used to make meaningful qualitative arguments about important topics. Statistical information is usually the beginning of a story. Data and statistical information is often one small piece of a puzzle in trying to make good stuff happen in the future.
“There is a deep gap between our thinking about statistics and our thinking about individual cases … On the other hand, surprising individual cases have a powerful impact … because the incongruity must be resolved and embedded in a causal story.”
Daniel Kahneman. Thinking, Fast and Slow, p. 174
How many “friends” do you have on Facebook? How many followers do you have on Twitter? What is the probability that the treatment for a disease will be effective? Data is setting the tone and character of many modern conversations.
Seth Godin launched a successful Kickstarter project for his new book today. If the project is funded then his publisher has agreed to publish and promote his new book. The project was funded in a few hours after the launch. Seth Godin now has data (and money) to show the publisher about the potential market for his new book. But is the response to this Kickstarter project an indicator of future book sales? I hope he publishes the data.
Using data to predict an event that has yet to occur is statistical prediction.
Inferring the value of a population quantity such as the average income of a country or the proportion of eligible voters who say they will vote ‘yes’ is statistical inference.
Prediction and inference answer different types of statistical questions.
The following are examples of predictions because the events have not occurred at the time of writing this post.
The probability that the Miami Heat will win the 2012 NBA playoffs is ____.
The probability that Barack Obama will win the 2012 Presidential election is ____.
The following are examples of inferences because the questions involve estimating a population value.
The proportion of NBA fans that currently believe the Miami Heat will win the 2012 playoffs is ____.
The proportion of eligible voters that currently state they will vote for Barack Obama in the 2012 Presidential election is ____.
Daniel Kahneman‘s recent book Thinking, Fast and Slow has many examples of our difficulties with probabilistic and statistical reasoning.
Statistical software and spreadsheet programs have made it relatively straight forward to carry out the science of data analysis. The art of data analysis involves answering questions such as: How should I frame my question quantitatively? What statistics should I use that will provide a convincing qualitative answer? Answers to these questions still challenge data analysis veterans in every field.
Some key questions to consider are:
1. How good are the data?
2. Could chance or bias explain the findings?
3. How do the current results compare with what is already known?
4. What theory or process might account for the findings?
5. What are the business or scientific implications?
1. Don’t rely on only one statistical measure such as the average. Examine several measures of central tendency such as the mean and median.
2. Don’t check central tendency without a measure of variability such as standard deviation or interquartile range. Make sure to examine several measures of variability.
3. Large samples can’t fix poor quality data or data that was not collected.
4. If you don’t find a result in a small sample, even when the data is high quality, then you might need more data to see the result. Statisticians sometimes say that the data is too noisy to see the signal.
5. Don’t be overconfident in your interpretations and seek counter interpretations. If you don’t then others might.
Causality is the relationship between cause and effect. For example, smoking does not lead to lung cancer in every smoker, but smoking increases the probability that a person will get lung cancer. In other words, smoking has a high probability of causing lung cancer.
In order to establish probabilistic causality using data R. A. Fisher introduced randomized experimentation in 1935. Fisher’s ingenious idea was to isolate cause and effect using random assignment to groups with and without the cause then compare the effect. To do this, create two groups by flipping a coin so there is a 50:50 chance of belonging to one of the groups. If the coin is heads then the person is assigned to the group of people who will receive an e-mail ad that includes a coupon for a discount, and if the coin is tails then the person receives the same e-mail but without a coupon. After the coupon has expired the click-through rate (CTR) and sales between the two groups are compared. By establishing the two groups using a simple coin toss the groups should be similar in all aspects (age, sex, previous purchase history) except for receipt of coupon. So, if there is a difference in the CTR or sales then it was probably caused by the coupon.
In many big data sets, such as those based on social networking sites, people’s connections are observed along with, mentions, posts, and other data that they share. Social networking data from sites such as Facebook, Twitter, and Google + are observational. So if an online ad that has a discount results in a higher CTR and sales compared to the same ad without a discount the causal question is: did the ad with a discount cause the increased CTR and sales (effects) or are there other differences between the groups causing the increased CTR and sales? For example, did both groups contain similar numbers of men and women? Are men just as likely to buy from this company as women? If there are more men in the group without the coupon then perhaps it’s gender driving the difference in CTR and sales, and not the discount.
Three statistical methodologies that can be applied to observational data to explore if the ad with a discount caused an increased CTR and sales are: propensity scores; multivariable regression modeling; and instrumental variable analysis. These methods allow the evaluation of causation in observational big data sets.
Big data has big possibilities, to be sure, but age-old questions of causation still loom large.
Big data usually refers to very large data sets such as 1 million hourly retail transactions at Wal-Mart, mentions of a political candidate on Facebook, or crowdsourced information about deaths in a conflict. In these examples the samples could be classified as a non-probability sample. This basically means that Wal-Mart sales, mentions of a candidate on Facebook, and counts of conflict deaths might be representative of the population. What population? The populations of: retail stores; eligible voters; people at risk of death during the conflict.
One of the perils of using a non-probability sample is that it’s hard to accurately assess how popular a political candidate is among eligible voters that do and do not use Facebook with only a large sample from Facebook. But it’s so tempting to extrapolate to the potential people outside of your sample since there is so much data. Another pitfall is that with a big data set it’s possible to make a big number of false discoveries. How can an association be false if it’s based on thousands of data points? These dangers are counter intuitive at best or unimaginable at their worst!
Big complicated data is here to stay regardless of these caveats. Dealing with the limitations is one of the next frontiers in statistical science.
There are some techniques currently available for dealing with selection bias in non-probability samples. An overview of these statistical methods are available here.