Big data usually refers to very large data sets such as 1 million hourly retail transactions at Wal-Mart, mentions of a political candidate on Facebook, or crowdsourced information about deaths in a conflict. In these examples the samples could be classified as a non-probability sample. This basically means that Wal-Mart sales, mentions of a candidate on Facebook, and counts of conflict deaths might be representative of the population. What population? The populations of: retail stores; eligible voters; people at risk of death during the conflict.
One of the perils of using a non-probability sample is that it’s hard to accurately assess how popular a political candidate is among eligible voters that do and do not use Facebook with only a large sample from Facebook. But it’s so tempting to extrapolate to the potential people outside of your sample since there is so much data. Another pitfall is that with a big data set it’s possible to make a big number of false discoveries. How can an association be false if it’s based on thousands of data points? These dangers are counter intuitive at best or unimaginable at their worst!
Big complicated data is here to stay regardless of these caveats. Dealing with the limitations is one of the next frontiers in statistical science.
There are some techniques currently available for dealing with selection bias in non-probability samples. An overview of these statistical methods are available here.