Big data and large sample size: a cautionary note on the potential for bias

Robert M Kaplan; David A Chambers; Russell E Glasgow

doi:10.1111/cts.12178

Big data and large sample size: a cautionary note on the potential for bias

Clin Transl Sci. 2014 Aug;7(4):342-6. doi: 10.1111/cts.12178. Epub 2014 Jul 15.

Authors

Robert M Kaplan¹, David A Chambers, Russell E Glasgow

Affiliation

¹ Office of Behavioral and Social Sciences Research and Department of Rehabilitation Medicine, National Institutes of Health, Bethesda, Maryland, USA.

Abstract

A number of commentaries have suggested that large studies are more reliable than smaller studies and there is a growing interest in the analysis of "big data" that integrates information from many thousands of persons and/or different data sources. We consider a variety of biases that are likely in the era of big data, including sampling error, measurement error, multiple comparisons errors, aggregation error, and errors associated with the systematic exclusion of information. Using examples from epidemiology, health services research, studies on determinants of health, and clinical trials, we conclude that it is necessary to exercise greater caution to be sure that big sample size does not lead to big inferential errors. Despite the advantages of big studies, large sample size can magnify the bias associated with error resulting from sampling or study design.

Keywords: bias; big data; research methods; sampling.

Published 2014. This article is a U.S. Government work and is in the public domain in the USA.

MeSH terms

Bias*
Databases as Topic* / standards
Electronic Health Records
Epidemiologic Methods
Humans
Randomized Controlled Trials as Topic
Reproducibility of Results
Sample Size