Tuesday, February 4, 2014
It seems that everyone is talking about “big data.” The New York Times deemed this the “Age of Big Data” in a 2012 article and a Google search for the term yields over 14 million hits. I am spending a sabbatical semester at the Centers for Disease Control and Prevention, and here too there is much enthusiastic discussion of big data. Large databases containing electronic health records (EHR) and genomic data from millions of patients have the potential to facilitate medical discoveries, improve health care outcomes, and promote public health. Big data resources are being established by government agencies such as the Centers for Medicare and Medicaid Services, the Food and Drug Administration, and the Department of Veterans Affairs. They are also being launched by private entities such as Geisinger Health Systems, Explorys, the Electronic Medical Records and Genomics Network (eMERGE), and many others.
While the promise of big data is considerable, so are the challenges it poses. First, big data collections may be populated by data that is of poor quality, incomplete, or even deliberately distorted. Second, very careful attention must be paid to study design in order to avoid biases and other flaws.
Entering data into an EHR takes a lot of work, and some clinicians will be sloppy, rushed, or not adequately adept at using the system. For example, treating physicians may invert numbers when they are typing quickly, click on wrong menu items, forget to edit narrative that they have copied and pasted from a prior visit, or exaggerate the services provided during a visit in order to maximize billing. Yes, that means that your EHR may contain significant mistakes, and when it is included in a research database, it will contribute inaccurate information. Several studies have focused on EHR error rates and have estimated them to range from 2.3-26.9%. In addition, outcomes data for treatments is sometimes missing from records. To illustrate, patients who are given antibiotics often do not return for follow-up. Therefore, the physician’s record will not reflect whether the patient got better or got worse and went to a specialist for other treatments. The lack of outcome data can further erode the reliability of big data collections.
Second, only skilled experts can properly design and analyze research studies, and even they make mistakes at times. One has to ensure that the sample of records studied is representative of the population of interest. Otherwise, a risk of selection bias exists. In addition, investigators must take account of all relevant variables so that their absence does not confound results.
A stark example of big data gone wrong is a study published in a peer-reviewed journal in 2009 that posited a causal link between abortion and psychiatric disorders. The conclusion was debunked in 2012 following a review that revealed the study’s many flaws, such as a failure to exclude women who had mental health problems before they had an abortion. However, in the interim, several states passed laws requiring doctors to warn women who sought abortions that they may suffer mental health problems.
Big data holds great promise for the fields of medicine and public health. Nevertheless, we must approach its use with caution and with a deep understanding of its potential pitfalls. You can read more about these issues in my articles:
Big Bad Data: Law, Public Health, and Biomedical Databases at http://www.ncvhs.hhs.gov/130430b6.pdf, and
The Use and Misuse of Biomedical Data: Is Bigger Really Better? at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2235267
-Guest Bloggger Sharona Hoffman