In data science, research design involves planning the processes that will be followed in the analytical research of large data sets. The whole idea behind taking a thoughtful approach to research design is to ensure the integrity and accuracy of the findings. Research design involves everything from ethical considerations to the methods by which data is collected to avoid inaccurate or irrelevant findings.
When it comes to big data, almost everyone focuses solely on the results of the analysis. It’s natural to look for quick answers, and big data has been promising some genuinely groundbreaking insights into problems that more conventional analytics have never been able to crack.
For data scientists themselves, however, the issue is considerably more subtle. When looking at a research report, what they want to know is the process used to perform the analysis. They’ll ask questions like:
- How large was the data set?
- What variables were included?
- What hypotheses were formulated before and after the analysis?
- What mechanisms were used to test the validity of the results?
This is because data scientists understand that the results can be influenced by the way the original questions were asked. The headlines are meaningless without an understanding of how the answers were arrived at. To avoid improper or misleading conclusions, data scientists design their data research projects carefully before placing their faith in the results.
Phrasing The Question Correctly is Half The Battle In Research Design
For data scientists searching for insights and meaningful solutions from their analysis of very large data sets, half of the problem is deciding how to ask their questions in the first place.
In one well-known and widely reported study, two data analysts declared that their analysis of sales and outcomes data from the purchases of used cars had revealed the surprising insight that cars with an orange paint job were the least likely to have defects. As it turned out, that wasn’t actually the case at all.
The data point was astonishing because there is no apparent logical connection between paint color and a car’s mechanical condition. Observers theorized that orange was often a custom color that only car buffs would choose, so, of course, these cars were more meticulously maintained… or that perhaps orange was used more often by a certain manufacturer or on certain models that were of higher quality. None of these hypotheses held up to scrutiny, however. Mathematically, everything seemed to suggest that the finding was completely meaningless; that it was nothing more than a coincidence that used cars with an orange paint job had fewer problems.
It turned out that the “big data” that comprised the research set wasn’t actually all that big. The flaw in the design, which the math itself couldn’t reflect when calculated for only the color comparison, was that the researchers looked at more than 70 factors, not just the color. The problem was simply that they didn’t look at enough records of car sales. By looking at a relatively small data set, the odds became extremely high that at least one of the 70 factors would show a false correlation. The percentage difference in defective orange cars versus other colors was simply statistical noise that existed as an artifact in the less than 75,000 records that had been examined.
Better research design could have prevented the original analytical error from ever seeing the light of day. By accounting for the possible effects of limited sample size, the data scientists who announced the results of the study could’ve saved their reputation, not to mention saving a lot of people from going out and buying orange cars.
The so-called Vast Search Effect is a real danger in big data analysis. When you are looking at millions of data points, you have to expect that a one-in-a-million result is going to appear. The numbers involved in many modern data research projects are so large that some particular finding will stand out, whether it is simply a random artifact or a genuine reflection of an unimagined reality.
Using Data Science Techniques To Avoid Common Pitfalls in Big Data Analytics
To combat these and other potential pitfalls in big data analysis, data scientists are trained to design their research projects to test and re-test their hypothesis in statistically rigorous ways. These methods can include:
- Partitioning the data into subsets to test the hypotheses.
- Testing hypotheses apparently confirmed in one analysis against a fresh data set to see if they prove to be predictive.
- Use mathematical inference to generalize results to compare against findings of a specific case.
- Using data simulation to create a truly random target set to compare to genuine datasets.
- Running comparative visualizations to see results in different formats.
Although it’s not yet common in data science generally, one method for ensuring good research design is through peer-reviewed publication of research projects. This is the gold standard in other scientific fields. In cases where data science overlaps with medicine, physics, or other hard science research efforts, researchers are expected to provide enough information for others to be able to reproduce their results on demand.
This is a challenge in some areas of data science, both because the massive data sets may not be either open or portable, and because they are sometimes composed of proprietary data. Nonetheless, journals like Scientific Data are beginning to emerge as forums where data scientists can publish peer-reviewed research projects.
Validating Data Sources
The same movement is pushing the use of standards like the Digital Object Identifier (DOI), from the International Standards Organization (ISO), as a consistent way for data scientists to cite their information sources for reference.
That’s one part of a process to validate data sources. It’s necessary for reproducibility, but ensuring that the underlying data is clean, or at least that aspects of it that are muddy are known, is also an important part of research design.
Frequently, data scrubbing is necessary, either as part of the research process or prior to starting it. Carefully designing collection methods can ensure that the information that comes out of analyses is valid for answering the questions being asked. “Garbage In-Garbage Out” is a problem as old as computer science, and it’s become even more of a concern to data scientists in particular.
Considering Data Research Ethics
There is an increasing focus on ethical considerations in the research design component of data science projects. Facebook’s 2014 experiment in manipulating the emotions of users through alterations in their news feed, for the purposes of collecting data on them, was a wake-up call for both the industry and society in general.
The principle of informed consent as it pertains to research subjects is now a consideration for data scientists as they design the collection methods for their research projects. It may also apply to projects using previously collected information; if that data was expressly collected for another purpose, studying it for results the subjects never consented to can be unethical.
These considerations are increasingly built into research design today. Data scientists have had to learn as they are going, but by making mistakes and analyzing them, as with the used car example, they learn better design skills and produce more interesting and relevant results.