Categories
December 23, 2022
In the insights industry, experts have described 2022 as the Year of Data Quality. There is no doubt that it has been a hot topic of discussion and debates throughout…
In the insights industry, experts have described 2022 as the Year of Data Quality. There is no doubt that it has been a hot topic of discussion and debates throughout the year. However, we find common ground where most agree there is no silver bullet to address data quality issues in surveys.
As the Swiss cheese model suggests, to have the best chance of preventing survey fraud and poor data quality we need to approach the problem by thinking of it in terms of layers of protection that are implemented throughout the research process.
To this end, the Insights Association Data Integrity Initiative Council has published a hands-on toolkit. It includes a Checks of Integrity Framework with concrete data integrity measures. This is essential to all phases of survey research: pre-survey, in-survey, and post-survey.
What constitutes good data quality remains nebulous. We can agree on what is very bad data such as gibberish open-ended responses. However, identifying poor-quality data is rarely so simple. The responses we keep or remove from a dataset are often a tough call. These called are often based on our own personal assumptions and tolerance for imperfection.
Because objectively defining data quality is difficult, researchers have developed a wide range of in-survey checks. Including; instructional manipulation, low incidence, speeder, straight lining, red herring questions, and open-end responses, that act as predictors of poor-quality participants. But, like data quality itself, these predictors are subjective in nature.
The in-survey checks typically built into surveys inadvertently lead to miscategorizing participants as false positives (i.e. incorrectly flagging valid respondents as problematic) and false negatives (i.e. incorrectly flagging problematic respondents as valid).
In fact, these in-survey checks may penalize human error too harshly. While, at the same time, making it too easy for experienced participants, whether fraudsters or professional survey takers, to fall through the cracks. As an example, most surveys exclude speeders, participants who complete the survey too quickly to have provided thoughtful responses.
While researchers are likely to agree on what is unreasonably fast (or bot-fast!), there is no consensus on what is a little too fast. Is it the fastest 10% of the sample? Or those completing in <33% relative to median duration?
This subjectivity baked into these rules can result in researchers flagging honest participants who read and process information faster, or those who are less engaged with the category. Researchers might not flag participants with excessively long response time, the crawlers who could be translating the survey, or fraudulently filling out more than one survey at once.
These errors have a serious impact on the research. On the one hand, false positives can have negative consequences such as providing a poor survey experience and alienating honest participants.
Is this not a compelling enough reason to avoid false positives? Then think about the extra days of fieldwork needed to replace participants. On the other hand, false negatives can cause researchers to draw conclusions based on dubious data which lead to bad business decisions.
Our ultimate goal as responsible researchers is to minimize these errors. To achieve this, it’s critical that we shift our focus to understanding which data integrity measures are most effective at flagging the right participants. With this in mind, using advanced analytics (e.g.Root Likelihood in conjoint or maxdiff) to identify randomly answering, poor-quality participants presents a huge opportunity.
In 2022, much worthwhile effort was devoted to raising awareness and educating insights professionals. Especially, on how to identify and mitigate data issues in survey response quality. Moving forward, researchers need a better understanding of which data integrity measures are most effective at objectively identifying problematic respondents in order to minimize false positives and false negatives.
Comments
Comments are moderated to ensure respect towards the author and to prevent spam or self-promotion. Your comment may be edited, rejected, or approved based on these criteria. By commenting, you accept these terms and take responsibility for your contributions.
Disclaimer
The views, opinions, data, and methodologies expressed above are those of the contributor(s) and do not necessarily reflect or represent the official policies, positions, or beliefs of Greenbook.
More from Karine Pepin
Explore five key truths about sampling, uncovering fraud, low-quality respondents, and transparency issues that have eroded data quality over two deca...
Learn key principles of content design that enable researchers to distill insights into assets, fostering stakeholder influence and sustainable busine...
Discover the challenge of identifying AI-generated open-ended responses and the potential consequences for researchers and the market research industr...
This article discusses how the online sampling ecosystem favors professional respondents and bad actors. It advocates for a transformative shift towar...
Sign Up for
Updates
Get content that matters, written by top insights industry experts, delivered right to your inbox.
67k+ subscribers