Categories
Synthetic data is a powerful research tool when used wisely. Learn where it delivers value, where human research remains essential, and how to use it responsibly.
Synthetic data is the newest tool in the armory of insights professionals. It promises the possibility of significant time and cost savings and has divided the Insights industry in two camps. One camp is seduced by the spectacle of disruption and embraces hyperbolic declarations of death of all other methods. The other camp is obstinately resistant to adopting a tool that challenges time tested beliefs. As, with most things, there are nuances, with both camps being both right and wrong.
This white paper has been written to enable insights professionals to navigate the emerging and changing landscape of Synthetic Data.
Summary: Synthetic data is a useful tool in the armory of insights professionals but the validity of the synthetic data is a function of the quality of the training data set used.
Where to use- It is useful for hypothesis generation, refinement of concepts, screening of stimuli for testing with consumers and to augment human data for final testing and brand tracking.
Where to not use - It is insufficient for high stakes decisions. It should not be used for pricing decisions, in situations where emotional responses are being evaluated and for fundamental category understanding.
Esomar defines Synthetic data as “information that has been generated to replicate the characteristics of real-world data”. It encompasses artificially created information designed to act in place of data that would have been collected from real human participants.
What are the types of Synthetic Data: There are broadly two types of Synthetic data
A key difference is that synthetic persona produces dynamic responses to questions while synthetic data is static.
There are two key characteristics of good Synthetic Data – (1) Grounded in Solid training (2) Has Predictive Power
Synthetic data is particularly susceptible to the principle of GIGO – Garbage In Garbage Out. Good synthetic data is trained on data which is contextually relevant, recent and robust.
Synthetic data trained on generic data e.g data available on the open web is particularly poor. For instance, a Harvard study established that the more distant a country was culturally from the United States, the lower the correspondence between AI responses and human responses.

To ensure that the training data is contextually relevant it should include a range of inputs. This should, at the minimum include, habits data (Source: Usage and Attitudes), behavioral data (Source: Consumer panel or other sources), attitudinal data (Sources include – Brand equity data, Barriers and Triggers) and product preference data (Sources include Product testing, UX testing, Ratings and reviews etc). It should also include data from secondary sources including social media data and market intelligence e.g NielsenIQ, Mintel, Euromonitor etc,. The training data should not be restricted only to the category and broader contexts like macro-economic data e.g inflation should be included. There is increasing evidence that training data that is narrow can lead to errors.
It is important that the data used in training is robust. Fraud in market research is unfortunately not rare with estimates putting fraud rates at between 15-30%. Kantar reported that researchers discard up to 38% of online data due to fraud. If the training dataset is not robust then the effects of the fraud seep into the synthetic responses as well.
To ensure the robustness of the training data is imperative to ensure it represents the variance in the data. For instance, if synthetic data were being created on students in Grammar schools in the UK it should include the fact that Asian pupils over index in Grammar schools – 15-20% of the Grammar school students are Asian while only 9.6% of the population is Asian. It should also include the fact that there is variance across schools with schools like Queen Elizabeth’s School in Barnet (London) having upto 70% Asians.
The training data needs to be recent. If there are significant changes in the market, changes in consumer habits or changes in factors like inflation which might impact responses then the Synthetic data will not be accurate.
The strongest pre-test tools have been validated to predict in market results. For instance, Kantar has validated the Short-Term Sales Likelihood metric on Link (ad pre-testing) to be predictive of volume sales.
The predictive power of tools rested on the principle of random sampling which remains one of the fundamental bedrocks of market research. Random sampling allows every member of a population an equal and independent chance of being selected. By following this process, it allows the findings to be extrapolated to the universe and allows for replicability of findings.
Good quality synthetic data should be able to predict what people will say and not merely replicate what people have said in the past. Synthetic data is not built off random sampling and hence the results of the synthetic data should be validated vs. in market results. Validating vs. survey results is an intermediate step but has the problem of compounding errors i.e difference between survey and in market is compounded by the difference between synthetic and survey.

The decision on when and how to use Synthetic data should be based on (1) Level of risk associated with the decision being taken (2) Whether the question asked is in the realm of experiential / emotional or System 2 i.e deliberate, rational processing.
A white paper done by Panoplai and twenty44 showed a descriptive accuracy of 91% and a predictive correlation of 0.9. IPSOS found that across 184 product tests the results were consistent across human and human-synthetic groups with an average correlation of +0.9. A study by Bain found that the “digital twins” replicated about 90% of the key outcomes of the original research.
All of these point to the fact that synthetic data is directionally correct and in 90% of the cases the decision taken would have been correct. However, for decisions that involve high risks e.g investing tens of millions of Dollars in starting a new plant, getting the decision wrong in 10% of the cases would be very costly and unacceptable.
The recent Bain Consulting study, created synthetic data using a training set consisting of historical respondent level data and ran the same tasks used in the original study, excluding the study itself from the training inputs. The results are directionally correct and the top two feature are predicted correctly and the predicted importances are accurate. It is worth noting that the importance of feature 9 doubles from 5 among human respondents to 10 among synthetic respondents. If the objective was to find the top two features inorder to develop communication around it, the results would be valid. If an entire new product line were being developed with a big R&D budget and a big investment in a production facility, then over weighting the importance of factor 9 could be a costly misstep.

Synthetic data is good to use for low-risk decisions but not for high-risk decisions.
For questions that are in the realm of a System 2 response i.e considered and rational decision Synthetic data is directionally correct but emotion resists simulation.
The distinction between System 2 and emotional processing is not merely a research taxonomy — it reflects something fundamental about the architecture of human experience. Kahneman's dual-process model established that much of human decision-making bypasses deliberate reasoning entirely. The neuroscientist Antonio Damasio demonstrated that emotion is not a distortion of rational thought but it’s very substrate — patients with damage to the emotional centres of the brain, despite intact reasoning faculties, became catastrophically unable to make decisions. Emotion, in other words, is not noise in the signal. It is the signal.
This has profound implications for synthetic data. An AI model can be trained to report emotional responses — to say, in effect, "a person like me would feel excited by this." But reporting an emotion and having one are categorically different things. Phenomenologists like Merleau-Ponty went further still, arguing that human experience is irreducibly embodied — that we do not merely think about the world but encounter it through a body with a history, a nervous system, and senses that no dataset can fully encode. The feeling of holding a pack, the involuntary recoil from a colour that reminds you of illness, the warmth triggered by a brand that featured in your childhood — these are propositional states that CANNOT be retrieved from training data. They are lived. Synthetic data trained on what people say cannot reliably recover what people do when emotion is in the room.
This is the deeper reason — beyond methodological caution — why emotional questions sit outside the safe zone for synthetic data. It is not that the models are not yet good enough. It is that the phenomenon they are trying to simulate is, in important respects, irreducibly human.
At ESOMAR 2026 in Japan, a paper presented by Imran (MARS) and Gayatri (Toluna) found that synthetic respondents were able to mimic human responses when the processing of the ad was structural. However, synthetic respondents were unable to mimic human responses when the process of the ad was experiential.
Human beings respond to stimulus using all five senses and are guided by the context. Packaging is one area where small adjustments in font, color, position of logo etc can change the subconscious processing of packing making it more difficult (or easier) to identify the pack on shelf.
Hence, Synthetic data should not be for questions that are in the realm of emotion or experiential.
The decision framework proposed is as follows:
|
|
|
Type of Question |
|
|
|
|
System 2 |
Emotional / Experiential |
|
Risk Associated with the Decision |
High |
Augment with Human Data |
Do NOT use |
|
Low |
Use for directional information |
Do NOT use |
|
With this framework in mind, we can now explore various use cases where Synthetic data is fit for purpose
Watch out – Using Synthetic Persona to generate persona runs the risk of losing out on “exploratory surprises” which come from meeting consumers and observing usage in real life contexts.
Watch out – Synthetic Persona will not be able to provide feedback on emotional aspects of the concept.
Watch out – Synthetic data are not good to evaluate new to the world concepts and products. They are also not good to evaluate concepts that are reliant on emotions and products that evoke a new to sensation.
Watch out – Standard significance testing was built to define confidence levels assuming random sampling. Synthetic data does not follow the principles of random sampling and so does not add to the statistical power in the same manner as an increase in human sample size does. This means that standard significance testing will reduce Type 2 errors and increase Type 1 errors i.e. probability of finding a difference where none exists. New and more sophisticated significance testing is required to account for different types of Synthetic data.
Watch out – The Synthetic data needs to be augmented and trained in each fieldwork interval. The results are only valid for the scope of the survey ie brands and attributes being covered. It should not be used to extrapolate to brands or attributes not covered in the survey.
For tracking studies, it is important to ensure that the Synthetic data not just replicates the “average” but also preserves the skewness and kurtosis in the data.
We already noted that Synthetic data is unsuitable to measure emotional or experiential responses and that it is insufficient for high stakes decisions. There are other uses cases where Synthetic data is unsuitable.
Synthetic data, when properly constructed, can preserve the statistical properties of a dataset without retaining any identifiable individual, thereby providing reassurance on privacy. This is particularly valuable in categories where research touches sensitive territories like health, finance, sexual behavior, political beliefs etc.
However, the ethical picture is more complicated. At the time of fieldwork, respondents should be made aware that their responses will be used to generate further answers and this is not always the case. Moreover, Synthetic data does not automatically confer privacy. If the training dataset is small or poorly anonymized, it is possible to reverse-engineer individual-level information from synthetic outputs — a vulnerability known as membership inference. Responsible use therefore requires that the training data itself be properly anonymized before it is used to generate synthetic responses. Another ethical consideration is under representation of certain sections of society in the training data sets.
The following is a checklist of questions that any insights professional should be asking partners seeking to provide synthetic data
Synthetic data has a role to play in the armory of the Insights professional. It offers the possibility of significant time and cost savings. However, it is not the “silver bullet” or the “one stop shop” for all insights needs. It also requires significant thought on how to train the models and where to use the models.
1 Source: https://arxiv.org/abs/2502.17424 | Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
2 Source: Greenbook | Fraud in Market Research and How to Fight It: Lessons From Rare Patient Voice | https://www.greenbook.org/insights/research-methodologies/fraud-in-market-research-and-how-to-fight-it-lessons-from-rare-patient-voice
3 Source : Comprehensive Future
4 Source : Kantar | https://www.kantar.com/inspiration/advertising-media/can-ad-testing-really-predict-sales-impact
5 Source : Panoplai and twenty44 | The Future of Customer Understanding. A New Framework for Digital Twin and Synthetic Data Validation.
6 Source : IPSOS | The Power of Product Testing with Synthetic Data. Humanising AI series, Part two.
7 Source : Bain | https://www.bain.com/insights/synthetic-customers-earn-their-stripes/
8 Source : Bain | Synthetic Customers Earn Their Stripes | https://www.bain.com/insights/synthetic-customers-earn-their-stripes/
9 Source : ESOMAR | Rethinking Pre-Testing. Synthetic Data?
Comments
Comments are moderated to ensure respect towards the author and to prevent spam or self-promotion. Your comment may be edited, rejected, or approved based on these criteria. By commenting, you accept these terms and take responsibility for your contributions.
Disclaimer
The views, opinions, data, and methodologies expressed above are those of the contributor(s) and do not necessarily reflect or represent the official policies, positions, or beliefs of Greenbook.
ARTICLES
Top in Data Science
Online research chased speed over quality. Discover why rapport, trust, and legitimacy drive stronger engagement and better data than quick fixes.
Partner Content
Synthetic panels built on validated human data reduce early-stage testing waste, helping teams extend the value of every research dollar.
Synthetic research is evolving fast. Beyond the hype, what can it truly do well today — and where does it still fall short for insights teams?
Sign Up for
Updates
Get content that matters, written by top insights industry experts, delivered right to your inbox.