Synthetic Data: A White Paper on Fundamentals

Synthetic data is a powerful research tool when used wisely. Learn where it delivers value, where human research remains essential, and how to use it responsibly.

by Vijay Raj

Independent Consultant at Incite Growth

Synthetic data is the newest tool in the armory of insights professionals. It promises the possibility of significant time and cost savings and has divided the Insights industry in two camps. One camp is seduced by the spectacle of disruption and embraces hyperbolic declarations of death of all other methods. The other camp is obstinately resistant to adopting a tool that challenges time tested beliefs. As, with most things, there are nuances, with both camps being both right and wrong.

This white paper has been written to enable insights professionals to navigate the emerging and changing landscape of Synthetic Data.

Summary: Synthetic data is a useful tool in the armory of insights professionals but the validity of the synthetic data is a function of the quality of the training data set used.

Where to use- It is useful for hypothesis generation, refinement of concepts, screening of stimuli for testing with consumers and to augment human data for final testing and brand tracking.

Where to not use - It is insufficient for high stakes decisions. It should not be used for pricing decisions, in situations where emotional responses are being evaluated and for fundamental category understanding.

What is Synthetic Data

Esomar defines Synthetic data as “information that has been generated to replicate the characteristics of real-world data”. It encompasses artificially created information designed to act in place of data that would have been collected from real human participants.

What are the types of Synthetic Data: There are broadly two types of Synthetic data

Synthetic Persona: These are agents that can engage in conversations and generate answers. They are designed to replicate the attitudes and behaviors of real individuals or groups of individuals i.e segments.
Synthetic Data: This is data generated to simulate how human beings would have responded to a survey.

A key difference is that synthetic persona produces dynamic responses to questions while synthetic data is static.

What are the Characteristics of Good Synthetic Data

There are two key characteristics of good Synthetic Data – (1) Grounded in Solid training (2) Has Predictive Power

1. Grounded in Solid Training

Synthetic data is particularly susceptible to the principle of GIGO – Garbage In Garbage Out. Good synthetic data is trained on data which is contextually relevant, recent and robust.

Synthetic data trained on generic data e.g data available on the open web is particularly poor. For instance, a Harvard study established that the more distant a country was culturally from the United States, the lower the correspondence between AI responses and human responses.

To ensure that the training data is contextually relevant it should include a range of inputs. This should, at the minimum include, habits data (Source: Usage and Attitudes), behavioral data (Source: Consumer panel or other sources), attitudinal data (Sources include – Brand equity data, Barriers and Triggers) and product preference data (Sources include Product testing, UX testing, Ratings and reviews etc). It should also include data from secondary sources including social media data and market intelligence e.g NielsenIQ, Mintel, Euromonitor etc,. The training data should not be restricted only to the category and broader contexts like macro-economic data e.g inflation should be included. There is increasing evidence that training data that is narrow can lead to errors.

The Secret Life of Synthetic Data: Why It’s Taking Over Research

It is important that the data used in training is robust. Fraud in market research is unfortunately not rare with estimates putting fraud rates at between 15-30%. Kantar reported that researchers discard up to 38% of online data due to fraud. If the training dataset is not robust then the effects of the fraud seep into the synthetic responses as well.

To ensure the robustness of the training data is imperative to ensure it represents the variance in the data. For instance, if synthetic data were being created on students in Grammar schools in the UK it should include the fact that Asian pupils over index in Grammar schools – 15-20% of the Grammar school students are Asian while only 9.6% of the population is Asian. It should also include the fact that there is variance across schools with schools like Queen Elizabeth’s School in Barnet (London) having upto 70% Asians.

The training data needs to be recent. If there are significant changes in the market, changes in consumer habits or changes in factors like inflation which might impact responses then the Synthetic data will not be accurate.

2. Has Predictive Power

The strongest pre-test tools have been validated to predict in market results. For instance, Kantar has validated the Short-Term Sales Likelihood metric on Link (ad pre-testing) to be predictive of volume sales.

The predictive power of tools rested on the principle of random sampling which remains one of the fundamental bedrocks of market research. Random sampling allows every member of a population an equal and independent chance of being selected. By following this process, it allows the findings to be extrapolated to the universe and allows for replicability of findings.

Good quality synthetic data should be able to predict what people will say and not merely replicate what people have said in the past. Synthetic data is not built off random sampling and hence the results of the synthetic data should be validated vs. in market results. Validating vs. survey results is an intermediate step but has the problem of compounding errors i.e difference between survey and in market is compounded by the difference between synthetic and survey.

When to Use Synthetic Data

The decision on when and how to use Synthetic data should be based on (1) Level of risk associated with the decision being taken (2) Whether the question asked is in the realm of experiential / emotional or System 2 i.e deliberate, rational processing.

1. Level of risk associated with the decision being taken

A white paper done by Panoplai and twenty44 showed a descriptive accuracy of 91% and a predictive correlation of 0.9. IPSOS found that across 184 product tests the results were consistent across human and human-synthetic groups with an average correlation of +0.9. A study by Bain found that the “digital twins” replicated about 90% of the key outcomes of the original research.

All of these point to the fact that synthetic data is directionally correct and in 90% of the cases the decision taken would have been correct. However, for decisions that involve high risks e.g investing tens of millions of Dollars in starting a new plant, getting the decision wrong in 10% of the cases would be very costly and unacceptable.

The recent Bain Consulting study, created synthetic data using a training set consisting of historical respondent level data and ran the same tasks used in the original study, excluding the study itself from the training inputs. The results are directionally correct and the top two feature are predicted correctly and the predicted importances are accurate. It is worth noting that the importance of feature 9 doubles from 5 among human respondents to 10 among synthetic respondents. If the objective was to find the top two features inorder to develop communication around it, the results would be valid. If an entire new product line were being developed with a big R&D budget and a big investment in a production facility, then over weighting the importance of factor 9 could be a costly misstep.

Synthetic data is good to use for low-risk decisions but not for high-risk decisions.

2. Whether the question asked is in the realm of experiential / emotional or System 2

For questions that are in the realm of a System 2 response i.e considered and rational decision Synthetic data is directionally correct but emotion resists simulation.

The distinction between System 2 and emotional processing is not merely a research taxonomy — it reflects something fundamental about the architecture of human experience. Kahneman's dual-process model established that much of human decision-making bypasses deliberate reasoning entirely. The neuroscientist Antonio Damasio demonstrated that emotion is not a distortion of rational thought but it’s very substrate — patients with damage to the emotional centres of the brain, despite intact reasoning faculties, became catastrophically unable to make decisions. Emotion, in other words, is not noise in the signal. It is the signal.

This has profound implications for synthetic data. An AI model can be trained to report emotional responses — to say, in effect, "a person like me would feel excited by this." But reporting an emotion and having one are categorically different things. Phenomenologists like Merleau-Ponty went further still, arguing that human experience is irreducibly embodied — that we do not merely think about the world but encounter it through a body with a history, a nervous system, and senses that no dataset can fully encode. The feeling of holding a pack, the involuntary recoil from a colour that reminds you of illness, the warmth triggered by a brand that featured in your childhood — these are propositional states that CANNOT be retrieved from training data. They are lived. Synthetic data trained on what people say cannot reliably recover what people do when emotion is in the room.

This is the deeper reason — beyond methodological caution — why emotional questions sit outside the safe zone for synthetic data. It is not that the models are not yet good enough. It is that the phenomenon they are trying to simulate is, in important respects, irreducibly human.

At ESOMAR 2026 in Japan, a paper presented by Imran (MARS) and Gayatri (Toluna) found that synthetic respondents were able to mimic human responses when the processing of the ad was structural. However, synthetic respondents were unable to mimic human responses when the process of the ad was experiential.

Human beings respond to stimulus using all five senses and are guided by the context. Packaging is one area where small adjustments in font, color, position of logo etc can change the subconscious processing of packing making it more difficult (or easier) to identify the pack on shelf.

Hence, Synthetic data should not be for questions that are in the realm of emotion or experiential.

The decision framework proposed is as follows:

		Type of Question
		System 2	Emotional / Experiential
Risk Associated with the Decision	High	Augment with Human Data	Do NOT use
Risk Associated with the Decision	Low	Use for directional information	Do NOT use

How to Use Synthetic Data

With this framework in mind, we can now explore various use cases where Synthetic data is fit for purpose

Generation of hypothesis – This is a good use case for Synthetic Persona. This step can be invaluable in identifying insights or refining hypothesis prior to research among humans.

Watch out – Using Synthetic Persona to generate persona runs the risk of losing out on “exploratory surprises” which come from meeting consumers and observing usage in real life contexts.

Refinement of concepts – This is a good use case for Synthetic Persona and can help save time and money. The Synthetic Persona can provide quick early feedback that is useful to refine concepts. This can be particularly useful in situations that involve a workshop or brainstorming sessions where the feedback can help in iteratively improving the concepts.

Watch out – Synthetic Persona will not be able to provide feedback on emotional aspects of the concept.

Screening of stimulus – Often teams have multiple concepts or products to be tested. Synthetic data can be used to shortlist concepts and products to a more manageable number to be tested with consumers. This use case of Synthetic data can help save costs.

Watch out – Synthetic data are not good to evaluate new to the world concepts and products. They are also not good to evaluate concepts that are reliant on emotions and products that evoke a new to sensation.

Augment with human data for final testing of concepts and products – In most final testing, there are boosters for specific target groups for better understanding responses but the decision is basis the random sample. In such cases, it is possible to use synthetic data in lieu of the quotas. This is a use case that can help reduce the cost associated with the boosters and also the time required to find these respondents.

Watch out – Standard significance testing was built to define confidence levels assuming random sampling. Synthetic data does not follow the principles of random sampling and so does not add to the statistical power in the same manner as an increase in human sample size does. This means that standard significance testing will reduce Type 2 errors and increase Type 1 errors i.e. probability of finding a difference where none exists. New and more sophisticated significance testing is required to account for different types of Synthetic data.

Augment with human data for brand tracking – Brand tracking is designed to keep a pulse on the changing brand equity across the spectrum of awareness, usage and perception. Any synthetic data trained on past data will not be able to pick up changes as a result of change in mix elements. For instance, the impact of Cracker Barrel removing its iconic “Old Timer” mascot and bringing it back would not be captured by data trained on the past. It is possible to use the current data to train Synthetic data and use it to augment human data. This could result in a saving in costs.

Watch out – The Synthetic data needs to be augmented and trained in each fieldwork interval. The results are only valid for the scope of the survey ie brands and attributes being covered. It should not be used to extrapolate to brands or attributes not covered in the survey.

For tracking studies, it is important to ensure that the Synthetic data not just replicates the “average” but also preserves the skewness and kurtosis in the data.

Where to NOT use Synthetic Data

We already noted that Synthetic data is unsuitable to measure emotional or experiential responses and that it is insufficient for high stakes decisions. There are other uses cases where Synthetic data is unsuitable.

Decisions on pricing - Pricing is one element of the mix where Synthetic data is unsuitable. This is because the willingness to pay is a highly irrational decision that is neatly post rationalised and Synthetic data struggles with this “say-do gap”. For instance, in today’s world, the “social currency” of the brand i.e my willingness to pose with it on Instagram is a determinant of how much I am willing to pay. This is an example of the phenomenon of “desire” at play. Desire is a deeply nonconscious impulse to have, do or experience something that drives a person to pay a premium over what the product rationally should command.
Foundational Strategic Understanding - Synthetic data should not be used for foundational strategic understanding pieces like Habits and Attitudes, Segmentation, Barriers and Triggers etc. Human attitudes and behaviour evolves over time as a result of changes in culture, in overall context of living or changes in the category dynamics. These changes cannot be measured using past data.

What are the Privacy and Ethical Considerations

Synthetic data, when properly constructed, can preserve the statistical properties of a dataset without retaining any identifiable individual, thereby providing reassurance on privacy. This is particularly valuable in categories where research touches sensitive territories like health, finance, sexual behavior, political beliefs etc.

However, the ethical picture is more complicated. At the time of fieldwork, respondents should be made aware that their responses will be used to generate further answers and this is not always the case. Moreover, Synthetic data does not automatically confer privacy. If the training dataset is small or poorly anonymized, it is possible to reverse-engineer individual-level information from synthetic outputs — a vulnerability known as membership inference. Responsible use therefore requires that the training data itself be properly anonymized before it is used to generate synthetic responses. Another ethical consideration is under representation of certain sections of society in the training data sets.

What Should you Ask Potential Partners

The following is a checklist of questions that any insights professional should be asking partners seeking to provide synthetic data

What is the data you are using to train the model, how often do you retrain your model
How are you evaluating the validity of your data
What are the significance tests you are conducting
How are you preventing comingling of client data in training the model for the future
How do you protect privacy

Conclusion

Synthetic data has a role to play in the armory of the Insights professional. It offers the possibility of significant time and cost savings. However, it is not the “silver bullet” or the “one stop shop” for all insights needs. It also requires significant thought on how to train the models and where to use the models.

References

¹ Source: https://arxiv.org/abs/2502.17424 | Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

² Source: Greenbook | Fraud in Market Research and How to Fight It: Lessons From Rare Patient Voice | https://www.greenbook.org/insights/research-methodologies/fraud-in-market-research-and-how-to-fight-it-lessons-from-rare-patient-voice

³ Source : Comprehensive Future

⁴ Source : Kantar | https://www.kantar.com/inspiration/advertising-media/can-ad-testing-really-predict-sales-impact

⁵ Source : Panoplai and twenty44 | The Future of Customer Understanding. A New Framework for Digital Twin and Synthetic Data Validation.

⁶ Source : IPSOS | The Power of Product Testing with Synthetic Data. Humanising AI series, Part two.

⁷ Source : Bain | https://www.bain.com/insights/synthetic-customers-earn-their-stripes/

⁸ Source : Bain | Synthetic Customers Earn Their Stripes | https://www.bain.com/insights/synthetic-customers-earn-their-stripes/

⁹ Source : ESOMAR | Rethinking Pre-Testing. Synthetic Data?

synthetic data data science data quality

Comments

Comments are moderated to ensure respect towards the author and to prevent spam or self-promotion. Your comment may be edited, rejected, or approved based on these criteria. By commenting, you accept these terms and take responsibility for your contributions.

Vijay Raj

Independent Consultant at Incite Growth

1 article

author bio

Disclaimer

The views, opinions, data, and methodologies expressed above are those of the contributor(s) and do not necessarily reflect or represent the official policies, positions, or beliefs of Greenbook.

ARTICLES

Top in Data Science