LevelUP Your Research

March 14, 2024

Are You Using Synthetic Data for Analytics?

Explore the use of synthetic data to bridge the gap between sales and ad exposure data. Learn how it can enhance targeting and validate ad effectiveness models.

Are You Using Synthetic Data for Analytics?
Joel Rubinson

by Joel Rubinson

President at Rubinson Partners Inc

I am sensing more buzz around synthetic data as a way to answer important marketing questions. A big motivator for synthetic data approaches is when you must use different sources for different pieces of data, that you want to bring them together into a holistic view. Increasingly, this is true in marketing, especially as walled gardens create their own clean rooms for integrating sales and ad exposure.  These clean rooms do not talk to each other. For example, good luck trying to integrate ad effectiveness results from Meta and YouTube.

I’m going to give you three use cases that benefit from synthetic data:

  • Discovering new targeting principles
  • Estimating the reach of your advertising
  • Validating models of ad effectiveness

Discovering new ad targeting principles

 This is how we discovered the power of targeting the Movable Middle. Consulting with the MMA trade association, I worked with TransUnion’s (fka Neustar) data science team to create a synthetic database with rules for choice probabilities that followed known distributions, response curves that followed MMM functions with diminishing returns, and media exposure with time bank decay. 

No dataset existed with all of these elements so we created a synthetic data set of over 600,000 synthetic consumers in order to simulate the differential impact of advertising on consumer segments.

Then we saw the pattern…those synthetic consumers with midrange probabilities of choosing the brand of interest were five times more responsive to advertising in our simulations.  Non-buyers (75% of consumers) were the least responsive, challenging the call for broad reach in media planning by Ehrenberg Bass and Les Binet.

As we dug into the reason, a professor at UCLA pointed out to me something so simple and obvious…the first derivative of the logit function (used in most MTA work to model conversions) is p*(1-p), which is maximized at p = .5.  In English, that means that those with a 50/50 shot at choosing your brand SHOULD have the greatest responsiveness/rate of change to advertising over and above their baseline probability of converting, as our simulations revealed. 

[related-article title="Too Good to be True: How AI is Impacting Data Quality" url="https://www.greenbook.org/insights/the-prompt-ai/too-good-to-be-true-how-ai-is-impacting-data-quality"]

Then for practical targeting purposes, we broadened this to the Movable Middle segment…those with 20-80% probability of choosing the brand of interest. Since then, we published the results in a peer reviewed journal and the mathematical principle has been validated on 10 out of 10 case studies…Movable Middle consumers towards your brand are, in fact, hyper-sensitive to your advertising (up to 23X).

Estimating the reach of your advertising.

Marketers want to plan for and measure the “reach” of their media plans (i.e. what percent of consumers will see your ads across any of the platforms you will be using.) Unfortunately, privacy concerns are leading to walled garden walls getting higher, more silos are appearing, and Google killing off cookies. As a result, the ability to actually measure the reach of your advertising from a single dataset is fast evaporating. 

The alternative is to measure reach of each platform individually and then put it together via synthetic data so you can measure reach via this synthesized data set. I have seen models to do this but it’s tricky because it is hard to get the covariance patterns right unless you choose the right type of model.  For example, those who see CTV and display, when the tech stack uses the same DSP for both, will be highly correlated. 

50% reach on each won’t deliver 75% reach across the two...it is more like 55-60% reach.  If you use something ex-Googlers use…Dirac mixture models…or if you use a Dirichlet model, your efforts will fail (assume no covariance patterns). The better way to create synthetic data is by capturing the covariance patterns and building synthetic data using something called Copulas which are made to capture covariance patterns. (…just so happens I do such modeling…hint, hint.)

Validating models of ad effectiveness

One of the challenges for attribution models is to prove that they are really delivering results about incrementality, not sales or conversions you would have had anyway.  It’s the old “correlation does not equal causation” argument. To some extent, MMM also has this challenge. So, what some MTA approaches use to validate their model is to create synthetic data.  Here’s how it works. You identify the channels/partners/etc. that are in the actual model of the media plan. 

You then create a proposed set of parameters of effectiveness for all of these factors but “hide them from the model”.  You then synthesize a data set of ad exposure and ad response from these factors with some random noise thrown in. To an MTA model, this will look like data in the wild but you actually know the answer because you synthesized the data set from parameters of effectiveness that you created. 

The “hackathon-like” challenge is to see if the MTA model is smart enough to “reverse engineer” these parameters of relative effectiveness. If the model can get the right answer on synthetic data, it adds to the confidence that it is returning the right answers for the advertising that was actually run.

Synthetic data are rising in prominence in the analytics and data science toolkits. Are you getting on-board?

data sciencedata collection

Comments

Comments are moderated to ensure respect towards the author and to prevent spam or self-promotion. Your comment may be edited, rejected, or approved based on these criteria. By commenting, you accept these terms and take responsibility for your contributions.

Disclaimer

The views, opinions, data, and methodologies expressed above are those of the contributor(s) and do not necessarily reflect or represent the official policies, positions, or beliefs of Greenbook.

More from Joel Rubinson

The Paradox of the Paradox of Choice
LevelUP Your Research

The Paradox of the Paradox of Choice

Discover how to navigate consumer choices effectively. Learn to leverage behavioral cues and refine ad targeting to enhance brand visibility and drive...

How to Improve Ad Attentiveness Measurement
LevelUP Your Research

How to Improve Ad Attentiveness Measurement

Explore the hierarchy of advertising effects, from impressions to sales. Discover how consumer attentiveness and relevance drive effective marketing s...

How Baseball Led Me to Marketing Analytics
LevelUP Your Research

How Baseball Led Me to Marketing Analytics

Discover the power of Moneyballing marketing and revolutionize your outcomes by leveraging math over judgment for superior results.

Brand Differentiation: A Reality the Dirichlet Model Can’t Account For
LevelUP Your Research

Brand Differentiation: A Reality the Dirichlet Model Can’t Account For

Reevaluate marketing strategies and research approaches by challenging the Ehrenberg Bass Institute's claim on brand differentiation.

Sign Up for
Updates

Get content that matters, written by top insights industry experts, delivered right to your inbox.

67k+ subscribers