Categories
March 14, 2016
Analytics means different things to different people, and sometimes different things to even the same person. Here is my take.
“Definitions are hard. Especially when it comes to what they mean.” – Yogi Berra
Okay, I made that up. But definitions are hard. Analytics, for example, means different things to different people, and sometimes different things to the same person on different occasions. Here is my own definition, with no official sanction, though heavily inspired by Wikipedia and other knowledgeable sources:
Analytics is the discovery and communication of meaningful patterns in data. It makes use of information technology, statistics and mathematical algorithms to develop knowledge, to quantify performance or to make predictions. It uses the insights gained from this process to recommend action or to guide decision making. Analytics is best thought of as a research procedure for decision making, not simply as isolated tools or steps in a process.
Basic components
There are many steps in the analytics process that can come into play in quantitative MR projects:
1. Defining Objectives
2. Data Collection
3. Data Preparation and Cleaning
4. Model Building
5. Model Evaluation
6. Interpretation
7. Scoring New Data or Simulations Using the Model
8. Communication of Results and Implications to Decision Makers
9. Implementation and Monitoring Effectiveness
In this article I’ll only be able to touch on one of these – Model Building. However, the vast bulk of an analyst’s time is spent on the other components and each is critical to success. Getting the objectives straight and how the results are used – the bookends of the process – are usually most critical. A sophisticated answer to the wrong question is not a good answer, and a sophisticated answer to the right question can be used inappropriately or simply ignored.
An Analytics Toolkit
There are countless methods for analyzing data and, for simplicity, here are some broad ways of categorizing analytics tools, with brief illustrations shown in parentheses:
Important Considerations
Big Data: There has been Big Hype about Big Data in the past few years but the term is still not used consistently to mean any one thing. The size and type of data we analyze obviously has an impact on the analytics tools we use, though size by itself is less important than the hype about Big Data might suggest. Some marketing scientists (with assorted job titles) have worked with huge data bases for many years.
Most of the time, to build and evaluate a model, we actually don’t need to use all our data at once. Samples normally are sufficient. Remember that most of our “old” statistical tools were designed to generalize from very small samples to the population, and these methods are widely used in Big Data analytics as well. Many marketing scientists are also accustomed to very wide data – hundreds or thousands of variables – so this is not a new challenge to us either.
Prediction versus Interpretation: Perhaps more important is whether we need to predict or need to explain. A lot of data mining and predictive analytics is mainly concerned with prediction, for example, predicting whether or not a consumer will purchase a product. However, in marketing, understanding why consumers behave one way or another is a big plus. If we also have clues as to Why or Why Not, they can provide important insights for branding, creative and execution as well as for new product development.
Statistics versus Machine Learning: Machine Learning has also been getting a lot of press and, like Big Data, has no single universally-accepted definition. Broadly speaking, machine learners are computer algorithms designed for pattern recognition, curve fitting, classification and clustering. The word learning in the term stems from the ability to learn from data. Confusingly, machine learning is also used to refer to very familiar statistical methods such as regression, cluster analysis and principle components analysis. Better ask for specifics if you’re not sure how the term is being used.
Macro Categories
A moment ago I showed you about a dozen ways to categorize analytics tools. You can also collapse this down into two macro categories, supervised and unsupervised. Supervised methods are used when there is a dependent variable – something you are trying to explain or predict from one or more independent variables. Regression is one example of a supervised method. Unsupervised methods are used when there is no distinction between dependent and independent variables. Factor analysis is an unsupervised method.
Dependence and interdependence are also used to mean supervised and unsupervised, respectively, and label, target variable and criterion variable are sometimes used in place of dependent variable. (Aren’t you glad Yogi wasn’t a statistician?)
Popular Supervised Methods
Generalized Linear Models (GLM) are a large family of statistical techniques that are extremely versatile and can be used when dependent variable is:
– Continuous (OLS regression)
– Categorical (binary logistic and multinomial logistic regression)
– Ordinal (ordered logistic regression)
– Count (Poisson regression)
– Repeated over time (longitudinal analysis)
– Clustered (e.g., departments within divisions of a company)
GLM are widely used in key driver analysis, choice modeling and conjoint analysis. “Linear” does not mean that curvilinear relationships and interactions cannot be handled by GLM – they can and often are. Both Frequentist and Bayesian estimation can be used – yet another “macro” way of classifying analytics.
Here are some methods widely-used in data mining and predictive analytics that may be less familiar to you:
– K-nearest neighbors
– Artificial Neural Networks (“neural nets”)
– Support Vector Machines
– Boosting (e.g. AdaBoost)
– Bagging (e.g. Random Forests)
– MARS
– GAM
These tools are sometimes easier to use and more accurate for prediction but not always. Their downsides are that they’re more black box than GLM techniques and less informative about how the variables interrelate. This is a drawback in Key Driver Analysis and some other analytics. Fortunately, it’s often possible to use one method for prediction and another to help us understand the why and how.
Popular Unsupervised Methods
Here are some familiar faces:
– Principal Components Analysis
– Factor Analysis
– Correspondence Analysis
– Biplots
– MDPREF
– Canonical Correlation
These methods are very useful for brand and user mapping. As many of you know, big brands can dominate perceptual maps and when that happens mapping may not be very informative since it merely shows us graphically what we already knew – that the market leaders score highest on most attributes. Correspondence Analysis is one way – but not the only way – to reduce brand size effect when that is a concern.
Principal Components Analysis is also widely used for pre-processing data, for example, prior to cluster analysis or regression. Canonical Correlation, a kind of multi-barreled Principal Components Analysis, and Multiple Correspondence Analysis can also be used this way.
As an aside, when marketing researchers say “Factor Analysis” we usually mean Principal Components. However, the two methods have different origins, Psychology and Statistics, respectively, and are not identical.
The methods shown next are frequently used for “discovered” (post hoc) segmentation.
– K-means Cluster Analysis
– Agglomerative Hierarchical Clustering (AHC)
– Partitioning Around Medoids (PAM)
– Self-Organizing Maps (Kohonen networks)
– Mixture Modelling and Latent Class
– Frequent Pattern Mining (e.g., Apriori, FP-Growth)
Mixture Modeling, which subsumes Latent Class, is highly versatile and also used when there are dependent variables. Frequent Pattern Mining and Association Rules are used for market basket analysis and recommender systems, to name two examples of how they are put into play.
Structural Equation Modeling (SEM)
SEM is arguably the most multitalented of the bunch. It is certainly the most versatile – uniting GLM and Factor Analysis – and is invaluable in attitudinal research. A Mixture Modeling variant (SEMM) combines SEM with Cluster Analysis, and SEM can also be used with longitudinal and clustered (i.e., multi-level, hierarchical) data.
Some SEM software is very abuser-friendly and, unfortunately, SEM may be the most misunderstood and misused statistical technique of them all. When I think of SEM I sometimes imagine a wonderful all-around athlete who’s injured a lot… Partial Least Squares (PLS) is a competitor to SEM and there are often heated debates between these rival camps but I will not join the fray here.
Time-Series Analysis
Time is an important dimension in some kinds of analytics. This is a very large group of methods that originated in Operations Research, Econometrics, Statistics and other disciplines. It’s used when data have been collected at many points in time, (e.g., weekly sales). In marketing research it’s used most often for sales forecasting and Marketing Mix Modeling, which is also known as Market Response Modeling. We employ these methods to try to find out which marketing activities have the biggest payoff and sometimes for forecasting sales under various marketing scenarios.
Some of the most commonly-used methods are:
– Exponential Smoothing
– ARIMA
– ARMAX
– VAR, VEC
– State-Space Modelling
– Dynamic Factor Models
– GARCH
Definitions can be tricky, as I’ve noted. Longitudinal is a generic term used to mean data collected over time. Typically it implies fewer points in time than time-series, which can have dozens or even thousands of data points. I personally would call two years of quarterly tracking figures longitudinal data data and weekly sales figures over a two-year period time-series data. Some researchers would maintain that a consumer tracking study is not truly longitudinal since we (normally) are not following the same consumers from wave to wave. We do follow the same brands, however.
Time-To-Event Modelling
This kind of analytics comes in handy when analyzing the expected time until one or more events happen and is also known as Survival, Duration or Event History Analysis. It’s quite complex and is heavily used by Medical researchers and also by Economists, Engineers and in Operations Research. Kaplan-Meier, Cox regression and parametric models are the main methods and recent variations can include a segmentation component instead of assuming all customers, patients, etc. are the same.
Some of the ways it’s used in marketing research are to find out:
– What factors cause customer churn?
– For predicting how long a customer will remain (“survive” as a) customer
– In analysis of purchase behavior and website usage
These methods are very useful but very easy to get very wrong. Many dense textbooks have been written about this one topic or include it as major section. The admonition “Don’t try this at home, folks” comes to mind.
Why Use Advanced Analytics?
This topic may strike you as complicated and many of you may be wondering “Why bother?” There are many reasons. Advanced analytics adds value to data – it can help data speak to us. For example, in consumer attitude studies we can use psychometric tools to take respondent scale usage patterns and background characteristics into account. Doing so will provide us with a deeper and more accurate understanding of consumer attitudes and behaviors and how they connect with each other.
In any kind of research totals and crosstabs only show us the surface and are just the first steps in exploratory data analysis. Moreover, running lots of crosstabs increases the risk of fluke “findings”, and clients can make bad decisions based these chance results.
Looking at variables two-at-a-time also can be very misleading. Older consumers, for instance, may seem to be heavier users of a particular category but, after taking gender, income and other characteristics into account, we may find that category usage actually declines with age!
Appearances can be deceiving!
What I’ve described above are only some of the tools a Marketing Scientist may wish to include in his or her toolbox. Standardized methodologies also have their place – I used to help design them – but for many projects customized analytics are the better route, and also faster and less expensive when done competently.
More than the tools, though, it is how they are used that’s most important – making advanced analytics work involves much more than math and programming…You need to put the patient before the cure!
Future Directions
IoT, Artificial Intelligence, Quantum Computing and unforeseen innovations will likely have a profound impact on our lives in the future. This means they will also impact MR and analytics. Some kinds of analytics will be largely automated in the not-so-distant future, though human judgment will remain essential.
Further ahead, Marketing and MR may be extensively automated…but in that sort of world will they still be necessary? I wonder.
A version of this article was presented at the Festival of the NewMR February 2nd, 2016.
Comments
Comments are moderated to ensure respect towards the author and to prevent spam or self-promotion. Your comment may be edited, rejected, or approved based on these criteria. By commenting, you accept these terms and take responsibility for your contributions.
Disclaimer
The views, opinions, data, and methodologies expressed above are those of the contributor(s) and do not necessarily reflect or represent the official policies, positions, or beliefs of Greenbook.
More from Kevin Gray
Discover the world of Artificial Intelligence and unravel the confusion of basic concepts. Explore distinctions between pattern detection and generati...
Marketing scientist Kevin Gray asks Dr. Anna Farzindar of the University of Southern California about the impact and ethics of Artificial Intelligence...
Some marketing data science directly competes with traditional marketing research areas and many marketing researchers may wonder what the future hold...
The importance of pursuing the discovery and understanding of data.
Sign Up for
Updates
Get content that matters, written by top insights industry experts, delivered right to your inbox.
67k+ subscribers