Regression techniques correlate advertising activity with conversion results, returning a contribution value for each campaign (or channel) grouping in the form of a coefficient.
We employ a logistic regression probability model to predict binary outcomes (did convert or didn’t convert) from a large sample of paths. Even across a fairly wide series of advertising campaigns and a couple of million cases run time is generally quite quick on modern computers, so this allows the analyst to experiment with including and excluding channels from the mix of variables in order to achieve the best model.
Excluding a channel may seem counter-intuitive: after all, surely all our advertising should be generating a positive result? Well, in theory probably yes, however in practise some campaigns do need lifting out of the model in order to avoid nonsensical results. This effect is caused by limitations of the tracking data and having a lack of “unseen, non converter” data to represent unexposed users. To a degree, we may compensate by creating an artificial baseline for everyone we don’t know about (in the tens of millions of cookies), but this approach is rather arbitrary.
To explain further:
Consider that for the vast majority of our data, “not-converted” is the dominant state. In most client data sets there will be massively more internet users to which an advert has been delivered without them going on to become purchasing customers, particularly if a general awareness (“reach”) campaign is active. By far the most common case therefore will be “Impression Viewing = 1 (seen), converted = 0 (no)”. No surprise there: a cold campaign is about generating the initial seeds of awareness for the most part rather than short-term customer conversion-to-sale.
Contrast this with a paid search term/advert, where a customer has actively searched for your brand or product then clicked. While the ratio of conversions:non-conversions may still be in the low percentages, this ratio is likely far higher than the previous case since here the customer is clearly in market and/or pro-actively researching.
Let us now also consider an example of mixing both channels, in particular the mechanic of serving an impression after a site visit. A random customer landing on your site from a search click is fairly ‘warm’, and those that convert there and then, or return through another search click, contribute to a positive relationship co-efficient through the regression.
However, what if a user doesn’t convert there and then? Clearly they have demonstrated (for whatever reason) an indication that they are probabilistically less likely to convert. Even with some level of exclusion filtering, there will still be a proportion to whom we serve another advert, after all likely as not they’ve ticked the general boxes that make them a good target opportunity.
In terms of the regression modelling, we’re comparing someone with a high probability to convert via a search click with someone who has a search click and view impression who is LESS likely to convert (i.e. a negative coefficient). The model looks at this naively and just correlates the serving of an impression with a negative effect on sales.
Applying some common sense here we acknowledge this is unlikely to be the case. It is unintuitive to assume that the advert has a negative effect on sales unless it is abhorrent in some way. The issue is that we have no data to represent a ‘control’ pool against which we can then calculate uplift.
One viable solution then is to generate independent pools based on activity; for example modelling “impression only” users, “click only” users, and then some ensemble model on those users with exposure to both. Regardless; it’s difficult (maybe impossible) to fully unpick the collinearity that occurs in the mixed exposure set.
There are alternative methods that allow us to work out incrementality from a control (A/B testing for one), but from just a logistical regression framework and against raw data, testing and removing variables that don’t register a strong enough correlation is part and parcel of the process.
There are also some fairly general assumptions applied in the practical application of this model.
Firstly; is the order of interaction with advertising irrelevant? The regression algorithm itself has no concept of sequence ordering, so any desired influence needs to be precalculated and presented to the algorithm as part of the data.
I have experimented with recency caps, for example excluding interactions over 7/14/21 days, albeit on client request rather than having seen any evidence that this is a fair assumption. In terms of applying a scaling factor for step sequence, to be honest I am unclear how these could be introduced fairly and accurately. Perhaps a different approach should be adopted if sequence is considered a major component?
Secondly; do you consider the number of interactions via each advertising channel to have import? Do three advert clicks through the same campaign indicate a higher likelihood to convert than a single click? Arguably it could indicate a lower likelihood for the same reasons as in our previous example? This is an area where repeating model runs with different frequency caps may yield deeper insight and a stronger model.
In this vein, I always make a point of running a version where a simple binary placeholder indicates presence in the case, e.g.
My last point here is to note that channel detail grouping (a “taxonomy”) is a significant step in the data preparation. With all models, the more cases that you have, generally the better supported your findings will be and the more confident you can be that whatever trends have emerged will hold true with subsequent activity.
However, to capture subtleties in search terms and to fit the multitude of different display advertising sizes and creatives, there are often thousands of unique combinations to accommodate. This detail is far too great to model with accuracy: as the granularity goes up so the number of supporting cases comes down. Likely as not too there will be operational limitations that make overly granular detail difficult to manage in an efficient manner.
Grouping finer campaign detail in to a more actionable level therefore is an important step in producing usable results. An example may be rounding up the various client Brand terms and renaming occurrences with a single generic “Brand” placeholder, but allowing a more detailed structure under one of the other channels – for other paid search let’s say Product Type, Product Make, Search Location, and so on.
All considerations taken then; your output will hopefully be a fairly robust model, and by that I mean produces a similar set of results across repeat random data samplings. Applying the regression coefficients back against the data generates a ‘campaign grouping level’ conversion attribution for subsequent review.
There will be some campaigns that you may have had to exclude from the model, and most likely a small pot of unexplained conversions. It’s reasonable to assume the ‘missing’ channels may stake a claim to some of these, while a proportion may also be other offline/external factors – depending on whether you’ve introduced a baseline component. Ideally, most of the conversions are allocated which then gives more credence to any subsequent return on investment calculations.
My experience so far with this method is a fairly mixed reception across internal business teams and their clients. There’s an acceptance that a statistical approach is a ‘proper’ method and trust in the results being fair – or at least not humanly biased. But there are also frustrations where the effect of a campaign cannot be discerned from the data, or due to data limitations, cannot be fairly evaluated in the same manner as the others.