Game Theory (Shapley) revisited

Setting some time aside for development work, I decided finally that I should try and write a script to evaluate game theory solutions. Truthfully, I’d been put off previously by the apparent complexity in generating the set structure to evaluate the regression.

For every n campaigns, there are n! (factorial) + 1 combinations of sets for a fully defined game. As seen in the previous post, with three channels this is relatively trivial and can be done by hand. However, with a campaign taxonomy of even 10 campaigns you would need to generate 3.6million sets! Most of these sets would be empty as real user data doesn’t come close to having examples of each of these combinations.

However, when I actually spent some time reading through the maths I soon realised that there is a mathematical approach that avoids evaluating ‘empty’ sets. I found a handy explanation of the formula:

Shapley_latex

here  https://linguisticcapital.wordpress.com/2015/06/09/the-shapley-value-an-extremely-short-introduction/  which was essentially that the formula can be calculated in two stages.

Imagine a step-wise process, which loops through each unique combination (“coalition”) of campaigns you have. If you refer to my previous post (https://thedataanalyst.wordpress.com/2016/08/30/shapley-value-regression-game-theory-as-an-attribution-solution/) then the equivalent would be looping through each row of the table presented there.

For each unique combination of campaigns, S = number of ‘campaigns’ in that combination instance, and n is the total number of unique campaigns you have.

So for a 3 campaign taxonomy, and a set containing 2 campaigns (e.g. row 4, PPC Brand and SEO), that first bit is

Factorial(2-1) * Factorial(3-2) / Factorial(3) =   1/6   = 0.1667

 

The second bit is simply the value difference with and without the campaign. For example, if we’re evaluating SEO in the set {PPCBrand, SEO}, and we know from the value estimation stage that {PPCBrand, SEO} = 424 and {PPCBrand} = 270, then the credit SEO receives for this combination is (424 – 270) * 0.1667 = 25.7

 

Evaluating {SEO} = 199, then PPC Brand receives (424-199) * 0.1667 = 37.5

You can then move on to the next known unique combination of campaigns. When all have been calculated, a sum of credit across each set by campaign yields an attributed share.

 

So far so good. In theory.

 

Except: unless you have a trivial case-study, you won’t have all your combinations described in your customer path data, so you are missing vast swathes of the sub-sets you need to evaluate uplift.

Real world data being what it is too, there’s every chance that a key rule of “additivity” has also been broken. Additivity states that in this cooperative game, adding a channel to a coalition should not reduce its value: but at least in the data I’ve worked with it is not uncommon for (say) a single SEM click to have a higher conversion rate than an SEM click with a series of prospecting display adverts.

How does this impact on the results?

 

In the instance of partially defined coalitions, fortunately real world data aids us here: consider it not uncommon that a typical data set that shows 1 step paths accounting for 50% of conversions, 2 steps or fewer ~ 80%, and cumulatively 3 or fewer steps ~ 90% – this then tails away over the remaining data.

Chances are you have combinations of 3 channels described fully: at least for campaigns that make up the bulk of conversions.  This leaves only a small proportion of conversions caught up in non-described games.

From reading around, evaluating ‘partially defined games’ appears to be an unsolved problem with active investigation (https://www.goshen.edu/wp-content/uploads/sites/27/2015/05/Linear.pdf).

If you are determining value by some kind of algorithm, then it may be possible to generate these in-situ (e.g. logistic regression as your model, ref: https://huayin.wordpress.com/tag/attribution-modeling/).

For pre-modelled values though I’ve not yet worked out an answer: *makes note – this would be a good question to put to vendors..!*.  I can imagine for a simplistic resolution you can adjust for these undescribed games by ignoring them, and simply scaling known credit back up. Maybe also by applying a hybrid approach to those sets missing subsets, where known shares are used where available and the remainder shared equally between the remaining channels?

 

In the latter instance of additivity, data partitioning (as described previously) has been suggested as a means to separate upper funnel and lower funnel activity. By modelling those returning clicks close to conversion differently from brand/product awareness activity, and adjusting credit between the pools there is an implicit push of value back up the funnel.

I’ve no doubt that there will be cases still where a channel appears to have a negative effect: setting a lower boundary of zero credit is a blunt way of approaching this, though this necessitates some modest rescaling of results as by doing this your model will inevitably generate more conversions than actually occurred.

 

And so work continues. A welcome addition to the portfolio of approaches I can apply even if it isn’t 100% there yet. Though, what model is?

Advertisements

Logistic Regression revisited

I recently came across a different method using logistic regression to yield attributed credit, and having tested it for a few months am relatively happy with the results generated.

As previously, a standard glm logit model is run against an instance data table and coefficients determined for each campaign (plus an intercept). These coefficients are then resolved to odds via an exponential transformation, and these act as stakes in a proportional claim of conversion.

 

So for example, assume we have two campaigns yielding logodds coefficients as follows:

Intercept             Coefficient = -5.5

Campaign A        Coefficient = +1.6

Campaign B        Coefficient = +1.0

 

We translate these in to odds:

Intercept = exp(-5.5) = 0.004

A = 4.78

B = 2.70

 

So for a conversion where all both channels are present, the claim for channel A is 4.78 / (4.78+2.70+.004) = 0.64.  Channel B claims 0.36, and the intercept has a very small fraction ~ 0 in this instance.

Quite a neat method I think which avoids the outcome of negative contributions that may occur with resolution to a probability formula.

 

Touched upon in the prior post was the technique of partitioning data prior to running the regression in an attempt to reduce collinearity effects (i.e. impression-only paths/click-only/mixed). As part of our testing, we have also attempted to partition data in to activity occurring pre- and post- first site landing, yielding models with two, three and six separate partitions respectively that we combine and, in the case of pre-post splits, weight by unique converters.

Partitioning data by site visit isn’t a new approach, and the logic behind this is to separate activity from users who are in likely quite distinct decision states: pre-site advertising’s designated function is to raise brand and product awareness, post-site advertising is likely functioning as retargeting or is employed in a purely navigational role (where a user follows a brand search click rather than entering the URL).

Shapley Value regression (Game Theory) as an Attribution solution

Shapley Value regression (referred to commonly as Game Theory) is at its core quite an elegant solution, and one I’ve left until quite late on in the series to post about as it is an approach I haven’t implemented, despite understanding its principles.

[ I therefore offer no real practicable insight, short of that from discussions with 3rd party practitioners and published papers. ]

 

At the core of this algorithm sits a fairly simple two stage approach

1              Identify a baseline ‘importance’ value for each campaign that represents the expected conversion performance (number of conversions, or anticipated conversion rate)

2              Run a series of regressions comparing the importance value of each campaign in turn with each of the others as a pair, triplet, or higher order combination. Allocate the conversions observed when these combinations occur in the attribution data based on the Shapley Value regression approach

 

In essence, the approach simulates a series of multivariate (A/B) test uplift comparisons. I first came across this approach when assisting in pilot testing with GoogleNY shortly after their acquisition of Adometry. At the time I little understood the approach, focusing more on the credibility of the attributed outputs.

I came across it again when research uncovered the Shao and Li paper “Data-driven Multi-touch Attribution Models”, which details a 2nd order Shapley regression almost as an aside to their ensemble logistic regression approach (the latter of which, frustratingly, I cannot get to produce valid results using our data!).

A more comprehensive step by step detailing of the technique is incorporated in to the Abakus patent (http://www.google.co.uk/patents/US8775248). Incidentally, Abakus provide a series of short videos on their site explaining how the technique is applied, though the algorithm (as far as I could see) isn’t covered.

 

Hence, a simplified example of how stage 1 and 2 are executed (as I understand it). Note that the below approach is based upon the Abakus method described in their patent, but other methods for determining the stage 1 ‘player values’ are alluded to, if not detailed, in other articles (http://huayin.wordpress.com/tag/attribution-modeling/)  such as logistic regression, time-decay based models, etc.

 

Imagine this core result set obtained from a hypothetical attribution data set:

S1 S2 S3 Desc Unique Users Conversions cr%
PPC Brand p1 800 60 7.500%
SEO p2 850 45 5.294%
Display p3 150000 10 0.007%
PPC Brand SEO p1,2 300 25 8.333%
PPC Brand Display p1,3 1800 65 3.611%
SEO Display p2,3 1900 55 2.895%
PPC Brand SEO Display p1,2,3 700 40 5.714%
P p0 3000000 100 0.003%

 

 

From this aggregated data set, we ascertain each campaigns ‘playing value’ – i.e. stage 1 above. Starting with PPC Brand, we multiply its conversion rate in isolation (7.5%) against all the combinations in which it appears (800+300+1800+700) and add to that the rate that would occur anyway (the baseline of 0.003% against 850+150000+1900+3000000). This gives a playing value for V1  (PPC Brand only) of 375. This is repeated for each of the single campaign only combinations (V2 and V3).

For the two-campaign sets, the channels base conversion rate is applied against the combinations it occurs in but which the OTHER campaign does not, PLUS how it performs together. So for PPC+SEO, then this set (V1,V2) is worth =

7.5% * (800+1800)                           (the PPC component of the set)

+ 5.294% * (850+1900)                   (the SEO component of the set)

+8.333% * (300+700)                      (the combined PPC+SEO components)

+0.003% * (150000+3000000)      (the baseline * what’s left)

= 529

 

The last remaining combination of all sets (V1,V2,V3) is effectively the ‘game’ total – a sum of all conversions (400).  Then we account for the base line set, V0 (no media) which is 105 (0.0003% * all users).

This yields a ‘game’ as follows against which we enter each ‘campaign value’ in the order seen (starting with the special value, V0 = 105). On adding each subsequent set, we evaluate the ‘incremental’ difference.

E.g. in the first combination, PPC Brand, we know its V1 score is 375. Therefore after V0 has claimed 105, only 270 is left.

In the second combination, PPC Brand+SEO, we apply 105, then 270, then 529-270-105 = 154

And so on as follows:

 

Marginal Improvement Values
Step 1 Step 2 Step 3 V0 (Base) PPC Brand SEO Display
PPC Brand 105 270
PPC Brand SEO 105 270 154
PPC Brand Display 105 270 -92
PPC Brand SEO Display 105 270 154 -129
PPC Brand Display SEO 105 270 117 -92
SEO 105 198
SEO PPC Brand 105 225 198
SEO Display 105 198 -57
SEO PPC Brand Display 105 225 198 -129
SEO Display PPC Brand 105 154 198 -57
Display 105 5
Display PPC Brand 105 173 5
Display SEO 105 136 5
Display PPC Brand SEO 105 173 117 5
Display SEO PPC Brand 105 154 136 5
Attribution: 105 223 164 5
Target is 400 conversions: 105 167.7 123.5 3.9

 

The attribution score is the average contribution of the column which typically may exceed the actual number of conversions, and hence is scaled to a target number of conversions.

 

The occurrence of negative values occurs because while the model is based on cooperative game theory (that is, the worth from a combination of participants’ activity cannot be worse than the best ‘player’ alone) in reality two channels occurring together may well have a conversion rate LESS than the parts achieve “independently”. This effect is observed in the Anderl et al. result set (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2343077). Does this mean that a channel has a negative effect? Likely, no. It’s much as we observe in logistic regression where the likelihood is that two different customer states are being pooled.

 

Hence we again turn to data pre-processing, whereupon I can offer no real practical advice short of that recommended in the Abakus patent and elsewhere (such as Archak et al., “Mining Advertiser-specific User Behaviour Using Adfactors): pre and post site visit segmentation of data. The logic behind this is to ascertain the clicks that are most likely ‘navigational’ in nature, and thus push credit the other side of this divide in to ‘influencing’ activity.

In addition, a proportion is extracted from the influencing bracket and allocated to a baseline of non-digital media conversions. I am admittedly unclear how these proportions are decided given the exact number of non-converting, non-exposed users is not directly measurable, but trust that an estimate of cookies in market can at least give a steer. I imagine also a calibration of the outputs is enacted in order to bring the results in line with those expected (again from independent tests).

 

So in summary – the game theory as an approach isn’t overly complex. The difficulty though is the efficient reduction of the data in to distinct sets, and then the stepwise allocation of ‘values’ as seen in stage 1. Stage 2 as it happens has been solved in a variety of packages in R and is freely available (and is quick to execute).

 

If you’ve access to GA Premium, then ultimately it is a version of this approach that is deployed in the Data-Driven Algorithm available under the attribution section. It’s not easy though to determine what data manipulation has been employed (if any?) given that the algorithm is locked up in a black box click-and-go.

Even so, I find the results interesting if only for the ability to see the ‘direct conversion’ land grab. Display campaigns increase against a last click view of the world, but maybe not quite so much as we see in say the Markov approach. In addition, some non-brand PPC terms also seem to increase – the opposite from that seen in other approaches I’ve tried. This might be via collection from the direct conversion pool, but either way, it is always interesting to have another model, especially one requiring little effort from the user.

__

Note for reference: An independent comparison of results is available in the aforementioned Anderl paper and is well worth a read given the differences observed between models.

Markov Chain approach to digital attribution

I’ve been using a Markov Chain approach for some time now, refocusing on this approach  soon after the release of Davide Altomare’s excellent ChannelAttribution package in R back in January (’16). We’ve had a basic implementation of this method built in R for some time, but having the speed (from C libraries) and the simplicity of its front end has really allowed extensive testing of variables and input data.

Over the past few months it has been pleasing to see this approach gain wider exposure (http://www.lunametrics.com/blog/2016/06/30/marketing-channel-attribution-markov-models-r/ and http://analyzecore.com/2016/08/03/attribution-model-r-part-1/ ), though these working examples predominantly use data from Google Analytics which can have gaps when it comes to display impression tracking.

Rather than recite the methodology here (both the above posts cover these well, as does Davide’s original slideshare), I’d prefer to share my experience with this approach. As is the case with other attribution techniques, the model is sensitive to the data you feed in to it, and importantly I feel (so far) this hasn’t been discussed in detail.

 

If you are using purely path to conversion data, that is only successfully converting journeys, then you will achieve a very different result from the model than if you include non-converting journeys (and apply the var_null parameter). Principally this is because your starting and subsequent transition probabilities will vary hugely if you capture and incorporate all display impressions.

Objectively having only conversions isn’t a true representation of the underlying data, but whether or not including the non-converting data improves the results is more subjective. As mentioned in previous posts, running independent A/B uplift tests can help give a target with which to select an approach.

 

Secondly, changing the order of your Markov model also has a significant change on your results. If you read the paper on which the package is based (Eva Anderl et al. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2343077 ) the authors settled on a third order model based on the model fit to the data, but I have read results from other sources presenting best fit with 2nd order Markov chains.

These decisions are predominantly based on a few statistical tests; how well the model predicts the original data’s successes and fails, how consistent the results are (stability), and top decile lifts. These tests are fairly arbitrary in my mind, and from a practitioner’s point of view I am more concerned with how the results match up with other, independently sourced real world results. While I can respect and appreciate that having the model replicating the structure of the data is a good thing, having result consensus with independent tests is, for me, a better indicator of ‘goodness’.

As a result, I generally work with 2o and 3o models and try and match external test results.

I think first order Markov is too simplistic, and suffers from a passive serving effect: that is, display impressions that appear in a user’s path but which likely haven’t influenced the conversion. Fourth order loses the finer detail; when the majority of conversions occur within a few steps then four-tuple sequences lose the transitional detail.

Having mentioned passive touchpoints such as impression serving (i.e. advertising that is dictated by a targeting algorithm rather than a user’s choice to interact such as that seen with a click), it is worth a further comment on this issue.

Where regression techniques pick out incremental correlation, the Markov approach rewards any touch point that occurs in a successful tuple. Our initial findings indicate that without prior distinction, display impressions in particular receive more credit than a ‘causation’ based test would suggest. Model order selection does vary this credit, but the more thorny issue of impression viewability is at issue here.

Wider reading can be found in many Media industry sites, but the quick version of this is that not all served impressions are, or can be, actually seen. Without data to ascertain the viewability of each impression, there is a danger you reward a campaign for its ability to identify warm targets rather than its ability to influence the conversion outcome.

We therefore exclude non-viewable impressions whenever we have the option to mitigate this eventuality.

 

Our work on this approach continues; despite some of the limitations above the results are not unreasonable when compared against a range of sense checks. A key advantage is that all channels are described in this model; though a downside remains that no offline/exogenous baseline contribution is identified.

Most interesting for us is how the contribution of display impression appears to notably increase vs. last-touch models, and we see much closer comparisons with A/B test results from this modelling approach than others we’ve deployed.

We are investigating further methods to refine the modelling approach. Redistribution of credit for single-click and direct-to-site conversions is one such idea; where driven by online activity, may we assume incomplete path data may have been a factor? If so, we have a probabilistic model already built to suggest what the previous step may have been.

There may be some merit in using VLMCs (Variable length Markov chains) rather than fixed length chains in order to accommodate the very different path lengths observed. Also, the possibility of integrating some offline touchpoints (such as a TV advert) based on time sequencing is an appealing proposal for further development.

Hopefully as adoption of this approach widens a broader set of ‘data pre-processing’ approaches will be uncovered and tested.

Interaction Sequence Models

 

Sequence models apply the premise that there is an implicit ‘worth’ to when an interaction occurs in a customers’ research journey. The complexity of these models may vary, but two main strands of approach involve;

  • How distant was an interaction from the conversion point, and
  • Does interaction, “A”, have more or less worth when either preceding, or following, another interaction, “B”

 

The first proposal really is one of recency where there is greater importance placed on interactions either nearer the end or the start of a customer journey. There is a recall consideration implied within these models, along the lines of: “Did an impression served two weeks ago really have any bearing on a customer converting now?” And conversely, “Was the original impression a key component in putting the customer on their current track?”

I find the second strand an intriguing proposal, if only for the creativity I’ve seen in examples of manually created rules. Individual cases of imaginative thinking aside, there is merit to the principle that certain combinations of campaigns outperform others, and certainly outperform isolated activity. Likewise, we may also assume that an awareness campaign occurring after a user has visited your website several times is of limited value.

 

Both these approaches are offered in limited fashion within the Google reporting platforms. The downside is that it is nigh impossible to see exactly what weights are being applied to your data, and there is little in the way of guidance or a testing mechanism to see whether these are evenly remotely close to accurately fitting your data.

In practise, I would suggest that these models are likely a poor fit to your data and should be taken with a hefty pinch of salt. Even assuming (for example) the time decay model is a fair reflection of overall effect, broadly applying this to each channel is nonsensical. The brand recall effect of a Search Click from a week ago will in the vast majority of cases be stronger than a Display impression from the same time.

In addition, these models are dropped on top of only successful paths and there is no consideration of ‘failure’ in their calculation. For any given simple model, consider these cases:

 

Search ClickRightArrowConvert    (*50 examples)

DisplayImpressionRightArrowConvert    (*15 examples)

 

If we now see a couple of paths like this:

DisplayImpressionRightArrowSearch ClickRightArrowConvert

Search ClickRightArrowDisplayImpressionRightArrowConvert

 

Then given our available information, we would probably end up giving the search click and display impression equal worth in the above two paths. After all, why not? There’s nothing in our data to suggest one is better than the other. Simple frequency of occurrence might just be down to user exposure, volume of spend, or some campaign aggregation effect.

 

If however you now introduce these new facts:

Search ClickRightArrowDoes Not Convert (*200 examples)

DisplayImpressionRightArrowDoes Not Convert (*2000 examples)

 

Then suddenly there’s an indication that the display impression may not share the same influence on conversion as the search click, and given that the user may well have converted without the impression, our impression should only receive a fraction of the value we would previously have awarded.

 

It’s clear then that a more accurate approach is to include, if not all, then at least a large sampling of failed paths so as to mitigate an overly generous attribution. By hand, it would be impossible to calculate the fractional weights we should apply to each campaigns’ interactions, and by each campaign, to each position in the path, and so we turn to the power of computers to solve this algorithmically.

 

Solver Algorithms sift through the vast quantities of path data adjusting a series of variables that represent the positional weight of each activity. A ‘goodness of fit’ is measured by calculating the difference between each individual case’s predicted outcome (a fraction between 0 and 1), and the actual outcome (0 = not converted, 1 = converted).

The algorithm then iterates to minimise this difference across the entire data set and, after a given time or number of iterations with no improvement have elapsed the best solution is returned. We apply these final weights to give us the share of conversions for each campaign grouping.

All very clever.

However for all the complexity of this approach the solver returns simply the best solution to our pre-supposed model, which is itself a human based construct. It is our assumption that positional values decay over time, and that this decay is based on a power series, an exponential, or simple 1/n proportional to n [steps] or t [hours]. If we’ve chosen a poor model to begin with, then we’ve merely achieved a very precise level of wrongness. For this reason, my preference is to avoid decay curve-only modelling.

___

(Worth a view here is this presentation: https://www.youtube.com/watch?v=AZtLZn34IuY which brings together decay modelling with logistic regression.)

Logistic Regression as a digital attribution modelling approach

 

Regression techniques correlate advertising activity with conversion results, returning a contribution value for each campaign (or channel) grouping in the form of a coefficient.

We employ a logistic regression probability model to predict binary outcomes (did convert or didn’t convert) from a large sample of paths. Even across a fairly wide series of advertising campaigns and a couple of million cases run time is generally quite quick on modern computers, so this allows the analyst to experiment with including and excluding channels from the mix of variables in order to achieve the best model.

Excluding a channel may seem counter-intuitive: after all, surely all our advertising should be generating a positive result? Well, in theory probably yes, however in practise some campaigns do need lifting out of the model in order to avoid nonsensical results. This effect is caused by limitations of the tracking data and having a lack of “unseen, non converter” data to represent unexposed users. To a degree, we may compensate by creating an artificial baseline for everyone we don’t know about (in the tens of millions of cookies), but this approach is rather arbitrary.

To explain further:

Consider that for the vast majority of our data, “not-converted” is the dominant state. In most client data sets there will be massively more internet users to which an advert has been delivered without them going on to become purchasing customers, particularly if a general awareness (“reach”) campaign is active. By far the most common case therefore will be “Impression Viewing = 1 (seen), converted = 0 (no)”. No surprise there: a cold campaign is about generating the initial seeds of awareness for the most part rather than short-term customer conversion-to-sale.

Contrast this with a paid search term/advert, where a customer has actively searched for your brand or product then clicked. While the ratio of conversions:non-conversions may still be in the low percentages, this ratio is likely far higher than the previous case since here the customer is clearly in market and/or pro-actively researching.

Let us now also consider an example of mixing both channels, in particular the mechanic of serving an impression after a site visit. A random customer landing on your site from a search click is fairly ‘warm’, and those that convert there and then, or return through another search click, contribute to a positive relationship co-efficient through the regression.

However, what if a user doesn’t convert there and then? Clearly they have demonstrated (for whatever reason) an indication that they are probabilistically less likely to convert. Even with some level of exclusion filtering, there will still be a proportion to whom we serve another advert, after all likely as not they’ve ticked the general boxes that make them a good target opportunity.

In terms of the regression modelling, we’re comparing someone with a high probability to convert via a search click with someone who has a search click and view impression who is LESS likely to convert (i.e. a negative coefficient). The model looks at this naively and just correlates the serving of an impression with a negative effect on sales.

 

Applying some common sense here we acknowledge this is unlikely to be the case. It is unintuitive to assume that the advert has a negative effect on sales unless it is abhorrent in some way. The issue is that we have no data to represent a ‘control’ pool against which we can then calculate uplift.

One viable solution then is to generate independent pools based on activity; for example modelling “impression only” users, “click only” users, and then some ensemble model on those users with exposure to both. Regardless; it’s difficult (maybe impossible) to fully unpick the collinearity that occurs in the mixed exposure set.

 

There are alternative methods that allow us to work out incrementality from a control (A/B testing for one), but from just a logistical regression framework and against raw data, testing and removing variables that don’t register a strong enough correlation is part and parcel of the process.

There are also some fairly general assumptions applied in the practical application of this model.

Firstly; is the order of interaction with advertising irrelevant? The regression algorithm itself has no concept of sequence ordering, so any desired influence needs to be precalculated and presented to the algorithm as part of the data.

I have experimented with recency caps, for example excluding interactions over 7/14/21 days, albeit on client request rather than having seen any evidence that this is a fair assumption. In terms of applying a scaling factor for step sequence, to be honest I am unclear how these could be introduced fairly and accurately. Perhaps a different approach should be adopted if sequence is considered a major component?

Secondly; do you consider the number of interactions via each advertising channel to have import? Do three advert clicks through the same campaign indicate a higher likelihood to convert than a single click? Arguably it could indicate a lower likelihood for the same reasons as in our previous example?  This is an area where repeating model runs with different frequency caps may yield deeper insight and a stronger model.

In this vein, I always make a point of running a version where a simple binary placeholder indicates presence in the case, e.g.

 

LogResTable

 

My last point here is to note that channel detail grouping (a “taxonomy”) is a significant step in the data preparation. With all models, the more cases that you have, generally the better supported your findings will be and the more confident you can be that whatever trends have emerged will hold true with subsequent activity.

However, to capture subtleties in search terms and to fit the multitude of different display advertising sizes and creatives, there are often thousands of unique combinations to accommodate. This detail is far too great to model with accuracy: as the granularity goes up so the number of supporting cases comes down. Likely as not too there will be operational limitations that make overly granular detail difficult to manage in an efficient manner.

Grouping finer campaign detail in to a more actionable level therefore is an important step in producing usable results. An example may be rounding up the various client Brand terms and renaming occurrences with a single generic “Brand” placeholder, but allowing a more detailed structure under one of the other channels – for other paid search let’s say Product Type, Product Make, Search Location, and so on.

 

All considerations taken then; your output will hopefully be a fairly robust model, and by that I mean produces a similar set of results across repeat random data samplings. Applying the regression coefficients back against the data generates a ‘campaign grouping level’ conversion attribution for subsequent review.

There will be some campaigns that you may have had to exclude from the model, and most likely a small pot of unexplained conversions. It’s reasonable to assume the ‘missing’ channels may stake a claim to some of these, while a proportion may also be other offline/external factors – depending on whether you’ve introduced a baseline component. Ideally, most of the conversions are allocated which then gives more credence to any subsequent return on investment calculations.

My experience so far with this method is a fairly mixed reception across internal business teams and their clients. There’s an acceptance that a statistical approach is a ‘proper’ method and trust in the results being fair – or at least not humanly biased. But there are also frustrations where the effect of a campaign cannot be discerned from the data, or due to data limitations, cannot be fairly evaluated in the same manner as the others.

Attribution in the Media and Advertising Industry

I work as an agency analyst in the media and advertising sector, and by far the most commonly requested work is attribution analysis.

For the non-layman, this is the process of modelling which adverts have helped generate incremental business revenue, be that in the form of new customer interest as a precursor to sale or the sale itself, and then in some manner awarding nominal credit to that advert. The ratio between cost and credit is used to understand which strategies are generating the best return on investment for the next planning cycle.

An example then: let us imagine a customer who responded to seeing our TV advert by entering our brand name in to an internet search engine. Clicking on a resulting paid-for link they landed on our web site where they completed an order. Our customer also happened to have read a magazine in which one of our adverts had been placed, but it failed to resonate with them and they skipped over it.

Before bandying numbers and models around, it’s worth just remembering that the true purpose of attribution lies in answering these questions:

  • How do we know the customer saw the TV advert and responded to it?
  • How do we know the magazine advert failed to resonate?
  • What role did the search click play?
  • What was the relative importance of each advert to the sale, and ultimately, was it worth me spending the money on it?

I feel these basic concepts are often not given due appreciation in the rush to produce “a model” – merely having the latter seemingly more important than the question it is intended to answer.

Simply exposing a potential customer to an advert, for example, doesn’t mean it was effective. Likewise, how a customer navigates their web browser to your site might just be a route of convenience rather than via an opinion influencing step. And lastly; at what point does the customer journey cease to be influenced by advertising, and instead move over to being driven by the customer-service experience?

For an individual customer, these questions are admittedly impossible to answer. But across a larger sample size we can use data to identify repeat patterns between advertising and customer response. We can identify the channels that typically favour successful sales over missed ones. Crucially though, in the process of quantifying this influence we have to apply some assumptions about how we believe our customers respond. This is where the art mixes with the science: having strong insight in to your customer journey can really shape the solution for the better.