Interaction Sequence Models

 

Sequence models apply the premise that there is an implicit ‘worth’ to when an interaction occurs in a customers’ research journey. The complexity of these models may vary, but two main strands of approach involve;

  • How distant was an interaction from the conversion point, and
  • Does interaction, “A”, have more or less worth when either preceding, or following, another interaction, “B”

 

The first proposal really is one of recency where there is greater importance placed on interactions either nearer the end or the start of a customer journey. There is a recall consideration implied within these models, along the lines of: “Did an impression served two weeks ago really have any bearing on a customer converting now?” And conversely, “Was the original impression a key component in putting the customer on their current track?”

I find the second strand an intriguing proposal, if only for the creativity I’ve seen in examples of manually created rules. Individual cases of imaginative thinking aside, there is merit to the principle that certain combinations of campaigns outperform others, and certainly outperform isolated activity. Likewise, we may also assume that an awareness campaign occurring after a user has visited your website several times is of limited value.

 

Both these approaches are offered in limited fashion within the Google reporting platforms. The downside is that it is nigh impossible to see exactly what weights are being applied to your data, and there is little in the way of guidance or a testing mechanism to see whether these are evenly remotely close to accurately fitting your data.

In practise, I would suggest that these models are likely a poor fit to your data and should be taken with a hefty pinch of salt. Even assuming (for example) the time decay model is a fair reflection of overall effect, broadly applying this to each channel is nonsensical. The brand recall effect of a Search Click from a week ago will in the vast majority of cases be stronger than a Display impression from the same time.

In addition, these models are dropped on top of only successful paths and there is no consideration of ‘failure’ in their calculation. For any given simple model, consider these cases:

 

Search ClickRightArrowConvert    (*50 examples)

DisplayImpressionRightArrowConvert    (*15 examples)

 

If we now see a couple of paths like this:

DisplayImpressionRightArrowSearch ClickRightArrowConvert

Search ClickRightArrowDisplayImpressionRightArrowConvert

 

Then given our available information, we would probably end up giving the search click and display impression equal worth in the above two paths. After all, why not? There’s nothing in our data to suggest one is better than the other. Simple frequency of occurrence might just be down to user exposure, volume of spend, or some campaign aggregation effect.

 

If however you now introduce these new facts:

Search ClickRightArrowDoes Not Convert (*200 examples)

DisplayImpressionRightArrowDoes Not Convert (*2000 examples)

 

Then suddenly there’s an indication that the display impression may not share the same influence on conversion as the search click, and given that the user may well have converted without the impression, our impression should only receive a fraction of the value we would previously have awarded.

 

It’s clear then that a more accurate approach is to include, if not all, then at least a large sampling of failed paths so as to mitigate an overly generous attribution. By hand, it would be impossible to calculate the fractional weights we should apply to each campaigns’ interactions, and by each campaign, to each position in the path, and so we turn to the power of computers to solve this algorithmically.

 

Solver Algorithms sift through the vast quantities of path data adjusting a series of variables that represent the positional weight of each activity. A ‘goodness of fit’ is measured by calculating the difference between each individual case’s predicted outcome (a fraction between 0 and 1), and the actual outcome (0 = not converted, 1 = converted).

The algorithm then iterates to minimise this difference across the entire data set and, after a given time or number of iterations with no improvement have elapsed the best solution is returned. We apply these final weights to give us the share of conversions for each campaign grouping.

All very clever.

However for all the complexity of this approach the solver returns simply the best solution to our pre-supposed model, which is itself a human based construct. It is our assumption that positional values decay over time, and that this decay is based on a power series, an exponential, or simple 1/n proportional to n [steps] or t [hours]. If we’ve chosen a poor model to begin with, then we’ve merely achieved a very precise level of wrongness. For this reason, my preference is to avoid decay curve-only modelling.

___

(Worth a view here is this presentation: https://www.youtube.com/watch?v=AZtLZn34IuY which brings together decay modelling with logistic regression.)

Advertisements

Logistic Regression as a digital attribution modelling approach

 

Regression techniques correlate advertising activity with conversion results, returning a contribution value for each campaign (or channel) grouping in the form of a coefficient.

We employ a logistic regression probability model to predict binary outcomes (did convert or didn’t convert) from a large sample of paths. Even across a fairly wide series of advertising campaigns and a couple of million cases run time is generally quite quick on modern computers, so this allows the analyst to experiment with including and excluding channels from the mix of variables in order to achieve the best model.

Excluding a channel may seem counter-intuitive: after all, surely all our advertising should be generating a positive result? Well, in theory probably yes, however in practise some campaigns do need lifting out of the model in order to avoid nonsensical results. This effect is caused by limitations of the tracking data and having a lack of “unseen, non converter” data to represent unexposed users. To a degree, we may compensate by creating an artificial baseline for everyone we don’t know about (in the tens of millions of cookies), but this approach is rather arbitrary.

To explain further:

Consider that for the vast majority of our data, “not-converted” is the dominant state. In most client data sets there will be massively more internet users to which an advert has been delivered without them going on to become purchasing customers, particularly if a general awareness (“reach”) campaign is active. By far the most common case therefore will be “Impression Viewing = 1 (seen), converted = 0 (no)”. No surprise there: a cold campaign is about generating the initial seeds of awareness for the most part rather than short-term customer conversion-to-sale.

Contrast this with a paid search term/advert, where a customer has actively searched for your brand or product then clicked. While the ratio of conversions:non-conversions may still be in the low percentages, this ratio is likely far higher than the previous case since here the customer is clearly in market and/or pro-actively researching.

Let us now also consider an example of mixing both channels, in particular the mechanic of serving an impression after a site visit. A random customer landing on your site from a search click is fairly ‘warm’, and those that convert there and then, or return through another search click, contribute to a positive relationship co-efficient through the regression.

However, what if a user doesn’t convert there and then? Clearly they have demonstrated (for whatever reason) an indication that they are probabilistically less likely to convert. Even with some level of exclusion filtering, there will still be a proportion to whom we serve another advert, after all likely as not they’ve ticked the general boxes that make them a good target opportunity.

In terms of the regression modelling, we’re comparing someone with a high probability to convert via a search click with someone who has a search click and view impression who is LESS likely to convert (i.e. a negative coefficient). The model looks at this naively and just correlates the serving of an impression with a negative effect on sales.

 

Applying some common sense here we acknowledge this is unlikely to be the case. It is unintuitive to assume that the advert has a negative effect on sales unless it is abhorrent in some way. The issue is that we have no data to represent a ‘control’ pool against which we can then calculate uplift.

One viable solution then is to generate independent pools based on activity; for example modelling “impression only” users, “click only” users, and then some ensemble model on those users with exposure to both. Regardless; it’s difficult (maybe impossible) to fully unpick the collinearity that occurs in the mixed exposure set.

 

There are alternative methods that allow us to work out incrementality from a control (A/B testing for one), but from just a logistical regression framework and against raw data, testing and removing variables that don’t register a strong enough correlation is part and parcel of the process.

There are also some fairly general assumptions applied in the practical application of this model.

Firstly; is the order of interaction with advertising irrelevant? The regression algorithm itself has no concept of sequence ordering, so any desired influence needs to be precalculated and presented to the algorithm as part of the data.

I have experimented with recency caps, for example excluding interactions over 7/14/21 days, albeit on client request rather than having seen any evidence that this is a fair assumption. In terms of applying a scaling factor for step sequence, to be honest I am unclear how these could be introduced fairly and accurately. Perhaps a different approach should be adopted if sequence is considered a major component?

Secondly; do you consider the number of interactions via each advertising channel to have import? Do three advert clicks through the same campaign indicate a higher likelihood to convert than a single click? Arguably it could indicate a lower likelihood for the same reasons as in our previous example?  This is an area where repeating model runs with different frequency caps may yield deeper insight and a stronger model.

In this vein, I always make a point of running a version where a simple binary placeholder indicates presence in the case, e.g.

 

LogResTable

 

My last point here is to note that channel detail grouping (a “taxonomy”) is a significant step in the data preparation. With all models, the more cases that you have, generally the better supported your findings will be and the more confident you can be that whatever trends have emerged will hold true with subsequent activity.

However, to capture subtleties in search terms and to fit the multitude of different display advertising sizes and creatives, there are often thousands of unique combinations to accommodate. This detail is far too great to model with accuracy: as the granularity goes up so the number of supporting cases comes down. Likely as not too there will be operational limitations that make overly granular detail difficult to manage in an efficient manner.

Grouping finer campaign detail in to a more actionable level therefore is an important step in producing usable results. An example may be rounding up the various client Brand terms and renaming occurrences with a single generic “Brand” placeholder, but allowing a more detailed structure under one of the other channels – for other paid search let’s say Product Type, Product Make, Search Location, and so on.

 

All considerations taken then; your output will hopefully be a fairly robust model, and by that I mean produces a similar set of results across repeat random data samplings. Applying the regression coefficients back against the data generates a ‘campaign grouping level’ conversion attribution for subsequent review.

There will be some campaigns that you may have had to exclude from the model, and most likely a small pot of unexplained conversions. It’s reasonable to assume the ‘missing’ channels may stake a claim to some of these, while a proportion may also be other offline/external factors – depending on whether you’ve introduced a baseline component. Ideally, most of the conversions are allocated which then gives more credence to any subsequent return on investment calculations.

My experience so far with this method is a fairly mixed reception across internal business teams and their clients. There’s an acceptance that a statistical approach is a ‘proper’ method and trust in the results being fair – or at least not humanly biased. But there are also frustrations where the effect of a campaign cannot be discerned from the data, or due to data limitations, cannot be fairly evaluated in the same manner as the others.

Digital Attribution Modelling

Having a feel for your typical customer journey is an integral part of the attribution model design process.

For example, understanding the number of steps you average user takes, and over what time frame, can massively influence the assumptions you make and hence your choice of model. Fortunately almost all digital reporting platforms provide you this kind of insight as part of their standard reporting toolkit, under a name of “time lag to conversion”, “path length”, or something similar.

If most customer journeys are only a single step and a customer ‘converts’ there and then, then you need make few assumptions: for whatever model you’ve chosen there are few ways to distribute credit.

In my experience, short paths are common in insurance products. A customers’ mind set is one of speed, convenience and price – often on a renewal deadline. With the prominence of price comparison sites acting as one-stop-shops, a customer has few opportunities to be exposed to advertising and as such generates few data “touch points”.

As the research phase extends, so we collect more advertising interactions and the complexity of our path evaluation increases. Where a purchase decision may take weeks or months, so we must now decide not only on advert resonance but also how time plays as a factor.

Early on in their journey they may just be getting a feel for the marketplace; which brands are out there, what are their perceptions or previous experiences of these, do they trust the site and does it have what they are in market for?

As they progress: maybe their activity is more about deciding between a shortlist of products or brands? We expect refined search terms and specific information sought on dedicated web sites. Towards the point of conversion have they in effect made up their mind, and are now looking simply for reassurance through second opinions on social media and product review sites?

I see longer journeys more commonly with high value purchases and/or ‘desirable’ goods – homes, holidays and mobile phones for example: areas where customers are happy to spend more time browsing and researching to get a specific deal or set of product features.

From an analytical viewpoint then: does our understanding of the customer journey extend to allowing us to identify which stages are more important than others? Are they equally important? Do we believe that as the user nears decision then so the advert’s influence becomes more important? Or do we assume the early (or first) brand association step was key to the chain? Is a particular advert channel known to be strong in this sector?

In some cases we can answer these questions with data, in others we must make an educated choice.

If you’re dipped your toe already in to the Google Analytics world, you will probably have already seen their [relatively] simple modelling tools. These reports allow a simple, low-cost entry in to envisaging the results of a particular credit attribution model:

GAModels

I would say if you have fairly short conversion paths of one or two steps, and/or your online budget levels are fairly small, then these are perfectly adequate tools to begin exploring and testing models.

As a recommendation, my advice would be to try and steer clear of the ‘absolute’ models (First interaction and Last interaction). If you’ve read any of the previous posts, you’ll hopefully agree that giving all credit to just one interaction is like giving all the credit in a football match to your goal scorer. Assuming you optimise on the results of your model (if not – why bother at all?), you’d end up with a team purely of forwards which is more than likely, “a bad strategyTM”.

There are some other notable limitations to these positional models, and I would advocate you should be aware of these even though it doesn’t necessarily prevent you from drawing useful insight.

Firstly: technological limitations. Online tracking technology is not perfect. The primary tracking mechanism is a cookie, which is device specific; by that I mean tied to your mobile, tablet, laptop or computer. Many customers will use multiple devices depending on where they are and whom they are with: assuming they don’t identify themselves to the browser (such as signing in to Google+), they will generate independent cookie data on each device.

While you can see the device on which the user completes a purchases (last click), you have no idea where they really started their journey. Putting all your eggs on the first known step is, I’d suggest, at best imprecise and at worst horribly wrong.

Secondly: these models are digital domain only, and make no accounting for offline advertising such as TV and Radio, or even plain old word-of-mouth. You will with certainty be over-estimating the contribution from your online advertising, particularly through any brand name search advertising which is commonly seen as a means to site navigation, rather than advertising per se. You might be able to apply some gut feel scale-factor to adjust these numbers particularly if you’ve been monitoring offline campaigns for a while, but unfortunately at this level of complexity there’s not much more science you can apply.

Thirdly: in the GA implementation at least, the data used to feed these models only contains the paths of successful ‘converters’ (i.e. customers who complete a certain phase such as quote request or sale). Each occurrence of a particular advert is rewarded with incremental credit regardless of its realistic influence at a given step, and there is no negative feedback from failed conversions against which this incrementality may be offset. In more complex models, we observe that repeat advert appearances can in fact be an indicator of decreased effectiveness.

Still, no model is perfect and given that the results return quickly it is worth comparing a couple of these models side by side. If the results come out pretty similar for a period of a month, then it’s probably not worth getting too hung up on exactly which is best – take one, and try it out for a while (i.e. see if making the changes the results imply actually result in better returns).

When your online advertising expenditure sits in the “higher” category, or your customer journeys are more protracted, then you are probably at a stage where either a more complex model, or certainly a better fitted model, is worth investment.

I’ll get on to techniques later, but a good rule of thumb here is that these models shouldn’t just apply more complex rules, but more complex techniques. This means that rather than making more assumptions, you make fewer.

By more complex techniques; we’re really talking about utilising more complete data sets, either pushing it through statistical models (for example, logistic regression or decision trees), crunching it brute-force style with algorithmic solvers to produce weighted positional or Markov chain models, or using mathematical game theory solutions (such as obtained via Shapley Value techniques).

I’ve seen, and admittedly produced (under protest!), rule-based attribution models that apply an inordinate number of “if-then” conditions, all of which were based entirely on conjecture. These results were reported upon, but with an ever changing strategy landscape and high seasonal volatility no feasible testing structure would ever be possible. In other words, it was guess work.

Testing is a basic requirement for any model development lifecycle. If you make changes based on a model that doesn’t result in greater success (all other things being equal), then you haven’t got a good model and you’re likely wasting money. Moreover, if you can’t work back from results to understand which part of your model might be at fault, you’re doomed to repeat the failings.

The best way to test your model is, of course, to spend some money. If it looks like doing more of A is a cost effective solution, then do some more “A” and see if your sales increase and efficiency remains roughly consistent. Utilise A/B tests to help calibrate your model.

However, it is often difficult to convince a stakeholder to ‘gamble’ on a model, and frequently the market place differs significantly from one month to the next, making the ‘all other things being equal’ stipulation awkwardly un-equal. If you have a regular periodic measurement framework in place though, and maybe a second model running in parallel for comparison, you can begin to see which particular adverts in a campaign are consistently performing well, and certainly those that aren’t, and from this begin to make some campaign decisions.

Choosing an attribution approach that is right for you

Let me start this section by saying: there is no ‘right’ attribution model. There are no off-the-shelf solutions out there that you can buy, drop on your data, and hey presto: instant, accurate campaign evaluations.

In order to get to a view of campaign performance there are concessions you must make, the biggest being whether to split your views of the online and offline world in to independent, mutually exclusive existences, or whether to try and integrate them. No: I take that back. The biggest concession is how much time (or money) you’re willing to invest in improving your understanding of how your advertising works. More advanced techniques may take weeks to generate, and then need ongoing support and tweaks.

The underlying complexity driving this decision is a result of the vastly different data sets that arise from the offline and online worlds. The former are generally traditional media channels such as radio and TV, to which you may physically be unable respond in kind (at least until touchscreen TV technology and its ilk is widely deployed). Adverts are served at set times in the programming, and are ‘slow’ at typically 10-30seconds in duration. Whether an individual has seen your advert is an unknown.

Contrast this with online advertising where adverts are shown across most pages, at any time, are renewed with each refresh and sometimes with each scroll. A user browsing around may be exposed to tens or hundreds of adverts in a very short space in time, with every viewing and interaction tracked to the nth degree. Advert content (‘creative’) can be generated programmatically on the fly leading to a multitude of different versions and, significantly, there is (with some exceptions) a cookie record tracking the sequence of what you’ve seen and done.

In essence: you have offline data that needs (let’s say) weekly aggregations in order to provide sufficient sample volumes for a robust pattern to emerge. You then have online data that comes thick and fast, but where scale is thinned by high variety and fine detail.

So gives rise to two main techniques, Econometrics and Digital Attribution. And occasionally: an attempted mash of the two.

Econometrics attempts the holistic view, giving up some of the finer detail in order to give an all-encompassing view. Models will vary in complexity (often determined by the amount of time invested in developing the model), but the basic premise is one of linear relationships: do more of advertising type A, get a proportional increase in sales back x weeks later.

Digital Attribution foregoes explaining the offline activity and instead makes use of the online tracking technologies to try and piece together a customers’ journey. Broadly speaking, these techniques fall in to two camps. The first correlates customer interactions with sales activities, and leaves sales unexplained by patterns in the digital data in an unallocated ‘pot’. The second fully apportions the known sales across digital advertising channels. Both may choose (or not) to make sequencing and chronology a factor, but both techniques make use of case-wise customer journeys as opposed to ‘activity volumes’ as used in econometrics.

I’m not an econometrician, and my work falls in to this second set of techniques: digital attribution. In agencies this is often a complementary piece to an econometrics project, and is used to provide further, more granular insight in to the finer detail of online campaign performance.

So, having set up the background, let’s talk more about this latter field of analysis.

Attribution in the Media and Advertising Industry

I work as an agency analyst in the media and advertising sector, and by far the most commonly requested work is attribution analysis.

For the non-layman, this is the process of modelling which adverts have helped generate incremental business revenue, be that in the form of new customer interest as a precursor to sale or the sale itself, and then in some manner awarding nominal credit to that advert. The ratio between cost and credit is used to understand which strategies are generating the best return on investment for the next planning cycle.

An example then: let us imagine a customer who responded to seeing our TV advert by entering our brand name in to an internet search engine. Clicking on a resulting paid-for link they landed on our web site where they completed an order. Our customer also happened to have read a magazine in which one of our adverts had been placed, but it failed to resonate with them and they skipped over it.

Before bandying numbers and models around, it’s worth just remembering that the true purpose of attribution lies in answering these questions:

  • How do we know the customer saw the TV advert and responded to it?
  • How do we know the magazine advert failed to resonate?
  • What role did the search click play?
  • What was the relative importance of each advert to the sale, and ultimately, was it worth me spending the money on it?

I feel these basic concepts are often not given due appreciation in the rush to produce “a model” – merely having the latter seemingly more important than the question it is intended to answer.

Simply exposing a potential customer to an advert, for example, doesn’t mean it was effective. Likewise, how a customer navigates their web browser to your site might just be a route of convenience rather than via an opinion influencing step. And lastly; at what point does the customer journey cease to be influenced by advertising, and instead move over to being driven by the customer-service experience?

For an individual customer, these questions are admittedly impossible to answer. But across a larger sample size we can use data to identify repeat patterns between advertising and customer response. We can identify the channels that typically favour successful sales over missed ones. Crucially though, in the process of quantifying this influence we have to apply some assumptions about how we believe our customers respond. This is where the art mixes with the science: having strong insight in to your customer journey can really shape the solution for the better.