Growth Leaders Forum: roundup on Testing (culture and methodology)

Manfredi Sassoli
May 19, 2020
5 min read

Updated: May 27, 2020

Last week, in our Growth Leaders’ virtual aperitivo, we had a great, (and informal), chat on the topic of Testing. The group was made of myself and a four exceptional Growth and marketing experts: Oren Greenberg, Yara Paoli, Ian Howie and Gianluca Binelli. Katia Damer also joined us, she is the founder and CEO of her company, and she has a Ph.D on social psychology – so she knows a thing or two about testing.

In February I had read an extended feature on the Harvard Business Review magazine on testing and experimentation. The feature had a strong focus on digital ventures, including a piece on Booking.com as a case study, (link here), which talks about constant testing in a completely democratised way: one where anyone can run tests.

I always believed that speed of learning is the single most important metric to optimise to – so running 1,000s of test sounds like the optimal solution. Unfortunately it’s not that simple, (although I have come across a firm whose man KPI was numbers of tests run). LEARNING NOT TESTING

What we all agreed on was that testing designed in order to increase conversion rate or revenue is not particularly beneficial. This sounds counterintuitive, but there is a strong logic behind it.

Any hypothesis worth testing can either turn out to be true or false. For any working website, the likelihood is that the test outcome will be a false, so in order to encourage testing one must be able to embrace failure. So how does one embrace failure?

The key here is to run experiments that allow you to learn, if a company is constantly learning about its customers it will inevitably move forward in the medium and long term. IT’S ABOUT THE QUESTION

Before running a test one must have a clear hypothesis. This should follow a period of observation, (or analysis). The Hypothesis should then state which variable should be changed and what impact on outcome is expected and the reason why.

Coming out with the right questions / hypothesis will over time always give you a better understanding of your market, the outcome of the test will therefore always be useful. CULTURE AND BOLDNESS

The hardest part in establishing a culture of testing is balancing the friction between embracing failure and accountability.

If everyone can test anything and failing is accepted, how is the team motivated in delivering wins? (failure is ok as part of a process that brings long term success).

People need to be accountable for the tests they run, each test needs to move the organisation forward in terms of learning. If that happens then the test should be seen as a success, even if the test is a failure. On the other hand, a test that improves conversion rate but doesn’t deliver learning should not be incentivised.

While studying innovation I came across the concept of celebrating failure. During our conversation it came up that in many of the largest and most successful tech companies people have worked in there were prizes for the best failure of the month.

Ian stressed how failing fast is absolutely accepted in the US. Particularly in California, the ability to fail fast is appreciated. Much less so in the UK, even less in Europe.

Accepting failure to accelerate learning has one distinct effect: it encourages bold ideas. Bold tests not only give you big insights, but they tend to have a big impact on performance – good or bad. This impact means that it’s likely there will be a big delta between the two variables, which in turn leads to statistically significant results quickly.

Jeff Bezos famously told his team not to worry one minute about the failure of the Fire phone, as big bets, (an big mistakes), are part of the game if a firm want to grow fast (you can read about it here) STATISTICS, TIME AND RIGOUR

A testing culture is also a culture of optimisation. Often teams invest a lot of time thinking about how to maximise outcome and little about how to maximise speed.

Before starting each test there should be a prioritisation process, scoring test on predicted impact, cost and ease of implementation and time of learning (the first three are part of the ICE methodology, but T (time) is often forgotten).

Planning for T also requires agreeing on the confidence level needed to declare the end of a test. Is the team happy with 80% , 95% or 99% confidence? Either way test should be prioritised depending also on the time it takes for them to show some results. Testing a small button on a low traffic page will require a lot of time to test as traffic volume will be low and so will the delta in performance.

Once the confidence level is agreed, the team should also agree a time limit – if after X number of days or weeks no variable has shown a superior performance the test in considered null – and the team moves on to the next test. CHEATING

What if one requires quick wins?

Testing is expensive and one may need to make a business case for it and that may require showing a win, or a few wins.

Here enters the concept of exploration vs validation. Exploration can be carried out online. Historical data analysis can create insight, which can then be tested directly with users with qualitative research, through very specific but open ended questions, or even with a Wizard of Oz prototype. This will then give the team intel to design tests for validation, rather than for exploration.

Exploration tests will likely deliver a false, while validation is likely to deliver a true result. This approach may be relatively time intensive, but it’s a good way to start when the website/product doesn’t yet have the infrastructure for testing at scale and excessive dev time is required to run a test. MULTIVARIATE TESTING AND COMPETITIVE ADVANTAGE

What about multivariate testing? Here the matter gets complicated.

There is little doubt that multivariate testing can speed up the speed of learning, the challenge is its difficulty in implementation, particularly at scale. If our goal is learning, rather than optimising, learning across three variables simultaneously will prove tricky.

Testing multiple variables can also lead to confounding, when change in one variable impacts the other. This can partly be solved through the design of orthogonal tests (link here for more info). Things can be further accelerated using a ML technique called multi armed bandit (for more info link here).

Yara in our group had particularly deep insight on how experimentation is carried out on a couple of big tech companies, famous for their testing capabilities: Booking.com and Netflix. Both firm have developed an internal software to facilitate the process, democratise it, improve rigour and maximise knowledge sharing internationally.

Of course most firms can’t invest millions to build an in-house testing solution, their traffic volume don’t justify it: this showcases the power of a data competitive advantage. The higher traffic volume, will allow for faster and more learning, which will deliver a higher conversion rate, driving more revenue, that can be invested in ad-spend to deliver even more traffic. (When a 5% uplift in conversion rate delivers £10 mils it’s easy to make a business case to invest £1 mil to develop an in-house testing platform).

The diagram above shows the data flywheel, or how the volume of data can deliver a competitive advantage, (or moat), that compounds over time, (not to be confused with a data network effect).

Designing a growth loop is extremely challenging, in fact I’d argue that for certain products it’s impossible. A data flywheel on the other hand is a competitive advantage that is accessible to many digital firms who can develop the right learning culture.

Most firms can’t run 1,000s of tests a year like Amazon or Booking.com, sometimes traffic only allows for 10 significant tests over 12 months, so how does one fill the gap? Thinking very very carefully about what to test based on user psychology. One great hypothesis can unlock huge value.