Login

Making Informed Product Decisions: A Comprehensive Guide to Analyze Painted Door Tests with Two Variants

Identifying the Right Statistical Tests for Analyzing Behavioural Market Research Based on A/B Testing Methodology

Introduction

In the rapidly evolving business landscape, innovation transcends the mere creation of new products. True innovation is about designing solutions that not only captivate but also comprehensively meet consumer needs. This is where the art of consumer validation becomes crucial, not just for launching groundbreaking products but also for refining the ones already in the market.

Enter the realm of pretotyping. Originating from the traditional A/B testing framework, which compares two different groups' responses to varied offerings, pretotyping has revolutionized the way businesses validate market demand before full-scale product development. Among its most strategic techniques is the painted door test. With painted door tests, it is possible to measure and compare the market demand for different product characteristics and use this information to make a decision in favour of one over the other. This method effectively gauges the demand for a (new) brand, feature, or value proposition, identifying the most appropriate target audience and optimal pricing strategy.

With the method, you can collect close to infinite behavioural consumer data.

But, how do you make sense of all the data?

In this comprehensive guide, we will delve into the statistical tests that underpin pretotyping tests, helping you navigate the intricate world of behavioural-based decision-making. Before we lead you through the analysis of painted door tests step-by-step, we briefly introduce the metrics and statistical tests used for painted door tests.

Key Metrics For Painted Door Tests

When implementing painted door tests, choosing the right metrics is crucial for accurate analysis. First, you need to determine whether your primary metric will be quantitative (numerical) or qualitative (categorical). Quantitative variables are further classified as either continuous, which can assume any value within a range (e.g. time spent on page (seconds)), or discrete, limited to specific numeric values, typically integers (e.g. page views per visit). On the other hand, qualitative variables are categorized as nominal, where the categories have no inherent order (e.g. colour scheme of button), or ordinal, where the categories do have a logical sequence (e.g. user satisfaction rating (1-5)).

Understanding these distinctions is vital because the type of variable you are measuring dictates the statistical test you should employ. For instance, common Key Performance Indicators (KPIs) like click-through rate, CTA clicks, conversion rate, and email confirmations are integral for gauging purchase intent. Each of these metrics can offer a snapshot of user engagement and the effectiveness of the tested elements, guiding strategic decisions in product development and marketing.

2-Group Comparisons

When comparing two groups (for testing two variants against each other) selecting the appropriate statistical test is essential. There are three commonly used options: the Chi-Square Test, the Z-test, and the T-test.

The Chi-Square Test is essential for handling categorical data, and evaluating whether there are significant associations between categories. Meanwhile, the Z-test is applicable when comparing proportions to determine if there are differences in percentage outcomes between the two groups. Lastly, the T-test is employed to evaluate the statistical differences in means between two sets of data, making it invaluable for assessing variations in continuous variables.

A practical approach

To illustrate the practical application of the statistical tests, let's delve into a fictional example within the home appliance industry, conducting a painted door test for a smart vacuum cleaner. 

We will guide you through the entire 7-step process along the example.

Overview of the proven 7-step process to make data-driven decisions

Consider a scenario where the current pricing for the smart vacuum cleaner was set at 150 euros, and the team is contemplating a price increase to 180 euros.

Step 1: Pose your research question

Our primary query revolves around understanding whether the contemplated price increase will influence market demand, so essentially, the purchase intent of consumers.

Thus, our research question may sound like this:

Is there a difference in the purchase intent for the smart vacuum cleaner at a price of 150 euros compared to a price of 180 euros?

Step 2: Define hypotheses and set proxies

To analyze a problem in a structured manner, it's important to first establish hypotheses. Typically, you'll want to create a null hypothesis (H0) and an alternative hypothesis (HA). The null hypothesis is the default assumption that there's no difference or effect, while the alternative hypothesis suggests the opposite, that there is a difference or effect present in the population. To test these hypotheses, we need to determine a Key Performance Indicator (KPI) that represents purchase intent. For our case, the number of clicks on the call-to-action (CTA) button, such as "order now," can be considered a reliable proxy to understand the level of purchase intent.

Thus, we formulate the following hypotheses:

H0: There is no significant association between the pricing change (150 euros to 180 euros) and the frequency of CTA clicks.

HA: The pricing change to 180 euros results in a significant difference in the frequency of CTA clicks.

Now in practical application, we do not only want to understand if the purchase intent is different, but if the consumers’ purchase intent decreases with the introduction of a higher price. To dig even deeper into the data, we can use the average number of CTA clicks per day on either of the landing pages. This leads us to the following hypotheses: 

H0: There is no significant difference in the average number of CTA clicks between the current price (150 euros) and the proposed increased price (180 euros).

HA: The price increase to 180 euros will result in a decrease in the average number of CTA clicks for the smart vacuum cleaner.

Although total buying intents (= CTA clicks) per landing page can be an adequate indicator of a product's success, it may also be beneficial to consider the ratio of people who made a purchase compared to the total number of visitors to the landing page. Generally, it will provide a more accurate insight into the landing page's effectiveness, and since both landing pages are alike except for the price, thus each of the price points’ persuasiveness, in converting visitors into customers. 

From this, we derive the following hypotheses:

H0: There is no significant difference in conversion rates on the CTA button clicks between the current price (150 euros) and the proposed increased price (180 euros).

HA: The price increase to 180 euros will result in a decrease in conversion rates on the CTA button clicks for the smart vacuum cleaner.

While these three hypotheses and their respective proxies do not yet provide you with a holistic overview of your consumers' true purchase intent, they are important. We picked these three hypotheses on purpose to lead you through different possible statistical tests. In practice, different test setups can demand different hypotheses. If you want to learn more about how to create a hypothesis for your Pretotyping tests, consider reading the following article.

So, here we are. Ready to set up the test. Almost.

Step 3: Calculate sample size

Prior to conducting the test, it's essential to define the minimum effect size you aim to detect between the two landing page variants, be it in terms of the difference in total CTA clicks or conversion rates. Establishing this parameter allows you to determine the required sample size for achieving statistical significance in identifying the specified effect size.

For this specific research, we calculated the need for a sample size of around 6,800 people in total divided into the two landing pages to detect a minimum difference of 25% in their conversion rates at an estimated conversion rate of 7%. This would mean you would be able to detect significant differences that are +/-1.75% different from 7%. Feel free to reach out to understand more about calculating sample sizes.

To acquire the sample, Horizon usually uses Meta and Google Ads, leading the target audience to the different landing page variants.

 

Step 4: Collect data

To effectively gather data for your experiment, we initiate by setting up two identical landing pages that only differ in their pricing strategies. This method allows us to directly compare how different price points influence visitor behaviour and conversion rates.

Example of two landing page variants for a painted door test - differing only in one variable, in this case: price

Before we tap into the data itself, it's crucial to pay attention to your technical setup and ensure that you're collecting data correctly. There's nothing worse than setting up a test and realizing later on that the data was not collected accurately.

Learn more about how to set up a painted door test correctly in this article.

This will help you avoid common mistakes and optimize your test setup, whether you're using Horizon software or a custom configuration.

In this scenario, let’s assume you collected the following data over two weeks of running an A/B test with two landing pages:

Zoomed in on the data over the 14-day period, the table looks like this:

The data appears accurate and leads us directly into the next step, where the actual magic happens.

Step 5: Perform the statistical tests

You probably learned in school or university that before running a statistical test, it is crucial to determine the level of significance.

In statistical hypothesis testing, the significance level (α) is the threshold at which you decide whether to reject the null hypothesis. A common choice is 0.05, meaning that you are willing to accept a 5% chance of making a Type I error — rejecting a true null hypothesis. This level helps set the critical value or critical region for the test.

Hypothesis 1

Let’s have a look back at our hypotheses. The first hypothesis was as follows:

H0: There is no significant association between the pricing change (150 euros to 180 euros) and the frequency of CTA clicks.

HA: The pricing change to 180 euros results in a significant difference in the frequency of CTA clicks.

We made it fairly simple in this one. We put in a signal word: frequency.

In other words, or in more generalized scientific language, our hypotheses can be rephrased as follows:

H0: The distribution of the observed frequencies is equal to the distribution of the expected frequencies under independence.

HA: The distribution of the observed frequencies is not equal to the distribution of the expected frequencies under independence.

Chi-Square Test

When it comes to frequencies or categorial data, the Chi-Square Test of independence is our test to go. This statistical method is particularly useful for analyzing categorical data and determining whether the distribution of one categorical variable is independent of another. 

By constructing a contingency table, we can compare the observed and expected frequencies associated with the pricing change (Landing Page A (150 €) or Landing Page B (180 €)) and CTA click outcomes (Yes or No). By analyzing this relationship, we can determine whether the observed distribution of CTA clicks is consistent with what we would expect if there was no correlation between the pricing change and CTA click outcomes.

To investigate it further, we will set up a 2x2 contingency table with the categories "Landing Page" (A (150€) or B (180€)) and "CTA Click" (Yes or No):

To fill in the table, you can withdraw the number of CTA clicks per landing page from the given data. To calculate the number of “No CTA clicks”, you simply take the total number of landing page visitors and subtract them from the number of CTA clicks on this landing page.

Next, we'll use the Chi-Square test formula to calculate the test statistic.

No worries about the maths in the following part. You can also just use software like SPSS, R, Stata or any other statistical program you are familiar with. If you are using the Horizon software, you are even able to see directly in the software the results of your test without the need to use further statistical software.

But if all this is not accessible to you or you want to impress your colleagues with your knowledge, feel free to stick around the following part. Otherwise, just skip to the ‘Results Interpretation’ of this test.

The formula for the Chi-Square statistic is:

Where:

Oij is the observed frequency in each cell of the contingency table

Eij is the expected frequency in each cell

Before you can fill in the values in the formula, you need to calculate the expected frequency for each cell. The formula is as follows:

Where:

Ri is the total for row i

Cj is the total for column j

N is the grand total of all observations

This would give us the following expected values:

For Landing Page A, CTA Click: E11=207.5

For Landing Page A, No CTA Click: E12=2792.5

For Landing Page B, CTA Click: E21=207.5

For Landing Page B, No CTA Click: E22=2792.5

Let’s calculate χ2 by filling in the values from the 2x2 contingency table.

Now that we have the Chi-Square statistic, the next step is to compare it with the critical value from the Chi-Square distribution with appropriate degrees of freedom to determine the p-value and assess statistical significance. The degrees of freedom in this case would be (rows-1)(columns-1)=1. You can use a Chi-Square distribution table or a statistical software package to find the critical value and determine the p-value. If the p-value is below your chosen significance level (commonly 0.05), you would reject the null hypothesis.

Using a Chi-Square distribution table (you can find our very own one here) or statistical software, you find that the critical value at a 0.05 significance level with 1 degree of freedom is approximately 3.84.

Now, we compare the calculated Chi-Square statistic with the critical value. We observe that the calculated Chi-Square statistic (18.7) is greater than the critical value (3.84), which leads us to reject the null hypothesis. This means there is a significant association between the landing page, and CTA clicks at the 0.05 significance level.

Never miss expert insights and case studies on market success prediction with our monthly newsletter

Hi! 👋 Who should we deliver this too?
Results Interpretation

In conclusion, based on the Chi-Square test results, there is evidence to suggest that the pricing change from 150 € to 180 € has a significant impact on the frequency of CTA clicks. However, the Chi-Square test itself doesn't determine the direction of the association, meaning whether it is an increase or decrease. For this, you would need to have a deeper look at the observations again or simply run a further analysis.

The chi-square test for independence assumes certain conditions, including that the observations are independent and that the expected frequency in each cell of the contingency table is not too small. If your data violates these assumptions, you may need to consider alternative tests or adjustments.

Hypothesis 2

Now, let's shift our focus to the second hypothesis. Up until now, we only know that there is a significant difference between the number of CTA clicks for the two prices; we do not know in which direction it is. Thus, we wanted to investigate the following hypothesis: 

H0: There is no significant difference in the average number of CTA clicks between the current price (150 euros) and the proposed increased price (180 euros).

HA: The price increase to 180 euros will result in a decrease in the average number of CTA clicks for the smart vacuum cleaner.

Since we wanted to look more in detail into the daily averages, we now do not deal with total frequencies anymore but with means.

Let’s generalize the hypothesis again:

H0: The population mean of one group equals the population mean of the other group.

HA: The population mean of one group is greater than the population mean of the other group.

When it comes to statistical methods comparing the means of two independent groups, the t-test is a suitable choice.

T-Test

A t-test is commonly used to compare the means of two groups and assess whether there is a statistically significant difference between them. It can be bi-directional, also called two-tailed (tests whether the means are significantly different in either direction), or unidirectional, also called one-tailed (tests whether the means are significantly different in only one direction); Paired, meaning comparing means of two related groups, or unpaired, meaning comparing means of two independent groups.

In our test, we have two independent groups, since each person was randomly acquired to either Landing Page A (150€) or Landing Page B (180€), never to both. Also, our hypothesis is unidirectional, looking at whether the mean number of purchase intents is significantly less for the 180€ variant.

Let’s dive right back into the maths. As described, we suggest using software like SPSS or R to calculate the statistics, or simply check out the Horizon software dashboard to get real-time insights.

The formula for the t-test is as follows:

where

Let’s fill in our numbers. First, we need to calculate the mean value of daily landing page visitors over the 14-day period for each landing page. I am confident everybody reading this text has already calculated the average for their grades in high school. So this should not be a big problem. Looking at the results table provided, we can sum up all the CTA clicks per day per landing page and divide it by the number of days. Mathematically it is written like the following:

Let’s fill in our numbers. First, we need to calculate the mean value of daily landing page visitors over the 14-day period for each landing page. I am confident everybody reading this text has already calculated the average for their grades in high school. So this should not be a big problem. Looking at the results table provided, we can sum up all the CTA clicks per day per landing page and divide it by the number of days. Mathematically it is written like the following:

Filling in the observed numbers, we calculate the following for the daily average CTA clicks per landing page variant:

The next step is to calculate the sample variances with the following formula:

Filling in the observed numbers, we calculate the following variances:

Now, we have all the values to fill in our t-test formula:

To determine if this t-statistic is statistically significant, we compare it against a critical value from the t-distribution, considering our chosen significance level of 5% and our degrees of freedom.

The corresponding p-value for our t-statistic was calculated to be 3.31 x 10-9. (Typically, this calculation is performed using statistical software, as it involves complex integration that is not practically done by hand.)

The degrees of freedom (df) for the t-test when comparing two samples with potentially unequal variances (as per the Welch-Satterthwaite equation) are calculated as:

where

s1 and s2 are the sample variances of the two independent groups

n1 and n2 are the number of observations in each sample

By filling in the values, we receive:

Using a Student’s t table (you can find our very own one here) or statistical software, we can determine the significance of our findings. Since our t-statistic is positive and large, and the p-value is significantly small, we have strong evidence against the null hypothesis. This implies that there is a statistically significant difference in the average number of purchases or buying intents between the two price points.

Never miss expert insights and case studies on market success prediction with our monthly newsletter

Hi! 👋 Who should we deliver this too?
Results Interpretation

Given that our alternative hypothesis (HA) specifies a decrease in purchases with the price increase, and our test results are statistically significant, we would conclude in favour of the alternative hypothesis. This means that the data supports the claim that increasing the price to 180 euros results in a decrease in the average number of purchases or buying intents for the smart vacuum cleaner.

Box plot showing that confidence intervals do not overlap between the two variants - evidence for a significant difference

We are not only able to measure the difference between the two groups but also the size and direction of this difference. The t-statistic is a measure of the difference between the two group means relative to the variability of their scores. A high t-statistic value (far from zero) indicates a large difference between the groups. In our case, the value of 8.929 is relatively large, indicating a significant difference between the means of the two landing pages. The sign of the t-statistic indicates the direction of this difference. In this case, the positive sign suggests that the mean of Landing Page A is higher than that of Landing Page B.

The t-test can be used for A/B testing in most cases, but certain assumptions must be met. These assumptions include random samples, independent observations, each group's population following a normal distribution, and equal variances of the populations. If your data violates these assumptions, you may need to consider alternative tests or adjustments.

Hypothesis 3

Until now, we only looked at the absolute number of CTA clicks. In a lot of scenarios, for instance, with highly different numbers of unique page visitors, you might want to compare the conversion rates of the two variants with each other. For this reason, we derived our third hypothesis as follows:

H0: There is no significant difference in conversion rates between the current price (150 euros) and the proposed increased price (180 euros).

HA: The price increase to 180 euros will result in a decrease in conversion rates for the smart vacuum cleaner.

To scrutinize this hypothesis, we turn to the two-sample z-test, a statistical method ideal for comparing proportions from two different samples. Before we get into the nitty-gritty, let's understand the rationale behind this test.

Z-Test

The z-test is a reliable method for analyzing binary outcomes between two groups. It's accurate, tailored for proportion comparisons, and appropriate for scenarios with known population variance and sufficient sample sizes. Unlike the Chi-Square test, which explores associations in categorical variables but isn't tailored for proportion comparisons, and the t-test, designed for means in continuous data, the z-test specifically caters to the nuanced requirements of proportion comparisons.

Let’s jump into math. As suggested, feel free to use software like SPSS or R to calculate the statistics, or simply check out the Horizon software dashboard to get real-time insights.

The formula for the z-test is as follows:

Where

pA and pB are the sample proportions of the two groups (= conversion rates)

nA and nB are the sample sizes of the two groups 

p is the pooled sample proportion

Before we can substitute all values into the formula, we calculate the pooled sample proportion.

Let's substitute now the values in the z-test formula:

The calculated z-score for the comparison of conversion rates between Landing Page A (150€) and Landing Page B (180€) is approximately 4.32. In hypothesis testing, we compare this z-score with critical values from the standard normal distribution. Given our uni-directional hypothesis that the conversion rate of Landing Page A is greater than that of Landing Page B, we are interested in the z-score corresponding to a one-tailed test at a chosen significance level (commonly 0.05).

Comparing it to the critical value in the table, we find that our z-score is significantly larger. This indicates that the observed difference in conversion rates between the two landing pages is statistically significant at the 0.05 significance level.

Results interpretation

Based on our statistical analysis, we have found that there is sufficient evidence to reject the null hypothesis. This means that the difference in price from 150€ to 180€ does seem to have a significant impact on the conversion rates for the smart vacuum cleaner. The higher conversion rate on Landing Page A suggests that customers are more likely to engage with the CTA at the lower price point.

The z-test is a powerful tool for analyzing binary outcomes between two groups. However, it is essential to acknowledge certain assumptions when using this test. These assumptions include independent observations, large sample sizes, and known population variances. If these assumptions are not met, alternative tests or adjustments may be necessary to ensure the validity of the results.

Step 6: Interpret the results

We aimed to find out if the proposed price increase would affect market demand. Based on our statistical analyses, we found that the higher price point of 180€ led to a significant decrease in consumer interest compared to the original price of 150€. This was reflected in the frequency of clicks and the average number of purchases as well as the on-page conversion rates. 

The higher rates for the lower price point indicate a stronger consumer response to the more favourable pricing, reinforcing the conclusion that overall market demand is sensitive to price increases in this context.

Thus, we might want to conclude that our smart vacuum cleaner follows the law of demand, which states that as the price of a given commodity increases, the quantity demanded decreases as long as all else is equal.

Illustration of a demand curve following the law of demand

Seems to be logical. However, testing with Horizon, we also observe price elasticity that goes against our human rationale – displaying increased demand with increased prices. By this, Bosch was able to detect this for one of their smart gardening products being able to increase the price from 179€ to 199€ with the help of Horizon. (Download the case study here)

One example of this is the Snob effect for Veblen goods. The term refers to the tendency for individuals to desire unique or exclusive products or experiences simply because they are rare or difficult to obtain. This bias can influence consumer behavior by causing individuals to place higher value on items simply because they are perceived as exclusive, rather than objectively assessing their utility or quality. As a result, the Snob effect can lead to irrational decision-making.

Illustration of a demand curve for normal goods vs Veblen goods

This leads us to the last aspect of the test. How is the decision made?

Step 7: Decision-making

After carefully conducting the statistical analysis, we have a foundation for making the decision. It would be fairly simple to say that the price of the smart vacuum cleaner should not be increased due to decreasing demand. But this decision would be made without considering the overall business case.

For simplicity, let’s assume the observed conversion rates on the CTA per price point are the conversion rates of a purchase. We observed average conversion rates on the CTA of  8.33% (Landing Page A (150€)) and 5.50% (Landing Page B (180€)). 

We further assume that 10,000 people end up on the landing page per price point and buy the smart vacuum cleaner to the conversion rates observed.

This would lead to the following potential revenues:

Landing Page A (150€): Potential Revenue = 10,000 x 8.33% x 150€ = 124,950

Landing Page B (180€): Potential Revenue = 10,000 x 5.50% x 180€ = 99,000

In this extremely simplified business case, considering the costs for the good remain the same regardless of the price point, the potential revenue for the lower-priced variant exceeds the more expensive one. A further indicator making the decision in favour of keeping the price at 150€.

In reality, the decision-making process is more complex than demonstrated. That’s why we work on a guide on how to make decisions based on painted door tests. Stay updated by subscribing to our newsletter to ensure you don't miss its release.

Conclusion

This comprehensive guide delved into the intricate application of statistical tests within the context of painted door tests, essential for making informed product decisions. By dissecting a case study of a smart vacuum cleaner, we explored various statistical methods, including the Chi-Square test, T-test, and Z-test, each tailored to measure different aspects of consumer behaviour and purchase intent. 

This analysis underscores the importance of choosing appropriate statistical tests based on the type of data and the specific hypotheses under investigation. It also highlights the complexity of interpreting these tests, where understanding the nature of the data and the market dynamics plays a crucial role. While our results supported the decision not to increase the price based on reduced demand, the exploration of price elasticity in similar products suggested that higher prices might paradoxically boost demand in certain contexts.

Employing the right statistical tools and a thorough understanding of market psychology are paramount in navigating the challenges of product innovation.

At Horizon, we understand the significance of making informed decisions, and that's why we helped Blue Chip’s product and innovation teams in real-market environments to glean valuable insights into consumer behaviour, uncovering consumers’ real purchase intent. If you want to learn more about behavioural-based market research, sign up for our newsletter.

Never miss expert insights and case studies on market success prediction with our monthly newsletter

Hi! 👋 Who should we deliver this too?
Written by
Florian Haberler
Florian is a Research Manager at Horizon with a rich background in consulting, marketing, and advertising. In Horizon's service department, he spearheads strategic market analysis, leveraging clients' use of Horizon to predict the market success of their product decisions. With his expertise, Florian ensures that consumer insights are meticulously analysed, enabling clients to confidently make the right product decisions pre-market.
LinkedIn Profile Link

More insights from Horizon