A/B Testing Emails: What to Test and How to Read Results

Marketing analyst comparing A/B test results for email campaign optimization

Subject lines, send times, CTAs, and layout. Statistical significance and the minimum sample sizes you need.

Why Most Email A/B Tests Are Reading Tea Leaves

Every email platform makes A/B testing look effortless: split your list, send two versions, crown a winner. The mechanics are genuinely easy. What the platforms don’t tell you is that most of the tests run through them are statistically meaningless — the lists are too small, the differences too tiny, and the metrics too noisy for the “winner” to be anything more than a coin flip dressed up in a dashboard.

This matters because bad tests are worse than no tests. A team that flips a coin knows it’s guessing. A team that runs an underpowered test on three thousand subscribers and sees Version B “win” by four percent believes it has learned something — and then builds future campaigns on noise. Multiply that across a year and you get an email program steered by superstition with a veneer of data.

Rigorous email testing doesn’t require a statistics degree. It requires three things: testing variables big enough to produce detectable differences, measuring them with metrics that still mean something in a post-privacy world, and being honest about what your list size can and cannot tell you. This guide covers all three — what to test, in what order, and how to read the results without fooling yourself.

Subject Lines: The Highest-Leverage Test, with a Catch

Subject lines are where almost everyone starts testing, for a sensible reason: the subject line is the single biggest factor in whether a message gets opened at all, and you can vary it without touching anything else in the email. The catch — which we’ll return to in detail — is that the metric used to judge subject lines, the open rate, has been badly degraded by privacy changes. Test subject lines, but know that the scoreboard is blurry.

The subject-line variables worth testing are the ones that change the reader’s reason to open, not the decoration. Specificity versus curiosity is the classic matchup: a subject that states exactly what’s inside against one that withholds it to provoke a click. Both can win — curiosity tends to spike opens but can depress clicks when the email doesn’t pay off the tease, which is exactly why you should judge subject-line tests on clicks too, not opens alone. Length is worth one good test, framed as short-and-punchy versus complete-thought rather than counting characters. Personalization beyond the first name — the subscriber’s city, past purchase, or stated interest — is usually a stronger lever than the first name itself, which most inboxes have learned to ignore. Urgency framing is testable, but only when the deadline is real; manufactured urgency trains your list to ignore it.

What’s mostly not worth your limited testing budget: emoji versus no emoji, sentence case versus title case, and other cosmetic tweaks. These rarely produce differences large enough to detect at typical list sizes, so the test ends in noise no matter which version is “better.” One structural tip instead: make your two subject lines genuinely different angles on the same content — a benefit frame against a problem frame, a number against a story. Two phrasings of the same idea will almost never separate; two different ideas can.

Send Times and Frequency: Test Less Than You Think

Send-time testing has a devoted following and a weak evidence base. The industry folklore — typically that mid-morning, midweek performs best — comes from aggregate studies averaging millions of sends across every industry, and the dirty secret of those studies is that the differences between slots are usually small and the “best” slot varies by audience. A B2B list of office workers and a list of restaurant-goers live on different clocks. Folklore is a starting point, not a finding about your list.

If you test send times, respect two traps. First, a send-time test is really a test of two different audiences-in-the-moment: the people who check email at 8 a.m. and at 8 p.m. overlap but aren’t identical, so results wobble more than subject-line tests do — run the same matchup across several campaigns before believing it. Second, if your platform’s send-time optimization is already staggering delivery per subscriber, a blunt morning-versus-evening split tells you nothing.

Frequency is the more valuable and less tested variable. Whether your list performs better with one email a week or three is worth far more than whether Tuesday beats Thursday, because frequency compounds: it affects revenue per subscriber, unsubscribe rate, complaint rate, and long-term list fatigue all at once. Frequency tests need a different design — split the list into stable cohorts, hold the frequency difference for a month or more, and judge on revenue or conversions per subscriber alongside unsubscribes and complaints, not on any single campaign’s performance. It’s slower and less satisfying than a subject-line test, and it will teach you more than a year of send-time experiments.

CTAs and Layout: Where Clicks Are Actually Won

Once the email is opened, the call to action and the layout decide whether anyone clicks — and these tests have a quiet advantage over subject-line tests: they’re judged on clicks, a metric that privacy changes haven’t corrupted. If your list is on the smaller side, this is an argument for spending more of your testing energy below the subject line, not less.

For CTAs, the highest-leverage variable is the copy itself: generic labels like “Learn more” against specific, outcome-shaped labels like “Get my quote” or “See the fall menu.” Specific CTAs tell the reader what happens on the other side of the click, and that clarity usually beats any visual tweak. Next is the number of links: a single CTA repeated against a menu of choices. Fewer choices winning is the right default hypothesis, but newsletters and digests are a legitimate exception, so test it on your own format. Button versus text link, color, and size are real but smaller levers; test them only after copy and count, and remember that contrast against the surrounding design matters more than the specific color.

For layout, the matchups worth running are structural. Long email versus short email that pushes the detail to the landing page — in most commercial sends the email’s job is the click, not the full pitch. Image-led versus text-led design — heavily designed emails look polished, while plain-text-style emails can read as personal correspondence and often earn surprising click and reply rates, especially in B2B. CTA position — first screen versus after the case has been made — interacts with length, so don’t test both at once.

That last clause is the rule that makes all of this work: one variable per test. Change the CTA copy and the layout together and a winner tells you nothing about why it won, so you can’t reuse the lesson. The entire point of testing is transferable knowledge, and transferable knowledge requires isolating the variable.

The Open-Rate Problem: Testing in the Mail Privacy Protection Era

Here is the caveat that should be stapled to every subject-line test run since 2021: a large share of the opens in your analytics never happened. Apple’s Mail Privacy Protection preloads email content on Apple’s servers for Apple Mail users whether or not the person ever looks at the message, and each preload registers as an open. Corporate security scanners that pre-fetch links and images add more phantom activity on top. Apple Mail’s client share is large enough that this isn’t a rounding error — it’s a structural distortion of the metric.

For A/B testing, the damage is specific. Phantom opens fire for both variants regardless of subject line — a machine doesn’t read your copy before preloading it. That pads both arms with identical noise, diluting the real difference and making open-rate gaps look smaller and less significant than they truly are. A subject-line test judged on opens is therefore biased toward false ties: real winners get buried under machine-generated sameness. And any absolute open-rate comparison — against industry benchmarks, or your own pre-2021 history — is comparing inflated numbers to differently inflated numbers.

What to do about it, in order of preference. First, judge subject-line tests on clicks wherever volume allows — a subject line’s real job is to get the right people into the email, and clicks measure that end-to-end. Second, if you must use opens, check whether your platform can segment out machine opens; several now flag Apple-proxy opens explicitly. Third, treat open rate as directional, not precise: a collapse still signals a deliverability or relevance problem, and a large, repeated gap between variants still probably means something. A two-point gap on a single send means nothing.

The broader principle: the further down the funnel your metric, the more trustworthy it is and the fewer events you’ll have. Opens are plentiful and polluted; clicks are cleaner but scarcer; conversions and revenue are the truth but arrive in tiny numbers. Picking the right metric for a test is a trade between cleanliness and sample size — which is exactly the next problem.

Sample Size Honesty: What Your List Can and Cannot Detect

The most common failure in email testing isn’t bad creative — it’s asking a small list to detect a small difference. The statistics here reduce to one blunt relationship: the smaller the difference you’re trying to detect, and the rarer the event you’re measuring, the more subscribers you need in each arm of the test. The numbers get big faster than intuition suggests.

Some illustrative math, using standard power calculations rather than anyone’s benchmark data. To reliably detect a two-percentage-point difference in open rate — say, distinguishing twenty percent from twenty-two percent — you need on the order of several thousand recipients in each variant. To detect a half-point difference in click rate on a base of two or three percent, you need on the order of ten to twenty thousand per arm. To detect anything at the conversion or revenue level on a typical campaign, you generally need either an enormous list or weeks of accumulated sends. These aren’t platform limitations; they’re the mathematics of separating signal from random variation, and no tool can waive them.

The practical translation by list size. Under roughly ten thousand subscribers, single-campaign tests can only detect dramatic differences — radically different offers, formats, or angles, not phrasing tweaks. Test big swings, and accumulate evidence by repeating the same matchup across campaigns and looking for a consistent direction. From ten to fifty thousand, subject-line and CTA tests on click-based metrics become workable for moderate differences, but per-campaign conversion testing is still mostly out of reach. Above that, you can test most variables properly — and your discipline problem shifts from sample size to testing too many things at once.

Two honest corollaries. First, the popular winner-takes-remainder feature — test on twenty percent of the list, auto-send the winner to the rest — inherits all of this: if your twenty percent is too small to detect the difference, the “winner” was chosen by noise, and usually by opens, the most polluted metric available. On a small list, it’s often better to split the whole list fifty-fifty and bank the learning. Second, a small test’s result isn’t invalidated by these limits — it’s just unproven. Treat it as a hypothesis to retest, not a finding to build on.

Reading Results: Significance, Peeking, and Other Self-Deceptions

Statistical significance answers one narrow question: if the two variants were actually identical, how often would random chance alone produce a gap this large? The conventional threshold — ninety-five percent confidence — means a gap big enough that chance would produce it less than one time in twenty. Most email platforms compute this for you or you can use any free significance calculator; the arithmetic is not the hard part. The hard part is the discipline around it.

The first self-deception is peeking. Calling a winner a few hours into a send goes wrong for two stacked reasons: early responders aren’t representative of your list, and repeatedly checking for significance until it appears virtually guarantees you’ll eventually find it by chance. Decide the evaluation window before sending — for most lists, somewhere between twenty-four and seventy-two hours captures the meaningful response — and read the result once.

The second is multiplicity. If you judge one test on opens, clicks, conversions, unsubscribes, and revenue simultaneously, the odds that something crosses the significance line by chance multiply with every metric. Declare a primary metric before the test; everything else is context. Relatedly, slicing results after the fact — “it lost overall but won with mobile users in Ontario” — is how noise gets laundered into insight. Subgroup findings are hypotheses for a future test, never conclusions.

The third is ignoring effect size. With a big enough list, a trivial difference can be statistically significant — real, but too small to matter. Significance tells you the difference probably exists; it doesn’t tell you it’s worth anything. A result worth acting on clears both bars: unlikely to be chance, and large enough to change a decision.

Finally, expect ties and report them honestly. Most well-run email tests end without a significant winner — that’s the normal texture of a real testing program. A tie on a big swing is itself information: that variable doesn’t matter much for your audience, and you can stop arguing about it in planning meetings. The only true failure is calling a winner that isn’t there.

From One-Off Tests to a Testing Program

The difference between teams that get compounding value from testing and teams that just generate dashboard screenshots is rarely statistical sophistication — it’s memory and sequencing. A testing program is just one-off tests plus a written record plus an order of operations.

Sequence by leverage. Test the variables with the largest plausible effect first: the offer itself, the angle or framing of the message, frequency, and list segmentation. Then the message-level variables: subject-line approach, CTA copy, long versus short. Save cosmetic variables — button color, emoji, minor wording — for last, or for never, because at most list sizes they’re undetectable anyway. A common and costly inversion is spending a year on subject-line phrasing without ever testing whether the offer or the cadence is right.

Keep a test log: hypothesis, variants, primary metric, sample sizes, evaluation window, result, and decision — a simple spreadsheet is enough. The log does three jobs. It stops the team from unknowingly rerunning settled questions. It turns one-off results into patterns: one specificity-beats-curiosity result is anecdote, but the same direction across five campaigns is a real finding even when no single test was conclusive — which, for smaller lists, is the only reliable way to learn anything. And it forces every test to state its hypothesis up front, which quietly kills the vague “let’s see what happens” tests. Retest settled beliefs occasionally, too — audiences shift, lists turn over, and a two-year-old result may simply be stale.

And keep perspective on what testing is for. A/B testing optimizes — it cannot rescue a weak offer, a tired list, or the deliverability fundamentals that determine whether your mail gets seen at all. Get those right first. Then run fewer tests, run them bigger, judge them on clicks and conversions rather than polluted opens, and write down what you learn. That modest discipline beats a high-volume testing program run on vibes — every time, and by more than two percent.

Want help implementing this?

Get a free proposal for your email marketing setup. We’ll show you exactly where the opportunities are.

Get Free Proposal

Email Marketing