When A/B Testing Your Outreach Helps, and When It's a Distraction

A/B testing sounds like rigor. For most letter-writing programs it’s a way to feel busy.

I love a good test. I’ve also watched teams spend a campaign cycle running comparisons that couldn’t possibly produce a learning, then declare the winner of a coin flip and roll the result into next year’s playbook. That’s worse than not testing. Confident wrong is more expensive than uncertain.

If your program has gotten into the habit of A/B testing, or wants to start, it’s worth being honest about when this work pays off and when it doesn’t.

The Sample Size Problem

Most letter-writing programs run too small to A/B test the things they want to test.

The math is unforgiving. If you’re sending 1,000 postcards split 500/500 between two scripts, and your baseline turnout in the universe is 35%, you’d need a turnout difference of about 6 percentage points between the two halves to be reasonably sure you’re seeing a real effect rather than noise. A 6-point lift on a postcard variant is enormous. Real-world effects on persuasion-mode mail in a competitive electorate land in the 0.5 to 1 point range. You’d need something closer to 30,000 to 50,000 postcards per variant to detect that.

This isn’t a reason not to send postcards. The Center for Common Ground evidence is good and well-powered. It’s a reason to be careful about what you call a test result. If your campaign sent 1,200 postcards and the “blue paper” group voted 2 points higher than the “white paper” group, the most likely explanation is randomness. Switching to blue paper next year because of that result is exactly the kind of confident-wrong move that hardens a bad practice into a tradition.

What’s Actually Worth Testing

A useful test changes one variable that could plausibly move turnout by a meaningful amount. In my experience, three variables clear that bar:

The opening sentence. First sentences carry more weight than any other line on the postcard. Two genuinely different openings — one that leads with the date, another that leads with a question, another that leads with a story fragment — can produce real differences in response. This is testable.

The ask. “Vote on May 19” versus “Make your plan to vote by mail” versus “Check that your registration is current at vote.pa.gov” are different actions. They will produce different outcomes. This is testable.

The teaser on the envelope. For sealed mail (less common in our world but increasing), what’s printed on the outside drives the open rate, which drives everything else. This is testable, and the effect sizes tend to be large enough that even modest volumes show signal.

What’s not worth testing in a typical program: font, color, paper weight, signature placement, whether the writer’s first name is on the front or the back, exact word count of the body. The effect sizes for these variables are smaller than the noise in your sample. You can test them; you won’t learn anything. The hours are better spent elsewhere.

How to Run a Test You’ll Actually Learn From

If you’ve decided a variable is worth testing and your volume is high enough to detect a real effect, four practices separate useful tests from theater:

Test one thing. Two different openings, identical everything else. If you change the opening and the ask in the same test, you don’t know which one moved the result.

Pre-register what you expect. Before the test runs, write down what you think will happen and how big the effect will need to be to count. “I expect Opening A to outperform Opening B by 2 points or more” gives you a way to evaluate the result honestly. Without this step, every result becomes “interesting” and gets used to justify whatever the team already wanted to do.

Run it on a single state, single campaign, single period. Don’t pool data across campaigns to get the sample size up. Different lists, different states, different timing all introduce variables you can’t control for.

Be willing to learn nothing. “The result was inconclusive” is a real and useful finding. It tells you the variable is smaller than you thought, or your volume is too small to detect it. Both pieces of information should change what you test next.

A Two-Question Filter

Before your team launches another A/B test, run the proposal through two questions:

Is this variable plausibly large enough to actually move turnout? If the honest answer is “probably not, but maybe a fraction of a point,” skip it. Even if you’re right, you won’t be able to detect it.
Do we have enough volume for a meaningful result to be visible? If you’re sending under 5,000 pieces per variant, you almost certainly don’t. Use the time to write more letters instead.

If a proposed test fails either question, it’s not a test. It’s a ritual. There’s nothing wrong with rituals; they have their own benefits. But call them what they are and stop using their results to make decisions.

When to Stop Testing

Programs that test well share a habit: they ship the winner and move on. They don’t run the same test again next year “just to confirm.” They don’t add a third variant to “really make sure.” They take what they learned, write it into the script template, and spend the next cycle’s testing budget on a different question.

The point of testing isn’t to keep testing. It’s to learn enough to stop testing this thing and start testing something else. Treat your testing budget like your volunteer hours. It’s finite, and most programs have less of it than they think.

The Sample Size Problem

What’s Actually Worth Testing

How to Run a Test You’ll Actually Learn From

A Two-Question Filter

When to Stop Testing

Liked this article?