The Labor Illusion Is the Wrong Bet for Creative Testing

Q: What should replace designed latency in creative testing?

Real evidence delivered fast. In creative testing the evidence is attention: predictive-attention or eye-tracking models that show where viewers look as heatmaps, fixation and dwell, scanpath order, and saliency, along with the model's confidence. Show the actual signal and let users inspect it rather than performing effort with a spinner, and validate predicted attention against real gaze on your own creative before trusting it to guide spend.

Q: Does a better-looking creative-testing tool give better results?

Not necessarily. The aesthetic-usability effect, documented by Kurosu and Kashimura in 1995 and replicated by Tractinsky in 2000, shows that people judge attractive interfaces as more usable and more trustworthy regardless of how well they actually perform. A polished, confident-looking tool can collect trust its verdicts have not earned. Judge a creative-testing tool by whether its attention predictions hold up against real eye-tracking on your own creative, not by how considered the interface feels.

Two ways to make an AI feel trustworthy. The labor illusion makes you watch each word land while progress creeps from zero to one hundred. The honest version stays at zero until the answer is ready, then delivers the full phrase at once.

A spinner that knows nothing

You upload an ad to a creative-testing tool. A spinner appears. A progress bar crawls: 40 percent, “reading the frame,” 68 percent, “modeling attention,” 92 percent, “scoring engagement.” A few seconds later the verdict lands, and you believe it a little more than you would have if it had answered instantly. It looked like it worked for the answer.

Here is the uncomfortable part. The wait was designed. The model finished in a few hundred milliseconds. The spinner, the crawling bar, the little status messages naming each step are set dressing, added on purpose so the output would feel considered. You were not watching a machine think. You were watching a machine perform thinking, for you.

We trust machines more when they appear to struggle. Most of the time that instinct is harmless, and occasionally it is useful. In creative testing it is exactly backwards.

The idea has been circulating in AI product circles under a borrowed name: the labor illusion. It is worth taking seriously, because the people building audience-intelligence tools are now deciding whether to lean into it. I think leaning in is a mistake, and I want to lay out why, from both the psychology and the measurement side.

The labor illusion, and why it works

The phrase comes from a 2011 paper by Ryan Buell and Michael Norton at Harvard Business School, published in Management Science. They studied what they called operational transparency: what happens when a service visibly shows you the work it is doing. Their finding was counterintuitive. Across five experiments simulating online travel and online dating, people valued a service more when it signalled effort, and in some cases preferred a site that made them wait over an instant one that returned identical results.

The mechanism they proposed is a short causal chain. Visible effort reads as effort. Perceived effort triggers a sense of reciprocity and a quiet quality inference (“it must be working hard for me, so the result must be good”), and that inflates how much we value the outcome. Show the labor, raise the perceived value, even when the labor changes nothing about what you get.

Underneath that sits an older cognitive shortcut: the effort heuristic, documented by Justin Kruger and colleagues in 2004. We use perceived effort as a proxy for quality, and we lean on it hardest exactly when quality is difficult to judge on its own terms. Tell people a poem or a painting took longer to make and they rate it as better. The heuristic fills the gap when we cannot evaluate the thing directly.

The same machinery shows up one step further along, in what Michael Norton, Daniel Mochon, and Dan Ariely named the IKEA effect: we place a higher value on things we have watched ourselves labor over, down to badly assembled furniture and clumsy origami. Labor leads to love. The detail that matters here is the boundary condition they found. The effect only holds when the labor actually succeeds. When people built something and then took it apart, or failed to finish, the inflated valuation vanished. Effort earns its bonus by producing something real. A spinner is labor that produces nothing for you, which makes it the cleanest possible case of effort that has not earned the credit it collects.

There is a competing instinct, and it is worth naming because it is the one creative testing should be recruiting. Processing fluency: things that are easy to perceive and quick to process feel good, and Rolf Reber, Norbert Schwarz, and Piotr Winkielman argued in 2004 that aesthetic pleasure itself is largely a readout of how fluently the mind handles something. So we carry two opposed reflexes at once. Effort can read as quality. Ease can read as good. Which one fires depends on framing, and the labor illusion is a deliberate nudge toward effort-as-quality, away from the speed a creative test should be rewarding.

⚠️ The trap in one sentence

Designed latency raises how much you trust a verdict without changing whether the verdict is correct. In a creative test, that is the entire failure mode: it moves your confidence and leaves your accuracy exactly where it was.

One honesty note, because this blog is not in the business of citing a single study as law. The effort heuristic is a lab effect, and a 2023 replication in Collabra: Psychology got mixed results on the original experiments, which tells you the effect is real but context-dependent rather than universal. The operational-transparency finding is the better-supported of the two and has shown up across retail queues, government dashboards, and online services. So treat the building blocks honestly: the labor illusion is a robust operations result resting partly on a more fragile psychological one. That nuance matters for where it should and should not be deployed.

The interface is a designed object, too

There is a second bias stacked on top of the first, and it comes from the design of the testing tool itself. People reliably judge attractive interfaces as more usable, and more trustworthy, than plain ones, whether or not they actually work better. This is the aesthetic-usability effect, first measured by Masaaki Kurosu and Kaori Kashimura at Hitachi in 1995, who had 252 people rate variations of an ATM screen and found that perceived ease of use tracked beauty more tightly than it tracked real ease of use. Noam Tractinsky replicated it across cultures in 2000 under a blunt title, “what is beautiful is usable,” and reported that the link got stronger after people actually used the machine. Don Norman built much of Emotional Design on the same observation: attractive things are perceived to work better, and are forgiven more when they do not.

Put that next to processing fluency and the problem compounds. A creative-testing tool is itself a designed object, and the more polished, confident, and effortful it looks, the more its verdicts inherit a credibility that has nothing to do with whether they are correct. The labor illusion is just the most deliberate lever on a whole console of them. A clean layout, a decisive single number, a knowing pause: each nudges the operator toward belief, and none of them touch accuracy. Neuroscience offers a tidy reason this is so hard to resist. Work in neuroaesthetics by Anjan Chatterjee and Oshin Vartanian places aesthetic response partly in the brain’s emotion-valuation circuitry, the same reward system that tags things as wanted and good. A beautiful, busy-looking interface is, at the neural level, talking to the part of you that assigns value, not the part that audits it.

Why AI reached for the trick

Conversational products learned this early. A chatbot that answers the instant you hit enter feels robotic. Add a typing indicator and a short pause and the same reply feels like consideration. Those artificial delays have been shown to raise perceived empathy and user satisfaction, which is why almost every assistant now hesitates before it speaks, on purpose.

Then reasoning models made the illusion literal. The current generation of “thinking” models exposes a stream of intermediate reasoning while it works, and research on visible thinking in chatbots finds that the stream functions as a social cue: watching a model deliberate raises perceived effort, competence, and even warmth. The scrolling “thinking” text is doing reputational work, separate from whatever the model actually computed.

The catch is a live debate in the field called chain-of-thought faithfulness. A model’s visible reasoning trace is not guaranteed to reflect the computation that produced the answer. Models can emit fluent, plausible-looking reasoning that has little to do with how they actually arrived at the output. So in the worst case the “thinking” you are watching is not a window into the work. It is a second performance, layered on top of the first.

Why does a pause read as thought at all? Because we extend social rules to machines without meaning to. Clifford Nass and Youngme Moon showed decades ago that people mindlessly apply human social scripts to computers, responding to politeness, personality, and apparent effort even while insisting they know better. A hesitation triggers the same script a hesitating colleague would: she must be weighing it. The machine inherits the courtesy.

That courtesy lands on ground that already tilts toward over-trust. Research on automation bias finds that people lean on automated cues and stop checking them once a system looks competent, and work on algorithm appreciation shows we will often take a machine’s advice over a human’s. The tilt is not unconditional. It flips to algorithm aversion the moment we watch the system err, sometimes punishing the machine more harshly than we would punish a person for the same mistake. But before errors are visible, the default is to give the confident-looking system the benefit of the doubt, and designed effort is a way of buying that benefit on purpose. The sharpest finding here is about explanations. Studies of human-AI teams have found that attaching an explanation to a recommendation raises how often people accept it whether or not it is correct, producing blind reliance rather than the calibrated kind. A reasoning trace, then, is not only a window into the work. It is also a trust lever that moves independently of accuracy, which is precisely the property you do not want on a surface whose only job is to be right.

If that sounds familiar, it should. This is the ELIZA effect wearing a 2026 suit. Weizenbaum’s 1966 chatbot created an illusion of understanding out of simple pattern matching, and people supplied the comprehension themselves. The labor illusion creates an illusion of diligence out of latency, and we supply the diligence. Both tricks work because the human in the loop does the meaning-making for free.

Creative testing is a different job

Here is the pivot, and it is the whole argument.

The labor illusion is a tool for shaping a feeling. That is legitimate when the AI’s job genuinely is the relationship: a support bot, a companion app, an onboarding flow where patience, trust, and a sense of being heard are the actual deliverable. If the point is how the interaction feels, then a little theatre that makes it feel more attentive is, arguably, doing its job.

Creative testing is not that job. When you test an ad, a thumbnail, the first second of a video, or a packshot, you are not trying to feel good about the tool. You are trying to settle a question with money riding on it: which creative earns attention, and where do the eyes actually go. The output is a verdict, not an experience.

For a verdict, only two properties matter. Is it fast, and is it right. Designed latency makes it slower and adds nothing to whether it is right. The wait is pure cost wearing the costume of credibility.

It is actually worse than neutral, because the theatre corrupts the read. A confident, effortful-looking interface is a textbook System 1 nudge: it invites your fast, automatic brain to accept the call without scrutiny. The entire promise of audience intelligence is to replace gut judgment with measurement. A tool that performs effort is quietly re-installing the exact bias you bought it to remove, and dressing the re-install up as rigor. It is the same blind spot that makes teams celebrate a statistically significant result with no behavioral meaning: the number looks authoritative, so nobody asks whether it matters.

Match the tactic to the job. Performing effort can warm up a conversation, but a creative test exists to settle a question. There, speed is the product and the only credibility that counts is evidence: where attention actually went.

What proving the call actually looks like

So if not theatre, then what. The honest version of confidence in creative testing is evidence, and in this domain the evidence is attention.

A class of tools now predicts where a viewer will look at a frame before a single dollar of media is spent. Predictive-attention models, sometimes called AI eye-tracking, are trained on large corpora of real gaze data, millions of recorded fixations from lab studies, and they return the things a verdict can be built from: a heatmap of predicted attention, fixation and dwell estimates, scanpath order, and saliency. With that you can ask concrete questions in seconds. Does the eye reach the logo before the cut. Does the call to action survive the first second. Is the face stealing attention away from the message it is supposed to support. Vendors in this space include Attention Insight, built on several million eye-tracking fixations, and Realeyes, which trains neural networks on large gaze and reaction datasets, among others.

It helps to understand why attention is a real signal and not just another number. Human vision is not a camera. It is a sequence of sharp fixations stitched together by rapid jumps called saccades, with only a small foveal window in focus at any instant, so where the eye goes is, quite literally, where the mind is. What pulls it there comes in two flavors. Bottom-up, stimulus-driven attention is grabbed by raw conspicuity, the high-contrast edge, the face, the sudden motion, formalized in Laurent Itti and Christof Koch’s saliency-map model and prefigured by Anne Treisman’s feature-integration theory, in which simple features register in parallel before attention binds them into objects. Top-down, goal-directed attention is steered by what the viewer is trying to do, and Maurizio Corbetta and Gordon Shulman traced it to a separate dorsal control network in the brain. Alfred Yarbus made the consequence vivid back in the 1960s: the same face draws completely different scanpaths depending on the question in the viewer’s head. That is the deep reason a measured attention map carries information a spinner never can. It is sampling a real, structured, physiological process rather than narrating one. It is also the reason for the caution that follows, because predictive models are trained mostly on the bottom-up half. They approximate raw saliency reasonably well and the top-down, task-driven half poorly, and original creative lives exactly where intent and context bend the gaze.

Now the skeptic’s footnote, because predicted attention is a model, not a measurement, and the difference is the whole game.

ℹ️ Treat vendor accuracy as a claim, not a fact

Numbers like “94 percent accurate” or “78 percent predictive” are vendor-reported and benchmarked against the vendor’s own studies. Predicted attention correlates with real gaze on average, but it degrades on novel layouts, heavy motion, and culturally specific imagery, the exact cases where original creative tends to live. Before predicted attention is allowed to steer spend, validate it against real eye-tracking on a sample of your own work, and keep the headline accuracy figure as a hypothesis until you have.

The point is not any particular vendor. It is the shape of the answer. Fast, and backed by a signal you can open up and inspect. That shape is the opposite of a spinner, which is slow and backed by nothing you can examine.

At North AI we build attention analytics for video, and the discipline we try to hold is easy to say and hard to keep: measure attention, do not perform it. A heatmap showing the eye missing your brand for the first two seconds is worth more than any interface, however considered it looks, that merely feels like it thought hard. One of those changes the creative. The other changes only your mood about it.

Real transparency beats fake transparency

It would be easy to read all of this as “hide the processing, just show the number.” That is not the lesson, and it is worth correcting, because Buell and Norton’s deeper finding actually points the other way.

Operational transparency works when the disclosed labor is real and informative. The spinner was never the active ingredient. Visible, genuine work was. Which means the fix for creative testing is not to conceal the computation. It is to replace fake work with real work.

Do not animate a fake progress bar that names steps it is not really taking. Show the heatmap. Show the fixation order and where attention falls off. Show the model’s confidence, the second-best cut, and the reason it lost. Let the user watch the actual reasoning: here is where eyes go, here is how sure we are, here is what would have to change to move the result. That is transparency that earns trust because inspecting it makes the decision better, not because it made you wait. It is the difference between a tool that shows its work and a tool that performs the appearance of work.

✅ The bottom line

The labor illusion sells the feeling of a good decision. Creative testing needs the decision itself. Buy the measurement, not the performance, and reserve the theatre for the surfaces whose job is genuinely how they feel.

What to do this sprint

If you are buying or building audience-intelligence or creative-testing tools, five concrete moves:

Treat time-to-answer as a feature. If a tool adds latency that adds no information, log it as a bug, not as a sign of rigor. Speed is part of the product, not a compromise on it.
Separate the two jobs explicitly. For every AI surface, decide up front whether it exists to shape a feeling or to settle a question. Only the first category is allowed to perform effort.
Demand inspectable evidence. A verdict should arrive with the attention signal behind it: heatmap, fixations, scanpath, confidence. “Trust me, I analyzed it” is just the spinner rendered in words.
Validate predicted attention against real gaze. Before predicted attention drives media spend, check it against actual eye-tracking on a sample of your own creative. Hold the vendor’s accuracy claim as a hypothesis until your own test clears it.
Watch your own System 1. If a slicker, slower, more effortful-looking interface raises your confidence in the same underlying number, that reaction is the bias, not the data. Notice it, and discount it.

The spinner can spin as long as it likes. It still does not know where anyone looked.

Lucas Cazelli is CPO and Co-founder at North AI, where he builds neuroscience-inspired attention analytics for video. He writes about decision-making, audience intelligence, and the intersection of cognitive science and product strategy, usually filtered through a civil-engineering background he is still in the process of unlearning.

Connect: LinkedIn | North AI | lucas@north-ai.com

References and further reading

Psychology and decision-making

Buell, R. W., and Norton, M. I. (2011). The Labor Illusion: How Operational Transparency Increases Perceived Value. Management Science, 57(9), 1564–1579.
Kruger, J., Wirtz, D., Van Boven, L., and Altermatt, T. W. (2004). The effort heuristic. Journal of Experimental Social Psychology, 40(1), 91–98.
Tomicic, A., et al. (2023). “The Effort Heuristic” Revisited: Mixed Results for Replications of Kruger et al. (2004). Collabra: Psychology, 9(1).
Norton, M. I., Mochon, D., and Ariely, D. (2012). The IKEA effect: When labor leads to love. Journal of Consumer Psychology, 22(3), 453–460.
Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.

The psychology of design

Kurosu, M., and Kashimura, K. (1995). Apparent usability vs. inherent usability: experimental analysis on the determinants of the apparent usability. CHI ‘95 Conference Companion, 292–293.
Tractinsky, N., Katz, A. S., and Ikar, D. (2000). What is beautiful is usable. Interacting with Computers, 13(2), 127–145.
Reber, R., Schwarz, N., and Winkielman, P. (2004). Processing fluency and aesthetic pleasure: Is beauty in the perceiver’s processing experience? Personality and Social Psychology Review, 8(4), 364–382.
Norman, D. A. (2004). Emotional Design: Why We Love (or Hate) Everyday Things. Basic Books.
Nielsen Norman Group: The Aesthetic-Usability Effect, a practitioner overview of the Kurosu, Kashimura, and Tractinsky work.

AI bias and human-AI trust

Nass, C., and Moon, Y. (2000). Machines and mindlessness: Social responses to computers. Journal of Social Issues, 56(1), 81–103.
Dietvorst, B. J., Simmons, J. P., and Massey, C. (2015). Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General, 144(1), 114–126.
Logg, J. M., Minson, J. A., and Moore, D. A. (2019). Algorithm appreciation: People prefer algorithmic to human judgment. Organizational Behavior and Human Decision Processes, 151, 90–103.
Parasuraman, R., and Manzey, D. H. (2010). Complacency and bias in human use of automation: An attentional integration. Human Factors, 52(3), 381–410.
Bansal, G., et al. (2021). Does the whole exceed its parts? The effect of AI explanations on complementary team performance. Proceedings of CHI 2021. See also Microsoft Research’s Overreliance on AI literature review (2022).
Watching AI Think: User Perceptions of Visible Thinking in Chatbots (2026). arXiv preprint.
Processing fluency: an overview of how ease of processing shapes judgment.

The neuroscience of attention

Treisman, A. M., and Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12(1), 97–136.
Itti, L., and Koch, C. (2001). Computational modelling of visual attention. Nature Reviews Neuroscience, 2(3), 194–203.
Corbetta, M., and Shulman, G. L. (2002). Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience, 3(3), 201–215.
Yarbus, A. L. (1967). Eye Movements and Vision. Plenum Press. See also the 2013 retrospective, Yarbus, eye movements, and vision.
Chatterjee, A., and Vartanian, O. (2014). Neuroaesthetics. Trends in Cognitive Sciences, 18(7), 370–375.

Predictive attention in creative testing

Attention Insight: predictive attention heatmaps trained on eye-tracking data.
Realeyes: attention and gaze prediction for video and static creative.

Frequently asked questions

What is the labor illusion?

The labor illusion is a finding from operations and consumer psychology, introduced by Ryan Buell and Michael Norton in 2011, that people place higher value on a service when it visibly signals effort, even when that effort does not change the result. A website that shows itself “searching” can be valued more than one that returns the same answer instantly. It rests on a related cognitive shortcut, the effort heuristic, in which perceived effort is used as a proxy for quality, especially when quality is hard to judge directly.

Does the labor illusion help AI products?

Sometimes. For AI surfaces whose job is the relationship, such as support bots, companions, and onboarding flows, a small amount of designed latency or visible “thinking” can raise perceived empathy, competence, and satisfaction. The benefit is real, but it is a benefit to how the interaction feels, not to the accuracy of any decision the AI produces.

Why is the labor illusion bad for creative testing?

Because creative testing exists to settle a question, not to shape a feeling. The output is a verdict about which creative earns attention, with media spend riding on it. Designed latency makes that verdict slower and does nothing to make it more correct, so the wait is pure cost. Worse, an effortful-looking interface nudges the viewer’s fast, intuitive judgment into accepting the call without scrutiny, which reintroduces the very bias that audience-intelligence tools are supposed to remove.

What should replace designed latency in creative testing?

Real evidence, delivered fast. In creative testing the evidence is attention: predictive-attention or eye-tracking models that show where viewers look, in the form of heatmaps, fixation and dwell, scanpath order, and saliency, along with the model’s confidence. Show the actual signal and let users inspect it, rather than performing effort with a spinner. Predicted attention should also be validated against real gaze on your own creative before it is trusted to guide spend.

Does a better-looking creative-testing tool give better results?

Not necessarily, and that is the trap. The aesthetic-usability effect, documented by Kurosu and Kashimura in 1995 and replicated by Tractinsky in 2000, shows that people judge attractive interfaces as more usable and more trustworthy independent of how well they actually perform. A polished, confident-looking tool collects belief its verdicts may not have earned. Judge a creative-testing tool by whether its attention predictions hold up against real eye-tracking on your own creative, not by how considered the interface feels.