Which AI is actually best for market research?

An independent, head-to-head test of five large language models across a full research workflow, from raw survey data to a client-ready report. Here is what we found.

Which AI is Actually Best for Market Research? | Yazi

YAZI · INDEPENDENT BENCHMARK

We ran an independent, head-to-head test of five large language models across a full research workflow, from raw data to a client-ready report. Five models, five standardised prompts, one real 1,500-respondent dataset, scored against a 170-point rubric by two analysts. Here is what we found.

Leading LLMs scored
head-to-head

170pts

Rubric across 5 prompts
+ 2 session criteria

34pts

Spread between first
and last place

Run by

Yazi Research

Type

Independent benchmark

Dataset

1,500 respondents

Tested

May 2026

There’s no shortage of opinions about which AI model is best for research, yet there is very little based on actual tests. So we ran one.

The question started between the two of us. We use these tools every day, and kept disagreeing about which model to reach for, and for which job. Not in a vague “AI is amazing” way, but specifically: when I have a quant dataset to analyse, or a thousand open-ended responses to code, or a client report due on Friday, which model should I actually open?

And since we couldn’t find a credible answer anywhere, we built one. We took a real research project, wrote a set of standardised prompts that follow the work a researcher genuinely does, scored every output against a 170-point rubric, and ran five of the biggest models through exactly the same process. What follows is a full breakdown of what we did, how we did it, what happened, and how you can replicate it yourself.

The leaderboard

If you’re looking for the headline result, here it is, the overall ranking before we unpack how we got there. Two Claude models took the top two spots, ChatGPT was a solid third, Gemini fourth, and Microsoft Copilot came last by a wide margin.

The single most important number is the spread. More than 34 points separate first from last, these tools are not interchangeable, and the gap is far bigger than the marketing would have you believe. If you’ve settled on a model out of habit, or because it’s the one your company already licenses, that choice is worth revisiting.

What this was, and wasn’t

A quick word on motivation, because it matters for trust. This was not commissioned, sponsored, or run on behalf of any of the model providers. We are a research company. We built the test because we needed the answer for our own work, and once we had it, it seemed worth sharing with the industry.

We also wanted to put a few tired arguments to bed: the “AI produces generic rubbish” line, the “it cannot handle a real dataset” line, and the “it is coming for our jobs” line. Rather than argue about those in the abstract, we wanted to test them. We compared five models using the versions current as of May 2026: Claude Sonnet 4.6, Claude Opus 4.7, ChatGPT 5.5, Gemini 3.1, and Microsoft Copilot. These models update constantly, so this is a snapshot, not a permanent ranking. The methodology is the durable part.

How we tested

This is the part that matters most, so we’ve gone into detail. Rigorous, standardised, independently scored, so you can trust what you see.

The dataset

We used one real study, anonymised and stripped of anything personally identifiable before any model saw it. Both of us reviewed the data by hand before testing, so we knew the dataset cold and could judge whether a model was right or just confident.

1,500

survey respondents in a single, real study

quantitative questions

open-ended qualitative questions

The scoring approach

We scored each output against 12 criteria spread across the five prompts, plus two session-level criteria, for a maximum of 170 points. Everything was scored from 0 to 5, covering the things a research buyer actually cares about: accuracy, output quality, recommendation relevance, action specificity, usability, depth of analysis, quant-versus-qual handling, dataset-size handling, cross-source integration, narrative and headline quality, deliverable completeness, and whether quality held across the whole session.

Test design

5 standardised prompts, identical wording for every model
12 evaluation criteria across 5 prompts, plus 2 session-level
Scored 0 to 5 per criterion, maximum 170 points
Single-shot per prompt, no follow-up prompts allowed

Controls

Two analysts scored independently before comparing
Scores averaged across both; gaps of 2+ points discussed and reconciled
Fresh session for every model, no shared context
Identical dataset, identical instructions, head-to-head

Single-shot, no coaching. We did not follow up, re-prompt, or nudge any model toward a better answer. If the first output was shallow, that shallow output is what we scored. As a researcher choosing a tool, you want the one that doesn’t need babysitting.

Replication & reliability

We ran each model multiple times under identical conditions. The numerical scores and figures came out identical across runs, and the insights and recommendations were around 90% consistent. So this is not a one-off fluke.

The five tasks

We wrote five standardised prompts that mirror the actual steps a researcher takes between receiving the data and sending the report. Every model got the same prompts, word for word, in the same order, in a single session.

Five tasks, in the order a researcher actually works. From analysing the raw survey, through coding the verbatims and synthesising the two, to structuring and then producing a report that could go straight to the client.

The results, task by task

The leaderboard is an average. The more useful view is how each model performed on each task, because the shape tells you where the real differences sit. The overall ranking holds across almost every task, but the gaps open and close depending on the work.

Sonnet 4.6 Opus 4.7 ChatGPT 5.5 Gemini 3.1 Copilot

Prompt 1, Quantitative analysis

The clearest separator was sub-group analysis. The top three models automatically cut the data by tenure, location, and revenue band without being asked. Gemini stopped at top-line aggregation; Copilot didn’t attempt sub-groups at all and leaned on approximate ranges where exact figures were available. Same data, same prompt, very different work. Opus posted the single highest score of the entire test on this prompt.

Criterion (0–5)	Opus 4.7	Sonnet 4.6	GPT 5.5	Gemini 3.1	Copilot
Accuracy	4.5	5	5	4.5	3
Output quality	5	5	4.5	4	2.5
Recommendation relevance	5	5	4.5	4.5	2.5
Action specificity	5	5	4.5	4.5	3
Usability	5	5	4.5	4.5	2.5
Depth of analysis	5	4.5	4	3	2
Dataset-size handling	5	4.5	5	5	4
Total / 35	34.5 99%	34 97%	32 91%	30 86%	19.5 56%

Prompt 2, Qualitative analysis

The interesting split was about voice. Sonnet and ChatGPT preserved the actual respondent voice, grammar quirks and all. Opus and Gemini polished the verbatims into smoother prose, which reads better but quietly loses authenticity. In qualitative work, the rawness is often the point. Opus surfaced cross-corpus patterns no other model found; Copilot produced theme labels that could have been written from the brief alone.

Criterion (0–5)	Sonnet 4.6	Opus 4.7	GPT 5.5	Gemini 3.1	Copilot
Accuracy	4.5	4	4.5	4.5	4
Output quality	5	5	4.5	4.5	3.5
Recommendation relevance	5	5	4.5	4.5	3.5
Action specificity	5	5	4.5	4.5	3.5
Usability	5	5	4.5	4.5	3.5
Depth of analysis	5	5	4	3.5	2.5
Dataset-size handling	4.5	5	5	5	4
Total / 35	34 97%	34 97%	31.5 90%	31 89%	24.5 70%

Prompt 3, Combined synthesis

Every model managed to integrate the quant and qual, so the technique itself is no longer a differentiator. What separated them was depth: the top models surfaced genuine contradictions and alignments between the two sources, while Copilot stayed at lighter, more descriptive integration. This is also the prompt where Opus slipped slightly on accuracy, most of the reason Sonnet edged it overall.

Criterion (0–5)	Sonnet 4.6	Opus 4.7	GPT 5.5	Gemini 3.1	Copilot
Accuracy	4.5	3	5	4.5	3
Output quality	5	5	4.5	5	3
Strategic rec relevance	5	5	4.5	4.5	3.5
Content specificity	5	5	4.5	4.5	3.5
Usability	4.5	5	4.5	4.5	3
Depth of analysis	5	5	4	4	3
Cross-source integration	5	5	5	5	4
Total / 35	34 97%	33 94%	32 91%	32 91%	23 66%

Prompt 4, Reporting structure

The closest scores of the whole test, with under three points separating the top four. The recurring slip across the entire panel was defaulting to topic labels instead of insight headlines, even though the prompt contained a worked example of exactly what we wanted. Even the best models did it at least once. Copilot produced section headlines without specific data citations and lost the client framing.

Criterion (0–5)	Sonnet 4.6	Opus 4.7	GPT 5.5	Gemini 3.1	Copilot
Accuracy	4.5	4	5	4.5	3.5
Output quality	5	5	4.5	4	4
Recommendation relevance	5	5	4.5	4.5	4
Usability	5	5	5	4.5	3
Narrative & headline quality	5	5	4.5	4.5	3.5
Total / 25	24.5 98%	24 96%	23.5 94%	22 88%	18 72%

Prompt 5, Final deliverable

This is the diagnostic prompt, the one where sessions either hold together or fall apart. The top models introduced genuinely new analytical content here; the mid-tier consolidated competently; the weakest model rehashed earlier outputs and reverted to a generic template that didn’t even name the commissioning client. One honest caveat: we asked for a Word document. Only ChatGPT produced one with proper tables. Gemini and Copilot couldn’t produce a Word file at all, so you’d have to copy and paste, the kind of practical detail that matters when delivering to a client.

Criterion (0–5)	Sonnet 4.6	Opus 4.7	GPT 5.5	Gemini 3.1	Copilot
Accuracy	4.5	4	5	4	3
Output quality	5	5	4	4.5	2.5
Recommendation relevance	5	5	4.5	4.5	2.5
Usability	4	4.5	4.5	4	2.5
Narrative & headline quality	5	5	4.5	4.5	3
Logical flow / deliverable quality	4.5	4.5	4.5	3.5	2
Total / 30	28 93%	28 93%	27 90%	25 83%	15.5 52%

Two session-level criteria sit on top of the five prompts: quant/qual consistency and session resilience, each worth 5 points. Sonnet and Opus scored a clean 5 on both; ChatGPT 4.5 and 4.5; Gemini 4.5 and 4; Copilot 3 and 3. Together with the 160 prompt points, these make up the 170-point total.

Beyond the leaderboard

If you take nothing else from this, take these six.

Comprehension is solved

All five models understood the data. The bottleneck is no longer whether the model can read your dataset.

Reframing was the real differentiator

The top tier turned findings into strategic positions. The rest described what was in the data, and that gap is where human judgement lives.

Accuracy alone did not win

ChatGPT had the cleanest numbers of the group and still finished third. Being correct is necessary but not sufficient.

Session resilience varied dramatically

The best models held quality from the first prompt to the last. The weaker ones drifted as the session went on.

The final report exposes the earlier analysis

Weak thinking upstream becomes obvious by the time you ask for the full deliverable.

Two of the top three came from the same lab

Anthropic took first and second, Sonnet 4.6 and Opus 4.7.

So which one should you use?

There’s no single answer, because the top three are genuinely close and each has a personality. Based on the results, here’s how we’d use each one.

Claude Sonnet 4.6 #1 · 96.8%

For a polished, senior-client narrative. Strategic reframes, real respondent voice, and recommendations that connect, but watch that it sometimes slips word counts.

Claude Opus 4.7 #2 · 96.2%

For deep analytical breadth and segmentation. Automatic sub-group cuts, sharp headlines, and it surfaces hidden patterns, but it defaults to long and dense.

ChatGPT 5.5 #3 · 91.2%

For accuracy-critical or fact-checked work. Brief-perfect and consultancy-ready, the cleanest quant accuracy of the group, but more descriptive than strategic.

Gemini 3.1 #4 · 87.4%

For ideation and sharp single insights. Lean and high-density, but there are no default sub-groups, and it fades a little by the final deliverable.

Microsoft Copilot #5 · 62.6%

For a quick exploratory first pass only. Consistent in voice, but approximate rather than exact, and it lost the client framing.

A note on Sonnet versus Opus

We get asked this most, so here’s the honest read. It was effectively a tie, 96.8 against 96.2 is about a single point on a 170-point scale. Opus was arguably the deeper analyst: it had the highest single score of the whole test on the quant prompt and caught patterns in the verbatims that nothing else did. Where Sonnet won was accuracy consistency. It was tighter on the raw numbers across almost every prompt, and the gap was clearest on the synthesis step, where Opus slipped slightly while Sonnet held. In research, a wrong figure costs you more than a slightly less deep insight, and that’s what tipped it. Opus for depth, Sonnet for a reliable, client-ready output is a fair way to split them.

What this actually saves you

The case for adoption is easier to make in hours than in theory. The important shift is where the time-saving now sits. For most of us, the unlock is no longer in finding the insight, it’s in producing the output: the drafting, the chart-making, the slide-building. That’s the part these tools have quietly closed the gap on in the last year.

The bottleneck moved. A year ago you couldn’t get a usable report or deck out of these models; now you can. The unlock for researchers was never noticing the insight, it was producing the artefact, and that’s the part that has caught up.

What’s changed since 2023

If you tried these tools in 2023 and walked away unimpressed, that was reasonable at the time. But three things have changed. Context windows went from a few thousand tokens to over a million, so an entire dataset now fits in a single session. Hallucinations dropped sharply inside a tight workflow, with outputs moving from fabricated sources to verifiable ones. And the outputs caught up, a year ago you couldn’t get a usable report or deck out of these models, and now you can. That last one is the breakthrough that moved these tools from interesting to genuinely useful.

Try it yourself this week

The best way to trust any of this is to test it on your own data. None of these takes more than ten minutes, and you don’t need any setup beyond the chat window.

Paste in a quant table

Copy 10–20 rows of survey data into any top model and ask: “You are a senior market research analyst. What are the five most interesting patterns in this data?”

Code your verbatims

Paste 20–30 open-ended responses and ask: “What are the five dominant themes in these responses? For each theme, give me two supporting quotes.”

Draft an executive summary

Paste your key findings as bullets and ask: “Write a 150-word executive summary for a client. Lead with the most critical finding and end with the single most important action.”

Take 15 minutes one day next week, pick one, and use whatever dataset is already on your desk. That’s how this stops being a debate and becomes part of your toolkit.

The bottom line

The headline that AI is coming for research jobs misses what the data actually shows. These models do the heavy lifting on production, but they still need someone who knows what good looks like to steer them. The better you are at the craft, the better your instructions, and the better the result.

The researchers who learn to wield these tools well will do better work, deliver faster, and free up more of their time for the thinking that only humans can do. If anything, this is the moment the most experienced researchers should be pulling ahead. These tools aren’t just good enough any more, they are fit for everyday use.

We ran this study independently at Yazi Research. The whole test was run inside the standard chat interface, no API, no custom tooling, specifically so that anyone in the industry can replicate it. If you’d like the full scoring rubric, the exact prompts, or the per-criterion breakdown, get in touch and we’ll share them.

Tim Treagus

Founder of Yazi, a research platform in 100+ languages for surveys, AI-moderated interviews, and diary studies inside WhatsApp.

Yaseen Mowzer

Head of Research at Yazi. Co-designed the 170-point rubric and prompts, and was the study's second independent scorer.

Want the full rubric, the exact prompts, and the per-criterion breakdown to run this on your own data?

Book a Demo →