An independent, head-to-head test of five large language models across a full research workflow, from raw survey data to a client-ready report. Here is what we found.

We ran an independent, head-to-head test of five large language models across a full research workflow, from raw data to a client-ready report. Five models, five standardised prompts, one real 1,500-respondent dataset, scored against a 170-point rubric by two analysts. Here is what we found.
There’s no shortage of opinions about which AI model is best for research, yet there is very little based on actual tests. So we ran one.
The question started between the two of us. We use these tools every day, and kept disagreeing about which model to reach for, and for which job. Not in a vague “AI is amazing” way, but specifically: when I have a quant dataset to analyse, or a thousand open-ended responses to code, or a client report due on Friday, which model should I actually open?
And since we couldn’t find a credible answer anywhere, we built one. We took a real research project, wrote a set of standardised prompts that follow the work a researcher genuinely does, scored every output against a 170-point rubric, and ran five of the biggest models through exactly the same process. What follows is a full breakdown of what we did, how we did it, what happened, and how you can replicate it yourself.
If you’re looking for the headline result, here it is, the overall ranking before we unpack how we got there. Two Claude models took the top two spots, ChatGPT was a solid third, Gemini fourth, and Microsoft Copilot came last by a wide margin.
A quick word on motivation, because it matters for trust. This was not commissioned, sponsored, or run on behalf of any of the model providers. We are a research company. We built the test because we needed the answer for our own work, and once we had it, it seemed worth sharing with the industry.
We also wanted to put a few tired arguments to bed: the “AI produces generic rubbish” line, the “it cannot handle a real dataset” line, and the “it is coming for our jobs” line. Rather than argue about those in the abstract, we wanted to test them. We compared five models using the versions current as of May 2026: Claude Sonnet 4.6, Claude Opus 4.7, ChatGPT 5.5, Gemini 3.1, and Microsoft Copilot. These models update constantly, so this is a snapshot, not a permanent ranking. The methodology is the durable part.
This is the part that matters most, so we’ve gone into detail. Rigorous, standardised, independently scored, so you can trust what you see.
We used one real study, anonymised and stripped of anything personally identifiable before any model saw it. Both of us reviewed the data by hand before testing, so we knew the dataset cold and could judge whether a model was right or just confident.
We scored each output against 12 criteria spread across the five prompts, plus two session-level criteria, for a maximum of 170 points. Everything was scored from 0 to 5, covering the things a research buyer actually cares about: accuracy, output quality, recommendation relevance, action specificity, usability, depth of analysis, quant-versus-qual handling, dataset-size handling, cross-source integration, narrative and headline quality, deliverable completeness, and whether quality held across the whole session.
We ran each model multiple times under identical conditions. The numerical scores and figures came out identical across runs, and the insights and recommendations were around 90% consistent. So this is not a one-off fluke.
We wrote five standardised prompts that mirror the actual steps a researcher takes between receiving the data and sending the report. Every model got the same prompts, word for word, in the same order, in a single session.
The leaderboard is an average. The more useful view is how each model performed on each task, because the shape tells you where the real differences sit. The overall ranking holds across almost every task, but the gaps open and close depending on the work.
The clearest separator was sub-group analysis. The top three models automatically cut the data by tenure, location, and revenue band without being asked. Gemini stopped at top-line aggregation; Copilot didn’t attempt sub-groups at all and leaned on approximate ranges where exact figures were available. Same data, same prompt, very different work. Opus posted the single highest score of the entire test on this prompt.
| Criterion (0–5) | Opus 4.7 | Sonnet 4.6 | GPT 5.5 | Gemini 3.1 | Copilot |
|---|---|---|---|---|---|
| Accuracy | 4.5 | 5 | 5 | 4.5 | 3 |
| Output quality | 5 | 5 | 4.5 | 4 | 2.5 |
| Recommendation relevance | 5 | 5 | 4.5 | 4.5 | 2.5 |
| Action specificity | 5 | 5 | 4.5 | 4.5 | 3 |
| Usability | 5 | 5 | 4.5 | 4.5 | 2.5 |
| Depth of analysis | 5 | 4.5 | 4 | 3 | 2 |
| Dataset-size handling | 5 | 4.5 | 5 | 5 | 4 |
| Total / 35 | 34.5 99% | 34 97% | 32 91% | 30 86% | 19.5 56% |
The interesting split was about voice. Sonnet and ChatGPT preserved the actual respondent voice, grammar quirks and all. Opus and Gemini polished the verbatims into smoother prose, which reads better but quietly loses authenticity. In qualitative work, the rawness is often the point. Opus surfaced cross-corpus patterns no other model found; Copilot produced theme labels that could have been written from the brief alone.
| Criterion (0–5) | Sonnet 4.6 | Opus 4.7 | GPT 5.5 | Gemini 3.1 | Copilot |
|---|---|---|---|---|---|
| Accuracy | 4.5 | 4 | 4.5 | 4.5 | 4 |
| Output quality | 5 | 5 | 4.5 | 4.5 | 3.5 |
| Recommendation relevance | 5 | 5 | 4.5 | 4.5 | 3.5 |
| Action specificity | 5 | 5 | 4.5 | 4.5 | 3.5 |
| Usability | 5 | 5 | 4.5 | 4.5 | 3.5 |
| Depth of analysis | 5 | 5 | 4 | 3.5 | 2.5 |
| Dataset-size handling | 4.5 | 5 | 5 | 5 | 4 |
| Total / 35 | 34 97% | 34 97% | 31.5 90% | 31 89% | 24.5 70% |
Every model managed to integrate the quant and qual, so the technique itself is no longer a differentiator. What separated them was depth: the top models surfaced genuine contradictions and alignments between the two sources, while Copilot stayed at lighter, more descriptive integration. This is also the prompt where Opus slipped slightly on accuracy, most of the reason Sonnet edged it overall.
| Criterion (0–5) | Sonnet 4.6 | Opus 4.7 | GPT 5.5 | Gemini 3.1 | Copilot |
|---|---|---|---|---|---|
| Accuracy | 4.5 | 3 | 5 | 4.5 | 3 |
| Output quality | 5 | 5 | 4.5 | 5 | 3 |
| Strategic rec relevance | 5 | 5 | 4.5 | 4.5 | 3.5 |
| Content specificity | 5 | 5 | 4.5 | 4.5 | 3.5 |
| Usability | 4.5 | 5 | 4.5 | 4.5 | 3 |
| Depth of analysis | 5 | 5 | 4 | 4 | 3 |
| Cross-source integration | 5 | 5 | 5 | 5 | 4 |
| Total / 35 | 34 97% | 33 94% | 32 91% | 32 91% | 23 66% |
The closest scores of the whole test, with under three points separating the top four. The recurring slip across the entire panel was defaulting to topic labels instead of insight headlines, even though the prompt contained a worked example of exactly what we wanted. Even the best models did it at least once. Copilot produced section headlines without specific data citations and lost the client framing.
| Criterion (0–5) | Sonnet 4.6 | Opus 4.7 | GPT 5.5 | Gemini 3.1 | Copilot |
|---|---|---|---|---|---|
| Accuracy | 4.5 | 4 | 5 | 4.5 | 3.5 |
| Output quality | 5 | 5 | 4.5 | 4 | 4 |
| Recommendation relevance | 5 | 5 | 4.5 | 4.5 | 4 |
| Usability | 5 | 5 | 5 | 4.5 | 3 |
| Narrative & headline quality | 5 | 5 | 4.5 | 4.5 | 3.5 |
| Total / 25 | 24.5 98% | 24 96% | 23.5 94% | 22 88% | 18 72% |
This is the diagnostic prompt, the one where sessions either hold together or fall apart. The top models introduced genuinely new analytical content here; the mid-tier consolidated competently; the weakest model rehashed earlier outputs and reverted to a generic template that didn’t even name the commissioning client. One honest caveat: we asked for a Word document. Only ChatGPT produced one with proper tables. Gemini and Copilot couldn’t produce a Word file at all, so you’d have to copy and paste, the kind of practical detail that matters when delivering to a client.
| Criterion (0–5) | Sonnet 4.6 | Opus 4.7 | GPT 5.5 | Gemini 3.1 | Copilot |
|---|---|---|---|---|---|
| Accuracy | 4.5 | 4 | 5 | 4 | 3 |
| Output quality | 5 | 5 | 4 | 4.5 | 2.5 |
| Recommendation relevance | 5 | 5 | 4.5 | 4.5 | 2.5 |
| Usability | 4 | 4.5 | 4.5 | 4 | 2.5 |
| Narrative & headline quality | 5 | 5 | 4.5 | 4.5 | 3 |
| Logical flow / deliverable quality | 4.5 | 4.5 | 4.5 | 3.5 | 2 |
| Total / 30 | 28 93% | 28 93% | 27 90% | 25 83% | 15.5 52% |
If you take nothing else from this, take these six.
All five models understood the data. The bottleneck is no longer whether the model can read your dataset.
The top tier turned findings into strategic positions. The rest described what was in the data, and that gap is where human judgement lives.
ChatGPT had the cleanest numbers of the group and still finished third. Being correct is necessary but not sufficient.
The best models held quality from the first prompt to the last. The weaker ones drifted as the session went on.
Weak thinking upstream becomes obvious by the time you ask for the full deliverable.
Anthropic took first and second, Sonnet 4.6 and Opus 4.7.
There’s no single answer, because the top three are genuinely close and each has a personality. Based on the results, here’s how we’d use each one.
We get asked this most, so here’s the honest read. It was effectively a tie, 96.8 against 96.2 is about a single point on a 170-point scale. Opus was arguably the deeper analyst: it had the highest single score of the whole test on the quant prompt and caught patterns in the verbatims that nothing else did. Where Sonnet won was accuracy consistency. It was tighter on the raw numbers across almost every prompt, and the gap was clearest on the synthesis step, where Opus slipped slightly while Sonnet held. In research, a wrong figure costs you more than a slightly less deep insight, and that’s what tipped it. Opus for depth, Sonnet for a reliable, client-ready output is a fair way to split them.
The case for adoption is easier to make in hours than in theory. The important shift is where the time-saving now sits. For most of us, the unlock is no longer in finding the insight, it’s in producing the output: the drafting, the chart-making, the slide-building. That’s the part these tools have quietly closed the gap on in the last year.
If you tried these tools in 2023 and walked away unimpressed, that was reasonable at the time. But three things have changed. Context windows went from a few thousand tokens to over a million, so an entire dataset now fits in a single session. Hallucinations dropped sharply inside a tight workflow, with outputs moving from fabricated sources to verifiable ones. And the outputs caught up, a year ago you couldn’t get a usable report or deck out of these models, and now you can. That last one is the breakthrough that moved these tools from interesting to genuinely useful.
The best way to trust any of this is to test it on your own data. None of these takes more than ten minutes, and you don’t need any setup beyond the chat window.
Copy 10–20 rows of survey data into any top model and ask: “You are a senior market research analyst. What are the five most interesting patterns in this data?”
Paste 20–30 open-ended responses and ask: “What are the five dominant themes in these responses? For each theme, give me two supporting quotes.”
Paste your key findings as bullets and ask: “Write a 150-word executive summary for a client. Lead with the most critical finding and end with the single most important action.”
Take 15 minutes one day next week, pick one, and use whatever dataset is already on your desk. That’s how this stops being a debate and becomes part of your toolkit.
The headline that AI is coming for research jobs misses what the data actually shows. These models do the heavy lifting on production, but they still need someone who knows what good looks like to steer them. The better you are at the craft, the better your instructions, and the better the result.
The researchers who learn to wield these tools well will do better work, deliver faster, and free up more of their time for the thinking that only humans can do. If anything, this is the moment the most experienced researchers should be pulling ahead. These tools aren’t just good enough any more, they are fit for everyday use.
Want the full rubric, the exact prompts, and the per-criterion breakdown to run this on your own data?
Book a Demo →