Analysis · AI · Machine Learning

PewDiePie Beats
ChatGPT?
An AI Researcher
Explains.

By Jessenth Ebenezer
Category AI · Machine Learning
Subject Felix Kjellberg · Qwen 32B · Aider Polyglot
Reading Time ~10 minutes
PewDiePie humorously explains using Qwen as the base model
Felix humorously explains using Qwen as the base to finetune his model on.

In 2020, KSI publicly roasted PewDiePie for still using Bandicam to record his videos. Not OBS, not any of the dozen free and better alternatives that every other creator had switched to years earlier. Bandicam. The kind of software your cousin installed in 2012 and never uninstalled. Felix laughed it off, which is very on brand, but it was a genuine window into where he was at the time with technology — a guy who had been making videos for over a decade and had never particularly needed to care about what was under the hood.

Fast forward five or six years and the same person has built a custom 10-GPU workstation, migrated to Linux, written his own local AI interface called ChatOS, and published a video documenting a months-long attempt to fine-tune a 32 billion parameter language model to beat frontier AI systems on a coding benchmark. That arc, from someone using outdated freeware because it gets the job done to someone who genuinely understands what benchmark contamination is and why it invalidates a training run, is one of the more lowkey remarkable things I've seen from a public figure in recent memory. And I grew up watching him, so I was paying attention.

When the video dropped I wasn't surprised that it had merit. I was surprised by how much.

01 // What he actually did — and what he didn't

Every headline about this story got the framing slightly wrong, so let's be precise.

PewDiePie did not train an AI from scratch. He fine-tuned an existing open-source model called Qwen 2.5-32B, a 32 billion parameter model developed by Alibaba that was already strong at coding tasks. Fine-tuning means taking a model that has already been trained on enormous amounts of data and continuing to train it on a specific, curated dataset to sharpen its performance on a narrower task. It is a completely standard and widely used technique in professional AI development. Companies and research labs do this constantly. What you are doing is essentially giving a very capable generalist a concentrated crash course in one specific thing.

He chose the Aider Polyglot benchmark as his target — a coding test that measures a model's ability to solve programming problems across six languages. His stated goal was to beat ChatGPT-4o's score on this benchmark, which at the time sat at around 16 to 18 percent. Frontier models performing at 16 percent on a coding benchmark sounds alarming until you understand what the benchmark is actually testing, which I will get to.

His hardware setup was not modest. He assembled a workstation with 2x RTX 4000 Ada cards and 8x modded RTX 4090s, each with 48GB of VRAM, for a total memory pool of roughly 256GB. The whole rig cost somewhere around $41,000. Power consumption exceeded 2,000 watts, which caused power cables to catch fire on more than one occasion. One GPU was destroyed entirely. This is what serious home AI infrastructure looks like in 2026, apparently.

· · ·

02 // The format insight nobody explained properly

Here is where the story gets technically interesting and where most of the coverage missed the real point.

The Aider Polyglot benchmark tests two different formats for code editing. The DIFF format asks the model to output only the specific lines that changed, like a git diff — precise, minimal, surgical. The WHOLE format asks the model to rewrite the entire file with the changes incorporated. Most frontier models default to something closer to DIFF behavior, and they were performing poorly on this benchmark partly because of it.

PewDiePie's first major jump — from 8% to 16% — did not come from any training at all. It came from simply switching which format the model was prompted to answer in. That alone nearly doubled the score. Which tells you something important: the benchmark was testing format adherence as much as actual coding ability.

My honest opinion here — and I say this as someone who has worked with these systems — is that DIFF is conceptually the correct approach. Rewriting an entire file to make a small change is wasteful. Transformers generate tokens sequentially and there is no efficient way to reuse the input context without regenerating everything, which means WHOLE format edits are burning compute on tokens that didn't need to change. DIFF is harder to get right, which is exactly why frontier models struggle with it, but it is the right problem to be solving. The benchmark inadvertently rewarded the lazier solution.

As someone wise once said, statistics are like a bikini — what they reveal is suggestive, but what they conceal is vital.

· · ·

03 // The full progression

His path to the final score was anything but straight. Here is what the journey actually looked like:

StageScoreWhat happened
Base Qwen 32B, wrong format8%Wrong output format for the benchmark
Format switch to WHOLE16%No training — just a format change
Early fine-tuning runs~19.6%Benchmark contamination found — invalidated
Retrain from scratch36%Clean dataset, coder-specific base model
Post-training adjustments39.1%Final result
ChatGPT-4o23.1%Beaten
Gemini 2.0 Pro Exp35.6%Beaten
Qwen 3 (released immediately after)40%Pushed his result to second place

The benchmark contamination discovery is worth highlighting because it says a lot about the seriousness of the project. When he found that some of his training data overlapped with benchmark questions, inflating the score, he didn't take the number and post the headline. He restarted. That is not how a clout chaser behaves. That is how a researcher behaves.

· · ·

04 // The real story: small models, specific tasks

Here is the part of PewDiePie's project that actually matters beyond the headline.

What he stumbled into, without necessarily framing it this way, is one of the most important trends in applied AI right now. Fine-tuning small, open-source models on targeted datasets for specific tasks consistently outperforms prompting large frontier models for those same tasks. A Qwen 32B model fine-tuned specifically on coding data in a specific format will beat GPT-4o on that specific task. Not because it is a better model in any general sense, but because specialization beats generalization when the task is narrow enough.

This is what serious AI teams are doing at scale. They are not just calling the OpenAI API and hoping for the best. They are identifying the specific capability they need, curating training data for it, fine-tuning an open-source base model, and deploying something that is smaller, faster, cheaper and better at that one thing than any frontier model. PewDiePie arrived at this conclusion by accident, through months of failed training runs, benchmark contamination issues, model version misalignment errors, and near-electrical-fires, but he arrived at it.

The broader point

The gap between a fine-tuned specialist and a general-purpose model is exactly what the broader industry is discovering as it moves toward smaller, task-specific AI. PewDiePie stumbled into a technique that many AI companies are already deploying at scale. The difference is that he documented every failure publicly and in the most entertainingly self-aware way possible.

His final score was 39.1% on the Aider Polyglot benchmark. ChatGPT-4o scored 23.1%. Gemini 2.0 Pro Exp scored 35.6%. He beat both. The moment he finished, Qwen 3 was released and scored 40%, immediately pushing his model into second place, which is perhaps the most brutally perfect ending to a months-long project that anyone could have written.

· · ·

05 // What I actually think about all of this

The headline claim — PewDiePie beats ChatGPT — is technically accurate and contextually misleading, which is exactly how benchmarks work and why you should always read past the number.

But I think the more interesting story is the one that doesn't fit in a headline. A person with no formal ML background, access to serious compute, genuine curiosity and an enormous amount of free time just documented the full process of fine-tuning a large language model, including every mistake he made along the way. He understood benchmark contamination when he found it and restarted rather than taking the inflated score. He identified model version misalignment as a source of error. He built his own inference interface. He ran quantized versions of 235 billion parameter models on consumer hardware.

The privilege here is real — not everyone has $41,000 for a GPU rig and months of unstructured time. But the arc from Bandicam in 2020 to this in 2026 is genuinely worth appreciating. He said it himself in the video: "I like running AI more than using AI." That sentence, more than any benchmark result, explains the whole project.

There is a version of AI enthusiasm that is just vibes and hype and asking ChatGPT to write your emails. And then there is the version where you actually try to understand how it works, accept that you will fail repeatedly, and build something real out of the failures. PewDiePie did the second one. That is worth acknowledging even if the benchmark is niche and the headline is overblown.

· · ·
Jessenth Ebenezer
Computer scientist and MS CS student at New York University, working at the intersection of machine learning, HCI and extended reality. Has fine-tuned a few models himself, none of which nearly burned his house down.
← Back to Tech Home Writing · AI · Machine Learning