Gpt2-chatbot: The Mysterious New Model That’s Baffling the AI Community

But what is it and where did it come from?

Around this time yesterday, initial reports of a brand-new large language model started making waves across Twitter. But this wasn’t a new release. Nor was it coming out of one of the large, established AI labs, like OpenAI, Meta, Mistral, Cohere, Anthropic, or others.

Nope. In fact, it was nothing like any model release we’ve seen so far. And that’s precisely why its release has driven such massive engagement in the twenty four hours since. Because this isn’t just another model. “Gpt2-chatbot” is good — incredibly good.

Gpt2-chatbot

Okay, so what is this mysterious new LLM? Who’s responsible for it and how good is it really? Well, if you’ve had the opportunity to play around with it already, you’ll quickly find out — it’s very good. We’re talking on par, if not maybe even better, than GPT-4 Turbo. Yes, that good. Let’s dive into the details.

Without announcement the new model, “gpt2-chatbot,” was made available on the LMSYS Chatbot Arena. Once discovered, users begin testing and tinkering with the chatbot, and to their surprise, it was clearly quite capable and impressive. By now, many more have joined in.

Across the AI community, the reaction was swift: this model, whatever it is, is on the level of the best model on the market today — GPT-4 Turbo. Many have gone a step further, calling “gpt2” even better. Either way, this puts the new chatbot at the upper echelon of LLMs, but unlike the rest, we know almost nothing about it — who developed it, where did it come from, what architecture is based upon, is it a fine-tune of another highly capable model?

The questions go on and on. So let’s run through what we know.

What Do We Know?

What we know for certain, as highlighted by @futuristflower, is that while gpt2-chatbot rivals or exceeds GPT-4's performance, it likely demands higher operational costs. Right now, users are limited to just 8 messages per day.

Furthermore, interacting with the model is an additional challenge given what seems to be an hourly rate limit attached to this particular model. For example, when I attempted to use gpt2 in the process of writing this article, I was met with the following error: “MODEL_HOURLY_LIMIT (gpt2-chatbot): 1000.”

Clearly, the model behind gpt2-chatbot is much more expensive to run than other frontier models at this point. This would make sense if some of the potential hypotheses about its origin and architecture are true (we’ll get to these shortly).

As for its capabilities, it’s comparable to GPT-4 on many levels - plenty of people have even pointed out it seems to have a similar ‘personality’ - but where it seems to excel is in its much-improved reasoning and planning capabilities. Others have noted a similar jump up from GPT-4 in coding.

And what do we know about its origins? While it’s impossible to say with any certainty, it would seem that “the model has the same weaknesses to certain special tokens as other OpenAI models and it appears to be trained with the openai family of tokenizers.” Now, even if gpt2 was indeed trained on the same OpenAI family of tokenizers, that’s no smoking gun.

And no, neither is just asking it. When asked, gpt2 will tell you that it was made by OpenAI, but frankly, that means nothing. As @itsandrewgao put it, “this is a weak signal though because of data contamination (a lot of models are trained on OpenAI chats and thus, think they were made by OpenAI).”

So, to summarize: gpt2-chatbot is…

  • highly capable - on par with the best of the best LLMs

  • expensive - relatively slow inference and low rate limit

  • great at planning, reasoning, and code generation (and apparently also ASCII art and JSON mode?)

But there’s still so much more that we don’t know.

Potential Theories

By now, everyone’s got their own working theory of what gpt2-chatbot truly is. At first, many were quick to theorize that it was the long-awaited GPT-4.5 or GPT-5. However, I think we can already rule that out. At comparable—if even a tad better—performance, that would make a sorry upgrade; not at all in line with the rhetoric from Sam Altman thus far. So, instead, let’s walk through some more realistic possibilities.

Some have speculated that gpt2-chatbot could be a retrained version of OpenAI’s original GPT-2 model - hence the naming convention - updated with modern datasets. This theory suggests that the foundational pre-training done back in 2019 was so robust that, even years later, it competes with the latest models. This would essentially mean even back then, OpenAI was far ahead of the game. That’s a narrative many already believe.

And so, while appealing for its simplicity and keeping in line with OpenAI’s dominance, this idea doesn't quite hold up.

Under scrutiny it’s far from likely. Given the leaps in AI capabilities since GPT-2 and the distinct improvements gpt2-chatbot demonstrates in reasoning and planning, it seems wrong to suggest simply retraining on modern datasets could lead to such a result. And this doesn’t account for the supposed ‘Knowledge Cutoff’ of gpt2, which when prompted, it reveals to be November of 2023. Overall, this theory is almost certainly not true.

However, there is one theory that the community seems to be coalescing around. And it involves our old friend, Q*.

A more compelling theory, stated most simply by Siqi Chen (@blader) posits that gpt2-chatbot likely blends the extensive general knowledge of GPT-4 with a sophisticated 'Q*' search reasoning algorithm, potentially part of a tree of thought search approach.

This would account for the bot's advanced reasoning capabilities and its higher operational costs due to more complex processing demands. The seamless success of gpt2-chatbot in handling prompts that stump other models—even when token limitations are tight—suggests a deeper, more integrated reasoning capacity.

And, interestingly, this may also lend some insight into the naming of the bot, hinting that it’s neither an incremental update (continuing on the GPT-3, GPT-4, etc naming convention) nor a full-fledged new generation model, but instead, a special test bed for next-level AI reasoning technologies.

Altogether, this would make it a likely precursor to the anticipated GPT-5, which is expected to further capitalize on these advanced reasoning enhancements. But, to be clear, this is still just speculation. At this point, there are far more unknowns than unknowns. And until someone steps forward with more convincing evidence, we may have to content ourselves with just not knowing.

Though, it’s exciting nonetheless. A new age of LLMs could be upon us. We’ll just have to wait and find out.

Postscript

Make of that what you will.