How DeepSeek stacks up against popular AI models, in three charts
DeepSeek, until recently a little-known Chinese artificial intelligence company, has made itself the talk of the tech industry after it rolled out a series of large language models that outshone many of the worldâs top AI developers.
DeepSeek released its buzziest large language model, R1, on Jan. 20. The AI assistant hit No. 1 on the Apple App Store in recent days, bumping OpenAIâs long-dominant ChatGPT down to No. 2.
Its sudden dominance â and its ability to outperform top U.S. models across a variety of benchmarks â have both sent Silicon Valley into a frenzy, especially as the Chinese company touts that its model was developed at a fraction of the cost.
The shock within U.S. tech circles has ignited a reckoning in the industry, showing that perhaps AI developers donât need exorbitant amounts of money and resources in order to improve their models. Instead, researchers are realizing, it may be possible to make these processes efficient, both in terms of cost and energy consumption, without compromising ability.
R1 came on the heels of its previous model V3, which launched in late December. But Monday, DeepSeek released yet another high-performing AI model, Janus-Pro-7B, which is multimodal in that it can process various types of media.
Here are some features that make DeepSeekâs large language models appear so unique.
Size
Despite being developed by a smaller team with drastically less funding than the top American tech giants, DeepSeek is punching above its weight with a large, powerful model that runs just as well on fewer resources.
Thatâs because the AI assistant relies on a âmixture-of-expertsâ system to divide its large model into numerous small submodels, or âexperts,â with each one specializing in handling a specific type of task or data. In contrast to the traditional approach, which uses every part of the model for every input, each submodel is activated only when its particular knowledge is relevant.
So even though V3 has a total of 671 billion parameters, or settings inside the AI model that it adjusts as it learns, it is actually only using 37 billion at a time, according to a technical report its developers published.
The company also developed a unique load-bearing strategy to ensure that no one expert is being overloaded or underloaded with work, by using more dynamic adjustments rather than a traditional penalty-based approach that can lead to worsened performance.
All these enable DeepSeek to employ a robust team of âexpertsâ and to keep adding more, without slowing down the whole model.
It also uses a technique called inference-time compute scaling, which allows the model to adjust its computational effort up or down depending on the task at hand, rather than always running at full power. A straightforward question, for example, might only require a few metaphorical gears to turn, whereas asking for a more complex analysis might make use of the full model.
Together, these techniques make it easier to use such a large model in a much more efficient way than before.
Training cost
DeepSeekâs design also makes its models cheaper and faster to train than those of its competitors.
Even as leading tech companies in the United States continue to spend billions of dollars a year on AI, DeepSeek claims that V3 â which served as a foundation for the development of R1 â took less than $6 million and only two months to build. And due to U.S. export restrictions that limited access to the best AI computing chips, namely Nvidiaâs H100s, DeepSeek was forced to build its models with Nvidiaâs less-powerful H800s.
One of the companyâs biggest breakthroughs is its development of a âmixed precisionâ framework, which uses a combination of full-precision 32-bit floating point numbers (FP32) and low-precision 8-bit numbers (FP8). The latter uses up less memory and is faster to process, but can also be less accurate.
Rather than relying only on one or the other, DeepSeek saves memory, time and money by using FP8 for most calculations, and switching to FP32 for a few key operations in which accuracy is paramount.Â
Some in the field have noted that the limited resources are perhaps what forced DeepSeek to innovate, paving a path that potentially proves AI developers could be doing more with less.
Performance
Despite its relatively modest means, DeepSeekâs scores on benchmarks keep pace with the latest cutting-edge models from top AI developers in the United States.
R1 is nearly neck and neck with OpenAIâs o1 model in the artificial analysis quality index, an independent AI analysis ranking. R1 is already beating a range of other models including Googleâs Gemini 2.0 Flash, Anthropicâs Claude 3.5 Sonnet, Metaâs Llama 3.3-70B and OpenAIâs GPT-4o.
One of its core features is its ability to explain its thinking through chain-of-thought reasoning, which is intended to break complex tasks into smaller steps. This method enables the model to backtrack and revise earlier steps â mimicking human thinking â while allowing users to also follow its rationale.
V3 was also performing on par with Claude 3.5 Sonnet upon its release last month. The model, which preceded R1, had outscored GPT-4o, Llama 3.3-70B and Alibabaâs Qwen2.5-72B, Chinaâs previous leading AI model.
Meanwhile, DeepSeek claims its newest Janus-Pro-7B surpassed OpenAIâs DALL-E and Stable Diffusionâs 3 Medium in multiple benchmarks.
Source link