Final week, Chinese language lab DeepSeek launched an updated version of its R1 reasoning AI model that performs effectively on various math and coding benchmarks. The corporate didn’t reveal the supply of the information it used to coach the mannequin, however some AI researchers speculate that at the least a portion got here from Google’s Gemini household of AI.
Sam Paeach, a Melbourne-based developer who creates “emotional intelligence” evaluations for AI, revealed what he claims is proof that DeepSeek’s newest mannequin was educated on outputs from Gemini. DeepSeek’s mannequin, referred to as R1-0528, prefers phrases and expressions just like these Google’s Gemini 2.5 Professional favors, stated Paeach in an X put up.
That’s not a smoking gun. However one other developer, the pseudonymous creator of a “free speech eval” for AI referred to as SpeechMap, famous the DeepSeek mannequin’s traces — the “ideas” the mannequin generates as it really works towards a conclusion — “learn like Gemini traces.”
DeepSeek has been accused of coaching on knowledge from rival AI fashions earlier than. In December, builders observed that DeepSeek’s V3 mannequin typically recognized itself as ChatGPT, OpenAI’s AI-powered chatbot platform, suggesting that it might’ve been educated on ChatGPT chat logs.
Earlier this yr, OpenAI told the Financial Times it discovered proof linking DeepSeek to using distillation, a method to coach AI fashions by extracting knowledge from greater, extra succesful ones. According to Bloomberg, Microsoft, an in depth OpenAI collaborator and investor, detected that giant quantities of information have been being exfiltrated by OpenAI developer accounts in late 2024 — accounts OpenAI believes are affiliated with DeepSeek.
Distillation isn’t an unusual apply, however OpenAI’s phrases of service prohibit clients from utilizing the corporate’s mannequin outputs to construct competing AI.
To be clear, many fashions misidentify themselves and converge on the identical phrases and turns of phrases. That’s as a result of the open net, which is the place AI corporations supply the majority of their coaching knowledge, is turning into littered with AI slop. Content material farms are utilizing AI to create clickbait, and bots are flooding Reddit and X.
This “contamination,” if you’ll, has made it quite difficult to totally filter AI outputs from coaching datasets.
Nonetheless, AI specialists like Nathan Lambert, a researcher on the nonprofit AI analysis institute AI2, don’t assume it’s out of the query that DeepSeek educated on knowledge from Google’s Gemini.
“If I used to be DeepSeek, I might undoubtedly create a ton of artificial knowledge from the perfect API mannequin on the market,” Lambert wrote in a put up on X. “[DeepSeek is] brief on GPUs and flush with money. It’s actually successfully extra compute for them.”
Partly in an effort to forestall distillation, AI corporations have been ramping up safety measures.
In April, OpenAI started requiring organizations to finish an ID verification course of in an effort to entry sure superior fashions. The method requires a government-issued ID from one of many international locations supported by OpenAI’s API; China isn’t on the listing.
Elsewhere, Google lately started “summarizing” the traces generated by fashions obtainable by its AI Studio developer platform, a step that makes it tougher to coach performant rival fashions on Gemini traces. Anthropic in Could stated it might start to summarize its personal mannequin’s traces, citing a necessity to guard its “aggressive benefits.”
We’ve reached out to Google for remark and can replace this piece if we hear again.
Trending Merchandise