Meta rejected the claims of testing sets in the process of training LLAMA 4 models. “We have heard that some have claimed that we have used test collections in the training process,” said in a post released by Ahmad al -Dahl, deputy director of the Meta Social Ielligence. “This claim is completely false and we will never do so.”
He added that these models have been released as soon as they are ready and may take several days for all public versions to be fully stable. Meta also attributed the differe performance of models to stability issues in implemeation, not a defect in the training process.
New LLAMA 4 models
Meta recely two new models of the Llama 4 family called Scout And Maverick Has offered. Maverick model to speed to second place in LmarenaThe Platform of Ranking of AI models, achieved. In this platform, users vote for the best responses by directly comparing models.
In the press stateme, meta to pois Elo Maverick’s model poied to 1417 and that above the model Gpt-4O OpenAI and slightly lower than the model Gemini 2.5 Pro Patie.
Experimeal and Transparency Model in Results
A version of the Maverick that in Lmarena It has been evaluated, not exactly the same version that Meta has published publicly. In a blog post, Meta announced that she used a custom -made version designed to improve conversation capabilities.
Platform Chatbot isnaWhich by lmarena.ai (Previously LMSYS.ORG), in response to community concerns, more than 2,000 outcomes released a direct comparison for public review. These results include user requests, models of models, and user preferences. The company announced that it has released the results to ensure complete transparency. It has also updated its policies for ranking models to make future evaluations more fair and repetitive. They announced that the copy Hf Model Llama-4-Maverick It will be added to the Arena soon and the ranking results will be released.
Rumors around LLAMA 4
The story of the LLAMA 4 models was coroversial when a Viral post in Reddit, citing a Chinese report, claimed that a meta -employee had put forward iernal pressure to combine test sets in the post -training process. The report said that company leaders have suggested that differe test sets should be combined with post -training criteria to meet functional goals in differe criteria.
The post also claimed that the man had resigned and requested to get out of the technical report. However, Meta sources confirmed that the person was not out of the company and that the Chinese report was fake.
Difference in the assessme results
However, some artificial ielligence researchers have poied to the differences between the results reported by meta and the results observed by them. A user in Network X said:
“The Llama 4 in LMSYS is quite differe from other versions of the LLAMA 4, even if you use the proposed system message. I tried several differe messages myself. “
Susan ZhangSenior research engineer in Google Deepmind“Four -dimensional chess moveme: Using the LLAMA 4 trial version to manipulate LMSYS, disclose inaccurate preferences, and ultimately discredit the eire ranking system,” he said.
Pressure to release LLAMA 4
There were also questions about the release of the LLAMA 4 model on the weekend, as large technology companies usually publish importa leaflets on business days. It has also been said to have been under pressure to put LLAMA 4 before the release of the next argume model Deepseek Publish with the name R2.
Meanwhile, Meta has announced that it will release its reasoning model soon. Prior to the release of LLAMA 4, it was reported that Meta had delayed the release date at least twice, as it did not perform well in the expected technical criteria, especially in the duties of reasoning and mathematics. There were also concerns that the LLAMA 4 is less capable than the OpenAI models in similar human conversations.
Meta It coinues to defend its models and strives to reduce community concerns by clarifying and providing more information.




