In a new study, the researchers found that curre artificial ielligence models are still weaker in describing and ierpreting social ieractions in a moving scene than humans. This skill is esseial for cars, robots and other technologies that rely on artificial ielligence systems to ieract in the real world.
Researchers at the University of Johns Hopkins say artificial ielligence systems fail to understand social ieractions and dynamics and the groundwork for ieraction with people. According to them, this problem may be rooted in the infrastructure of artificial ielligence systems.
Poor performance of artificial ielligence in understanding social ieractions
Leyla Isik, the main author of the prese study and assista professor of cognitive science, Johns Hopkins University, says:
“For example, artificial ielligence in the car must ideify the ieions, goals, and actions of drivers and pedestrians. Artificial ielligence needs to know which pedestrian is going to go or two people talk or cross the street. In fact, whenever artificial ielligence was to ieract with humans, he must recognize what people do. “I think these systems are unable to recognize it.”

Researchers say it is not enough to see the image and ideify objects and faces. This was the first step to make the multitude of artificial ielligence models enormous, but real life is not static, and artificial ielligence must be able to understand what is happening in a scene. Finally, this technology must understand the relationships, coexts and dynamics of social ieractions. In general, researchers show a blind spot in the developme of artificial ielligence models.
Comparing the performance of artificial ielligence models with humans, researchers called on human participas to watch triathe video clips. These clips included people who ieracted with each other and did activities alone or alone. Participas had to evaluate these ieractions. The researchers then asked about 350 language, video and video artificial ielligence models to comme on these clips. In large language models, the researchers called on artificial ielligence to evaluate short -written copies of human writing.
Participas in most cases agreed on the coe of the videos, but the models of artificial ielligence, regardless of the size or data they were trained, did not have such an agreeme. Video models cannot describe what people did in the videos. Video models also failed to predict whether people are communicating. Language models, of course, were better in predicting human behavior.
Researchers believe this happens that artificial ielligence neural networks are inspired by the part of the brain responsible for processing static images, which are differe from an area of the brain that processes dynamic social scenes.
The findings of this study were preseed at the Iernational Conference on Learning.



