In the earliest stages of Earth’s life, organisms were incredibly simple—microscopic, single-celled beings with minimal coordination abilities. However, billions of years of evolution through competition and natural selection have shaped the complex life forms and human intelligence we see today. Drawing inspiration from this natural process, the renowned AI non-profit organization OpenAI, along with Fujitsu, is exploring whether similar competition in virtual environments can yield more complex artificial intelligence.
The Power of Simulation
In a recently published paper, OpenAI revealed its preliminary results. By engaging in billions of rounds of the simple game of hide and seek, two opposing teams of AI agents discovered complex strategies, including the use of tools and teamwork.
The test results indicate that the self-improvement speed of the two teams through competitive modes far surpasses that of any single agent. According to the paper’s co-authors, this power can be fully utilized in other AI domains to enhance efficiency.
Additionally, this research provides deeper insights into OpenAI’s dominant research strategy: vastly expanding existing AI technologies to explore what new characteristics emerge. The hide and seek environment incorporates numerous open-source systems from OpenAI, DeepMind, and their sister company Google, including CoinRun, Neural MMO, Research Football Environment, and OpenSpiel.
The Six Stages of Strategy Evolution
The experiment is based on two existing concepts in the field of artificial intelligence: multi-agent learning and reinforcement learning. Multi-agent learning involves placing multiple algorithms in competitive or cooperative scenarios to elicit emergent behaviors, while reinforcement learning is a specific machine learning technique that achieves goals through continuous trial and error, with AlphaGo being a prime example.
The virtual environment designed by researchers includes a closed space with various objects such as blocks, ramps, moving obstacles, and fixed obstacles. These agents are controlled by reinforcement learning algorithms. In each match, the agents are divided into two teams: hiders (blue) and seekers (red). Hiders receive rewards for successfully evading seekers and are penalized otherwise, while seekers are rewarded for finding hiders and penalized if they don’t.
With no other instructions given, the agents developed various strategies and tactics over billions of rounds of play. In the initial stage, hiders and seekers developed simple evasion and pursuit strategies by moving within the space without manipulating any objects.
However, after 25 million matches, the game became more complex. Hiders learned to move and fix boxes and obstacles in the environment, building forts around themselves to make them invisible to seekers. They even developed coordination strategies, such as passing objects to each other to speed up the construction process.
After 75 million matches, seekers discovered a strategy to push the search process into the third phase: They learned to move ramps next to forts and use them to climb over walls. Another 10 million rounds later, hiders learned to fix ramps before building forts.
Stage Three: Agents Learn to Use Ramps to Climb Over Walls (Source: OpenAI)
Initially, OpenAI researchers believed that the game should be entering its final stage, but after 380 million matches, two more strategies emerged. Seekers learned to climb onto fixed boxes using a fixed ramp and then surf to the top of the forts. In the final stage, hiders again learned to fix all ramps and boxes before building forts.
Agents Learn to Surf (Source: OpenAI)
Scale played a significant role in this game. The default mode requires over 34 hours of training to reach the ramp defense stage, with larger batches leading to shorter training times. For instance, increasing the number of parameters (the part of the model that learns from historical training data) from 500,000 to 5.8 million improved sample efficiency by 2.2 times.
Robustness Testing
To evaluate the robustness of the agents, researchers designed five benchmark tests, divided into two aspects: cognitive and memory. All tests used the same action space, observation space, and object types as the hide and seek environment:
Five Benchmark Intelligence Tests (Source: OpenAI)
The object counting test measures whether agents have the perception of
Views: 0