Pixels2Play V0.3: Text-conditioned multi-modal game player
Multimodal foundation model play 3-D video games with and without text instructions
Announcing Our Multimodal 3D Gaming Agent
We’re excited to share our ICCV MMRAgI workshop paper: “Learning to play: A Multimodal Agent for 3D Game-Play.” [link to full-text paper]
This blog post breaks down our work on a novel multimodal agent that learns to play a variety of 3D first-person games. It takes both video (what it “sees”) and text (what it’s “told”) as input and outputs keyboard and mouse actions. The model is also efficient in inference as in it can run at 20 FPS on a single RTX 5090 GPU, making it reliable to interact with the games in real-time.
What’s New? From Cog2025 to Now
This work significantly extends our previous Cog2025 paper. The four biggest improvements are:
Massive Data Expansion: We collect more than 10x labelled data and 20x more unlabelled data compared with the prior work, with a significant expansion over game categories too.
Text-augmented data: We annotated all the labelled and unlabelled data with a VLM so frames will have the text instructions whenever it’s applicable (e.g., some frames will be annotated with the instruction: “approach the skull gate”).
Text-Conditioning: With the help from the new dataset, the agent can now follow natural language instructions. This means you can tell it what to do during gameplay (e.g., “pick up the key”), and it will adjust its actions accordingly. To achieve this, we added a text token to the backbone transformer model so the action can attend to that.
Lightweight action decoder: We introduced a separate, lightweight action decoder in addition to the main backbone transformer. This small, decoder-only model is responsible for autoregressively generating the final actions. By offloading this task from the large backbone transformer, this design greatly reduces our model’s inference latency.
Our Three-Stage Training Recipe
We developed a three-stage training recipe so that we can leverage the unlabelled dataset (which is now 10x bigger than our labelled dataset and growing rapidly!).
The key was to train an IDM (Inverse Dynamics Model) on our labeled dataset. This IDM learns to watch video frames and predict the action that the player took. We then used this trained IDM to automatically generate pseudo-labels for the massive unlabeled video dataset.
This resulted in our final training pipeline:
Stage 1: IDM Training We train the IDM on our high-quality labeled data to teach it to predict actions from video.
Stage 2: Policy Pre-training We pre-train the main policy agent on a mixture of our labeled data and the much larger unlabeled dataset (which now has pseudo-labels from the IDM).
Stage 3: Policy Fine-tuning Finally, we fine-tune the policy exclusively on the original, high-quality labeled data to refine its performance and ensure accuracy.
We’re excited about this new model and believe it’s a promising step toward agents that can learn from the vast amount of video content available online. Be sure to check out the full paper for all the details!
Watch our model play games below! 🎮
Non-text-conditioned game-playing
Overall, the new model significantly outperforms our previous version (see our earlier blog post) — it can handle more complex tasks, play with clearer intent, and even complete early levels of classic games like Doom and Quake with minimal human intervention. However, it still struggles with long-horizon gameplay where planning and memory are critical.
Below, we break down its performance game by game:
Hovercraft (internal test game):
The new model plays the Hovercraft game almost like a human player — a major improvement over the previous model!
Simple-FPS (internal test game)
In Simple-FPS, the model performs reasonably well. It shoots enemies with good accuracy and navigates effectively to new targets.
msdos/doom:
We’re excited to see the model complete the first level of Doom with only two brief human interventions (about one second each, at 1:33 and 3:19 to trigger switches). The model consistently aims well, eliminates enemies, and completes objectives with clear intent.
msdos/quake:
The model also completes the first level of Quake with two short interventions (at 1:18 and 2:03, around three seconds each). It occasionally gets stuck when the trajectory goes out of distribution or re-explores the same area. Still, this represents a significant improvement over the prior model — it completes more objectives and shows better spatial reasoning.
Roblox / be a shark
In simpler games like Be a Shark, the new model demonstrates precise aiming and purposeful movement, showing better control.
Roblox / natural disaster survival
The model successfully survives easier disaster scenarios, reacting appropriately when the threats are straightforward to avoid.
msdos/need for speed
The new model drives much more like a human player, maintaining control and finishing with a higher rank compared to the older version.
need for speed rivals
Similarly, the model shows improved driving behavior, though it still struggles — especially since this game was not included in training.
goat simulator 3
The model struggles with long-horizon, open-ended games like Goat Simulator 3 (and Left 4 Dead 2 below) when not guided by text instructions
left 4 dead 2
Text-conditioned game-playing
We evaluate the model on Quake and Doom, as these games allow checkpoint saving — ensuring that each run starts from the exact same scenario for fair comparison.
In Doom, we begin from a scene where the model can either pick up a shotgun or explore other areas. We test two text instructions:
“Pick up the shotgun.”
“Proceed to the red cross gate.”
The model follows the first instruction consistently well, while performance on the second is slightly weaker. Nonetheless, both cases show that text conditioning improves behavior compared to runs without any instruction.
no text-input:
pick up shotgun:
proceed to the red-cross gate
In Quake, we start from a scene where the model must press a wall switch to lower a bridge. Without any text prompt, the model typically fails to trigger the switch. However, with the instruction “move to the wall and press the red button,” it successfully presses the switch in some trials — as shown below:
no text-input:
move to the wall and press the red button: