Tag
2 articles
Build a lightweight vision-language-action-inspired embodied agent that learns to perceive, plan, predict, and replan directly from pixel observations in a grid world environment.
This explainer explores MEM, a multi-scale memory system that extends the context window of Vision-Language-Action (VLA) models to 15 minutes, enabling robots to perform complex, multi-step tasks.