Cortex A Bidirectionally Aligned Embodied Agent Framework for Long-horizon Manipulation A Bidirectionally Aligned Embodied Agent Framework for Long-horizon Manipulation

Cortex aligns a high-level VLM cognitive orchestrator with a low-level VLA executor through executable subtasks, compact memory, and continuous progress verification.

Jiaqi Peng*1,2 Xiqian Yu*2 Delin Feng*2 Yuqiang Yang2 Wenzhe Cai2 Jing Xiong2,3 Ganlin Yang2,4 Jinliang Zheng1,2 Jiafei Cao2 Xueyuan Wei2 Jiangmiao Pang2 Yuan Shen†1 Tai Wang†2

1Tsinghua University  ·  2Shanghai AI Laboratory  ·  3Peking University  ·  4USTC

*Equal contribution    Corresponding author

Cortex framework teaser comparing monolithic VLA, previous dual systems, and Cortex on the beaker washing task.
Cortex closes the loop between long-horizon planning and short-horizon execution by continuously streaming verified subtasks and memory to the VLA executor.
Beaker washing. Cortex tracks bottle state, cap state, beaker placement, and water-transfer progress across a long manipulation chain.
MOF chemistry workflow. The agent preserves task order and advances only after visual evidence supports each subtask transition.
4k+ h annotated long-horizon videos
30 h simulation data with structural priors
32 canonical skill primitives
95.5% LIBERO-Long zero-shot success
86.8% RoboTwin 2.0 overall success
65% real-world chemical task success

Video

Cortex in action

Cortex coordinate long-horizon manipulation through subtask planning, compact memory, grounded execution, and online progress verification.

Closed-loop long-horizon manipulation. Cortex keeps task progress explicit, routes executable subtasks, and verifies transitions before moving forward.

Method

A dual-system interface that both planners and executors can trust.

Cortex treats the subtask-memory pair as the contract between System-2 planning and System-1 execution. The planner is constrained to executable skills, and the executor receives local, physically grounded commands instead of a brittle global task description.

Cortex architecture with instruction, observation, memory, VLM orchestration, and VLA execution.
Bidirectionally aligned subtask interface. The VLM updates memory and streams subtasks; the VLA consumes the active subtask as a grounded local objective.
1

Executable Skill Space

Free-form instructions are standardized into 32 canonical primitives, reducing kinematic hallucinations and making planner outputs routable by the VLA harness.

2

Tractable Metadata

Subtasks include object attributes, spatial relations, counts, and reachability priors so the high-level plan matches what the robot can physically execute.

3

Event-balanced Training

Training balances ongoing execution frames and boundary transition frames, teaching the planner when to hold a command and when to update memory.

4

Asynchronous Loop

System-2 runs at a slower reasoning rate while System-1 executes continuously, with harness logic for command mapping, holding, and timeout recovery.

Data

Scalable metadata construction for long-horizon manipulation.

Cortex builds a standardized interface from public real-world data, public simulation data, self-collected robot demonstrations, and procedural generation. The pipeline annotates subtask sequences, aligns boundaries, and injects execution priors.

Cortex dataset construction with public real-world data, simulation data, self-collected data, procedural data, annotation pipeline, and interface properties.
Metadata and interface standardization. The dataset pipeline aligns raw trajectories to executable subtasks and adds grounding signals for both executability and tractability.
Automatic annotation pipeline for subtask segmentation and temporal alignment.
Automatic subtask annotation and temporal alignment over long-horizon demonstrations.
Event-balanced sampling improves average total score while using fewer training samples.
Event-balanced sampling improves planning score while reducing sample count near redundant intra-task frames.

Results

State-of-the-art long-horizon autonomy across planning and control.

Cortex improves both open-loop System-2 planning quality and closed-loop task execution, with the strongest gains on tasks that require memory, subtask transitions, and physical grounding.

Step-level avg. 8.32 out of 10
Episode-level avg. 7.81 closed-loop planning
LIBERO-Long 95.5% zero-shot success
RoboTwin 2.0 86.8% overall success

System-2 Planning

Structured subtasks keep long-horizon reasoning grounded.

Average total score across spatial, long-horizon, and counting evaluations.

Cortex subtask interface
Step 8.32
Episode 7.81
GPT-5 foundation VLM
Step 6.27
Episode 7.23
Gemini 3.1 Pro foundation VLM
Step 6.92
Episode 6.86
Qwen3-VL-8B foundation VLM
Step 6.74
Episode 6.29
8.74 Counting 8.16 Long-horizon 8.05 Spatial

Closed-loop Simulation

Progress verification improves full-task success.

Cortex preserves high success as task horizon grows, especially on long-horizon splits.

LIBERO-Long zero-shot Success rate
Cortex95.5
OpenVLA-OFT94.5
MemoryVLA93.4
π0.592.4
Gemini 3.1 Pro91.0
RoboTwin 2.0 Short / long / overall
MethodShortLongOverall
Cortex86.088.086.8
π0.582.683.082.7
X-VLA77.166.372.8
π061.572.665.9

Memory-heavy Manipulation

Cortex keeps object order, counts, and prior state across task memory.

RMBench success rates over seven manipulation tasks, each evaluated with 100 rollouts.

7-task average Success rate
Cortex61.9
Mem-041.7
π0.512.6
X-VLA12.1
ACT7.6
DP6.0
Observe and Pick Up14%+10 vs Mem-0
Rearrange Blocks100%+11 vs Mem-0
Put Back Block100%+10 vs Mem-0
Swap Blocks99%+32 vs Mem-0
Swap T63%+49 vs Mem-0
Battery Try37%+9 vs Mem-0
Press Button20%only non-zero

Real World

Zero-shot deployment on complex physical workflows.

Cortex transfers to an ARX ACONE dual-arm setup and enables long-horizon chemistry and washing workflows by combining a generalist VLM planner with a short-horizon subtask-conditioned VLA executor.

Chemical liquid stirring real-world rollout with fourteen subtasks.
Chemical liquid stirring. Cortex preserves procedure order over fourteen stages and switches only after visual evidence supports completion.
Beaker washing real-world rollout with subtask prediction and execution process.
Beaker washing requires remembering bottle state, cap state, beaker location, and washing progress.
Local execution failure where Cortex keeps retrying the stopper grasp until success.
When a stopper grasp fails, Cortex holds the current subtask and memory until the grasp is verified.

20-trial Real-world Average

MethodChemical TaskWashing Task
π0.50% SR, 2.5 / 140% SR, 3.7 / 14
πmem0% SR, 4.1 / 140% SR, 6.5 / 14
Cortex65% SR, 11.0 / 1455% SR, 10.5 / 14
Human + πmemsub75% SR, 12.2 / 1470% SR, 11.6 / 14

Citation

BibTeX

@misc{peng2026cortex,
  title={Cortex: A Bidirectionally Aligned Embodied Agent Framework for Long-horizon Manipulation},
  author={Jiaqi Peng and Xiqian Yu and Delin Feng and Yuqiang Yang and Wenzhe Cai and Jing Xiong and Ganlin Yang and Jinliang Zheng and Jiafei Cao and Xueyuan Wei and Jiangmiao Pang and Yuan Shen and Tai Wang},
  year={2026},
  url={https://steinate.github.io/cortex.github.io}
}