Dual-Process Atomic Skill Learning for Long-Horizon Manipulation Tasks

1 University of Electronic Science and Technology of China
2 Huazhong University of Science and Technology
3 Southern University of Science and Technology
4 South China University of Technology
5 EbTech Co. Ltd.

*Indicates Equal Contribution, Corresponding authors
Under review

Abstract

Language-conditioned Imitation Learning (IL) is essential for enabling robots to perform complex tasks following natural language instructions. However, executing long-horizon tasks sequentially remains a significant challenge. While hierarchical approaches attempt to address this by decomposing tasks into atomic skills, existing methods often suffer from training instability and codebook collapse due to the tight coupling between high-level skill reasoning and low-level action generation. Inspired by the Dual-Process Theory of cognition, we propose DASL, a novel asynchronous hierarchical imitation learning framework that effectively decouples slow, semantic reasoning from fast, real-time motion control. DASL comprises a Slow-Frequency Policy that predicts interpretable, discrete skills via Vector Quantization, and a High-Frequency Policy that leverages a diffusion model and Decision Transformer to generate precise actions conditioned on these latent skills. By asynchronously coordinating these modules, our framework mitigates the interference common in synchronous co-training without relying on complex auxiliary regularization. Extensive evaluations on robotic manipulation and grid-world navigation benchmarks demonstrate that DASL significantly outperforms state-of-the-art baselines, particularly in skill acquisition and compositional generalization to unseen instructions. We will share our source code on GitHub.

Framework of DASL

DASL framework overview

Overview of DASL.an asynchronous hierarchical imitation learning framework. The high-level policy operates at a slow timescale to generate discrete semantic skills from language instructions and sparse observations, while the low-level policy executes skill-conditioned actions at a fast timescale. A latent diffusion module regularizes the latent trajectory space during training and is removed at inference for efficient real-time control.

Results

Performance on LOReL Sawyer

Performance on LOReL Sawyer
Table.1 Rephrasal-wise success rates (%) on LOReL Sawyer
LOReL State 1(a)
LOReL State 2(b)
LOReL State 3(c)
LOReL State 4(d)
LOReL State 5(e)
Fig.1 LOReL State: open drawer and move black mug right
LOReL Image 1(a)
LOReL Image 2(b)
LOReL Image 3(c)
LOReL Image 4(d)
LOReL Image 5(e)
Fig.2 LOReL Image: turn faucet right and close drawer

Performance on Franka Kitchen

Performance on Franka Kitchen
Table.2 N-rates of different methods on seen and unseen instructions in Kitchen with state and image observation.
Fig1 visualization
Fig.3 K-rates on seen and unseen tasks in Kitchen (image)
Kitchen State 1(a)
Kitchen State 2(b)
Kitchen State 3(c)
Kitchen State 4(d)
Kitchen State 5(e)
Kitchen State 6(f)
Kitchen State 7(g)
Fig.4 Kitchen State: activate bottom burner and activate top burner and turn on light switch and open sliding cabinet
Kitchen Image 1(a)
Kitchen Image 2(b)
Kitchen Image 3(c)
Kitchen Image 4(d)
Kitchen Image 5(e)
Kitchen Image 6(f)
Kitchen Image 7(g)
Fig.5 Kitchen Image: activate bottom burner and activate top burner and turn on light switch and open sliding cabinet

Performance on BabyAI

Performance on BabyAI
Fig.6 Success rates (%) on BabyAI GoToSeq task with varying numbers of demonstrations.

Interpretability Analysis

Latent distribution visualization
Fig.7 Visualization of Latent Skill Distributions
Word clouds comparison
Fig.8 Word cloud of skills learned in LOReL Sawyer (state) compositional task
Correlation matrix
Fig.9 Skill heatmap visualization for LOReL on Sawyer (state) compositional tasks
Option frequency matrix
Fig.10 Skill heatmap visualization for LOReL on Sawyer (state) compositional tasks (normalized row-wise)
Word frequency matrix
Fig.11 Skill heatmap visualization for LOReL on Sawyer (state) compositional tasks (normalized column-wise)

Bibtex

@inproceedings{Chen2026DASL,
  title     = {Dual-Process Atomic Skill Learning for Long-Horizon Manipulation Tasks},
  author    = {Jun Chen, Erdemt Bao, Wenlong Dong, Jierui Liu, Hao Wan, Shaopeng Li, Weijun Qin, Jing Liang, Huiping Zhuang},
  booktitle = {},
  year      = {2026}
}