Legato: Learning Native Continuation for Action Chunking Flow Policies

Abstract

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule conditioning during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

Legato reduces task completion time while improving trajectory smoothness compared to RTC. Across five real-world manipulation tasks, Legato consistently achieves shorter execution time and lower NSPARC (indicating smoother trajectories). The bottom plot shows an example execution trace on the pour task, where Legato produces smoother action trajectories with fewer hesitation-induced slowdowns than RTC.

Overview Video

Method

Standard action chunking policies generate a fixed-length action sequence (chunk) at each decision step. Due to inference delay and the intrinsic multimodality of flow-based policies, transitions between consecutive chunks are often not smooth, leading to visible discontinuities during execution. Real-Time Chunking (RTC) alleviates this through inference-time inpainting, but its continuation mechanism is applied only at inference time and is not learned as part of the policy, leaving it prone to spurious multimodal switching.

Action-Noise Mixture

Legato introduces a horizon-wise continuation vector $\boldsymbol{\omega} \in [0,1]^H$ encoding the guidance schedule. Using $\boldsymbol{\omega}$, we define an action-noise mixture as the effective noise initialization:

$$\boldsymbol{\epsilon}_{\mathrm{eff}} = (\mathbf{1}-\boldsymbol{\omega}) \odot \boldsymbol{\epsilon} + \boldsymbol{\omega} \odot \mathbf{A}$$

The interpolation path and corresponding flow-matching velocity are:

$$\mathbf{Y}_t = (1-t)\,\boldsymbol{\epsilon}_{\mathrm{eff}} + t\,\mathbf{A}, \qquad \mathbf{u}^{\mathrm{FM}}(\mathbf{Y}_t,t) = (\mathbf{1}-\boldsymbol{\omega}) \odot (\mathbf{A}-\boldsymbol{\epsilon})$$

Per-Step Guided Dynamics

At each denoising step, the current noisy action is guided toward the reference action, then updated:

$$\mathbf{Y}_k = (\mathbf{1}-\boldsymbol{\omega}) \odot \mathbf{X}_k + \boldsymbol{\omega} \odot \mathbf{A}, \qquad \mathbf{X}_{k+1} = \mathbf{Y}_k + \Delta t\, f_\theta(\mathbf{Y}_k,t_k)$$

Taking the continuous-time limit yields the guided ODE:

$$\dot{\mathbf{Y}}(t) = (\mathbf{1}-\boldsymbol{\omega}) \odot f_\theta(\mathbf{Y}(t),t) - \boldsymbol{\kappa} \odot (\mathbf{Y}(t)-\mathbf{A}), \quad \boldsymbol{\kappa} = \boldsymbol{\omega}/\Delta t$$

Legato Velocity Field

To ensure training-inference consistency, we require the executed dynamics to match the flow-matching target. Solving for $f_\theta$ gives the Legato velocity field:

$$f_\theta(\mathbf{Y},t) = (\mathbf{1}-\boldsymbol{\omega})^{-1} \odot \bigl[\mathbf{u}^{\mathrm{FM}}(\mathbf{Y},t) + \boldsymbol{\kappa} \odot (\mathbf{Y}-\mathbf{A})\bigr]$$

Substituting the path and velocity yields the closed-form training target:

$$\mathbf{v}_{\text{target}}(t,\mathbf{A},\boldsymbol{\epsilon},\boldsymbol{\omega}) = \bigl(1 - \boldsymbol{\kappa} \odot (1-t)\bigr) \odot (\mathbf{A}-\boldsymbol{\epsilon})$$

Legato preserves the geometric direction of standard flow matching while reshaping the velocity magnitude to internalize continuation dynamics. The network is trained by regressing $f_\theta(\mathbf{Y}_t,o,t,\boldsymbol{\omega})$ to $\mathbf{v}_{\text{target}}$.

Overview of Legato with schedule-shaped continuation dynamics. The schedule parameters are: s is the executed length per cycle, d sets the fully guided prefix (inference delay), and r controls the ramp-down length. Given the schedule, Legato initializes actions via an action–noise mixture and learns a reshaped velocity field so that the native schedule effect is realized during multi-step denoising.

Tasks and Environments

Real-world evaluation tasks on a dual-arm robot. We consider five manipulation tasks (stack bowls, pour things, pick and place, fold towel and open drawer) covering diverse motion patterns and multimodal choices such as alternative grasp goals and left/right arm selection.

Video Comparisons: RTC vs. Legato

Each pair shows the same task executed by the RTC baseline (left) and Legato (right).
Legato produces smoother trajectories with less hesitation and shorter completion time.

Stack the Bowls 5x Speed

RTC Baseline — Trial 1

Legato (Ours) — Trial 1

RTC Baseline — Trial 2

Legato (Ours) — Trial 2

Open the Drawer 2x Speed

RTC Baseline

Legato (Ours)

Pour Things into the Bowl 3x Speed

RTC Baseline — Trial 1

Legato (Ours) — Trial 1

RTC Baseline — Trial 2

Legato (Ours) — Trial 2

Put All Items into the Box 3x Speed

RTC Baseline — Trial 1

Legato (Ours) — Trial 1

RTC Baseline — Trial 2

Legato (Ours) — Trial 2

Fold the Towel 2x Speed

RTC Baseline

Legato (Ours)

Results

Comparison with RTC

We evaluate Legato and RTC across five real-world manipulation tasks under strictly controlled settings. Both methods are initialized from the same pretrained checkpoint, trained on identical datasets, and optimized with the same hyperparameters. We report task score, completion time, and three smoothness metrics. Values are mean ± standard error.

Task	Score ↑		Time (s) ↓		NLDLJ ↓		NSPARC ↓		Overlap RMSE (×10³) ↓
Task	RTC	Legato	RTC	Legato	RTC	Legato	RTC	Legato	RTC	Legato
Bowls	8.68 ± 0.35	9.08 ± 0.33	52.88 ± 3.54	42.66 ± 2.68	36.00 ± 0.34	35.86 ± 0.38	1.82 ± 0.04	1.63 ± 0.02	6.83 ± 0.50	4.58 ± 0.17
Pour	9.34 ± 0.18	9.72 ± 0.13	95.07 ± 2.86	75.73 ± 1.51	39.82 ± 0.15	39.50 ± 0.13	2.85 ± 0.24	1.65 ± 0.08	7.64 ± 0.70	5.14 ± 0.17
PickPlace	9.47 ± 0.15	9.53 ± 0.12	35.53 ± 1.24	30.37 ± 0.65	34.42 ± 0.18	34.34 ± 0.14	2.10 ± 0.08	1.89 ± 0.05	10.17 ± 0.66	5.98 ± 0.40
Drawer	9.20 ± 0.16	9.50 ± 0.13	25.97 ± 0.74	21.80 ± 0.72	32.73 ± 0.13	28.55 ± 0.26	2.24 ± 0.05	1.99 ± 0.08	12.11 ± 0.66	11.74 ± 0.55
Towel	7.33 ± 0.62	8.17 ± 0.56	25.93 ± 0.98	20.00 ± 0.78	32.79 ± 0.20	32.43 ± 0.24	2.17 ± 0.07	1.97 ± 0.05	11.28 ± 0.55	6.22 ± 0.66

Legato consistently outperforms RTC across all tasks and metrics. It achieves shorter task completion time by suppressing spurious multimodal switching, and produces smoother trajectories as measured by NLDLJ, NSPARC, and overlap RMSE.

Comparison with Training-Time RTC

We also compare Legato with Training-Time RTC on the pour task. Values are mean ± standard error.

Metric	Training-Time RTC	Legato
Score ↑	9.46 ± 0.16	9.72 ± 0.13
Completion Time (s) ↓	81.73 ± 1.12	75.73 ± 1.51
NSPARC ↓	2.46 ± 0.14	1.65 ± 0.08
NLDLJ ↓	39.95 ± 0.13	39.50 ± 0.13

For detailed ablation studies, additional experiments, and implementation details, please refer to the paper.

BibTeX

@misc{liu2026learningnativecontinuationaction,
      title={Learning Native Continuation for Action Chunking Flow Policies},
      author={Yufeng Liu and Hang Yu and Juntu Zhao and Bocheng Li and Di Zhang and Mingzhu Li and Wenxuan Wu and Yingdong Hu and Junyuan Xie and Junliang Guo and Dequan Wang and Yang Gao},
      year={2026},
      eprint={2602.12978},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.12978},
}