Audio Demo

Demo Overview

This work addresses the Stability-Expressivity Gap in low-resource Spoken Language Models (SLMs). We identify a phenomenon called "Synthetic Erosion"—where scaling synthetic data improves stability but causes prosodic naturalness to collapse.

To bridge this gap, we propose two self-alignment frameworks: DGSA and TDSC. This demo is organized as follows:

SOTA Applications: High-fidelity Zero-Shot Voice Cloning for Thai and the first-ever implementation for the Lao language.
The Stability-Expressivity Gap: Illustrating the emergence of "Synthetic Erosion" across different data configurations.
Methodological Solutions: Demonstrating how our self-alignment strategies (DGSA & TDSC) restore prosodic expressiveness.

Listeners are encouraged to evaluate examples by jointly considering pronunciation clarity and prosodic naturalness.

Section 1: Comparison with Existing Systems

Section 2: Zero-shot Voice Cloning

Section 3: The Stability-Expressivity Gap

Section 4: Disentanglement-Guided Self-Alignment (DGSA)

Section 5: Temperature-Driven Self-Critique (TDSC)

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

Demo Overview

Table of Contents