Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

Audio Comparison | Section1-Section5
Section1 Section2 Section3 Section4 Section5

Demo Overview

This work addresses the Stability-Expressivity Gap in low-resource Spoken Language Models (SLMs). We identify a phenomenon called "Synthetic Erosion"—where scaling synthetic data improves stability but causes prosodic naturalness to collapse.

To bridge this gap, we propose two self-alignment frameworks: DGSA and TDSC. This demo is organized as follows:

  • SOTA Applications: High-fidelity Zero-Shot Voice Cloning for Thai and the first-ever implementation for the Lao language.
  • The Stability-Expressivity Gap: Illustrating the emergence of "Synthetic Erosion" across different data configurations.
  • Methodological Solutions: Demonstrating how our self-alignment strategies (DGSA & TDSC) restore prosodic expressiveness.

Listeners are encouraged to evaluate examples by jointly considering pronunciation clarity and prosodic naturalness.

Table of Contents