Al-DimSum launches as world's first Cantonese corpus platform

The AI-DimSum multimodal Cantonese corpus platform—the world's first of its kind designed to be culturally authentic, secure, trustworthy, and AI-friendly—has been officially launched. It features a processed text corpus of over one million words and more than 3,000 hours of high-fidelity annotated speech data, marking a significant milestone in the preservation and digital application of Cantonese language resources.

(Video: Guangzhou Federation of Social Science Circles) 

Al-DimSum was unveiled at the Tenth Advanced Forum on Language Services held at Guangzhou University on December 6–7. It comprises seven integrated subsystems spanning corpus collection, annotation, management, large-model integration, and application deployment, forming a strong foundation for Cantonese preservation, research, and technological innovation.

The platform is notably comprehensive: beyond its million-word text corpus, it includes more than 1 TB of audiovisual materials, 10,000 images of Lingnan culture, and 10,000 utterances from daily life scenarios. It also features an authoritative Cantonese safety lexicon and over 200,000 safety assessment questions, fully supporting multimodal and large language model (LLM) training needs.

At the same event, the national language resource service platform released its "6+1" resource package, bringing together nearly 100 language services from about 50 institutions, including services such as cultural term translation and emergency language support.

The forum drew over 120 experts, scholars, and industry representatives to discuss the vital role of language resources in cultural preservation and emergency language services.

第十届语言服务高级论坛暨2025年度国家应急语言服务团学术年会在广州大学举行。 受访者供图

Author | He Fengyu

Photo | Nanfang Plus

Editor | Liu Lingzhi, James Campion, Shen He

Related News