1. Data & AI Solutions
  2. Off-the-Shelf Datasets
  3. Arabic (Saudi) in-studio speech dataset
  • audio

Arabic (Saudi) in-studio speech dataset

Featuring over 1,300 prompts, this dataset supports wake word detection and command phrase recognition. Recorded by voice-actors and native-speakers of Arabic (Saudi dialect) in a studio environment, this dataset delivers mono-channel audio at 44.1 kHz, 24-bit fidelity in clear, consistent quality.

Specifications

Modalities
Audio
Language
Saudi (Arabic) [ar-SA]
Total prompts
1,392
Total audio length
1:04h
Average recording length (in sec)
2.76
Participants
29
Group
Adults
Task category
Scripted prompts
Data type
In-studio speech

Accelerate model development & training processes

  • High-fidelity, studio-recorded audio

    Captured in a controlled professional studio environment to minimize background noise, echo and other acoustic distortions. Standardized 1 second silence padding to help with segmenting utterances and noise analysis.

  • Scripted prompts with comprehensive coverage

    Carefully scripted prompts to ensure consistency in phrasing while covering a wide range of wake words and command expressions, including multiple variations of how commands and wake words may be naturally phrased or used in different contexts.

  • Authentic Saudi dialect with phonetic details

    Recorded exclusively by native Saudi Arabic speakers to capture authentic pronunciation, intonation and regional linguistic nuances, enabling models to detect subtle phonetic variations across varied speakers and acoustic environments.

Still searching for the right dataset? We can help.

Reach out and we’ll guide you to the right solution.

Case Studies

Explore our success stories

  • Evaluating a conversational AI model with a highly complex multimodal STEM dataset

    Man using his mobile device with a chatbot illustration above the device.

    Discover how our off-the-shelf science, technology, engineering and mathematics (STEM) dataset contributed to enhancing scientific reasoning and visual processing capabilities in a chatbot model crafted by a leading-edge tech and AI company.


    • 4485Physics prompt-response pairs


    • 9606Math prompt-response pairs

    Download case study
  • Improving large language model logic and reasoning with a specialized fine-tuning dataset

    Person working at a laptop holding a mobile phone with an overlaid illustration of LLM features.

    Explore how TELUS Digital created an off-the-shelf dataset to advance the capabilities of large language models (LLMs).


    • 50KSTEM-based prompt-response pairs created


    • 300Highly-skilled contributors

    Download case study

Access the Arabic (Saudi) in-studio speech dataset

Connect with our experts for pricing and samples.