NEW · RESEARCH Human‑1 — the first open full‑duplex conversational model for Hindi →

Voice AI Infrastructure · India

Infrastructure for
Voice AI in India.

Research-grade datasets and evaluations for the world's top AI labs and technology companies — built on the real languages, accents, and conversations of India.

Explore datasets → Read the research

Built for the teams advancing speech & conversational AI

Customers across our portfolio of products

Labs · Platforms · Institutions

OpenAI

We build the voice infrastructure AI needs.

Josh Talks enables AI labs and enterprise teams to train, evaluate, and scale voice technologies that truly understand India's linguistic diversity. We collect, curate, and deliver research-grade conversational and multi-speaker voice datasets across Indian languages, accents, and real contexts — with rigorous quality, compliance, and traceability.

Indian languages across our datasets

Hours of voice data produced & labeled every year

Distinct emotions annotated in conversational data

Levels of human-in-the-loop annotation & QA

Why teams trust us

Data you can trust.

Grassroots diversity

Collected from real speakers across states, socioeconomic tiers, and dialect regions — exactly where your product will be used.

Measurable quality

Five-level, human-in-the-loop annotation with automated anomaly detection keeps label error rates exceptionally low.

Ethical by design

Consent workflows that meet global standards, automated PII redaction, and contributor revenue-share models.

Enterprise-grade security

Air-gapped labs, ISO 27001–aligned cloud practices, and full per-file audit trails for compliance teams.

Production at scale

From scarcity to abundance.

Our patented data-production and annotation pipeline lets us generate and label 10 million hours of voice data every year — channel-separated conversational audio sourced from the grassroots of India, through Josh Talks' network of Training Data Specialists.

01 · Multilingual

Multilingual datasets

Multilingual and code-switching voice datasets
Low-resource, rare-language, and dialect datasets
Accented English speech datasets

English, Hindi, Tamil, Marathi, Telugu, Bengali, Kannada, Malayalam, Punjabi, Odia, Gujarati, Assamese

02 · Multi-speaker

Multi-speaker datasets

Multi-speaker spontaneous conversations
Diarized speaker stems for up to 16 speakers
Multi-speaker debate datasets

03 · Emotion

Emotion-rich data

Emotionally aware and annotated conversations
9 emotions: Neutral, Angry, Happy, Sad, Fear, Anxious, Surprised, Confused, Excited

04 · Specialized

Specialized datasets

Noisy and adverse-environment datasets
Privacy-preserving, highly personalized datasets
Voice datasets for accessibility
Child speech datasets

Featured · ASR

Two-channel separated conversational voice datasets

Large-scale, multi-topic, natural dialogues in Indian languages — built for training Automatic Speech Recognition (ASR) models. Each dataset captures real conversational patterns, diverse accents, and natural speech variability to improve model robustness and generalization.

Off-the-shelf volumes span tens of thousands of hours per language, with per-speaker metadata and contextual labels.

Explore datasets →

The Josh Talks AI ecosystem

Datasets, evaluations & research — under one roof.

Datasets

Partner with us

The voice infrastructure layer for India's languages.

Work with Josh Talks AI to train, evaluate, and scale voice models on the real languages and conversations of India.

Infrastructure for
Voice AI in India.

Customers across our portfolio of products

We build the voice infrastructure AI needs.

Data you can trust.

Grassroots diversity

Measurable quality

Ethical by design

Enterprise-grade security

From scarcity to abundance.

Multilingual datasets

Multi-speaker datasets

Emotion-rich data

Specialized datasets

Two-channel separated conversational voice datasets

Datasets, evaluations & research — under one roof.

ASR Datasets

TTS Evals

Voice of India

Human-1

Research

The voice infrastructure layer for India's languages.

Infrastructure forVoice AI in India.

Customers across our portfolio of products

We build the voice infrastructure AI needs.

Data you can trust.

Grassroots diversity

Measurable quality

Ethical by design

Enterprise-grade security

From scarcity to abundance.

Multilingual datasets

Multi-speaker datasets

Emotion-rich data

Specialized datasets

Two-channel separated conversational voice datasets

Datasets, evaluations & research — under one roof.

ASR Datasets

TTS Evals

Voice of India

Human-1

Research

The voice infrastructure layer for India's languages.

Infrastructure for
Voice AI in India.