Kala — Nepali TTS

A Nepali VITS voice with its own Devanagari G2P — no eSpeak. A hand-crafted G2P, 5 speakers, real-time on CPU.

open source CC-BY-SA 4.0 CPU-only 22050 Hz ONNX

0.020

RTF on laptop CPU
(50× real-time)

distinct phones
+ geminated variants

48 k

curated lexicon
entries

99.5 %

minimal-pair contrast
preservation (NepTTS-Bench)

Model on Hugging Face · Source on GitHub · Training recipe · pip install · comparison · G2P details

Try it

The demo runs on HuggingFace Spaces CPU — same ONNX model, same latency. Source: ampixa/real-nepali-tts.

TTS comparison

Kala uses its own Devanagari G2P instead of delegating phonology to eSpeak the way Piper does. All six systems run on the same sentence set. Use the speaker selector to switch the Kala voice.

Kala speaker: Other systems are fixed voices. Audio loads on click.

Why a new G2P?#

eSpeak-ng's ne voice was designed for phoneme coverage, not phonological accuracy. It maps Nepali affricates to alveolar labels (ts, tsh) that do not match how Kathmandu speakers produce च and छ. It silently drops gemination, handles no Latin code-switching, and has no lexicon.

The real_nepali G2P was built ground-up from Khatiwada (2009) — the IPA Handbook entry for Nepali — and refined by native-speaker listening review on 48 000 lexicon entries.

Feature	eSpeak `ne`	real_nepali G2P
च / छ (palatal affricates)	ts / tsh (alveolar)	ch / chh (palatal)
Gemination	often lost	explicit ː tokens
Schwa deletion	heuristic	rule-based, audited
Latin code-switch	undefined	letter-by-letter + override lexicon
Phone inventory	~35	48 phones + geminated variants (101 IDs)
Lexicon	none	48 000-entry curated lexicon (CC-BY-SA)
NepTTS-Bench score	—	99.5 % minimal-pair contrast

The affricate distinction matters perceptually. छोरा (son) vs चोरा (thief) are distinguished by aspiration and place of articulation. eSpeak maps both to alveolar tsh / ts; the Kala model maps them to palatal chh / ch, matching mainstream Kathmandu Nepali.

The phonological foundation

Every phone assignment traces back to a primary source. The baseline authority is Khatiwada (2009) — the IPA Handbook entry for Nepali — read cover-to-cover, with direct quotes captured in the project policy document. Contested phones (aspirated laterals, the w/ v distinction, retroflex nasals) are documented with the evidence for each decision.

The lexicon seeds from the Google language-resources/ne/ wordlist (54 000 entries, CC-BY 4.0), then passes through automated rule expansion and native-speaker listening review to produce a 48 000-entry gold lexicon.

Benchmark#

System	Type	CPU RTF	G2P	License
Kala v0.2 (this model)	VITS ONNX	0.020	real_nepali (hand-crafted)	CC-BY-SA 4.0
eSpeak-ng ne	formant	< 0.001	eSpeak rules	GPL 3
Meta MMS-TTS (npi)	VITS	~0.12	eSpeak	CC-BY-NC 4.0
Piper Chitwan	VITS ONNX	~0.03	eSpeak ne	MIT
Microsoft Edge TTS (Hemkala)	neural cloud	cloud API	proprietary	commercial
Google TTS (ne-NP)	neural cloud	cloud API	proprietary	commercial

RTF measured on a 4-core laptop CPU. Cloud APIs are network-bound; CPU RTF is not applicable. MMS-TTS RTF is on the same laptop. Listening comparison: see the comparison panel above.

What the G2P difference sounds like: eSpeak and MMS-TTS (which uses eSpeak) both produce robotic alveolar affricates for च/छ. Edge TTS and Google TTS have natural-sounding output but are cloud-only, closed-source, and require API credentials. Kala is the only open-source model with correct palatal affricates that runs offline on CPU.

Speakers#

Name	ID	Data	Hours	Notes
kala	2	human studio	0.37 h	namesake voice; recommended for demos
barsha	1	human recording	1.62 h	most training data among human voices
slr143_F	3	OpenSLR-143 corpus	1.01 h	read speech; neutral prosody
slr43_0546	4	OpenSLR-43 corpus	0.62 h	read speech
slr43_2099	5	OpenSLR-43 corpus	0.51 h	read speech

Install and use#

pip install kala-tts

import kala_tts

# synthesize returns WAV bytes (16-bit PCM, 22050 Hz mono)
wav = kala_tts.synthesize("नमस्कार, कसरी हुनुहुन्छ?", speaker="kala")

# write directly to disk
kala_tts.synthesize_to_file(
    "नेपाल सुन्दर देश हो।",
    "output.wav",
    speaker="barsha",
)

# speed control (0.8 = slower, 1.3 = faster)
wav = kala_tts.synthesize("राम्रो दिन!", speaker="kala", speed=0.9)

# list available speakers
print(kala_tts.list_speakers())
# ('kala', 'barsha', 'slr143_F', 'slr43_0546', 'slr43_2099')

CLI:

kala-tts "नमस्कार" --speaker kala -o out.wav
kala-tts --list-speakers

# pipe from stdin
echo "नेपाल सुन्दर देश हो।" | kala-tts -o out.wav

The first call downloads the ONNX model (~60 MB) from HuggingFace Hub and caches it locally. No internet needed after first use. No GPU required. No eSpeak required.

From source: pip install git+https://github.com/Ampixa/nepa-newa-text-frontend onnxruntime huggingface_hub numpy

Known limitations

Naturalness: Kala voice was trained on 200 utterances (~22 min). Prosody is often flat on long sentences. Barsha and slr143_F have more data and better prosody consistency.
Punctuation: Pauses are inserted deterministically at sentence boundaries. The model does not learn intonation contours from punctuation.
OOV words: Unknown Devanagari words fall back to letter-by-letter rules. The lexicon covers ~95 % of common vocabulary.
Numbers: Digit sequences are normalized to Nepali word order; mixed Nepali/English numerals may produce unexpected output.

Model card

Full training details, checkpoint SHA-256, phoneme ID map, and licensing: huggingface.co/ampixa/real-nepali-v0.2-kala

Reproducible training recipe — patched piper-plus, data prep, the exact training command, checkpoint selection, and ONNX export: github.com/Ampixa/nepal-tts-training (see TRAINING_GUIDE.md).

Citation

@misc{ampixa2026kala,
  title  = {Kala: CPU-native Nepali Text-to-Speech with a hand-crafted G2P},
  author = {Ampixa},
  year   = {2026},
  url    = {https://huggingface.co/ampixa/real-nepali-v0.2-kala},
}

Phonological foundation: Khatiwada (2009), Nepali, Journal of the International Phonetic Association, 39(3), 373–380.