A Nepali VITS voice with its own Devanagari G2P — no eSpeak. A hand-crafted G2P, 5 speakers, real-time on CPU.
Model on Hugging Face · Source on GitHub · Training recipe · pip install · comparison · G2P details
The demo runs on HuggingFace Spaces CPU — same ONNX model, same latency. Source: ampixa/real-nepali-tts.
Kala uses its own Devanagari G2P instead of delegating phonology to eSpeak the way Piper does. All six systems run on the same sentence set. Use the speaker selector to switch the Kala voice.
eSpeak-ng's ne voice was designed for phoneme coverage, not
phonological accuracy. It maps Nepali affricates to alveolar labels
(ts, tsh) that do not match how Kathmandu speakers
produce च and छ. It silently drops gemination, handles no
Latin code-switching, and has no lexicon.
The real_nepali G2P was built ground-up from Khatiwada (2009)
— the IPA Handbook entry for Nepali — and refined by native-speaker listening
review on 48 000 lexicon entries.
| Feature | eSpeak ne |
real_nepali G2P |
|---|---|---|
| च / छ (palatal affricates) | ts / tsh (alveolar) | ch / chh (palatal) |
| Gemination | often lost | explicit ː tokens |
| Schwa deletion | heuristic | rule-based, audited |
| Latin code-switch | undefined | letter-by-letter + override lexicon |
| Phone inventory | ~35 | 48 phones + geminated variants (101 IDs) |
| Lexicon | none | 48 000-entry curated lexicon (CC-BY-SA) |
| NepTTS-Bench score | — | 99.5 % minimal-pair contrast |
The affricate distinction matters perceptually. छोरा (son) vs
चोरा (thief) are distinguished by aspiration and place of
articulation. eSpeak maps both to alveolar tsh / ts;
the Kala model maps them to palatal chh / ch,
matching mainstream Kathmandu Nepali.
Every phone assignment traces back to a primary source. The baseline
authority is Khatiwada (2009) — the IPA Handbook entry for Nepali —
read cover-to-cover, with direct quotes captured in the project policy
document. Contested phones (aspirated laterals, the w/
v distinction, retroflex nasals) are documented with the
evidence for each decision.
The lexicon seeds from the Google language-resources/ne/
wordlist (54 000 entries, CC-BY 4.0), then passes through automated rule
expansion and native-speaker listening review to produce a 48 000-entry
gold lexicon.
| System | Type | CPU RTF | G2P | License |
|---|---|---|---|---|
| Kala v0.2 (this model) | VITS ONNX | 0.020 | real_nepali (hand-crafted) | CC-BY-SA 4.0 |
| eSpeak-ng ne | formant | < 0.001 | eSpeak rules | GPL 3 |
| Meta MMS-TTS (npi) | VITS | ~0.12 | eSpeak | CC-BY-NC 4.0 |
| Piper Chitwan | VITS ONNX | ~0.03 | eSpeak ne | MIT |
| Microsoft Edge TTS (Hemkala) | neural cloud | cloud API | proprietary | commercial |
| Google TTS (ne-NP) | neural cloud | cloud API | proprietary | commercial |
RTF measured on a 4-core laptop CPU. Cloud APIs are network-bound; CPU RTF is not applicable. MMS-TTS RTF is on the same laptop. Listening comparison: see the comparison panel above.
What the G2P difference sounds like: eSpeak and MMS-TTS (which uses eSpeak) both produce robotic alveolar affricates for च/छ. Edge TTS and Google TTS have natural-sounding output but are cloud-only, closed-source, and require API credentials. Kala is the only open-source model with correct palatal affricates that runs offline on CPU.
| Name | ID | Data | Hours | Notes |
|---|---|---|---|---|
| kala | 2 | human studio | 0.37 h | namesake voice; recommended for demos |
| barsha | 1 | human recording | 1.62 h | most training data among human voices |
| slr143_F | 3 | OpenSLR-143 corpus | 1.01 h | read speech; neutral prosody |
| slr43_0546 | 4 | OpenSLR-43 corpus | 0.62 h | read speech |
| slr43_2099 | 5 | OpenSLR-43 corpus | 0.51 h | read speech |
pip install kala-tts
import kala_tts
# synthesize returns WAV bytes (16-bit PCM, 22050 Hz mono)
wav = kala_tts.synthesize("नमस्कार, कसरी हुनुहुन्छ?", speaker="kala")
# write directly to disk
kala_tts.synthesize_to_file(
"नेपाल सुन्दर देश हो।",
"output.wav",
speaker="barsha",
)
# speed control (0.8 = slower, 1.3 = faster)
wav = kala_tts.synthesize("राम्रो दिन!", speaker="kala", speed=0.9)
# list available speakers
print(kala_tts.list_speakers())
# ('kala', 'barsha', 'slr143_F', 'slr43_0546', 'slr43_2099')
CLI:
kala-tts "नमस्कार" --speaker kala -o out.wav kala-tts --list-speakers # pipe from stdin echo "नेपाल सुन्दर देश हो।" | kala-tts -o out.wav
The first call downloads the ONNX model (~60 MB) from HuggingFace Hub and caches it locally. No internet needed after first use. No GPU required. No eSpeak required.
From source:
pip install git+https://github.com/Ampixa/nepa-newa-text-frontend onnxruntime huggingface_hub numpy
Full training details, checkpoint SHA-256, phoneme ID map, and licensing: huggingface.co/ampixa/real-nepali-v0.2-kala
Reproducible training recipe — patched piper-plus, data prep, the exact training command, checkpoint selection, and ONNX export: github.com/Ampixa/nepal-tts-training (see TRAINING_GUIDE.md).
@misc{ampixa2026kala,
title = {Kala: CPU-native Nepali Text-to-Speech with a hand-crafted G2P},
author = {Ampixa},
year = {2026},
url = {https://huggingface.co/ampixa/real-nepali-v0.2-kala},
}
Phonological foundation: Khatiwada (2009), Nepali, Journal of the International Phonetic Association, 39(3), 373–380.