tts.ampixa / kala

Kala — Nepali TTS

A Nepali VITS voice with its own Devanagari G2P — no eSpeak. A hand-crafted G2P, 5 speakers, real-time on CPU.

open source CC-BY-SA 4.0 CPU-only 22050 Hz ONNX
0.020
RTF on laptop CPU
(50× real-time)
48
distinct phones
+ geminated variants
48 k
curated lexicon
entries
99.5 %
minimal-pair contrast
preservation (NepTTS-Bench)

Model on Hugging Face  ·  Source on GitHub  ·  Training recipe  ·  pip install  ·  comparison  ·  G2P details


Try it

The demo runs on HuggingFace Spaces CPU — same ONNX model, same latency. Source: ampixa/real-nepali-tts.


TTS comparison

Kala uses its own Devanagari G2P instead of delegating phonology to eSpeak the way Piper does. All six systems run on the same sentence set. Use the speaker selector to switch the Kala voice.

Other systems are fixed voices. Audio loads on click.

Why a new G2P?#

eSpeak-ng's ne voice was designed for phoneme coverage, not phonological accuracy. It maps Nepali affricates to alveolar labels (ts, tsh) that do not match how Kathmandu speakers produce and . It silently drops gemination, handles no Latin code-switching, and has no lexicon.

The real_nepali G2P was built ground-up from Khatiwada (2009) — the IPA Handbook entry for Nepali — and refined by native-speaker listening review on 48 000 lexicon entries.

Feature eSpeak ne real_nepali G2P
च / छ (palatal affricates) ts / tsh (alveolar) ch / chh (palatal)
Gemination often lost explicit ː tokens
Schwa deletion heuristic rule-based, audited
Latin code-switch undefined letter-by-letter + override lexicon
Phone inventory ~35 48 phones + geminated variants (101 IDs)
Lexicon none 48 000-entry curated lexicon (CC-BY-SA)
NepTTS-Bench score 99.5 % minimal-pair contrast

The affricate distinction matters perceptually. छोरा (son) vs चोरा (thief) are distinguished by aspiration and place of articulation. eSpeak maps both to alveolar tsh / ts; the Kala model maps them to palatal chh / ch, matching mainstream Kathmandu Nepali.

The phonological foundation

Every phone assignment traces back to a primary source. The baseline authority is Khatiwada (2009) — the IPA Handbook entry for Nepali — read cover-to-cover, with direct quotes captured in the project policy document. Contested phones (aspirated laterals, the w/ v distinction, retroflex nasals) are documented with the evidence for each decision.

The lexicon seeds from the Google language-resources/ne/ wordlist (54 000 entries, CC-BY 4.0), then passes through automated rule expansion and native-speaker listening review to produce a 48 000-entry gold lexicon.


Benchmark#

System Type CPU RTF G2P License
Kala v0.2 (this model) VITS ONNX 0.020 real_nepali (hand-crafted) CC-BY-SA 4.0
eSpeak-ng ne formant < 0.001 eSpeak rules GPL 3
Meta MMS-TTS (npi) VITS ~0.12 eSpeak CC-BY-NC 4.0
Piper Chitwan VITS ONNX ~0.03 eSpeak ne MIT
Microsoft Edge TTS (Hemkala) neural cloud cloud API proprietary commercial
Google TTS (ne-NP) neural cloud cloud API proprietary commercial

RTF measured on a 4-core laptop CPU. Cloud APIs are network-bound; CPU RTF is not applicable. MMS-TTS RTF is on the same laptop. Listening comparison: see the comparison panel above.

What the G2P difference sounds like: eSpeak and MMS-TTS (which uses eSpeak) both produce robotic alveolar affricates for च/छ. Edge TTS and Google TTS have natural-sounding output but are cloud-only, closed-source, and require API credentials. Kala is the only open-source model with correct palatal affricates that runs offline on CPU.


Speakers#

Name ID Data Hours Notes
kala 2 human studio 0.37 h namesake voice; recommended for demos
barsha 1 human recording 1.62 h most training data among human voices
slr143_F 3 OpenSLR-143 corpus 1.01 h read speech; neutral prosody
slr43_0546 4 OpenSLR-43 corpus 0.62 h read speech
slr43_2099 5 OpenSLR-43 corpus 0.51 h read speech

Install and use#

pip install kala-tts
import kala_tts

# synthesize returns WAV bytes (16-bit PCM, 22050 Hz mono)
wav = kala_tts.synthesize("नमस्कार, कसरी हुनुहुन्छ?", speaker="kala")

# write directly to disk
kala_tts.synthesize_to_file(
    "नेपाल सुन्दर देश हो।",
    "output.wav",
    speaker="barsha",
)

# speed control (0.8 = slower, 1.3 = faster)
wav = kala_tts.synthesize("राम्रो दिन!", speaker="kala", speed=0.9)

# list available speakers
print(kala_tts.list_speakers())
# ('kala', 'barsha', 'slr143_F', 'slr43_0546', 'slr43_2099')

CLI:

kala-tts "नमस्कार" --speaker kala -o out.wav
kala-tts --list-speakers

# pipe from stdin
echo "नेपाल सुन्दर देश हो।" | kala-tts -o out.wav

The first call downloads the ONNX model (~60 MB) from HuggingFace Hub and caches it locally. No internet needed after first use. No GPU required. No eSpeak required.

From source: pip install git+https://github.com/Ampixa/nepa-newa-text-frontend onnxruntime huggingface_hub numpy


Known limitations


Model card

Full training details, checkpoint SHA-256, phoneme ID map, and licensing: huggingface.co/ampixa/real-nepali-v0.2-kala

Reproducible training recipe — patched piper-plus, data prep, the exact training command, checkpoint selection, and ONNX export: github.com/Ampixa/nepal-tts-training (see TRAINING_GUIDE.md).

Citation

@misc{ampixa2026kala,
  title  = {Kala: CPU-native Nepali Text-to-Speech with a hand-crafted G2P},
  author = {Ampixa},
  year   = {2026},
  url    = {https://huggingface.co/ampixa/real-nepali-v0.2-kala},
}

Phonological foundation: Khatiwada (2009), Nepali, Journal of the International Phonetic Association, 39(3), 373–380.