Volunteering in North-East India for open speech data

Looking forward to more volunteering...

image

From mid‑March to mid‑May last year, I volunteered with Patkai Himalaya Foundation under the folk.exchange initiative—an NGO working on digitisation and the safeguarding of intangible heritage, currently active across North‑East India. I’m writing this to document what I saw on the ground, and what I think would make future language‑data efforts more respectful, accurate, and sustainable.

Why speech data, and why now?

India’s public digital infrastructure is increasingly being imagined not just as apps, but as capabilities. Bhashini is one such effort: a broad platform building AI language tools (translation, ASR) as Digital Public Goods. But speech recognition and translation are only as good as the data they learn from.

That’s where Project Vaani comes in—led by IISc and Google—focused on building large, open speech datasets for Indian languages that can strengthen models powering platforms like Bhashini.

Through this ecosystem, we were connected to a survey‑platform partner (I won’t name them) that pays participants for contributing speech samples. Patkai Himalaya Foundation was offered 18 districts across the North‑East to collect speech datasets using that partner’s platform. I happened to be free from work, and the weather was perfect for field travel—so I joined.

Fieldwork reality: networks, nuance, and trust

Over those weeks, we moved across universities and colleges in four states, and into remote hilltop villages—often where connectivity was patchy or nonexistent. We travelled 4,000+ km by road. At times, the work felt like equal parts community outreach, operational planning, and improvisation.

One incident stayed with me: in a Nishi/Nyishi context, I learned how fragile “direct translation” can be—where the same word can point to meanings as far apart as “rice” and “skin.” It’s a small example, but it captures a big truth: language isn’t a spreadsheet column. When we treat it like one, the harm isn’t abstract—it becomes social embarrassment, misunderstanding, or mistrust.

What I’d improve next time (constructively)

  1. Time horizons must match linguistic diversity
    The North‑East has 200+ languages (often with dialect continua). Survey‑style collection may scale faster elsewhere, but here it needs more time, deeper local partnerships, and realistic targets—otherwise the effort risks becoming superficial.

  2. Conversational AI ≠ literary AI
    Everyday speech borrows heavily across languages. If “correctness” is judged by a formal, literary standard, we defeat the purpose of capturing how people actually speak. Borrowed words are not “errors”—they’re living language. If a community historically didn’t have windows, it’s natural that a “windows” term gets borrowed from a dominant contact language.

  3. Gamification and friction reduction matter
    Small incentives help, but participation drops when recording feels confusing or slow. People’s attention is limited; the UX must make it easy to complete a full session without getting lost, stuck, or distracted.

  4. Automation should protect quality early
    The system should flag mismatches early—wrong language, repeated prompts, suspicious patterns—before large‑scale collection compounds errors. Pauses and filler words are natural; models can learn them, and tooling can mark or skip them intelligently.

  5. Prioritise breadth over repeating what’s already abundant
    The point of this work is not to over‑collect for widely available languages; it’s to expand coverage where data is thin. That requires deliberate diversity targets and transparent reporting.

  6. Trust is a product requirement, not a PR line
    Prompt payments matter. But so does how the app feels: it must look and behave like a legitimate civic/academic initiative—not something that resembles a crypto or “money‑making scheme.” In communities where trust is earned slowly, UI/UX and communication aren’t secondary.

  7. Hardware support needs to be real, not assumed
    The Android market is diverse. Technical issues can become a major drop‑off point when people use older phones or custom ROMs. A lightweight web form (or a low‑spec mode) would be a useful fallback for corner cases.

Why the future is offline, small, and conversational

After watching how often the internet disappears once you leave a city, I’m convinced the future of inclusive language tech in places like the North‑East is: small language models that can run offline for conversational AI.

Offline capability isn’t only a technical preference—it’s dignity (works anywhere), privacy (no constant uploads), and reliability (no network, no problem). Research is increasingly aligned with this direction, from studies on offline mobile conversational agents to broader work on small language models and how they can be made capable enough for real‑world use.

Final remarks

I’m deeply grateful for the chance to contribute to this work. While much of my time was spent on the logistical grind of fieldwork, the real reward was the human connection: speaking with thousands of people about the project, introducing many to the idea of AI for the first time, and having candid conversations about the ethics of voice and data.

I’d welcome the chance to volunteer with the Patkai Himalaya Foundation again—continuing the work of ensuring every voice is heard, and understood.

Language is the most intimate thing we own. As we build the datasets of the future, let’s make sure we aren’t just collecting voices—we’re actually listening to what they’re telling us.

On a personal note: this initiative didn’t feel like “work.” I didn’t burn out. I think I finally understand nishkaam karma—how it shifts attention away from doership and toward the work itself.

Finally, if you care about cultural preservation, consider supporting the Patkai Himalaya Foundation. Their work to protect and digitise the intangible cultural heritage of North‑East India is vital, and their growing volunteer community offers a rare chance to contribute meaningfully on the ground.

Research pointers (for anyone building in this space)

Note: AI has been used to sturcture the blog post.

all tags