Building a rich reservoir of Gurmukhi

Overview of Project

The language of a region is an intrinsic expression of its culture. It is very well quoted- “Language is not just words. It’s a culture, tradition, a unification of a community, a whole history that creates what a community is.” It’s all embodied in a language.

Guru Angad Dev Ji invented the language ‘Gurmukhi’ for recording sacred literature. Gurmukhi means “from the mouth of the Guru.” Owing to increasing western influence Gurmukhi is losing its identity. Protection of our Punjabi culture viz language is imperative.

Project Goals

Punjabi Speech to Text

Among all Indic languages, the most worked upon languages in the sector of speech recognition include Hindi, Bengali, etc. Punjabi is found in the lower ranks of this list. Moreover, the speech corpora of these languages can be obtained easily on public domain which is again not the case for Punjabi. Taking this into consideration, a Speech Data Collection app was introduced by Sabudh which allows volunteers to contribute towards building a rich resource for the same by recording short utterances of Punjabi literature.

The aim behind creating this platform is to create a publicly available “speech to text” corpus to fuel research and development of Automatic Speech Recognition (ASR) models for the Punjabi/Gurmukhi language. The web application can be accessed at panjabi.ai

Along with Data acquisition, state-of-the-art methods are being brought into the application for obtaining an optimal Speech to Text pipeline specifically for Punjabi.

Natural Language Processing – Toolkit

Natural Language Processing is a cumulative term that cites automated processing of human language. India, without a doubt, is rich in languages but is relatively poor in digitalized resources for these languages. This makes building such an NLP kit even more problematic. This toolkit would be aiming for including language models trained on Punjabi text corpora, these models would include Named entity Recognizer, Relationship classifier, etc.

In order to train such models, a plethora of annotated datasets is required customized for every module respectively. Doccano is one of the most versatile annotator for Natural Language Processing datasets, an instance of which is set up on the server for custom annotation.

Data collection: The data up and running is sourced through some Punjabi news articles, a Punjabi grammar book, and a couple of FIR Data reports (image data which was converted to textual counterpart using Optical Character Recognition).

AIM

We aim at identifying the named entities from Punjabi Text. Punjabi articles are being annotated on Doccano, the output of which is further being fed into a sequence model for optimal training of a Named Entity Recognizer for Punjabi. Accordingly, Punjabi word vectors are required which are being formulated with a custom fast text model.

Use of Allen NLP Model for identifying the named entities, using Bi-LSTM and CRF models.

A custom Spacy model and a span-based approach for Named Entity Recognition for the Punjabi language.

Contextual Search Engine on Gurbani

A context-based search system for Punjabi text uses a deep learning approach to perform a search in the collection of Punjabi documents and finds the documents that are similar in the context of the search query. The result of such a search is more relevant to users than the results of systems that rely solely on word-to-word matching.

Custom lemmatizer is needed for Gurbani because of its rich vocabulary consisting of words from various languages and dialects. The goal is to develop an algorithm that uses concepts of artificial neural networks with multiple layers to convert the text documents into word embeddings which in turn can be used to calculate the similarity between the documents.

Punjabi Lemmatizer

Dictionaries that are available are utilized to make a database of words and their corresponding inflected forms. This database is then used to lemmatize the document by using a brute-force type lemmatizer that performs word-to-word matching and replacement. Some stop-words are also identified and then eliminated from the documents in the preprocessing phase. A well-known document to vectors model, Doc2Vec is used to create word vectors from documents that use a deep learning approach for formation. These document vectors retain the context of the documents by learning word associations among the terms of the document. The resulting vectors are stored in the database for future calculations. The results are based upon the cosine similarity scores between the input document vectors and other document vectors.

The Project Lingua Franca Aims at Building a rich reservoir of Gurmukhi.