Hindi to Punjabi Machine Translation System

The Hindi To Punjabi Machine Translation System has been developed using Direct/Rule based Approach by Dr.Vishal Goyal and Dr. G.S Lehal. Various large size Lexicon resources have been used to map Source and Target language words.

In general, if the two languages are structurally similar, in particular as regards lexical correspondences, morphology and word order, the case for abstract syntactic analysis seems less convincing. Since the present research work deals with a pair of closely related language, so the direct translation system is the obvious choice. The overall system architecture shown below, is adopted for Hindi to Punjabi Machine Translation System. The system is divided into three stages: Preprocessing, Translation Engine, and Post Processing stage. Following is the description of various steps of this architecture.

PreProcessing

The pre-processing stage is a collection of operations that are applied on input

data to make it processable by the translation engine. In our current work, we

have performed following pre-processing steps:

Text Normalization
Replacing Collocations
Replacing Proper Nouns

Translation Engine

The translation engine is responsible for translation of each token obtained

from the previous step. It uses various lexical resources for finding the match

of a given token in target language. Following is the description of how a

token is passed through various modules.

Analyzing the word for Translation /Transliteration: The token obtained in the previous stage is passed through various stages.

Identifying Titles: The token is checked whether it is a title like प्रो(prō), श्रीभती(shrīmtī) etc. If the current token is found to be a title, then the token next to it, should be transliterated instead of translation.
Identifying Surnames: The token is checked whether it is a surname like अग्रवार (agrvāl), ओफेयॉम (ōbērāy ) etc. If the current token is found to be a surname, then the token previous to it, should be transliterated instead of translation.
Lexicon Lookup: If the token does not satisfy above two steps, then it is looked into the lexicon for a match for direct word to word translation.
Resolving Ambiguity: If the token is not present in the lexicon for direct translation, it is looked into the database of ambiguous words. If this token is found to be ambiguous, then dis-ambiguity is resolved with the help of n-gram language modeling. The system uses bigram and trigram databases, which contains one and two words respectively in the vicinity of an ambiguous word and corresponding meaning for that particular context.
Unknown Words: If all the above modules fail to analyze the token, it is considered to be foreign/unknown word. Such words first pass through the morphological analysis phase based on the rules for inflections in Hindi words. Morphological generator generates the transliterated word using the inflectional rules and then checks the generated word in the Punjabi uni-grams database for its genuinity. If this new generated word is found in the Punjabi uni-grams, it is considered for translation otherwise the token is sent to transliteration module for transliteration. Transliteration Module is the major module in the system that uses various rules specifically designed from the translation point of view.

Post Processing

After converting all the source text to target text, there are some of the grammatical errors that need to be corrected. For this purpose, we have formulated the rules for correcting the grammatical errors. Such rules have been implemented using Regular expressions and Pattern matching. This Post Processing phase is responsible for correcting grammatical errors in the generated output.

GUI Features of Systems

Text translation from Hindi to Punjabi
Text transliteration from Hindi to Punjabi
Translating Websites
Sending Email in Punjabi Language originally written in Hindi language.

The system has been rigorously evaluated and its accuracy has been found to be 94% on the basis of intelligibility test and 90.84% on the basis of accuracy test.

Architecture of Hindi To Punjabi Machine Translation System

System is freely abaliable to use. Web Link to Access Machine Translation System: h2p.learnpunjabi.org

Font identifier and Unicode converter for Hindi

Font identifier and Unicode converter for Hindi Fonts are used to represent text in document. Fonts are mainly two kind non-Unicode and Unicode fonts. Complex scripts like Hindi and other Asian languages well represented in Unicode fonts. There are some other ways to write these languages for e.g we can use ASCII/ISCII codes to represent different characters of Hindi, but there are large numbers of characters in Hindi script as compared to English. Therefore, we always need multiple ASCII/ISCII encoded characters combination to represent a single character of Hindi Script. One major problem in these ASCII encoding based fonts is that we cannot easily transfer text from one system to another. The system must have these text fonts. There is hundreds of ASCII/ISCII encoding based fonts which are used to write Hindi text. New software systems are based on Unicode fonts. ...

UMR-Blogs

Search This Blog

Hindi to Punjabi Machine Translation System

Labels

Comments

Post a Comment

Popular posts from this blog

Font identifier and Unicode converter for Hindi

Binary Search Tree in ASP .Net