Skip to main content

Urdu Stemmer - Rule Based



Urdu Stemmer-Rule Based


Stemming is the process in which inflected words are reduced to find stem or root. There are

various inflected words that can be reduced to stem.

e.g. In English language : 
1) Act can have inflected words like actor, acted, acting etc.
2) Words like fishing, fished and fisher can be reduced to root word fish.

Similarly in Urdu various possibilities have been identified and rules have been developed 
appropriate :

Inflected Word         Root Word 
ںایکڑل                            ÛŒÚ©Ú‘Ù„
ںایتسب                            ÛŒØªØ³Ø¨
ںایڑاگ                            ÛŒÚ‘اگ
ںیباتک                            Ø¨Ø§ØªÚ©
ےلیم                               لایم

 Approaches
Stemming algorithms are classified under three categories- Rule Based, Statistical and Hybrid.
1) Rule Based approach - This approach applies a set of transformation rules to inflected words 
in order to cut prefixes or suffixes. 
E.g. if the word ends in 'ed', remove the 'ed'. 
2) Statistical approach - The major drawback of Rule Based approach is that it is dependent on 
database. Statistical algorithms overcome this problem by finding distributions of root elements 
in a database. There is no need to maintain the database.        
3) Hybrid approach - It is combination of both Affix removal and Statistical approach. 
Stemming is useful in Natural Language Processing problems like search engine, word 
processing problems and information retrieval. In this stemmer we have applied Rule Based
Approach in which we apply rules on various possibilities of inflected words to remove suffixes 
or prefixes. In Urdu, the only stemmer available to us is Assas-Band developed by NUCES, 
Pakistan which maintains an Affix Exception List and works according to the algorithm to
remove inflections. 

For More details you may read our research paper.

To Test or Use our Urdu Stemmer please fallow this link.


Comments

  1. sir please healp me how i can connect with a server and how i can make urdu dictionary. This is my project and i am making it. sir please

    ReplyDelete

Post a Comment

Popular posts from this blog

Font identifier and Unicode converter for Hindi

Font identifier and Unicode converter for Hindi Fonts are used to represent text in document. Fonts are mainly two kind non-Unicode and Unicode fonts. Complex scripts like Hindi and other Asian languages well represented in Unicode fonts. There are some other ways to write these languages for e.g we can use ASCII/ISCII codes to represent different characters of Hindi, but there are large numbers of characters in Hindi script as compared to English. Therefore, we always need multiple ASCII/ISCII encoded characters combination to represent a single character of Hindi Script. One major problem in these ASCII encoding based fonts is that we cannot easily transfer text from one system to another. The system must have these text fonts. There is hundreds of ASCII/ISCII encoding based fonts which are used to write Hindi text. New software systems are based on Unicode fonts.                   ...

Binary Search Tree in ASP .Net

Binary Search Tree in ASP .Net To create Binary Search Tree(BST) in Asp.net application   first you need to create a Node class. Something like following : class Node {     public String data;     public int freq = 0;     public Node left, right;     public Node()     { }     public Node( String data)     {         this .data = data;         left = null ;         right = null ;     } } Next You need to create a class including different functions.Like class BinaryTreeImp {     Node root;     String outputfreq = "" ;     static int count = 0;     public BinaryTreeImp()     {      ...

Hindi to Punjabi Machine Translation System

The Hindi To Punjabi Machine Translation System has been developed using Direct/Rule based Approach by Dr.Vishal Goyal and Dr. G.S Lehal. Various large size Lexicon resources  have been used to map Source and Target language words.  In general, if the two languages are structurally similar, in particular as regards lexical correspondences, morphology and word order, the case for abstract syntactic analysis seems less convincing. Since the present research work deals with a pair of closely related language, so the direct translation system is the obvious choice. The overall system architecture shown below, is adopted for Hindi to Punjabi Machine Translation System. The system is divided into three stages: Preprocessing, Translation Engine, and Post Processing stage. Following is the description of various steps of this architecture.  PreProcessing   The pre-processing stage is a collection of operations that are applied on input  data to make it pr...