Urdu Stemmer - Rule Based

Urdu Stemmer-Rule Based

Stemming is the process in which inflected words are reduced to find stem or root. There are

various inflected words that can be reduced to stem.

e.g. In English language :
1) Act can have inflected words like actor, acted, acting etc.
2) Words like fishing, fished and fisher can be reduced to root word fish.

Similarly in Urdu various possibilities have been identified and rules have been developed
appropriate :

Inflected Word Root Word
ںایکڑل یکڑل
ںایتسب یتسب
ںایڑاگ یڑاگ
ںیباتک باتک
ےلیم لایم

Approaches

Stemming algorithms are classified under three categories- Rule Based, Statistical and Hybrid.

1) Rule Based approach - This approach applies a set of transformation rules to inflected words

in order to cut prefixes or suffixes.

E.g. if the word ends in 'ed', remove the 'ed'.

2) Statistical approach - The major drawback of Rule Based approach is that it is dependent on

database. Statistical algorithms overcome this problem by finding distributions of root elements

in a database. There is no need to maintain the database.

3) Hybrid approach - It is combination of both Affix removal and Statistical approach.

Stemming is useful in Natural Language Processing problems like search engine, word

processing problems and information retrieval. In this stemmer we have applied Rule Based

Approach in which we apply rules on various possibilities of inflected words to remove suffixes

or prefixes. In Urdu, the only stemmer available to us is Assas-Band developed by NUCES,

Pakistan which maintains an Affix Exception List and works according to the algorithm to

remove inflections.

For More details you may read our research paper.

http://aclweb.org/anthology//C/C12/C12-3034.pdf

To Test or Use our Urdu Stemmer please fallow this link.

http://h2p.learnpunjabi.org/ust.aspx

Comments

UnknownDecember 30, 2015 at 6:07 AM
sir please healp me how i can connect with a server and how i can make urdu dictionary. This is my project and i am making it. sir please
ReplyDelete
Replies

Add comment

UMR-Blogs

Search This Blog

Urdu Stemmer - Rule Based

Labels

Comments

Post a Comment

Popular posts from this blog

Font identifier and Unicode converter for Hindi

Binary Search Tree in ASP .Net

Hindi to Punjabi Machine Translation System