Skip to main content

Urdu Named Entity Recognition(NER) / Named Entity Recognition System for Urdu


Named Entity Recognition System for Urdu
Named Entity Recognition (NER) is a task which helps in finding out Persons name, Location names, Brand names, Abbreviations, Date, Time etc and classifies them into predefined different categories. NER plays a major role in various Natural Language Processing (NLP) fields like Information Extraction, Machine Translations and Question Answering.We have used the Rule Based approach and developed the various rules to extract the Named Entities in the given Urdu text.  So, accurate working of NER system is very important. NER system can be used for one's personal interest like company manager wants to know all the names involved in specific text document. 


Approaches to NER

1 Rule Based approach: Rules are developed to identify NE in text. This approach takes much time in development and one should have good knowledge of target language. Heuristic based rules are used to identify tags and these rules are language specific. Good rules always yield good results. Development of these kinds of systems is always a time consuming task.  
2 Statistical approach: Statistical approach is also known as Machine Learning approach. This is a fast way to develop a NER system. The system is trained using annotated training data set in specified format. Accuracy of statistical approach is dependent upon the training data. So, we always train the system with a large set of annotated data. Various Machine Learning models like HMM, CRF, MaxEnt, are used for NER system.
3 Hybrid system: Hybrid system is combination of Rule Based approach and Statistical approach. To develop the Hybrid system we use Statistical tools as well as linguistic rules. Combinations of both approaches make a system more accurate and efficient.

We have used Rule Based Approach:
Rule Based approach is time consuming task to develop any NER system. Rule based approach is used only when you know the target language well and have sufficient knowledge about the linguistic rules like knowledge of grammar. The system developed using Rule Based approach always yields the good results. On the another hand, Statistical approach which provide us with many Statistical tools, to develop NER system like HMM, CRF, SVM, MaxEnt etc, with the help of these tools  development process of the system is rapid as compared to Rule Based approach.

To know more about this system, please follow my Research paper published in Coling 2012.

I developed this system in VS 2010 ASP.NET C#, it free to use, please check it out and give me your valuable feed back.

Comments

Popular posts from this blog

Font identifier and Unicode converter for Hindi

Font identifier and Unicode converter for Hindi Fonts are used to represent text in document. Fonts are mainly two kind non-Unicode and Unicode fonts. Complex scripts like Hindi and other Asian languages well represented in Unicode fonts. There are some other ways to write these languages for e.g we can use ASCII/ISCII codes to represent different characters of Hindi, but there are large numbers of characters in Hindi script as compared to English. Therefore, we always need multiple ASCII/ISCII encoded characters combination to represent a single character of Hindi Script. One major problem in these ASCII encoding based fonts is that we cannot easily transfer text from one system to another. The system must have these text fonts. There is hundreds of ASCII/ISCII encoding based fonts which are used to write Hindi text. New software systems are based on Unicode fonts.                   ...

Hindi to Punjabi Machine Translation System

The Hindi To Punjabi Machine Translation System has been developed using Direct/Rule based Approach by Dr.Vishal Goyal and Dr. G.S Lehal. Various large size Lexicon resources  have been used to map Source and Target language words.  In general, if the two languages are structurally similar, in particular as regards lexical correspondences, morphology and word order, the case for abstract syntactic analysis seems less convincing. Since the present research work deals with a pair of closely related language, so the direct translation system is the obvious choice. The overall system architecture shown below, is adopted for Hindi to Punjabi Machine Translation System. The system is divided into three stages: Preprocessing, Translation Engine, and Post Processing stage. Following is the description of various steps of this architecture.  PreProcessing   The pre-processing stage is a collection of operations that are applied on input  data to make it pr...

Binary Search Tree in ASP .Net

Binary Search Tree in ASP .Net To create Binary Search Tree(BST) in Asp.net application   first you need to create a Node class. Something like following : class Node {     public String data;     public int freq = 0;     public Node left, right;     public Node()     { }     public Node( String data)     {         this .data = data;         left = null ;         right = null ;     } } Next You need to create a class including different functions.Like class BinaryTreeImp {     Node root;     String outputfreq = "" ;     static int count = 0;     public BinaryTreeImp()     {      ...