Text Categorizing System for Indian Languages

 

GUIDE: Dr. Rajeev Sangal.

 

This project was my first step into the realm of research. It was a course project of the Programming Utilities and Scripts course in the third semester of Undergraduate study. I continued working on this project as an overload project for the 4th semester. The product developed in course of this project was bought by the Language Technologies Research Center (LTRC), which is headed by Prof. Rajeev Sangal.

PROBLEM DEFINITION

Text categorization involves categorizing input text into one of previously defined categories. Each category has a set of pre-classified files, which are used as the basis for classification, by extracting category-defining features from them. These category-defining features are then used for the classification of unknown input texts. If the categories are fairly distinct, the system performs well and gives accurate results. Search engines to classify web sites and compile web directories use such systems.

DESCRIPTION

This system passes through two phases - a learning phase and a classification phase. In the learning phase category specific data is processed one category at a time. The texts of a category are processed to extract its key terms - the key words and phrases that are representative of the texts that belong to that category. The extraction of key terms should allow for errors in the input files, therefore a term should be considered to be a key term only if it has a certain minimum number of occurrences. However, frequency of occurrence should not be the only criterion, for, commonly occurring words of a language such as prepositions and articles have a very high rate of occurrence, but have no categorization value and hence should be discarded. The key terms are then assigned weights. This is to emphasize the importance of some terms in the classification of a category. Terms like quasars, galaxies and black holes are terms that one would find only in texts on Astronomy and hence a lot of importance should be given to these terms. These weights are computed using heuristics that take into consideration factors like the number of occurrences of the term and their respective density of occurrence. The final list of words and their weights, compiled out of a category form the characterization set for that category. The classification phase involves finding the category that has characteristics similar to the text to be classified. The quality of a match was computed using a similarity measure.

IMPLEMENTATION

The system was implemented in Perl by a team of three students. The system was trained on a corpus of Hindi and Telugu texts. Several approaches had been tried out. One of them is described below.

 

 

 

 

METHODOLOGY

1.      A list of all key terms and their statistics was constructed to get a table of the following form

Term

Number of times the term has occurred in the corpus.

Cumulative Frequency.

Cumulative Frequency: The percentage of the corpus covered by a term and the ones above it in the list.

2.      The stop and rare terms were removed to compile a list of terms that would be considered for categorization.

Stop Terms: Certain terms occurred frequently in all documents of all categories and have no classification value. 65% of the corpus was covered by these stop terms. Hence these terms were ignored.

Ignored Terms: Certain terms occurred very rarely, about once or twice in the corpus. These terms were also ignored.

Considered Terms: These are the terms, which were used for categorization. Their cumulative coverage lay between 65 % and 95 %.

3.      The corpus texts were then separated into two sets - one for training and the other for testing. In each category 90% of texts were used as Training data and the remaining 10 % for Testing.

4.      The Normalized Frequency of a considered term in each input text was computed using the expression -

            NFt =  (f * Fs) / S

Where, NFt: Normalized Frequency.

Fs:    Average size of a text in the corpus,

f:       Frequency of the term in the text,

                        S:      Size of the text,

For each considered term, its Normalized Frequency in each category was computed and a table of the following form was constructed.

Term

TNfc

NumTextsinCatt.

Where, TNfc: Sum of the normalized frequencies of a term in all texts of that category,

             NumTextsinCatt: Number of texts of the category in which the term occurs.  

5.      The weights for each term for a category were computed using the following expression.

            Wct =   (TNfc * (k - 1) * NumTextsinCatt)/ Oft

            Where Wct: Weight of a term in a category,

 k: Number of categories in which the term occurs,

Oft: Number of texts outside the category in which the term occurs.

6.      The Normalized Weight of each term in a category was then computed.

            NWct = Wct / RMSc

            where   NWct: Normalized Weight of a Term in a category.

                        RMSc: root mean square of the weights of all the terms in the category.

7.      Categorization

            To categorize an unknown text the following steps were performed.

i)                    The normalized frequencies (NFt-u) of the terms in the input text were computed as in step 4.

ii)                   For each category, the category score ( Sc ) was calculated as

                        Sc = ∑ (Nft-u) * NWct

iii)                 The category with highest category score is declared as the one to which the unknown text belongs.