GUIDE: Dr. Rajeev Sangal.
This project was my first step into the realm of research. It was a course project of the Programming Utilities and Scripts course in the third semester of Undergraduate study. I continued working on this project as an overload project for the 4th semester. The product developed in course of this project was bought by the Language Technologies Research Center (LTRC), which is headed by Prof. Rajeev Sangal.
Text categorization involves categorizing input text into one of previously defined categories. Each category has a set of pre-classified files, which are used as the basis for classification, by extracting category-defining features from them. These category-defining features are then used for the classification of unknown input texts. If the categories are fairly distinct, the system performs well and gives accurate results. Search engines to classify web sites and compile web directories use such systems.
This system passes through two phases - a learning phase and a classification phase. In the learning
phase category specific data is processed one category at a time. The texts of
a category are processed to extract its key terms - the key words and phrases
that are representative of the texts that belong to that category. The
extraction of key terms should allow for errors in the input files, therefore a
term should be considered to be a key term only if it has a certain minimum
number of occurrences. However, frequency of occurrence should not be the only
criterion, for, commonly occurring words of a language such as prepositions and
articles have a very high rate of occurrence, but have no categorization value
and hence should be discarded. The key terms are then assigned weights. This is
to emphasize the importance of some terms in the classification of a category.
Terms like quasars, galaxies and black holes are terms that one would find only
in texts on Astronomy and hence a lot of importance should be given to these
terms. These weights are computed using heuristics that take into consideration
factors like the number of occurrences of the term and their respective density
of occurrence. The final list of words and their weights, compiled out of a
category form the characterization set for that category. The classification
phase involves finding the category that has characteristics similar to the
text to be classified. The quality of a match was computed using a similarity
measure.
The system was implemented in Perl by a team of three students. The system was trained on a corpus of Hindi and Telugu texts. Several approaches had been tried out. One of them is described below.
METHODOLOGY
1. A list of all key terms and their statistics was constructed to get a table of the following form
|
Term |
Number of times the term has occurred in the corpus. |
Cumulative Frequency. |
Cumulative Frequency: The percentage of the corpus covered by a term and the ones above it in the list.
2. The stop and rare terms were removed to compile a list of terms that would be considered for categorization.
Stop Terms: Certain terms occurred frequently in all documents of all categories
and have no classification value. 65% of the corpus was covered by these stop
terms. Hence these terms were ignored.
Ignored Terms: Certain terms occurred very rarely, about once or twice in the corpus.
These terms were also ignored.
Considered Terms: These are the terms, which were used for categorization. Their
cumulative coverage lay between 65 % and 95 %.
3. The corpus texts were then separated into two sets - one for training and the other for testing. In each category 90% of texts were used as Training data and the remaining 10 % for Testing.
4. The Normalized Frequency of a considered term in each input text was computed using the expression -
NFt = (f * Fs) / S
Where, NFt: Normalized Frequency.
Fs: Average size of a text in the corpus,
f: Frequency of the term in the text,
S: Size of the text,
For each considered term, its Normalized Frequency in each category was computed and a table of the following form was constructed.
|
Term |
TNfc |
NumTextsinCatt. |
Where, TNfc: Sum of the normalized frequencies of a term in all texts of that category,
NumTextsinCatt: Number of texts of the category in which the term occurs.
5.
The
weights for each term for a category were computed using the following
expression.
Wct = (TNfc * (k - 1) * NumTextsinCatt)/ Oft
Where Wct: Weight of a term in a category,
k: Number of categories in which the term occurs,
Oft: Number of texts outside the
category in which the term occurs.
6. The Normalized Weight of each term in a category was then computed.
NWct = Wct
/ RMSc
where NWct: Normalized
Weight of a Term in a category.
RMSc: root mean square of the weights
of all the terms in the category.
7. Categorization
To categorize an unknown text the following steps were performed.
i) The normalized frequencies (NFt-u) of the terms in the input text were computed as in step 4.
ii) For each category, the category score ( Sc ) was calculated as
Sc
= ∑ (Nft-u) * NWct
iii) The category with highest category score is declared as the one to which the unknown text belongs.