Sunday, June 12, 2011

My Proposal

Introduction
Text clustering is the process of grouping the opinion or review or comment on the particular topic. The clustering can be done using different methods like k-means, vector space models or other machine learning algorithms available in language processing tasks. For the operation to be performed in the process of opinion clustering, we have to   take an idea of linear algebra as well as the vector space model which is well known method in text processing.

The clustering task has been performed by different researchers like [1], [2], [3]. [1] Took the concept of modified vector space model in which they modified the inverse document frequency with document frequency only.

Similarly [2] take the vector space model for blog analysis purpose. They took different clustering algorithms with vector space model and found that fuzzy based model is the best.

[3] Used vector space model for the clustering purpose and they used the knn method for the clustering problem.

Thus far, scoring has hinged on whether or not a query term is present in a zone within a document. We take the next logical step: a document or zone that mentions a query term more often has more to do with that query and therefore should receive a higher score. [4]

Denoting as usual the total number of documents in a collection by N,  define the   inverse document frequency (idf) of a term t as follows:
idft = log(N/dfi)
The tf-idf weighting scheme assigns to term t a weight TF-IDF in document d given
by
 tf-idft,d = tft,d ×idft.[5]

By adding the membership value from the Gaussian membership function the tfidf value is updated and clustering concept is applied using threshold value for clustering.
For example TFIDF(new)=TFIDF(traditional)+semanticscore
Semantic score is obtained from the Gaussian membership function and it is added to the weight if it contains in the fuzzy set.

In our method suppose we have the following test documents to be clustered using this method



Documents    Content      
1    यो घर राम्रो छ      
2    यो घर सुन्दर छ      
3    यो घर मनमोहक छ      
4    म भात खान्छु   
Figure 1 sample table for clustering








Figure 2  Resultant clustering


Figure 3 Flow chart how the operation works

     In the process above it is divided into five steps
1)    Calculate TF of each  term  in each document
2)    Add semantic score to each tf and perform tf*idf operation
3)    Perform cosine similarity with query vector for each document
4)    Apply   grouping rule
5)    Test the condition of grouping rule and check whether it falls in the same cluster or not.




Problem Definition
Clustering the text using the normal vector space model could not handle the    semantic relevancy of words so due to lack of such features in traditional vector space model the concept of enhanced vector method is proposed. The research has not been performed yet in opinion mining task in Nepal which is the leading task for the Nepali researcher who wants to work in Nepali language for the opinion clustering. The algorithms which work in English language may not work in other language. The clustering task enables the analyst to observe those clusters having maximum no of documents which saves the time in this busy world for the opinion to be analyzed by the analyst.

Objective
    The main objectives of this research work are

To  cluster the Nepali texts
To find semantic relevancy and syntactic relevancy of the text
To observe the relation between semantic score and no of clusters.







Research Methodology
 Data preparation
 For the research purpose the test data written using Nepali Unicode software and these are tested using programming language php/mysql.

       
4.2    Performance evaluation
For the performance evaluation the concepts like random index, precision and recall are taken in to consideration
   
Expected output

The expected out put will be like this
Suppose we have given the different documents like

Documents    words      
1    विशाल      
2    ठुलो      
4    नराम्रो      
4    खराव   

    Figure 4 sample table for clustering

It will make the cluster into at most two cluster semantically using enhanced vector space model.



Figure 5 resultant clustering





Working Schedule

Activities (weeks)    1   1   1    1   1   1   1   1   1   1   1   1   1   1    1   1     1    1    1      

     Study and analysis

    Data collection           

    Implementation                       

    Testing                           

    Documentation           

    Review                           

    Presentation                               


Figure 6 Working Schedule






References
[1] Abdul-Rub, Mohammed Said, ”A modified vector spaced model for protein Retrieval”, UHCSNS, Vol 7 No 9, 2007.
[2] Ho, Chi-Shu, ”Blog analysis with Fuzzy TFIDF”, Master Project, San Jose State University, 2007
[3] Jaiswal, Mayank Prakash, “Clustering Blog Information”, Master Project, Paper 36,2007
[4]  Shin Kwangcheol , Abraham Ajith, Han Sang Yong, “ Improving kNN Text Categorization by Removing Outliers from Training Set” , 2006.
[5] Emre Esin Yunus, “Improvement of corpus-based word similarity using vector space model”, Mater Thesis, Middle Ease University, 2009.