Tuesday, May 24, 2011

Counting the no of documents in php

echo countFiles('/usr/local/'); // outputs 27  function countFiles($dir){ 
    
$files = array(); 
    
$directory opendir($dir); 
    while(
$item readdir($directory)){ 
    
// We filter the elements that we don't want to appear ".", ".." and ".svn" 
         
if(($item != ".") && ($item != "..") && ($item != ".svn") ){ 
              
$files[] = $item
         } 
    } 
    
$numFiles count($files); 
    return 
$numFiles
?>
 

Information retrieval and tokenization in php

Information retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the World Wide Web. There is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also has its own body of literature, theory, praxis, and technologies. IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics.[Wiki]
Tokenization is the first step in preprocessing on Information Retrieval. Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis.[Wiki]
The following program code shows the simple steps in making Tokenization using PHP, so it is easy to understand and can be run directly, but must be placed in the web server eg: htdocs (XAMPP or lampp).
Ok, This is the code.
// Tokenization Function
 function tokenization($text){
  // Removing punctuation in the text.
  $text = preg_replace('/[?!.,()*]|[-]|\'/','', $text);

  // Convert text to lower case
  $text = strtolower(trim($text));

  // Tokenization
  $word = explode(" ",$text);
  $tok = $word;

  for($i=0;$i<=(count($tok)-1);$i++){
   for($j=0;$j<=(count($tok)-1);$j++){
    if ($word[$i] == $tok[$j]){
     $freq[$word[$i]]+=1;
     array_splice($word,$i,1);
    }
   }
  }

  // Sort the results of tokenization based on the largest frequency
  arsort($freq);

  // Returns the result of Tokenization
  return $freq;

 }

 $news = "Information retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the World Wide Web. There is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also has its own body of literature, theory, praxis, and technologies. IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics.";

 $result = tokenization($news);

 // The result in table
 echo "
ResultNews
    "; foreach($result as $key => $val) { echo "
  1. $key = $val
  2. "; } echo "
$news
"; ?>

Monday, May 16, 2011

Preparing proposal for my thesis

I am writing proposal for my thesis.For the preparation of my thesis proposal I am doing hard work because proposal is the main thing  for the thesis to have great value.