Main Page | Namespace List | Compound List | File List | Namespace Members | Compound Members | File Members

std::UnigramTextClassifier Class Reference

A text classifier based on single characters. The basic idea: texts from the same class will tend to have character (byte) frequencies that are similar. In information theoretical terms, texts from the same class should require the same number of bits to encode them in a perfect encoding. We don't actually have to create the encoding, just use the number of bits. The basic methods are learn (read a corpus and count the frequencies), dump (save the frequencies to a stream) and read, read the frequencies from a stream. More...

#include <UnigramTextClassifier.h>

List of all members.

Public Member Functions

 UnigramTextClassifier ()
 Constructor. Constructor for UnigramTextClassifier. Name of classification defaults to 'Unknown.'.

 UnigramTextClassifier (const string classification)
 Constructor. Constructor for UnigramTextClassifier.

frequency_map freqs ()
unsigned long corpus_total ()
unsigned long total ()
string classification ()
void setClassification (string &classification)
void UnigramTextClassifier::learn (istream &in)
 Learn the frequencies of characters in a corpus. Learn the frequencies of characters in a corpus; may be called multiple times.

void UnigramTextClassifier::learn (char *in)
 Learn the frequencies of characters in a corpus. Learn the frequencies of characters in a corpus; may be called multiple times.

void UnigramTextClassifier::dump (ostream &out)
 Dump the frequencies of characters in a corpus. Dump the frequencies of characters in a corpus.

void UnigramTextClassifier::dump (char *out)
 Dump the frequencies of characters in a corpus. Dump the frequencies of characters in a corpus.

void UnigramTextClassifier::read (istream &in)
 Read the frequencies of characters in a corpus. Learn the frequencies of characters in a corpus; may be called multiple times.

void UnigramTextClassifier::read (char *in)
 Read the frequencies of characters in a corpus. Learn the frequencies of characters in a corpus; may be called multiple times.

float UnigramTextClassifier::score (istream &in)
float UnigramTextClassifier::score (char *in)
float UnigramTextClassifier::bits_required (unsigned char ch)
float UnigramTextClassifier::bits_required (istream &in)
float UnigramTextClassifier::bits_required (char *in)

Private Member Functions

float UnigramTextClassifier::lg (float n)
float UnigramTextClassifier::info_value (float n)
string UnigramTextClassifier::ctime_string ()

Private Attributes

frequency_map _freqs
unsigned long _corpus_total
unsigned long _total
string _classification


Detailed Description

A text classifier based on single characters. The basic idea: texts from the same class will tend to have character (byte) frequencies that are similar. In information theoretical terms, texts from the same class should require the same number of bits to encode them in a perfect encoding. We don't actually have to create the encoding, just use the number of bits. The basic methods are learn (read a corpus and count the frequencies), dump (save the frequencies to a stream) and read, read the frequencies from a stream.


Constructor & Destructor Documentation

std::UnigramTextClassifier::UnigramTextClassifier  ) 
 

Constructor. Constructor for UnigramTextClassifier. Name of classification defaults to 'Unknown.'.

std::UnigramTextClassifier::UnigramTextClassifier const string  classification  ) 
 

Constructor. Constructor for UnigramTextClassifier.

Parameters:
classification The name of the classification (e.g., "Spam").


Member Function Documentation

string std::UnigramTextClassifier::classification  )  [inline]
 

Returns:
the name of the classifier.

unsigned long std::UnigramTextClassifier::corpus_total  )  [inline]
 

Returns:
the total number of characters in the corpus.

frequency_map std::UnigramTextClassifier::freqs  )  [inline]
 

Returns:
the map of characters and their frequencies.

void std::UnigramTextClassifier::setClassification string &  classification  )  [inline]
 

Parameters:
classification the name of the classifier.

unsigned long std::UnigramTextClassifier::total  )  [inline]
 

Returns:
the total number of characters in the corpus.

float std::UnigramTextClassifier::UnigramTextClassifier::bits_required char *  in  ) 
 

How many bits would it take to encode the characters a file?

Parameters:
in The fiele in question
Returns:
Number of bits required.

float std::UnigramTextClassifier::UnigramTextClassifier::bits_required istream &  in  ) 
 

How many bits would it take to encode the characters a stream?

Parameters:
in The stream in question
Returns:
Number of bits required.

float std::UnigramTextClassifier::UnigramTextClassifier::bits_required unsigned char  ch  ) 
 

How many bits would it take to code a character?

Parameters:
ch The character in question.
Returns:
Number of bits required.

string std::UnigramTextClassifier::UnigramTextClassifier::ctime_string  )  [private]
 

internal current time stream

void std::UnigramTextClassifier::UnigramTextClassifier::dump char *  out  ) 
 

Dump the frequencies of characters in a corpus. Dump the frequencies of characters in a corpus.

Parameters:
out the output filename.

void std::UnigramTextClassifier::UnigramTextClassifier::dump ostream &  out  ) 
 

Dump the frequencies of characters in a corpus. Dump the frequencies of characters in a corpus.

Parameters:
out the output stream, which must be open.

float std::UnigramTextClassifier::UnigramTextClassifier::info_value float  n  )  [private]
 

internal information value function -lg(n)

void std::UnigramTextClassifier::UnigramTextClassifier::learn char *  in  ) 
 

Learn the frequencies of characters in a corpus. Learn the frequencies of characters in a corpus; may be called multiple times.

Parameters:
in a filename.

void std::UnigramTextClassifier::UnigramTextClassifier::learn istream &  in  ) 
 

Learn the frequencies of characters in a corpus. Learn the frequencies of characters in a corpus; may be called multiple times.

Parameters:
in an input stream, which must be open.

float std::UnigramTextClassifier::UnigramTextClassifier::lg float  n  )  [private]
 

internal base-2 logarithm

void std::UnigramTextClassifier::UnigramTextClassifier::read char *  in  ) 
 

Read the frequencies of characters in a corpus. Learn the frequencies of characters in a corpus; may be called multiple times.

Parameters:
in a filename.

void std::UnigramTextClassifier::UnigramTextClassifier::read istream &  in  ) 
 

Read the frequencies of characters in a corpus. Learn the frequencies of characters in a corpus; may be called multiple times.

Parameters:
in an input stream, which must be open.

float std::UnigramTextClassifier::UnigramTextClassifier::score char *  in  ) 
 

What's the score? How many bits would it take to encode the characters a file?

Parameters:
in The file in question
Returns:
Number of bits required.

float std::UnigramTextClassifier::UnigramTextClassifier::score istream &  in  ) 
 

What's the score? How many bits would it take to encode the characters a file?

Parameters:
in an input stream, which must be open.
Returns:
Number of bits required.


Member Data Documentation

string std::UnigramTextClassifier::_classification [private]
 

internal name of classifer

unsigned long std::UnigramTextClassifier::_corpus_total [private]
 

internal total number of characters in corpus

frequency_map std::UnigramTextClassifier::_freqs [private]
 

internal character->frequency map

unsigned long std::UnigramTextClassifier::_total [private]
 

internal total number of characters in text


The documentation for this class was generated from the following file:
Generated on Fri Aug 8 15:44:40 2003 for UnigramTextClassifier by doxygen 1.3.3