Dbacl

From Christoph's Personal Wiki
Jump to: navigation, search
The correct title of this article is dbacl. The initial letter is capitalized due to technical restrictions.

dbacl is a digramic Bayesian text classifier. Given some text, it calculates the posterior probabilities that the input resembles one of any number of previously learned document collections. It can be used to sort incoming email into arbitrary categories such as spam, work, and play, or simply to distinguish an English text from a French text. It fully supports international character sets, and uses sophisticated statistical models based on the Maximum Entropy Principle.

Usage

Simple example

dbacl has two major modes of operation. The first is learning mode, where one or more text documents are analysed to find out what make them look the way they do. At the shell prompt, type (without the leading %)

% dbacl -l one sample1.txt
% dbacl -l two sample2.txt

This creates two files named one and two, which contain the important features of each sample document.

The second major mode is classification mode. Let's say that you want to see if sample3.txt is closer to sample1.txt or sample2.txt; type

% dbacl -c one -c two sample3.txt -v
  one   # <-- output

and dbacl should tell you it thinks sample3.txt is less like two (which is the category learned from sample2.txt) and more like one (which is the category learned from sample1.txt).

Extended usage

Note: Taken directly from man page.

To create two category files in the current directory from two ASCII text files named Mark_Twain.txt and William_Shakespeare.txt respectively, type:

% dbacl -l twain Mark_Twain.txt
% dbacl -l shake William_Shakespeare.txt

Now you can classify input text, for example:

% echo "howdy" | dbacl -v -c twain -c shake
  twain
% echo "to be or not to be" | dbacl -v -c twain -c shake
  shake

Note that the -v option at least is necessary, otherwise dbacl does not print anything. The return value is 1 in the first case, 2 in the second.

% echo "to be or not to be" | dbacl -v -N -c twain -c shake
  twain 22.63% shake 77.37%
% echo "to be or not to be" | dbacl -v -n -c twain -c shake
  twain 7.04 * 6.0 shake 6.74 * 6.0

These invocations are equivalent. The numbers 6.74 and 7.04 represent how close the average token is to each category, and 6.0 is the number of tokens observed. If you want to print a simple confidence value together with the best category, replace -v with -U.

% echo "to be or not to be" | dbacl -U -c twain -c shake
  shake # 34%

Note that the true probability of category shake versus category twain is 77.37%, but the calculation is somewhat ambiguous, and 34% is the confidence out of 100% that the calculation is qualitatively correct.

Suppose a file document.txt contains English text lines interspersed with noise lines. To filter out the noise lines from the English lines, assuming you have an existing category shake say, type:

% dbacl -c shake -f shake -R document.txt > document.txt_eng
% dbacl -c shake -f random -R document.txt > document.txt_rnd

Note that the quality of the results will vary depending on how well the categories shake and random represent each input line. It is sometimes useful to see the posterior probabilities for each line without filtering:

% dbacl -c shake -f shake -RN document.txt > document.txt_probs

You can now postprocess the posterior probabilities for each line of text with another script, to replicate an arbitrary Bayesian decision rule of your choice.

In the special case of exactly two categories, the optimal Bayesian decision procedure can be implemented for documents as follows: let p1 be the prior probability that the input text is classified as category1. Consequently, the prior probability of classifying as category2 is 1 - p1. Let u12 be the cost of misclassifying a category1 input text as belonging to category2 and vice versa for u21. We assume there is no cost for classifying correctly. Then the following command implements the optimal Bayesian decision:

% dbacl -n -c category1 -c category2 | \
        awk '{ if($2 * p1 * u12 > $4 * (1 - p1) * u21) \
        { print $1; } else { print $3; } }'

dbacl can also be used in conjunction with procmail(1) to implement a simple Bayesian email classification system. Assume that incoming mail should be automatically delivered to one of three mail folders located in $MAILDIR and named work, personal, and spam. Initially, these must be created and filled with appropriate sample emails. A crontab file can be used to learn the three categories once a day, e.g.

CATS=$HOME/.dbacl
5  0 * * * dbacl -T email -l $CATS/work $MAILDIR/work
10 0 * * * dbacl -T email -l $CATS/personal $MAILDIR/personal
15 0 * * * dbacl -T email -l $CATS/spam $MAILDIR/spam

To automatically deliver each incoming email into the appropriate folder, the following procmailrc(5) recipe fragment could be used:

CATS=$HOME/.dbacl

# run the spam classifier
:0 c
YAY=| dbacl -vT email -c $CATS/work -c $CATS/personal -c $CATS/spam

# send to the appropriate mailbox
:0:
* ? test -n "$YAY"
$MAILDIR/$YAY

:0:
$DEFAULT

Sometimes, dbacl will send the email to the wrong mailbox. In that case, the misclassified message should be removed from its wrong destination and placed in the correct mailbox. The error will be corrected the next time your messages are learned. If it is left in the wrong category, dbacl will learn the wrong corpus statistics.

The default text features (tokens) read by dbacl are purely alphabetic strings, which minimizes memory requirements but can be unrealistic in some cases. To construct models based on alphanumeric tokens, use the -e switch. The example below also uses the optional -D switch, which prints a list of actual tokens found in the document:

% dbacl -e alnum -D -l twain Mark_Twain.txt | less

It is also possible to override the default feature selection method used to learn the category model by means of regular expressions. For example, the following duplicates the default feature selection method in the C locale, while being much slower:

% dbacl -l twain -g '^([[:alpha:]]+)' -g '[^[:alpha:]]([[:alpha:]]+)' Mark_Twain.txt

The category twain which is obtained depends only on single alphabetic words in the text file Mark_Twain.txt (and computed digram statistics for prediction). For a second example, the following command builds a smoothed Markovian (word bigram) model which depends on pairs of consecutive words within each line (but pairs cannot straddle a line break):

% dbacl -l twain2 -g '(^|[^[:alpha:]])([[:alpha:]]+)||2' -g
        '(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23'
        Mark_Twain.txt

More general, line based, n-gram models of all orders (up to 7) can be built in a similar way. To construct paragraph based models, you should reformat the input corpora with awk(1) or sed(1) to obtain one paragraph per line. Line size is limited by available memory, but note that regex performance will degrade quickly for long lines.

External links