Dbacl

From Christoph's Personal Wiki
Revision as of 09:15, 11 February 2007 by Christoph (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
The correct title of this article is dbacl. The initial letter is capitalized due to technical restrictions.

dbacl is a digramic Bayesian text classifier. Given some text, it calculates the posterior probabilities that the input resembles one of any number of previously learned document collections. It can be used to sort incoming email into arbitrary categories such as spam, work, and play, or simply to distinguish an English text from a French text. It fully supports international character sets, and uses sophisticated statistical models based on the Maximum Entropy Principle.

Usage

Simple example

dbacl has two major modes of operation. The first is learning mode, where one or more text documents are analysed to find out what make them look the way they do. At the shell prompt, type (without the leading %)

% dbacl -l one sample1.txt
% dbacl -l two sample2.txt

This creates two files named one and two, which contain the important features of each sample document.

The second major mode is classification mode. Let's say that you want to see if sample3.txt is closer to sample1.txt or sample2.txt; type

% dbacl -c one -c two sample3.txt -v
one   # <-- output

and dbacl should tell you it thinks sample3.txt is less like two (which is the category learned from sample2.txt) and more like one (which is the category learned from sample1.txt).

External links