Information on SNOB
(a related classification program)
What Is AutoClass
AutoClass is an unsupervised Bayesian classification system that seeks a maximum posterior probability classification.
Key features:
AutoClass uses only vector valued data, in which each instance to be classified is represented by a vector of values, each value characterizing some attribute of the instance. Values can be either real numbers, normally representing a measurement of the attribute, or they can be discrete, one of a countable attribute dependent set of such values, normally representing some aspect of the attribute.
AutoClass models the data as mixture of conditionally independent classes. Each class is defined in terms of a probability distribution over the meta-space defined by the attributes. AutoClass uses Gaussian distributions over the real valued attributes, and Bernoulli distributions over the discrete attributes. Default class models are provided.
AutoClass finds the set of classes that is maximally probable with respect to the data and model. The output is a set of class descriptions, and partial membership of the instances in the classes.
For more details click here (Bayesian Classification (AutoClass): Theory and Results; 1995, postscript - 220k), and here (Bayesian Classification Theory; 1991, postscript - 200k), and here (list of references).
AutoClass C is a public domain version of
AutoClass III, with some improvements from
AutoClass X, implemented in the C language.
It was programmed by Dr. Diane Cook (cook@centauri.uta.edu) and Joseph Potts
(potts@cse.uta.edu) of the University of Texas at Arlington. Will Taylor
(william.m.taylor@nasa.gov) "productized" the software through extensive testing,
addition of sample data bases, and re-working the user documentation.
It provides four models:
Additional models were done in Lisp for AutoClass X, and may be
implemented in C at some later time. These models are:
The C implementation also does not provide single_multinomial model
value translations, and canonical model group/attribute ordering.
AutoClass C was written in ANSI C using the GNU gcc compiler
version 2.6.3 running on a SunSparc under SunOS 4.1.3.
It has also been ported to and tested on:
Considerations for porting to other platforms, operating systems,
and compilers:
AutoClass C is limited by memory requirements that are roughly in
proportion to the number of data, times the number of attributes (the
data space); plus the number of classes, times number of modeled
attributes (the model space); plus a fixed program space. Thus there
should be no limit on the number of attributes beyond the program
addressable memory, but there are definite tradeoffs with respect to
the model space, and performance degradations as paging requirements
increase.
For very large data sets, you may well find that even if you can handle
the data, the processing time is excessive. If that is the case, it may
be worthwhile to try class generation on random subsets of the data set.
This should pick out the major classes, although it will miss small
ones that are only vaguely represented in the random subsets. You can
then switch to prediction mode to classify the entire data set.
Contact
Peter Cheeseman
if you have questions concerning the theoretical aspects of AutoClass.
Contact
John Stutz
if you have questions concerning the applicability of AutoClass to your data analysis
situation.
Contact Will Taylor
if you have questions concerning the implementation, installation, and running
of AutoClass C, including "bugs" and features you may add to the existing code.
AutoClass C is available free here as a "gzipped" tar file. Note that the
anonymous ftp site csr.uta.edu will no longer provide the latest version
of AutoClass C.
The AutoClass C files include source code, user documentation, two theory
papers (in postscript), a sample run, and five test data bases.
The uncompressed files are about 5.7 megabytes. When built, this becomes
about 7.5 megabytes.
Click on one of the following to download the AutoClass C files to your host --
Click on the following to download the new (06May02) version 3.3.4
Windows stand-alone executable to your host --
Execute Autoclass.exe in an "MS-DOS" window (Win98), or i in a "Command Prompt" window (Win2000),
Then, read autoclass-c/read-me.text or autoclass-c-win\read-me.text,
and you are off to the races!
Information on a related, but independently developed, classification program --
SNOB -- written in FORTRAN, is available
here
Last updated February 28, 2003
=- [ Return to AutoClass C Contents ] -=
What Is AutoClass C
Significant new features of the C implementation are:
Update History
Version: 1.0 15 Apr 95 initial version of AutoClass C
Version: 1.5 08 May 95 ported to Sun Solaris 2.4; corrected string
overwrite problems; compilation of file
search-control.c is now optimized; & added binary data file input
option.
Version: 2.0 08 Jun 95 ported to SGI IRIX version 5.2; converted
binary i/o from non-standard (open/close/
read/write) to ANSI (fopen/fclose/fread/fwrite); converted from
srand/rand to srand48/lrand48 for random number generation; added
prediction capability which uses a "training" classification to
predict probabilistic class membership for the cases of a "test"
data file; added new ".s-params" parameter "screen_output_p"; added
output of real and discrete attribute statistics when data base is
initially read; corrected garbage output when ".r-params" parameter
"xref_class_report_att_list" contains mixed real and discrete
attributes; corrected the handling of unknown real values in reports
output; and corrected an error in function "output_warning_msgs"
which caused an abort condition.
Version: 2.5 28 Jul 95 Influence values report has been
significantly revised and reformatted;
add SunOS/Solaris C compiler support; correct segmentation fault
which occurs when more than 25 type = real, subtype = scalar
attributes are defined; correct "LOG domain" errors in generation
of influence values for model "single_multinomial"; and added mods
for port to Linux operating system using gcc compiler.
Version: 2.6 02 Aug 95 Correct segmentation fault which occurs
when more than 50 type = real, subtype =
scalar attributes are defined; add function safe_log to prevent
"log: SING error" error messages; and require user to confirm
search runs using test settings for .s-params file parameters:
start_fn_type and randomize_random_p.
Version: 2.7 16 Aug 95 Add search parameter to allow AutoClass
to be run as a background task.
Version: 2.8 03 Sep 96 Add search parameter "read_compact_p",
which directs AutoClass to read the "results" and "checkpoint"
files in either binary format or ascii format; redefine make
files with -I and -L parameters for SunOS 4.1.3; change make
file naming conventions; prevent corruption of discrete data
translation tables when translations are longer than 40
characters; increase from 3000 to 20000 the value of
VERY_LONG_STRING_LENGTH to handle very large datum lines;
increase DATA_ALLOC_INCREMENT from 100 to 1000 for reading very
large datasets; add DATA_ALLOC_INCREMENT logic of READ_DATA
to XREF_GET_DATA -- this will prevent segmentation faults
encountered when reading very large .db2 files into the
reports processing function of AutoClass; in
FORMAT_DISCRETE_ATTRIBUTE, do not process attributes with
warning or error messages -- this prevents segmentation faults;
in XREF_GET_DATA, free database allocated memory after it is
transferred into report data structures --this reduces the
amount of memory required when generating reports for very
large data bases, and prevents running out of memory; in all
functions calling malloc/realloc for dynamic memory allocation,
checks have been added to notify the user if memory is exhausted;
and port the "make" file for HP-UX operating system using the
bundled "cc" compiler.
Version: 2.9 17 Oct 96 Correct bugs which occur when generating
reports of discrete type data -- these were introduced in version
2.8. Added new parameter for both ".s-params" & ".r-params"
files: break_on_warnings_p.
Version: 3.0 15 Apr 97 New parameter for .r-params files:
report_mode -- "text" (current report output) or "data"
(parsable format for further processing); correct minor bugs;
improve input checking for .hd2 file; correct segmentation
fault which occurred in prediction runs when the size of the
"test" database was larger than that of the "training"
database; and new parameter for .s-params & .r-params files:
free_storage_p.
Version: 3.1 04 Jul 97 New parameters for .r-params files:
comment_data_headers_p, max_num_xref_class_probs,
start_sigma_contours_att, & stop_sigma_contours_att. Allow
checkpoint files to be loaded for reconvergence. Allow
reports to be generated for data sets of 100,000 cases and
more, without causing a segmentation fault. For "-predict"
runs, handle "test" cases which are not predicted in be in
any of the "training" classes. When there is more than one
covariant normal correlation matrix, print all of them.
In the case cross-reference report (report_type = "xref_case")
generated with the data option (report_mode = "data"), other class
probabilities are now printed. In the case and class cross-
reference reports, the print out of probabilities has increased
by one significant digit (0.04 => 0.041), and the minimum value
printed is now 0.001, rather than 0.01. Add capability to
compute sigma class contour values for specified pairs of
real valued attributes.
Version: 3.2 13 Apr 98 Changed the behavior of search
parameter force_new_search_p; amplified some documentation
sections; corrected several segmentation faults in reports
generation; corrected several errors in sigma contours output;
correct problem with cross-reference reports class assignment
when there are more than five marginal probabilities; change
layout of influence values report to print matrices after all
class attributes are listed; warn user when default start_j_list
may not find the correct number of classes in data set; warn
user of search trials which do not converge and print
convergence summary at the end of each run; the multi-normal
model was corrected to prevent oscillation in the expectation
maximization calculations; and allow non-contiguous groups of
attributes to be specified for sigma contours calculations.
Version: 3.2.1 04 Jun 98 Minor documentation changes.
Version: 3.2.2 02 Jul 98 Minor documentation changes.
Version: 3.3 23 Sep 98 Integrated source port of version
3.2.2 to Windows NT/95/98. Update sample AutoClass C run files
contained in autoclass-c/sample.
Version: 3.3.1 30 Nov 98 Correct incompatibility with
.results[-bin] files written by AutoClass C versions prior
to version 3.3.
Version: 3.3.2 13 Sep 99 In all situations warning and error
messages are now written to the log file.
Version: 3.3.3 01 May 00 Add Dec Alpha support; correct Dec
Alpha crashes when attempting to free memory at the end of
search runs; conditionalize two warning tests to fail in
batch mode; and separate log files are now written for
"-search" (.log) and "-reports" (.rlog).
Version: 3.3.4 24 Jan 02 Correct bugs in -predict and -report
modes; correct "safe_log" function for range near 0; and
minor code cleanup. Update sample AutoClass C run files
contained in autoclass-c/sample.
Compatibility and Requirements
Limitations
Theoretical Questions
Technical Questions
Implementation Questions
Obtaining AutoClass C
(Corrects required .dll file, MSVCRTD.dll was not found problem.)
not in a "Run Command" window.
Information on SNOB