AutoClass C - General Information

Contents

What Is AutoClass

AutoClass is an unsupervised Bayesian classification system that seeks a maximum posterior probability classification.

Key features:

AutoClass uses only vector valued data, in which each instance to be classified is represented by a vector of values, each value characterizing some attribute of the instance. Values can be either real numbers, normally representing a measurement of the attribute, or they can be discrete, one of a countable attribute dependent set of such values, normally representing some aspect of the attribute.

AutoClass models the data as mixture of conditionally independent classes. Each class is defined in terms of a probability distribution over the meta-space defined by the attributes. AutoClass uses Gaussian distributions over the real valued attributes, and Bernoulli distributions over the discrete attributes. Default class models are provided.

AutoClass finds the set of classes that is maximally probable with respect to the data and model. The output is a set of class descriptions, and partial membership of the instances in the classes.

For more details click here (Bayesian Classification (AutoClass): Theory and Results; 1995, postscript - 220k), and here (Bayesian Classification Theory; 1991, postscript - 200k), and here (list of references).

What Is AutoClass C

AutoClass C is a public domain version of AutoClass III, with some improvements from AutoClass X, implemented in the C language.

It was programmed by Dr. Diane Cook (cook@centauri.uta.edu) and Joseph Potts (potts@cse.uta.edu) of the University of Texas at Arlington. Will Taylor (william.m.taylor@nasa.gov) "productized" the software through extensive testing, addition of sample data bases, and re-working the user documentation.
Significant new features of the C implementation are:

It provides four models:

Additional models were done in Lisp for AutoClass X, and may be implemented in C at some later time. These models are:

The C implementation also does not provide single_multinomial model value translations, and canonical model group/attribute ordering.

Update History

   Version: 1.0	   15 Apr 95    initial version of AutoClass C

   Version: 1.5	   08 May 95    ported to Sun Solaris 2.4; corrected string
                                overwrite problems; compilation of file
        search-control.c is now optimized; & added binary data file input 
        option.

   Version: 2.0	   08 Jun 95    ported to SGI IRIX version 5.2; converted
                                binary i/o from non-standard (open/close/
        read/write) to ANSI (fopen/fclose/fread/fwrite); converted from 
        srand/rand to srand48/lrand48 for random number generation; added
        prediction capability which uses a "training" classification to
        predict probabilistic class membership for the cases of a "test"
        data file; added new ".s-params" parameter "screen_output_p"; added
        output of real and discrete attribute statistics when data base is
        initially read; corrected garbage output when ".r-params" parameter
        "xref_class_report_att_list" contains mixed real and discrete
        attributes; corrected the handling of unknown real values in reports
        output; and corrected an error in function "output_warning_msgs"
        which caused an abort condition.

   Version: 2.5	   28 Jul 95    Influence values report has been
                                significantly revised and reformatted; 
        add SunOS/Solaris C compiler support; correct segmentation fault
        which occurs when more than 25 type = real, subtype = scalar
        attributes are defined; correct "LOG domain" errors in generation
        of influence values for model "single_multinomial"; and added mods 
        for port to Linux operating system using gcc compiler. 

   Version: 2.6	   02 Aug 95    Correct segmentation fault which occurs
                                when more than 50 type = real, subtype = 
        scalar attributes are defined; add function safe_log to prevent 
        "log: SING error" error messages; and require user to confirm 
        search runs using test settings for .s-params file parameters: 
        start_fn_type and randomize_random_p.

   Version: 2.7	   16 Aug 95    Add search parameter to allow AutoClass
                                to be run as a background task.

   Version: 2.8	   03 Sep 96    Add search parameter "read_compact_p",
        which directs AutoClass to read the "results" and "checkpoint"
        files in either binary format or ascii format; redefine make
        files with -I and -L parameters for SunOS 4.1.3; change make
        file naming conventions; prevent corruption of discrete data 
        translation tables when translations are longer than 40
        characters; increase from 3000 to 20000 the value of 
        VERY_LONG_STRING_LENGTH to handle very large datum lines;
        increase DATA_ALLOC_INCREMENT from 100 to 1000 for reading very
        large datasets; add DATA_ALLOC_INCREMENT logic of READ_DATA
        to XREF_GET_DATA -- this will prevent segmentation faults
        encountered when reading very large .db2 files into the 
        reports processing function of AutoClass; in
        FORMAT_DISCRETE_ATTRIBUTE, do not process attributes with
        warning or error messages -- this prevents segmentation faults;
        in XREF_GET_DATA, free database allocated memory after it is 
        transferred into report data structures --this reduces the
        amount of memory required when generating reports for very
        large data bases, and prevents running out of memory; in all 
        functions calling malloc/realloc for dynamic memory allocation, 
        checks have been added to notify the user if memory is exhausted;
        and port the "make" file for HP-UX operating system using the
        bundled "cc" compiler. 

   Version: 2.9	   17 Oct 96    Correct bugs which occur when generating
        reports of discrete type data -- these were introduced in version 
        2.8.  Added new parameter for both ".s-params" & ".r-params"
        files: break_on_warnings_p.

   Version: 3.0    15 Apr 97  New parameter for .r-params files:
        report_mode -- "text" (current report output) or "data"
        (parsable format for further processing); correct minor bugs;
        improve input checking for .hd2 file; correct segmentation
        fault which occurred in prediction runs when the size of the
        "test" database was larger than that of the "training"
        database; and new parameter for .s-params & .r-params files: 
        free_storage_p.

   Version: 3.1    04 Jul 97  New parameters for .r-params files:
        comment_data_headers_p, max_num_xref_class_probs,
        start_sigma_contours_att, & stop_sigma_contours_att.  Allow
        checkpoint files to be loaded for reconvergence.  Allow
        reports to be generated for data sets of 100,000 cases and 
        more, without causing a segmentation fault.  For "-predict"
        runs, handle "test" cases which are not predicted in be in 
        any of the "training" classes.  When there is more than one
        covariant normal correlation matrix, print all of them.
        In the case cross-reference report (report_type = "xref_case")
        generated with the data option (report_mode = "data"), other class
        probabilities are now printed.  In the case and class cross-
        reference reports, the print out of probabilities has increased
        by one significant digit (0.04 => 0.041), and the minimum value 
        printed is now 0.001, rather than 0.01.  Add capability to
        compute sigma class contour values for specified pairs of 
        real valued attributes.

   Version: 3.2    13 Apr 98    Changed the behavior of search
        parameter force_new_search_p; amplified some documentation
        sections; corrected several segmentation faults in reports
        generation; corrected several errors in sigma contours output;
        correct problem with cross-reference reports class assignment
        when there are more than five marginal probabilities; change
        layout of influence values report to print matrices after all 
        class attributes are listed; warn user when default start_j_list 
        may not find the correct number of classes in data set; warn 
        user of search trials which do not converge and print 
        convergence summary at the end of each run; the multi-normal 
        model was corrected to prevent oscillation in the expectation 
        maximization calculations; and allow non-contiguous groups of 
        attributes to be specified for sigma contours calculations.

   Version: 3.2.1  04 Jun 98    Minor documentation changes. 

   Version: 3.2.2  02 Jul 98    Minor documentation changes. 

   Version: 3.3    23 Sep 98    Integrated source port of version
        3.2.2 to Windows NT/95/98. Update sample AutoClass C run files
        contained in autoclass-c/sample.

   Version: 3.3.1  30 Nov 98    Correct incompatibility with
        .results[-bin] files written by AutoClass C versions prior 
        to version 3.3.  

   Version: 3.3.2  13 Sep 99    In all situations warning and error 
        messages are now written to the log file. 

   Version: 3.3.3  01 May 00    Add Dec Alpha support; correct Dec
        Alpha crashes when attempting to free memory at the end of
        search runs; conditionalize two warning tests to fail in
        batch mode; and separate log files are now written for
        "-search" (.log) and "-reports" (.rlog).

   Version: 3.3.4  24 Jan 02    Correct bugs in -predict and -report
        modes; correct "safe_log" function for range near 0; and
        minor code cleanup.  Update sample AutoClass C run files
        contained in autoclass-c/sample.
 

Compatibility and Requirements

AutoClass C was written in ANSI C using the GNU gcc compiler version 2.6.3 running on a SunSparc under SunOS 4.1.3.

It has also been ported to and tested on:

Considerations for porting to other platforms, operating systems, and compilers:

Limitations

AutoClass C is limited by memory requirements that are roughly in proportion to the number of data, times the number of attributes (the data space); plus the number of classes, times number of modeled attributes (the model space); plus a fixed program space. Thus there should be no limit on the number of attributes beyond the program addressable memory, but there are definite tradeoffs with respect to the model space, and performance degradations as paging requirements increase.

For very large data sets, you may well find that even if you can handle the data, the processing time is excessive. If that is the case, it may be worthwhile to try class generation on random subsets of the data set. This should pick out the major classes, although it will miss small ones that are only vaguely represented in the random subsets. You can then switch to prediction mode to classify the entire data set.

Theoretical Questions

Contact Peter Cheeseman if you have questions concerning the theoretical aspects of AutoClass.

Technical Questions

Contact John Stutz if you have questions concerning the applicability of AutoClass to your data analysis situation.

Implementation Questions

Contact Will Taylor if you have questions concerning the implementation, installation, and running of AutoClass C, including "bugs" and features you may add to the existing code.

Obtaining AutoClass C

AutoClass C is available free here as a "gzipped" tar file. Note that the anonymous ftp site csr.uta.edu will no longer provide the latest version of AutoClass C.

The AutoClass C files include source code, user documentation, two theory papers (in postscript), a sample run, and five test data bases. The uncompressed files are about 5.7 megabytes. When built, this becomes about 7.5 megabytes.

Click on one of the following to download the AutoClass C files to your host --

Click on the following to download the new (06May02) version 3.3.4 Windows stand-alone executable to your host --
(Corrects required .dll file, MSVCRTD.dll was not found problem.)

Execute Autoclass.exe in an "MS-DOS" window (Win98), or i in a "Command Prompt" window (Win2000),
not in a "Run Command" window.

Then, read autoclass-c/read-me.text or autoclass-c-win\read-me.text, and you are off to the races!

Information on SNOB

Information on a related, but independently developed, classification program -- SNOB -- written in FORTRAN, is available here


Last updated February 28, 2003

=- [ Return to AutoClass C Contents ] -=