Data Mining Technique - Bayesian Approaches

Data Mining - Bayesian Approaches

Data Mining Course [CISC 873, School of Computing, Queen's University]

Bayesian Approaches Tutorials and Applying Results [Henry 2004]

    Data Mining (DM) Introduction
        DM Definitions
        DM Web Pages
    Bayesian Tutorials
        Overview
        Naïve Bayesian Classifiers
        Gaussian Bayesian Classifiers
        Bayesian Networks
    Applying Bayesian Approach on Datasets
        Dataset #1
        Dataset #2
        Dataset #3
        Dataset Additional
    Mining Software
        Weka
        MatLab

Data Mining:

Some Data Mining (DM) Definitions
Resource	Define
Two Crows	An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.
Gotcha	A type of application with built-in proprietary algorithms that sort, rank, and perform calculations on a specified and often large data set, producing visualizations that reveal patterns which may not have been evident from mere listings or summaries.
The OLAP Report	The process of using statistical techniques to discover subtle relationships between data items, and the construction of predictive models based on them. The process is not the same as just using an OLAP tool to find exceptional items. Generally, data mining is a very different and more specialist application than OLAP, and uses different tools from different vendors. Normally the users are different, too.

Data Mining Web Pages:

Statistical Data Mining Tutorials (by Andrew Moore) - Highly recommended! Excellent introductions to the DM techniques.
An Introduction Student Notes - Good materials to accompany with the course.
An Introduction to Data Mining (by Kurt Thearling) - General ideas of why we need to do DM and how DM works.
The Data Mine - Launched in April 1994, and providing information about DM. Nice place to find topics in DM.
CISC873 Data Mining - Finally, our course page which is obvious necessary here.

[To Index]

Bayesian Tutorials:

Overview

Bayesian approaches are a fundamentally important DM technique. Given the probability distribution, Bayes classifier can provably achieve the optimal result. Bayesian method is based on the probability theory. Bayes Rule is applied here to calculate the posterior from the prior and the likelihood, because the later two is generally easier to be calculated from a probability model.

One limitation that the Bayesian approaches can not cross is the need of the probability estimation from the training dataset. It is noticeable that in some situations, such as the decision is clearly based on certain criteria, or the dataset has high degree of randomality, the Bayesian approaches will not be a good choice.

My introduction slides of Bayesian Approaches. [pdf]

Naïve Bayesian Classifiers

The Naïve Bayes Classifier technique is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods. The following example is a simple demonstration of applying the Naïve Bayes Classifier from StatSoft.

As indicated at Figure 1, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive (i.e., decide to which class label they belong, based on the currently exiting objects).
Figure 1. objects are classified to GREEN or RED.

We can then calculate the priors (i.e. the probability of the object among all objects) based on the previous experience. Thus:

Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:

Having formulated our prior probability, we are now ready to classify a new object (WHITE circle in Figure 2). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label.

Figure 2. classify the WHITE circle.

We can calculate the likelihood:

In Figure 2, it is clear that Likelihood of X given RED is larger than Likelihood of X given GREEN, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:

Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information (i.e. the prior and the likelihood) to form a posterior probability using Bayes Rule.

Finally, we classify X as RED since its class membership achieves the largest posterior probability.

[To Index]

Gaussian Bayesian Classifiers

The problem with the Naïve Bayes Classifier is that it assumes all attributes are independent of each other which in general can not be applied. Gaussian PDF can be plug-in here to estimate the attribute probability density function (PDF). Because the well developed Gaussian PDF theories, we can classify the new object easier through the same Bayes Classifier Model but with certain degree recognition of the covariance. Normally, this gives more accurate classification result.

I guess one question to be asked is why Gaussian? There are many other PDF's can be applied. But from statistic point of view, many real world distributions are more likely to be estimated by Gaussian PDF than others. If you are familiar with Information Theory, the Gaussian gives the maximum entropy for an unbounded range, which means Gaussian has more ability to estimate the randomality.

How to apply the Gaussian to the Bayes Classifier?

The application here is very intuitive. We assume the Density Estimation follows a Gaussian distribution. Then the prior and the likelihood can be calculated through the Gaussian PDF. The critical thing here is to identify the Gaussian distribution (i.e. find the mean and variance of the Gaussian). The following 5 steps are a general model to initialize the Gaussian distribution to fit our input dataset.

Choose a probability estimator form (Gaussian)
Choose an initial set of parameters for the estimator (Gaussian mean and variance)
Given parameters, compute posterior estimates for hidden variable
Given posterior estimates, find distributional parameters that maximize expectation (mean) of joint density for data and hidden variable (Guarantee to also maximize improvement of likelihood)
Assess goodness of fit (i.e. log likelihood) If not stopping criterion, return to (3).

From research perspective, Gaussian may not be the only PDF to be applied to the Bayes Classifier, although it has very strong theoretical support and nice properties. The general model of applying those PDF's should be the same. The estimation results highly depend on whether or how close a PDF can simulate the given dataset.

Some normal used PDF's are listed below: (just to refresh our statics)

[To Index]

Bayesian Networks

First of all, there is a nice introduction to Bayesian Networks and their Contemporary Applications by Daryle Niedermayer [web page]. It is generally hard for me to come up with a better one here. So the following may just a simpler version of Daryle's introduction.

From the previous tutorial on Gaussian Bayes Classifier, we notice that the Gaussian model helps to integrate some correlation which improves the classification performance against the Naïve model assuming the independence. However, using Gaussian model with Bayes Classifier still has its limitation of generating the correlations. So it is where the Bayesian Networks (Bayes Nets) get involved.

Bayes Net is a model of utilizing the conditional probabilities among different variables. It is generally impossible to generate all conditional probabilities from a given dataset. Our task is to pick important ones and use them in the classification process. So essentially, a Bayes net is a set of "Generalized Probability Density Function" (gpdf).

Informally, we can define a Bayes net as an augmented directed acyclic graph, represented by the vertex set V and directed edge set E. Each vertex from V represents an attribute, and each edge from E represents a correlation between two attributes. One example of five attributes Bayes net is shown in Figure 3.

Figure 3. A Bayes net for 5 attributes.

Some important observations here:

There is no loop in the graph representation since it is acyclic.
Two variable vi and vj may still correlated even if they are not connected.
Each variable vi is conditionally independent of all non-descendants, given its parents.

Now, let's suffer the mathematical definition formally.

Consider a domain U of n variables, x₁,...x_n.Each variable may be discrete having a finite or countable number of states, or continuous. Given a subset X of variables x_iwhere x_i U, if one can observe the state of every variable in X, then this observation is called an instance of X and is denoted as X= for the observations . The "joint space" of U is the set of all instances of U. denotes the "generalized probability density" that X= given Y= for a person with current state information . p(X|Y, ) then denotes the gpdf for X, given all possible observations of Y. The joint gpdf over U is the gpdf for U.

A Bayesian network for domain U represents a joint gpdf over U. This representation consists of a set of local conditional gpdfs combined with a set of conditional independence assertions that allow the construction of a global gpdf from the local gpdfs. Then these value can be ascertained as:
(Equation 1)

A Bayesian Network Structure then encodes the assertions of conditional independence in Equation 1 above. Essentially then, a Bayesian Network Structure B_s is a directed acyclic graph such that (1) each variable in U corresponds to a node in B_s, and (2) the parents of the node corresponding to x_i are the nodes corresponding to the variables in [Pi]_i.

It is not hard to argue the advantage of using Bayes net model here. But formal analysis needs to be done with the probability inference theory which we have briefly discussed in the next section. For reader who is not interested with the analysis mathematically, ignore the next section to avoid confusion (sometimes not bad).

How to build a Bayes Net? A general model can be followed below:

Choose a set of relevant variables.
Choose an ordering for the variables.
Assume the variables are X1, X2, ..., Xn (where X1 is the first, and Xi is the ith).
for i = 1 to n:
Add the Xi vertex to the network
Set Parent(Xi) to be a minimal subset of X1, ..., Xi-1, such that we have conditional independence of Xi and all other members of X1, ..., Xi-1 given Parents(Xi).
Define the probability table of P(Xi=k | Assignments of Parent(Xi)).

There are many choices of how to select relevant variables, as well as how to estimate the conditional probabilities. If we imagine the network as a connection of Bayes classifiers, then the probability estimation can be done applying some PDF like Gaussian. In some cases, the design of the network can be rather complicated. There are some efficient ways of getting relevant variables from the dataset attributes. Assume the coming signal to be stochastic will give a nice way of extracting the signal attributes. And normally, the likelihood weighting is another way to getting attributes.

[To Index]

Applying Bayesian Approach on Datasets

Dataset #1

Dataset Preliminary Analysis [pdf]

Attribute Selection results:

                    CfsSubsetEval with BestFirst (15 attributes)[txt]
                    InfoGainAttibuteEval with Ranker (12 attributes 0.1000 above) [txt]
                    ClassifierSubsetEval (NaiveBayes) with BestFirst (11 attributes) [txt]
                    ClassifierSubsetEval (BayesNet) with BestFirst [txt]

Preliminary Mining results (selected):

                    All attributes with NaiveBayes (82% training) [txt] - 80 %
                    15 attributes (CfsSubsetEval) with AODE (82% training) [txt] - 84 %
                    12 attributes (InfoGainAttributeEval) with BayesNets (82% training) [txt] - 84 %
                    11 attributes with (ClassifierSubsetEval-NaiveBayes) BayesNets (65% training) [txt] - 87.7551%
                    12 attributes with (ClassifierSubsetEval-BayesNet) BayeNets (78% training) [txt] - 87.0968 %

Record "optimal" result:

12 attributes (InfoGainAttributeEval) with AODE (82% training) [txt] - 92%

Attribute subsets:

                    CfsSubesetEval -------------------------- {10, 11, 14, 24, 25, 27, 65, 71, 74,   83,   88,   92, 107, 115, 119}
                    InfoGainAttributeEval ------------------- {11, 15, 19, 24, 27, 29, 71, 81, 88,   92, 115, 119}
                    ClassifierSubsetEval(NaiveBayes) - { 6,   9, 10, 19, 24, 49, 59, 65, 72, 100, 115}
                    ClassifierSubsetEval(BayesNet) ---- { 7, 10, 28, 37, 46, 49, 57, 64, 81,   88,   91, 115}

What are the "important" attributes?

*Please note that the attribute order is different from the original .csv file. We removed the 6 attributes and dicretized all attributes. Please use the file here to see the order.

Mining Results [pdf]

We mainly look at the Bayes Net in our final mining process since the Bayes network demonstrates some statistic properties like the independence and inference, which may be helpful for us to select attributes. Some Bayes networks which are built by Weka using the attribute sets from the preliminary analysis are shown below.


InfoGainArributeEval Set (12)	ClassifierSubsetEval(NaiveBayes) (11)

ClassifierSubsetEval(BayesNet) (12)

The attribute set from the InfoGainArributeEval has more correlations from the generated network. This may or may not be a good news to the Bayes Net since some correlation may distract the classification process. The attribute sets from both ClassifierSubsetEval(NaiveBayes) and ClassifierSubsetEval(BayesNet) have some independences between the "target" attribute and other attributes. We consider to remove those attributes that are independent to "target" since they are in some degree irrelevant towards our classification. We further tried to combine the two sets' "target" correlated attributes, and got a set of 8 attributes as listed below.

Combined Attribute Set: {AQR-B, AQR-Mg, AQR-Pb, CHX-Fe, SDA-Cr, SDA-Na, SDA-Se, SPF-Mg}

The Bayes network constructed by Weka is showed in below figure, and the classification result is here [txt].

The training and testing split is at 66%, and we can see from the network that all attributes are correlated. Some result discussion is proposed in the first dataset mining result slide.

Finally, here is a nice Bayes Net Weka description by the developer just came out this September. [pdf]

We plot some attributes against the AMGN attribute, and it does look like forming some clusters. However, we have not figure out how to employ this to improve our mining result. [pdf] (plot)

[To Index]

Dataset #2

Dataset Preliminary Analysis [pdf]

Microarray Data Cleaning

We remove the "Gene Description" column first, and change the "Gene Accession Number" to "ID".
We normalize the data such that for each value, set the minimum field value to 20 and the maximum to 16,000. (i.e. the expression values less than 20 or over 16,000 were considered by biologists as unreliable for this experiment.).
We transpose the data making each column representing an attribute and each row representing a record. (i.e. the data transpose is needed to get the CSV file compatible with Weka. The transpose is done by MatLab.)
We further add a "Class" attribute to indicate the kind of leukemia. (i.e. Class {ALL, AML})

The cleaned training data can be downloaded here [csv]. Some internet tip suggests to remove the initial records with Gene Description containing "control)".
(Those are Affymetrix controls, not human genes), which has not been applied in our cleaning steps.

Attribute Selection / Feature Reduction

There are in total 7070 genes (attributes) in the dataset. It is critical to do the attribute selection. We take the standard Signal to Noise (S2N) ratio and T-value to get significant attribute set. (Note that it is impractical to run any attribute selection algorithm from Weka since the memory requirement would be huge.)

            Let Avg1, Avg2 be the average expression values.
            Let Stdev1, Stdev2 be the sample standard deviations.
                S2N = (Avg1 - Avg2)/(Stdev1 + Stdev2)
                T-value = (Avg1 - Avg2)/sqrt(Stdev1*Stdev1/N1+Stdev2*Stdev2/N2)
            Where N1 is the number of ALL observations, and N2 is the number of AML observations.

T-value
T-value is the observed value of the T-statistic that is used to test the hypothesis that two attributes are correlated. The T-value can range between -infinity and +infinity. A T-value near 0 is evidence for the null hypothesis that there is no correlation between the attributes. A T-value far from 0 (either positive or negative) is evidence for the alternative hypothesis that there is correlation between the attributes.

Top 50 genes with the highest S2N ratio:

*Rank*	*Gene Name*	*S2N*	*T-value*	*Rank*	*Gene Name*	*S2N*	*T-value*
1	M55150_at	1.467641	8.091951	26	L08246_at	1.034784	4.506267
2	X95735_at	1.444531	5.727643	27	X74262_at	1.027773	6.016389
3	U50136_rna1_at	1.421708	6.435952	28	M62762_at	1.023393	5.122153
4	U22376_cds2_s_at	1.339308	7.9043	29	M31211_s_at	1.022881	6.294207
5	M81933_at	1.204042	6.164965	30	M28130_rna1_s_at	1.001286	3.911006
6	M16038_at	1.203218	4.930437	31	D26156_s_at	0.989098	6.097113
7	M84526_at	1.20142	4.057042	32	M63138_at	0.983628	4.048325
8	M23197_at	1.195974	4.737778	33	M31523_at	0.971228	5.677543
9	U82759_at	1.192556	6.24302	34	M57710_at	0.967686	3.69931
10	Y12670_at	1.184737	5.324928	35	X15949_at	0.96009	5.389656
11	D49950_at	1.143704	5.588063	36	M69043_at	0.95886	4.708053
12	M27891_at	1.133427	3.986204	37	S50223_at	0.956733	5.858626
13	X59417_at	1.124637	6.803106	38	U32944_at	0.954875	5.442992
14	X52142_at	1.122589	5.833144	39	M81695_s_at	0.953066	4.195556
15	M28170_at	1.116756	6.253971	40	L47738_at	0.95138	5.715039
16	X17042_at	1.105975	5.388623	41	M83652_s_at	0.947504	3.721478
17	U05259_rna1_at	1.103966	6.175126	42	X85116_rna1_s_at	0.946372	3.923369
18	Y00787_s_at	1.081995	4.701085	43	M11147_at	0.945755	5.588645
19	M96326_rna1_at	1.07719	3.869518	44	Z15115_at	0.9452	5.615471
20	U12471_cds1_at	1.069731	6.146299	45	M21551_rna1_at	0.941981	5.377292
21	U46751_at	1.064078	4.127982	46	M19045_f_at	0.938076	3.834549
22	M80254_at	1.044395	4.271131	47	X04085_rna1_at	0.930499	4.931191
23	M92287_at	1.043056	6.217365	48	L49229_f_at	0.920745	5.398203
24	L13278_at	1.042032	6.021342	49	X14008_rna1_f_at	0.914954	3.655554
25	U09087_s_at	1.036257	6.182425	50	M91432_at	0.913521	5.300882

The S2N training dataset (50 top genes) can be downloaded here [csv] (transposed).
The S2N testing dataset (50 top genes) can be downloaded here [csv] (transposed).

Top 50 genes with the highest T-value:

*Rank*	*Gene Name*	*S2N*	*T-value*	*Rank*	*Gene Name*	*S2N*	*T-value*
1	M55150_at	1.467641	8.091951	26	U32944_at	0.954875	5.442992
2	U22376_cds2_s_at	1.339308	7.9043	27	U26266_s_at	0.887813	5.425205
3	X59417_at	1.124637	6.803106	28	J05243_at	0.886777	5.406254
4	U50136_rna1_at	1.421708	6.435952	29	L49229_f_at	0.920745	5.398203
5	M31211_s_at	1.022881	6.294207	30	X15949_at	0.96009	5.389656
6	M28170_at	1.116756	6.253971	31	X17042_at	1.105975	5.388623
7	U82759_at	1.192556	6.24302	32	M21551_rna1_at	0.941981	5.377292
8	M92287_at	1.043056	6.217365	33	M31303_rna1_at	0.875318	5.370571
9	U09087_s_at	1.036257	6.182425	34	Y08612_at	0.877482	5.366653
10	U05259_rna1_at	1.103966	6.175126	35	U20998_at	0.896103	5.361239
11	M81933_at	1.204042	6.164965	36	AF012024_s_at	0.873658	5.333337
12	U12471_cds1_at	1.069731	6.146299	37	X56411_rna1_at	0.870983	5.327392
13	D26156_s_at	0.989098	6.097113	38	Y12670_at	1.184737	5.324928
14	L13278_at	1.042032	6.021342	39	U29175_at	0.908033	5.322716
15	X74262_at	1.027773	6.016389	40	M91432_at	0.913521	5.300882
16	S50223_at	0.956733	5.858626	41	HG1612-HT1612_at	0.856338	5.276869
17	X52142_at	1.122589	5.833144	42	M13792_at	0.866312	5.222356
18	X95735_at	1.444531	5.727643	43	D63874_at	0.844716	5.206861
19	L47738_at	0.95138	5.715039	44	U72342_at	0.848128	5.206761
20	M31523_at	0.971228	5.677543	45	X97267_rna1_s_at	0.844922	5.203003
21	Z15115_at	0.9452	5.615471	46	X76648_at	0.861562	5.195203
22	M11147_at	0.945755	5.588645	47	U35451_at	0.842047	5.144391
23	D49950_at	1.143704	5.588063	48	Z69881_at	0.896518	5.143816
24	X63469_at	0.905744	5.583296	49	D63880_at	0.833577	5.125102
25	D38073_at	0.902946	5.562983	50	M62762_at	1.023393	5.122153

The T-value training dataset (50 top genes) can be downloaded here [csv] (transposed).
The T-value testing dataset (50 top genes) can be downloaded here [csv] (transposed).

Common genes from the above two sets:

31 attributes in the common set: {D26156_s_at, D49950_at, L13278_at, L47738_at, L49229_f_at, M11147_at, M21551_rna1_at, M28170_at, M31211_s_at, M31523_at, M55150_at, M62762_at, M81933_at, M91432_at, M92287_at, S50223_at, U05259_rna1_at, U09087_s_at, U12471_cds1_at, U22376_cds2_s_at, U32944_at, U50136_rna1_at, U82759_at, X15949_at, X17042_at, X52142_at, X59417_at, X74262_at, X95735_at, Y12670_at, Z15115_at}

Common set attributes can be visualized here.

            From statistic aspect, the common attribute set contains those attributes that have high discriminating ability to distinguish ALL and AML leukemia.
                                The Common training dataset (31 common genes) can be downloaded here [csv] (transposed).
                                The Common testing dataset (31 common genes) can be downloaded here [csv] (transposed).

Preliminary result with NaiveBayes and BayesNet

ID	NB(S2N)	BN(S2N)	NB(T)	BN(T)	NB( C)	BN( C)
39	ALL	ALL	ALL	ALL	ALL	ALL
40	ALL	ALL	ALL	ALL	ALL	ALL
42	ALL	ALL	ALL	ALL	ALL	ALL
47	ALL	ALL	ALL	ALL	ALL	ALL
48	ALL	ALL	ALL	ALL	ALL	ALL
49	ALL	ALL	ALL	ALL	ALL	ALL
41	ALL	ALL	ALL	ALL	ALL	ALL
43	ALL	ALL	ALL	ALL	ALL	ALL
44	ALL	ALL	ALL	ALL	ALL	ALL
45	ALL	ALL	ALL	ALL	ALL	ALL
46	ALL	ALL	ALL	ALL	ALL	ALL
70	ALL	ALL	ALL	ALL	ALL	ALL
71	ALL	ALL	ALL	ALL	ALL	ALL
72	ALL	ALL	ALL	ALL	ALL	ALL
68	ALL	ALL	ALL	ALL	ALL	ALL
69	ALL	ALL	ALL	ALL	ALL	ALL
67	AML	AML	ALL	ALL	AML	AML
55	ALL	ALL	ALL	ALL	ALL	ALL
56	ALL	ALL	ALL	ALL	ALL	ALL
59	ALL	ALL	ALL	ALL	ALL	ALL
52	AML	AML	AML	AML	AML	AML
53	AML	AML	AML	AML	AML	AML
51	AML	AML	ALL	AML	ALL	AML
50	AML	AML	AML	AML	AML	AML
54	AML	ALL	ALL	ALL	ALL	ALL
57	AML	AML	AML	AML	AML	AML
58	AML	AML	AML	AML	AML	AML
60	AML	ALL	ALL	ALL	ALL	ALL
61	AML	ALL	AML	ALL	AML	ALL
65	AML	AML	AML	AML	AML	AML
66	ALL	ALL	ALL	ALL	ALL	ALL
63	AML	AML	ALL	AML	AML	AML
64	AML	AML	ALL	AML	AML	AML
62	AML	AML	ALL	ALL	ALL	ALL
Total ALL	20	23	27	25	24	24
Total AML	14	11	7	9	10	10

                * NB: NaiveBayes; BN: BayesNet.
                * S2N uses the top 50 S2N ratio gene set; T uses the top 50 T-value gene set; C uses the common 31 gene set from previous two.
                * ID field uses original sample sequence number.
                * Red columns show the samples classified to different kind through different attribute set or different technique.

Mining Results [pdf]

Since the limitation of Bayes methods on this dataset, we do not try to push the mining further on to expect some "better" result. Some major problems with the Bayes methods here are listed below:

Validation (either training set or testing set) is hard to tell.
Comparison between the methods is generally not applicable.
Detection for new type of leukemia is impossible with the given training set.

Nevertheless, we still put some effort on selecting better attribute/gene set which can be found in our final results [pdf].

[To Index]

Dataset #3

Dataset Preliminary Analysis [pdf]

Statistic and Syntactic (preamble):

We propose to use rule based systems to do the mining of this dataset besides some Bayes methods. The interesting point to examine here is how these two techniques work and compared. Clearly, Bayes approach is based on the statistic model built through the dataset, and rule based system is a syntactic approach in some sense more like our thinking process. Theoretically, the statistical techniques have a well-founded mathematical theory support, and thus, usually computationally inexpensive to be applied. On the other hand, syntactical techniques give nice structural descriptions/rules, and thus, simple to be understand and validated.

Regardless the above remarks, for this dataset from the Sloan Digital Sky Survey, release 3, the intuitive observation really cannot tell us which technique may perform better than the other. We would like to try both techniques on the dataset and compare them more from the mining outcomes.

Data Preprocessing:

We add a top row for descriptions of the columns. The last column are normalized in two ways to make a two-class list and a three-class list.

Two-class: "8" -> "1" (i.e. class 1) and "4" and "7" -> "2" (i.e. class 2).

Three-class: "8" -> "1" (i.e. class1) ,"7" -> "2" (i.e. class 2), and "4" -> "3" (i.e. class3).

Preliminary Analysis:

The two-class case seems very optimistic if we take a look at the graph of attributes plotted with respect to the "class" below:

Attribute 7, 8, 9, and 10 separate the two classes completely, which means any of the attribute can be used to give a promised mining outcome. It is thus of less interest here to be further discussed.

The three-class case is unfortunately and fortunately not as simple as the two-class case. The plots of attributes demonstrate the difficulties.

The main problem is the class "2" and "3" (i.e. red and light blue in the graph) as hardly an attribute can clearly separate them. And above all, attribute 10 is probably the most useful one in three-class case.

Some Results:

The two-class case is trivial, both techniques can guarantee the 100% correctness from mining. This can be seen clearly through the rule output by PRISM and DecisionTable (i.e. rule based systems):

            Rules:
            ================================
            attr_7 class
            ================================
            '(-inf-3.047]' 1
            '(3.047-inf)' 2
            ================================

Note that Weka probably run the algorithm sequentially through all attributes so attribute 7 is picked up first.

The three-class case can be directly mined by the two techniques (need to discritize attributes when using PRISM). The confusion matrices are listed in the table:

	BayesNet - 95.2497%			DecisionTable - 98.9521%			PRISM - 98.7426%
Class = >	a	b	c	a	b	c	a	b	c
a = 1	330	0	0	330	0	0	330	0	0
b = 2	0	2328	126	0	2438	16	0	2437	11
c = 3	0	10	69	0	14	65	0	17	60

DecisionTable uses an attribute set of {1,4,7,10}. PRISM uses all attributes but based on the information gain to rank the importance.

The information gain ranking is: 10, 9, 7, 8, 6, 4, 5, 3, 2, 1 (based on attribute sequence).

Other results are available in the Preliminary Analysis report [pdf].

Mining Results [pdf]

Visualization through Clustering

Some interesting information can be gained with clustering the dataset without predefined knowledge (i.e. class). The following plots show different clustering results from applying the k-mean algorithm using Weka.

(Click for full size)

Class 1 is generally clustered in red which is a very pleasant case. Class 2 and 3 are confusing from the cluster plots.

The parallel plot and scatter plot are also very interesting to see here. Both of them demonstrate some important properties of this dataset.

Parallel Plot

Scatter Plot

[To Index]

Dataset (Additional set with missing values from dataset #3)

Dataset Recovery

The most intuitive method to deal with the missing values is just to ignore them. The plots of attributes ignoring the missing values show very likely distributions from the previous dataset.

We also try to recover the missing values with the means. Following table lists the means of the 10 attributes subjected to classes (i.e., "4, 7, or 8") omitting the missing cells.

*Class/Mean*	*attr_1*	*attr_2*	*attr_3*	*attr_4*	*attr_5*	*attr_6*	*attr_7*	*attr_8*	*attr_9*	*attr_10*	*Total Sample*
4	-0.098	1.317	4.051	5.958	1.475	4.147	6.047	2.657	4.542	1.888	256
7	1.498	3.081	5.198	6.370	1.594	3.701	4.871	2.108	3.277	1.170	7186
8	0.559	0.478	0.325	0.150	-0.060	-0.200	-0.367	-0.145	-0.314	-0.169	976

The process is then straight forward replacing missing values with the mean values in respect to the class and attribute number.

The plots of attributes after recovery is given below:

The parallel plot gives basically a same graph.

Parallel Plot

Note that there is a difference caused by sorting respecting to the class number.

General Result [pdf]

The recovered dataset with the means in general helps the Bayes methods to classify the dwarfs, but kind of misleading for the rule base systems.

The standard derivation table for each attribute after two preprocessing methods somehow states the point.

Method	Class	attr_1	attr_2	attr_3	attr_4	attr_5	attr_6	attr_7	attr_8	attr_9	attr10
Ignore	4	1.216	1.250	1.164	1.068	0.773	0.707	0.717	0.446	0.568	0.374
Replace	4	1.216	1.188	1.105	0.999	0.734	0.692	0.699	0.439	0.553	0.365
Ignore	7	1.082	1.120	1.048	1.020	0.278	0.333	0.431	0.267	0.398	0.158
Replace	7	1.081	1.056	0.994	0.975	0.267	0.321	0.418	0.260	0.388	0.154
Ignore	8	0.699	1.151	1.312	1.493	0.457	0.657	0.850	0.201	0.408	0.240
Replace	8	0.699	1.114	1.294	1.475	0.453	0.652	0.845	0.200	0.405	0.238

[To Index]

Mining Software

Weka

Weka Tutorial (Ulrike Sattler's introduction) - It is the best tutorial I could find online, but it is still somehow not that useful.

Weka Explorer Guide (University of Waikato) - I downloaded it from somewhere. Not for the newest version, but useable.

Weka Java Class Document (University of Waikato) - Java style documents for every class in the .jar of the package.

MatLab

MatLab Introduction Tutorial (UF Mathematics) - I commonly refer to find commands.

The Pattern Classification Toolbox - Nice toolbox to visualize Bayes Classifiers. [Computer Manual in MATLAB to accompany Pattern Classification (by David D. Stork and Elad Yom-Tov)]

[To Index]