Data Mining  Bayesian Approaches
Data Mining Course [CISC 873, School of Computing, Queen's University] 
Bayesian Approaches Tutorials and Applying Results [Henry 2004] 
Data Mining (DM)
Introduction
DM Definitions
DM Web Pages
Bayesian Tutorials
Overview
Naïve Bayesian Classifiers
Gaussian Bayesian Classifiers
Bayesian Networks
Applying Bayesian Approach
on Datasets
Dataset #1
Dataset #2
Dataset #3
Dataset
Additional
Mining Software
Weka
MatLab
Resource  Define 
Two Crows  An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis. 
Gotcha  A type of application with builtin proprietary algorithms that sort, rank, and perform calculations on a specified and often large data set, producing visualizations that reveal patterns which may not have been evident from mere listings or summaries. 
The OLAP Report  The process of using statistical techniques to discover subtle relationships between data items, and the construction of predictive models based on them. The process is not the same as just using an OLAP tool to find exceptional items. Generally, data mining is a very different and more specialist application than OLAP, and uses different tools from different vendors. Normally the users are different, too. 
Data Mining Web Pages:
Statistical Data Mining Tutorials (by Andrew Moore)  Highly recommended! Excellent introductions to the DM techniques.
An Introduction Student Notes  Good materials to accompany with the course.
An Introduction to Data Mining (by Kurt Thearling)  General ideas of why we need to do DM and how DM works.
The Data Mine  Launched in April 1994, and providing information about DM. Nice place to find topics in DM.
CISC873 Data Mining  Finally, our course page which is obvious necessary here.
[To Index]
Bayesian approaches are a fundamentally important DM technique. Given the probability distribution, Bayes classifier can provably achieve the optimal result. Bayesian method is based on the probability theory. Bayes Rule is applied here to calculate the posterior from the prior and the likelihood, because the later two is generally easier to be calculated from a probability model.
One limitation that the Bayesian approaches can not cross is the need of the probability estimation from the training dataset. It is noticeable that in some situations, such as the decision is clearly based on certain criteria, or the dataset has high degree of randomality, the Bayesian approaches will not be a good choice.
My introduction slides of Bayesian Approaches. [pdf]
The Naïve Bayes Classifier technique is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods. The following example is a simple demonstration of applying the Naïve Bayes Classifier from StatSoft.
As
indicated at Figure 1, the objects can be classified as either GREEN or RED. Our
task is to classify new cases as they arrive (i.e., decide to which class label
they belong, based on the currently exiting objects).
Figure
1. objects are classified to GREEN or RED.
We can then calculate the priors (i.e. the probability of the object among all objects) based on the previous experience. Thus:
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:
Having formulated our prior probability, we are now ready to classify a new object (WHITE circle in Figure 2). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label.
Figure 2. classify the WHITE circle.
We can calculate the likelihood:
In Figure 2, it is clear that Likelihood of X given RED is larger than Likelihood of X given GREEN, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:
Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information (i.e. the prior and the likelihood) to form a posterior probability using Bayes Rule.
Finally, we classify X as RED since its class membership achieves the largest posterior probability.
[To Index]
The problem with the Naïve Bayes Classifier is that it assumes all attributes are independent of each other which in general can not be applied. Gaussian PDF can be plugin here to estimate the attribute probability density function (PDF). Because the well developed Gaussian PDF theories, we can classify the new object easier through the same Bayes Classifier Model but with certain degree recognition of the covariance. Normally, this gives more accurate classification result.
I guess one question to be asked is why Gaussian? There are many other PDF's can be applied. But from statistic point of view, many real world distributions are more likely to be estimated by Gaussian PDF than others. If you are familiar with Information Theory, the Gaussian gives the maximum entropy for an unbounded range, which means Gaussian has more ability to estimate the randomality.
How to apply the Gaussian to the Bayes Classifier?
The application here is very intuitive. We assume the Density Estimation follows a Gaussian distribution. Then the prior and the likelihood can be calculated through the Gaussian PDF. The critical thing here is to identify the Gaussian distribution (i.e. find the mean and variance of the Gaussian). The following 5 steps are a general model to initialize the Gaussian distribution to fit our input dataset.
Choose a probability estimator form (Gaussian)
Choose an initial set of parameters for the estimator (Gaussian mean and variance)
Given parameters, compute posterior estimates for hidden variable
Given posterior estimates, find distributional parameters that maximize expectation (mean) of joint density for data and hidden variable (Guarantee to also maximize improvement of likelihood)
Assess goodness of fit (i.e. log likelihood) If not stopping criterion, return to (3).
From research perspective, Gaussian may not be the only PDF to be applied to the Bayes Classifier, although it has very strong theoretical support and nice properties. The general model of applying those PDF's should be the same. The estimation results highly depend on whether or how close a PDF can simulate the given dataset.
Some normal used PDF's are listed below: (just to refresh our statics)
[To Index]
First of all, there is a nice introduction to Bayesian Networks and their Contemporary Applications by Daryle Niedermayer [web page]. It is generally hard for me to come up with a better one here. So the following may just a simpler version of Daryle's introduction.
From the previous tutorial on Gaussian Bayes Classifier, we notice that the Gaussian model helps to integrate some correlation which improves the classification performance against the Naïve model assuming the independence. However, using Gaussian model with Bayes Classifier still has its limitation of generating the correlations. So it is where the Bayesian Networks (Bayes Nets) get involved.
Bayes Net is a model of utilizing the conditional probabilities among different variables. It is generally impossible to generate all conditional probabilities from a given dataset. Our task is to pick important ones and use them in the classification process. So essentially, a Bayes net is a set of "Generalized Probability Density Function" (gpdf).
Informally, we can define a Bayes net as an augmented directed acyclic graph, represented by the vertex set V and directed edge set E. Each vertex from V represents an attribute, and each edge from E represents a correlation between two attributes. One example of five attributes Bayes net is shown in Figure 3.
Figure 3. A Bayes net for 5 attributes.
Some important observations here:
There is no loop in the graph representation since it is acyclic.
Two variable vi and vj may still correlated even if they are not connected.
Each variable vi is conditionally independent of all nondescendants, given its parents.
Now, let's suffer the mathematical definition formally.
Consider a domain U of n variables, x_{1},...x_{n. }Each variable may be discrete having a finite or countable number of states, or continuous. Given a subset X of variables x_{i }where x_{i} U, if one can observe the state of every variable in X, then this observation is called an instance of X and is denoted as X= for the observations . The "joint space" of U is the set of all instances of U. denotes the "generalized probability density" that X= given Y= for a person with current state information . p(XY, ) then denotes the gpdf for X, given all possible observations of Y. The joint gpdf over U is the gpdf for U.
A Bayesian network for domain U
represents a joint gpdf over U. This representation consists of a set of
local conditional gpdfs combined with a set of conditional independence
assertions that allow the construction of a global gpdf from the local
gpdfs. Then these value can be ascertained as:
(Equation
1)
A Bayesian Network Structure then encodes the assertions of conditional independence in Equation 1 above. Essentially then, a Bayesian Network Structure B_{s} is a directed acyclic graph such that (1) each variable in U corresponds to a node in B_{s}, and (2) the parents of the node corresponding to x_{i} are the nodes corresponding to the variables in [Pi]_{i}.
It is not hard to argue the advantage of using Bayes net model here. But formal analysis needs to be done with the probability inference theory which we have briefly discussed in the next section. For reader who is not interested with the analysis mathematically, ignore the next section to avoid confusion (sometimes not bad).
How to build a Bayes Net? A general model can be followed below:
Choose a set of relevant variables.
Choose an ordering for the variables.
Assume the variables are X1, X2, ..., Xn (where X1 is the first, and Xi is the ith).
for i = 1 to n:
Add the Xi vertex to the network
Set Parent(Xi) to be a minimal subset of X1, ..., Xi1, such that we have conditional independence of Xi and all other members of X1, ..., Xi1 given Parents(Xi).
Define the probability table of P(Xi=k  Assignments of Parent(Xi)).
There are many choices of how to select relevant variables, as well as how to estimate the conditional probabilities. If we imagine the network as a connection of Bayes classifiers, then the probability estimation can be done applying some PDF like Gaussian. In some cases, the design of the network can be rather complicated. There are some efficient ways of getting relevant variables from the dataset attributes. Assume the coming signal to be stochastic will give a nice way of extracting the signal attributes. And normally, the likelihood weighting is another way to getting attributes.
[To Index]
Applying Bayesian Approach on Datasets
Dataset Preliminary Analysis [pdf]
Attribute Selection results:
CfsSubsetEval with BestFirst (15 attributes)[txt]
InfoGainAttibuteEval with Ranker (12 attributes 0.1000 above) [txt]
ClassifierSubsetEval (NaiveBayes) with BestFirst (11 attributes) [txt]
ClassifierSubsetEval (BayesNet) with BestFirst [txt]
Preliminary Mining results (selected):
All attributes with NaiveBayes (82% training) [txt]
 80 %
15 attributes (CfsSubsetEval) with AODE (82% training) [txt]
 84 %
12 attributes (InfoGainAttributeEval) with BayesNets (82% training) [txt]
 84 %
11 attributes with (ClassifierSubsetEvalNaiveBayes) BayesNets (65%
training) [txt]  87.7551%
12 attributes with (ClassifierSubsetEvalBayesNet) BayeNets (78% training) [txt]
 87.0968 %
Record "optimal" result:
12 attributes (InfoGainAttributeEval) with AODE (82% training) [txt]  92%
Attribute subsets:
CfsSubesetEval  {10, 11, 14, 24, 25, 27, 65, 71, 74,
83, 88, 92, 107, 115, 119}
InfoGainAttributeEval  {11, 15, 19, 24, 27, 29, 71, 81, 88,
92, 115, 119}
ClassifierSubsetEval(NaiveBayes)  { 6, 9, 10, 19, 24, 49, 59, 65,
72, 100, 115}
ClassifierSubsetEval(BayesNet)  { 7, 10, 28, 37, 46, 49, 57, 64, 81,
88, 91, 115}
What are the "important" attributes?
*Please note that the attribute order is different from the original .csv file. We removed the 6 attributes and dicretized all attributes. Please use the file here to see the order.
Mining Results [pdf]
We mainly look at the Bayes Net in our final mining process since the Bayes network demonstrates some statistic properties like the independence and inference, which may be helpful for us to select attributes. Some Bayes networks which are built by Weka using the attribute sets from the preliminary analysis are shown below.
InfoGainArributeEval Set (12) 
ClassifierSubsetEval(NaiveBayes) (11) 
ClassifierSubsetEval(BayesNet) (12) 
The attribute set from the InfoGainArributeEval has more correlations from the generated network. This may or may not be a good news to the Bayes Net since some correlation may distract the classification process. The attribute sets from both ClassifierSubsetEval(NaiveBayes) and ClassifierSubsetEval(BayesNet) have some independences between the "target" attribute and other attributes. We consider to remove those attributes that are independent to "target" since they are in some degree irrelevant towards our classification. We further tried to combine the two sets' "target" correlated attributes, and got a set of 8 attributes as listed below.
Combined Attribute Set: {AQRB, AQRMg, AQRPb, CHXFe, SDACr, SDANa, SDASe, SPFMg}
The Bayes network constructed by Weka is showed in below figure, and the classification result is here [txt].
The training and testing split is at 66%, and we can see from the network that all attributes are correlated. Some result discussion is proposed in the first dataset mining result slide.
Finally, here is a nice Bayes Net Weka description by the developer just came out this September. [pdf]
We plot some attributes against the AMGN attribute, and it does look like forming some clusters. However, we have not figure out how to employ this to improve our mining result. [pdf] (plot)
[To Index]
Dataset Preliminary Analysis [pdf]
Microarray Data Cleaning
We remove the "Gene Description" column first, and change the "Gene Accession Number" to "ID".
We normalize the data such that for each value, set the minimum field value to 20 and the maximum to 16,000. (i.e. the expression values less than 20 or over 16,000 were considered by biologists as unreliable for this experiment.).
We transpose the data making each column representing an attribute and each row representing a record. (i.e. the data transpose is needed to get the CSV file compatible with Weka. The transpose is done by MatLab.)
We further add a "Class" attribute to indicate the kind of leukemia. (i.e. Class {ALL, AML})
The cleaned training data can be downloaded here [csv].
Some internet tip suggests to remove the initial records with Gene Description
containing "control)".
(Those are Affymetrix controls, not human genes), which has not been applied in
our cleaning steps.
Attribute Selection / Feature Reduction
There are in total 7070 genes (attributes) in the dataset. It is critical to do the attribute selection. We take the standard Signal to Noise (S2N) ratio and Tvalue to get significant attribute set. (Note that it is impractical to run any attribute selection algorithm from Weka since the memory requirement would be huge.)
Let Avg1, Avg2 be the average expression values.
Let Stdev1,
Stdev2 be the sample standard deviations.
S2N = (Avg1  Avg2)/(Stdev1 + Stdev2)
Tvalue = (Avg1  Avg2)/sqrt(Stdev1*Stdev1/N1+Stdev2*Stdev2/N2)
Where N1 is
the number of ALL observations, and N2 is the number of AML observations.
Tvalue
Tvalue is the observed value of the Tstatistic that is used to test the
hypothesis that two attributes are correlated. The Tvalue can range between
infinity and +infinity. A Tvalue near 0 is evidence for the null hypothesis
that there is no correlation between the attributes. A Tvalue far from 0
(either positive or negative) is evidence for the alternative hypothesis that
there is correlation between the attributes.
Top 50 genes with the highest S2N ratio:
Rank 
Gene Name 
S2N 
Tvalue 

Rank 
Gene Name 
S2N 
Tvalue 
1 
M55150_at 
1.467641 
8.091951 
26 
L08246_at 
1.034784 
4.506267 

2 
X95735_at 
1.444531 
5.727643 
27 
X74262_at 
1.027773 
6.016389 

3 
U50136_rna1_at 
1.421708 
6.435952 
28 
M62762_at 
1.023393 
5.122153 

4 
U22376_cds2_s_at 
1.339308 
7.9043 
29 
M31211_s_at 
1.022881 
6.294207 

5 
M81933_at 
1.204042 
6.164965 
30 
M28130_rna1_s_at 
1.001286 
3.911006 

6 
M16038_at 
1.203218 
4.930437 
31 
D26156_s_at 
0.989098 
6.097113 

7 
M84526_at 
1.20142 
4.057042 
32 
M63138_at 
0.983628 
4.048325 

8 
M23197_at 
1.195974 
4.737778 
33 
M31523_at 
0.971228 
5.677543 

9 
U82759_at 
1.192556 
6.24302 
34 
M57710_at 
0.967686 
3.69931 

10 
Y12670_at 
1.184737 
5.324928 
35 
X15949_at 
0.96009 
5.389656 

11 
D49950_at 
1.143704 
5.588063 
36 
M69043_at 
0.95886 
4.708053 

12 
M27891_at 
1.133427 
3.986204 
37 
S50223_at 
0.956733 
5.858626 

13 
X59417_at 
1.124637 
6.803106 
38 
U32944_at 
0.954875 
5.442992 

14 
X52142_at 
1.122589 
5.833144 
39 
M81695_s_at 
0.953066 
4.195556 

15 
M28170_at 
1.116756 
6.253971 
40 
L47738_at 
0.95138 
5.715039 

16 
X17042_at 
1.105975 
5.388623 
41 
M83652_s_at 
0.947504 
3.721478 

17 
U05259_rna1_at 
1.103966 
6.175126 
42 
X85116_rna1_s_at 
0.946372 
3.923369 

18 
Y00787_s_at 
1.081995 
4.701085 
43 
M11147_at 
0.945755 
5.588645 

19 
M96326_rna1_at 
1.07719 
3.869518 
44 
Z15115_at 
0.9452 
5.615471 

20 
U12471_cds1_at 
1.069731 
6.146299 
45 
M21551_rna1_at 
0.941981 
5.377292 

21 
U46751_at 
1.064078 
4.127982 
46 
M19045_f_at 
0.938076 
3.834549 

22 
M80254_at 
1.044395 
4.271131 
47 
X04085_rna1_at 
0.930499 
4.931191 

23 
M92287_at 
1.043056 
6.217365 
48 
L49229_f_at 
0.920745 
5.398203 

24 
L13278_at 
1.042032 
6.021342 
49 
X14008_rna1_f_at 
0.914954 
3.655554 

25 
U09087_s_at 
1.036257 
6.182425 
50 
M91432_at 
0.913521 
5.300882 
The S2N training dataset (50 top genes) can be downloaded here [csv]
(transposed).
The S2N testing dataset (50 top genes) can be downloaded here [csv]
(transposed).
Top 50 genes with the highest Tvalue:
Rank 
Gene Name 
S2N 
Tvalue 

Rank 
Gene Name 
S2N 
Tvalue 
1 
M55150_at 
1.467641 
8.091951 
26 
U32944_at 
0.954875 
5.442992 

2 
U22376_cds2_s_at 
1.339308 
7.9043 
27 
U26266_s_at 
0.887813 
5.425205 

3 
X59417_at 
1.124637 
6.803106 
28 
J05243_at 
0.886777 
5.406254 

4 
U50136_rna1_at 
1.421708 
6.435952 
29 
L49229_f_at 
0.920745 
5.398203 

5 
M31211_s_at 
1.022881 
6.294207 
30 
X15949_at 
0.96009 
5.389656 

6 
M28170_at 
1.116756 
6.253971 
31 
X17042_at 
1.105975 
5.388623 

7 
U82759_at 
1.192556 
6.24302 
32 
M21551_rna1_at 
0.941981 
5.377292 

8 
M92287_at 
1.043056 
6.217365 
33 
M31303_rna1_at 
0.875318 
5.370571 

9 
U09087_s_at 
1.036257 
6.182425 
34 
Y08612_at 
0.877482 
5.366653 

10 
U05259_rna1_at 
1.103966 
6.175126 
35 
U20998_at 
0.896103 
5.361239 

11 
M81933_at 
1.204042 
6.164965 
36 
AF012024_s_at 
0.873658 
5.333337 

12 
U12471_cds1_at 
1.069731 
6.146299 
37 
X56411_rna1_at 
0.870983 
5.327392 

13 
D26156_s_at 
0.989098 
6.097113 
38 
Y12670_at 
1.184737 
5.324928 

14 
L13278_at 
1.042032 
6.021342 
39 
U29175_at 
0.908033 
5.322716 

15 
X74262_at 
1.027773 
6.016389 
40 
M91432_at 
0.913521 
5.300882 

16 
S50223_at 
0.956733 
5.858626 
41 
HG1612HT1612_at 
0.856338 
5.276869 

17 
X52142_at 
1.122589 
5.833144 
42 
M13792_at 
0.866312 
5.222356 

18 
X95735_at 
1.444531 
5.727643 
43 
D63874_at 
0.844716 
5.206861 

19 
L47738_at 
0.95138 
5.715039 
44 
U72342_at 
0.848128 
5.206761 

20 
M31523_at 
0.971228 
5.677543 
45 
X97267_rna1_s_at 
0.844922 
5.203003 

21 
Z15115_at 
0.9452 
5.615471 
46 
X76648_at 
0.861562 
5.195203 

22 
M11147_at 
0.945755 
5.588645 
47 
U35451_at 
0.842047 
5.144391 

23 
D49950_at 
1.143704 
5.588063 
48 
Z69881_at 
0.896518 
5.143816 

24 
X63469_at 
0.905744 
5.583296 
49 
D63880_at 
0.833577 
5.125102 

25 
D38073_at 
0.902946 
5.562983 
50 
M62762_at 
1.023393 
5.122153 
The Tvalue training dataset (50 top genes) can be downloaded here [csv]
(transposed).
The Tvalue testing dataset (50 top genes) can be downloaded here [csv]
(transposed).
Common genes from the above two sets:
31 attributes in the common set: {D26156_s_at, D49950_at, L13278_at, L47738_at, L49229_f_at, M11147_at, M21551_rna1_at, M28170_at, M31211_s_at, M31523_at, M55150_at, M62762_at, M81933_at, M91432_at, M92287_at, S50223_at, U05259_rna1_at, U09087_s_at, U12471_cds1_at, U22376_cds2_s_at, U32944_at, U50136_rna1_at, U82759_at, X15949_at, X17042_at, X52142_at, X59417_at, X74262_at, X95735_at, Y12670_at, Z15115_at}
Common set attributes can be visualized here.
From statistic aspect, the common attribute set contains those attributes that
have high discriminating ability to distinguish ALL and AML leukemia.
The Common training dataset (31 common genes) can be downloaded here [csv]
(transposed).
The Common testing dataset (31 common genes) can be downloaded here [csv]
(transposed).
Preliminary result with NaiveBayes and BayesNet
ID 
NB(S2N) 
BN(S2N) 
NB(T) 
BN(T) 
NB( C) 
BN( C) 
39 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
40 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
42 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
47 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
48 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
49 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
41 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
43 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
44 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
45 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
46 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
70 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
71 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
72 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
68 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
69 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
67 
AML 
AML 
ALL 
ALL 
AML 
AML 
55 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
56 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
59 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
52 
AML 
AML 
AML 
AML 
AML 
AML 
53 
AML 
AML 
AML 
AML 
AML 
AML 
51 
AML 
AML 
ALL 
AML 
ALL 
AML 
50 
AML 
AML 
AML 
AML 
AML 
AML 
54 
AML 
ALL 
ALL 
ALL 
ALL 
ALL 
57 
AML 
AML 
AML 
AML 
AML 
AML 
58 
AML 
AML 
AML 
AML 
AML 
AML 
60 
AML 
ALL 
ALL 
ALL 
ALL 
ALL 
61 
AML 
ALL 
AML 
ALL 
AML 
ALL 
65 
AML 
AML 
AML 
AML 
AML 
AML 
66 
ALL 
ALL 
ALL 
ALL 
ALL 
ALL 
63 
AML 
AML 
ALL 
AML 
AML 
AML 
64 
AML 
AML 
ALL 
AML 
AML 
AML 
62 
AML 
AML 
ALL 
ALL 
ALL 
ALL 
Total ALL 
20 
23 
27 
25 
24 
24 
Total AML 
14 
11 
7 
9 
10 
10 
* NB: NaiveBayes; BN: BayesNet.
* S2N uses the top 50 S2N ratio gene set; T uses the top 50 Tvalue gene set; C
uses the common 31 gene set from previous two.
* ID field uses original sample sequence number.
* Red columns show the samples classified to different kind through different
attribute set or different technique.
Mining Results [pdf]
Since the limitation of Bayes methods on this dataset, we do not try to push the mining further on to expect some "better" result. Some major problems with the Bayes methods here are listed below:
Validation (either training set or testing set) is hard to tell.
Comparison between the methods is generally not applicable.
Detection for new type of leukemia is impossible with the given training set.
Nevertheless, we still put some effort on selecting better attribute/gene set which can be found in our final results [pdf].
[To Index]
Dataset Preliminary Analysis [pdf]
Statistic and Syntactic (preamble):
We propose to use rule based systems to do the mining of this dataset besides some Bayes methods. The interesting point to examine here is how these two techniques work and compared. Clearly, Bayes approach is based on the statistic model built through the dataset, and rule based system is a syntactic approach in some sense more like our thinking process. Theoretically, the statistical techniques have a wellfounded mathematical theory support, and thus, usually computationally inexpensive to be applied. On the other hand, syntactical techniques give nice structural descriptions/rules, and thus, simple to be understand and validated.
Regardless the above remarks, for this dataset from the Sloan Digital Sky Survey, release 3, the intuitive observation really cannot tell us which technique may perform better than the other. We would like to try both techniques on the dataset and compare them more from the mining outcomes.
Data Preprocessing:
We add a top row for descriptions of the columns. The last column are normalized in two ways to make a twoclass list and a threeclass list.
Twoclass: "8" > "1" (i.e. class 1) and "4" and "7" > "2" (i.e. class 2).
Threeclass: "8" > "1" (i.e. class1) ,"7" > "2" (i.e. class 2), and "4" > "3" (i.e. class3).
Preliminary Analysis:
The twoclass case seems very optimistic if we take a look at the graph of attributes plotted with respect to the "class" below:
Attribute 7, 8, 9, and 10 separate the two classes completely, which means any of the attribute can be used to give a promised mining outcome. It is thus of less interest here to be further discussed.
The threeclass case is unfortunately and fortunately not as simple as the twoclass case. The plots of attributes demonstrate the difficulties.
The main problem is the class "2" and "3" (i.e. red and light blue in the graph) as hardly an attribute can clearly separate them. And above all, attribute 10 is probably the most useful one in threeclass case.
Some Results:
The twoclass case is trivial, both techniques can guarantee the 100% correctness from mining. This can be seen clearly through the rule output by PRISM and DecisionTable (i.e. rule based systems):
Rules:
================================
attr_7 class
================================
'(inf3.047]' 1
'(3.047inf)'
2
================================
Note that Weka probably run the algorithm sequentially through all attributes so attribute 7 is picked up first.
The threeclass case can be directly mined by the two techniques (need to discritize attributes when using PRISM). The confusion matrices are listed in the table:
BayesNet  95.2497%  DecisionTable  98.9521%  PRISM  98.7426%  
Class = >  a  b  c  a  b  c  a  b  c 
a = 1  330  0  0  330  0  0  330  0  0 
b = 2  0  2328  126  0  2438  16  0  2437  11 
c = 3  0  10  69  0  14  65  0  17  60 
DecisionTable uses an attribute set of {1,4,7,10}. PRISM uses all attributes but based on the information gain to rank the importance.
The information gain ranking is: 10, 9, 7, 8, 6, 4, 5, 3, 2, 1 (based on attribute sequence).
Other results are available in the Preliminary Analysis report [pdf].
Mining Results [pdf]
Visualization through Clustering
Some interesting information can be gained with clustering the dataset without predefined knowledge (i.e. class). The following plots show different clustering results from applying the kmean algorithm using Weka.
Class 1 is generally clustered in red which is a very pleasant case. Class 2 and 3 are confusing from the cluster plots.
The parallel plot and scatter plot are also very interesting to see here. Both of them demonstrate some important properties of this dataset.
[To Index]
Dataset (Additional set with missing values from dataset #3)
Dataset Recovery
The most intuitive method to deal with the missing values is just to ignore them. The plots of attributes ignoring the missing values show very likely distributions from the previous dataset.
We also try to recover the missing values with the means. Following table lists the means of the 10 attributes subjected to classes (i.e., "4, 7, or 8") omitting the missing cells.
Class/Mean  attr_1  attr_2  attr_3  attr_4  attr_5  attr_6  attr_7  attr_8  attr_9  attr_10  Total Sample 
4  0.098  1.317  4.051  5.958  1.475  4.147  6.047  2.657  4.542  1.888  256 
7  1.498  3.081  5.198  6.370  1.594  3.701  4.871  2.108  3.277  1.170  7186 
8  0.559  0.478  0.325  0.150  0.060  0.200  0.367  0.145  0.314  0.169  976 
The process is then straight forward replacing missing values with the mean values in respect to the class and attribute number.
The plots of attributes after recovery is given below:
The parallel plot gives basically a same graph.
Note that there is a difference caused by sorting respecting to the class number.
General Result [pdf]
The recovered dataset with the means in general helps the Bayes methods to classify the dwarfs, but kind of misleading for the rule base systems.
The standard derivation table for each attribute after two preprocessing methods somehow states the point.
Method  Class  attr_1  attr_2  attr_3  attr_4  attr_5  attr_6  attr_7  attr_8  attr_9  attr10 
Ignore  4  1.216  1.250  1.164  1.068  0.773  0.707  0.717  0.446  0.568  0.374 
Replace  4  1.216  1.188  1.105  0.999  0.734  0.692  0.699  0.439  0.553  0.365 
Ignore  7  1.082  1.120  1.048  1.020  0.278  0.333  0.431  0.267  0.398  0.158 
Replace  7  1.081  1.056  0.994  0.975  0.267  0.321  0.418  0.260  0.388  0.154 
Ignore  8  0.699  1.151  1.312  1.493  0.457  0.657  0.850  0.201  0.408  0.240 
Replace  8  0.699  1.114  1.294  1.475  0.453  0.652  0.845  0.200  0.405  0.238 
[To Index]
Weka Tutorial (Ulrike Sattler's introduction)  It is the best tutorial I could find online, but it is still somehow not that useful.
Weka Explorer Guide (University of Waikato)  I downloaded it from somewhere. Not for the newest version, but useable.
Weka Java Class Document (University of Waikato)  Java style documents for every class in the .jar of the package.
MatLab Introduction Tutorial (UF Mathematics)  I commonly refer to find commands.
The Pattern Classification Toolbox  Nice toolbox to visualize Bayes Classifiers. [Computer Manual in MATLAB to accompany Pattern Classification (by David D. Stork and Elad YomTov)]
[To Index]