Data Mining - Bayesian Approaches

Data Mining Course [CISC 873, School of Computing, Queen's University]

Bayesian Approaches Tutorials and Applying Results [Henry 2004]

Tutorial Topics:

    Data Mining (DM) Introduction
        DM Definitions
        DM Web Pages
    Bayesian Tutorials
        Overview
        Naïve Bayesian Classifiers
        Gaussian Bayesian Classifiers
        Bayesian Networks
    Applying Bayesian Approach on Datasets
        Dataset #1
        Dataset #2
        Dataset #3
        Dataset Additional
    Mining Software
        Weka
        MatLab


Data Mining:

Some Data Mining (DM) Definitions

Resource Define
Two Crows An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.
Gotcha A type of application with built-in proprietary algorithms that sort, rank, and perform calculations on a specified and often large data set, producing visualizations that reveal patterns which may not have been evident from mere listings or summaries.
The OLAP Report The process of using statistical techniques to discover subtle relationships between data items, and the construction of predictive models based on them. The process is not the same as just using an OLAP tool to find exceptional items. Generally, data mining is a very different and more specialist application than OLAP, and uses different tools from different vendors. Normally the users are different, too.

    Data Mining Web Pages:

        [To Index]


Bayesian Tutorials:

    Overview

        Bayesian approaches are a fundamentally important DM technique. Given the probability distribution, Bayes classifier can provably achieve the optimal result.     Bayesian method is based on the probability theory. Bayes Rule is applied here to calculate the posterior from the prior and the likelihood, because the later two is generally easier to be calculated from a probability model.

        One limitation that the Bayesian approaches can not cross is the need of the probability estimation from the training dataset. It is noticeable that in some situations, such as the decision is clearly based on certain criteria, or the dataset has high degree of randomality, the Bayesian approaches will not be a good choice.

        My introduction slides of Bayesian Approaches. [pdf]

    Naïve Bayesian Classifiers

        The Naïve Bayes Classifier technique is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods. The following example is a simple demonstration of applying the Naïve Bayes Classifier from StatSoft.

        As indicated at Figure 1, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive (i.e., decide to which class label they belong, based on the currently exiting objects).
            Figure 1. objects are classified to GREEN or RED.

        We can then calculate the priors (i.e. the probability of the object among all objects) based on the previous experience. Thus:

           

        Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:

           

        Having formulated our prior probability, we are now ready to classify a new object (WHITE circle in Figure 2). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label.

        Figure 2. classify the WHITE circle.

        We can calculate the likelihood:     

       

        In Figure 2, it is clear that Likelihood of X given RED is larger than Likelihood of X given GREEN, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:

       

        Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information (i.e. the prior and the likelihood) to form a posterior probability using Bayes Rule.

       

        Finally, we classify X as RED since its class membership achieves the largest posterior probability.

        [To Index]

 

    Gaussian Bayesian Classifiers

        The problem with the Naïve Bayes Classifier is that it assumes all attributes are independent of each other which in general can not be applied. Gaussian PDF can be plug-in here to estimate the attribute probability density function (PDF). Because the well developed Gaussian PDF theories, we can classify the new object easier through the same Bayes Classifier Model but with certain degree recognition of the covariance. Normally, this gives more accurate classification result.

        I guess one question to be asked is why Gaussian? There are many other PDF's can be applied. But from statistic point of view, many real world distributions are more likely to be estimated by Gaussian PDF than others. If you are familiar with Information Theory, the Gaussian gives the maximum entropy for an unbounded range, which means Gaussian has more ability to estimate the randomality.

        How to apply the Gaussian to the Bayes Classifier?

        The application here is very intuitive. We assume the Density Estimation follows a Gaussian distribution. Then the prior and the likelihood can be calculated through the Gaussian PDF. The critical thing here is to identify the Gaussian distribution (i.e. find the mean and variance of the Gaussian). The following 5 steps are a general model to initialize the Gaussian distribution to fit our input dataset.

  1. Choose a probability estimator form (Gaussian)

  2. Choose an initial set of parameters for the estimator (Gaussian mean and variance)

  3. Given parameters, compute posterior estimates for hidden variable

  4. Given posterior estimates, find distributional parameters that maximize expectation (mean) of joint density for data and hidden variable (Guarantee to also maximize improvement of likelihood)

  5. Assess goodness of fit (i.e. log likelihood) If not stopping criterion, return to (3).

        From research perspective, Gaussian may not be the only PDF to be applied to the Bayes Classifier, although it has very strong theoretical support and nice properties. The general model of applying those PDF's should be the same. The estimation results highly depend on whether or how close a PDF can simulate the given dataset.       

        Some normal used PDF's are listed below: (just to refresh our statics)

       

        [To Index]

 

    Bayesian Networks

        First of all, there is a nice introduction to Bayesian Networks and their Contemporary Applications by Daryle Niedermayer [web page]. It is generally hard for me to come up with a better one here. So the following may just a simpler version of Daryle's introduction.

        From the previous tutorial on Gaussian Bayes Classifier, we notice that the Gaussian model helps to integrate some correlation which improves the classification performance against the Naïve model assuming the independence. However, using Gaussian model with Bayes Classifier still has its limitation of generating the correlations. So it is where the Bayesian Networks (Bayes Nets) get involved.

        Bayes Net is a model of utilizing the conditional probabilities among different variables. It is generally impossible to generate all conditional probabilities from a given dataset. Our task is to pick important ones and use them in the classification process. So essentially, a Bayes net is a set of "Generalized Probability Density Function" (gpdf).

        Informally, we can define a Bayes net as an augmented directed acyclic graph, represented by the vertex set V and directed edge set E. Each vertex from V represents an attribute, and each edge from E represents a correlation between two attributes. One example of five attributes Bayes net is shown in Figure 3.

        Figure 3. A Bayes net for 5 attributes.

        Some important observations here:

        Now, let's suffer the mathematical definition formally.

        Consider a domain U of n variables, x1,...xn. Each variable may be discrete having a finite or countable number of states, or continuous. Given a subset X of variables xi where xi is an element of U, if one can observe the state of every variable in X, then this observation is called an instance of X and is denoted as X= for the observations . The "joint space" of U is the set of all instances of U. denotes the "generalized probability density" that X= given Y= for a person with current state information . p(X|Y, ) then denotes the gpdf for X, given all possible observations of Y. The joint gpdf over U is the gpdf for U.

        A Bayesian network for domain U represents a joint gpdf over U. This representation consists of a set of local conditional gpdfs combined with a set of conditional independence assertions that allow the construction of a global gpdf from the local gpdfs. Then these value can be ascertained as:
                                        (Equation 1)

        A Bayesian Network Structure then encodes the assertions of conditional independence in Equation 1 above. Essentially then, a Bayesian Network Structure Bs is a directed acyclic graph such that (1) each variable in U corresponds to a node in Bs, and (2) the parents of the node corresponding to xi are the nodes corresponding to the variables in [Pi]i.

        It is not hard to argue the advantage of using Bayes net model here. But formal analysis needs to be done with the probability inference theory which we have briefly discussed in the next section. For reader who is not interested with the analysis mathematically, ignore the next section to avoid confusion (sometimes not bad).

        How to build a Bayes Net? A general model can be followed below:

        There are many choices of how to select relevant variables, as well as how to estimate the conditional probabilities. If we imagine the network as a connection of Bayes classifiers, then the probability estimation can be done applying some PDF like Gaussian. In some cases, the design of the network can be rather complicated. There are some efficient ways of getting relevant variables from the dataset attributes. Assume the coming signal to be stochastic will give a nice way of extracting the signal attributes. And normally, the likelihood weighting is another way to getting attributes.

        [To Index]


Applying Bayesian Approach on Datasets

    Dataset #1

            Attribute Selection results:

                    CfsSubsetEval with BestFirst (15 attributes)[txt]
                    InfoGainAttibuteEval with Ranker (12 attributes 0.1000 above) [txt]
                    ClassifierSubsetEval (NaiveBayes) with BestFirst (11 attributes) [txt]
                    ClassifierSubsetEval (BayesNet) with BestFirst [txt]

            Preliminary Mining results (selected):

                    All attributes with NaiveBayes (82% training) [txt] - 80 %
                    15 attributes (CfsSubsetEval) with AODE (82% training) [txt] - 84 %
                    12 attributes (InfoGainAttributeEval) with BayesNets (82% training) [txt] - 84 %
                    11 attributes with (ClassifierSubsetEval-NaiveBayes) BayesNets (65% training) [txt] - 87.7551%
                    12 attributes with (ClassifierSubsetEval-BayesNet) BayeNets (78% training) [txt] - 87.0968 %

            Record "optimal" result:

                    12 attributes (InfoGainAttributeEval) with AODE (82% training) [txt] - 92%

            Attribute subsets:

                    CfsSubesetEval -------------------------- {10, 11, 14, 24, 25, 27, 65, 71, 74,   83,   88,   92, 107, 115, 119}
                    InfoGainAttributeEval ------------------- {11, 15, 19, 24, 27, 29, 71, 81, 88,   92, 115, 119}
                    ClassifierSubsetEval(NaiveBayes) - { 6,   9, 10, 19, 24, 49, 59, 65, 72, 100, 115}
                    ClassifierSubsetEval(BayesNet) ---- { 7, 10, 28, 37, 46, 49, 57, 64, 81,   88,   91, 115} 

                What are the "important" attributes?

           *Please note that the attribute order is different from the original .csv file. We removed the 6 attributes and dicretized all attributes. Please use the file here to see the order.

 

        We mainly look at the Bayes Net in our final mining process since the Bayes network demonstrates some statistic properties like the independence and inference, which may be helpful for us to select attributes. Some Bayes networks which are built by Weka using the attribute sets from the preliminary analysis are shown below.

InfoGainArributeEval Set (12)

ClassifierSubsetEval(NaiveBayes) (11)

ClassifierSubsetEval(BayesNet) (12)

        The attribute set from the InfoGainArributeEval has more correlations from the generated network. This may or may not be a good news to the Bayes Net since some correlation may distract the classification process. The attribute sets from both ClassifierSubsetEval(NaiveBayes) and ClassifierSubsetEval(BayesNet) have some independences between the "target" attribute and other attributes. We consider to remove those attributes that are independent to "target" since they are in some degree irrelevant towards our classification. We further tried to combine the two sets' "target" correlated attributes, and got a set of 8 attributes as listed below.

    Combined Attribute Set: {AQR-B, AQR-Mg, AQR-Pb, CHX-Fe, SDA-Cr, SDA-Na, SDA-Se, SPF-Mg}

        The Bayes network constructed by Weka is showed in below figure, and the classification result is here [txt].

       

        The training and testing split is at 66%, and we can see from the network that all attributes are correlated. Some result discussion is proposed in the first dataset mining result slide.

        Finally, here is a nice Bayes Net Weka description by the developer just came out this September. [pdf]

        We plot some attributes against the AMGN attribute, and it does look like forming some clusters. However, we have not figure out how to employ this to improve our mining result. [pdf] (plot)

        [To Index]

 

    Dataset #2

            Microarray Data Cleaning

        The cleaned training data can be downloaded here [csv]. Some internet tip suggests to remove the initial records with Gene Description containing "control)".
(Those are Affymetrix controls, not human genes), which has not been applied in our cleaning steps.

        Attribute Selection / Feature Reduction

        There are in total 7070 genes (attributes) in the dataset. It is critical to do the attribute selection. We take the standard Signal to Noise (S2N) ratio and T-value to get significant attribute set. (Note that it is impractical to run any attribute selection algorithm from Weka since the memory requirement would be huge.)

            Let Avg1, Avg2 be the average expression values.
            Let Stdev1, Stdev2 be the sample standard deviations.
                S2N = (Avg1 - Avg2)/(Stdev1 + Stdev2)
                T-value = (Avg1 - Avg2)/sqrt(Stdev1*Stdev1/N1+Stdev2*Stdev2/N2)
            Where N1 is the number of ALL observations, and N2 is the number of AML observations.

            T-value
                T-value is the observed value of the T-statistic that is used to test the hypothesis that two attributes are correlated. The T-value can range between -infinity and +infinity. A T-value near 0 is evidence for the null hypothesis that there is no correlation between the attributes. A T-value far from 0 (either positive or negative) is evidence for the alternative hypothesis that there is correlation between the attributes.

        Top 50 genes with the highest S2N ratio:

Rank

Gene Name

S2N

T-value

 

Rank

Gene Name

S2N

T-value

1

M55150_at

1.467641

8.091951

26

L08246_at

1.034784

4.506267

2

X95735_at

1.444531

5.727643

27

X74262_at

1.027773

6.016389

3

U50136_rna1_at

1.421708

6.435952

28

M62762_at

1.023393

5.122153

4

U22376_cds2_s_at

1.339308

7.9043

29

M31211_s_at

1.022881

6.294207

5

M81933_at

1.204042

6.164965

30

M28130_rna1_s_at

1.001286

3.911006

6

M16038_at

1.203218

4.930437

31

D26156_s_at

0.989098

6.097113

7

M84526_at

1.20142

4.057042

32

M63138_at

0.983628

4.048325

8

M23197_at

1.195974

4.737778

33

M31523_at

0.971228

5.677543

9

U82759_at

1.192556

6.24302

34

M57710_at

0.967686

3.69931

10

Y12670_at

1.184737

5.324928

35

X15949_at

0.96009

5.389656

11

D49950_at

1.143704

5.588063

36

M69043_at

0.95886

4.708053

12

M27891_at

1.133427

3.986204

37

S50223_at

0.956733

5.858626

13

X59417_at

1.124637

6.803106

38

U32944_at

0.954875

5.442992

14

X52142_at

1.122589

5.833144

39

M81695_s_at

0.953066

4.195556

15

M28170_at

1.116756

6.253971

40

L47738_at

0.95138

5.715039

16

X17042_at

1.105975

5.388623

41

M83652_s_at

0.947504

3.721478

17

U05259_rna1_at

1.103966

6.175126

42

X85116_rna1_s_at

0.946372

3.923369

18

Y00787_s_at

1.081995

4.701085

43

M11147_at

0.945755

5.588645

19

M96326_rna1_at

1.07719

3.869518

44

Z15115_at

0.9452

5.615471

20

U12471_cds1_at

1.069731

6.146299

45

M21551_rna1_at

0.941981

5.377292

21

U46751_at

1.064078

4.127982

46

M19045_f_at

0.938076

3.834549

22

M80254_at

1.044395

4.271131

47

X04085_rna1_at

0.930499

4.931191

23

M92287_at

1.043056

6.217365

48

L49229_f_at

0.920745

5.398203

24

L13278_at

1.042032

6.021342

49

X14008_rna1_f_at

0.914954

3.655554

25

U09087_s_at

1.036257

6.182425

50

M91432_at

0.913521

5.300882

                                The S2N training dataset (50 top genes) can be downloaded here [csv] (transposed).
                                The S2N testing dataset (50 top genes) can be downloaded here [csv] (transposed).

        Top 50 genes with the highest T-value:

Rank

Gene Name

S2N

T-value

 

Rank

Gene Name

S2N

T-value

1

M55150_at

1.467641

8.091951

26

U32944_at

0.954875

5.442992

2

U22376_cds2_s_at

1.339308

7.9043

27

U26266_s_at

0.887813

5.425205

3

X59417_at

1.124637

6.803106

28

J05243_at

0.886777

5.406254

4

U50136_rna1_at

1.421708

6.435952

29

L49229_f_at

0.920745

5.398203

5

M31211_s_at

1.022881

6.294207

30

X15949_at

0.96009

5.389656

6

M28170_at

1.116756

6.253971

31

X17042_at

1.105975

5.388623

7

U82759_at

1.192556

6.24302

32

M21551_rna1_at

0.941981

5.377292

8

M92287_at

1.043056

6.217365

33

M31303_rna1_at

0.875318

5.370571

9

U09087_s_at

1.036257

6.182425

34

Y08612_at

0.877482

5.366653

10

U05259_rna1_at

1.103966

6.175126

35

U20998_at

0.896103

5.361239

11

M81933_at

1.204042

6.164965

36

AF012024_s_at

0.873658

5.333337

12

U12471_cds1_at

1.069731

6.146299

37

X56411_rna1_at

0.870983

5.327392

13

D26156_s_at

0.989098

6.097113

38

Y12670_at

1.184737

5.324928

14

L13278_at

1.042032

6.021342

39

U29175_at

0.908033

5.322716

15

X74262_at

1.027773

6.016389

40

M91432_at

0.913521

5.300882

16

S50223_at

0.956733

5.858626

41

HG1612-HT1612_at

0.856338

5.276869

17

X52142_at

1.122589

5.833144

42

M13792_at

0.866312

5.222356

18

X95735_at

1.444531

5.727643

43

D63874_at

0.844716

5.206861

19

L47738_at

0.95138

5.715039

44

U72342_at

0.848128

5.206761

20

M31523_at

0.971228

5.677543

45

X97267_rna1_s_at

0.844922

5.203003

21

Z15115_at

0.9452

5.615471

46

X76648_at

0.861562

5.195203

22

M11147_at

0.945755

5.588645

47

U35451_at

0.842047

5.144391

23

D49950_at

1.143704

5.588063

48

Z69881_at

0.896518

5.143816

24

X63469_at

0.905744

5.583296

49

D63880_at

0.833577

5.125102

25

D38073_at

0.902946

5.562983

50

M62762_at

1.023393

5.122153

                                The T-value training dataset (50 top genes) can be downloaded here [csv] (transposed).
                                The T-value testing dataset (50 top genes) can be downloaded here [csv] (transposed).

        Common genes from the above two sets:

            31 attributes in the common set: {D26156_s_at, D49950_at, L13278_at, L47738_at, L49229_f_at, M11147_at, M21551_rna1_at, M28170_at, M31211_s_at, M31523_at, M55150_at, M62762_at, M81933_at, M91432_at, M92287_at, S50223_at, U05259_rna1_at, U09087_s_at, U12471_cds1_at, U22376_cds2_s_at, U32944_at, U50136_rna1_at, U82759_at, X15949_at, X17042_at, X52142_at, X59417_at, X74262_at, X95735_at, Y12670_at, Z15115_at}

            Common set attributes can be visualized here.

            From statistic aspect, the common attribute set contains those attributes that have high discriminating ability to distinguish ALL and AML leukemia. 
                                The Common training dataset (31 common genes) can be downloaded here [csv] (transposed).
                                The Common testing dataset (31 common genes) can be downloaded here [csv] (transposed).

    Preliminary result with NaiveBayes and BayesNet

ID

NB(S2N)

BN(S2N)

NB(T)

BN(T)

NB( C)

BN( C)

39

ALL

ALL

ALL

ALL

ALL

ALL

40

ALL

ALL

ALL

ALL

ALL

ALL

42

ALL

ALL

ALL

ALL

ALL

ALL

47

ALL

ALL

ALL

ALL

ALL

ALL

48

ALL

ALL

ALL

ALL

ALL

ALL

49

ALL

ALL

ALL

ALL

ALL

ALL

41

ALL

ALL

ALL

ALL

ALL

ALL

43

ALL

ALL

ALL

ALL

ALL

ALL

44

ALL

ALL

ALL

ALL

ALL

ALL

45

ALL

ALL

ALL

ALL

ALL

ALL

46

ALL

ALL

ALL

ALL

ALL

ALL

70

ALL

ALL

ALL

ALL

ALL

ALL

71

ALL

ALL

ALL

ALL

ALL

ALL

72

ALL

ALL

ALL

ALL

ALL

ALL

68

ALL

ALL

ALL

ALL

ALL

ALL

69

ALL

ALL

ALL

ALL

ALL

ALL

67

AML

AML

ALL

ALL

AML

AML

55

ALL

ALL

ALL

ALL

ALL

ALL

56

ALL

ALL

ALL

ALL

ALL

ALL

59

ALL

ALL

ALL

ALL

ALL

ALL

52

AML

AML

AML

AML

AML

AML

53

AML

AML

AML

AML

AML

AML

51

AML

AML

ALL

AML

ALL

AML

50

AML

AML

AML

AML

AML

AML

54

AML

ALL

ALL

ALL

ALL

ALL

57

AML

AML

AML

AML

AML

AML

58

AML

AML

AML

AML

AML

AML

60

AML

ALL

ALL

ALL

ALL

ALL

61

AML

ALL

AML

ALL

AML

ALL

65

AML

AML

AML

AML

AML

AML

66

ALL

ALL

ALL

ALL

ALL

ALL

63

AML

AML

ALL

AML

AML

AML

64

AML

AML

ALL

AML

AML

AML

62

AML

AML

ALL

ALL

ALL

ALL

Total ALL

20

23

27

25

24

24

Total AML

14

11

7

9

10

10

                * NB: NaiveBayes; BN: BayesNet.
                * S2N uses the top 50 S2N ratio gene set; T uses the top 50 T-value gene set; C uses the common 31 gene set from previous two.
                * ID field uses original sample sequence number.
                * Red columns show the samples classified to different kind through different attribute set or different technique.

 

        Since the limitation of Bayes methods on this dataset, we do not try to push the mining further on to expect some "better" result. Some major problems with the Bayes methods here are listed below:

        Nevertheless, we still put some effort on selecting better attribute/gene set which can be found in our final results [pdf].

        [To Index]

 

    Dataset #3

       Statistic and Syntactic (preamble):

        We propose to use rule based systems to do the mining of this dataset besides some Bayes methods. The interesting point to examine here is how these two techniques work and compared. Clearly, Bayes approach is based on the statistic model built through the dataset, and rule based system is a syntactic approach in some sense more like our thinking process. Theoretically, the statistical techniques have a well-founded mathematical theory support, and thus, usually computationally inexpensive to be applied. On the other hand, syntactical techniques give nice structural descriptions/rules, and thus, simple to be understand and validated.

        Regardless the above remarks, for this dataset from the Sloan Digital Sky Survey, release 3, the intuitive observation really cannot tell us which technique may perform better than the other. We would like to try both techniques on the dataset and compare them more from the mining outcomes.

        Data Preprocessing:

            We add a top row for descriptions of the columns. The last column are normalized in two ways to make a two-class list and a three-class list.

            Two-class: "8" -> "1" (i.e. class 1) and "4" and "7" -> "2" (i.e. class 2).

            Three-class: "8" -> "1" (i.e. class1) ,"7" -> "2" (i.e. class 2), and "4" -> "3" (i.e. class3).

        Preliminary Analysis:

            The two-class case seems very optimistic if we take a look at the graph of attributes plotted with respect to the "class" below:

        

            Attribute 7, 8, 9, and 10 separate the two classes completely, which means any of the attribute can be used to give a promised mining outcome. It is thus of less interest here to be further discussed.

            The three-class case is unfortunately and fortunately not as simple as the two-class case. The plots of attributes demonstrate the difficulties.

       

            The main problem is the class "2" and "3" (i.e. red and light blue in the graph) as hardly an attribute can clearly separate them. And above all, attribute 10 is probably the most useful one in three-class case.

        Some Results:

            The two-class case is trivial, both techniques can guarantee the 100% correctness from mining. This can be seen clearly through the rule output by PRISM and DecisionTable (i.e. rule based systems):

            Rules:
            ================================
            attr_7 class
            ================================
            '(-inf-3.047]' 1
            '(3.047-inf)' 2
            ================================

            Note that Weka probably run the algorithm sequentially through all attributes so attribute 7 is picked up first.

            The three-class case can be directly mined by the two techniques (need to discritize attributes when using PRISM). The confusion matrices are listed in the table:   

  BayesNet - 95.2497% DecisionTable - 98.9521% PRISM - 98.7426%
Class = > a b c a b c a b c
a = 1 330 0 0 330 0 0 330 0 0
b = 2 0 2328 126 0 2438 16 0 2437 11
c = 3 0 10 69 0 14 65 0 17 60

            DecisionTable uses an attribute set of {1,4,7,10}. PRISM uses all attributes but based on the information gain to rank the importance.

            The information gain ranking is: 10, 9, 7, 8, 6, 4, 5, 3, 2, 1 (based on attribute sequence).

            Other results are available in the Preliminary Analysis report [pdf].

 

            Visualization through Clustering

                Some interesting information can be gained with clustering the dataset without predefined knowledge (i.e. class). The following plots show different clustering results from applying the k-mean algorithm using Weka.

        (Click for full size)

                Class 1 is generally clustered in red which is a very pleasant case. Class 2 and 3 are confusing from the cluster plots.

                The parallel plot and scatter plot are also very interesting to see here. Both of them demonstrate some important properties of this dataset.

            Parallel Plot

            Scatter Plot

        [To Index]

 

     Dataset (Additional set with missing values from dataset #3)

            Dataset Recovery

            The most intuitive method to deal with the missing values is just to ignore them. The plots of attributes ignoring the missing values show very likely distributions from the previous dataset.  

       

            We also try to recover the missing values with the means. Following table lists the means of the 10 attributes subjected to classes (i.e., "4, 7, or 8") omitting the missing cells.

Class/Mean attr_1 attr_2 attr_3 attr_4 attr_5 attr_6 attr_7 attr_8 attr_9 attr_10 Total Sample
4 -0.098 1.317 4.051 5.958 1.475 4.147 6.047 2.657 4.542 1.888 256
7 1.498 3.081 5.198 6.370 1.594 3.701 4.871 2.108 3.277 1.170 7186
8 0.559 0.478 0.325 0.150 -0.060 -0.200 -0.367 -0.145 -0.314 -0.169 976

                The process is then straight forward replacing missing values with the mean values in respect to the class and attribute number.

                The plots of attributes after recovery is given below:

       

                The parallel plot gives basically a same graph.

        Parallel Plot

                Note that there is a difference caused by sorting respecting to the class number.

        General Result [pdf]

                The recovered dataset with the means in general helps the Bayes methods to classify the dwarfs, but kind of misleading for the rule base systems.

                The standard derivation table for each attribute after two preprocessing methods somehow states the point.

Method Class attr_1 attr_2 attr_3 attr_4 attr_5 attr_6 attr_7 attr_8 attr_9 attr10
Ignore 4 1.216 1.250 1.164 1.068 0.773 0.707 0.717 0.446 0.568 0.374
Replace 4 1.216 1.188 1.105 0.999 0.734 0.692 0.699 0.439 0.553 0.365
Ignore 7 1.082 1.120 1.048 1.020 0.278 0.333 0.431 0.267 0.398 0.158
Replace 7 1.081 1.056 0.994 0.975 0.267 0.321 0.418 0.260 0.388 0.154
Ignore 8 0.699 1.151 1.312 1.493 0.457 0.657 0.850 0.201 0.408 0.240
Replace 8 0.699 1.114 1.294 1.475 0.453 0.652 0.845 0.200 0.405 0.238

 

        [To Index]


 

Mining Software

    Weka

        Weka Tutorial (Ulrike Sattler's introduction) - It is the best tutorial I could find online, but it is still somehow not that useful.

        Weka Explorer Guide (University of Waikato) - I downloaded it from somewhere. Not for the newest version, but useable.

        Weka Java Class Document (University of Waikato) - Java style documents for every class in the .jar of the package.

    MatLab

        MatLab Introduction Tutorial (UF Mathematics) - I commonly refer to find commands.

        The Pattern Classification Toolbox - Nice toolbox to visualize Bayes Classifiers. [Computer Manual in MATLAB to accompany Pattern Classification (by David D. Stork and Elad Yom-Tov)]

 

        [To Index]