CISC333 Data Mining

CISC 333 Weka Tutorial - Part 1

Introduction This is a tutorial for those who are not familiar with Weka, the data mining package we'll be using in Cisc 333, which was built at the University of Waikato in New Zealand. Weka is an open source collection of data mining tasks which you can utilize in a number of different ways. It comes with a Graphical User Interface (GUI), but can also be called from your own Java code. You can even write your own batch files for tasks that you need to execute more than once, maybe with slightly different parameters each time. For now, we'll start with the GUI. Weka is installed on all the machines in Caslab (check), but you can also download it to use at home. There are a number of different versions, depending on which operating system you're using. The downloads and installation instructions can be found at Weka's homepage. First Steps Start up Weka (it's listed under All Programs if you install it on your won machines, note that there is also a tutorial available here. In Caslab, this can be found under databases). You'll have a choice between the command line interface (CLI), the Experimenter, the Explorer and Knowledge flow. Initially, we'll stick with the Explorer. Once you click on that you'll see the (initially empty) main GUI. You now have a number of choices, but before you can work with any data, you'll have to load it into Weka. For now, we'll use one of the datasets that are included, but later on you'll have to get any file you'll use into the right format (more about that below). Open a file from the data subcategory, for example the Iris data to find the following screen (the default should be the processing tab). You'll notice that Weka now provides some information about the data, such as for example the number of instances, the number of attributes, and also some statistical information about the attributes one at a time. Figure our how to switch between attributes for which this statistical information is displayed. Visualization There are a number of ways in which you can use Weka to visualize your data. The main GUI will show a histogram for the attribute distributions for a single selected attribute at a time, by default this is the class attribute. Note that the individual colors indicate the individual classes (the Iris dataset has 3). If you move the mouse over the histogram, it will show you the ranges and how many samples fall in each range. The button VISUALIZE ALL will let you bring up a screen showing all distributions at once as in the picture below. Take some time to look at the image and what it tells you about the attributes. There is also a tab called called VISUALIZE. Clicking on that will open the scatterplots for all attribute pairs: From these scatterplots, we can infer a number of interesting things. For example, in the picture above we can see that in some examples the clusters (for now, think of clusters as collections of points that are physically close to each other on the screen) and the different colors correspond to each other such as for example in the plots for class/(any attribute) pairs and the petalwidth/petallength attribute pair, whereas for other pairs (sepalwidth/sepallength for example) it's much hader to separate the clusters by color. By default, the colors indicate the different classes, in this case we used red and two shades of blue. Left clicking on any of the highlighted class names towards the bottom of the screenshot allows you to set your own color for the classes. Also, by default, the color is used in conjunction with the class attribute, but it can be useful to color the other attributes as well. For example, changing the color to the fourth attribute by clicking on the arrow next to the bar that currently reads Color: class (Num) and selecting pedalwidth enables us to observe even more about the data, for example the fact that for the class/sepallength attribute pair, which range of attribute values (indicated by different color) tends to go along with which class. s Filters There are also a number of filters available, which apply different criterial to select either objects (the rows in your data matrix) or attributes (the columns in your data matrix). This allows you to discard parts of your matrix without having to manipulate your original data file (which is a bad idea in any case, and time consuming). For example, you can look at subsets of attributes, discard the first 20 rows, normalize or discretize atttributes and so on. To apply a filter, you first have to select which type of filter you'd like by clicking on the CHOOSE button right underneath Filter in your main GUI. Double clicking on the FILTER folder that appeared will expand the window to show two folders named supervised and unsupervised, both of which you can expand again (you can collapse the folders by clicking on the minus sign which appears left to each folder once it's expanded). Both unsupervised and supervised filters can be applied to objects and attributes. Once you have chosen a filter, the selected option will show up in the bar next to FILTER, but at this stage, nothing has happened to your data yet. You then have to press apply to actually filter your data. There is also a SAVE button which allows you to save any changes you made to your data. Make sure you don't overwrite your original data file! At this point, note that whatever shows up next to Filter is the actual java command you'd need to apply the filter to your data through the Command line interface, which we'll get to later. The log file The log file is used to keep track of what you did. Clicking on LOG in your main GUI will bring up another window which will show exactly what you did, in this case it shows that we loaded the Iris data set and applied a filter. Note: to get a screenshot of the active window, press ALT + PrtSc. You can then paste it from the clipboard to another application, for example paint. Making your own arff file To use Weka with a data set that isn't included in the package already, the data you want to examine needs to be in a certain format. If you open the Iris data for example using a text editor, you'll notice that in addition to the data, it contains information about the data. Part of the Iris data file is shown below. Comments can be included via the percentage sign to give additional information about the data. If you make up your own data, this allows you to make sure that you'll always know which data you're dealing with, for example where you found it and what it represents. You can also include some summary statistics as in the example above. More importantly, you'll notice that there are a header and a data section, including certain keywords such as @ATTRIBUTE, @DATA and @RELATION. If you make up the data file yourself, you need to make sure that the data is comma separated, with the class as the last attribute the data section is prefaced by @DATA the header section is prefaced by @RELATION each attribute is indicated by @ATTRIBUTE Each attribute is described by its name and the type of values it is made of. The different types supported by Weka are: Numeric (REAL or INTEGER), nominal (lists of possible values as in the case of the class attribute), String and Date. Missing values are indicated with a question mark. For detailed information about Weka's arff file format, go to Additional information about Weka's data format ARFF. Note also that the arff file needs to be saved with the extension arff Go to Tutorial Part 2 Additional Resources Cisc 333 Home Page The Weka homepage Weka's User Guide

For now, we'll start with the GUI. Weka is installed on all the machines in Caslab (check), but you can also download it to use at home. There are a number of different versions, depending on which operating system you're using. The downloads and installation instructions can be found at Weka's homepage.

First Steps

You now have a number of choices, but before you can work with any data, you'll have to load it into Weka. For now, we'll use one of the datasets that are included, but later on you'll have to get any file you'll use into the right format (more about that below). Open a file from the data subcategory, for example the Iris data to find the following screen (the default should be the processing tab).

You'll notice that Weka now provides some information about the data, such as for example the number of instances, the number of attributes, and also some statistical information about the attributes one at a time. Figure our how to switch between attributes for which this statistical information is displayed.

Visualization

There is also a tab called called VISUALIZE. Clicking on that will open the scatterplots for all attribute pairs:

From these scatterplots, we can infer a number of interesting things. For example, in the picture above we can see that in some examples the clusters (for now, think of clusters as collections of points that are physically close to each other on the screen) and the different colors correspond to each other such as for example in the plots for class/(any attribute) pairs and the petalwidth/petallength attribute pair, whereas for other pairs (sepalwidth/sepallength for example) it's much hader to separate the clusters by color.

By default, the colors indicate the different classes, in this case we used red and two shades of blue. Left clicking on any of the highlighted class names towards the bottom of the screenshot allows you to set your own color for the classes. Also, by default, the color is used in conjunction with the class attribute, but it can be useful to color the other attributes as well. For example, changing the color to the fourth attribute by clicking on the arrow next to the bar that currently reads Color: class (Num) and selecting pedalwidth enables us to observe even more about the data, for example the fact that for the class/sepallength attribute pair, which range of attribute values (indicated by different color) tends to go along with which class.

Filters

At this point, note that whatever shows up next to Filter is the actual java command you'd need to apply the filter to your data through the Command line interface, which we'll get to later.

The log file

Note: to get a screenshot of the active window, press ALT + PrtSc. You can then paste it from the clipboard to another application, for example paint.

Making your own arff file

Comments can be included via the percentage sign to give additional information about the data. If you make up your own data, this allows you to make sure that you'll always know which data you're dealing with, for example where you found it and what it represents. You can also include some summary statistics as in the example above. More importantly, you'll notice that there are a header and a data section, including certain keywords such as @ATTRIBUTE, @DATA and @RELATION. If you make up the data file yourself, you need to make sure that

the data is comma separated, with the class as the last attribute
the data section is prefaced by @DATA
the header section is prefaced by @RELATION
each attribute is indicated by @ATTRIBUTE

Each attribute is described by its name and the type of values it is made of. The different types supported by Weka are: Numeric (REAL or INTEGER), nominal (lists of possible values as in the case of the class attribute), String and Date. Missing values are indicated with a question mark.
For detailed information about Weka's arff file format, go to Additional information about Weka's data format ARFF.

Note also that the arff file needs to be saved with the extension arff

Introduction