CISC 333 Weka Tutorial - Part 1 |
IntroductionThis is a tutorial for those who are not familiar with Weka, the data mining package we'll be using in Cisc 333, which was built at the University of Waikato in New Zealand. Weka is an open source collection of data mining tasks which you can utilize in a number of different ways. It comes with a Graphical User Interface (GUI), but can also be called from your own Java code. You can even write your own batch files for tasks that you need to execute more than once, maybe with slightly different parameters each time.For now, we'll start with the GUI. Weka is installed on all the machines in Caslab (check), but you can also download it to use at home. There are a number of different versions, depending on which operating system you're using. The downloads and installation instructions can be found at Weka's homepage. First StepsStart up Weka (it's listed under All Programs if you install it on your won machines, note that there is also a tutorial available here. In Caslab, this can be found under databases). You'll have a choice between the command line interface (CLI), the Experimenter, the Explorer and Knowledge flow. Initially, we'll stick with the Explorer. Once you click on that you'll see the (initially empty) main GUI.
You'll notice that Weka now provides some information about
the data,
such as for example the number of instances, the number of attributes,
and also some statistical information about the attributes one at a
time.
Figure our how to switch between attributes for which this statistical
information is displayed.
VisualizationThere are a number of ways in which you can use Weka to visualize your data. The main GUI will show a histogram for the attribute distributions for a single selected attribute at a time, by default this is the class attribute. Note that the individual colors indicate the individual classes (the Iris dataset has 3). If you move the mouse over the histogram, it will show you the ranges and how many samples fall in each range. The button VISUALIZE ALL will let you bring up a screen showing all distributions at once as in the picture below. Take some time to look at the image and what it tells you about the attributes.There is also a tab called called VISUALIZE. Clicking on that
will open
the scatterplots for all attribute pairs:
From these scatterplots, we can
infer a
number of interesting things. For example, in the picture above we can
see that in some examples the clusters (for now, think of clusters as
collections
of points that are physically close to each other on the screen) and
the
different colors correspond to each other such as for example in the
plots
for class/(any attribute) pairs and the petalwidth/petallength
attribute
pair, whereas for other pairs (sepalwidth/sepallength for example) it's
much hader to separate the clusters by color.
By default, the colors indicate
the different
classes, in this case we used red and two shades of blue. Left clicking
on any of the highlighted class names towards the bottom of the
screenshot
allows you to set your own color for the classes. Also, by default, the
color is used in conjunction with the class attribute, but it can be
useful
to color the other attributes as well. For example, changing the color
to the fourth attribute by clicking on the arrow next to the bar that
currently
reads Color: class (Num) and selecting pedalwidth enables us to observe
even more about the data, for example the fact that for the
class/sepallength
attribute pair, which range of attribute values (indicated by
different
color) tends to go along with which class.
s
FiltersThere are also a number of filters available, which apply different criterial to select either objects (the rows in your data matrix) or attributes (the columns in your data matrix). This allows you to discard parts of your matrix without having to manipulate your original data file (which is a bad idea in any case, and time consuming). For example, you can look at subsets of attributes, discard the first 20 rows, normalize or discretize atttributes and so on. To apply a filter, you first have to select which type of filter you'd like by clicking on the CHOOSE button right underneath Filter in your main GUI. Double clicking on the FILTER folder that appeared will expand the window to show two folders named supervised and unsupervised, both of which you can expand again (you can collapse the folders by clicking on the minus sign which appears left to each folder once it's expanded). Both unsupervised and supervised filters can be applied to objects and attributes. Once you have chosen a filter, the selected option will show up in the bar next to FILTER, but at this stage, nothing has happened to your data yet. You then have to press apply to actually filter your data. There is also a SAVE button which allows you to save any changes you made to your data. Make sure you don't overwrite your original data file!At this point, note that whatever shows up next to Filter is
the actual
java command you'd need to apply the filter to your data through the
Command
line interface, which we'll get to later.
The log fileThe log file is used to keep track of what you did. Clicking on LOG in your main GUI will bring up another window which will show exactly what you did, in this case it shows that we loaded the Iris data set and applied a filter.Note: to get a screenshot of the active window, press ALT + PrtSc. You can then paste it from the clipboard to another application, for example paint.Making your own arff fileTo use Weka with a data set that isn't included in the package already, the data you want to examine needs to be in a certain format. If you open the Iris data for example using a text editor, you'll notice that in addition to the data, it contains information about the data. Part of the Iris data file is shown below.Comments can be included via the
percentage
sign to give additional information about the data. If you make up your
own data, this allows you to make sure that you'll always know which
data
you're dealing with, for example where you found it and what it
represents.
You can also include some summary statistics as in the example
above.
More importantly, you'll notice that there are a header and a data
section,
including certain keywords such as @ATTRIBUTE, @DATA and @RELATION. If
you make up the data file yourself, you need to make sure that
For detailed information about Weka's arff file format, go to Additional information about Weka's data format ARFF. Note also that the arff file needs to be saved with the extension arff Go to Tutorial Part 2Additional Resources |