CPSC 633 Machine Learning (Fall 2000)
Class Webpage  Other Teams
Project 1   Project 2   Project 3   Project 4  

Luis Francisco-Revilla

Code

Report

DBInstances.java
Attribute.java
Instance.java
brcancer.txt
census.txt
credit.txt
diabetes.txt
glass.txt
heart.txt
iris.txt
promoters.txt
rotate.txt
secondary.txt
voting.txt
 
Unmil Karadkar

Code

Report

AttribDetails.java
Classifier.java
ConstantValues.java
DataFormat.java
Proj1.java
ReadDataFile.java
output.txt


Majority Classifier Summary
Database Examples Attribute Target Index Target Concept Mayority Classifier
Value Accuracy
brcancer.dat 277 10 1 CLASS no-recurrence-events 0.7075812274368231
census.dat 1000 15 15 CLASS <=50K 0.75
credit.dat 653 16 16 CREDIT - 0.5467075038284839
diabetes.dat 768 9 9 DIABETES 0 0.6510416666666666
glass.dat 214 10 10 FLOAT 0 0.5934579439252337
heart.dat 296 14 14 HEALTH healthy 0.5405405405405406
iris.dat 150 5 1 DX 3 0.3333333333333333
promoters.dat 106 58 1 PROMOTER + 0.5
rotate.dat 1000 3 3 F 1 0.765
secondary.dat 19836 16 16 STRUCT _ 0.5361464004839686
voting.dat 435 17 1 PARTY democrat 0.6137931034482759
Luis Francisco-Revilla
Results

 

 

Information
Gain

Gain
Ratio

Delta

 

Initial
Accuracy

Pruned Accuracy

Delta

brcancer.dat

m

73.5483

69.2472

4.301

 

66.7741

66.0214

-0.7526

s

4.2166

2.9643

3.4378

 

 4.4895

 4.0614

 1.831

census.dat

m

63.6936

75.9759

-12.2822

 

74.8047

73.6335

-1.1711

s

2.2403

2.193

1.9536

 

 3.3182

 3.3493

 1.6895

credit.dat

m

56.4219

74.6788

-18.2568

 

78.1651

80.1834

2.0183

s

4.0281

4.6166

7.3801

 

 1.9244

 1.5259

 1.3188

diabetes.dat

m

56.5234

63.7109

-7.1874

 

59.0624

56.3671

-2.6953

s

2.4981

3.4521

2.8119

 

 5.9606

 4.5369

 3.4026

glass.dat

m

72.361

67.361

4.9999

 

66.5277

64.3055

-2.2221

s

3.8428

6.5227

6.4549

 

 4.0598

 4.9017

 3.1535

heart.dat

m

57.4746

58.9898

-1.5151

 

61.414

59.3938

-2.0201

s

2.584

5.557

4.885

 

 7.8502

 7.4653

 4.2589

iris.dat

m

95

93.8

1.2

 

94.6

94.2

-0.4

s

2.8674

3.5839

2.3475

 

 3.4058

 4.3665

 2.7968

promoters.dat

m

70.5555

74.9999

-4.4443

 

73.8888

76.9444

3.0555

s

4.9344

6.6769

6.7025

 

 6.3071

 4.3528

 4.0253

rotate.dat

m

85.105

85.105

0

 

86.3362

86.126

-0.2101

s

1.5259

1.5259

0

 

 1.4173

 1.4901

 0.6178

voting.dat

m

95.2413

95.1723

0.0689

 

94.4827

93.6551

-0.8275

s

2.1917

1.7507

1.3185

 

 1.7806

 1.9722

 1.9721


Discussion

While for some databases the current implementation of decision trees performs relatively well (with accuracies over 90%) it is also true that for some other databases presented decreased performance. Comparing the results with my teammate and with other students in the class, it was obvious that for some databases like (credit.dat) the performance of multi-way branching on continuous attributes was lower than binary branching. It appears that the Gain Ratio splitting criterion alleviates the situation, but still does not compensate completely.

Additionally, pruning has less of an impact in the current implementation because of the use of multi-way branching. A result of multi-way branching, trees tend to be wide and shallow, with few nodes. Therefore, pruning a node implies removing many branches, especially in the case of continuous attributes. Therefore pruning a node implies removing many splits and merging their examples. In a binary split architecture, pruning only means merging two of the possible values into a single node. Also, in a binary architecture splits on an attribute with multiple values, might be distributed all over the three. However, in multi-way branching, attributes tend to be split only once in a single place of the tree. This is even more notorious on continuous attributes than on discrete.

Although the decision tree effectively learns different concepts, it appears that has relevant drawbacks. An interesting improvement on the current implementation would be to support binary splits, such that it could be evaluated the results of pruning, using binary splitting compared to multi-way splitting.

Unmil Karadkar
Results

Database

Tree Depth

Data Split  %

Accuracy %

 

Training:

Validation Set

Testing Set

Before

After

Testing:
Validation

Before Pruning

After Pruning

Before Pruning

After Pruning

Iris

5

5

50:25:25

97.36

97.36

97.36

97.36

5

5

34:33:33

90.00

90.00

79.59

79.59

Glass

8

7

50:25:25

83.33

83.33

79.24

83.00

6

4

34:33:33

74.64

81.69

84.28

82.85

Diabetes

17

7

50:25:25

64.58

71.87

73.43

69.27

15

9

40:30:30

68.96

78.16

64.50

69.7

Heart

9

6

50:25:25

70.27

79.73

68.91

75.67

9

4

60:20:20

80.00

80.00

72.88

72.88

Rotate

9

8

50:25:25

97.6

98.00

96.8

97.6

8

7

34:33:33

93.63

95.15

95.45

93.93

Voting

7

4

50:25:25

94.49

95.41

96.29

96.29

8

4

80:10:10

90.90

93.18

93.18

88.63


Discussion

In general, we can make a few observations:

  • the accuracy on the testing set does not drastically decrease due to pruning. The testing set is a sort of independent data set, and its observations can be generalized to most data.
  • The depth of the tree reduces, in some cases drastically, without causing the loss of, and in some cases, actually a gain in the accuracy.
  • It is easier to prune bigger trees (Diabetes, heart) than to prune small trees (Iris) and there is much greater benefit in pruning deeper trees as it helps reach a result in fewer number of comparisons.
  • In some cases, attempting to improve the accuracy on the validation set results in decrease in accuracy on the testing set. However, there is no relation between the two as they are independent datasets.
  • In general, the trees generated from a higher percentage of data are more balanced (60:20:20) than trees generated from a lower percentage of data (34:33:33).
Luis Francisco-Revilla

Unmil Karadkar

Luis Francisco-Revilla

 
Unmil Karadkar