1、(b)没有剩余属性可以用来进一步划分样本(步骤4).在此情况下,采用多数表决(步骤5)。这涉及将给定的结点转换成树叶,并用samples中的多数所在类别标记它。换一种方式,可以存放结点样本的类分布。(c)分支test_attribute=ai 没有样本。在这种情况下,以samples中的多数类创建一个树叶(步骤12)。算法 Decision_Tree(samples,attribute_list)输入 由离散值属性描述的训练样本集samples;候选属性集合attribute_list.输出 一棵决策树。(1) 创建节点N;(2) If samples 都在同一类C中then (3) 返回N作
2、为叶节点,以类C标记;(4) If attribute_list为空then (5) 返回N作为叶节点,以samples 中最普遍的类标记;/多数表决(6) 选择attribute_list 中具有最高信息增益的属性test_attribute;(7) 以test_attribute 标记节点N;(8) For each test_attribute 的已知值v /划分 samples(9) 由节点N分出一个对应test_attribute=v的分支;(10) 令Sv为 samples中 test_attribute=v 的样本集合;/一个划分块(11) If Sv为空 then (12) 加
3、上一个叶节点,以samples中最普遍的类标记;(13) Else 加入一个由Decision_Tree(Sv,attribute_listtest_attribute)返回节点值E(S)=(915)log2(915)-(615)log2(615)=0.971Values(收入范围)=2030K,3040k,4050K,50-60K E(S(20-30K)= (-24)log2(24)- (24)log2(24)=1E(S(30-40K)= (-45)log2(45)- (15)log2(15)=0。7219E(S(4050K)= (14)log2(14)- (34)log2(34)=0.81
4、13E(S(5060K)= (22)log2 (22) (02)log2(02)=0所以E(S,收入范围)=(4/15) E(S(20-30K) +(5/15) E(S(3040K) +(4/15) E(S(4050K) +(2/15) E(S(50-60K)=0。7236Gain(S,收入范围)=0。971-0。7236=0.2474同理:计算“保险,“性别”,“年龄”的信息增益为:E(S)=(915)log2(915)-(615)log2(615)=0。971Insurance(保险)=yes, noE(S(yes)= (33)log2 (33)- (03)log2(03)=0E(S(no
5、)= (-612)log2 (612)- (612)log2(612)=1E(S, 保险)=(3/15) E(S(yes) +(12/15) E(S(no) =0.8Gain(S, 保险)=0。9710。8=0.171E(S)=(-915)log2(915)(615)log2(615)=0。sex(性别)=male, femaleE(S(male)= (37)log2 (37) (47)log2(47)=0。9852E(S(female)= (68)log2 (68)- (28)log2(28)=0。8113E(S, 性别)=(7/15) E(S(male) +(8/15) E(S(femal
6、e) =0。8925Gain(S, 性别)=0。971-0.8925=0。0785E(S)=(-915)log2(915)-(615)log2(615)=0。age(年龄)=1540,41 60E(S(1540)= (-67)log2 (67) (17)log2(17)=0。5917E(S(41 60)= (-38)log2 (38) (58)log2(58)=0。9544E(S, 年龄)=(7/15) E(S(1540) +(8/15) E(S(41 60) =0.7851Gain(S, 年龄)=0.9710。7851=0.1859代码package DecisionTree;import
7、java.util。ArrayList;/* * 决策树结点类 /public class TreeNode private String name; /节点名(分裂属性的名称) private ArrayListString rule; /结点的分裂规则 ArrayListTreeNode child; /子结点集合 private ArrayListArrayList datas; /划分到该结点的训练元组 private ArrayList();child = new ArrayList getChild() return child; public void setChild(Arra
8、yListTreeNode child) this.child = child; public ArrayListString getRule() return rule; public void setRule(ArrayList getDatas() return datas; public void setDatas(ArrayListArrayList datas) datas = datas; public ArrayListString getCandAttr() return candAttr; public void setCandAttr(ArrayListString ca
9、ndAttr) candAttr = candAttr;import java。io。BufferedReader;import java.io.IOException;InputStreamReader;util.ArrayList;StringTokenizer;/* 决策树算法测试类public class TestDecisionTree / 读取候选属性 return 候选属性集合 throws IOException public ArrayListString readCandAttr() throws IOException ArrayListString candAttr =
10、 new ArrayListString(); BufferedReader reader = new BufferedReader(new InputStreamReader(System。in); String str = ”; while (!(str = reader。readLine()。equals(”) StringTokenizer tokenizer = new StringTokenizer(str); while (tokenizer。hasMoreTokens() candAttr.add(tokenizer。nextToken(); return candAttr;
11、/* 读取训练元组 return 训练元组集合 * throws IOException */ArrayListString readData() throws IOException datas = new ArrayListArrayList s = new ArrayListString(); s。add(tokenizer。 datas.add(s); /* 递归打印树结构 param root 当前待输出信息的结点 public void printTree(TreeNode root) System。out.println(”name:” + root.getName(); Arr
12、ayListString rules = root。getRule();out。print(node rules: ); for (int i = 0; i children = root。getChild(); int size =children。size(); if (size = 0) println(”leaf node!-”); else size of children:” + children。size(); for (int i = 0; children。 System。out.print(”child ” + (i + 1) + of node ” + root。getN
13、ame() + ”: printTree(children。get(i); 主函数,程序入口 param args public static void main(String args) TestDecisionTree tdt = new TestDecisionTree(); candAttr = null; ArrayListArrayList datas = null; try System.out.println(请输入候选属性”); candAttr = tdt。readCandAttr();请输入训练数据 datas = tdt.readData(); catch (IOExc
14、eption e) e。printStackTrace(); DecisionTree tree = new DecisionTree(); TreeNode root = tree.buildTree(datas, candAttr); tdt。printTree(root);import java.util.ArrayList;util.HashMap;Iterator;import java.util.Map;/* * 选择最佳分裂属性public class Gain String D = null; /训练元组 private ArrayListString attrList = n
15、ull; /候选属性集 public Gain(ArrayListArrayList attrList) this.D = datas;attrList = attrList; /* 获取最佳侯选属性列上的值域(假定所有属性列上的值都是有限的名词或分类类型的) param attrIndex 指定的属性列的索引 * return 值域集合String getValues(ArrayListArrayListString datas, int attrIndex) ArrayListString values = new ArrayList valueCounts(ArrayListArrayL
16、ist valueCount = new HashMapString, Integer(); String c = ”; ArrayListString tuple = null; for (int i = 0; i datas.size(); tuple = datas.get(i); c = tuple。get(attrIndex); if (valueCount。containsKey(c) valueCount.put(c, valueCount。get(c) + 1); else valueCount。put(c, 1); return valueCount; * 求对datas中元
17、组分类所需的期望信息,即datas的熵 * param datas 训练元组 return datas的熵值 public double infoD(ArrayListArrayListString datas) double info = 0.000; int total = datas。size(); classes = valueCounts(datas, attrList.size(); Iterator iter = classes.entrySet().iterator(); Integer counts = new Integerclasses。size(); for(int i
18、 = 0; iter。hasNext(); i+) Map.Entry entry = (Map.Entry) iter.next(); Integer val = (Integer) entry.getValue(); countsi = val; i counts。length; double base = DecimalCalculate.div(countsi, total, 3); info += (-1) * base Math。log(base); return info; * 获取指定属性列上指定值域的所有元组 * param attrIndex 指定属性列索引 param v
19、alue 指定属性列的值域 return 指定属性列上指定值域的所有元组ArrayListString datasOfValue(int attrIndex, String value)String Di = new ArrayListString(); ArrayListString t = null; i D.size(); t = D。get(i); if(t。get(attrIndex)。equals(value) Di。add(t); return Di; * 基于按指定属性划分对D的元组分类所需要的期望信息 * param attrIndex 指定属性的索引 * return 按指定属性划分的期望信息值 public double infoAttr(int attrIndex) double info = 0.000; ArrayListString values = getValues(D, attrIndex); for (int i