- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
and R1 p0: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1
N a m eB l o o d T y p e G i v e B i r t h C a n F l y L i v e i n W a t e rC l a s s
t u r t l e c o l d
n on o s o m e t i m e s?
规则排序
基于规则的排序
• 根据规则的质量进行排序
直接方法: 顺序覆盖
顺序覆盖(Sequential Covering)
(1) 初始值为空规则集 (2) 使用Learn-One-Rule函数得到一条新规则 (3) 从训练集中删去被新产生的规则所覆盖的实例 (4) 重复步骤(2)和步骤(3),直到满足停止标准为止。
示例
(i) Original Data
Single 125K No
Married 100K No
Single 70K
No
Married 120K No
Divorced 95K
Yes
Married 60K
No
Divorced 220K No
Single 85K
Yes
Married 75K
No
Single 90K
Yes
(Status=Single) No
Can Fly
no no no no no no yes yes no no no no no no no no no yes no yes
Live in Water
Class
no
mammals
no
reptiles
yes
fishes
yes
mammals
sometimes amphibians
no
reptiles
分类规则的类别
互斥规则(Mutually exclusive rules)
• 若规则互相独立,则称分类器包含互斥规则 • 每条记录最多被一条规则所覆盖
无遗漏规则(Exhaustive rules)
• 若分类器考虑了所有可能的属性值的组合,则 该分类器具有无遗漏的覆盖
• 每条记录至少被一条规则所覆盖
基于类别的排序
• 根据规则的类别进行排序
Rule-based Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes
... Status =
Income
Married
> 80K
Yes: 2 No: 1
Yes: 1 No: 0
Yes: 0 No: 3
(a) General-to-specific
Yes: 3 No: 1
Refund=No, Status=Single, Income=85K
(Class=Yes)
Refund=No, Status=Single, Income=90K
Blood Type
warm cold cold warm cold cold warm warm warm cold cold warm warm cold cold cold warm warm warm warm
Give Birth
yes no no yes no no yes no yes yes no no yes no no no no no yes no
Yes
10
Initial Rule:
(Refund=No) (Status=Married) No
Simplified Rule: (Status=Married) No
规则约简的效果
规则有可能不再互斥 • 一条记录有可能调用多条规则 • 解决方案
• 对规则集进行排序 • 使用投票的方式
规则有可能存在遗漏 • 一条记录可能不满足任何一条规则 • 解决方案
Coverage = 40%, Accuracy = 50%
构造分类规则
直接方法:
• 直接从数据中提取规则 • e.g. RIPPER, CN2, Holte’s 1R
间接方法:
• 从其它分类模型中提取规则 、 • e.g. decision trees, neural networks, etc
消除实例
R3
R2
class = + class = -
R1
++
+ ++
+
+ ++
+
+++ ++ +
++
+ +
-
-
-- -
-
-
-
-
-
-
-
-
+
++
+
+
+
+
++
--
-
-
-
-
-
不消除实例? 不消除正例? 不消除负例?
d o g f is h s h a r kc o ld
y e s n o y e s
?
A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules
• 使用默认类别
利用规则进行分类
R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians
(Refund=No, Marital Status={Married}) ==> No
Class-based Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No
measure:
• R0: {} => class (initial rule) • R1: {A} => class (rule after adding conjunct) • Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ] • where t: number of positive instances covered by both R0
有序规则集
根据优先权对规则进行排序 对一个待分类的记录
• 若满足多条规则,则使用排在最前面的对其进行分类。 • 若不满足任何规则,则使用默认类别。
R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians
(ii) Step 1
R1
(iii) Step 2
示例
R1
R2
(iv) Step 3
顺序覆盖的要点
产生规则 消除实例 规则评价 停止标准 规则的剪枝
产生规则
两种常用方法
Yes: 3
{}
No: 4
Refund= No
Yes: 3 No: 4
Status = Single
Status = Divorced
(Class=Yes)
Refund=No, Status = Single (Class = Yes)
(b) Specific-to-general
RIPPER算法
Start from an empty rule: {} => class Add conjuncts that maximizes FOIL’s information gain
M a rrie d 1 2 0 K
No
5 No
D iv o rc e d 9 5 K
Yes
6 No
M a rrie d 6 0 K
No
7 Yes
D iv o rc e d 2 2 0 K
No
8 No
S in g le 8 5 K
Yes
9 No
Fra Baidu bibliotek
M a rrie d 7 5 K
No
10 No
S in g le 9 0 K
N a m e B l o o d T y p e G i v e B i r t h C a n F l yL i v e i n W a t e r C l a s s
le m u r w a r m
y e s n o n o
?
t u r t le
c o ld
n o n o s o m e t im e s ?
no
mammals
no
birds
no
mammals
yes
fishes
sometimes reptiles
sometimes birds
no
mammals
yes
fishes
sometimes amphibians
no
reptiles
no
mammals
no
birds
yes
mammals
no
birds
R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians
数据挖掘技术
第九课 常用分类方法
主要内容
基于规则的分类 基于实例的分类
基于规则的分类(Rule-Based Classifier)
使用形如“if…then…” 的规则集对记录进行分类。 规则: (Condition) y
• 其中: • Condition 是属性-值对的合取 • y 是类标记
规则可以约简
T id R e fu n d M a rita l T a x a b le S ta tu s In c o m e C h e a t
1 Yes
S in g le 1 2 5 K
No
2 No
M a rrie d 1 0 0 K
No
3 No
S in g le 7 0 K
No
4 Yes
(Refund=No, Marital Status={Married}) ==> No
(Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes
规则的覆盖度与正确性
规则的覆盖度(Coverage):
• 满足规则条件的记录的百 分比
规则的正确性(Accuracy) :
• 在满足规则条件的记录中, 也满足规则结论的记录的 百分比
Tid Refund Marital Taxable Status Income Class
1 Yes 2 No 3 No 4 Yes 5 No 6 No 7 Yes 8 No 9 No 10 No
10
• 分类规则的例子: • (Blood Type=Warm) (Lay Eggs=Yes) Birds • (Taxable Income < 50K) (Refund=Yes) Evade=No
示例
Name
human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle