Discriminant Feature Selection for Texture Classification
- 格式:pdf
- 大小:798.97 KB
- 文档页数:10
简述数据降维的基本流程英文回答:Data Dimensionality Reduction: A Concise Overview.Data dimensionality reduction techniques play a pivotal role in data analysis and machine learning. They enable us to simplify high-dimensional data by projecting it into a lower-dimensional space while preserving essential information. The basic process of dimensionality reduction typically involves the following steps:1. Data Preprocessing: The first step involves preparing the data for dimensionality reduction. This includes cleaning the data, removing outliers, and normalizing the features to ensure they are on the same scale.2. Feature Selection: Feature selection techniques identify the most informative and relevant features fromthe original dataset. This can be achieved using various methods, such as filter methods (e.g., correlation, information gain), wrapper methods (e.g., forward selection, backward selection), or embedded methods (e.g., L1 regularization).3. Feature Transformation: Feature transformation techniques transform the original features into a new setof features that are more suitable for dimensionality reduction. Common transformation techniques includeprincipal component analysis (PCA), singular value decomposition (SVD), and linear discriminant analysis (LDA).4. Dimensionality Reduction: In this step, the dimensionality of the data is reduced by projecting it into a lower-dimensional space using dimensionality reduction algorithms. Some of the commonly used algorithms include PCA, t-SNE (t-distributed stochastic neighbor embedding), UMAP (Uniform Manifold Approximation and Projection), and autoencoders.5. Evaluation: After performing dimensionalityreduction, it is important to evaluate its effectiveness. This can be done by comparing the performance of models trained on the original data and the reduced-dimensionality data. Metrics such as accuracy, precision, recall, and F1-score can be used for evaluation.中文回答:数据降维的基本流程。
文献外部特征的检索语言外部特征的检索语言是指在文献检索中使用的特定术语或关键词,用于寻找与特定主题或研究领域相关的外部特征的文献。
这些外部特征可能涉及物体表面的形态、纹理、颜色、特定结构或其他性质,或者涉及人体的某些特定特征或行为。
以下是一些与外部特征相关的常用检索语言的示例:1. 表面形态特征:- 均匀性(uniformity)- 曲率(curvature)- 平滑度(smoothness)- 粗糙度(roughness)- 几何形状(geometric shape)- 表面形貌(surface topography)- 表面形态(surface morphology)2. 表面纹理特征:- 纹理特征(texture features)- 纹理描述(texture descriptors)- 纹理分析(texture analysis)- 纹理判别(texture discrimination)- 纹理识别(texture recognition)- 纹理提取(texture extraction)- 纹理模型(texture model)- 纹理分类(texture classification)3. 表面颜色特征:- 颜色特征(color features)- 颜色分布(color distribution)- 颜色模型(color model)- 颜色空间(color space)- 颜色直方图(color histogram)- 颜色特征提取(color feature extraction)- 颜色描述符(color descriptor)4. 特定结构特征:- 细胞结构(cellular structure)- 晶格结构(lattice structure)- 分子结构(molecular structure)- 生物组织结构(biological tissue structure)- 表表面结构(surface structure)- 微观结构(microstructure)5. 人体特征:- 人脸特征(facial features)- 身体形态(body shape)- 手部特征(hand features)- 骨骼结构(skeletal structure)- 步态分析(gait analysis)- 视觉注意(visual attention)- 姿势识别(pose recognition)- 表情识别(facial expression recognition)以上仅是外部特征检索语言的示例,实际应用中需要根据具体的研究领域和研究目的进行调整和进一步扩展。
srcfs特征选择方法
srcfs(Subspace Regularized Feature Selection)是一种特征选择方法,它结合了子空间正则化和特征选择的思想。
其主要目标是在高维数据集中找到最相关的特征子集,以提高分类或回归模型的性能。
srcfs方法的步骤如下:
1. 数据预处理:对原始数据进行标准化或归一化处理,确保特征具有相似的尺度。
2. 构建相关矩阵:计算原始特征之间的相关性矩阵,常用的方法有Pearson相关系数、互信息等。
3. 子空间学习:通过主成分分析(PCA)或其他降维方法,将特征投影到一个低维子空间中。
4. 特征选择:在子空间中,使用L1范数正则化或其他稀疏化方法,将不相关的特征权重设为零,从而实现特征选择。
5. 模型训练和评估:使用选定的特征子集训练分类或回归模型,并通过交叉验证或其他评估方法评估模型性能。
srcfs方法的优点包括:
-考虑了特征之间的相关性,避免了冗余特征的选取。
-采用子空间学习和正则化技术,能够在高维数据集中提取更有信息量的特征。
-具有稀疏性,能够得到较为紧凑的特征子集。
需要注意的是,srcfs方法的具体实现可能因应用场景和算法选择而有所不同。
1。
收稿日期:2006-03-08 基金项目:国家自然科学基金项目(60305006)资助. 作者简介:孙 麟,男,1980年生,硕士研究生,主要研究方向为Web 信息挖掘;牛军钰,女,1973年生,博士,副教授,主要研究方向为多媒体信息智能处理.基于领域相关词汇提取的特征选择方法孙 麟,牛军钰(复旦大学计算机科学与工程系,上海200433)E-mail:jyniu@fu 摘 要:传统文本分类中的文档表示方法一般基于全文本(Bag -O f-W or ds)的分析,由于忽略了领域相关的语义特征,无法很好地应用于面向特定领域的文本分类任务.本文提出了一种基于语料库对比领域相关词汇提取的特征选择方法,结合SVM 分类器实现了适用于特定领域的文本分类系统,能轻松应用到各个领域.该系统在2005年文本检索会议(T REC,T ex t REtr iev al Confer ence )的基因领域文本分类任务(G eno mics T r ack Ca tego rizat ion T ask )的评测中取得第一名.关键词:文本分类;文档表示;特征选择;领域相关中图分类号:T P 311 文献标识码:A 文章编号:1000-1220(2007)05-0895-05Feature Selection Method Based on Domain -specific Term ExtractionSU N L in ,N IU Jun -yu(De p ar tment of Comp uter S cie nce and Eng ineering ,Fud an Univ ersity ,S hanghai 200433,C hina )Abstract :T he tr aditio nal tex t repr esentat ion methods fo r tex t classification ar e g enerally based o n the ana ly sis of full text (Bag -of -Wo rds).Because of ig nor ing dom ain-specific semantic featur es,they can no t fit do main-specific tex t classification.T his pa-per descr ibes a feature select ion metho d based o n dom ain -specific term ex tr actio n using co rpus co mpa riso n ,and a tex t classifi-ca tio n system based on the co mbina tio n of this method and the SV M classifier,w hich can be applied t o any do main ea sily.T his tex t classificatio n system go t t he hig hest sco re among r uns fr om 19g ro ups in the ev aluat ion o f T REC 2005G enomics T r ack Categ or izatio n T ask.Key words :tex t classificat ion ;do cument r epr esentatio n ;featur e selectio n ;domain -specific1 引 言文本分类任务是给文档分配一个预先定义好的类别.在这过程中,通常使用向量空间模型(V SM )表示文档.但是文本包含的词汇量越来越大,往往造成向量空间维数太多而显得过于稀疏.特征选择就是用来解决这一问题的方法之一[1-3],以Infor matio n Gain [4]、Chi -square [5,6]等基于概率统计理论的方法为代表,通过选择最合适特征子集来降低维度.然而这些方法无法体现语义层面信息,使得一般的文本分类系统无法很好地应用于需要语义支持的领域相关文本分类任务.虽然近年来出现了许多本体辞典可以提供语义支持,但本体词典有明显的弱点[7]:(1)本体辞典的构建费时费力,(2)无法收录新出现的词汇,(3)查辞典的过程会很大的降低系统性能,(4)会引入很多噪音,(5)难以覆盖所有领域,并且不同本体辞典语义层次和结构的不同,导致很难有统一的方法适用于不同领域,从而造成通用性差.为解决这些问题,Penas 等人提出通过对比语料库提取领域相关词汇[7-10],从而提供了一条提供语义支持的捷径.但是他们仅根据词频在不同语料库中的变化来选择领域相关词汇,忽略了语境的影响,降低了语义层面的意义.本文提出的基于领域相关词汇提取的特征选择方法,结合词的上下文语境解决这一问题.该方法根据词与词搭配的分布在不同语料库中的差异,选择领域相关词汇,进而用这些领域相关词汇及周围的词作为文档特征.这种特征选择方法结合SV M Light[11]分类器构造的文本分类系统,经过2005年文本检索会议基因领域文本分类任务[12](T REC 2005G e-no mics T r ack Categ or izatio n T ask)的评测,取得了第一名的好成绩[13].本文接下来将介绍这种基于领域相关词汇提取的特征选择方法,以及其与SV M 分类器结合的适用于特定领域的文本分类系统,并描述该系统在文本检索会议中的评测结果,最后对该特征选择方法及文本分类系统作出总结.2 基于领域相关词汇提取的特征选择方法人们阅读特定领域的文献时,常会对该领域相关的词汇非常关注,并且从这些词汇周围入手理解文献的大意.因此,我们作如下假设H 1.H1:与领域相关的文档特征出现在该领域相关词汇周围.基于这样的假设,本文提出了一种基于领域相关词汇提取的特征选择方法.该方法分为两个步骤:小型微型计算机系统Jour nal o f Chinese Computer Systems 2007年5月第5期V o l.28N o.52007 (1)通过对比语料库找出领域相关词汇.(2)以这些词汇为中心,选择其周围一定数量的词作为文档特征.2.1 基于语料库对比的领域相关词汇提取方法基于语料库对比的领域相关词汇提取方法的基本思想是,通过对比词的某种属性在某个特定领域的语料库以及包含很广泛领域的一般语料库中的差异,选择出这个特定领域语料库中与其领域相关的词汇.因此,首先要选择一个属于某特定领域的语料库作为领域相关词汇提取的研究对象A C (Analysis Co rpus),以及一个包含很广泛领域的语料库作为参照RC (Refer ence Cor pus ).然后,为每个出现在A C 中的词对比它的某些属性在这两个语料库中的差异,根据这种差异选择出A C 中与该特定领域相关的词汇.传统的基于语料库对比的领域相关词汇提取方法[7-10]主要是通过对比词的词频在不同语料库中的差异来选择领域相关词汇.以Penas 等人[7]为代表,他们通过对比词的词频在特定领域的语料库与一般领域的语料库中的差异,选择在特定领域的语料库中的领域相关词汇.然而这些方法均忽略了词的上下文语境.一般来讲,领域相关词汇不仅在词频方面存在差异,而且这种词汇在不同领域中所搭配的词通常会有很大的变化.如“disk ”若出现在餐饮领域,通常会与“china ”、“washer ”等词搭配形成“china disk ”、“disk washer "等二元词组;然而若在计算机领域,“disk "则通常与“har d ”、“dr iver "等词搭配形成“har d disk"、“disk dr iv er "等二元词组.因此,我们做如下假设H2.H2:某个词与其他词搭配的概率分布在不同领域中变化越大,则说明这个词与该领域越相关.对于那些只出现在A C 或RC 中的词,由于其缺乏可比性,因此不属于此算法研究对象.对于每个同时出现在AC 和RC 中的词ter m i ,若在A C 和R C 中总共有m i 个不同的词ter m j 连接在term i 后面,如图1所示,可以为ter m i 构造m i 维的Bi -gr am 向量空间模型,用来描述ter m i 在A C 、RC 中与这m i 个词的连接分布.图1 term i 在A C 和RC 中的Bi-gr am 向量VA 和VR F ig.1T he Bi-g ra m vecto r V A and V R o f ter m i in A C and RC V A 、V R 是两个m i 维向量,分别描述ter m i 在A C 、RC 中与这m i 个词的搭配分布,称之为Bi -g r am 向量.Bi -g ra m 向量的每个分量表示以ter m i 开头ter m j (j =1,2,…,m i )结尾的Bi -gr am 词组在相应语料库中出现的次数(phr ase fr equency),用p f ij A 表示该词组在A C 中出现的次数,用p f ij R 表示该词组在RC 中出现的次数.若该词组未在某个语料库中出现,则在对应于这个语料库的向量的相应的分量上的值为0.基于这个向量空间模型,我们可以构造一个用于计算term i 在AC 所属的领域中的特殊性(Speciality i )公式F 1. F1:Sp eciality i =S Ai S Ri S A i=∑m ij =1pf A ij tf R i -1m iS R i=∑m ij =1pf R ijtf A i -1m iF1:计算ter m i 在AC 所属的领域中的特殊性(Speciali-t y i )的公式Sp eciality i 表示t erm i 在AC 所属的领域中的特殊性,tf i A 为term i 在AC 中的词频,tf i R为ter m i 在RC 中的词频,m i =0的词被忽略了,S i R =0且S i A ≠0的词的Speciality i设为-1.根据公式F 1,Speciality 分数高的词被选出来作为AC 的领域相关词汇.表1列出了根据此公式得出的Specialit y 分数最高的词语,这些词经过了Po rt er Stemmer 的取词干的处理.表1 按照Speciality 分数排列,得分最高的20个词T able 1T he 20to p -ranked w o rds acco rdingto the sco re o f specialityNo TermNo T erm 1phosph or yl 11upregul 2overex pres s 12term inu 3tyros in 13his ton4isofor m 14su bcellu lar 5exon 15stain 6cleavag 16endoth 7down regul 17cytos ol 8plasmid 18repres s9fulllength 19transmembran 10fibr ob last20bead2.2 基于领域相关词汇提取的文档特征选择基于假设H1,领域相关词汇周围一般存在着与文档及该领域密切相关的特征.而远离这些词汇的词语、语句、甚至段落谈论与领域相关话题的可能性较小.因此,在得到语料库中的领域相关词汇之后,则选取这些词汇为中心一定长度的窗口内的词作为文档特征.根据选择领域相关词汇的多少以及窗口的大小,可以产生不同的特征集.若这些词汇太多或者窗口长度太大,则有可能包含了文档内大部分或全部内容;若这些词汇太少或窗口长度太小,则有可能使特征数量太少.这两种情况都会使文档表示的质量降低.因此,要选取合适的领域相关词汇数量以及窗口大小.2.3 特征选择系统框架896 小 型 微 型 计 算 机 系 统 2007年在特征选择部分,系统的输入是语料库中的原始文档,输出是SV M Light 输入格式的特征文件.首先为A C 和RC 分别使用相应的解析器将其原始语料转化为纯文本格式文档,其中,使用了P or ter St emmer ,将同一个词的不同形态转化为统一图2 特征选择系统框架图Fig.2T he fr amew or k of featur e selectio n sy stem 的词干形式,并去除了纯数字和小数.然后计算出A C 中的领域相关词汇,最后选择这些词汇周围的词作为文档特征.图2显示了特征选择部分的系统框架.3 实验及评测为了评测这种特征选择方法,我们将它与SV M L ig ht 分类器结合,参加了2005年文本检索会议基因领域文本分类任务(T REC 2005Genomics T rack Categ or izatio n T ask)的评测.3.1 任务介绍该任务由2004年文本检索会议基因领域文本分类任务的T r iag e 子任务衍生而来[14](T R EC 2004Genomics T r ack Categ or izatio n T ask T r iag e Subt ask),包含4个子任务,各个子任务的目标是在语料库中分别找出属于A lleles of m ut ant pheno types(A 类)、Embry olog ic gene ex pressio n(E 类)、G Oanno tation (G 类)、和T umor bio lo gy (T 类)这4种类型中一种类型的文档.使用的语料库是由Hig hwire P ress 提供的生物化学领域的三个杂志2002年和2003年这两年内总共11880篇全文本文章组成的:Jour nal o f Bio lo gical Chemistr y (JBC ),Jo urnal o f Cell Biolog y (JCB),以及P ro ceeding s o f the Nat ional A cadem y o f Science (PN A S).这些全文本文档的格式为基于Hig hw ir e 文档类型定义(DT D )的SG M L 格式.以2002年的文章作为训练集,2003年的文章作为测试集.3.2 评测标准文本检索会议基因领域文本分类任务使用U t ility 作为各个子任务的评测标准,这种评测标准常被用于文本分类任务中,并且在以前的文本检索会议的过滤项目中也用到过.在这里使用的是正规化了的U tility ,F 2是其计算公式. U norm =U raw /U max U raw =(u r *T P )-FP U max =U r *A P u r =A N /A P F2:正规化了的U tility 其中: T P 表示分类结果的正例中本身就是正例的数量, FP 表示分类结果的正例中本身是负例的数量, AN 表示所有负例的数量,A P 表示所有正例的数量.表2 各个任务的u r 值以及正负例数量T able 2T he v alue o f u r and the po sitiv e&neg ative samples distr ibutio n子任务APAN u r 总u r A 类训练338549916.27测试332571117.2017E 类训练81575671.06测试105593856.5564G 类训练462537511.63测试518552510.6711T 类训练365801161.14测试206023301.15231如表2所示,由于各个子任务的A P 及A N 不同,使得他们的U r 值不一样.3.3 语料库的选择在实验中,A C 使用的是用于文本检索会议基因领域文本分类任务的语料库.对于RC 的选择应该使其包含尽可能广泛的领域.这里则使用的是用于文本检索会议网络检索项目(T R EC W eb T r ack )的.GO V 语料库.该语料库抓取了2002年早期的.go v 网站中1247753篇文档,其中包括1053372篇tex t/html 格式的文档,总大小为18.1G,其中包含了相当广泛的领域.为了简化在对比中语料库大小所产生的影响,我们使用了.GO V 语料库中与A C 大小相当的一个子集.3.4 分类系统介绍结合SV M L ig ht 分类器,我们为该评测任务构建了文本分类系统.如图3所示,将特征选择部分处理得到的特征文件送给分类器做分类,并输出分类结果.SV M Lig ht 的参数则使用Fujita [15]在2004年T REC Geno mics 项目T r iag e 任务中得到的最佳参数,即C =0.0001505、J =u r .如上所述,根据选择的领域相关词汇数量以及窗口大小的不同,会产生不同的特征集,需要通过学习来选择合适的领域相关词汇数量以及窗口的大小.因此,在学习阶段,我们将训练语料拆分为对等的两半,分别包含相同数目的正例和负例,用其中一半作训练,另一半做测试,使用在两半上测试结果的平均作为最终测试结果,用来比较特征集的优劣,从而选择出表现最好的特征集.我们还尝试了将领域相关词汇列表中分数最低的词添加为禁用词,然而,添加禁用词的方法在学习阶段并没有显示出很好的性能.8975期 孙 麟等:基于领域相关词汇提取的特征选择方法 图3 分类系统框架图F ig.3T he framew o rk o f the classification system共有两组基于此分类系统的运行结果M ar sI和M arsII 参加了评测.对于每个子任务,分别选择表现最好的特征集,表3 两组运行结果的各项参数T a ble3T he par ameter of t he t wo submit ted r uns领域相关词汇数量窗口大小是否添加禁用词M ar sI A20004否E5002否G20004否T25000以及分数为-1的词0是M arsII5002否经运算后结果组成了M a rsI.而M a rsII则是由同一个特征集运行出来的结果组成,该特征集对于所有子任务的测试结果的平均值最高.表3详细列出了这两组运行结果的各项参数.3.5 评测结果表4列出了这两组运行结果的评测结果.表4 评测结果T a ble4T he evaluatio n scor es of the sy st em sP:精确率,R:召回率,NU:正规化了的Utility2005年共有19个组织参加了该任务的评测,其中包括IBM的三个研究机构和U IU C、韦斯康星大学、加州州立San M arco s大学、Q ueen s大学、清华大学、复旦大学、大连理工大学、香港中文大学、国立台湾大学等院校.我们的系统取得表5 我们最好的成绩在所有评测成绩中的位置T able5Fudan W IM s best r esult andits rank in the T R EC 05geno mics最高中等最低我们最好的组参评总数A0.87100.77850.20090.843948E0.87110.6548-0.00740.871146G0.58700.4575-0.03420.587047T0.94330.76100.04130.915451了E类第一、G类第一、T类第三、A类第五的评测成绩.表5显示了我们的最好评测成绩在所有评测成绩中的位置.4 结 论实验结果表明,选取少量的领域相关词汇以及较小的窗口较适合.领域相关词汇数量定为500、窗口大小定为2时取得了较好的分类效果.但是,将领域相关词汇列表中分数最低的词添加到禁用词表的做法并未取得很好的效果,这说明这种领域相关词汇提取的方法并不适用于选取禁用词.从T REC的评测的结果来看,这种通过对比语料库提取领域相关词汇提取的特征选择方法可以很好地适用于领域相关的文本分类任务.它不仅提炼出了领域相关的特征,而且克服了那些依赖本体辞典的特征选择方法的不足,同时能够轻松的应用到不同领域.在今后的工作中,我们将进行以下研究:(1)与本体辞典的结合,(2)自动构造本体辞典,(3)挖掘领域相关词汇之间的关系.References:[1]Ron Kohavi,George H John.Wrappers for feature subs et selec-tion[C].In:Artificial Intellig ence,1997,97(1-2):273-324. [2]Avrim L Blum,Pat Langley.S election of relevant featu res andexamp les in mach ine learning[C].In:AAAI Fall Symposiu m on Relevan ce,1994,140-144.[3]Yang Yi-ming,Jan O Pedersen.A comparative study on featu res election in text categorization[C].In:Proceedings of14th In-tern ational Conference on M ach ine Learnin g,1997,412-420. [4]Lew is D D,Ringuette parison of tw o learning algo-rithms for tex t categoriz ation[C].In:Proceedings of the T hird Annu al S ymposium on Documen t Analys is and Information Re-trieval,1994.[5]W iener E,Pedersen J O,W eigend A S.A neural network ap-proach to topic spotting[C].In:Proceedings of the Fourth An-nual S ymposiu m on Document Analys is and In formation Re-898 小 型 微 型 计 算 机 系 统 2007年trieval,1995,317-332.[6]Sch utze H,Hull D A,Peders en J O.A com paris on of class ifiersand docu ment rep resentations for th e routing p roblem[C].In: 18th Ann Int ACM SIGIR Conference on Research and Develop-m ent in Information Retrieval,1995,229-237.[7]Penas A,V erdejo F,Gonzalo J,et al.Corpus-bas ed terminolo-gy extraction applied to information acces s[C].In:Pr oceedings of Corpus Lingu istics,2001.[8]David in g generic corpora to learn d om ain-s pecific ter-min ology[C].In:Proceedin gs of T he Ninth ACM SIGKDD In-ternational Conference on Know ledge Dis covery and Data M in-ing,2003.[9]T eresa M ihw a Chung.A corp us comparison approach for termi-nology ex tr action[J].T er minology,2003,(9):221-246. [10]Patrick Dr ou in.Detection of d om ain s pecific termin ology usingcorpora comp aris on[C].In:Proceedings of the Fourth Interna-tional Conference on L angu age Res ou rces and Evaluation (LREC),Lis bon,Portug al,2004.[11]Joachims T.M aking large-Scale SVM L earning Practical.Ad-vances in Kern el M ethods-Su pport Vector L earning, B.Sch olkopf and C.Burges and A.Smola(ed.)[M].M IT-Pres s,1999.[12]W illiam Hers h.TREC2005genomics track overview[C].In:14th T ext Retrieval Conference,2005.To appear.[13]Niu Jun-yu,Su n L in,et al.W IM at TREC2005[C].In:14thText Retrieval Conference,2005.T o appear.[14]W illiam Hers h.TREC2004genomics track overview[C].In:13th T ext Retrieval Conference,2004.[15]Fu jita S.Revis itin g again docum ent len gth hyp otheses TREC2004gen om ics track experiments at patolis[C].In:13th T ext Retrieval C on feren ce,2004.2007年全国软件与应用学术会议征文(NASAC 07)全国软件与应用学术会议(NA SA C)由中国计算机学会系统软件专业委员会和软件工程专业委员会联合主办,是中国计算机软件领域一项重要的学术交流活动.第六届全国软件与应用学术会议N A SAC2007将由西安交通大学计算机系承办,于2007年9月20日至22日在陕西西安举行.此次会议将由国内核心刊物(计算机科学)以增刊形式出版会议论文集,还将选择部分优秀论文推荐到核心学术刊物(EI检索源)发表,并将评选优秀学生论文.欢迎踊跃投稿.一、征文范围(但不限于下列内容) 1.需求工程2.构件技术与软件复用3.面向对象与软件A g ent4.软件体系结构与设计模式5.软件开发方法及自动化6.软件过程管理与改进7.软件质量、测试与验证8.软件再工程9.软件工具与环境10.软件理论与形式化方法11.操作系统12.软件中间件与应用集成13.分布式系统及应用14.软件语言与编译15.软件标准与规范16.软件技术教育17.计算机应用软件二、论文要求1.论文必须未在杂志和会议上发表和录用过.2.论文篇幅限定6页(A4纸)内.3.会议只接受电子文档P DF或PS格式提交论文.排版格式请访问会议网址.(htt p://na )4.投稿方式:采用“N A SAC2007在线投稿系统”(http://nasac07.x jt )投稿(待建).三、重要日期1.论文投稿截止日期:2007年5月31日2.论文录用通知日期:2007年6月30日3.学术会议及活动日期:2007年9月20日至22日四、联系方式联系人:王换招、张华,西安交通大学计算机科学与技术系T el:029-********Email:csed@ma il.x 更详细的内容请访问N A SA C2007网址:http://nasac07.x 8995期 孙 麟等:基于领域相关词汇提取的特征选择方法 。
特征点筛选法(中英文实用版)Title: Feature Point Selection Method中文标题:特征点筛选法Section 1: Introduction英文段落:The feature point selection method is a technique commonly used in image processing and computer vision to identify distinct points that represent important features in an image.These points are crucial for various applications such as image recognition, object detection, and image registration.The main objective of this method is to accurately detect and extract these feature points, which can then be used to perform further processing tasks.中文段落:特征点筛选法是一种在图像处理和计算机视觉领域常用的技术,用于识别代表图像中重要特征的独特点。
这些点对于各种应用至关重要,如图像识别、目标检测和图像配准。
本方法的主要目标是准确检测和提取这些特征点,然后可使用它们执行进一步的处理任务。
Section 2: Feature Detection英文段落:Feature detection is the first step in the feature point selection method.It involves identifying points in an image that have distinctproperties or characteristics.This can be achieved using various algorithms such as SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features), or ORB (Oriented FAST and Rotated BRIEF).These algorithms analyze the image intensities and gradients to detect points with high contrast and uniqueness.中文段落:特征检测是特征点筛选方法的第一步。
A very brief guide to using MXMMichail Tsagris,Vincenzo Lagani,Ioannis Tsamardinos1IntroductionMXM is an R package which contains functions for feature selection,cross-validation and Bayesian Networks.The main functionalities focus on feature selection for different types of data.We highlight the option for parallel computing and the fact that some of the functions have been either partially or fully implemented in C++.As for the other ones,we always try to make them faster.2Feature selection related functionsMXM offers many feature selection algorithms,namely MMPC,SES,MMMB,FBED,forward and backward regression.The target set of variables to be selected,ideally what we want to discover, is called Markov Blanket and it consists of the parents,children and parents of children(spouses) of the variable of interest assuming a Bayesian Network for all variables.MMPC stands for Max-Min Parents and Children.The idea is to use the Max-Min heuristic when choosing variables to put in the selected variables set and proceed in this way.Parents and Children comes from the fact that the algorithm will identify the parents and children of the variable of interest assuming a Bayesian Network.What it will not recover is the spouses of the children of the variable of interest.For more information the reader is addressed to[23].MMMB(Max-Min Markov Blanket)extends the MMPC to discovering the spouses of the variable of interest[19].SES(Statistically Equivalent Signatures)on the other hand extends MMPC to discovering statistically equivalent sets of the selected variables[18,9].Forward and Backward selection are the two classical procedures.The functionalities or the flexibility offered by all these algorithms is their ability to handle many types of dependent variables,such as continuous,survival,categorical(ordinal,nominal, binary),longitudinal.Let us now see all of them one by one.The relevant functions are1.MMPC and SES.SES uses MMPC to return multiple statistically equivalent sets of vari-ables.MMPC returns only one set of variables.In all cases,the log-likelihood ratio test is used to assess the significance of a variable.These algorithms accept categorical only, continuous only or mixed data in the predictor variables side.2.wald.mmpc and wald.ses.SES uses MMPC using the Wald test.These two algorithmsaccept continuous predictor variables only.3.perm.mmpc and perm.ses.SES uses MMPC where the p-value is obtained using per-mutations.Similarly to the Wald versions,these two algorithms accept continuous predictor variables only.4.ma.mmpc and ma.ses.MMPC and SES for multiple datasets measuring the same variables(dependent and predictors).5.MMPC.temporal and SES.temporal.Both of these algorithms are the usual SES andMMPC modified for correlated data,such as clustered or longitudinal.The predictor vari-ables can only be continuous.6.fbed.reg.The FBED feature selection method[2].The log-likelihood ratio test or the eBIC(BIC is a special case)can be used.7.fbed.glmm.reg.FBED with generalised linear mixed models for repeated measures orclustered data.8.fbed.ge.reg.FBED with GEE for repeated measures or clustered data.9.ebic.bsreg.Backward selection method using the eBIC.10.fs.reg.Forward regression method for all types of predictor variables and for most of theavailable tests below.11.glm.fsreg Forward regression method for logistic and Poisson regression in specific.Theuser can call this directly if he knows his data.12.lm.fsreg.Forward regression method for normal linear regression.The user can call thisdirectly if he knows his data.13.bic.fsreg.Forward regression using BIC only to add a new variable.No statistical test isperformed.14.bic.glm.fsreg.The same as before but for linear,logistic and Poisson regression(GLMs).15.bs.reg.Backward regression method for all types of predictor variables and for most of theavailable tests below.16.glm.bsreg.Backward regression method for linear,logistic and Poisson regression(GLMs).17.iamb.The IAMB algorithm[20]which stands for Incremental Association Markov Blanket.The algorithm performs a forward regression at first,followed by a backward regression offering two options.Either the usual backward regression is performed or a faster variation, but perhaps less correct variation.In the usual backward regression,at every step the least significant variable is removed.In the IAMB original version all non significant variables are removed at every step.18.mmmb.This algorithm works for continuous or categorical data only.After applying theMMPC algorithm one can go to the selected variables and perform MMPC on each of them.A list with the available options for this argument is given below.Make sure you include the test name within””when you supply it.Most of these tests come in their Wald and perm (permutation based)versions.In their Wald or perm versions,they may have slightly different acronyms,for example waldBinary or WaldOrdinal denote the logistic and ordinal regression respectively.1.testIndFisher.This is a standard test of independence when both the target and the setof predictor variables are continuous(continuous-continuous).2.testIndSpearman.This is a non-parametric alternative to testIndFisher test[6].3.testIndReg.In the case of target-predictors being continuous-mixed or continuous-categorical,the suggested test is via the standard linear regression.If the robust option is selected,M estimators[11]are used.If the target variable consists of proportions or percentages(within the(0,1)interval),the logit transformation is applied beforehand.4.testIndRQ.Another robust alternative to testIndReg for the case of continuous-mixed(or continuous-continuous)variables is the testIndRQ.If the target variable consists of proportions or percentages(within the(0,1)interval),the logit transformation is applied beforehand.5.testIndBeta.When the target is proportion(or percentage,i.e.,between0and1,notinclusive)the user can fit a regression model assuming a beta distribution[5].The predictor variables can be either continuous,categorical or mixed.6.testIndPois.When the target is discrete,and in specific count data,the default test isvia the Poisson regression.The predictor variables can be either continuous,categorical or mixed.7.testIndNB.As an alternative to the Poisson regression,we have included the Negativebinomial regression to capture cases of overdispersion[8].The predictor variables can be either continuous,categorical or mixed.8.testIndZIP.When the number of zeros is more than expected under a Poisson model,thezero inflated poisson regression is to be employed[10].The predictor variables can be either continuous,categorical or mixed.9.testIndLogistic.When the target is categorical with only two outcomes,success or failurefor example,then a binary logistic regression is to be used.Whether regression or classifi-cation is the task of interest,this method is applicable.The advantage of this over a linear or quadratic discriminant analysis is that it allows for categorical predictor variables as well and for mixed types of predictors.10.testIndMultinom.If the target has more than two outcomes,but it is of nominal type(political party,nationality,preferred basketball team),there is no ordering of the outcomes,multinomial logistic regression will be employed.Again,this regression is suitable for clas-sification purposes as well and it to allows for categorical predictor variables.The predictor variables can be either continuous,categorical or mixed.11.testIndOrdinal.This is a special case of multinomial regression,in which case the outcomeshave an ordering,such as not satisfied,neutral,satisfied.The appropriate method is ordinal logistic regression.The predictor variables can be either continuous,categorical or mixed.12.testIndTobit(Tobit regression for left censored data).Suppose you have measurements forwhich values below some value were not recorded.These are left censored values and by using a normal distribution we can by pass this difficulty.The predictor variables can be either continuous,categorical or mixed.13.testIndBinom.When the target variable is a matrix of two columns,where the first one isthe number of successes and the second one is the number of trials,binomial regression is to be used.The predictor variables can be either continuous,categorical or mixed.14.gSquare.If all variables,both the target and predictors are categorical the default test isthe G2test of independence.An alternative to the gSquare test is the testIndLogistic.With the latter,depending on the nature of the target,binary,un-ordered multinomial or ordered multinomial the appropriate regression model is fitted.The predictor variables can be either continuous,categorical or mixed.15.censIndCR.For the case of time-to-event data,a Cox regression model[4]is employed.Thepredictor variables can be either continuous,categorical or mixed.16.censIndWR.A second model for the case of time-to-event data,a Weibull regression modelis employed[14,13].Unlike the semi-parametric Cox model,the Weibull model is fully parametric.The predictor variables can be either continuous,categorical or mixed.17.censIndER.A third model for the case of time-to-event data,an exponential regressionmodel is employed.The predictor variables can be either continuous,categorical or mixed.This is a special case of the Weibull model.18.testIndIGreg.When you have non negative data,i.e.the target variable takes positivevalues(including0),a suggested regression is based on the the inverse Gaussian distribution.The link function is not the inverse of the square root as expected,but the logarithm.This is to ensure that the fitted values will be always be non negative.An alternative model is the Weibull regression(censIndWR).The predictor variables can be either continuous, categorical or mixed.19.testIndGamma(Gamma regression).Gamma distribution is designed for strictly positivedata(greater than zero).It is used in reliability analysis,as an alternative to the Weibull regression.This test however does not accept censored data,just the usual numeric data.The predictor variables can be either continuous,categorical or mixed.20.testIndNormLog(Gaussian regression with a log link).Gaussian regression using the loglink(instead of the identity)allows non negative data to be handled naturally.Unlike the gamma or the inverse gaussian regression zeros are allowed.The predictor variables can be either continuous,categorical or mixed.21.testIndClogit.When the data come from a case-control study,the suitable test is via con-ditional logistic regression[7].The predictor variables can be either continuous,categorical or mixed.22.testIndMVReg.In the case of multivariate continuous target,the suggested test is viaa multivariate linear regression.The target variable can be compositional data as well[1].These are positive data,whose vectors sum to1.They can sum to any constant,as long as it the same,but for convenience reasons we assume that they are normalised to sum to1.In this case the additive log-ratio transformation(multivariate logit transformation)is applied beforehand.The predictor variables can be either continuous,categorical or mixed.23.testIndGLMMReg.In the case of a longitudinal or clustered target(continuous,propor-tions within0and1(not inclusive)),the suggested test is via a(generalised)linear mixed model[12].The predictor variables can only be continuous.This test is only applicable in SES.temporal and MMPC.temporal.24.testIndGLMMPois.In the case of a longitudinal or clustered target(counts),the suggestedtest is via a(generalised)linear mixed model[12].The predictor variables can only be continuous.This test is only applicable in SES.temporal and MMPC.temporal.25.testIndGLMMLogistic.In the case of a longitudinal or clustered target(binary),thesuggested test is via a(generalised)linear mixed model[12].The predictor variables can only be continuous.This test is only applicable in SES.temporal and MMPC.temporal.To avoid any mistakes or wrongly selected test by the algorithms you are advised to select the test you want to use.All of these tests can be used with SES and MMPC,forward and backward regression methods.MMMB accepts only testIndFisher,testIndSpearman and gSquare.The reason for this is that MMMB was designed for variables(dependent and predictors)of the same type.For more info the user should see the help page of each function.2.1A more detailed look at some arguments of the feature selection algorithmsSES,MMPC,MMMB,forward and backward regression offer the option for robust tests(the argument robust).This is currently supported for the case of Pearson correlation coefficient and linear regression at the moment.We plan to extend this option to binary logistic and Poisson regression as well.These algorithms have an argument user test.In the case that the user wants to use his own test,for example,mytest,he can supply it in this argument as is,without””. For all previously mentioned regression based conditional independence tests,the argument works as test=”testIndFisher”.In the case of the user test it works as user test=mytest.The max kargument must always be at least1for SES,MMPC and MMMB,otherwise it is a simple filtering of the variables.The argument ncores offers the option for parallel implementation of the first step of the algorithms.The filtering step,where the significance of each predictor is assessed.If you have a few thousands of variables,maybe this option will do no significant improvement.But, if you have more and a”difficult”regression test,such as quantile regression(testIndRQ),then with4cores this could reduce the computational time of the first step up to nearly50%.For the Poisson,logistic and normal linear regression we have included C++codes to speed up this process,without the use of parallel.The FBED(Forward Backward Early Dropping)is a variant of the Forward selection is per-formed in the first phase followed by the usual backward regression.In some,the variation is that every non significant variable is dropped until no mre significant variables are found or there is no variable left.The forward and backward regression methods have a few different arguments.For example stopping which can be either”BIC”or”adjrsq”,with the latter being used only in the linear regression case.Every time a variable is significant it is added in the selected variables set.But, it may be the case,that it is actually not necessary and for this reason we also calculate the BIC of the relevant model at each step.If the difference BIC is less than the tol(argument)threshold value the variable does not enter the set and the algorithm stops.The forward and backward regression methods can proceed via the BIC as well.At every step of the algorithm,the BIC of the relevant model is calculated and if the BIC of the model including a candidate variable is reduced by more that the tol(argument)threshold value that variable is added.Otherwise the variable is not included and the algorithm stops.2.2Other relevant functionsOnce SES or MMPC are finished,the user might want to see the model produced.For this reason the functions ses.model and mmpc.model can be used.If the user wants to get some summarised results with MMPC for many combinations of max k and treshold values he can use the mmpc.path function.Ridge regression(ridge.reg and ridge.cv)have been implemented. Note that ridge regression is currently offered only for linear regression with continuous predictor variables.As for some miscellaneous,we have implemented the zero inflated Poisson and beta regression models,should the user want to use them.2.3Cross-validationcv.ses and cv.mmpc perform a K-fold cross validation for most of the aforementioned regression models.There are many metric functions to be used,appropriate for each case.The folds can be generated in a stratified fashion when the dependent variable is categorical.3NetworksCurrently three algorithms for constructing Bayesian Networks(or their skeleton)are offered,plus modifications.MMHC(Max-Min Hill-Climbing)[23],(mmhc.skel)which constructs the skeleton of the Bayesian Network(BN).This has the option of running SES[18]instead.MMHC(Max-Min Hill-Climbing)[23],(local.mmhc.skel)which constructs the skeleton around a selected node.It identifies the Parents and Children of that node and then finds their Parents and Children.MMPC followed by the PC rules.This is the command mmpc.or.PC algorithm[15](pc.skel for which the orientation rules(pc.or)have been implemented as well.Both of these algorithms accept continuous only,categorical data only or a mix of continuous,multinomial and ordinal.The skeleton of the PC algorithm has the option for permutation based conditional independence tests[21].The functions ci.mm and ci.fast perform a symmetric test with mixed data(continuous, ordinal and binary data)[17].This is employed by the PC algorithm as well.Bootstrap of the PC algorithm to estimate the confidence of the edges(pc.skel.boot).PC skeleton with repeated measures(glmm.pc.skel).This uses the symetric test proposed by[17]with generalised linear models.Skeleton of a network with continuous data using forward selection.The command work does a similar to MMHC task.It goes to every variable and instead applying the MMPC algorithm it applies the forward selection regression.All data must be continuous,since the Pearson correlation is used.The algorithm is fast,since the forward regression with the Pearson correlation is very fast.We also have utility functions,such as1.rdag and rdag2.Data simulation assuming a BN[3].2.findDescendants and findAncestors.Descendants and ancestors of a node(variable)ina given Bayesian Network.3.dag2eg.Transforming a DAG into an essential(mixed)graph,its class of equivalent DAGs.4.equivdags.Checking whether two DAGs are equivalent.5.is.dag.In fact this checks whether cycles are present by trying to topologically sort theedges.BNs do not allow for cycles.6.mb.The Markov Blanket of a node(variable)given a Bayesian Network.7.nei.The neighbours of a node(variable)given an undirected graph.8.undir.path.All paths between two nodes in an undirected graph.9.transitiveClosure.The transitive closure of an adjacency matrix,with and without arrow-heads.10.bn.skel.utils.Estimation of false discovery rate[22],plus AUC and ROC curves based onthe p-values.11.bn.skel.utils2.Estimation of the confidence of the edges[16],plus AUC and ROC curvesbased on the confidences.12.plotnetwork.Interactive plot of a graph.4AcknowledgmentsThe research leading to these results has received funding from the European Research Coun-cil under the European Union’s Seventh Framework Programme(FP/2007-2013)/ERC Grant Agreement n.617393.References[1]John Aitchison.The statistical analysis of compositional data.Chapman and Hall London,1986.[2]Giorgos Borboudakis and Ioannis Tsamardinos.Forward-Backward Selection with Early Drop-ping,2017.[3]Diego Colombo and Marloes H Maathuis.Order-independent constraint-based causal structurelearning.Journal of Machine Learning Research,15(1):3741–3782,2014.[4]David Henry Cox.Regression Models and Life-Tables.Journal of the Royal Statistical Society,34(2):187–220,1972.[5]Silvia Ferrari and Francisco Cribari-Neto.Beta regression for modelling rates and proportions.Journal of Applied Statistics,31(7):799–815,2004.[6]Edgar C Fieller and Egon S Pearson.Tests for rank correlation coefficients:II.Biometrika,48:29–40,1961.[7]Mitchell H Gail,Jay H Lubin,and Lawrence V Rubinstein.Likelihood calculations for matchedcase-control studies and survival studies with tied death times.Biometrika,68(3):703–707, 1981.[8]Joseph M Hilbe.Negative binomial regression.Cambridge University Press,2011.[9]Vincenzo Lagani,Giorgos Athineou,Alessio Farcomeni,Michail Tsagris,and IoannisTsamardinos.Feature Selection with the R Package MXM:Discovering Statistically-Equivalent Feature Subsets.Journal of Statistical Software,80(7),2017.[10]Diane Lambert.Zero-inflated Poisson regression,with an application to defects in manufac-turing.Technometrics,34(1):1–14,1992.[11]RARD Maronna,Douglas Martin,and Victor Yohai.Robust statistics.John Wiley&Sons,Chichester.ISBN,2006.[12]Jose Pinheiro and Douglas Bates.Mixed-effects models in S and S-PLUS.Springer Science&Business Media,2006.[13]FW Scholz.Maximum likelihood estimation for type I censored Weibull data including co-variates,1996.[14]Richard L Smith.Weibull regression models for reliability data.Reliability Engineering&System Safety,34(1):55–76,1991.[15]Peter Spirtes,Clark Glymour,and Richard Scheines.Causation,Prediction,and Search.TheMIT Press,second edi edition,12001.[16]Sofia Triantafillou,Ioannis Tsamardinos,and Anna Roumpelaki.Learning neighborhoods ofhigh confidence in constraint-based causal discovery.In European Workshop on Probabilistic Graphical Models,pages487–502.Springer,2014.[17]Michail Tsagris,Giorgos Borboudakis,Vincenzo Lagani,and Ioannis Tsamardinos.Constraint-based Causal Discovery with Mixed Data.In The2017ACM SIGKDD Work-shop on Causal Discovery,14/8/2017,Halifax,Nova Scotia,Canada,2017.[18]I.Tsamardinos,gani,and D.Pappas.Discovering multiple,equivalent biomarker sig-natures.In In Proceedings of the7th conference of the Hellenic Society for Computational Biology&Bioinformatics,Heraklion,Crete,Greece,2012.[19]Ioannis Tsamardinos,Constantin F Aliferis,and Alexander Statnikov.Time and sampleefficient discovery of Markov blankets and direct causal relations.In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,pages673–678.ACM,2003.[20]Ioannis Tsamardinos,Constantin F Aliferis,Alexander R Statnikov,and Er Statnikov.Al-gorithms for Large Scale Markov Blanket Discovery.In FLAIRS conference,volume2,pages 376–380,2003.[21]Ioannis Tsamardinos and Giorgos Borboudakis.Permutation testing improves Bayesian net-work learning.In ECML PKDD’10Proceedings of the2010European conference on Machine learning and knowledge discovery in databases,pages322–337.Springer-Verlag,2010.[22]Ioannis Tsamardinos and Laura E Brown.Bounding the False Discovery Rate in LocalBayesian Network Learning.In AAAI,pages1100–1105,2008.[23]Ioannis Tsamardinos,Laura E.Brown,and Constantin F.Aliferis.The Max-Min Hill-ClimbingBayesian Network Structure Learning Algorithm.Machine Learning,65(1):31–78,2006.。
the feature selection什么是特征选择(Feature Selection)?为什么我们需要进行特征选择?有哪些常用的特征选择方法?如何选择最合适的特征选择方法?在实际应用中,特征选择有哪些挑战?本文将逐步回答以上问题。
第一步:什么是特征选择?在机器学习和数据挖掘领域,特征选择是一个重要的数据预处理步骤。
特征选择是指从原始特征集合中选择出最有价值的特征子集,用于构建有效的机器学习模型。
在特征选择中,我们试图通过去除冗余特征和选择与目标变量相关的特征来降低维度,并保留最有用的信息。
特征选择的主要优势是降低模型的复杂性,提升模型的泛化能力,减少过拟合问题。
通过选择最相关的特征,我们可以更好地理解和解释数据。
此外,特征选择还可以提高模型的可解释性,并降低数据收集和处理的成本和时间。
第二步:为什么我们需要进行特征选择?在现实生活中,我们常常面临大量的特征变量,其中许多变量可能是冗余的或不相关的。
这些冗余和不相关的特征会增加模型的复杂性,并降低模型的性能。
此外,大量特征可能会导致维度灾难,使模型训练时间过长。
因此,进行特征选择可以帮助我们提取最相关和最有用的特征,从而提高模型的预测性能和效率。
第三步:有哪些常用的特征选择方法?特征选择方法可以分为三大类:过滤方法(Filter Methods),包装方法(Wrapper Methods)和嵌入方法(Embedded Methods)。
1. 过滤方法:这些方法通过统计量、相关系数、信息熵等指标对特征进行评估,并按照一定的准则进行排序。
常用的过滤方法有方差选择法(Variance Selection)、相关系数法(Correlation Selection)和互信息法(Mutual Information Selection)等。
2. 包装方法:这些方法通常将特征选择看作一个搜索问题,通过对不同特征子集进行评估,找到最佳特征子集。
常用的包装方法有递归特征消除(Recursive Feature Elimination)和遗传算法特征选择(Genetic Algorithm Feature Selection)等。
特征选择常用算法综述一.什么是特征选择(Featureselection )特征选择也叫特征子集选择 ( FSS , Feature SubsetSelection ) 。
是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化。
需要区分特征选择与特征提取。
特征提取 ( Feature extraction )是指利用已有的特征计算出一个抽象程度更高的特征集,也指计算得到某个特征的算法。
特征提取与特征选择都能降低特征集的维度。
评价函数 ( Objective Function ),用于评价一个特征子集的好坏的指标。
这里用符号J ( Y )来表示评价函数,其中Y是一个特征集,J( Y )越大表示特征集Y 越好。
评价函数根据其实现原理又分为2类,所谓的Filter和Wrapper 。
Filter(筛选器):通过分析特征子集内部的信息来衡量特征子集的好坏,比如特征间相互依赖的程度等。
Filter实质上属于一种无导师学习算法。
Wrapper(封装器):这类评价函数是一个分类器,采用特定特征子集对样本集进行分类,根据分类的结果来衡量该特征子集的好坏。
Wrapper实质上是一种有导师学习算法。
二.为什么要进行特征选择?获取某些特征所需的计算量可能很大,因此倾向于选择较小的特征集特征间的相关性,比如特征A完全依赖于特征B,如果我们已经将特征B选入特征集,那么特征A 是否还有必要选入特征集?我认为是不必的。
特征集越大,分类器就越复杂,其后果就是推广能力(generalization capability)下降。
选择较小的特征集会降低复杂度,可能会提高系统的推广能力。
Less is More !三.特征选择算法分类精确的解决特征子集选择问题是一个指数级的问题。
常见特征选择算法可以归为下面3类:第一类:指数算法 ( Exponential algorithms )这类算法对特征空间进行穷举搜索(当然也会采用剪枝等优化),搜索出来的特征集对于样本集是最优的。
feature_selection rfe -回复什么是特征选择(Feature Selection)?特征选择是指从原始数据中选择出最具有预测能力的特征,以提高机器学习算法的性能和效率。
在大规模的数据集中,特征选择可以减少特征的数量,简化问题的复杂度,并提高模型的可解释性。
为什么需要特征选择?在现实世界的数据集中,往往存在大量的冗余和无关的特征。
这些无用的特征会增加算法的计算复杂度、降低模型的解释能力,并可能引发过拟合问题。
而特征选择就是为了解决这些问题,从中挑选出最有信息量和最具区分度的特征。
特征选择方法有哪些?特征选择方法可以分为三大类:过滤方法(Filter methods)、包装方法(Wrapper methods)和嵌入方法(Embedded methods)。
1. 过滤方法:过滤方法根据每个特征的统计属性或相关性进行评估和排序。
常用的过滤方法有相关系数、卡方检验、信息增益等。
这些方法通过对特征进行数值计算来评估其与目标变量之间的关系。
2. 包装方法:包装方法是在特征子集上运行机器学习算法,根据模型效果来评估特征的重要性。
典型的包装方法有递归特征消除(Recursive Feature Elimination,RFE)和遗传算法。
包装方法可以更准确地评估特征的贡献,但计算开销较大。
3. 嵌入方法:嵌入方法是在机器学习模型的训练过程中自动选择特征。
嵌入方法通过引入正则化项或特定的约束来惩罚特征,从而实现特征选择的目的。
常见的嵌入方法有L1正则化和决策树剪枝。
如何使用RFE进行特征选择?递归特征消除(RFE)是一种包装方法,常用于选择线性模型和支持向量机等算法的特征。
RFE是一种递归算法,通过反复剔除最不重要的特征,直到达到设定的特征数量或达到模型的最优性能。
RFE的步骤如下:1. 选择一个基础模型:首先选择一个机器学习模型作为基础模型,通常选择线性模型或支持向量机等。
2. 拟合模型:使用基础模型对数据进行拟合,并计算每个特征的重要性。
特征选择方法一、介绍特征选择是指从原始特征集中挑选出相关、有效的特征,丢弃无关紧要的特征,以提高模型的训练效果和预测准确性。
在机器学习和数据挖掘领域中,特征选择方法起着至关重要的作用,能够帮助我们避免维度灾难、提高模型的解释性和泛化能力,加快模型训练和预测的速度,降低过拟合的风险。
本文将介绍几种常用的特征选择方法,并对它们进行比较和分析,旨在帮助读者更好地理解和应用特征选择技术。
二、过滤法过滤法是一种简单而高效的特征选择方法,它通过对特征进行评估和排序,然后按照某种规则进行筛选,来选择最优的特征子集。
常用的过滤法包括方差选择法、相关系数选择法、互信息选择法等。
1. 方差选择法:通过计算特征的方差,来判断特征的变化程度,然后选择方差较大的特征作为最优特征。
这种方法适用于连续型特征,对于分类型特征则不适用。
2. 相关系数选择法:通过计算特征与目标变量之间的相关性,来判断特征的重要程度,然后选择相关系数较大的特征作为最优特征。
这种方法适用于线性关系较强的模型,对于非线性关系的模型则不适用。
3. 互信息选择法:通过计算特征与目标变量之间的互信息,来评估特征的重要性,然后选择互信息较大的特征作为最优特征。
这种方法适用于非线性关系的模型,对于线性关系较强的模型则不适用。
三、包裹法包裹法是一种比较贪婪的特征选择方法,它通过试错的方式来找到最优的特征子集,通常使用的算法有递归特征消除法、迭代特征选择法等。
1. 递归特征消除法:通过递归地训练模型和删除特征,来找到最优的特征子集。
具体做法是先训练模型,然后根据特征的重要性进行排序,再去除最不重要的特征,然后再训练模型,如此循环进行,直至达到所需的特征数量。
2. 迭代特征选择法:通过不断地添加和删除特征,来找到最优的特征子集。
具体做法是先训练模型,然后根据特征的重要性进行排序,再添加最重要的特征,然后再训练模型,如此循环进行,直至达到所需的特征数量。
四、嵌入法嵌入法是一种较为复杂的特征选择方法,它通过在模型训练过程中自动选择特征,从而提高整个模型的性能,常用的嵌入法包括L1正则化、决策树特征重要性、神经网络特征选择等。
feature selection in threes
摘要:
1.特征选择的重要性
2.特征选择的方法
3.基于三的特性选择策略
4.实际应用案例
5.总结
正文:
特征选择是机器学习和数据挖掘领域中的一个重要步骤。
选择合适的特征能够提高模型的性能和效率,同时降低过拟合的风险。
本文将介绍一种特征选择策略,即“in threes”,并探讨其在实际应用中的效果。
“In threes”策略是基于三的特性选择策略。
具体来说,它包含以下步骤:
1.从数据集中选择前三个最相关的特征。
相关性可以通过皮尔逊相关系数、斯皮尔曼相关系数或其他相似度度量方法来衡量。
2.选择前三个具有最大方差的特征。
方差可以衡量特征的重要性,具有较大方差的特征对模型的贡献可能更大。
3.结合以上两个步骤的结果,选择前三个既相关又重要的特征。
这种策略的优点在于简单易行,易于理解和实施。
然而,它也可能导致某些有用的特征被忽略,因此需要根据实际问题和数据情况来调整策略。
在实际应用中,“in threes”策略在很多情况下都能够取得良好的效果。
例如,在图像分类任务中,我们可以使用这种策略来选择与类别高度相关的颜色特征。
在文本分类任务中,我们可以选择与类别高度相关的关键词或短语特征。
总之,特征选择是机器学习和数据挖掘中的一个重要步骤。
通过采用“in threes”策略,我们可以快速地选择一组具有代表性的特征,从而提高模型的性能和效率。
特征选择的标准特征选择是机器学习中非常重要的一个步骤,它的目的是从原始数据中选出最具有代表性的特征,以便于构建模型和进行预测。
在特征选择过程中,需要遵循一些标准来评估和选择特征,这些标准主要包括以下内容:1. 相关性相关性是指特征与目标变量之间的关系程度。
在特征选择过程中,需要筛选出与目标变量高度相关的特征,并将其纳入模型中。
可以通过计算皮尔逊相关系数或斯皮尔曼等级相关系数来评估特征与目标变量之间的相关性。
2. 方差方差是指数据分布的离散程度。
在特征选择过程中,需要筛选出方差较大的特征,并将其纳入模型中。
可以通过计算方差来评估各个特征之间的差异性。
3. 互信息互信息是指两个随机变量之间的相互依赖程度。
在特征选择过程中,需要筛选出与目标变量具有较高互信息值的特征,并将其纳入模型中。
可以通过计算互信息来评估各个特征之间的相互依赖程度。
4. 偏差偏差是指模型对数据的拟合程度。
在特征选择过程中,需要筛选出对目标变量具有较小偏差的特征,并将其纳入模型中。
可以通过计算模型的均方误差或平均绝对误差来评估模型的拟合程度。
5. 多重共线性多重共线性是指特征之间存在强相关关系。
在特征选择过程中,需要筛选出与目标变量相关性高、但与其他特征无多重共线性的特征,并将其纳入模型中。
可以通过计算特征之间的相关系数矩阵来评估特征之间是否存在多重共线性。
总之,以上标准都是影响特征选择结果的重要因素。
在实际应用中,需要根据具体情况综合考虑这些标准,并结合领域知识和经验进行选择和调整,以达到最优的特征选择效果。
基于邻域粗糙集的多标记分类特征选择算法随着数据量的不断增加,多标记分类问题在许多领域如生物信息学、社交网络分析等中变得越来越普遍。
在处理多标记分类问题时,特征选择是一个很重要的问题。
因为数据中几乎所有的特征都是相关的,为了提高分类的效率和准确性,需要对特征进行选择。
本文提出了一种基于邻域粗糙集的多标记分类特征选择算法。
首先,介绍一下邻域粗糙集。
邻域粗糙集是一种基于粒度关系的粗糙集,它考虑了样本之间的相似性和区分度。
在邻域粗糙集中,邻居是指与目标对象在某个相似度度量下相似或接近的对象。
如果两个对象相似度越高,则它们的邻居就越多。
因此,邻域粗糙集是一种基于相似性的特征选择方法。
本文的算法主要分为三个步骤:第一步:利用邻域粗糙集计算特征的区分度。
具体来说,将所有标记都视为二元类型,然后对每个特征进行二分,将该特征的值大于等于平均值的样本标记为“1”,否则标记为“0”。
然后,对每个标记计算它与其他标记的互信息,并将互信息之和作为该标记的邻居密度。
最后,将邻居密度最小的标记作为该特征的区分度度量。
第二步:根据特征的区分度确定邻居范围。
将所有特征按照区分度从小到大排序,然后根据邻居密度选取邻居范围,即从该特征的邻居中选取邻居密度最大的前k个标记作为邻居。
第三步:利用邻居范围计算特征的评价指标。
对于每个特征,将其邻居范围内的样本分成含该特征值和不含该特征值两部分,然后利用信息增益确定该特征在邻居范围内的评价指标。
最后,将所有特征根据评价指标进行排序,然后选择前n个特征作为最终的特征集。
实验结果表明,该算法在多个数据集上都比其他多标记分类特征选择算法表现更好。
算法的优点是能够考虑邻居的影响,而且不需要预先设定参数。
Trace-Ratio问题的两种解法及其应用赵守明【期刊名称】《《哈尔滨商业大学学报(自然科学版)》》【年(卷),期】2019(035)005【总页数】4页(P626-629)【关键词】降维; 线性判别分析; 迹比; 初始点; 逐次解; 牛顿法【作者】赵守明【作者单位】中国海洋大学数学科学学院山东青岛266100【正文语种】中文【中图分类】O242.2特征提取是解决数据降维的一种非常有效的方法,到现在为止已经提出了一系列的特征提取方法.其中,Foley-Sammon变换 (FST)[1]是目前最有效的方法之一,它是基于Fisher判别准则[2]的判别方法.Sammon在1970年基于Fisher线性判别法提出了最优判别平面[3].1975年 Foley 和 Sammon 一同将这个方法扩展应用在最优判别向量集上,就是一步一步使用 Foley-Sammon 变换来逐渐获得最优判别向量集.这个方法一经问世便吸引了众多模式识别领域专家和学者的关注[4],广泛应用于图像分类和人脸识别,并且提出了在不同条件下[5-7]FST的求解方法,其中最为有效的是Liu提出的方法[8].Guo基于Foley-Sammon变换提出广义Foley-Sammon变换,提出了一种新的求解迹比问题的算法[9]——二分法.但是该算法存在收敛较慢,且不稳定的问题.Wang等人提出了一种基于牛顿法的算法——ITR算法[10],该算法相比二分法而言收敛稳定.然而,在一些情形收敛速度较慢[11],每步迭代需要计算大型矩阵的主特征子空间.1 逐次解(Successive solution)构造如下Trace-Ratio问题的一个近似解:先求解n,‖q‖2=1它的解显然是{A,B}对应的最大特征值λ1的单位特征向量[12-13],记为q1,然后考虑把q1扩张为正交阵[q1,Q1]:构造Householder变换H使得Hq1=e1(单位阵首列)由于HT=H,因而H=[q1,Q1] ,事实上,不需要写出Q1的具体形式,只需要注意到令求解n-1,‖β‖2=1其解β(1)为对应{A(1),B(1)}的最大特征值的单位特征向量,记q2=Q1β(1)如此进行r步,得到q1,…,qr,令则有P(r)≤P(r-1)≤…≤P(1)=λ1(1)取Z(0)=[q1,…,qr]为Trace-Ratio问题的近似解,由于q1,…,qr是逐次求得的,我们称之为逐次解.命题1P(r)≤P(Z(0))≤Pmax≤P(1)=λ1(2)证明:由此,式(1)可得P(r)≤P(Z(0))≤PmaxP(1)≤λ1.不难发现:(3)因此,如果接近1,则式(2)说明Z(0) 一定是Trace- Ratio问题的好的近似解.如果P(r)偏离P(1),则式(3)说明Z(0)仍然有可能是好的近似解.2 牛顿法最优化问题n×r,XTX=Ir(4)是分式最优化问题,令(5)下面给出了f(λ)的一些重要性质.引理1[14]方程f(λ)=0有唯一根λ*,且为单根.引理2[15]2)λ<λ*,⟺f(λ)>03)λ>λ*,⟺f(λ)<0求解方程(4)可设当前近似点为αk,则1)计算A-αkB对应r个最大特征值的不变子空间的标准正交基矩阵Vk(6)上述就是Wang等提出的不动点迭代.实质上是Newton法.下面给出Newton法的算法.Algorithm 2.11)取X(0)∈n×r,X(0)TX(0)=Ir,计算2)迭代,计算3)若Pk+1-Pk≤ε,则输出近似解X(k+1)X(k+1)是A-αkB的对应r个最大特征值的特征子空间的标准正交基矩阵.定理设{αk}由式(6)产生,则1){αk}单调上升且收敛于λ*2){αk}的收敛阶为2证明:由于λ*为f(λ)=0的单根,f(λ)二阶连续可导,因此,由经典的Newton法收敛理论可知结论2)成立.下面主要证明结论1).由因此αk+23≥αk+1,所以{αk}单调上升.又由于αk≤λ*因此α*≡limαk存在,且f(α*)≥0,如果f(α*)>0,那么由f(α*)≤f(αk+1)=其中:μ1≥…≥μn>0为B的特征值.由此可知这与{αk}的收敛性矛盾.因此f(α*)=0.所以可得αk=λ*.3 数值实验本节展示了两个实验,使用了Guo、Yang等人在2003年时关于人脸识别的一组数据.实验1采用之前介绍过的方法分别用逐次解法和牛顿法求解迹比问题的解.表1 逐次解法和牛顿法求出的解算法Max解矩阵逐次解0.896 00.029 70.059 5-0.276 00.419 50.311 50.242 90.095 9-0.212 00.868 8-0.1698-0.248 7-0.8293 牛顿法0.919 9-0.027 2-0.145 20.350 1-0.515 4-0.249 3-0.476 1-0.18070.080 8-0.880 7-0.142 9-0.078 60.678 6可以看到牛顿法求出的解比逐次解大,因而解更加精确.实验2将逐次解作为初始迭代点,用牛顿法迭代求解迹比问题.表2 牛顿法与改变初始点牛顿法的比较迭代次数牛顿法10改变初始点后的牛顿法5由实验结果可知,将逐次解作为初始迭代点,再用牛顿法迭代求解迹比问题会使牛顿法的迭代次数大大减少.4 结语一般来说,用牛顿法求得迹比问题比逐次解更加精确,但牛顿法求解时可能迭代较慢.本文将逐次解作为初始迭代点带入到牛顿法中会让牛顿法的迭代速度变快.参考文献:【相关文献】[1] FOLEY D H, SAMMON J W.An optimal set of discriminant vectors[J].IEEE Transactionson Computers,1975,24(3):281-289.[2] KITTLER J. On the discriminant vector method of feature selection[J].IEEE Transactions on Compiters,1977,26(6):604-606.[3] SAMMON J W.An optimal discriminant plane[J].IEEE Transactions onComputers,1970,C-19(9):826-829.[4] 赖福辉.基于Trace Ratio问题的子空间降维算法[D]. 北京:清华大学,2017.[5] FRIEDMAN J H.Regularizeddiscriminationanalysis[J].J.Amer.Statist.Assoc,1989,84:165-175.[6] HASTIE T,BUJA A, TIBSHIRANI R. Penalized discriminant analysis[J].Annals of Statistics,1995, 23:73-102.[7] YE J. Characterization of a famlily algorithm forgeneralized discriminant analysis on under-sampledproblems[J].J,Mach.Learn.Res,2005,6:483-502.[8] LIU K, CHENG Y Q, YANG J Y. An efficient algorithm for the Foley-Sammon optimal setof discriminant vectors by algebraic method [J]. Int J Pattern Recognit ArtifIntell,1992,6(5):817-819.[9] GUO Y,LI S J, YANG T S, et al. A Generalized Foley-Sammon transform based on generalized fisher discriminant criterion and its application to face recognition[J]. Pattern Recognition Letter,2003,24:147-158.[10] WANG H, YAN S,XU D,et al.Trace ratio vs ratio trace for dimensionality reduction[C]// Proceedings of conferene on computer Vision and Pattern Recognition(CVPR),2007,1-8.[11] 张建文,刘新国.线性判别分析的迭代解法及其应用[J].中国海洋大学学报,2015,45(11):119-124.[12] GOULUB G H, VAN L C F. Matrix Computations[M]. Baltimore:The John Hopkins University Press, 1996.[13] 孙继广.矩阵扰动分析[M].2版,北京:科学出版社,2001.[14] ZHANG A L, HAYASHI S, CELIS D.Topia based approach to quadratic fractional programming problems with two quadratic constraints[J].Numer.Alg.,Control and Optimization,2011(1): 83-98.[15] GUO Y F, YANG J Y. An iterative algorithm for the generalized optimal set of discriminant vectorsand its application to face recognition[J]. Chineseput.,2000,23(11):1189-1195.。
R语言的特征选择(Feature Selection)包:Boruta和caret2011年07月15日⁄技术, 科研⁄共 12195字⁄评论数 1⁄被围观 4,635+对于大数据的分析,特征选择Feature Selection和降维是必不可少的,R有很多做FS的包,这里我直接转载引用两篇英文博文,很详细的讲了Boruta和caret 包的使用方法和注意问题,也分析了两种包的优缺点。
我不在翻译。
如下代码很好用哈:帮助Feature selection: All-relevant selection with the Boruta packageFeature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant featureselection which identifies all variables that are in some circumstances relevant for the classification.In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed for the R statistical computing and analysis platform.BackgroundAll-relevant feature selection is extremely useful for commercial data miners. We deploy it when we want to understand the mechanisms behind the behaviour or subject of interest, rather than just building a black-box predictive model. This understanding leads us to a better appreciation of our customers (or other subject under investigation) and not just how, but why they behave as they do, which is useful for all areas of the business, including strategy and product development. More narrowly, it also help us define the variables that we want to observe which is what will really make a difference in our ability to predict behaviour (as opposed to, say, run the data mining application a little longer).I really like the theoretical approach that the Boruta package tries to implement. It is based on the more general idea that by adding randomness to a system and then collecting results from random samples of the bigger system, one can actually reduce the misleading impact of randomness in the original sample.For the implementation, the Boruta package relies on a random forest classification algorithm. This provides an intrinsic measure of the importance of each feature, known as the Z score. While this score is not directly a statistical measure of the significance of the feature, we can compare it to random permutations of (a selection of) the variables to test if it is higher than the scores from random variables. This is the essence of the implementation in Boruta.The testsThis article is a first investigation into the performance of the Boruta package. For this initial examination we will use a test data sample that we can control so we know what is important and what is not. We will consider 200 observations of 20 normally distributed random variables: <- "feature-1"library("Boruta")set.seed(1)## Set up artificial test data for our analysisn.var <- 20n.obs <- 200x <- data.frame(V=matrix(rnorm(n.var*n.obs), n.obs, n.var))Normal distribution has the advantage of simplicity, but for commercial application where highly non-normally distributed features like money spent are important may not be the best test. Nevertheless, we will use it for now and define a simple utility function before we get on to the tests:## Utility function to make plots of Boruta test resultsmake.plots <- function(b, num,true.var = NA,main = paste("Boruta feature selection for test", num)) {write.text <- function(b, true.var) {if ( !is.na(true.var) ) {text(1, max(attStats(b)$meanZ), pos = 4,labels = paste("True vars are V.1-V.",true.var, sep = ""))}}plot(b, main = main, las = 3, xlab = "")write.text(b, true.var)png(paste(, num, "png", sep = "."), width = 8, height = 8, units = "cm", res = 300, pointsize = 4)plot(b, main = main, lwd = 0.5, las = 3, xlab = "")write.text(b, true.var)dev.off()}Test 1: Simple test of single significant variableFor a simple classification based on a single variable, Boruta performs well: while it identifies three variables as being potentially important, this does include the true variable (V.1) and the plot clearly shows it as being by far the most significant.## 1. Simple test of single variabley.1 <- factor( ifelse( x$V.1 >= 0, 'A', 'B' ) )b.1 <- Boruta(x, y.1, doTrace = 2)make.plots(b.1, 1)[Example 1]Figure 1: Simple test of Boruta feature selection with single variable.Test 2: Simple test of linear combination of variablesWith a test of a linear combination of the first four variables where the weights are decreasing from 4 to 1, we begin to get closer to the limitations of the approach.## 2. Simple test of linear combinationn.dep <- floor(n.var/5)print(n.dep)m <- diag(n.dep:1)y.2 <- ifelse( rowSums(as.matrix(x[, 1:n.dep]) %*% m) >= 0, "A", "B" ) y.2 <- factor(y.2)b.2 <- Boruta(x, y.2, doTrace = 2)make.plots(b.2, 2, n.dep)[Example 2]Figure 2: Simple test of Boruta feature selection with linear combination of four variables.The implementation correctly identified the first three variables (with weights 4, 3, and 2, respectively) as being important, but it had the fourth variable as possible along with the two random variables V.8 and V.9. Still, six variables are more approachable than twenty.Test 3: Simple test of less-linear combination of four variablesFor this text and the following we consider less obvious combinations of the first four variables. If we just count how many of them are positive, then we get to a situation where Boruta excels (because random forests excel at this type of problem).## 3. Simple test of less-linear combinationy.3 <- factor(rowSums(x[, 1:n.dep] >= 0))print(summary(y.3))b.3 <- Boruta(x, y.3, doTrace = 2)print(b.3)make.plots(b.3, 3, n.dep)[Example 3]Figure 3: Simple test of Boruta feature selection counting the positives of four variables.Test 4: Simple test of non-linear combinationFor a spectacular fail of the Boruta approach we will have to consider a classification in the hyperplane of the four variables. For this simple example, we simply count if there are an even or odd number of positive values among the first four variables:## 4. Simple test of non-linear combinationy.4 <- factor(rowSums(x[, 1:n.dep] >= 0) %% 2)b.4 <- Boruta(x, y.4, doTrace = 2)print(b.4)make.plots(b.4, 4, n.dep)Example 4Figure 4: Simple test of Boruta feature selection with non-linear combination of four variablesOuch. The package rejects the four known significant variables. It is too hard for the random forest approach. Increasing the number of observations to 1,000 does not help though at 5,000 observations Boruta identifies the four variables right.LimitationsSome limitations of the Boruta package are worth highlighting:It only works with classification (factor) target variables. I am not sure why: as far as I remember, the random forest algorithm also provides a variable significance score when it is used as a predictor, not just when it is run as a classifier.It does not handle missing (NA) values at all. This is quite a problem when working with real data sets, and a shame as random forests are in principle very good at handling missing values. A simple re-write of the package using the party package instead of randomForest should be able to fix this issue.It does not seem to be completely stable. I have crashed it on several real-world data sets and am working on a minimal set to send to the authors.But this is a really promising approach, if somewhat slow on large sets. I will have a look at some real-world data in a future post.Feature selection: Using the caret packageFeature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked atall-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.The caret package provides a very flexible framework for the analysis as we shall see, but first we set up the artificial test data set as in the previous article.## Feature-bc.R - Compare Boruta and caret feature selection## Copyright © 2010 Allan Engelhardt (/) <- "feature-bc"library("caret")## Load early to get the warnings out of the way:library("randomForest")library("ipred")library("gbm")set.seed(1)## Set up artificial test data for our analysisn.var <- 20n.obs <- 200x <- data.frame(V = matrix(rnorm(n.var*n.obs), n.obs, n.var))n.dep <- floor(n.var/5)cat( "Number of dependent variables is", n.dep, "\n")m <- diag(n.dep:1)## These are our four test targetsy.1 <- factor( ifelse( x$V.1 >= 0, 'A', 'B' ) )y.2 <- ifelse( rowSums(as.matrix(x[, 1:n.dep]) %*% m) >= 0, "A", "B" ) y.2 <- factor(y.2)y.3 <- factor(rowSums(x[, 1:n.dep] >= 0))y.4 <- factor(rowSums(x[, 1:n.dep] >= 0) %% 2)The flexibility of the caret package is to a large extent implemented by using control objects. Here we specify to use the randomForest classification algorithm (which is also what Boruta uses) and if the multicore package is available then we use that for extra perfomance (you can also use MPI etc – see the documentation):control <- rfeControl(functions = rfFuncs, method = "boot", verbose = FALSE,returnResamp = "final", number = 50)if ( require("multicore", quietly = TRUE, warn.conflicts = FALSE) ) { control$workers <- multicore:::detectCores()control$computeFunction <- mclapplycontrol$computeArgs <- list(mc.preschedule = FALSE, mc.set.seed = FALSE) }We will consider from one to six features (using the sizes variable) and then we simply let it lose:sizes <- 1:6## Use randomForest for predictionprofile.1 <- rfe(x, y.1, sizes = sizes, rfeControl = control)cat( "rf : Profile 1 predictors:", predictors(profile.1), fill = TRUE )profile.2 <- rfe(x, y.2, sizes = sizes, rfeControl = control)cat( "rf : Profile 2 predictors:", predictors(profile.2), fill = TRUE )profile.3 <- rfe(x, y.3, sizes = sizes, rfeControl = control)cat( "rf : Profile 3 predictors:", predictors(profile.3), fill = TRUE )profile.4 <- rfe(x, y.4, sizes = sizes, rfeControl = control)cat( "rf : Profile 4 predictors:", predictors(profile.4), fill = TRUE )The results are:rf : Profile 1 predictors: V.1 V.16 V.6rf : Profile 2 predictors: V.1 V.2rf : Profile 3 predictors: V.4 V.1 V.2rf : Profile 4 predictors: V.10 V.11 V.7If you recall the feature selection with Boruta article, then the results there wereProfile 1: V.1, (V.16, V.17)Profile 2: V.1, V.2, V,3, (V.8, V.9, V.4)Profile 3: V.1, V.4, V.3, V.2, (V.7, V.6)Profile 4: V.10, (V.11, V.13)To show the flexibility of caret, we can run the analysis with another of the built-in classifiers:## Use ipred::ipredbag for predictioncontrol$functions <- treebagFuncsprofile.1 <- rfe(x, y.1, sizes = sizes, rfeControl = control)cat( "treebag: Profile 1 predictors:", predictors(profile.1), fill = TRUE )profile.2 <- rfe(x, y.2, sizes = sizes, rfeControl = control)cat( "treebag: Profile 2 predictors:", predictors(profile.2), fill = TRUE )profile.3 <- rfe(x, y.3, sizes = sizes, rfeControl = control)cat( "treebag: Profile 3 predictors:", predictors(profile.3), fill = TRUE )profile.4 <- rfe(x, y.4, sizes = sizes, rfeControl = control)cat( "treebag: Profile 4 predictors:", predictors(profile.4), fill = TRUE )This gives:treebag: Profile 1 predictors: V.1 V.16treebag: Profile 2 predictors: V.2 V.1treebag: Profile 3 predictors: V.1 V.3 V.2treebag: Profile 4 predictors: V.10 V.11 V.1 V.7 V.13And of course, if you have your own favourite model class that is not already implemented, then you can easily do that yourself. We like gbm from the package of the same name, which is kind of silly to use here because it provides variable importance automatically as part of the fitting process, but may still be useful. It needs numeric predictors so we do:## Use gbm for predictiony.1 <- as.numeric(y.1)-1y.2 <- as.numeric(y.2)-1y.3 <- as.numeric(y.3)-1y.4 <- as.numeric(y.4)-1gbmFuncs <- treebagFuncsgbmFuncs$fit <- function (x, y, first, last, ...) {library("gbm")n.levels <- length(unique(y))if ( n.levels == 2 ) {distribution = "bernoulli"} else {distribution = "gaussian"}gbm.fit(x, y, distribution = distribution, ...)}gbmFuncs$pred <- function (object, x) {n.trees <- suppressWarnings(gbm.perf(object,plot.it = FALSE,method = "OOB"))if ( n.trees <= 0 ) n.trees <- object$n.treespredict(object, x, n.trees = n.trees, type = "link")}control$functions <- gbmFuncsn.trees <- 1e2 # Default value for gbm is 100profile.1 <- rfe(x, y.1, sizes = sizes, rfeControl = control, verbose = FALSE,n.trees = n.trees)cat( "gbm : Profile 1 predictors:", predictors(profile.1), fill = TRUE )profile.2 <- rfe(x, y.2, sizes = sizes, rfeControl = control, verbose = FALSE,n.trees = n.trees)cat( "gbm : Profile 2 predictors:", predictors(profile.2), fill = TRUE )profile.3 <- rfe(x, y.3, sizes = sizes, rfeControl = control, verbose = FALSE,n.trees = n.trees)cat( "gbm : Profile 3 predictors:", predictors(profile.3), fill = TRUE )profile.4 <- rfe(x, y.4, sizes = sizes, rfeControl = control, verbose = FALSE,n.trees = n.trees)cat( "gbm : Profile 4 predictors:", predictors(profile.4), fill = TRUE )And we get the results below:gbm : Profile 1 predictors: V.1 V.10 V.11 V.12 V.13gbm : Profile 2 predictors: V.1 V.2gbm : Profile 3 predictors: V.4 V.1 V.2 V.3 V.7gbm : Profile 4 predictors: V.11 V.10 V.1 V.6 V.7 V.18It is all good and very flexible, for sure, but I can’t really say it is better than the Boruta approach for these simple examples.【1】/Blogs/Data/Feature-selection-All_relevant-selec tion-with-the-Boruta-package.html【2】/Blogs/Data/Feature-selection-Using-the-caret-pa ckage.html。
feature selection shaply
在机器学习领域,Shapley值(Shapley Value)是一种用于衡量特征重要性的方法之一,它可以用于特征选择(Feature Selection)。
Shapley 值最初是由Lloyd Shapley在博弈论中提出的,后来被引入到机器学习领域,用于解释模型的预测结果以及特征的贡献程度。
Shapley值的主要思想是通过博弈论中的Shapley 值概念来衡量每个特征对于模型预测的贡献。
在特征选择任务中,Shapley值可以用来确定哪些特征对模型的性能有着重要的影响,从而帮助选择最重要的特征子集。
Feature Selection Shapley(特征选择Shapley)是指利用Shapley 值来进行特征选择的方法。
它基于Shapley值的计算结果来确定每个特征的重要性,进而选择最优的特征子集。
这种方法可以帮助提高模型的解释性和泛化能力,避免过拟合,并且可以帮助理解模型背后的特征交互关系和影响因素。
总的来说,Feature Selection Shapley是一种利用Shapley值进行特征选择的方法,它可以帮助提高机器学习模型的性能和可解释性。
Discriminant Feature Selection forTexture ClassificationAbhir H.Bhalerao and Nasir M.RajpootDepartment of Computer Science,University of Warwick,UK{abhir|nasir}@AbstractThe computational complexity of a texture classification algorithm is limitedby the dimensionality of the feature space.Althoughfinding the optimal fea-ture subset is a NP-hard problem[1],a feature selection algorithm that canreduce the dimensionality of problem is often desirable.In this paper,we re-port work on a feature selection algorithm for texture classification using twosubbandfiltering methods:a full wavelet packet decomposition and a Gabortype decomposition.The value of a cost function associated with a subband(feature)is used as a measure of relevance of that subband for classificationpurposes.This leads to a fast feature selection algorithm which ranks thefeatures according to their measure of relevance.Experiments on a range oftest images and bothfiltering methods provide results that are promising.1IntroductionIn many real-world classification applications,a number of features are often computed to aid the supervised learning.Some of the candidate features,however,may be irrelevant, in that they may be adding noise to useful information,or redundant in the presence of other relevant features.A feature selection algorithm addresses this problem by selecting a subset of features which are directly relevant to the target concept,which is texture classification here.Eigenspace methods like the principal component analysis(PCA)can be used to transform the feature space to a set of independent and orthogonal axes and rank those axes by the extent of variation.This is one way to reduce the set of features with regard to the global data covariance.Fisher’s linear discriminant analysis(LDA),on the other hand,finds the feature space mapping which maximises the ratio of between-class to within-class variation jointly for each feature(dimension).PCA can be further applied tofind a compact subspace to reduce the feature dimensionality.The central theme of almost all feature subset selection methods is to reduce the computational complexity and improve the classification performance by discarding those features which may be irrelevant or redundant(see,for instance,[2]for a survey of such algorithms).Despite the advances in feature selection,texture classification algorithms have often used as many features as possible[3]without consideration of whether the features are independent and discriminating,giving rise to unnecessary computation and sometimes hindering the subsequent classification process.In this paper,we present a fast feature selection scheme that uses a cost function associated with each feature as a measure of relevance of that feature to texture classification.Features are ranked so that the problemof selecting a subset of features reduces to picking thefirst m features only.Such features can also be termed as discriminant features,offering the benefit that only the discriminant feature subset needs to be computed on the observation data,unlike PCA or LDA meth-ods.This has significant advantage when dealing with texture classification problem in 3D[4]because of reduced complexity.We note that Boz[1]has proposed a similar algo-rithm which uses Quinlan’s gain ratio[5]to sort the features according to their relevance. It is also worth noting that our approach to the problem separates feature selection from classification stage,therefore putting it in thefilter category of Kohavi et al.[2].The paper is organised as follows.In the next section,a statement of the problem is provided with a description of the cost functions that can be used to measure the feature relevance.In Section3,two multiresolution feature space mappings investigated in this work and based on subband decomposition are described.Experimental results are pre-sented and discussed in Section4,and the paper concludes with a summary and future directions.2Discriminant Feature SelectionTexture classification can be cast into the general signal classification problem offinding the best mapping,d:X→Y,of a set of input signals,X⊆R n,on to the class la-bels,Y={1,...,K}.Even given a set of training samples,in most interesting problems, the dimensionality of the input space,n,is too great to estimate the optimal classifier, such as the Bayes classifier.Feature extraction allows us to discard much of the redun-dant information.Multivariate data often contain important structures(features)at lower dimensions and that the problem has some intrinsic dimension m<n is of course goal dependent[6].It is therefore prudent to employ a classifier,k:F→Y,on a subspace, F of X.We can write the feature space mapping s:X→F.Furthermore,we can try to extract a discriminant subspace of F using a feature selectorΘm that picks a set of dimensions m which we would hope is the intrinsic dimension of the problem:d=k◦Θm◦s(1) The focus of our study here is to investigate two popular multiresolution feature space mappings for textural data:a special kind of wavelet packet transform and a Fourier transform,subbandfiltering akin to a Gabor decomposition,and to employ a“greedy”feature selector which operates sequentially on the feature dimensions(subbands).Let us define the feature selection problem in the context of subband decomposition as follows.Given a test image x that has been decomposed into n subbands,each of which can be regarded as a feature,the goal is to select m subbands such that the result-ing misclassification error is minimal.Here we show that it is still possible to gauge the discrimination power of a subband independent of the classifier conditioned on K rep-resentative texture samples,T={(x1,y1),...,(x K,y K)}.Given the nature of subband decompositions under consideration,a subband can be regarded as being highly discrim-inant if it highlights the frequency characteristics of one class but not the other.In other words,if the coefficients of a particular subband light up(ie,are higher in magnitude)for one class but are relatively insignificant for the other class,the subband can prove to be helpful in terms of classification performance.Rajpoot[7]has previously proposed to use a symmetric version of Kullback-Leibler distance,between the normalised energies of a subband for training images,as a measureof relevance or an estimate of the discrimination power of a subband.In this paper,we investigate the effectiveness of a list of cost functions described below for our feature selection algorithm using two feature space mappings and a range of test images.Let F and G denote the transform coefficients of a particular subband for training images x1and x2,belonging to two different texture classes,respectively.We considered three ways of forming pseudo-distributions f from the subband coefficients F:f1(x,y)=F(x,y),f2(x,y)=|F(x,y)|2,f3(x,y)=|F(x,y)|2/ x1 2(2)Similarly,the pseudo-distributions g i(i=1,2,3)can be obtained by using the subband coefficients G(x,y)and texture image x2.Four discriminant measures were used in our experiments:symmetric Kullback-Leibler(KL)divergence,Fisher distance(FD),Bhat-tacharyya distance(BD),and Euclidean distance(ED)–denoted respectively by KL i, FD i,BD i,and ED i–defined as follows.KL i(F,G)=I(f i,g i)+I(g i,f i),(3)|µ(f i)−µ(g i)|2FD i(F,G)=(σ2(f i)+σ2(g i))24 logg(x,y)is the relative entropy between f and g, · 2denotes the l2-norm,andµ(f)andσ2(f) respectively are the mean and variance of f.The feature selector,Θm,operates by se-lecting the m subbands with the largest discrimination power using one of the above four measures.Saito et al.[6]warn that the approach of sequentially measuring the efficacy of each dimension of the feature space independently may be“too greedy”as2D and higher dimensional structures in the feature space may be missed.The principal advantage, however,is that we can make a feature selection solely on training data which reduces the complexity of thefinal classifier on test samples.When compared with traditional multivariate feature projection methods like PCA or LDA,this advantage is significant.3Subband Decompositions3.1Full Wavelet Packet DecompositionThe wavelet packet(WP)decomposition[8]extends the standard dyadic discrete wavelet transform such that the high frequency wavelet subbands are decomposed as well as the low frequency subbands,to an arbitrary number of levels.The frequency intervals of varying bandwidths can be adaptively selected,e.g.by picking the most energetic sub-bands,to extract specific frequencies present in the given signal from this overcomplete(a)D9D19grass/paper (b)D15D9f straw/grass (c)D77D15cloth/straw(d)D9D3grass/reptile (e)D4D12cork/bark (f)D65D65r fence/fenceFigure 1:Test images created from Brodatz album examples show combinations of gran-ular,oriented and periodic textures (natural and synthetic).representation.The full wavelet packet decomposition (FWP),sometimes also known as the uniform wavelet subband decomposition,is a special type of wavelet packet transform wherein all subbands are decomposed up to a given number of levels.The discrete wavelet packet transform of a 1D discrete signal f (x ),x =0,1,...,N −1can be computed as follows.The wavelet packet coefficients are defined asw 00(x )=f (x )(7)w l 2j (x )=∑k g (k −2x )w l −1j (k )w l 2j +1(x )=∑kh (k −2x )w l −1j (k )x =0,...,N 2−l −1(8)where l =1,2,...,L ;L =log 2N ,w l j (x )is the transform coefficient corresponding to the wavelet packet function which has relative support size 2l ,frequency j 2l and is located at j 2l .In other words,l ,j and x can be regarded as the scale,frequency and position indices of the corresponding wavelet packet function respectively.The coefficients {h x }and {g x }correspond to the lowpass and highpass filters respectively for a two-channel filter bank and the transform is invertible if appropriate dual filters {˜h x },{˜g x }are used on the synthesis side.The FWP can be used as a feature space mapping by generating a feature image from each of its subbands at a given level.A feature image corresponding to a subband can be generated by computing the magnitude of contribution of the coefficients of that sub-band to the input image,and smoothing it with a Guassian filter.As can be seen from Figure 2(b),which shows sixteen features for the grass/straw image computed from the subbands of a two-level FWP,some of the features may be adding merely noise to the performance of a texture classification algorithm.3.2Wilson-Spann DecompositionThe Wilson-Spann decomposition[9]uses a set of real,band-limited basis functions which cover the Fourier half-plane.The decomposition is multiresolution and resem-bles Gaborfiltering and has been shown to perform well in representing image features exhibiting granularity and local orientation.In[9],the authors proposed usingfinite pro-late spheroidal sequences(FPSS)because they are optimal in maximising energy in both spatial and spatial-frequency domains.For ease of implementation,we use instead trun-cated Gaussian functions of similar shape,centred in the frequency subband.The sub-band coefficients s l j(u)for subband j at level(or scale)l of the decomposition of a2D signal f(x),x=(x,y)T,x,y=0,1,...,N−1,can be computed by applying a set of band-limiting operatorsΛl j to thefiltered discrete Fourier transform coefficients:s l j(u)=Λl j g(u;m l j,Σl j)F(u) (9) where g(u)is an isotropic Gaussian function with centre frequency m l j and bandwidth determined byΣl j,and F(u)is the DFT of f(x).Various arrangements of thesefilters are possible:a centre-surround tessellation with a low-pass region in the interval[π/2,π/2] and6bands of size[π/4,π/4]creates a set of subbands that have been demonstrated to capture well textural variation in frequency and orientation.At the next level of the de-composition,the central LP region is itself tessellated by a low-LP region size[π/4,π/4] bandwidth surrounding orientedfilters size[π/8,π/8],and so on(Figure2(a)).4Experiments and DiscussionThe discriminant feature selection algorithm described in Section2was tested on texture examples taken from the Brodatz texture catalogue.We created two-class images con-taining pairs of textures with granularity(D9=grass,cork=D4,bark=D12),principal ori-entation(D15=straw)and periodicity(D3=reptile,D103=burlap,D65=fence),as shown in Figure1.Subband decomposition of these images using both the FWP(2levels using Duabechies-8filters)and the Wilson-Spann scheme(for3and5levels)were made.Given a pair of training samples,a set of discriminant values were calculated using ED3,KL2,FD2and BD1for corresponding subband decomposition.These values were then sorted into de-creasing order.We then used a k-means classifier to classify the two-class test image taking increasing numbers of subband features in the order defined by the discriminant value of each feature.The percentage misclassification error was then calculated given a binary mask for the true result.Figure3show error plots for the test images D15D9f (grass/straw),D9D3(reptile/grass),and D65D65r(fence/fence)for both subband decom-positions.These plots are representative of the results overall and we can begin to draw some conclusions on the effectiveness of particular cost functions and subband decompo-sition given a pair of textures.For textures containing a combination of granularity and orientation,such as straw/grass, the plots show FD and BD perform better as a feature relevance measure than KL or ED, which performs least well(Figures3(a)and3(b)).BD is able to bring the misclassifica-tion error to about5%for4bands.The error then reduces marginally with the addition of more features to the classifier.Images containing a natural periodic texture,like reptileskin against a granular/random texture like grass(Figure1(d)),show the mean-variance cost measures,FD and BD,to significantly outperform the point-wise measures,KL and ED.But again,a relatively small subset of features is able to achieve a good classification. Note that in the case of the FWP,Figure3(c),the plots for BD and FD show poorer results beyond10features indicating that bands11-16do not have useful discriminating power and are working against the classifier.Figure3(f)reveals that FD does less well when a periodic texture is paired with an-other periodic texture(Figure1(f)which has fence against rotated fence).In this case, both the point-wise measures,KL and ED,can identify the subbands where the spectral distributions vary(likely to be a periodic pattern in the Fourier domain),whereas,the average energy difference,which is what FD estimates,cannot reveal weather a band is discriminating.Thefirst term of BD measure takes into account the variability which ex-plains its better performance for these images.For the FWP(Figure3(e)),the KL and ED continue to perform worse which can be explained by the transform coefficients having spatial coordinates.A point-wise,pseudo distribution measure like KL,is fooled by the energy voids in a periodic texture aligning with energy peaks from the other texture at the same coordinate.KL,however,is widely used in WP basis selection[10]because it is additive and can be used to compare the relative energy when choosing child subbands over parent for an image region.Figure4shows four most discriminant feature images and corresponding classification results for the straw/grass image.It can be seen from this figure that the features selected by our algorithm prove to be relevant to the task at hand, ie texture classification.As regards the optimality or otherwise of the sequential ranking,we have experi-mented with reversing the discriminant ranking order and comparing with random feature selection orders.Figure5shows a plot of the misclassification curve for an example im-age with different feature selections.Note that the‘best’performance is obtained by the decreasing discriminant ranking of our feature selection,where as the‘worst’is given by reversing the best ranking order.The random feature selections fall somewhere in between.This indicates that the misclassification curves are approximately bounded by the two decreasing and increasing discriminant rank orders.It is not surprising to see all orders converge rapidly beyond n/2features as the number of possible ways to select the remaining features decreases-there is n!possible rank orders at the start and(n/2)! half way along and so on.If it is to be believed that the random orders fall along the centroid of this distribution,then it would appear that the likelihood that our‘best’rank ordering could have been obtained simply by chance is small.The decreasing rank order is probably close therefore to the optimal feature selection order.5ConclusionsThe primary goal of a feature selection algorithm in supervised classification applications is to reduce the dimensionality of problem by selecting only relevant features and dis-carding features that are irrelevant or redundant.In this paper,we described a simple yet effective feature selection algorithm for supervised texture classification problems by associating a measure of relevance with each of the features.Although the algorithm can be used with any type of features,its effectiveness was demonstrated by applying it to features extracted from two subband decompositions:a full wavelet packet decom-position and a Gabor type representation.From our experiments,we can conclude that measures based on BD(or FD)seem to be the best for feature selection from FWP sub-bands.For Fourier domain subbands,like Wilson-Spann,BD is best but,KL can work marginally better when energy distribution is periodic.Experimental results also show that the ranking of features according to a sorted measure of relevance produces plausi-ble results even using a simple k-means classifier.Investigation into a more sophisticated classifier and the extension of measures to a multi-class problems are matters of further bining a discriminant feature selection method with a basis selection and compact representation also remains to be investigated.References[1]O.Boz.Feature Subset Selection by Using Sorted Feature Relevance.In Proc.Intl.Conf.on Machine Learning and Applications,June2002.[2]R.Kohavi and G.H.John.Wrappers for Feature Subset Selection.Artificial Intelli-gence,97(1-2):273–324,1997.[3]T.Randen and J.H.Husoy.Filtering for Texture Classification:A ComparativeStudy.IEEE Trans.on PAMI,21(4),April1999.[4]C.C.Reyes-Aldasoro and A.H.Bhalerao.V olumetric Texture Classification andDiscriminant Feature Selection for MRI.In rmation Processing In Medi-cal Imaging IPMI’03,June2003.[5]J.R.Quinlan.Learning Decision Tree Classifiers.ACM Computing Surveys,28:71–2,1996.(invited contribution to50th anniversary issue).[6]N.Saito,R.Coifman,F.B.Geshwind,and F.Warner.Discriminant Feature Ex-traction Using Empirical Probability Density Estimation and a Local Basis Library.Pattern Recognition,35:2841–2852,2002.[7]N.M.Rajpoot.Texture Classification Using Discriminant Wavelet Packet Subbands.In Proc.IEEE Midwest Syposium on Circuits and Systems,Aug.2002.[8]R.R.Coifman and Y.Meyer.Othornormal Wave Packet Bases.preprint,Deptt.ofMaths.,Yale Uni.,USA,1990.[9]R.Wilson and M.Spann.Finite Prolate Spheroidal Sequences and Their Applica-tions:Image Feature Description and Segmentation.IEEE Trans.PAMI,10(2):193–203,1988.[10]N.Saito and R.R.Coifman.Local discriminant bases.In ine and M.A.Unser,editors,Mathematical Imaging:Wavelet Applications in Signal and Image Processing II,volume2303,1994.(a)(b)Figure2:Wilson-spann and FWP subband decompositions of example texture image D15D9f(straw/grass)in Figure1(b).(a)Three levels of the Wilson-Spann subband de-composition using a centre-surround arrangement of subbands across the Fourier domain.(b)Two levels of a Full Wavelet Packet decomposition.Note how in both cases certainsubbands clearly contrast the two textures.Number of Features M i s c l a s s i f i c a t i o n E r r o r (%)(e)(f)Figure 3:Classification results for FWP and Wilson-Spann subbands.(a)Feature6(b)Feature12(c)Feature11(d)Feature10(e)Classified(f)Feature7(g)Feature8(h)Feature2(i)Feature14(j)Classified Figure4:(a)-(d)Thefirst4most discriminant subband feature images from3rd order Wilson-Spann decomposition of grass/straw image D15D9f.(e)k-means classification result usingfirst4features only.(f)-(i)Thefirst4most discriminant subband feature images from2levels of FWP on grass/straw image D15D9f.(j)k-means classification result usingfirst4features only.Figure5:Comparing different discriminant rank orderings to assess the optimality of the decreasing rank order(‘Best’)against increasing rank order(‘Worst’)and three random orderings.The best and worst orderings are the boundaries the distribution of all n!rank orderings and the random orderings appear to cluster near the centroid of this distribution.。